CN117152416A - Sparse attention target detection method based on DETR improved model - Google Patents

Sparse attention target detection method based on DETR improved model Download PDF

Info

Publication number
CN117152416A
CN117152416A CN202311122596.8A CN202311122596A CN117152416A CN 117152416 A CN117152416 A CN 117152416A CN 202311122596 A CN202311122596 A CN 202311122596A CN 117152416 A CN117152416 A CN 117152416A
Authority
CN
China
Prior art keywords
attention
sparse
feature
target
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311122596.8A
Other languages
Chinese (zh)
Inventor
王文豪
伍言伦
付步颖
孙陈瑾
靳陶阳
牟孝志
陈鑫
赵丽娟
戚薇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaiyin Institute of Technology
Original Assignee
Huaiyin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaiyin Institute of Technology filed Critical Huaiyin Institute of Technology
Priority to CN202311122596.8A priority Critical patent/CN117152416A/en
Publication of CN117152416A publication Critical patent/CN117152416A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a sparse attention target detection method based on a DETR improved model, which is improved based on a Deformable DETR framework, wherein an encoder is formed by stacking a plurality of encoder layers, each layer mainly comprises an instance-dependent sparse attention module, a gating linear control unit and residual error connection and normalization operation among the encoder layers, a decoder is formed by stacking a plurality of decoder layers, and each layer mainly comprises a multi-head self-attention module, an instance-dependent sparse attention module, a gating linear control unit and residual error connection and normalization operation among the decoder layers; the application enhances the expression capability of the attention by utilizing the dependency relationship among the examples, the sparse attention can dynamically adjust the connectivity among the features according to the content of the input image, semantic information is better captured, and the operation reduces the computational complexity. The application can improve the calculation efficiency and the detection performance of the model on the target.

Description

Sparse attention target detection method based on DETR improved model
Technical Field
The application relates to a target detection method, in particular to a sparse attention target detection method based on a DETR improved model.
Background
Target detection aims to accurately identify and locate a specific target object from an image or video, and is an important research direction in the field of computer vision. Convolutional Neural Networks (CNNs) have been the main support for the target detection model for many years. However, the great success of transfomers in the field of Natural Language Processing (NLP) has prompted researchers to explore their potential in terms of Computer Vision (CV). The transducer structure has proven to be effective in effectively capturing remote dependencies in sequence data.
In the field of target detection, the core model based on the transducer architecture is DETR (Detection Transformers), and the DETR model uses the encoder and decoder architecture of the transducer to address the target detection problem. The encoder is used to extract features from the input image, and the decoder is used to generate a target prediction for each location. The core idea of DETR is to convert the target detection problem into a collective prediction problem, i.e. predicting the class of the target, the bounding box coordinate position and whether the target is present at each position, avoiding the method requirements of the traditional anchor boxes. Among the improved models based on DETR, many models reference the multi-scale feature fusion and Deformable attention ideas of transformable DETR. Deformable DETR proposes a new attention mechanism, deformable Attention, for processing image feature maps and improving model performance. The introduction of this idea is widely used in improved versions of DETR to solve the problems of slow convergence rate of training and limited spatial resolution of features of DETR.
The Deformable attention mechanism in the Deformable DETR is indeed a well-innovative approach that reduces the computational complexity by focusing on only a few sample points to perform the attention calculations. However, this approach also brings with it some drawbacks and limitations: first, the deformable attention is focused on the entire feature map for the deformable operation, requiring additional offset prediction for the sampling points, which results in higher computational cost. Second, it may lead to information loss, although deformable attention mechanisms may reduce computational complexity, if the sampling points of interest are insufficient or do not cover well the area of the target or critical information, ignoring other important information may lead to reduced detection or recognition performance of the target.
Disclosure of Invention
The application aims to: the application aims to provide a sparse attention target detection method based on a DETR improved model, which has high calculation efficiency and good detection performance.
The technical scheme is as follows: the sparse attention target detection method based on the DETR improved model provided by the application comprises the following steps:
(1) Inputting the training data set into a backbone network Swin Transformer V1, and extracting three layers of feature graphs C3, C4 and C5;
(2) Converting the three-layer feature images C3, C4 and C5 into four feature layers through a multi-scale feature fusion module, fusing the four feature layers and adding relative position coding information to obtain a multi-scale fusion feature image;
(3) Taking a multi-scale fusion characteristic diagram as the input of an encoder, wherein the encoder is formed by stacking a plurality of encoder layers, each layer mainly comprises an instance-dependent sparse attention module, a gate-controlled linear control unit and residual error connection and normalization operation among the encoder layers, the input characteristic sequence is sequentially processed by the instance-dependent sparse attention module, the residual error connection and normalization operation and the gate-controlled linear control unit, and finally, the output of the encoder of one layer is obtained through one residual error connection and normalization operation; repeatedly traversing the encoder for a plurality of times to obtain an encoder output characteristic diagram;
(4) The method comprises the steps that an output characteristic diagram of an encoder is used as input of the decoder, the decoder is formed by stacking a plurality of decoder layers, each layer mainly comprises a multi-head self-attention module, an instance-dependent sparse attention module, a gating linear control unit and residual error connection and normalization operation among the multi-head self-attention module, a characteristic sequence with position coding is input into the multi-head self-attention module, after the output of the multi-head self-attention module is subjected to the residual error connection and normalization operation, the output of the multi-head self-attention module is sequentially processed through the instance-dependent sparse attention module, the residual error connection and normalization operation and the gating linear control unit, and finally, the output of the decoder of one layer is obtained through one residual error connection and normalization operation; repeatedly traversing the decoder for a plurality of times to obtain an output characteristic vector of the decoder;
(5) Predicting the class and the boundary frame of the output feature vector of the decoder through a linear layer and a multi-layer perceptron respectively to obtain a predicted target set, wherein each target contains class and boundary frame coordinate information;
(6) Carrying out network overall loss calculation between the predicted target set and the real target set, and optimizing a model through back propagation;
(7) Repeating the steps (1) to (6) for a plurality of times to obtain a trained target detection model.
Further, step (1) includes:
the original input feature map size is H multiplied by W multiplied by 3, H represents the height of the image, and W represents the width of the image;
three-layer feature maps C3, C4 and C5 are extracted through a backbone network Swin Transformer V1, and the sizes are sequentially as followsAnd->
Further, step (2) includes:
the three-layer characteristic diagrams C3, C4 and C5 are sequentially transformed into the convolution with the step length of 1 multiplied by 1 through three convolution kernelsAnd->The feature map C5 of the last layer is transformed into a convolution with a convolution kernel of 3 x 3 steps of 1 to a size of +.>As a fourth feature layer;
adding coordinate information to four characteristic layers, introducing relative position coordinates for distinguishing the position information of characteristic points of different layers, and converting absolute coordinates of the characteristic points of each layer into relative coordinates by a position embedding method; and splicing the relative coordinates of the feature points of each layer and the scale information to obtain a multi-scale fusion feature map.
Further, the relative position codes include scale-level sounding and position embedding, which can be learned.
Further, in step (3), the instance-dependent sparse attention module performs the following operations:
firstly, the multiscale fusion feature map is partitioned to obtain a feature vector sequence X=x 1 ,x 2 ,...,x NWherein N and N represent the length of the signature sequence, x i Representing the i-th feature vector in the sequence, < + >>Representing the real number field, d is the characteristic dimension, +.>Represents x i Is a d-dimensional real vector, each element being a real number; />Representing that X is a real matrix of dimension n X d; each eigenvector is linearly transformed by three linear transforms q=xw, respectively Q 、K=XW K And v=xw V Obtaining a query vector q=q 1 ,q 2 ,...,q N Key vector k=k 1 ,k 2 ,...,k N Sum vector v=v 1 ,v 2 ,...,v N Wherein-> W Q ,W K ,W V Are learnable parameter matrices that are optimized by back propagation during training, so that the model can adaptively learn the representation of the input sequence,represents W Q ,W K ,W V Are d x d matrices formed by d-dimensional real vectors;
then, a lightweight connection prediction module is used for estimating the connection score between each pair of feature vectors, wherein the connection score reflects the semantic relativity of the two feature vectors, and the connection prediction module performs the following operations:
the low rank attention weight is calculated as follows:
wherein key W is projected from query Q and downward down The outer product of K computes a low rank approximation of the matrix of interest,W down is a learnable parameter matrix, n down Represents the dimension reduction, n represents the length of the input characteristic sequence, W down K denotes projecting the token dimension of K down to the lower dimension, +.>Representing feature dimensions, softmax representing normalization function,>a transpose operation representing the matrix;
the low rank attention weights are thinned by a threshold, and the formula is as follows:
wherein,representing a result obtained by low-rank attention weight calculation between a pair of eigenvectors i and j, wherein tau represents a threshold value, and in low-rank attention sparsification, a value smaller than tau is directly abandoned and a zero value is not stored;
by means of the connection mask predictor, an upward projected sparse connection mask M is generated, the expression of which is:
wherein, the connection mask predictor is used for sparsely projecting a matrix W up Performing sparse matrix multiplication, i.e. W up Is a learnable parameter matrix that selects a limited similarity score by using the Top-k algorithm, i.e., selecting the first k most relevant feature vectors as attention objects instead of computing all possible pairs; performing binarization operation to obtain an upward projected sparse connection mask M,1 [. Cndot.)]Representing binarization, which maps elements in the subset to ones and other elements to zeros, in a connection mask predictor, which is used to binarize connection scores for each pair of labels, the scores representing their relevance to attention;
then, under the direction of the sparse connection mask M, the algorithm only calculates the non-zero elements of the full rank attention weight A, i.e. if each pair of eigenvectors satisfy M between them ij When=1, it means that they have similarity and can perform attention matching calculation, and the formula of the sparse full-rank attention moment array is as follows:
finally, for each query vector i, its corresponding calculated output vector is:i,j∈[1,N]wherein when M ij When not equal to 1, the corresponding +.>Otherwise keep->N is the length of the characteristic sequence, v j Representing a value vector v=v 1 ,v 2 ,...,v N Representation of the j-th element of (a),>the attention weighted calculation result between the feature vectors i and j is represented, and finally the sparse attention module calculation output of the whole dependent instance is as follows:
further, in step (3), the feature sequence obtained by the sparse attention module of the dependent instanceObtaining input data x through residual error connection and normalization operation;
the gating linear control unit performs the following operations:
the input data x is subjected to linear transformation:
h=W·x+b 1
wherein h represents an intermediate vector, and h is divided into two equal parts, namely a and b; w represents matrix multiplication, b 1 Representing offset term addition;
multiplying the input data X by a Bernoulli distribution Bernoulli (φ (X)) by a GELU activation function, subjecting the input data X to a standard normal distribution N (0, 1) by φ (X) =P (X.ltoreq.x), and calculating to obtain a gating vector g:
g=σ(W g ·x+b g )
wherein σ () represents a GELU activation function; w (W) g Is the weight of the gating mechanism, b g Is an offset term of the gating mechanism;
by multiplying the gating vector g with b, a gated nonlinear section h is obtained gated =g++b, ++represents product;
finally, the gating part and the linear part are added to obtain an output GLU (x) =h of the gating linear control unit gated +a。
Further, in step (4), the feature sequence with position coding can be regarded as a combination of a series of position embedding and coded image features, and the multi-head self-attention module performs the following operations:
for each position, calculating a Query, a Key and a Value through three different linear transformations, wherein the transformations are fully connected, the Query represents the characteristics of the current position, and the Key and the Value represent the characteristics of other positions; calculating attention scores for the Query of each position and keys of other positions, wherein the attention scores reflect the similarity between the current position and the other positions; finally, attention weight calculation and weighted sum are carried out, attention scores are scaled, and then attention weights are obtained through a softmax function, and the attention degree of each position to other positions is determined by the weights; and weighting and summing the Value by using the attention weight to obtain a self-attention output result.
Further, in step (4), the query matrix Q in the instance-dependent sparse attention module is obtained by linear variation of the encoder output feature sequence, and the key matrix K and the value matrix V are obtained by linear variation after the multi-headed self-attention module is connected with the residual and normalized.
Further, in step (6), the relative coordinate information of the prediction target bounding box is decoded, and the original image size is reflected by decoding; then defining a loss function comprising target category loss and target frame coordinate position loss; and combining the two loss weighted summations, and carrying out matching calculation on the predicted target frame information and the real information through a Hungary algorithm.
Further, the target class loss measures the difference between the predicted target class and the true target class, and for each predicted target class, the class loss is calculated as follows:
wherein c i Is the predicted target frame class probability, N pos Is the number of positive samples, pos is the index set of positive samples, 1 {i∈pos} Is an indication function of the display,one-hot coding, p, of the corresponding real class label i Is a predicted class probability;
the target frame coordinate position loss measures the difference between the predicted target frame coordinate position and the real target frame coordinate position, and for each predicted target frame, the calculation formula of the coordinate position loss is as follows:
wherein b i Is the predicted coordinate position information of the target frame, N pos Is the number of positive samples, pos is the index set of positive samples, 1 {i∈pos} Is an indication function of the display,is a smooth L1 loss function used to balance the effects of large and small deviations; />Is the coordinate position information of the corresponding real target frame.
The beneficial effects are that: compared with the prior art, the application has the following remarkable advantages:
(1) The deformable attention calculation requires the offset prediction of each sampling point, so that the calculation complexity and the memory consumption are increased; according to the method and the device for calculating the attention weight based on the sparse attention, the sparse attention of the dependent examples is not required to be predicted, but the attention weight is calculated by directly using example characteristics, so that the calculation cost is reduced, and the calculation efficiency is improved.
(2) Deformable attention calculations may cause the attention position to deviate from the effective area, thereby degrading performance; the application uses a connection prediction mask module to limit the attention position, so that the attention position can be adaptively selected, the quality of the characteristic representation is improved, invalid areas are avoided, and the detection performance is better.
Drawings
FIG. 1 is a block flow diagram of a method for sparse attention target detection based on a DETR improved model provided by an embodiment of the application;
FIG. 2 is a diagram of a network architecture for a model in an embodiment of the present application;
FIG. 3 is a network block diagram of an instance-dependent sparse attention module in an embodiment of the application.
Detailed Description
The application is further described below with reference to the accompanying drawings.
As shown in fig. 1, an embodiment of the present application provides a sparse attention target detection method based on a DETR improvement model, including the steps of:
(1) Inputting the training data set into a backbone network Swin Transformer V1, and extracting three layers of feature graphs C3, C4 and C5;
referring to fig. 2, the original input feature map size is h×w×3, H represents the height of the image, and W represents the width of the image; three-layer feature maps C3, C4 and C5 are extracted through a backbone network Swin Transformer V1, and the sizes are sequentially as followsAnd->
The three-layer feature maps C3, C4 and C5 serve as inputs for a subsequent multi-scale feature fusion module.
(2) Three-layer feature images C3, C4 and C5 are converted into four feature layers through a multi-scale feature fusion module, the four feature layers are fused, corresponding relative position coding information is added, and a multi-scale fusion feature image is obtained;
referring to fig. 2, first, three-layer feature maps C3, C4, and C5 are sequentially transformed into a size by three convolutions with 1×1 step size of 1And->The feature map of (2) is a three-layer high-resolution map of s1, s2 and s 3. The final layer of characteristic diagram C5 is transformed into a convolution with a convolution kernel of 3 multiplied by 3 and a step length of 1And obtaining an s4 low-resolution graph as a fourth feature layer.
Then, adding coordinate information to four feature layers, and introducing relative position coordinates for distinguishing the position information of feature points of different layers in multi-scale feature fusion, wherein the position embedding method is to convert the absolute coordinates of the feature points of each layer into relative coordinates, namely the offset of the points relative to the center of an input image;
the relative position codes include scale-level sounding and position embedding which can be learned;
and splicing the relative coordinates of the feature points of each layer and the scale information to obtain a multi-scale fusion feature map.
(3) Taking a multi-scale fusion characteristic diagram as input of an encoder, stacking six encoder layers, wherein each layer mainly comprises an instance-dependent sparse attention module, a gating linear control unit and residual error connection and normalization operation (Add & Nor) between the encoder layers, processing the input characteristic fusion sequence sequentially by the instance-dependent sparse attention module, the residual error connection and normalization operation and the gating linear control unit, and finally obtaining output of a layer of encoder by one residual error connection and normalization operation; repeatedly traversing the encoder for six times to obtain an encoder output characteristic diagram;
the encoder output feature map (output feature vector) is then used as the query input to the decoder.
In combination with fig. 2 and fig. 3, in the encoder part, the sparse attention module depending on the instance is mainly composed of two parts, namely a connection prediction mask module M and a sparse attention module, wherein the connection prediction mask module M is used for predicting the connection probability between each pair of instances, and the connection score of each pair of instances is saved; the sparse attention module sparsifies the attention weight according to the connection score of each pair of examples, and only a part of important connections are reserved.
The connection prediction mask module M performs the following operations:
firstly, taking the multiscale fusion feature map obtained in the step (2) as an input feature map of an encoder, and partitioning the input feature map to obtain a feature vector sequence X=x 1 ,x 2 ,...,x NWherein N and N represent the length of the signature sequence, x i Representing the i-th feature vector in the sequence, < + >> Representing the real number field, d is the characteristic dimension, +.>Represents x i Is a d-dimensional real vector, each element being a real number; />Representing that X is a real matrix of dimension n X d; each eigenvector is linearly transformed by three linear transforms q=xw, respectively Q 、K=XW K And v=xw V Obtaining a query vector q=q 1 ,q 2 ,...,q N Key vector k=k 1 ,k 2 ,...,k N Sum vector v=v 1 ,v 2 ,...,v N Wherein-> W Q ,W K ,W V Are learnable parameter matrices that are optimized by back propagation during training to enable the model to adaptively learn the representation of the input sequence,/->Represents W Q ,W K ,W V Are d x d matrix formed by d-dimensional real number vectors;
then, a lightweight connection prediction module is used for estimating the connection score between each pair of feature vectors, wherein the connection score reflects the semantic relativity of the two feature vectors, and the connection prediction module performs the following operations:
the low rank attention weight is calculated as follows:
wherein key W is projected from query Q and downward down The outer product of K computes a low rank approximation of the matrix of interest,W down is a learnable parameter matrix, n down Representing the dimension reduction size, set to 32; n represents the length of the feature sequence; w (W) down K denotes projecting the token dimension of K down to the lower dimension, +.>Representing feature dimensions, softmax representing normalization function,>a transpose operation representing the matrix;
the low rank attention weights are thinned by a threshold, and the formula is as follows:
wherein,representing a result obtained by low-rank attention weight calculation between a pair of feature vectors i and j, wherein τ represents a threshold value and is set to be 0.05; in low-rank attentional sparsification, values less than 0.05 are discarded directly without storing zero values;
secondly, generating an upward projected sparse connection mask M through a connection mask predictor, wherein the expression is as follows:
wherein, the connection mask predictor is used for sparsely projecting a matrix W up Performing sparse matrix multiplication, i.e. W up Is a learnable parameter matrix that reduces the computational effort and memory requirements by selecting a limited similarity score with the Top-k algorithm, i.e., selecting the first k most relevant feature vectors as attention objects instead of computing all possible pairs; performing binarization operation to obtain an upward projected sparse connection mask M,1 [. Cndot.)]Representing binarization, which maps elements in the feature vector subset to ones and other elements to zeros, in a connection mask predictor, which is used to binarize connection scores for each pair of labels, the scores representing their relevance to attention; by such processing, only those connections with the highest attention relevance can be focused in subsequent calculations, thereby further reducing the amount of calculation and memory usage. In summary, the join prediction mask module M stores the join scores between each pair of feature vectors, with the join score for a particular correlation dependency between them being one and the others being zero.
The sparse attention module performs the following operations:
under the direction of the sparse connection mask M, the algorithm only calculates non-zero elements of the full rank attention weight A, i.e. if there is a pair of eigenvectors, between them satisfies M ij When=1, it indicates that they have a dependency relationship and can perform attention matching calculation, and a sparse full-rank attention moment array formula is calculated as follows:
finally, for each query vector i, its corresponding calculated output vector is:i,j∈[1,N]wherein when M ij When not equal to 1, the corresponding +.>Otherwise keep->N is the length of the characteristic sequence, v j Representing a value vector v=v 1 ,v 2 ,...,v N Representation of the j-th element of (a),>the attention weighted calculation result between the feature vectors i and j is represented, and finally the sparse attention module calculation output of the whole dependent instance is as follows:
feature sequences derived by instance-dependent sparse attention moduleObtaining input data x after residual connection and normalization operation, wherein the input data x is input into a gating linear control unit;
the gating linear control unit performs the following operations:
the input data x is subjected to linear transformation (matrix multiplication W and offset addition b 1 ):
h=W·x+b 1
Wherein h represents an intermediate vector, and h is divided into two equal parts, namely a and b;
the gating mechanism then multiplies the input data X by a Bernoulli distribution Bernoulli (phi (X)) via a GELU activation function, phi (X) =p (x.ltoreq.x) subject it to a standard normal distribution N (0, 1), calculates a gating vector g, which controls the filtering and control of the information in the linear part b, as follows:
g=σ(W g ·x+b g )
wherein σ () represents a GELU activation function; w (W) g Is the weight of the gating mechanism, b g Is an offset term of the gating mechanism;
second, by multiplying the gating vector g with b, a gated nonlinear section h is obtained gated =g++b, ++represents product;
finally, the gating part and the linear part are added to obtain an output GLU (x) =h of the gating linear control unit gated +a, and then obtaining a layer of encoder output after residual connection and normalization operation.
(4) The method comprises the steps that an output characteristic diagram of an encoder is used as input of the decoder, the decoder is formed by stacking six decoder layers, each layer mainly comprises a Mutil-Head self-Attention module, an instance-dependent sparse Attention module, a gating linear control unit and residual error connection and normalization operation among the Mutil-Head self-Attention module, the gating linear control unit, the position-coded characteristic sequence is input into the Mutil-Head self-Attention module, the output of the Mutil-Head self-Attention module is processed through the residual error connection and normalization operation, the processing is sequentially carried out through the instance-dependent sparse Attention module, the residual error connection and normalization operation and the gating linear control unit, and finally the output of the decoder of one layer is obtained through the residual error connection and normalization operation; repeatedly traversing the decoder for six times to obtain the decoder output characteristic vector;
in connection with fig. 2, in the decoder section, first, a feature sequence with position coding is fed into the Mutil-Head self-Attention, which can be seen as a combination of a series of position embedding and coded image features.
Mutil-Head self-attribute performs the following operations:
for each position, calculating a Query, a Key and a Value through three different linear transformations, wherein the transformations are fully connected, the Query represents the characteristics of the current position, and the Key and the Value represent the characteristics of other positions; calculating attention scores for the Query of each position and keys of other positions, wherein the attention scores reflect the similarity between the current position and the other positions; finally, attention weight calculation and weighted sum are carried out, attention scores are scaled, and then attention weights are obtained through a softmax function, and the attention degree of each position to other positions is determined by the weights; and weighting and summing the Value by using the attention weight to obtain a self-attention output result. This will aggregate the information of the other locations to generate a new representation at the current location.
After the self-attention calculation result is subjected to residual connection and normalization (Add & Norm) operation, the self-attention calculation result is taken as an input of an instance-dependent sparse attention module in the decoder, the input of the instance-dependent sparse attention module comprises a key and a value of sparse attention, and the query input is the output characteristic vector of the previous encoder. After instance-dependent sparse attention computation, the subsequent steps of the layer decoder are identical to the encoder decoder.
The sparse attention of the dependent instance in step (3) differs from that in step (4) in that:
the query matrix Q in the instance-dependent sparse Attention module in the step (4) is obtained by linear change of the encoder output characteristic sequence, and the key matrix K and the value matrix V are obtained by linear change after the mutual-Head self-Attention module is connected with the residual and normalized.
And (3) the subsequent operation is the same as the step (3), the output of a layer of decoder is obtained through a gating linear control unit and residual connection and normalization (Add & Norm) operation, and the decoder output characteristic vector is obtained through six times of traversal.
(5) Predicting the class and the boundary frame of the decoder output feature vector obtained in the step (4) through a linear layer and a multi-layer perceptron respectively to obtain a predicted target set, wherein each target contains class and boundary frame coordinate information;
(6) Carrying out network overall loss calculation between the predicted target set and the real target set, and optimizing a model through back propagation;
decoding the predicted information, wherein the predicted target boundary frame coordinates are in the form of relative coordinates, and the predicted target boundary frame coordinates need to be reflected to the original image size through decoding; a loss function is then defined to guide the learning process of the model, where the loss function comprises two parts: target category loss and target frame coordinate position loss; and combining class loss and target frame coordinate position loss weighted summation, and carrying out matching calculation between predicted target frame information and real target frame information through a Hungary algorithm.
The target class loss measures the difference between the predicted target class and the true target class, and for each predicted target class, the class loss is calculated as follows:
wherein c i Is the predicted target frame class probability, N pos Is the number of positive samples, pos is the index set of positive samples, 1 {i∈pos} Is an indication function of the display,one-hot coding, p, of the corresponding real class label i Is the predicted class probability.
The target frame coordinate position loss measures the difference between the predicted target frame coordinate position and the real target frame coordinate position, and for each predicted target frame, the calculation formula of the coordinate position loss is as follows:
wherein b i Is the predicted coordinate position information of the target frame, N pos Is the number of positive samples, pos is the index set of positive samples, 1 {i∈pos} Is an indication function, smooth L1 Is a smooth L1 loss function used to balance the effects of large and small deviations;is the coordinate position information of the corresponding real target frame.
(7) Repeating the steps (1) to (6) for a plurality of times to obtain a trained target detection model, and using the trained target detection model for target detection.
The improved model of the present application utilizes an instance-dependent sparse attention mechanism by focusing on the details of the partial region of each object of interest while ignoring the background and extraneous regions. The sparsity improves the calculation efficiency and simultaneously makes the model more robust to challenges such as shielding and deformation. By introducing an instance-dependent mechanism, the model supports personalized processing of each target instance, selectively focuses on sampling points according to the characteristic information of each instance, and improves the accuracy of target detection.
The application is improved based on a default DETR framework, can better adapt to target detection tasks of different scenes and scales, has stronger generalization capability, and is a general target detection model. The target detection method provided by the application has wide application scenes and has application prospects in the fields of object identification, video monitoring, automatic driving and the like.

Claims (10)

1. A DETR improvement model-based sparse attention target detection method, comprising:
(1) Inputting the training data set into a backbone network Swin Transformer V1, and extracting three layers of feature graphs C3, C4 and C5;
(2) Converting the three-layer feature images C3, C4 and C5 into four feature layers through a multi-scale feature fusion module, fusing the four feature layers and adding relative position coding information to obtain a multi-scale fusion feature image;
(3) Taking a multi-scale fusion characteristic diagram as the input of an encoder, wherein the encoder is formed by stacking a plurality of encoder layers, each layer mainly comprises an instance-dependent sparse attention module, a gate-controlled linear control unit and residual error connection and normalization operation among the encoder layers, the input characteristic sequence is sequentially processed by the instance-dependent sparse attention module, the residual error connection and normalization operation and the gate-controlled linear control unit, and finally, the output of the encoder of one layer is obtained through one residual error connection and normalization operation; repeatedly traversing the encoder for a plurality of times to obtain an encoder output characteristic diagram;
(4) The method comprises the steps that an output characteristic diagram of an encoder is used as input of the decoder, the decoder is formed by stacking a plurality of decoder layers, each layer mainly comprises a multi-head self-attention module, an instance-dependent sparse attention module, a gating linear control unit and residual error connection and normalization operation among the multi-head self-attention module, a characteristic sequence with position coding is input into the multi-head self-attention module, after the output of the multi-head self-attention module is subjected to the residual error connection and normalization operation, the output of the multi-head self-attention module is sequentially processed through the instance-dependent sparse attention module, the residual error connection and normalization operation and the gating linear control unit, and finally, the output of the decoder of one layer is obtained through one residual error connection and normalization operation; repeatedly traversing the decoder for a plurality of times to obtain an output characteristic vector of the decoder;
(5) Predicting the class and the boundary frame of the output feature vector of the decoder through a linear layer and a multi-layer perceptron respectively to obtain a predicted target set, wherein each target contains class and boundary frame coordinate information;
(6) Carrying out network overall loss calculation between the predicted target set and the real target set, and optimizing a model through back propagation;
(7) Repeating the steps (1) to (6) for a plurality of times to obtain a trained target detection model.
2. The DETR improvement model based sparse attention target detection method of claim 1, wherein step (1) comprises:
the original input feature map size is H multiplied by W multiplied by 3, H represents the height of the image, and W represents the width of the image;
three-layer feature maps C3, C4 and C5 are extracted through a backbone network Swin Transformer V1, and the sizes are sequentially as followsAnd->
3. The DETR improvement model based sparse attention target detection method of claim 2, wherein step (2) comprises:
the three-layer characteristic diagrams C3, C4 and C5 are sequentially transformed into the convolution with the step length of 1 multiplied by 1 through three convolution kernelsAnd->The feature map C5 of the last layer is transformed into a convolution with a convolution kernel of 3 x 3 steps of 1 to a size of +.>As a fourth feature layer;
adding coordinate information to four characteristic layers, introducing relative position coordinates for distinguishing the position information of characteristic points of different layers, and converting absolute coordinates of the characteristic points of each layer into relative coordinates by a position embedding method; and splicing the relative coordinates of the feature points of each layer and the scale information to obtain a multi-scale fusion feature map.
4. A DETR improvement model based sparse concentration target detection method according to claim 3 wherein the relative position encoding comprises learnable scale-level learning and position embedding.
5. The DETR improvement model-based sparse attention target detection method of claim 1, wherein in step (3), the instance-dependent sparse attention module performs the following operations:
firstly, a multiscale fusion feature map is partitioned to obtain a feature vector sequence X=x 1 ,x 2 ,...,x NWherein N and N represent the length of the signature sequence, x i Representing the i-th feature vector in the sequence, < + >> Representing the real number field, d is the characteristic dimension, +.>Represents x i Is a d-dimensional real vector, each element being a real number; />Representing that X is a real matrix of dimension n X d; each eigenvector is linearly transformed by three linear transforms q=xw, respectively Q 、K=XW K And v=xw V Obtaining a query vector q=q 1 ,q 2 ,...,q N Key vector k=k 1 ,k 2 ,...,k N Sum vector v=v 1 ,v 2 ,...,v N Wherein q is i ,k i ,/>W Q ,W K ,/>W Q ,W K ,W V Are learnable parameter matrices that are optimized by back propagation during training to enable the model to adaptively learn a representation of the input sequence, W Q ,W K ,/>Represents W Q ,W K ,W V Are d x d matrices formed by d-dimensional real vectors;
then, a lightweight connection prediction module is used for estimating the connection score between each pair of feature vectors, wherein the connection score reflects the semantic relativity of the two feature vectors, and the connection prediction module performs the following operations:
the low rank attention weight is calculated as follows:
wherein key W is projected from query Q and downward down The outer product of K computes a low rank approximation of the matrix of interest,W down is a learnable parameter matrix, n down Represents the dimension reduction, n represents the length of the input characteristic sequence, W down K denotes projecting the token dimension of K down to the lower dimension, +.>Representing feature dimensions, softmax representing normalization function,>a transpose operation representing the matrix;
the low rank attention weights are thinned by a threshold, and the formula is as follows:
wherein,representing a result obtained by low-rank attention weight calculation between a pair of eigenvectors i and j, wherein tau represents a threshold value, and in low-rank attention sparsification, a value smaller than tau is directly abandoned and a zero value is not stored;
by means of the connection mask predictor, an upward projected sparse connection mask M is generated, the expression of which is:
wherein, the connection mask predictor is used for sparsely projecting a matrix W up Performing sparse matrix multiplication, i.e. W up Is a learnable parameter matrix that selects a limited similarity score by using the Top-k algorithm, i.e., selecting the first k most relevant feature vectors as attention objects instead of computing all possible pairs; performing binarization operation to obtain an upward projected sparse connection mask M,1 [. Cndot.)]Representing binarization, which maps elements in the subset to ones and other elements to zeros, in a connection mask predictor, which is used to binarize connection scores for each pair of labels, the scores representing their relevance to attention;
then, under the direction of the sparse connection mask M, the algorithm only calculates the non-zero elements of the full rank attention weight A, i.e. if each pair of eigenvectors satisfy M between them ij When=1, it means that they have similarity and can perform attention matching calculation, and the formula of the sparse full-rank attention moment array is as follows:
finally, for each query vector i, its corresponding calculated output vector is: wherein when M ij When not equal to 1, the corresponding +.>Otherwise keep->N is the length of the characteristic sequence, v j Representing a value vector v=v 1 ,v 2 ,...,v N Representation of the j-th element of (a),>the attention weighted calculation result between the feature vectors i and j is represented, and finally the sparse attention module calculation output of the whole dependent instance is as follows: />
6. The DETR improvement model based sparse attention target detection method of claim 5, wherein in step (3), an instance-dependent sparse attention moduleThe obtained characteristic sequenceObtaining input data x through residual error connection and normalization operation;
the gating linear control unit performs the following operations:
the input data x is subjected to linear transformation:
h=W·x+b 1
wherein h represents an intermediate vector, and h is divided into two equal parts, namely a and b; w represents matrix multiplication, b 1 Representing offset term addition;
multiplying the input data X by a Bernoulli distribution Bernoulli (φ (X)) by a GELU activation function, subjecting the input data X to a standard normal distribution N (0, 1) by φ (X) =P (X.ltoreq.x), and calculating to obtain a gating vector g:
g=σ(W g ·x+b g )
wherein σ () represents a GELU activation function; w (W) g Is the weight of the gating mechanism, b g Is an offset term of the gating mechanism;
by multiplying the gating vector g with b, a gated nonlinear section h is obtained gated =g++b, ++represents product;
finally, the gating part and the linear part are added to obtain an output GLU (x) =h of the gating linear control unit gated +a。
7. The DETR improvement model based sparse attention target detection method of claim 6, wherein in step (4), the feature sequence with position coding can be regarded as a combination of a series of position embedding and coded image features, and the multi-headed self-attention module performs the following operations:
for each position, calculating a Query, a Key and a Value through three different linear transformations, wherein the transformations are fully connected, the Query represents the characteristics of the current position, and the Key and the Value represent the characteristics of other positions; calculating attention scores for the Query of each position and keys of other positions, wherein the attention scores reflect the similarity between the current position and the other positions; finally, attention weight calculation and weighted sum are carried out, attention scores are scaled, and then attention weights are obtained through a softmax function, and the attention degree of each position to other positions is determined by the weights; and weighting and summing the Value by using the attention weight to obtain a self-attention output result.
8. The DETR improvement model-based sparse attention target detection method of claim 7, wherein in step (4), the query matrix Q in the instance-dependent sparse attention module is obtained by linear variation of the encoder output feature sequence, and the key matrix K and the value matrix V are obtained by linear variation after the multi-headed self-attention module is connected with the residual and normalized.
9. The DETR improvement model-based sparse attention target detection method of claim 8, wherein in step (6), the relative coordinate information of the prediction target bounding box is decoded, and is projected to the original image size by decoding and reflection; then defining a loss function comprising target category loss and target frame coordinate position loss; and combining the two loss weighted summations, and carrying out matching calculation on the predicted target frame information and the real information through a Hungary algorithm.
10. The DETR improvement model based sparse attention target detection method of claim 9, wherein the target class loss measures the difference between the predicted target class and the true target class, and wherein for each predicted target class, the class loss is calculated as:
wherein c i Is the predicted target frame class probability, N pos Is the number of positive samples, pos is the index set of positive samples, 1 {i∈pos} Is an indication function of the display,one-hot coding, p, of the corresponding real class label i Is a predicted class probability;
the target frame coordinate position loss measures the difference between the predicted target frame coordinate position and the real target frame coordinate position, and for each predicted target frame, the calculation formula of the coordinate position loss is as follows:
wherein b i Is the predicted coordinate position information of the target frame, N pos Is the number of positive samples, pos is the index set of positive samples, 1 {i∈pos} Is an indication function, smooth L1 Is a smooth L1 loss function used to balance the effects of large and small deviations;is the coordinate position information of the corresponding real target frame.
CN202311122596.8A 2023-09-01 2023-09-01 Sparse attention target detection method based on DETR improved model Pending CN117152416A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311122596.8A CN117152416A (en) 2023-09-01 2023-09-01 Sparse attention target detection method based on DETR improved model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311122596.8A CN117152416A (en) 2023-09-01 2023-09-01 Sparse attention target detection method based on DETR improved model

Publications (1)

Publication Number Publication Date
CN117152416A true CN117152416A (en) 2023-12-01

Family

ID=88911432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311122596.8A Pending CN117152416A (en) 2023-09-01 2023-09-01 Sparse attention target detection method based on DETR improved model

Country Status (1)

Country Link
CN (1) CN117152416A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117370736A (en) * 2023-12-08 2024-01-09 暨南大学 Fine granularity emotion recognition method, electronic equipment and storage medium
CN117746233A (en) * 2023-12-08 2024-03-22 江苏海洋大学 Target lightweight detection method for unmanned cleaning ship in water area
CN117830874A (en) * 2024-03-05 2024-04-05 成都理工大学 Remote sensing target detection method under multi-scale fuzzy boundary condition
CN117852974A (en) * 2024-03-04 2024-04-09 禾辰纵横信息技术有限公司 Online evaluation score assessment method based on artificial intelligence

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117370736A (en) * 2023-12-08 2024-01-09 暨南大学 Fine granularity emotion recognition method, electronic equipment and storage medium
CN117746233A (en) * 2023-12-08 2024-03-22 江苏海洋大学 Target lightweight detection method for unmanned cleaning ship in water area
CN117852974A (en) * 2024-03-04 2024-04-09 禾辰纵横信息技术有限公司 Online evaluation score assessment method based on artificial intelligence
CN117830874A (en) * 2024-03-05 2024-04-05 成都理工大学 Remote sensing target detection method under multi-scale fuzzy boundary condition
CN117830874B (en) * 2024-03-05 2024-05-07 成都理工大学 Remote sensing target detection method under multi-scale fuzzy boundary condition

Similar Documents

Publication Publication Date Title
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN111783705B (en) Character recognition method and system based on attention mechanism
Lu et al. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering
CN117152416A (en) Sparse attention target detection method based on DETR improved model
CN112329760B (en) Method for recognizing and translating Mongolian in printed form from end to end based on space transformation network
US20230186056A1 (en) Grabbing detection method based on rp-resnet
Xiao et al. Heterogeneous knowledge distillation for simultaneous infrared-visible image fusion and super-resolution
CN111259940B (en) Target detection method based on space attention map
CN110163286B (en) Hybrid pooling-based domain adaptive image classification method
CN112347888B (en) Remote sensing image scene classification method based on bi-directional feature iterative fusion
CN115496928B (en) Multi-modal image feature matching method based on multi-feature matching
CN113657124A (en) Multi-modal Mongolian Chinese translation method based on circulation common attention Transformer
CN114936623B (en) Aspect-level emotion analysis method integrating multi-mode data
CN112733866A (en) Network construction method for improving text description correctness of controllable image
CN113536925B (en) Crowd counting method based on attention guiding mechanism
CN110929080A (en) Optical remote sensing image retrieval method based on attention and generation countermeasure network
CN115222998B (en) Image classification method
CN114973222B (en) Scene text recognition method based on explicit supervision attention mechanism
CN115512096A (en) CNN and Transformer-based low-resolution image classification method and system
CN115131313A (en) Hyperspectral image change detection method and device based on Transformer
CN115909036A (en) Local-global adaptive guide enhanced vehicle weight identification method and system
CN115147601A (en) Urban street point cloud semantic segmentation method based on self-attention global feature enhancement
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
CN113642602B (en) Multi-label image classification method based on global and local label relation
CN117078956A (en) Point cloud classification segmentation network based on point cloud multi-scale parallel feature extraction and attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination