CN117152416A

CN117152416A - Sparse attention target detection method based on DETR improved model

Info

Publication number: CN117152416A
Application number: CN202311122596.8A
Authority: CN
Inventors: 王文豪; 伍言伦; 付步颖; 孙陈瑾; 靳陶阳; 牟孝志; 陈鑫; 赵丽娟; 戚薇
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2023-09-01
Filing date: 2023-09-01
Publication date: 2023-12-01

Abstract

The application discloses a sparse attention target detection method based on a DETR improved model, which is improved based on a Deformable DETR framework, wherein an encoder is formed by stacking a plurality of encoder layers, each layer mainly comprises an instance-dependent sparse attention module, a gating linear control unit and residual error connection and normalization operation among the encoder layers, a decoder is formed by stacking a plurality of decoder layers, and each layer mainly comprises a multi-head self-attention module, an instance-dependent sparse attention module, a gating linear control unit and residual error connection and normalization operation among the decoder layers; the application enhances the expression capability of the attention by utilizing the dependency relationship among the examples, the sparse attention can dynamically adjust the connectivity among the features according to the content of the input image, semantic information is better captured, and the operation reduces the computational complexity. The application can improve the calculation efficiency and the detection performance of the model on the target.

Description

Sparse attention target detection method based on DETR improved model

Technical Field

The application relates to a target detection method, in particular to a sparse attention target detection method based on a DETR improved model.

Background

Target detection aims to accurately identify and locate a specific target object from an image or video, and is an important research direction in the field of computer vision. Convolutional Neural Networks (CNNs) have been the main support for the target detection model for many years. However, the great success of transfomers in the field of Natural Language Processing (NLP) has prompted researchers to explore their potential in terms of Computer Vision (CV). The transducer structure has proven to be effective in effectively capturing remote dependencies in sequence data.

In the field of target detection, the core model based on the transducer architecture is DETR (Detection Transformers), and the DETR model uses the encoder and decoder architecture of the transducer to address the target detection problem. The encoder is used to extract features from the input image, and the decoder is used to generate a target prediction for each location. The core idea of DETR is to convert the target detection problem into a collective prediction problem, i.e. predicting the class of the target, the bounding box coordinate position and whether the target is present at each position, avoiding the method requirements of the traditional anchor boxes. Among the improved models based on DETR, many models reference the multi-scale feature fusion and Deformable attention ideas of transformable DETR. Deformable DETR proposes a new attention mechanism, deformable Attention, for processing image feature maps and improving model performance. The introduction of this idea is widely used in improved versions of DETR to solve the problems of slow convergence rate of training and limited spatial resolution of features of DETR.

The Deformable attention mechanism in the Deformable DETR is indeed a well-innovative approach that reduces the computational complexity by focusing on only a few sample points to perform the attention calculations. However, this approach also brings with it some drawbacks and limitations: first, the deformable attention is focused on the entire feature map for the deformable operation, requiring additional offset prediction for the sampling points, which results in higher computational cost. Second, it may lead to information loss, although deformable attention mechanisms may reduce computational complexity, if the sampling points of interest are insufficient or do not cover well the area of the target or critical information, ignoring other important information may lead to reduced detection or recognition performance of the target.

Disclosure of Invention

The application aims to: the application aims to provide a sparse attention target detection method based on a DETR improved model, which has high calculation efficiency and good detection performance.

The technical scheme is as follows: the sparse attention target detection method based on the DETR improved model provided by the application comprises the following steps:

(1) Inputting the training data set into a backbone network Swin Transformer V1, and extracting three layers of feature graphs C3, C4 and C5;

(2) Converting the three-layer feature images C3, C4 and C5 into four feature layers through a multi-scale feature fusion module, fusing the four feature layers and adding relative position coding information to obtain a multi-scale fusion feature image;

(3) Taking a multi-scale fusion characteristic diagram as the input of an encoder, wherein the encoder is formed by stacking a plurality of encoder layers, each layer mainly comprises an instance-dependent sparse attention module, a gate-controlled linear control unit and residual error connection and normalization operation among the encoder layers, the input characteristic sequence is sequentially processed by the instance-dependent sparse attention module, the residual error connection and normalization operation and the gate-controlled linear control unit, and finally, the output of the encoder of one layer is obtained through one residual error connection and normalization operation; repeatedly traversing the encoder for a plurality of times to obtain an encoder output characteristic diagram;

(4) The method comprises the steps that an output characteristic diagram of an encoder is used as input of the decoder, the decoder is formed by stacking a plurality of decoder layers, each layer mainly comprises a multi-head self-attention module, an instance-dependent sparse attention module, a gating linear control unit and residual error connection and normalization operation among the multi-head self-attention module, a characteristic sequence with position coding is input into the multi-head self-attention module, after the output of the multi-head self-attention module is subjected to the residual error connection and normalization operation, the output of the multi-head self-attention module is sequentially processed through the instance-dependent sparse attention module, the residual error connection and normalization operation and the gating linear control unit, and finally, the output of the decoder of one layer is obtained through one residual error connection and normalization operation; repeatedly traversing the decoder for a plurality of times to obtain an output characteristic vector of the decoder;

(5) Predicting the class and the boundary frame of the output feature vector of the decoder through a linear layer and a multi-layer perceptron respectively to obtain a predicted target set, wherein each target contains class and boundary frame coordinate information;

(6) Carrying out network overall loss calculation between the predicted target set and the real target set, and optimizing a model through back propagation;

(7) Repeating the steps (1) to (6) for a plurality of times to obtain a trained target detection model.

Further, step (1) includes:

the original input feature map size is H multiplied by W multiplied by 3, H represents the height of the image, and W represents the width of the image;

three-layer feature maps C3, C4 and C5 are extracted through a backbone network Swin Transformer V1, and the sizes are sequentially as followsAnd->

Further, step (2) includes:

the three-layer characteristic diagrams C3, C4 and C5 are sequentially transformed into the convolution with the step length of 1 multiplied by 1 through three convolution kernelsAnd->The feature map C5 of the last layer is transformed into a convolution with a convolution kernel of 3 x 3 steps of 1 to a size of +.>As a fourth feature layer;

adding coordinate information to four characteristic layers, introducing relative position coordinates for distinguishing the position information of characteristic points of different layers, and converting absolute coordinates of the characteristic points of each layer into relative coordinates by a position embedding method; and splicing the relative coordinates of the feature points of each layer and the scale information to obtain a multi-scale fusion feature map.

Further, the relative position codes include scale-level sounding and position embedding, which can be learned.

Further, in step (3), the instance-dependent sparse attention module performs the following operations:

firstly, the multiscale fusion feature map is partitioned to obtain a feature vector sequence X=x ₁ ,x ₂ ,...,x _N ，Wherein N and N represent the length of the signature sequence, x _i Representing the i-th feature vector in the sequence, < + >>Representing the real number field, d is the characteristic dimension, +.>Represents x _i Is a d-dimensional real vector, each element being a real number; />Representing that X is a real matrix of dimension n X d; each eigenvector is linearly transformed by three linear transforms q=xw, respectively ^Q 、K＝XW ^K And v=xw ^V Obtaining a query vector q=q ₁ ,q ₂ ,...,q _N Key vector k=k ₁ ,k ₂ ,...,k _N Sum vector v=v ₁ ,v ₂ ,...,v _N Wherein-> W ^Q ,W ^K ,W ^V Are learnable parameter matrices that are optimized by back propagation during training, so that the model can adaptively learn the representation of the input sequence,represents W ^Q ,W ^K ,W ^V Are d x d matrices formed by d-dimensional real vectors;

then, a lightweight connection prediction module is used for estimating the connection score between each pair of feature vectors, wherein the connection score reflects the semantic relativity of the two feature vectors, and the connection prediction module performs the following operations:

the low rank attention weight is calculated as follows:

wherein key W is projected from query Q and downward ^down The outer product of K computes a low rank approximation of the matrix of interest,W ^down is a learnable parameter matrix, n _down Represents the dimension reduction, n represents the length of the input characteristic sequence, W ^down K denotes projecting the token dimension of K down to the lower dimension, +.>Representing feature dimensions, softmax representing normalization function,>a transpose operation representing the matrix;

the low rank attention weights are thinned by a threshold, and the formula is as follows:

wherein,representing a result obtained by low-rank attention weight calculation between a pair of eigenvectors i and j, wherein tau represents a threshold value, and in low-rank attention sparsification, a value smaller than tau is directly abandoned and a zero value is not stored;

by means of the connection mask predictor, an upward projected sparse connection mask M is generated, the expression of which is:

wherein, the connection mask predictor is used for sparsely projecting a matrix W ^up Performing sparse matrix multiplication, i.e. W ^up Is a learnable parameter matrix that selects a limited similarity score by using the Top-k algorithm, i.e., selecting the first k most relevant feature vectors as attention objects instead of computing all possible pairs; performing binarization operation to obtain an upward projected sparse connection mask M,1 [. Cndot.)]Representing binarization, which maps elements in the subset to ones and other elements to zeros, in a connection mask predictor, which is used to binarize connection scores for each pair of labels, the scores representing their relevance to attention;

then, under the direction of the sparse connection mask M, the algorithm only calculates the non-zero elements of the full rank attention weight A, i.e. if each pair of eigenvectors satisfy M between them _ij When=1, it means that they have similarity and can perform attention matching calculation, and the formula of the sparse full-rank attention moment array is as follows:

finally, for each query vector i, its corresponding calculated output vector is:i,j∈[1,N]wherein when M _ij When not equal to 1, the corresponding +.>Otherwise keep->N is the length of the characteristic sequence, v _j Representing a value vector v=v ₁ ,v ₂ ,...,v _N Representation of the j-th element of (a),>the attention weighted calculation result between the feature vectors i and j is represented, and finally the sparse attention module calculation output of the whole dependent instance is as follows:

further, in step (3), the feature sequence obtained by the sparse attention module of the dependent instanceObtaining input data x through residual error connection and normalization operation;

the gating linear control unit performs the following operations:

the input data x is subjected to linear transformation:

h＝W·x+b ₁

wherein h represents an intermediate vector, and h is divided into two equal parts, namely a and b; w represents matrix multiplication, b ₁ Representing offset term addition;

multiplying the input data X by a Bernoulli distribution Bernoulli (φ (X)) by a GELU activation function, subjecting the input data X to a standard normal distribution N (0, 1) by φ (X) =P (X.ltoreq.x), and calculating to obtain a gating vector g:

g＝σ(W _g ·x+b _g )

wherein σ () represents a GELU activation function; w (W) _g Is the weight of the gating mechanism, b _g Is an offset term of the gating mechanism;

by multiplying the gating vector g with b, a gated nonlinear section h is obtained _gated =g++b, ++represents product;

finally, the gating part and the linear part are added to obtain an output GLU (x) =h of the gating linear control unit _gated +a。

Further, in step (4), the feature sequence with position coding can be regarded as a combination of a series of position embedding and coded image features, and the multi-head self-attention module performs the following operations:

for each position, calculating a Query, a Key and a Value through three different linear transformations, wherein the transformations are fully connected, the Query represents the characteristics of the current position, and the Key and the Value represent the characteristics of other positions; calculating attention scores for the Query of each position and keys of other positions, wherein the attention scores reflect the similarity between the current position and the other positions; finally, attention weight calculation and weighted sum are carried out, attention scores are scaled, and then attention weights are obtained through a softmax function, and the attention degree of each position to other positions is determined by the weights; and weighting and summing the Value by using the attention weight to obtain a self-attention output result.

Further, in step (4), the query matrix Q in the instance-dependent sparse attention module is obtained by linear variation of the encoder output feature sequence, and the key matrix K and the value matrix V are obtained by linear variation after the multi-headed self-attention module is connected with the residual and normalized.

Further, in step (6), the relative coordinate information of the prediction target bounding box is decoded, and the original image size is reflected by decoding; then defining a loss function comprising target category loss and target frame coordinate position loss; and combining the two loss weighted summations, and carrying out matching calculation on the predicted target frame information and the real information through a Hungary algorithm.

Further, the target class loss measures the difference between the predicted target class and the true target class, and for each predicted target class, the class loss is calculated as follows:

wherein c _i Is the predicted target frame class probability, N _pos Is the number of positive samples, pos is the index set of positive samples, 1 _{i∈pos} Is an indication function of the display,one-hot coding, p, of the corresponding real class label _i Is a predicted class probability;

the target frame coordinate position loss measures the difference between the predicted target frame coordinate position and the real target frame coordinate position, and for each predicted target frame, the calculation formula of the coordinate position loss is as follows:

wherein b _i Is the predicted coordinate position information of the target frame, N _pos Is the number of positive samples, pos is the index set of positive samples, 1 _{i∈pos} Is an indication function of the display,is a smooth L1 loss function used to balance the effects of large and small deviations; />Is the coordinate position information of the corresponding real target frame.

The beneficial effects are that: compared with the prior art, the application has the following remarkable advantages:

(1) The deformable attention calculation requires the offset prediction of each sampling point, so that the calculation complexity and the memory consumption are increased; according to the method and the device for calculating the attention weight based on the sparse attention, the sparse attention of the dependent examples is not required to be predicted, but the attention weight is calculated by directly using example characteristics, so that the calculation cost is reduced, and the calculation efficiency is improved.

(2) Deformable attention calculations may cause the attention position to deviate from the effective area, thereby degrading performance; the application uses a connection prediction mask module to limit the attention position, so that the attention position can be adaptively selected, the quality of the characteristic representation is improved, invalid areas are avoided, and the detection performance is better.

Drawings

FIG. 1 is a block flow diagram of a method for sparse attention target detection based on a DETR improved model provided by an embodiment of the application;

FIG. 2 is a diagram of a network architecture for a model in an embodiment of the present application;

FIG. 3 is a network block diagram of an instance-dependent sparse attention module in an embodiment of the application.

Detailed Description

The application is further described below with reference to the accompanying drawings.

As shown in fig. 1, an embodiment of the present application provides a sparse attention target detection method based on a DETR improvement model, including the steps of:

referring to fig. 2, the original input feature map size is h×w×3, H represents the height of the image, and W represents the width of the image; three-layer feature maps C3, C4 and C5 are extracted through a backbone network Swin Transformer V1, and the sizes are sequentially as followsAnd->

The three-layer feature maps C3, C4 and C5 serve as inputs for a subsequent multi-scale feature fusion module.

(2) Three-layer feature images C3, C4 and C5 are converted into four feature layers through a multi-scale feature fusion module, the four feature layers are fused, corresponding relative position coding information is added, and a multi-scale fusion feature image is obtained;

referring to fig. 2, first, three-layer feature maps C3, C4, and C5 are sequentially transformed into a size by three convolutions with 1×1 step size of 1And->The feature map of (2) is a three-layer high-resolution map of s1, s2 and s 3. The final layer of characteristic diagram C5 is transformed into a convolution with a convolution kernel of 3 multiplied by 3 and a step length of 1And obtaining an s4 low-resolution graph as a fourth feature layer.

Then, adding coordinate information to four feature layers, and introducing relative position coordinates for distinguishing the position information of feature points of different layers in multi-scale feature fusion, wherein the position embedding method is to convert the absolute coordinates of the feature points of each layer into relative coordinates, namely the offset of the points relative to the center of an input image;

the relative position codes include scale-level sounding and position embedding which can be learned;

and splicing the relative coordinates of the feature points of each layer and the scale information to obtain a multi-scale fusion feature map.

(3) Taking a multi-scale fusion characteristic diagram as input of an encoder, stacking six encoder layers, wherein each layer mainly comprises an instance-dependent sparse attention module, a gating linear control unit and residual error connection and normalization operation (Add & Nor) between the encoder layers, processing the input characteristic fusion sequence sequentially by the instance-dependent sparse attention module, the residual error connection and normalization operation and the gating linear control unit, and finally obtaining output of a layer of encoder by one residual error connection and normalization operation; repeatedly traversing the encoder for six times to obtain an encoder output characteristic diagram;

the encoder output feature map (output feature vector) is then used as the query input to the decoder.

In combination with fig. 2 and fig. 3, in the encoder part, the sparse attention module depending on the instance is mainly composed of two parts, namely a connection prediction mask module M and a sparse attention module, wherein the connection prediction mask module M is used for predicting the connection probability between each pair of instances, and the connection score of each pair of instances is saved; the sparse attention module sparsifies the attention weight according to the connection score of each pair of examples, and only a part of important connections are reserved.

The connection prediction mask module M performs the following operations:

firstly, taking the multiscale fusion feature map obtained in the step (2) as an input feature map of an encoder, and partitioning the input feature map to obtain a feature vector sequence X=x ₁ ,x ₂ ,...,x _N ，Wherein N and N represent the length of the signature sequence, x _i Representing the i-th feature vector in the sequence, < + >> Representing the real number field, d is the characteristic dimension, +.>Represents x _i Is a d-dimensional real vector, each element being a real number; />Representing that X is a real matrix of dimension n X d; each eigenvector is linearly transformed by three linear transforms q=xw, respectively ^Q 、K＝XW ^K And v=xw ^V Obtaining a query vector q=q ₁ ,q ₂ ,...,q _N Key vector k=k ₁ ,k ₂ ,...,k _N Sum vector v=v ₁ ,v ₂ ,...,v _N Wherein-> W ^Q ,W ^K ,W ^V Are learnable parameter matrices that are optimized by back propagation during training to enable the model to adaptively learn the representation of the input sequence,/->Represents W ^Q ,W ^K ,W ^V Are d x d matrix formed by d-dimensional real number vectors;

the low rank attention weight is calculated as follows:

wherein key W is projected from query Q and downward ^down The outer product of K computes a low rank approximation of the matrix of interest,W ^down is a learnable parameter matrix, n _down Representing the dimension reduction size, set to 32; n represents the length of the feature sequence; w (W) ^down K denotes projecting the token dimension of K down to the lower dimension, +.>Representing feature dimensions, softmax representing normalization function,>a transpose operation representing the matrix;

wherein,representing a result obtained by low-rank attention weight calculation between a pair of feature vectors i and j, wherein τ represents a threshold value and is set to be 0.05; in low-rank attentional sparsification, values less than 0.05 are discarded directly without storing zero values;

secondly, generating an upward projected sparse connection mask M through a connection mask predictor, wherein the expression is as follows:

wherein, the connection mask predictor is used for sparsely projecting a matrix W ^up Performing sparse matrix multiplication, i.e. W ^up Is a learnable parameter matrix that reduces the computational effort and memory requirements by selecting a limited similarity score with the Top-k algorithm, i.e., selecting the first k most relevant feature vectors as attention objects instead of computing all possible pairs; performing binarization operation to obtain an upward projected sparse connection mask M,1 [. Cndot.)]Representing binarization, which maps elements in the feature vector subset to ones and other elements to zeros, in a connection mask predictor, which is used to binarize connection scores for each pair of labels, the scores representing their relevance to attention; by such processing, only those connections with the highest attention relevance can be focused in subsequent calculations, thereby further reducing the amount of calculation and memory usage. In summary, the join prediction mask module M stores the join scores between each pair of feature vectors, with the join score for a particular correlation dependency between them being one and the others being zero.

The sparse attention module performs the following operations:

under the direction of the sparse connection mask M, the algorithm only calculates non-zero elements of the full rank attention weight A, i.e. if there is a pair of eigenvectors, between them satisfies M _ij When=1, it indicates that they have a dependency relationship and can perform attention matching calculation, and a sparse full-rank attention moment array formula is calculated as follows:

feature sequences derived by instance-dependent sparse attention moduleObtaining input data x after residual connection and normalization operation, wherein the input data x is input into a gating linear control unit;

the gating linear control unit performs the following operations:

the input data x is subjected to linear transformation (matrix multiplication W and offset addition b ₁ )：

h＝W·x+b ₁

Wherein h represents an intermediate vector, and h is divided into two equal parts, namely a and b;

the gating mechanism then multiplies the input data X by a Bernoulli distribution Bernoulli (phi (X)) via a GELU activation function, phi (X) =p (x.ltoreq.x) subject it to a standard normal distribution N (0, 1), calculates a gating vector g, which controls the filtering and control of the information in the linear part b, as follows:

g＝σ(W _g ·x+b _g )

second, by multiplying the gating vector g with b, a gated nonlinear section h is obtained _gated =g++b, ++represents product;

finally, the gating part and the linear part are added to obtain an output GLU (x) =h of the gating linear control unit _gated +a, and then obtaining a layer of encoder output after residual connection and normalization operation.

(4) The method comprises the steps that an output characteristic diagram of an encoder is used as input of the decoder, the decoder is formed by stacking six decoder layers, each layer mainly comprises a Mutil-Head self-Attention module, an instance-dependent sparse Attention module, a gating linear control unit and residual error connection and normalization operation among the Mutil-Head self-Attention module, the gating linear control unit, the position-coded characteristic sequence is input into the Mutil-Head self-Attention module, the output of the Mutil-Head self-Attention module is processed through the residual error connection and normalization operation, the processing is sequentially carried out through the instance-dependent sparse Attention module, the residual error connection and normalization operation and the gating linear control unit, and finally the output of the decoder of one layer is obtained through the residual error connection and normalization operation; repeatedly traversing the decoder for six times to obtain the decoder output characteristic vector;

in connection with fig. 2, in the decoder section, first, a feature sequence with position coding is fed into the Mutil-Head self-Attention, which can be seen as a combination of a series of position embedding and coded image features.

Mutil-Head self-attribute performs the following operations:

for each position, calculating a Query, a Key and a Value through three different linear transformations, wherein the transformations are fully connected, the Query represents the characteristics of the current position, and the Key and the Value represent the characteristics of other positions; calculating attention scores for the Query of each position and keys of other positions, wherein the attention scores reflect the similarity between the current position and the other positions; finally, attention weight calculation and weighted sum are carried out, attention scores are scaled, and then attention weights are obtained through a softmax function, and the attention degree of each position to other positions is determined by the weights; and weighting and summing the Value by using the attention weight to obtain a self-attention output result. This will aggregate the information of the other locations to generate a new representation at the current location.

After the self-attention calculation result is subjected to residual connection and normalization (Add & Norm) operation, the self-attention calculation result is taken as an input of an instance-dependent sparse attention module in the decoder, the input of the instance-dependent sparse attention module comprises a key and a value of sparse attention, and the query input is the output characteristic vector of the previous encoder. After instance-dependent sparse attention computation, the subsequent steps of the layer decoder are identical to the encoder decoder.

The sparse attention of the dependent instance in step (3) differs from that in step (4) in that:

the query matrix Q in the instance-dependent sparse Attention module in the step (4) is obtained by linear change of the encoder output characteristic sequence, and the key matrix K and the value matrix V are obtained by linear change after the mutual-Head self-Attention module is connected with the residual and normalized.

And (3) the subsequent operation is the same as the step (3), the output of a layer of decoder is obtained through a gating linear control unit and residual connection and normalization (Add & Norm) operation, and the decoder output characteristic vector is obtained through six times of traversal.

(5) Predicting the class and the boundary frame of the decoder output feature vector obtained in the step (4) through a linear layer and a multi-layer perceptron respectively to obtain a predicted target set, wherein each target contains class and boundary frame coordinate information;

decoding the predicted information, wherein the predicted target boundary frame coordinates are in the form of relative coordinates, and the predicted target boundary frame coordinates need to be reflected to the original image size through decoding; a loss function is then defined to guide the learning process of the model, where the loss function comprises two parts: target category loss and target frame coordinate position loss; and combining class loss and target frame coordinate position loss weighted summation, and carrying out matching calculation between predicted target frame information and real target frame information through a Hungary algorithm.

The target class loss measures the difference between the predicted target class and the true target class, and for each predicted target class, the class loss is calculated as follows:

wherein c _i Is the predicted target frame class probability, N _pos Is the number of positive samples, pos is the index set of positive samples, 1 _{i∈pos} Is an indication function of the display,one-hot coding, p, of the corresponding real class label _i Is the predicted class probability.

wherein b _i Is the predicted coordinate position information of the target frame, N _pos Is the number of positive samples, pos is the index set of positive samples, 1 _{i∈pos} Is an indication function, smooth _L1 Is a smooth L1 loss function used to balance the effects of large and small deviations;is the coordinate position information of the corresponding real target frame.

(7) Repeating the steps (1) to (6) for a plurality of times to obtain a trained target detection model, and using the trained target detection model for target detection.

The improved model of the present application utilizes an instance-dependent sparse attention mechanism by focusing on the details of the partial region of each object of interest while ignoring the background and extraneous regions. The sparsity improves the calculation efficiency and simultaneously makes the model more robust to challenges such as shielding and deformation. By introducing an instance-dependent mechanism, the model supports personalized processing of each target instance, selectively focuses on sampling points according to the characteristic information of each instance, and improves the accuracy of target detection.

The application is improved based on a default DETR framework, can better adapt to target detection tasks of different scenes and scales, has stronger generalization capability, and is a general target detection model. The target detection method provided by the application has wide application scenes and has application prospects in the fields of object identification, video monitoring, automatic driving and the like.

Claims

1. A DETR improvement model-based sparse attention target detection method, comprising:

2. The DETR improvement model based sparse attention target detection method of claim 1, wherein step (1) comprises:

3. The DETR improvement model based sparse attention target detection method of claim 2, wherein step (2) comprises:

4. A DETR improvement model based sparse concentration target detection method according to claim 3 wherein the relative position encoding comprises learnable scale-level learning and position embedding.

5. The DETR improvement model-based sparse attention target detection method of claim 1, wherein in step (3), the instance-dependent sparse attention module performs the following operations:

firstly, a multiscale fusion feature map is partitioned to obtain a feature vector sequence X=x ₁ ,x ₂ ,...,x _N ，Wherein N and N represent the length of the signature sequence, x _i Representing the i-th feature vector in the sequence, < + >> Representing the real number field, d is the characteristic dimension, +.>Represents x _i Is a d-dimensional real vector, each element being a real number; />Representing that X is a real matrix of dimension n X d; each eigenvector is linearly transformed by three linear transforms q=xw, respectively ^Q 、K＝XW ^K And v=xw ^V Obtaining a query vector q=q ₁ ,q ₂ ,...,q _N Key vector k=k ₁ ,k ₂ ,...,k _N Sum vector v=v ₁ ,v ₂ ,...,v _N Wherein q is _i ,k _i ,/>W ^Q ,W ^K ,/>W ^Q ,W ^K ,W ^V Are learnable parameter matrices that are optimized by back propagation during training to enable the model to adaptively learn a representation of the input sequence, W ^Q ,W ^K ,/>Represents W ^Q ,W ^K ,W ^V Are d x d matrices formed by d-dimensional real vectors;

the low rank attention weight is calculated as follows:

finally, for each query vector i, its corresponding calculated output vector is: wherein when M _ij When not equal to 1, the corresponding +.>Otherwise keep->N is the length of the characteristic sequence, v _j Representing a value vector v=v ₁ ,v ₂ ,...,v _N Representation of the j-th element of (a),>the attention weighted calculation result between the feature vectors i and j is represented, and finally the sparse attention module calculation output of the whole dependent instance is as follows: />

6. The DETR improvement model based sparse attention target detection method of claim 5, wherein in step (3), an instance-dependent sparse attention moduleThe obtained characteristic sequenceObtaining input data x through residual error connection and normalization operation;

the gating linear control unit performs the following operations:

the input data x is subjected to linear transformation:

h＝W·x+b ₁

g＝σ(W _g ·x+b _g )

7. The DETR improvement model based sparse attention target detection method of claim 6, wherein in step (4), the feature sequence with position coding can be regarded as a combination of a series of position embedding and coded image features, and the multi-headed self-attention module performs the following operations:

8. The DETR improvement model-based sparse attention target detection method of claim 7, wherein in step (4), the query matrix Q in the instance-dependent sparse attention module is obtained by linear variation of the encoder output feature sequence, and the key matrix K and the value matrix V are obtained by linear variation after the multi-headed self-attention module is connected with the residual and normalized.

9. The DETR improvement model-based sparse attention target detection method of claim 8, wherein in step (6), the relative coordinate information of the prediction target bounding box is decoded, and is projected to the original image size by decoding and reflection; then defining a loss function comprising target category loss and target frame coordinate position loss; and combining the two loss weighted summations, and carrying out matching calculation on the predicted target frame information and the real information through a Hungary algorithm.

10. The DETR improvement model based sparse attention target detection method of claim 9, wherein the target class loss measures the difference between the predicted target class and the true target class, and wherein for each predicted target class, the class loss is calculated as: