CN115063833A - Machine room personnel detection method based on image layered vision - Google Patents
Machine room personnel detection method based on image layered vision Download PDFInfo
- Publication number
- CN115063833A CN115063833A CN202210529776.7A CN202210529776A CN115063833A CN 115063833 A CN115063833 A CN 115063833A CN 202210529776 A CN202210529776 A CN 202210529776A CN 115063833 A CN115063833 A CN 115063833A
- Authority
- CN
- China
- Prior art keywords
- loss
- attention
- self
- rpn
- regression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 54
- 108091006146 Channels Proteins 0.000 claims abstract description 36
- 230000006870 function Effects 0.000 claims abstract description 29
- 230000003993 interaction Effects 0.000 claims abstract description 22
- 238000004364 calculation method Methods 0.000 claims abstract description 21
- 238000000034 method Methods 0.000 claims abstract description 17
- 238000013461 design Methods 0.000 claims abstract description 9
- 230000008447 perception Effects 0.000 claims abstract description 7
- 230000007246 mechanism Effects 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000002159 abnormal effect Effects 0.000 claims 1
- 230000006872 improvement Effects 0.000 abstract description 2
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000004807 localization Effects 0.000 description 4
- 238000012423 maintenance Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Abstract
The invention discloses a machine room personnel detection method based on image layered vision, and particularly relates to a cascade detector based on a shift window layered vision Transformer. The utility model designs a practical self-attention method, reduces the size of the input token through the deep convolution to reduce the complexity of self-attention calculation, and adopts the channel interaction module to calculate the V value to solve the problem that the local window lacks direction perception and position information from the attention. Secondly, with balanced L 1 The loss and the weights lost in different stages are configured in the total loss function to solve the problem of imbalance of simple samples and difficult samples. Improved method for detecting accuracy mAP relative to original Swin transducer @0.5 The improvement is 3.2 percent.
Description
Technical Field
The invention belongs to the field of images, and particularly relates to a machine room personnel detection method based on image layered vision.
Background
Computer room personnel detection is one of the key tasks in the field of computer vision, and a Convolutional Neural Network (CNN) is widely applied to computer room personnel detection tasks, such as the RCNN series and the YOLO series. CNNs are very powerful in extracting locally valid information, but they lack the ability to extract remote features from global information. Recently, transformers with global computing functions are widely applied to computer vision tasks and achieve remarkable effects. The Transformer adopts a self-attention method to mine long-term dependency in the text. Many computer vision tasks at this stage use a Self-Attention mechanism to overcome the limitations of CNN, using Self-Attention (SA) to more quickly obtain relationships between remote elements. Therefore, it is very important to explore the potential of the Transformer in the field of detection of room personnel.
The recently proposed Swin Transformer can easily adapt to feature pyramids and the like by constructing a hierarchical feature structure, and it reduces complexity from quadratic to linear based on local window self-attention computation. These features allow Swin transducer to be used as a generic model for a variety of visual tasks. However, Swin Transformer still has three problems in the detection of the crew in the machine room: (1) performing self-attention within non-overlapping windows may still have a high computational complexity. (2) Self-attention calculations are performed within non-overlapping windows, which may lack direction-awareness and location information, i.e., cross-channel information may not be well captured. (3) In the training process, imbalance exists between the simple sample and the difficult sample, and the gradient effect of the simple sample is too small when the gradient is reversely propagated.
Disclosure of Invention
Based on the technical problems of Swin transducer, the application provides a machine room personnel detection method, which is a machine room personnel detection method based on image layered vision transducer,
a machine room personnel detection method based on image layered vision comprises the following steps:
the present application designs a practical self-attention module, adding two key designs to the standard window-based attention module (W-MSA): (1) a self-attention mechanism with smaller calculation amount is designed, and the calculation complexity of the self-attention mechanism is reduced. (2) Considering that convolutional layers aim to simulate local relations, the problem of lack of direction perception and position information of local window self-attention is solved by adding a channel interaction module, using parallel deep convolution (global computation) and local window-based self-attention computation. These two key points are integrated and an improved self-attention module is constructed herein. Details are described next. As shown in FIG. 2(c), by inputtingLinear projection derived queryWherein n is H × W. Will inputReshaped into a spatial vector (d) m H, W), the size of the input X is reduced by a deep convolution with a convolution kernel of s × s and a step size of s. The size of the token is represented by (d) m H, W) is changed toThe height and the width are reduced by s times and are obtained by linear conversionWhere X is the input token, n is the number of blocks, H is the number of blocks of the high-direction image of the input image, W is the number of blocks of the wide-direction image of the input image, d m Is the embedding dimension of each image block, the embedding dimension of the query vector dimension, the key vector and the value vector is d k And n' is the number of blocks.
For the value of V, we add a channel interaction module to calculate. Inspired by channel attention (SE), channel interaction comprises a DW 2 Deep convolution of oneGlobal average pooling layer (GAP), then two consecutive 1 × 1 convolutional layers, Batch Normalization (BN) and activation between them (SILU). Finally, we use Sigmoid to generate attention in the channel dimension. The calculation formula of V is as follows,
V=FC(LN(DW 1 (X))).Sigmoid(conv(SILU(BN(conv(GAP(DW 2 (x)))))))) (1)
where FC is full join, BN is batch normalization, DW 1 Is a deep convolution, X is the input token vector, conv is a 1 × 1 convolution, GAP is the global mean pooling, DW 2 Is a deep convolution.
To obtain finallyIn which DW 2 Is a deep convolution with a convolution kernel of 3 × where attention is paid to DW 1 And DW 2 Is distinguished by DW 1 The post-input X is reduced by s times in size by DW 2 Then, without changing the size and number of channels of the input X, more channel information is retained, conv is a convolution of 1 × 1, and then the self-attention function of Q, K and V is calculated by the following formula:
and finally, adding the linear transformation and the X to obtain final output. The channel interaction module and the SE layer are similar in design, but they are mainly distinguished by the following two points: (1) the inputs to the modules are different. Note that the two deep convolutions do not share weights, and the input for the channel interaction here comes from another parallel branch. (2) The present application applies channel interaction only to the local window self-attention V value calculation in the module, rather than to the output of the module as in the SE layer.
The loss function employed in the present application is as follows:
(1) RPN classification loss and cascade detection head loss. Using a multivariate cross entropy loss function herein, the goal of bounding box classification assigns C +1 class labels to each bounding box, denoted by probability p. Wherein C is allAnd another is background. For training sample x i And y i Wherein y is i Is to input x i The multivariate cross entropy loss function is as in equation (3):
wherein, W j As in equation (4):
(2) RPN bounding box regression loss, bounding box regression aiming at using a regression function to put (b) the candidate bounding box b ═ b x ,b y ,b w ,b h ) Return to target bounding box g ═ (g) x ,g y ,g w ,g h ) Minimizing the loss function L BIoc (b i ,g i ) Comprises the following steps:
wherein the content of the first and second substances,
Smooth L 1 the loss is defined as:
wherein N is reg Indicating the number of anchor locations when the candidate box is a positive sampleIs 1, when the candidate box is a negative sampleIs 0, b i Representing the bounding box regression parameter, g, predicting the ith anchor i Representing the real box corresponding to the ith anchor.
(3) And (4) cascading detection head bounding box regression loss. Directly increasing the weight of the localization loss (i.e., the regression loss) may result in a model that is more sensitive to the value of some localization anomalies. In Smooth L 1 After adding a gradient limit to the derivative equation of the loss, L is balanced 1 The gradient formula for the loss can be defined as follows:
where α represents the contribution of the outlier and γ is the upper limit of the outlier error. Herein L1 balanced The following were used:
wherein b is used to guaranteeIn thatWhere C is a constant, the conditions between the parameters are as follows:
αln(b+1)=γ (10)
where a and γ are hyperparameters, defaults are set to 0.5 and 1.5, a small α makes the back-propagated gradient larger, γ adjusts the upper bound of the regression error, and the back-propagated gradient does not exceed γ.
(4) The total loss. The total loss includes the classification loss and regression loss for the RPN stage and the classification loss and regression loss for the three stages. L to be balanced herein 1 Losses are applied to the first, second and third stages of the cascaded detector head. And assigns corresponding weights to the loss of the three stages and the loss of the RPN. The total loss function can be written as:
L=aL RPw +bL stage1 +cL stage2 +dL stage3 (II)
wherein:
a is 1, b is 0.75, c is 0.5, d is 0.25, and a, b, c and d represent the lost weight coefficients. L is RPN_cls Representing RPN classification loss, L RPN_reg Is the RPN regression loss. L is stage1 、L stage2 And L stage3 Represents the total loss of three stages, L stage1_cls 、L stage2_cls And L stage3_cls Is the classification loss, L, of each stage stage1_reg 、L sta g e2_reg And L stage3_reg Is the regression loss at each stage.
Compared with the prior art, the technical scheme has the following advantages:
the utility model provides a practical self-attention method, adopt the channel interaction module to the calculation of V value can solve the problem that local window lacks direction perception and positional information from the attention, reduces the size of input token through the degree of depth convolution, reduces the complexity from the attention calculation. With balanced L 1 Loss and weights lost in different stages are configured in a total loss function to solve the problem of gradient imbalance of simple samples and difficult samples, the improved method has an excellent effect in personnel detection of the machine room, the maintenance efficiency of the unattended machine room is improved, and the normal, safe and reliable operation of the machine room is guaranteed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is an overall framework diagram of the algorithm;
fig. 2 is a backbone network.
Detailed Description
Based on the above technical problems of Swin transducer, the present application provides a machine room personnel detection method, which is specifically a machine room personnel detection method based on image layered vision transducer,
the utility model provides a practical self-attention method, adopt the channel interaction module to the calculation of V value can solve the problem that local window lacks direction perception and positional information from the attention, reduces the size of input token through the degree of depth convolution, reduces the complexity from the attention calculation. With balanced L 1 Loss and weights lost in different stages are configured in a total loss function to solve the problem of gradient imbalance of simple samples and difficult samples, the improved method has an excellent effect in personnel detection of the machine room, the maintenance efficiency of the unattended machine room is improved, and the normal, safe and reliable operation of the machine room is guaranteed.
The application provides a computer room personnel detection method based on image layered vision Transformer, which specifically comprises the following steps:
the Network structure of the detection method consists of four parts, including a Swin-T backbone Network, a Feature Pyramid (FPN), an area Proposal Network (RPN) and a cascade detection head. As shown in fig. 1, Swin Transformer is used to extract image features, and FPN is mainly used to extract multi-scale features. The RPN is a combination Of several convolutional layers that produces a Region Of Interest (ROI) where an object may be present. And the cascade detection head classifies and positions the region of interest and outputs a final detection result. In the cascade detection head, FC is the full connectivity layer, C is the classification probability, and B is the regression of the candidate box.
The backbone network herein is shown in fig. 2 (a). Firstly, the pictures are input into a block segmentation module for segmentation. I.e. every 4 x 4 adjacent pixels is a block and then flattened in the channel direction. Assuming that an RGB three-channel picture is input, each block has 4 × 4-16 pixels, and each pixel has R, G, B three valuesAfter flattening, 16 × 3 is 48, so the image size after block division is (H, W,3) to 48Then, the channel data of each pixel is linearly transformed from 48 to 96 through a linear mapping layer, namely, the image size is changed fromBecome intoThen, the feature maps with different sizes are constructed through four stages, except that the stage 1 is firstly processed by a linear mapping layer, and the rest three stages are firstly processed by downsampling through a block merging layer. Both then are repeated stacks of Swin transducer modules, noting that the modules herein have two configurations in nature. As shown in FIG. 2(b), a standard window-based multi-headed self-attention (W-MSA) module and a shift-window multi-headed self-attention (SW-MSA) module are used in series in a Swin transform module. S (w) -MSA is preceded and followed by a Layer Normalization (LN), the last MPL consisting of two GELU nonlinear activation functions. The connection of W-MSA and SW-MSA in this context can be shown by the following formula, whereIs the output of (S) W-EMSA, Z l Is the output of the MLP, l-1, l and l +1 represent positions.
The present application designs a practical self-attention module, adding two key designs to the standard window-based attention module (W-MSA): (1) a self-attention mechanism with smaller calculation amount is designed, and the calculation complexity of the self-attention mechanism is reduced. (2) Considering that convolutional layers aim to simulate local relations, the problem of lack of direction perception and position information of local window self-attention is solved by adding a channel interaction module, using parallel deep convolution (global computation) and local window-based self-attention computation. These two key points are integrated and an improved self-attention module is constructed herein. Details are described next. As shown in fig. 2(c), by inputtingLinear projection derived queryWhere n is H × W. Will input Reshaped into a space vector (d) m H, W), the size of the input X is reduced by a deep convolution with a kernel of s X s and a step size of s. The size of the token is represented by (d) m H, W) is changed toThe height and the width are reduced by s times and are obtained by linear conversionWhere X is the input token, n is the number of blocks, H is the number of blocks of the high-direction image of the input image, W is the number of blocks of the wide-direction image of the input image, d m Is to embed dimension, query vector dimension and key into each image blockThe embedding dimension of the vector sum value vector is d k And n' is the number of blocks.
For the value of V, we add a channel interaction module to calculate. Inspired by channel attention (SE), channel interaction comprises a DW 2 Deep convolution, one global average pooling layer (GAP), then two consecutive 1 × 1 convolution layers, Batch Normalization (BN) and activation between them (SILU). Finally, we use Sigmoid to generate attention in the channel dimension. The calculation formula of V is as follows,
V=FC(LN(DW 1 (X)))·Sigmoid(conv(SILU(BN(conv(GAP(DW 2 (x)))))))) (1)
to obtain finallyIn which DW 2 Is a deep convolution with a convolution kernel of 3 × 3, where attention is paid to DW 1 And DW 2 Difference of (2), DW 1 The input X is reduced in size by a factor of s. DW (DW) 2 The size and number of channels of input X are not changed, more channel information is retained, conv is a convolution of 1 × 1. Then, the self-attention function of Q, K and V is calculated by the following formula:
and finally, adding the linear transformation and the X to obtain final output. The channel interaction module and the SE layer are similar in design, but they are mainly distinguished by the following two points: 1. the inputs to the modules are different. Note that the two deep convolutions do not share weights, and the input for channel interaction here comes from another parallel branch. 2. The present application applies channel interaction only to the local window self-attention V value calculation in the module, rather than to the output of the module as in the SE layer.
The loss function employed in the present application is as follows:
(1) RPN classification loss and cascade detection head loss. Using a multivariate cross entropy loss function herein, the goal of bounding box classification assigns C +1 class labels to each bounding box, denoted by probability p. It is composed ofIn (3), C is all categories, and one is background. For training sample x i And y i Wherein y is i Is to input x i The multivariate cross entropy loss function is as in equation (3):
wherein, W j As in equation (4):
(2) RPN bounding box regression loss, bounding box regression aiming at using a regression function to put (b) the candidate bounding box b ═ b x ,b y ,b w ,b h ) Return to target bounding box g ═ (g) x ,g y ,g w ,g h ) Minimizing the loss function L BIoc (b i ,g i ) Comprises the following steps:
wherein the content of the first and second substances,
Smooth L 1 the loss is defined as:
wherein, N reg Indicating the number of anchor locations when the candidate box is a positive sampleIs 1, when the candidate box is a negative sampleIs 0, b i Representing the bounding box regression parameter, g, predicting the ith anchor i Representing the real box corresponding to the ith anchor.
(3) And (5) cascading detection head bounding box regression loss. Directly increasing the weight of the localization loss (i.e., the regression loss) may result in a model that is more sensitive to the value of some localization anomalies. In Smooth L 1 After adding a gradient limit to the derivative equation of the loss, L is balanced 1 The gradient formula for the loss can be defined as follows:
where α represents the contribution of the outlier and γ is the upper limit of the outlier error. Herein L1 balanced The following were used:
wherein b is used to guaranteeIn thatWhere C is a constant, the conditions between the parameters are as follows:
αln(b+1)=γ (10)
where α and γ are hyperparameters, defaults are set to 0.5 and 1.5, a small α makes the back-propagated gradient larger, and γ adjusts the upper bound of the regression error, the back-propagated gradient not exceeding γ.
(4) The total loss. The total loss includes the classification loss and regression loss for the RPN stage and the classification loss and regression loss for the three stages. The balanced L1 loss is applied here to the first, second and third stages of the cascaded detector head. And assigns corresponding weights to the loss of the three stages and the loss of the RPN. The total loss function can be written as:
L=aL RPw +bL stage1 +CL stage2 +dL stage3 (II)
wherein:
a is 1, b is 0.75, c is 0.5, d is 0.25, and a, b, c and d represent the lost weight coefficients. L is RPN_cls Representing RPN classification loss, L RPN_reg Is the RPN regression loss. L is stage1 、L stage2 And L stage3 Represents the total loss of three stages, L stage1_cls 、L stage2_cls And L stage3_cls Is the classification loss, L, of each stage stage1_reg 、L stage2_reg And L stage3_reg Is the regression loss at each stage.
In order to verify the performance of the method for detecting the personnel in the machine room, the method is compared with a currently popular target detection algorithm, and the detection performance of the DETR based on ResNet50, the Deformable DETR based on ResNet50, YOLOX-x, the Retineet based on ResNext101 and using FPN, the YOLOF algorithm based on ResNet50, the Swin Transformer-T cascade algorithm and other algorithms is tested, as shown in Table 1. First, considering the mAP value of 0.5 at IoU, the detection accuracy of Swin Transformer is better than the algorithms such as DETR, Deformable DETR and YOLOF, and the detection accuracy of Swin Transformer is 0.05 points higher than that of Retianet, a one-stage algorithm. This seems to indicate that the two-stage algorithm has a higher detection accuracy in the field of machine room personnel detection than the end-to-end and one-stage algorithms, although this does not include YOLOX-x, since YOLOX is mainly an integrated detection algorithm that integrates various skills. The detection accuracy of YOLOX-x is better than that of Swin transducer. The detection accuracy of Deformable _ DETR is improved by 1.4 points compared with DETR. Detection network based on improved Swin transducer detection network mAP @0.5 Is 89.7%, compared with the Swin Transformer-T detection precision which is improved by 3.2 points, the detection precision of the improved algorithm is 0.6 point higher than that of YOLOX-x. Second, the algorithm improved herein is second only to YOLOX in detecting small objects, and the DETR seriesThe algorithm of the column performs the worst in detecting small targets. Deformable _ DETR has higher detection performance in terms of small objects than DETR. In addition to YOLOX, the two-stage detection algorithm is significantly better than the one-stage algorithm in detecting small objects. Finally, in terms of model complexity, in order to guarantee fairness, the input sizes of all target detection networks are set to (3, 1280, 800), and compared with Swin transform, GFlOPS is reduced by 5.43G and the parameter amount is increased by 3.42M. The improved method is huge in improvement of detection precision of Swin-Transformer.
TABLE 1 comparison of test results for different models
The application tested each of the improved detection performance, by mAP @0.5 To measure the detection performance. The experimental results are shown in table 2, and the detection precision is improved by 1.83 points by using a modified Swin transducer module (ISTB). The detection accuracy is improved by 1.4 points using a balanced loss function (BLOSS). Therefore, the accuracy of the detection of Swin transducer can be improved by using the improved method comprehensively.
TABLE 2 results of the experiments using the improved method in combination
In the description, each part is described in a progressive manner, the focus of each part is to be different from that of other parts, and the same and similar parts among the parts can be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (3)
1. A computer room personnel detection method based on image layered vision is characterized in that a Network structure Of the method consists Of four parts, including a Swin Transformer backbone Network, a characteristic Pyramid (FPN), a Region Proposal Network (RPN) and a cascade detection head, wherein the Swin Transformer is used for extracting image characteristics related to computer room personnel, the FPN is mainly used for extracting multi-scale characteristics, the RPN is a combination Of several convolution layers and generates a Region Of Interest (ROI) where objects possibly exist, the cascade detection head classifies and positions the Region Of Interest and outputs a final detection result, in the cascade detection head, FC is a full connection layer, C is a classification probability, and B is regression Of a candidate frame;
two key designs were added in the standard window-based multi-headed self-attention (W-MSA) module in the Swin Transformer backbone network: (1) a self-attention mechanism with smaller calculation amount is designed, and the calculation complexity of the self-attention mechanism is reduced; (2) considering that the convolutional layer aims at simulating local relations, the problem that the self-attention of local windows lacks direction perception and position information is solved by adding a channel interaction module and using parallel deep convolution (global calculation) and self-attention calculation based on the local windows;
in addition, balanced L is used 1 Losing and configuring weights lost in different stages in a total loss function to solve the problem of gradient imbalance of simple samples and difficult samples;
the method comprises the following steps of designing a self-attention mechanism with smaller calculation amount and reduced self-attention mechanism calculation complexity, and specifically comprises the following steps:
by pairing input tokensLinear projection derived queryWhere n is H × W, and then inputReshaped into a space vector (d) m H, W) by a deep convolution (DW) with a convolution kernel size s × s and a step size s 1 ) To reduce the size of the input X, the size of the token is reduced by (d) m H, W) is changed toThe height and the width are reduced by s times and are obtained by linear conversion Reducing the complexity of self-attention computation by reducing the size of the input token by deep convolution; where X is the input token, n is the number of blocks, H is the number of blocks of the high-direction image of the input image, W is the number of blocks of the wide-direction image of the input image, d m Is the embedding dimension of each image block, the embedding dimension of the query vector dimension, the key vector and the value vector is d k And n' is the number of blocks.
2. The method according to claim 1, characterized in that, considering that the convolutional layer aims at simulating local relations, the problem of the self-attention lack of direction perception and position information of local windows is solved by adding channel interaction modules, using parallel deep convolution (global computation) and self-attention computation based on local windows, specifically comprising:
for the value of V, we add a channel interaction module to compute, inspired by channel attention (SE), the channel interaction consists of a deep convolution, a global average pooling layer (GAP), then two successive 1 × 1 convolution layers, Batch Normalization (BN) and an activation function (SILU) between them, finally, we use Sigmoid to generate attention in the channel dimension, the formula for V is shown below,
V=FC(LN(DW 1 (X))).Sigmoid(conv(SILU(BN(conv(GAP(DW 2 (x))))))) (1)
where FC is full join, BN is batch normalization, DW 1 Is a deep convolution, X is the input token vector, conv is a 1 × 1 convolution, GAP is the global mean pooling, DW 2 Is a deep convolution;
to obtain finallyIn which DW 2 Is a deep convolution with a convolution kernel of 3 × 3, where it is noted that DW 1 And DW 2 Is distinguished by DW 1 The post-input X is reduced by s times in size by DW 2 Then, without changing the size and number of channels of the input X, more channel information is retained, conv is a convolution of 1 × 1, and then the self-attention function of Q, K and V is calculated by the following formula:
finally, the final output is obtained by linear transformation and X addition, the design of the channel interaction module is similar to that of the SE layer, but the two differences are mainly as follows: first, the inputs to the modules are different, noting that the two deep convolutions do not share weights, and the input to the channel interaction comes from another parallel branch. Second, the channel interaction is applied to the local window self-attention module V value calculation, rather than to the module's output as in the SE layer.
3. Method according to claim 1, characterized in that a balanced L is used 1 Losing and configuring weights lost in different stages in a total loss function to solve the problem of gradient imbalance of simple samples and difficult samples, and specifically comprises the following steps:
(1) RPN classification loss and cascade detection head loss, using multivariate cross entropy loss function, the goal of bounding box classification assigning C +1 classes to each bounding boxLabels, represented by probability p, where C is all classes and one is background, for training sample x i And y i Wherein y is i Is to input x i The multivariate cross entropy loss function is as in equation (3):
wherein, W j As in equation (4):
(2) RPN bounding box regression loss, bounding box regression aiming at using a regression function to put (b) the candidate bounding box b ═ b x ,b y ,b w ,b h ) Return to target bounding box g ═ (g) x ,g y ,g w ,g h ) Minimizing the loss function L BIoc (b i ,g i ) Comprises the following steps:
wherein the content of the first and second substances,
Smooth L 1 the loss is defined as:
wherein N is reg Indicating the number of anchor locations when the candidate box is a positive sampleIs 1, when the candidate box is a negative sampleIs 0, b i Representing the bounding box regression parameter, g, predicting the ith anchor i Representing a real box corresponding to the ith anchor;
(3) the regression loss of the boundary box of the cascade detection head, and the direct increase of the weight of the positioning loss (namely the regression loss) can lead the model to be more sensitive to some abnormal positioning values, and the model is applied to Smooth L 1 After adding a gradient limit to the derivative equation of the loss, L is balanced 1 The gradient formula for the loss can be defined as follows:
where α represents the contribution of the outlier and γ is the upper limit of the outlier error, herein L1 balanced The following:
wherein the parameter b is used to guaranteeIn thatWhere C is a constant, the conditions between the parameters are as follows:
αln(b+1)=γ (10)
wherein alpha and gamma are hyper-parameters, the default values are set to be 0.5 and 1.5, small alpha enables the gradient of the back propagation to be larger, gamma adjusts the upper bound of the regression error, and the gradient of the back propagation does not exceed gamma;
(4) the total loss; the total loss includes the classification loss and regression loss for the RPN stage and the classification loss and regression loss for the three stages,l to be balanced herein 1 The losses are applied to the first, second and third stages of the cascaded detector head and corresponding weights are assigned to the losses of the three stages and the RPN losses, and the total loss function can be written as:
L=aL RPN +bL stage1 +cL stage2 +dL stage3 (11)
wherein:
a is 1, b is 0.75, c is 0.5, d is 0.25, and a, b, c and d represent the lost weight coefficients. L is RPN_cls Represents the RPN classification loss, L RPN_reg Is the RPN regression loss. L is stage1 、L stage2 And L stage3 Represents the total loss of three stages, L stage1_cls 、L stage2_cls And L stage3_cls Is the classification loss, L, of each stage stage1_reg 、L stage2_reg And L stage3_reg Is the regression loss at each stage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210529776.7A CN115063833A (en) | 2022-05-16 | 2022-05-16 | Machine room personnel detection method based on image layered vision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210529776.7A CN115063833A (en) | 2022-05-16 | 2022-05-16 | Machine room personnel detection method based on image layered vision |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115063833A true CN115063833A (en) | 2022-09-16 |
Family
ID=83198297
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210529776.7A Pending CN115063833A (en) | 2022-05-16 | 2022-05-16 | Machine room personnel detection method based on image layered vision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115063833A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115631513A (en) * | 2022-11-10 | 2023-01-20 | 杭州电子科技大学 | Multi-scale pedestrian re-identification method based on Transformer |
CN116740790A (en) * | 2023-06-21 | 2023-09-12 | 北京科技大学 | Face detection method and device based on transducer |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111259930A (en) * | 2020-01-09 | 2020-06-09 | 南京信息工程大学 | General target detection method of self-adaptive attention guidance mechanism |
CN112949673A (en) * | 2019-12-11 | 2021-06-11 | 四川大学 | Feature fusion target detection and identification method based on global attention |
CN113888744A (en) * | 2021-10-14 | 2022-01-04 | 浙江大学 | Image semantic segmentation method based on Transformer visual upsampling module |
US11270124B1 (en) * | 2020-11-16 | 2022-03-08 | Branded Entertainment Network, Inc. | Temporal bottleneck attention architecture for video action recognition |
CN114241307A (en) * | 2021-12-09 | 2022-03-25 | 中国电子科技集团公司第五十四研究所 | Synthetic aperture radar aircraft target identification method based on self-attention network |
-
2022
- 2022-05-16 CN CN202210529776.7A patent/CN115063833A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112949673A (en) * | 2019-12-11 | 2021-06-11 | 四川大学 | Feature fusion target detection and identification method based on global attention |
CN111259930A (en) * | 2020-01-09 | 2020-06-09 | 南京信息工程大学 | General target detection method of self-adaptive attention guidance mechanism |
WO2021139069A1 (en) * | 2020-01-09 | 2021-07-15 | 南京信息工程大学 | General target detection method for adaptive attention guidance mechanism |
US11270124B1 (en) * | 2020-11-16 | 2022-03-08 | Branded Entertainment Network, Inc. | Temporal bottleneck attention architecture for video action recognition |
CN113888744A (en) * | 2021-10-14 | 2022-01-04 | 浙江大学 | Image semantic segmentation method based on Transformer visual upsampling module |
CN114241307A (en) * | 2021-12-09 | 2022-03-25 | 中国电子科技集团公司第五十四研究所 | Synthetic aperture radar aircraft target identification method based on self-attention network |
Non-Patent Citations (3)
Title |
---|
严娟;方志军;高永彬;: "结合混合域注意力与空洞卷积的3维目标检测", 中国图象图形学报, no. 06, 16 June 2020 (2020-06-16) * |
周幸;陈立福;: "基于双注意力机制的遥感图像目标检测", 计算机与现代化, no. 08, 15 August 2020 (2020-08-15) * |
宁尚明;滕飞;李天瑞;: "基于多通道自注意力机制的电子病历实体关系抽取", 计算机学报, no. 05, 15 May 2020 (2020-05-15) * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115631513A (en) * | 2022-11-10 | 2023-01-20 | 杭州电子科技大学 | Multi-scale pedestrian re-identification method based on Transformer |
CN115631513B (en) * | 2022-11-10 | 2023-07-11 | 杭州电子科技大学 | Transformer-based multi-scale pedestrian re-identification method |
CN116740790A (en) * | 2023-06-21 | 2023-09-12 | 北京科技大学 | Face detection method and device based on transducer |
CN116740790B (en) * | 2023-06-21 | 2024-02-09 | 北京科技大学 | Face detection method and device based on transducer |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112329658B (en) | Detection algorithm improvement method for YOLOV3 network | |
CN111210443B (en) | Deformable convolution mixing task cascading semantic segmentation method based on embedding balance | |
CN111639692B (en) | Shadow detection method based on attention mechanism | |
CN111898406B (en) | Face detection method based on focus loss and multitask cascade | |
CN115063833A (en) | Machine room personnel detection method based on image layered vision | |
CN110378222A (en) | A kind of vibration damper on power transmission line target detection and defect identification method and device | |
CN111612017A (en) | Target detection method based on information enhancement | |
CN116152254B (en) | Industrial leakage target gas detection model training method, detection method and electronic equipment | |
CN113569788B (en) | Building semantic segmentation network model training method, system and application method | |
CN112164077A (en) | Cell example segmentation method based on bottom-up path enhancement | |
CN114360067A (en) | Dynamic gesture recognition method based on deep learning | |
Su et al. | Semantic segmentation of high resolution remote sensing image based on batch-attention mechanism | |
CN112766186A (en) | Real-time face detection and head posture estimation method based on multi-task learning | |
CN111899203A (en) | Real image generation method based on label graph under unsupervised training and storage medium | |
CN116740527A (en) | Remote sensing image change detection method combining U-shaped network and self-attention mechanism | |
CN116434069A (en) | Remote sensing image change detection method based on local-global transducer network | |
CN116824335A (en) | YOLOv5 improved algorithm-based fire disaster early warning method and system | |
CN115565043A (en) | Method for detecting target by combining multiple characteristic features and target prediction method | |
CN115035371A (en) | Borehole wall crack identification method based on multi-scale feature fusion neural network | |
CN116977280A (en) | Rail surface defect detection method based on improved UPerNet and connected domain analysis | |
CN116630387A (en) | Monocular image depth estimation method based on attention mechanism | |
Chen et al. | An improved pedestrian detection algorithm based on YOLOv3 | |
CN116386042A (en) | Point cloud semantic segmentation model based on three-dimensional pooling spatial attention mechanism | |
CN115601820A (en) | Face fake image detection method, device, terminal and storage medium | |
CN113344005B (en) | Image edge detection method based on optimized small-scale features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |