CN115063833A - Machine room personnel detection method based on image layered vision - Google Patents

Machine room personnel detection method based on image layered vision Download PDF

Info

Publication number
CN115063833A
CN115063833A CN202210529776.7A CN202210529776A CN115063833A CN 115063833 A CN115063833 A CN 115063833A CN 202210529776 A CN202210529776 A CN 202210529776A CN 115063833 A CN115063833 A CN 115063833A
Authority
CN
China
Prior art keywords
loss
attention
self
rpn
regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210529776.7A
Other languages
Chinese (zh)
Inventor
苏丹
那琼澜
贺惠民
杨艺西
邢宁哲
庞思睿
李信
金燊
来骥
万莹
张辉
任建伟
吴舜
刘昀
于然
赵欣
魏秀静
赵琦
王艺霏
纪雨彤
张实君
赵子兰
尚芳剑
杨睿
于蒙
申昉
李欣怡
曾婧
张翼
温馨
张天颖
张海明
李宇鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202210529776.7A priority Critical patent/CN115063833A/en
Publication of CN115063833A publication Critical patent/CN115063833A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The invention discloses a machine room personnel detection method based on image layered vision, and particularly relates to a cascade detector based on a shift window layered vision Transformer. The utility model designs a practical self-attention method, reduces the size of the input token through the deep convolution to reduce the complexity of self-attention calculation, and adopts the channel interaction module to calculate the V value to solve the problem that the local window lacks direction perception and position information from the attention. Secondly, with balanced L 1 The loss and the weights lost in different stages are configured in the total loss function to solve the problem of imbalance of simple samples and difficult samples. Improved method for detecting accuracy mAP relative to original Swin transducer @0.5 The improvement is 3.2 percent.

Description

Machine room personnel detection method based on image layered vision
Technical Field
The invention belongs to the field of images, and particularly relates to a machine room personnel detection method based on image layered vision.
Background
Computer room personnel detection is one of the key tasks in the field of computer vision, and a Convolutional Neural Network (CNN) is widely applied to computer room personnel detection tasks, such as the RCNN series and the YOLO series. CNNs are very powerful in extracting locally valid information, but they lack the ability to extract remote features from global information. Recently, transformers with global computing functions are widely applied to computer vision tasks and achieve remarkable effects. The Transformer adopts a self-attention method to mine long-term dependency in the text. Many computer vision tasks at this stage use a Self-Attention mechanism to overcome the limitations of CNN, using Self-Attention (SA) to more quickly obtain relationships between remote elements. Therefore, it is very important to explore the potential of the Transformer in the field of detection of room personnel.
The recently proposed Swin Transformer can easily adapt to feature pyramids and the like by constructing a hierarchical feature structure, and it reduces complexity from quadratic to linear based on local window self-attention computation. These features allow Swin transducer to be used as a generic model for a variety of visual tasks. However, Swin Transformer still has three problems in the detection of the crew in the machine room: (1) performing self-attention within non-overlapping windows may still have a high computational complexity. (2) Self-attention calculations are performed within non-overlapping windows, which may lack direction-awareness and location information, i.e., cross-channel information may not be well captured. (3) In the training process, imbalance exists between the simple sample and the difficult sample, and the gradient effect of the simple sample is too small when the gradient is reversely propagated.
Disclosure of Invention
Based on the technical problems of Swin transducer, the application provides a machine room personnel detection method, which is a machine room personnel detection method based on image layered vision transducer,
a machine room personnel detection method based on image layered vision comprises the following steps:
the present application designs a practical self-attention module, adding two key designs to the standard window-based attention module (W-MSA): (1) a self-attention mechanism with smaller calculation amount is designed, and the calculation complexity of the self-attention mechanism is reduced. (2) Considering that convolutional layers aim to simulate local relations, the problem of lack of direction perception and position information of local window self-attention is solved by adding a channel interaction module, using parallel deep convolution (global computation) and local window-based self-attention computation. These two key points are integrated and an improved self-attention module is constructed herein. Details are described next. As shown in FIG. 2(c), by inputting
Figure BDA0003645718580000021
Linear projection derived query
Figure BDA0003645718580000022
Wherein n is H × W. Will input
Figure BDA0003645718580000023
Reshaped into a spatial vector (d) m H, W), the size of the input X is reduced by a deep convolution with a convolution kernel of s × s and a step size of s. The size of the token is represented by (d) m H, W) is changed to
Figure BDA0003645718580000024
The height and the width are reduced by s times and are obtained by linear conversion
Figure BDA0003645718580000025
Where X is the input token, n is the number of blocks, H is the number of blocks of the high-direction image of the input image, W is the number of blocks of the wide-direction image of the input image, d m Is the embedding dimension of each image block, the embedding dimension of the query vector dimension, the key vector and the value vector is d k And n' is the number of blocks.
For the value of V, we add a channel interaction module to calculate. Inspired by channel attention (SE), channel interaction comprises a DW 2 Deep convolution of oneGlobal average pooling layer (GAP), then two consecutive 1 × 1 convolutional layers, Batch Normalization (BN) and activation between them (SILU). Finally, we use Sigmoid to generate attention in the channel dimension. The calculation formula of V is as follows,
V=FC(LN(DW 1 (X))).Sigmoid(conv(SILU(BN(conv(GAP(DW 2 (x)))))))) (1)
where FC is full join, BN is batch normalization, DW 1 Is a deep convolution, X is the input token vector, conv is a 1 × 1 convolution, GAP is the global mean pooling, DW 2 Is a deep convolution.
To obtain finally
Figure BDA0003645718580000031
In which DW 2 Is a deep convolution with a convolution kernel of 3 × where attention is paid to DW 1 And DW 2 Is distinguished by DW 1 The post-input X is reduced by s times in size by DW 2 Then, without changing the size and number of channels of the input X, more channel information is retained, conv is a convolution of 1 × 1, and then the self-attention function of Q, K and V is calculated by the following formula:
Figure BDA0003645718580000032
and finally, adding the linear transformation and the X to obtain final output. The channel interaction module and the SE layer are similar in design, but they are mainly distinguished by the following two points: (1) the inputs to the modules are different. Note that the two deep convolutions do not share weights, and the input for the channel interaction here comes from another parallel branch. (2) The present application applies channel interaction only to the local window self-attention V value calculation in the module, rather than to the output of the module as in the SE layer.
The loss function employed in the present application is as follows:
(1) RPN classification loss and cascade detection head loss. Using a multivariate cross entropy loss function herein, the goal of bounding box classification assigns C +1 class labels to each bounding box, denoted by probability p. Wherein C is allAnd another is background. For training sample x i And y i Wherein y is i Is to input x i The multivariate cross entropy loss function is as in equation (3):
Figure BDA0003645718580000033
wherein, W j As in equation (4):
Figure BDA0003645718580000041
(2) RPN bounding box regression loss, bounding box regression aiming at using a regression function to put (b) the candidate bounding box b ═ b x ,b y ,b w ,b h ) Return to target bounding box g ═ (g) x ,g y ,g w ,g h ) Minimizing the loss function L BIoc (b i ,g i ) Comprises the following steps:
Figure BDA0003645718580000042
wherein the content of the first and second substances,
Figure BDA0003645718580000043
Smooth L 1 the loss is defined as:
Figure BDA0003645718580000044
wherein N is reg Indicating the number of anchor locations when the candidate box is a positive sample
Figure BDA0003645718580000045
Is 1, when the candidate box is a negative sample
Figure BDA0003645718580000046
Is 0, b i Representing the bounding box regression parameter, g, predicting the ith anchor i Representing the real box corresponding to the ith anchor.
(3) And (4) cascading detection head bounding box regression loss. Directly increasing the weight of the localization loss (i.e., the regression loss) may result in a model that is more sensitive to the value of some localization anomalies. In Smooth L 1 After adding a gradient limit to the derivative equation of the loss, L is balanced 1 The gradient formula for the loss can be defined as follows:
Figure BDA0003645718580000047
where α represents the contribution of the outlier and γ is the upper limit of the outlier error. Herein L1 balanced The following were used:
Figure BDA0003645718580000048
wherein b is used to guarantee
Figure BDA0003645718580000049
In that
Figure BDA00036457185800000410
Where C is a constant, the conditions between the parameters are as follows:
αln(b+1)=γ (10)
where a and γ are hyperparameters, defaults are set to 0.5 and 1.5, a small α makes the back-propagated gradient larger, γ adjusts the upper bound of the regression error, and the back-propagated gradient does not exceed γ.
(4) The total loss. The total loss includes the classification loss and regression loss for the RPN stage and the classification loss and regression loss for the three stages. L to be balanced herein 1 Losses are applied to the first, second and third stages of the cascaded detector head. And assigns corresponding weights to the loss of the three stages and the loss of the RPN. The total loss function can be written as:
L=aL RPw +bL stage1 +cL stage2 +dL stage3 (II)
wherein:
Figure BDA0003645718580000051
a is 1, b is 0.75, c is 0.5, d is 0.25, and a, b, c and d represent the lost weight coefficients. L is RPN_cls Representing RPN classification loss, L RPN_reg Is the RPN regression loss. L is stage1 、L stage2 And L stage3 Represents the total loss of three stages, L stage1_cls 、L stage2_cls And L stage3_cls Is the classification loss, L, of each stage stage1_reg 、L sta g e2_reg And L stage3_reg Is the regression loss at each stage.
Compared with the prior art, the technical scheme has the following advantages:
the utility model provides a practical self-attention method, adopt the channel interaction module to the calculation of V value can solve the problem that local window lacks direction perception and positional information from the attention, reduces the size of input token through the degree of depth convolution, reduces the complexity from the attention calculation. With balanced L 1 Loss and weights lost in different stages are configured in a total loss function to solve the problem of gradient imbalance of simple samples and difficult samples, the improved method has an excellent effect in personnel detection of the machine room, the maintenance efficiency of the unattended machine room is improved, and the normal, safe and reliable operation of the machine room is guaranteed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is an overall framework diagram of the algorithm;
fig. 2 is a backbone network.
Detailed Description
Based on the above technical problems of Swin transducer, the present application provides a machine room personnel detection method, which is specifically a machine room personnel detection method based on image layered vision transducer,
the utility model provides a practical self-attention method, adopt the channel interaction module to the calculation of V value can solve the problem that local window lacks direction perception and positional information from the attention, reduces the size of input token through the degree of depth convolution, reduces the complexity from the attention calculation. With balanced L 1 Loss and weights lost in different stages are configured in a total loss function to solve the problem of gradient imbalance of simple samples and difficult samples, the improved method has an excellent effect in personnel detection of the machine room, the maintenance efficiency of the unattended machine room is improved, and the normal, safe and reliable operation of the machine room is guaranteed.
The application provides a computer room personnel detection method based on image layered vision Transformer, which specifically comprises the following steps:
the Network structure of the detection method consists of four parts, including a Swin-T backbone Network, a Feature Pyramid (FPN), an area Proposal Network (RPN) and a cascade detection head. As shown in fig. 1, Swin Transformer is used to extract image features, and FPN is mainly used to extract multi-scale features. The RPN is a combination Of several convolutional layers that produces a Region Of Interest (ROI) where an object may be present. And the cascade detection head classifies and positions the region of interest and outputs a final detection result. In the cascade detection head, FC is the full connectivity layer, C is the classification probability, and B is the regression of the candidate box.
The backbone network herein is shown in fig. 2 (a). Firstly, the pictures are input into a block segmentation module for segmentation. I.e. every 4 x 4 adjacent pixels is a block and then flattened in the channel direction. Assuming that an RGB three-channel picture is input, each block has 4 × 4-16 pixels, and each pixel has R, G, B three valuesAfter flattening, 16 × 3 is 48, so the image size after block division is (H, W,3) to 48
Figure BDA0003645718580000071
Then, the channel data of each pixel is linearly transformed from 48 to 96 through a linear mapping layer, namely, the image size is changed from
Figure BDA0003645718580000072
Become into
Figure BDA0003645718580000073
Then, the feature maps with different sizes are constructed through four stages, except that the stage 1 is firstly processed by a linear mapping layer, and the rest three stages are firstly processed by downsampling through a block merging layer. Both then are repeated stacks of Swin transducer modules, noting that the modules herein have two configurations in nature. As shown in FIG. 2(b), a standard window-based multi-headed self-attention (W-MSA) module and a shift-window multi-headed self-attention (SW-MSA) module are used in series in a Swin transform module. S (w) -MSA is preceded and followed by a Layer Normalization (LN), the last MPL consisting of two GELU nonlinear activation functions. The connection of W-MSA and SW-MSA in this context can be shown by the following formula, where
Figure BDA0003645718580000074
Is the output of (S) W-EMSA, Z l Is the output of the MLP, l-1, l and l +1 represent positions.
Figure BDA0003645718580000075
Figure BDA0003645718580000076
Figure BDA0003645718580000077
Figure BDA0003645718580000081
The present application designs a practical self-attention module, adding two key designs to the standard window-based attention module (W-MSA): (1) a self-attention mechanism with smaller calculation amount is designed, and the calculation complexity of the self-attention mechanism is reduced. (2) Considering that convolutional layers aim to simulate local relations, the problem of lack of direction perception and position information of local window self-attention is solved by adding a channel interaction module, using parallel deep convolution (global computation) and local window-based self-attention computation. These two key points are integrated and an improved self-attention module is constructed herein. Details are described next. As shown in fig. 2(c), by inputting
Figure BDA0003645718580000082
Linear projection derived query
Figure BDA0003645718580000083
Where n is H × W. Will input
Figure BDA0003645718580000084
Figure BDA0003645718580000085
Reshaped into a space vector (d) m H, W), the size of the input X is reduced by a deep convolution with a kernel of s X s and a step size of s. The size of the token is represented by (d) m H, W) is changed to
Figure BDA0003645718580000086
The height and the width are reduced by s times and are obtained by linear conversion
Figure BDA0003645718580000087
Where X is the input token, n is the number of blocks, H is the number of blocks of the high-direction image of the input image, W is the number of blocks of the wide-direction image of the input image, d m Is to embed dimension, query vector dimension and key into each image blockThe embedding dimension of the vector sum value vector is d k And n' is the number of blocks.
For the value of V, we add a channel interaction module to calculate. Inspired by channel attention (SE), channel interaction comprises a DW 2 Deep convolution, one global average pooling layer (GAP), then two consecutive 1 × 1 convolution layers, Batch Normalization (BN) and activation between them (SILU). Finally, we use Sigmoid to generate attention in the channel dimension. The calculation formula of V is as follows,
V=FC(LN(DW 1 (X)))·Sigmoid(conv(SILU(BN(conv(GAP(DW 2 (x)))))))) (1)
to obtain finally
Figure BDA0003645718580000091
In which DW 2 Is a deep convolution with a convolution kernel of 3 × 3, where attention is paid to DW 1 And DW 2 Difference of (2), DW 1 The input X is reduced in size by a factor of s. DW (DW) 2 The size and number of channels of input X are not changed, more channel information is retained, conv is a convolution of 1 × 1. Then, the self-attention function of Q, K and V is calculated by the following formula:
Figure BDA0003645718580000092
and finally, adding the linear transformation and the X to obtain final output. The channel interaction module and the SE layer are similar in design, but they are mainly distinguished by the following two points: 1. the inputs to the modules are different. Note that the two deep convolutions do not share weights, and the input for channel interaction here comes from another parallel branch. 2. The present application applies channel interaction only to the local window self-attention V value calculation in the module, rather than to the output of the module as in the SE layer.
The loss function employed in the present application is as follows:
(1) RPN classification loss and cascade detection head loss. Using a multivariate cross entropy loss function herein, the goal of bounding box classification assigns C +1 class labels to each bounding box, denoted by probability p. It is composed ofIn (3), C is all categories, and one is background. For training sample x i And y i Wherein y is i Is to input x i The multivariate cross entropy loss function is as in equation (3):
Figure BDA0003645718580000093
wherein, W j As in equation (4):
Figure BDA0003645718580000094
(2) RPN bounding box regression loss, bounding box regression aiming at using a regression function to put (b) the candidate bounding box b ═ b x ,b y ,b w ,b h ) Return to target bounding box g ═ (g) x ,g y ,g w ,g h ) Minimizing the loss function L BIoc (b i ,g i ) Comprises the following steps:
Figure BDA0003645718580000095
wherein the content of the first and second substances,
Figure BDA0003645718580000101
Smooth L 1 the loss is defined as:
Figure BDA0003645718580000102
wherein, N reg Indicating the number of anchor locations when the candidate box is a positive sample
Figure BDA0003645718580000103
Is 1, when the candidate box is a negative sample
Figure BDA0003645718580000104
Is 0, b i Representing the bounding box regression parameter, g, predicting the ith anchor i Representing the real box corresponding to the ith anchor.
(3) And (5) cascading detection head bounding box regression loss. Directly increasing the weight of the localization loss (i.e., the regression loss) may result in a model that is more sensitive to the value of some localization anomalies. In Smooth L 1 After adding a gradient limit to the derivative equation of the loss, L is balanced 1 The gradient formula for the loss can be defined as follows:
Figure BDA0003645718580000105
where α represents the contribution of the outlier and γ is the upper limit of the outlier error. Herein L1 balanced The following were used:
Figure BDA0003645718580000106
wherein b is used to guarantee
Figure BDA0003645718580000107
In that
Figure BDA0003645718580000108
Where C is a constant, the conditions between the parameters are as follows:
αln(b+1)=γ (10)
where α and γ are hyperparameters, defaults are set to 0.5 and 1.5, a small α makes the back-propagated gradient larger, and γ adjusts the upper bound of the regression error, the back-propagated gradient not exceeding γ.
(4) The total loss. The total loss includes the classification loss and regression loss for the RPN stage and the classification loss and regression loss for the three stages. The balanced L1 loss is applied here to the first, second and third stages of the cascaded detector head. And assigns corresponding weights to the loss of the three stages and the loss of the RPN. The total loss function can be written as:
L=aL RPw +bL stage1 +CL stage2 +dL stage3 (II)
wherein:
Figure BDA0003645718580000111
a is 1, b is 0.75, c is 0.5, d is 0.25, and a, b, c and d represent the lost weight coefficients. L is RPN_cls Representing RPN classification loss, L RPN_reg Is the RPN regression loss. L is stage1 、L stage2 And L stage3 Represents the total loss of three stages, L stage1_cls 、L stage2_cls And L stage3_cls Is the classification loss, L, of each stage stage1_reg 、L stage2_reg And L stage3_reg Is the regression loss at each stage.
In order to verify the performance of the method for detecting the personnel in the machine room, the method is compared with a currently popular target detection algorithm, and the detection performance of the DETR based on ResNet50, the Deformable DETR based on ResNet50, YOLOX-x, the Retineet based on ResNext101 and using FPN, the YOLOF algorithm based on ResNet50, the Swin Transformer-T cascade algorithm and other algorithms is tested, as shown in Table 1. First, considering the mAP value of 0.5 at IoU, the detection accuracy of Swin Transformer is better than the algorithms such as DETR, Deformable DETR and YOLOF, and the detection accuracy of Swin Transformer is 0.05 points higher than that of Retianet, a one-stage algorithm. This seems to indicate that the two-stage algorithm has a higher detection accuracy in the field of machine room personnel detection than the end-to-end and one-stage algorithms, although this does not include YOLOX-x, since YOLOX is mainly an integrated detection algorithm that integrates various skills. The detection accuracy of YOLOX-x is better than that of Swin transducer. The detection accuracy of Deformable _ DETR is improved by 1.4 points compared with DETR. Detection network based on improved Swin transducer detection network mAP @0.5 Is 89.7%, compared with the Swin Transformer-T detection precision which is improved by 3.2 points, the detection precision of the improved algorithm is 0.6 point higher than that of YOLOX-x. Second, the algorithm improved herein is second only to YOLOX in detecting small objects, and the DETR seriesThe algorithm of the column performs the worst in detecting small targets. Deformable _ DETR has higher detection performance in terms of small objects than DETR. In addition to YOLOX, the two-stage detection algorithm is significantly better than the one-stage algorithm in detecting small objects. Finally, in terms of model complexity, in order to guarantee fairness, the input sizes of all target detection networks are set to (3, 1280, 800), and compared with Swin transform, GFlOPS is reduced by 5.43G and the parameter amount is increased by 3.42M. The improved method is huge in improvement of detection precision of Swin-Transformer.
TABLE 1 comparison of test results for different models
Figure BDA0003645718580000121
The application tested each of the improved detection performance, by mAP @0.5 To measure the detection performance. The experimental results are shown in table 2, and the detection precision is improved by 1.83 points by using a modified Swin transducer module (ISTB). The detection accuracy is improved by 1.4 points using a balanced loss function (BLOSS). Therefore, the accuracy of the detection of Swin transducer can be improved by using the improved method comprehensively.
TABLE 2 results of the experiments using the improved method in combination
Figure BDA0003645718580000122
In the description, each part is described in a progressive manner, the focus of each part is to be different from that of other parts, and the same and similar parts among the parts can be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (3)

1. A computer room personnel detection method based on image layered vision is characterized in that a Network structure Of the method consists Of four parts, including a Swin Transformer backbone Network, a characteristic Pyramid (FPN), a Region Proposal Network (RPN) and a cascade detection head, wherein the Swin Transformer is used for extracting image characteristics related to computer room personnel, the FPN is mainly used for extracting multi-scale characteristics, the RPN is a combination Of several convolution layers and generates a Region Of Interest (ROI) where objects possibly exist, the cascade detection head classifies and positions the Region Of Interest and outputs a final detection result, in the cascade detection head, FC is a full connection layer, C is a classification probability, and B is regression Of a candidate frame;
two key designs were added in the standard window-based multi-headed self-attention (W-MSA) module in the Swin Transformer backbone network: (1) a self-attention mechanism with smaller calculation amount is designed, and the calculation complexity of the self-attention mechanism is reduced; (2) considering that the convolutional layer aims at simulating local relations, the problem that the self-attention of local windows lacks direction perception and position information is solved by adding a channel interaction module and using parallel deep convolution (global calculation) and self-attention calculation based on the local windows;
in addition, balanced L is used 1 Losing and configuring weights lost in different stages in a total loss function to solve the problem of gradient imbalance of simple samples and difficult samples;
the method comprises the following steps of designing a self-attention mechanism with smaller calculation amount and reduced self-attention mechanism calculation complexity, and specifically comprises the following steps:
by pairing input tokens
Figure FDA0003645718570000011
Linear projection derived query
Figure FDA0003645718570000012
Where n is H × W, and then input
Figure FDA0003645718570000013
Reshaped into a space vector (d) m H, W) by a deep convolution (DW) with a convolution kernel size s × s and a step size s 1 ) To reduce the size of the input X, the size of the token is reduced by (d) m H, W) is changed to
Figure FDA0003645718570000014
The height and the width are reduced by s times and are obtained by linear conversion
Figure FDA0003645718570000015
Figure FDA0003645718570000016
Reducing the complexity of self-attention computation by reducing the size of the input token by deep convolution; where X is the input token, n is the number of blocks, H is the number of blocks of the high-direction image of the input image, W is the number of blocks of the wide-direction image of the input image, d m Is the embedding dimension of each image block, the embedding dimension of the query vector dimension, the key vector and the value vector is d k And n' is the number of blocks.
2. The method according to claim 1, characterized in that, considering that the convolutional layer aims at simulating local relations, the problem of the self-attention lack of direction perception and position information of local windows is solved by adding channel interaction modules, using parallel deep convolution (global computation) and self-attention computation based on local windows, specifically comprising:
for the value of V, we add a channel interaction module to compute, inspired by channel attention (SE), the channel interaction consists of a deep convolution, a global average pooling layer (GAP), then two successive 1 × 1 convolution layers, Batch Normalization (BN) and an activation function (SILU) between them, finally, we use Sigmoid to generate attention in the channel dimension, the formula for V is shown below,
V=FC(LN(DW 1 (X))).Sigmoid(conv(SILU(BN(conv(GAP(DW 2 (x))))))) (1)
where FC is full join, BN is batch normalization, DW 1 Is a deep convolution, X is the input token vector, conv is a 1 × 1 convolution, GAP is the global mean pooling, DW 2 Is a deep convolution;
to obtain finally
Figure FDA0003645718570000021
In which DW 2 Is a deep convolution with a convolution kernel of 3 × 3, where it is noted that DW 1 And DW 2 Is distinguished by DW 1 The post-input X is reduced by s times in size by DW 2 Then, without changing the size and number of channels of the input X, more channel information is retained, conv is a convolution of 1 × 1, and then the self-attention function of Q, K and V is calculated by the following formula:
Figure FDA0003645718570000022
finally, the final output is obtained by linear transformation and X addition, the design of the channel interaction module is similar to that of the SE layer, but the two differences are mainly as follows: first, the inputs to the modules are different, noting that the two deep convolutions do not share weights, and the input to the channel interaction comes from another parallel branch. Second, the channel interaction is applied to the local window self-attention module V value calculation, rather than to the module's output as in the SE layer.
3. Method according to claim 1, characterized in that a balanced L is used 1 Losing and configuring weights lost in different stages in a total loss function to solve the problem of gradient imbalance of simple samples and difficult samples, and specifically comprises the following steps:
(1) RPN classification loss and cascade detection head loss, using multivariate cross entropy loss function, the goal of bounding box classification assigning C +1 classes to each bounding boxLabels, represented by probability p, where C is all classes and one is background, for training sample x i And y i Wherein y is i Is to input x i The multivariate cross entropy loss function is as in equation (3):
Figure FDA0003645718570000031
wherein, W j As in equation (4):
Figure FDA0003645718570000032
(2) RPN bounding box regression loss, bounding box regression aiming at using a regression function to put (b) the candidate bounding box b ═ b x ,b y ,b w ,b h ) Return to target bounding box g ═ (g) x ,g y ,g w ,g h ) Minimizing the loss function L BIoc (b i ,g i ) Comprises the following steps:
Figure FDA0003645718570000033
wherein the content of the first and second substances,
Figure FDA0003645718570000034
Smooth L 1 the loss is defined as:
Figure FDA0003645718570000035
wherein N is reg Indicating the number of anchor locations when the candidate box is a positive sample
Figure FDA0003645718570000036
Is 1, when the candidate box is a negative sample
Figure FDA0003645718570000037
Is 0, b i Representing the bounding box regression parameter, g, predicting the ith anchor i Representing a real box corresponding to the ith anchor;
(3) the regression loss of the boundary box of the cascade detection head, and the direct increase of the weight of the positioning loss (namely the regression loss) can lead the model to be more sensitive to some abnormal positioning values, and the model is applied to Smooth L 1 After adding a gradient limit to the derivative equation of the loss, L is balanced 1 The gradient formula for the loss can be defined as follows:
Figure FDA0003645718570000041
where α represents the contribution of the outlier and γ is the upper limit of the outlier error, herein L1 balanced The following:
Figure FDA0003645718570000042
wherein the parameter b is used to guarantee
Figure FDA0003645718570000043
In that
Figure FDA0003645718570000044
Where C is a constant, the conditions between the parameters are as follows:
αln(b+1)=γ (10)
wherein alpha and gamma are hyper-parameters, the default values are set to be 0.5 and 1.5, small alpha enables the gradient of the back propagation to be larger, gamma adjusts the upper bound of the regression error, and the gradient of the back propagation does not exceed gamma;
(4) the total loss; the total loss includes the classification loss and regression loss for the RPN stage and the classification loss and regression loss for the three stages,l to be balanced herein 1 The losses are applied to the first, second and third stages of the cascaded detector head and corresponding weights are assigned to the losses of the three stages and the RPN losses, and the total loss function can be written as:
L=aL RPN +bL stage1 +cL stage2 +dL stage3 (11)
wherein:
Figure FDA0003645718570000045
a is 1, b is 0.75, c is 0.5, d is 0.25, and a, b, c and d represent the lost weight coefficients. L is RPN_cls Represents the RPN classification loss, L RPN_reg Is the RPN regression loss. L is stage1 、L stage2 And L stage3 Represents the total loss of three stages, L stage1_cls 、L stage2_cls And L stage3_cls Is the classification loss, L, of each stage stage1_reg 、L stage2_reg And L stage3_reg Is the regression loss at each stage.
CN202210529776.7A 2022-05-16 2022-05-16 Machine room personnel detection method based on image layered vision Pending CN115063833A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210529776.7A CN115063833A (en) 2022-05-16 2022-05-16 Machine room personnel detection method based on image layered vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210529776.7A CN115063833A (en) 2022-05-16 2022-05-16 Machine room personnel detection method based on image layered vision

Publications (1)

Publication Number Publication Date
CN115063833A true CN115063833A (en) 2022-09-16

Family

ID=83198297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210529776.7A Pending CN115063833A (en) 2022-05-16 2022-05-16 Machine room personnel detection method based on image layered vision

Country Status (1)

Country Link
CN (1) CN115063833A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631513A (en) * 2022-11-10 2023-01-20 杭州电子科技大学 Multi-scale pedestrian re-identification method based on Transformer
CN116740790A (en) * 2023-06-21 2023-09-12 北京科技大学 Face detection method and device based on transducer

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259930A (en) * 2020-01-09 2020-06-09 南京信息工程大学 General target detection method of self-adaptive attention guidance mechanism
CN112949673A (en) * 2019-12-11 2021-06-11 四川大学 Feature fusion target detection and identification method based on global attention
CN113888744A (en) * 2021-10-14 2022-01-04 浙江大学 Image semantic segmentation method based on Transformer visual upsampling module
US11270124B1 (en) * 2020-11-16 2022-03-08 Branded Entertainment Network, Inc. Temporal bottleneck attention architecture for video action recognition
CN114241307A (en) * 2021-12-09 2022-03-25 中国电子科技集团公司第五十四研究所 Synthetic aperture radar aircraft target identification method based on self-attention network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949673A (en) * 2019-12-11 2021-06-11 四川大学 Feature fusion target detection and identification method based on global attention
CN111259930A (en) * 2020-01-09 2020-06-09 南京信息工程大学 General target detection method of self-adaptive attention guidance mechanism
WO2021139069A1 (en) * 2020-01-09 2021-07-15 南京信息工程大学 General target detection method for adaptive attention guidance mechanism
US11270124B1 (en) * 2020-11-16 2022-03-08 Branded Entertainment Network, Inc. Temporal bottleneck attention architecture for video action recognition
CN113888744A (en) * 2021-10-14 2022-01-04 浙江大学 Image semantic segmentation method based on Transformer visual upsampling module
CN114241307A (en) * 2021-12-09 2022-03-25 中国电子科技集团公司第五十四研究所 Synthetic aperture radar aircraft target identification method based on self-attention network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
严娟;方志军;高永彬;: "结合混合域注意力与空洞卷积的3维目标检测", 中国图象图形学报, no. 06, 16 June 2020 (2020-06-16) *
周幸;陈立福;: "基于双注意力机制的遥感图像目标检测", 计算机与现代化, no. 08, 15 August 2020 (2020-08-15) *
宁尚明;滕飞;李天瑞;: "基于多通道自注意力机制的电子病历实体关系抽取", 计算机学报, no. 05, 15 May 2020 (2020-05-15) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631513A (en) * 2022-11-10 2023-01-20 杭州电子科技大学 Multi-scale pedestrian re-identification method based on Transformer
CN115631513B (en) * 2022-11-10 2023-07-11 杭州电子科技大学 Transformer-based multi-scale pedestrian re-identification method
CN116740790A (en) * 2023-06-21 2023-09-12 北京科技大学 Face detection method and device based on transducer
CN116740790B (en) * 2023-06-21 2024-02-09 北京科技大学 Face detection method and device based on transducer

Similar Documents

Publication Publication Date Title
CN112329658B (en) Detection algorithm improvement method for YOLOV3 network
CN111210443B (en) Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
CN111639692B (en) Shadow detection method based on attention mechanism
CN111898406B (en) Face detection method based on focus loss and multitask cascade
CN115063833A (en) Machine room personnel detection method based on image layered vision
CN110378222A (en) A kind of vibration damper on power transmission line target detection and defect identification method and device
CN111612017A (en) Target detection method based on information enhancement
CN116152254B (en) Industrial leakage target gas detection model training method, detection method and electronic equipment
CN113569788B (en) Building semantic segmentation network model training method, system and application method
CN112164077A (en) Cell example segmentation method based on bottom-up path enhancement
CN114360067A (en) Dynamic gesture recognition method based on deep learning
Su et al. Semantic segmentation of high resolution remote sensing image based on batch-attention mechanism
CN112766186A (en) Real-time face detection and head posture estimation method based on multi-task learning
CN111899203A (en) Real image generation method based on label graph under unsupervised training and storage medium
CN116740527A (en) Remote sensing image change detection method combining U-shaped network and self-attention mechanism
CN116434069A (en) Remote sensing image change detection method based on local-global transducer network
CN116824335A (en) YOLOv5 improved algorithm-based fire disaster early warning method and system
CN115565043A (en) Method for detecting target by combining multiple characteristic features and target prediction method
CN115035371A (en) Borehole wall crack identification method based on multi-scale feature fusion neural network
CN116977280A (en) Rail surface defect detection method based on improved UPerNet and connected domain analysis
CN116630387A (en) Monocular image depth estimation method based on attention mechanism
Chen et al. An improved pedestrian detection algorithm based on YOLOv3
CN116386042A (en) Point cloud semantic segmentation model based on three-dimensional pooling spatial attention mechanism
CN115601820A (en) Face fake image detection method, device, terminal and storage medium
CN113344005B (en) Image edge detection method based on optimized small-scale features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination