CN115063833A

CN115063833A - Machine room personnel detection method based on image layered vision

Info

Publication number: CN115063833A
Application number: CN202210529776.7A
Authority: CN
Inventors: 苏丹; 那琼澜; 贺惠民; 杨艺西; 邢宁哲; 庞思睿; 李信; 金燊; 来骥; 万莹; 张辉; 任建伟; 吴舜; 刘昀; 于然; 赵欣; 魏秀静; 赵琦; 王艺霏; 纪雨彤
Original assignee: State Grid Corp of China SGCC; Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd
Priority date: 2022-05-16
Filing date: 2022-05-16
Publication date: 2022-09-16

Abstract

The invention discloses a machine room personnel detection method based on image layered vision, and particularly relates to a cascade detector based on a shift window layered vision Transformer. The utility model designs a practical self-attention method, reduces the size of the input token through the deep convolution to reduce the complexity of self-attention calculation, and adopts the channel interaction module to calculate the V value to solve the problem that the local window lacks direction perception and position information from the attention. Secondly, with balanced L ₁ The loss and the weights lost in different stages are configured in the total loss function to solve the problem of imbalance of simple samples and difficult samples. Improved method for detecting accuracy mAP relative to original Swin transducer _@0.5 The improvement is 3.2 percent.

Description

Machine room personnel detection method based on image layered vision

Technical Field

The invention belongs to the field of images, and particularly relates to a machine room personnel detection method based on image layered vision.

Background

Computer room personnel detection is one of the key tasks in the field of computer vision, and a Convolutional Neural Network (CNN) is widely applied to computer room personnel detection tasks, such as the RCNN series and the YOLO series. CNNs are very powerful in extracting locally valid information, but they lack the ability to extract remote features from global information. Recently, transformers with global computing functions are widely applied to computer vision tasks and achieve remarkable effects. The Transformer adopts a self-attention method to mine long-term dependency in the text. Many computer vision tasks at this stage use a Self-Attention mechanism to overcome the limitations of CNN, using Self-Attention (SA) to more quickly obtain relationships between remote elements. Therefore, it is very important to explore the potential of the Transformer in the field of detection of room personnel.

The recently proposed Swin Transformer can easily adapt to feature pyramids and the like by constructing a hierarchical feature structure, and it reduces complexity from quadratic to linear based on local window self-attention computation. These features allow Swin transducer to be used as a generic model for a variety of visual tasks. However, Swin Transformer still has three problems in the detection of the crew in the machine room: (1) performing self-attention within non-overlapping windows may still have a high computational complexity. (2) Self-attention calculations are performed within non-overlapping windows, which may lack direction-awareness and location information, i.e., cross-channel information may not be well captured. (3) In the training process, imbalance exists between the simple sample and the difficult sample, and the gradient effect of the simple sample is too small when the gradient is reversely propagated.

Disclosure of Invention

Based on the technical problems of Swin transducer, the application provides a machine room personnel detection method, which is a machine room personnel detection method based on image layered vision transducer,

a machine room personnel detection method based on image layered vision comprises the following steps:

the present application designs a practical self-attention module, adding two key designs to the standard window-based attention module (W-MSA): (1) a self-attention mechanism with smaller calculation amount is designed, and the calculation complexity of the self-attention mechanism is reduced. (2) Considering that convolutional layers aim to simulate local relations, the problem of lack of direction perception and position information of local window self-attention is solved by adding a channel interaction module, using parallel deep convolution (global computation) and local window-based self-attention computation. These two key points are integrated and an improved self-attention module is constructed herein. Details are described next. As shown in FIG. 2(c), by inputting

Linear projection derived query

Wherein n is H × W. Will input

Reshaped into a spatial vector (d) _m H, W), the size of the input X is reduced by a deep convolution with a convolution kernel of s × s and a step size of s. The size of the token is represented by (d) _m H, W) is changed to

The height and the width are reduced by s times and are obtained by linear conversion

Where X is the input token, n is the number of blocks, H is the number of blocks of the high-direction image of the input image, W is the number of blocks of the wide-direction image of the input image, d _m Is the embedding dimension of each image block, the embedding dimension of the query vector dimension, the key vector and the value vector is d _k And n' is the number of blocks.

For the value of V, we add a channel interaction module to calculate. Inspired by channel attention (SE), channel interaction comprises a DW ₂ Deep convolution of oneGlobal average pooling layer (GAP), then two consecutive 1 × 1 convolutional layers, Batch Normalization (BN) and activation between them (SILU). Finally, we use Sigmoid to generate attention in the channel dimension. The calculation formula of V is as follows,

V＝FC(LN(DW ₁ (X))).Sigmoid(conv(SILU(BN(conv(GAP(DW ₂ (x)))))))) (1)

where FC is full join, BN is batch normalization, DW ₁ Is a deep convolution, X is the input token vector, conv is a 1 × 1 convolution, GAP is the global mean pooling, DW ₂ Is a deep convolution.

To obtain finally

In which DW ₂ Is a deep convolution with a convolution kernel of 3 × where attention is paid to DW ₁ And DW ₂ Is distinguished by DW ₁ The post-input X is reduced by s times in size by DW ₂ Then, without changing the size and number of channels of the input X, more channel information is retained, conv is a convolution of 1 × 1, and then the self-attention function of Q, K and V is calculated by the following formula:

and finally, adding the linear transformation and the X to obtain final output. The channel interaction module and the SE layer are similar in design, but they are mainly distinguished by the following two points: (1) the inputs to the modules are different. Note that the two deep convolutions do not share weights, and the input for the channel interaction here comes from another parallel branch. (2) The present application applies channel interaction only to the local window self-attention V value calculation in the module, rather than to the output of the module as in the SE layer.

The loss function employed in the present application is as follows:

(1) RPN classification loss and cascade detection head loss. Using a multivariate cross entropy loss function herein, the goal of bounding box classification assigns C +1 class labels to each bounding box, denoted by probability p. Wherein C is allAnd another is background. For training sample x _i And y _i Wherein y is _i Is to input x _i The multivariate cross entropy loss function is as in equation (3):

wherein, W _j As in equation (4):

(2) RPN bounding box regression loss, bounding box regression aiming at using a regression function to put (b) the candidate bounding box b ═ b _x ,b _y ,b _w ,b _h ) Return to target bounding box g ═ (g) _x ,g _y ,g _w ,g _h ) Minimizing the loss function L _BIoc (b _i ,g _i ) Comprises the following steps:

wherein the content of the first and second substances,

Smooth L ₁ the loss is defined as:

wherein N is _reg Indicating the number of anchor locations when the candidate box is a positive sample

Is 1, when the candidate box is a negative sample

Is 0, b _i Representing the bounding box regression parameter, g, predicting the ith anchor _i Representing the real box corresponding to the ith anchor.

(3) And (4) cascading detection head bounding box regression loss. Directly increasing the weight of the localization loss (i.e., the regression loss) may result in a model that is more sensitive to the value of some localization anomalies. In Smooth L ₁ After adding a gradient limit to the derivative equation of the loss, L is balanced ₁ The gradient formula for the loss can be defined as follows:

where α represents the contribution of the outlier and γ is the upper limit of the outlier error. Herein L1 _balanced The following were used:

wherein b is used to guarantee

In that

Where C is a constant, the conditions between the parameters are as follows:

αln(b+1)＝γ (10)

where a and γ are hyperparameters, defaults are set to 0.5 and 1.5, a small α makes the back-propagated gradient larger, γ adjusts the upper bound of the regression error, and the back-propagated gradient does not exceed γ.

(4) The total loss. The total loss includes the classification loss and regression loss for the RPN stage and the classification loss and regression loss for the three stages. L to be balanced herein ₁ Losses are applied to the first, second and third stages of the cascaded detector head. And assigns corresponding weights to the loss of the three stages and the loss of the RPN. The total loss function can be written as:

L＝aL _RPw +bL _stage1 +cL _stage2 +dL _stage3 (II)

wherein:

a is 1, b is 0.75, c is 0.5, d is 0.25, and a, b, c and d represent the lost weight coefficients. L is _{RPN_cls} Representing RPN classification loss, L _{RPN_reg} Is the RPN regression loss. L is _stage1 、L _stage2 And L _stage3 Represents the total loss of three stages, L _{stage1_cls} 、L _{stage2_cls} And L _{stage3_cls} Is the classification loss, L, of each stage _{stage1_reg} 、L _sta g _{e2_reg} And L _{stage3_reg} Is the regression loss at each stage.

Compared with the prior art, the technical scheme has the following advantages:

the utility model provides a practical self-attention method, adopt the channel interaction module to the calculation of V value can solve the problem that local window lacks direction perception and positional information from the attention, reduces the size of input token through the degree of depth convolution, reduces the complexity from the attention calculation. With balanced L ₁ Loss and weights lost in different stages are configured in a total loss function to solve the problem of gradient imbalance of simple samples and difficult samples, the improved method has an excellent effect in personnel detection of the machine room, the maintenance efficiency of the unattended machine room is improved, and the normal, safe and reliable operation of the machine room is guaranteed.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is an overall framework diagram of the algorithm;

fig. 2 is a backbone network.

Detailed Description

Based on the above technical problems of Swin transducer, the present application provides a machine room personnel detection method, which is specifically a machine room personnel detection method based on image layered vision transducer,

The application provides a computer room personnel detection method based on image layered vision Transformer, which specifically comprises the following steps:

the Network structure of the detection method consists of four parts, including a Swin-T backbone Network, a Feature Pyramid (FPN), an area Proposal Network (RPN) and a cascade detection head. As shown in fig. 1, Swin Transformer is used to extract image features, and FPN is mainly used to extract multi-scale features. The RPN is a combination Of several convolutional layers that produces a Region Of Interest (ROI) where an object may be present. And the cascade detection head classifies and positions the region of interest and outputs a final detection result. In the cascade detection head, FC is the full connectivity layer, C is the classification probability, and B is the regression of the candidate box.

The backbone network herein is shown in fig. 2 (a). Firstly, the pictures are input into a block segmentation module for segmentation. I.e. every 4 x 4 adjacent pixels is a block and then flattened in the channel direction. Assuming that an RGB three-channel picture is input, each block has 4 × 4-16 pixels, and each pixel has R, G, B three valuesAfter flattening, 16 × 3 is 48, so the image size after block division is (H, W,3) to 48

Then, the channel data of each pixel is linearly transformed from 48 to 96 through a linear mapping layer, namely, the image size is changed from

Become into

Then, the feature maps with different sizes are constructed through four stages, except that the stage 1 is firstly processed by a linear mapping layer, and the rest three stages are firstly processed by downsampling through a block merging layer. Both then are repeated stacks of Swin transducer modules, noting that the modules herein have two configurations in nature. As shown in FIG. 2(b), a standard window-based multi-headed self-attention (W-MSA) module and a shift-window multi-headed self-attention (SW-MSA) module are used in series in a Swin transform module. S (w) -MSA is preceded and followed by a Layer Normalization (LN), the last MPL consisting of two GELU nonlinear activation functions. The connection of W-MSA and SW-MSA in this context can be shown by the following formula, where

Is the output of (S) W-EMSA, Z ^l Is the output of the MLP, l-1, l and l +1 represent positions.

Linear projection derived query

Where n is H × W. Will input

Reshaped into a space vector (d) _m H, W), the size of the input X is reduced by a deep convolution with a kernel of s X s and a step size of s. The size of the token is represented by (d) _m H, W) is changed to

Where X is the input token, n is the number of blocks, H is the number of blocks of the high-direction image of the input image, W is the number of blocks of the wide-direction image of the input image, d _m Is to embed dimension, query vector dimension and key into each image blockThe embedding dimension of the vector sum value vector is d _k And n' is the number of blocks.

For the value of V, we add a channel interaction module to calculate. Inspired by channel attention (SE), channel interaction comprises a DW ₂ Deep convolution, one global average pooling layer (GAP), then two consecutive 1 × 1 convolution layers, Batch Normalization (BN) and activation between them (SILU). Finally, we use Sigmoid to generate attention in the channel dimension. The calculation formula of V is as follows,

V＝FC(LN(DW ₁ (X)))·Sigmoid(conv(SILU(BN(conv(GAP(DW ₂ (x)))))))) (1)

to obtain finally

In which DW ₂ Is a deep convolution with a convolution kernel of 3 × 3, where attention is paid to DW ₁ And DW ₂ Difference of (2), DW ₁ The input X is reduced in size by a factor of s. DW (DW) ₂ The size and number of channels of input X are not changed, more channel information is retained, conv is a convolution of 1 × 1. Then, the self-attention function of Q, K and V is calculated by the following formula:

and finally, adding the linear transformation and the X to obtain final output. The channel interaction module and the SE layer are similar in design, but they are mainly distinguished by the following two points: 1. the inputs to the modules are different. Note that the two deep convolutions do not share weights, and the input for channel interaction here comes from another parallel branch. 2. The present application applies channel interaction only to the local window self-attention V value calculation in the module, rather than to the output of the module as in the SE layer.

The loss function employed in the present application is as follows:

(1) RPN classification loss and cascade detection head loss. Using a multivariate cross entropy loss function herein, the goal of bounding box classification assigns C +1 class labels to each bounding box, denoted by probability p. It is composed ofIn (3), C is all categories, and one is background. For training sample x _i And y _i Wherein y is _i Is to input x _i The multivariate cross entropy loss function is as in equation (3):

wherein, W _j As in equation (4):

wherein the content of the first and second substances,

Smooth L ₁ the loss is defined as:

wherein, N _reg Indicating the number of anchor locations when the candidate box is a positive sample

Is 1, when the candidate box is a negative sample

(3) And (5) cascading detection head bounding box regression loss. Directly increasing the weight of the localization loss (i.e., the regression loss) may result in a model that is more sensitive to the value of some localization anomalies. In Smooth L ₁ After adding a gradient limit to the derivative equation of the loss, L is balanced ₁ The gradient formula for the loss can be defined as follows:

wherein b is used to guarantee

In that

Where C is a constant, the conditions between the parameters are as follows:

αln(b+1)＝γ (10)

where α and γ are hyperparameters, defaults are set to 0.5 and 1.5, a small α makes the back-propagated gradient larger, and γ adjusts the upper bound of the regression error, the back-propagated gradient not exceeding γ.

(4) The total loss. The total loss includes the classification loss and regression loss for the RPN stage and the classification loss and regression loss for the three stages. The balanced L1 loss is applied here to the first, second and third stages of the cascaded detector head. And assigns corresponding weights to the loss of the three stages and the loss of the RPN. The total loss function can be written as:

L＝aL _RPw +bL _stage1 +CL _stage2 +dL _stage3 (II)

wherein:

a is 1, b is 0.75, c is 0.5, d is 0.25, and a, b, c and d represent the lost weight coefficients. L is _{RPN_cls} Representing RPN classification loss, L _{RPN_reg} Is the RPN regression loss. L is _stage1 、L _stage2 And L _stage3 Represents the total loss of three stages, L _{stage1_cls} 、L _{stage2_cls} And L _{stage3_cls} Is the classification loss, L, of each stage _{stage1_reg} 、L _{stage2_reg} And L _{stage3_reg} Is the regression loss at each stage.

In order to verify the performance of the method for detecting the personnel in the machine room, the method is compared with a currently popular target detection algorithm, and the detection performance of the DETR based on ResNet50, the Deformable DETR based on ResNet50, YOLOX-x, the Retineet based on ResNext101 and using FPN, the YOLOF algorithm based on ResNet50, the Swin Transformer-T cascade algorithm and other algorithms is tested, as shown in Table 1. First, considering the mAP value of 0.5 at IoU, the detection accuracy of Swin Transformer is better than the algorithms such as DETR, Deformable DETR and YOLOF, and the detection accuracy of Swin Transformer is 0.05 points higher than that of Retianet, a one-stage algorithm. This seems to indicate that the two-stage algorithm has a higher detection accuracy in the field of machine room personnel detection than the end-to-end and one-stage algorithms, although this does not include YOLOX-x, since YOLOX is mainly an integrated detection algorithm that integrates various skills. The detection accuracy of YOLOX-x is better than that of Swin transducer. The detection accuracy of Deformable _ DETR is improved by 1.4 points compared with DETR. Detection network based on improved Swin transducer detection network mAP _@0.5 Is 89.7%, compared with the Swin Transformer-T detection precision which is improved by 3.2 points, the detection precision of the improved algorithm is 0.6 point higher than that of YOLOX-x. Second, the algorithm improved herein is second only to YOLOX in detecting small objects, and the DETR seriesThe algorithm of the column performs the worst in detecting small targets. Deformable _ DETR has higher detection performance in terms of small objects than DETR. In addition to YOLOX, the two-stage detection algorithm is significantly better than the one-stage algorithm in detecting small objects. Finally, in terms of model complexity, in order to guarantee fairness, the input sizes of all target detection networks are set to (3, 1280, 800), and compared with Swin transform, GFlOPS is reduced by 5.43G and the parameter amount is increased by 3.42M. The improved method is huge in improvement of detection precision of Swin-Transformer.

TABLE 1 comparison of test results for different models

The application tested each of the improved detection performance, by mAP _@0.5 To measure the detection performance. The experimental results are shown in table 2, and the detection precision is improved by 1.83 points by using a modified Swin transducer module (ISTB). The detection accuracy is improved by 1.4 points using a balanced loss function (BLOSS). Therefore, the accuracy of the detection of Swin transducer can be improved by using the improved method comprehensively.

TABLE 2 results of the experiments using the improved method in combination

In the description, each part is described in a progressive manner, the focus of each part is to be different from that of other parts, and the same and similar parts among the parts can be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A computer room personnel detection method based on image layered vision is characterized in that a Network structure Of the method consists Of four parts, including a Swin Transformer backbone Network, a characteristic Pyramid (FPN), a Region Proposal Network (RPN) and a cascade detection head, wherein the Swin Transformer is used for extracting image characteristics related to computer room personnel, the FPN is mainly used for extracting multi-scale characteristics, the RPN is a combination Of several convolution layers and generates a Region Of Interest (ROI) where objects possibly exist, the cascade detection head classifies and positions the Region Of Interest and outputs a final detection result, in the cascade detection head, FC is a full connection layer, C is a classification probability, and B is regression Of a candidate frame;

two key designs were added in the standard window-based multi-headed self-attention (W-MSA) module in the Swin Transformer backbone network: (1) a self-attention mechanism with smaller calculation amount is designed, and the calculation complexity of the self-attention mechanism is reduced; (2) considering that the convolutional layer aims at simulating local relations, the problem that the self-attention of local windows lacks direction perception and position information is solved by adding a channel interaction module and using parallel deep convolution (global calculation) and self-attention calculation based on the local windows;

in addition, balanced L is used ₁ Losing and configuring weights lost in different stages in a total loss function to solve the problem of gradient imbalance of simple samples and difficult samples;

the method comprises the following steps of designing a self-attention mechanism with smaller calculation amount and reduced self-attention mechanism calculation complexity, and specifically comprises the following steps:

by pairing input tokens

Linear projection derived query

Where n is H × W, and then input

Reshaped into a space vector (d) _m H, W) by a deep convolution (DW) with a convolution kernel size s × s and a step size s ₁ ) To reduce the size of the input X, the size of the token is reduced by (d) _m H, W) is changed to

Reducing the complexity of self-attention computation by reducing the size of the input token by deep convolution; where X is the input token, n is the number of blocks, H is the number of blocks of the high-direction image of the input image, W is the number of blocks of the wide-direction image of the input image, d _m Is the embedding dimension of each image block, the embedding dimension of the query vector dimension, the key vector and the value vector is d _k And n' is the number of blocks.

2. The method according to claim 1, characterized in that, considering that the convolutional layer aims at simulating local relations, the problem of the self-attention lack of direction perception and position information of local windows is solved by adding channel interaction modules, using parallel deep convolution (global computation) and self-attention computation based on local windows, specifically comprising:

for the value of V, we add a channel interaction module to compute, inspired by channel attention (SE), the channel interaction consists of a deep convolution, a global average pooling layer (GAP), then two successive 1 × 1 convolution layers, Batch Normalization (BN) and an activation function (SILU) between them, finally, we use Sigmoid to generate attention in the channel dimension, the formula for V is shown below,

V＝FC(LN(DW ₁ (X))).Sigmoid(conv(SILU(BN(conv(GAP(DW ₂ (x))))))) (1)

where FC is full join, BN is batch normalization, DW ₁ Is a deep convolution, X is the input token vector, conv is a 1 × 1 convolution, GAP is the global mean pooling, DW ₂ Is a deep convolution;

to obtain finally

In which DW ₂ Is a deep convolution with a convolution kernel of 3 × 3, where it is noted that DW ₁ And DW ₂ Is distinguished by DW ₁ The post-input X is reduced by s times in size by DW ₂ Then, without changing the size and number of channels of the input X, more channel information is retained, conv is a convolution of 1 × 1, and then the self-attention function of Q, K and V is calculated by the following formula:

finally, the final output is obtained by linear transformation and X addition, the design of the channel interaction module is similar to that of the SE layer, but the two differences are mainly as follows: first, the inputs to the modules are different, noting that the two deep convolutions do not share weights, and the input to the channel interaction comes from another parallel branch. Second, the channel interaction is applied to the local window self-attention module V value calculation, rather than to the module's output as in the SE layer.

3. Method according to claim 1, characterized in that a balanced L is used ₁ Losing and configuring weights lost in different stages in a total loss function to solve the problem of gradient imbalance of simple samples and difficult samples, and specifically comprises the following steps:

(1) RPN classification loss and cascade detection head loss, using multivariate cross entropy loss function, the goal of bounding box classification assigning C +1 classes to each bounding boxLabels, represented by probability p, where C is all classes and one is background, for training sample x _i And y _i Wherein y is _i Is to input x _i The multivariate cross entropy loss function is as in equation (3):

wherein, W _j As in equation (4):

(2) RPN bounding box regression loss, bounding box regression aiming at using a regression function to put (b) the candidate bounding box b ═ b _x ，b _y ，b _w ，b _h ) Return to target bounding box g ═ (g) _x ，g _y ，g _w ，g _h ) Minimizing the loss function L _BIoc (b _i ，g _i ) Comprises the following steps:

wherein the content of the first and second substances,

Smooth L ₁ the loss is defined as:

Is 1, when the candidate box is a negative sample

Is 0, b _i Representing the bounding box regression parameter, g, predicting the ith anchor _i Representing a real box corresponding to the ith anchor;

(3) the regression loss of the boundary box of the cascade detection head, and the direct increase of the weight of the positioning loss (namely the regression loss) can lead the model to be more sensitive to some abnormal positioning values, and the model is applied to Smooth L ₁ After adding a gradient limit to the derivative equation of the loss, L is balanced ₁ The gradient formula for the loss can be defined as follows:

where α represents the contribution of the outlier and γ is the upper limit of the outlier error, herein L1 _balanced The following:

wherein the parameter b is used to guarantee

In that

Where C is a constant, the conditions between the parameters are as follows:

αln(b+1)＝γ (10)

wherein alpha and gamma are hyper-parameters, the default values are set to be 0.5 and 1.5, small alpha enables the gradient of the back propagation to be larger, gamma adjusts the upper bound of the regression error, and the gradient of the back propagation does not exceed gamma;

(4) the total loss; the total loss includes the classification loss and regression loss for the RPN stage and the classification loss and regression loss for the three stages,l to be balanced herein ₁ The losses are applied to the first, second and third stages of the cascaded detector head and corresponding weights are assigned to the losses of the three stages and the RPN losses, and the total loss function can be written as:

L＝aL _RPN +bL _stage1 +cL _stage2 +dL _stage3 (11)

wherein:

a is 1, b is 0.75, c is 0.5, d is 0.25, and a, b, c and d represent the lost weight coefficients. L is _{RPN_cls} Represents the RPN classification loss, L _{RPN_reg} Is the RPN regression loss. L is _stage1 、L _stage2 And L _stage3 Represents the total loss of three stages, L _{stage1_cls} 、L _{stage2_cls} And L _{stage3_cls} Is the classification loss, L, of each stage _{stage1_reg} 、L _{stage2_reg} And L _{stage3_reg} Is the regression loss at each stage.