CN115471776A

CN115471776A - Helmet wearing identification method based on multi-convolution kernel residual error module time transformer model

Info

Publication number: CN115471776A
Application number: CN202211214277.5A
Authority: CN
Inventors: 朱建宝; 邓伟超; 俞鑫春; 陈宇; 马青山; 张才智; 叶超; 孙根森; 陈鹏; 曹雯佳
Original assignee: Nantong Power Supply Co Of State Grid Jiangsu Electric Power Co
Current assignee: Nantong Power Supply Co Of State Grid Jiangsu Electric Power Co
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2022-12-13

Abstract

The invention discloses a safety helmet wearing identification method based on a multi-convolution kernel residual error module time transform model, which is characterized in that a time transform model based on a multi-convolution kernel residual error module is selected, a deep learning technology is combined, a method suitable for detecting a moving individual in a complex power operation environment is provided, automatic identification and tracking detection of the wearing condition of a safety helmet during safety operation of workers in the power industry are realized, the adaptability and the high efficiency of the safety helmet wearing identification in the power industry can be effectively improved, and an effective and feasible path is explored for the dynamic target automatic identification and tracking application of the deep learning technology in the power safety operation.

Description

Helmet wearing identification method based on multi-convolution kernel residual error module time transformer model

Technical Field

The invention relates to safety operation of a power system, in particular to a helmet wearing identification method based on a multi-convolution kernel residual error module time transformer model.

Background

Nowadays, with the continuous and deep development of deep learning technology, video image processing technology has been widely applied to various fields in social life. For the electric power industry, safety is not strange for everyone, stable production of enterprises can be guaranteed only by realizing safety, and loss caused by safety accidents is huge. The safety helmet has a certain protection effect on the head during power operation, which requires that a power operator must wear the safety helmet in the construction process. However, in recent years, electric power safety accidents caused by electric power workers not wearing helmets as required against electric power safety regulations sometimes occur. In order to prevent the occurrence of power safety accidents and protect the personal safety of power operators, it is becoming more and more important for the power industry to develop a system capable of automatically identifying abnormal situations such as that a power operator does not wear a safety helmet during operation.

In order to solve the above problems, for example, chinese patent publication No. CN114387508A discloses a method for identifying a safety helmet based on a Transformer, which comprises the steps of firstly extracting features on an image, adding image position information, secondly obtaining attention feature information through a Transformer coding module, inputting the obtained attention feature information and target query information into a Transformer decoding module, outputting an attention feature map, and finally predicting object type, center coordinates, and height and width of a block diagram by using a feed-forward neural network.

Also, for example, chinese patent publication No. CN114241247A discloses a transformer substation helmet identification method and system based on a deep residual error network, which identifies a transformer substation helmet by constructing a deep residual error network with 4 residual error blocks, thereby effectively improving the detection and identification accuracy, making it easier to train a deep network, and effectively avoiding the problems of gradient explosion and gradient disappearance, etc. in the conventional convolutional network along with the increase of depth. The constructed depth residual error network adopts a Dropout layer, so that random neurons do not work with certain probability, and the overfitting phenomenon can be effectively prevented. During training, the loss function value is calculated in a mode of combining a Softmax function and a cross entropy function, so that the training efficiency is improved, and the performance degradation of a deep residual error network training model obtained through training is slowed down.

At present, the existing safety helmet wearing identification technology has the following defects: two patents mainly focus on detecting and identifying the safety helmet in a static image, the efficiency of target detection is not high, the accuracy rate needs to be further improved, the accurate detection of a dynamic target operating in a complex power environment still needs to be solved at present, and the prior art still needs to be improved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a helmet wearing identification method based on a multi-convolution kernel residual error module time transform model, so as to solve the problems.

In order to achieve the purpose, the invention is realized by the following technical scheme: the design scheme mainly comprises four parts: a multi-convolution kernel residual neural network (MK-RCN) backbone, a transform encoder-decoder, a time transform, and a Feed Forward Network (FFN) for extracting feature representations; where the time transformer consists of three components, namely a time-deformable transformer encoder (TDTE), a Time Query Encoder (TQE) and a time-deformable transformer decoder (TDTD).

The safety helmet wearing identification method based on the multi-convolution kernel residual error module time transformer model mainly comprises the following steps:

s1: acquiring a power operation dynamic video through field monitoring or other camera tools;

s2: extracting characteristic representation from the video image of the power operator by using an MK-RCN structure;

s3: and supplementing the characteristics extracted in the step S2 by using position coding to obtain position embedding, wherein a position coding vector formula is as follows:

where PE is a two-dimensional matrix, pos represents a location, and dmodel represents a vector dimension.

S4: embedding and transmitting the characteristic diagram obtained in the step S2 and the position obtained in the step S3 to a transform encoder;

s5: taking the output obtained by the object query and the output of the transform encoder in S4 as the input of the transform decoder, converting the input embedding of the learning position coding into the output embedding by using a multi-head attention module (MultiHeadAttn), and transmitting each output embedding of the transform encoder and the transform decoder to the time transform, wherein the multi-head attention module realizes the following formula:

wherein, the first and the second end of the pipe are connected with each other,m is the m-th attention head,

and

to learn weights. Attention weight A _mqk Is composed of

Is a learning weight;

s6: inputting the characteristic diagram of S4 into TDTE, and coding the space-time characteristic representation;

s7: embedding and inputting the output of the step 5 into the TQE, and acquiring all spatial object queries from the reference frame to enhance the spatial output query of the current frame;

s8: inputting the output of the step 6 and the step 7 into a TDTD to learn the time context of different frames, wherein the TDTD layer comprises a self-attention module, a deformable aggregation attention module and a forward feedback layer, and the deformable aggregation attention module realizes the following formula:

s9: and (4) transferring each output embedding of the step (8) to the FFN, and performing final target detection and identification through the FFN.

Further, the specific step of S2 includes:

s2.1: initializing parameters of an MK-RCN structure, using ResNet-18 as a network backbone, using three convolution kernels at a residual error module, wherein the sizes of the convolution kernels are 3 x 3, 3 x 1 and 1 x 3 respectively, and the learning rate is 10 ^-5 Weight attenuation of 10 ^-4 ；

S2.2: extracting the (t-i) th to t th frames of the power operation dynamic video to extract the characteristics of the power operation dynamic video, wherein the initial picture is 3 XH ₀ ×W ₀ A new profile, cxHxW, was generated by MK-RCN.

Further, the specific step of S3 includes:

s3.1: dividing the characteristic diagram obtained in S2.2 into three parts, wherein one part is directly used as a V value vector, the other two parts are directly added with a position coding vector to be respectively used as a K (key vector) and a Q (query vector), and according to a position coding vector formula (3-1) (3-2), the vector of PE (pos + K,2 i) can be solved and expressed as linear expression on PE (pos, 2 i):

further, the specific step of S4 includes:

s4.1: each layer of the transform encoder is composed of a multi-head attention mechanism module, add&The Norm module and the Forward propagation module are composed of 6 layers in total, normalization is respectively carried out after a multi-head attention layer and a Forward feedback layer (Feed-Forward), and the initial learning rate is 2 multiplied by 10 ^-4 Weight attenuation of 10 ^-4 ；

S4.2: inputting the KVQ obtained in the S3.1 into a multi-head attention module, and outputting a new characteristic diagram;

s4.3: adding the new characteristic diagram obtained in the step S4.2 with the original characteristic diagram;

s4.4: performing linear reduction dimensionality and ReLU activation;

s4.5: and repeating the 6 transform encoder layers, finishing encoding and outputting.

Further, the specific step of S5 includes:

s5.1: the input of the transform decoder comprises query embedding, query position, transform encoder output, multi-head attention mechanism module and Add&The Norm module and the forward propagation module are composed of 6 layers in total, the input of each layer is provided with a query position and a position code in a transform coder besides the input of the previous layer, and the initial learning rate is 2 multiplied by 10 ^-4 Weight attenuation of 10 ^-4 ；

S5.2: inputting query embedding of codes of anchors through object query, and adding the query embedding and the query position to obtain K and Q, wherein the object query is set to be 300;

s5.3: inputting the K and Q obtained in S5.2 and the output of the object query into a first multi-head attention module MultiHeadAttn to obtain output;

s5.4: dropout is carried out on the output of S5.3, and the output of the object query is added and output;

s5.5: adding the output of the object query and the query position to obtain Q, adding the output of S4.5 and the position coding vector to obtain K, and inputting the output of S4.5 serving as V to a second multi-head attention module;

s5.6: performing linear reduction dimensionality and ReLU activation;

s5.7: and after passing through 6 transform decoder layers, ending decoding and outputting.

Further, the specific step of S6 includes:

s6.1: the TDTE layer includes a Self-Attention module (Self-Attention), a multi-headed deformable Attention module (TempDefAttn), and a feedforward layer

S6.2: using the output of S4.5 as input to the self-attention module

S6.3: s6.2, taking the output of the multi-head deformable attention module as the input of the multi-head deformable attention module;

s6.4: s6.3, taking the output of the forward feedback layer as the input of the forward feedback layer;

further, in S6.3, the multi-head deformable attention module is input as follows:

wherein m is the mth attention head; l is the l-th frame of the same video sample; k is the kth sampling point; delta P _mlqk And A _mlqk Respectively representing the sampling offset and attention weight of a kth sampling point and an mth attention head of the ith frame; scalar attention weight A _mlqk Is located at [0,1]In the middle, by

Normalizing the result; delta P _mlqk ∈R ² To have an unconstrained rangeTwo-dimensional real numbers; computing x (P) using bilinear interpolation _q +ΔP _mlqk ；ΔP _mlqk And A _mlqk By query features z _q Obtained by linear projection on; using normalized coordinates

Realizing a proportional formula, wherein normalized coordinates (0, 0) and (1, 1) respectively represent image inflection points of the upper left corner and the lower right corner; function(s)

The normalized coordinates are rescaled to the input feature map of frame i. Multi-frame temporal warping notes that the samples are from LK points in the L profile, rather than K points in the single-frame profile

Further, the specific step of S7 includes:

s7.1: the TQE layer comprises a self-Attention module, a Cross-Attention module (Cross-Attention), and a feed-forward layer

S7.2: the output of step 5.7 is taken as input from the attention module

S7.3: the output of step 7.2 is taken as input to the cross attention module, combined with spatial object queries for all frames of reference, denoted as Q _ref . Scoring and selecting in a coarse-to-fine manner, i.e. predicting class logits using an additional forward feedback layer, and then calculating its Sigmoid value p = Sigmoid [ FFN (Q) _ref )]Sorting all reference points through P values, selecting the highest k value to input into a shallow network, and inputting the lower k value into a deeper network

S7.4: the output query is iteratively updated.

Further, the specific step of S8 includes:

s8.1: taking the output of S6.4 as the input of the self-attention module;

s8.2: taking the refined temporal object query of S7.4 and the output of S8.1 as inputs to a deformable aggregate attention module;

s8.3: taking the output of S8.2 as the output of a forward feedback layer;

s8.4: and taking the output of the S8.3 as the input of the FFN to realize target detection and identification.

Further, in S8.3, the loss function is formulated as:

wherein the content of the first and second substances,

representing a loss of focus for classification;

and

representing the loss for locating L1 and the loss of generalized IoU; lambda [ alpha ] _cls ，λ _L1 And λ _GIoU Is the coefficient thereof.

Compared with the prior art, the invention has the beneficial effects that: the invention discloses a helmet wearing identification method based on a multi-convolution kernel residual error module time transform model, which selects the time transform model based on the multi-convolution kernel residual error module, combines a deep learning technology, provides a method suitable for mobile individual detection in a complex electric power operation environment, realizes automatic identification and tracking detection of the helmet wearing condition during safety operation of workers in the electric power industry, reduces false alarm rate, and improves detection efficiency.

Drawings

FIG. 1 is a flow chart of a helmet identification wearing method based on a multi-convolution kernel residual error module time transformer model according to the present invention;

FIG. 2 is a time transformer model of a multi-convolution kernel residual error module of a helmet fit identification method based on a multi-convolution kernel residual error module time transformer model;

FIG. 3 is a RCN multi-convolution kernel residual error module of the helmet wearing identification method based on a multi-convolution kernel residual error module time transformer model according to the present invention;

FIG. 4 is an image frame of a video of a helmet wearing identification method based on a multi-convolution kernel residual error module time transformer model according to the present invention;

FIG. 5 shows three convolution kernels of the method for identifying the wearing of a helmet based on a multi-convolution kernel residual error module time transformer model according to the present invention;

FIG. 6 is a diagram of a perception with mosaic during convolution of a helmet wearing identification method based on a multi-convolution kernel residual error module time transformer model according to the present invention;

FIG. 7 is a target detection result of a single frame image in a video of the method for identifying wearing of a crash helmet based on a multi-convolution kernel residual error module time transformer model according to the present invention;

FIG. 8 is a tracking result of the helmet wearing identification method based on the multi-convolution kernel residual error module time transformer model in state 1 according to the present invention;

FIG. 9 is a tracking result in state 2 of the method for identifying a wearable safety helmet based on a multi-convolution kernel residual error module time transformer model according to the present invention;

fig. 10 is a tracking result in state 3 of the method for identifying wearing of a helmet based on a multiple convolution kernel residual error module time transformer model according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

The invention provides a technical scheme that: the safety helmet wearing identification method based on the multi-convolution kernel residual error module time transformer model mainly comprises the following four parts: and performing final detection identification by using a multi-convolution kernel residual error neural network (MK-RCN) trunk, a transform encoder-decoder, a time transform and a feed-forward network (FFN) for extracting feature representation. The time transformer consists of three components, namely a Time Deformable Transformer Encoder (TDTE), a Time Query Encoder (TQE) and a Time Deformable Transformer Decoder (TDTD).

The MK-RCN structure aims to improve the perception sensitivity of a model polymorphic target and strengthen characteristic reuse so as to solve the problem of low accuracy rate of small target identification; the transform encoder-decoder encodes each frame (including the reference frame and the current frame) into two compact representations, namely spatial object query and memory coding; TDTE encodes the spatio-temporal feature representation to provide location cues for the final decoder output; the TQE measures the interaction between the object in the current frame and the object in the reference image, and is used for fused object query; TDTD learns the time context of different frames to obtain the detection result of the current frame; the FFN structure realizes target detection and identification.

The method for identifying the wearing of the safety helmet based on the multi-convolution kernel residual error module time transformer model specifically comprises the following steps of:

the method comprises the following steps that S1, a dynamic video of the power operation is obtained through field monitoring, a power monitoring system is used for centralized management, scheduling, control and data acquisition of a power supply system, and the field monitoring is that an intelligent system used for monitoring and controlling the power production and supply process obtains the dynamic video of the power operation field through a background workstation of the power monitoring system.

And S2, extracting characteristic representation of the video image of the power operator by using an MK-RCN structure. The specific steps of S2 include:

S2.2: extracting the (t-i) th to t th frames of the power operation dynamic video to extract the characteristics of the dynamic video, wherein the initial picture is 3 XH ₀ ×W ₀ A new profile, cxHxW, was generated by MK-RCN.

And S3, supplementing the characteristics obtained in the S2 by using position codes to obtain position embedding. The specific steps of S3 comprise:

s3.1: dividing the characteristic diagram obtained in S2.2 into three parts, wherein one part is directly used as a V value vector, the other two parts are directly added with a position coding vector and respectively used as a K (key vector) and a Q (query vector), and the position coding vector realizes the following formula:

From equations (3-1) (3-2), the vector of PE (pos + k,2 i) can be solved, represented as a linear representation on PE (pos, 2 i):

s4: and embedding and inputting the characteristic diagram obtained in the S2 and the position obtained in the S3 into a transform encoder. The specific steps of S4 comprise:

s4.1: each layer of the transformer encoder is composed of a multi-head attention mechanism module, add&The Norm module and the Forward propagation module are composed of 6 layers in total, normalization is respectively carried out after a multi-head attention layer and a Forward feedback layer (Feed-Forward), and the initial learning rate is 2 multiplied by 10 ^-4 Weight attenuation of 10 ^-4 ；

s4.3: adding the new characteristic diagram obtained in the S4.2 with the original characteristic diagram;

s4.4: performing linear reduction dimensionality and ReLU activation;

S5: and (4) inputting the output obtained by the object query and the feature map obtained in the step (4) into a transform decoder, wherein the transform decoder takes the output obtained by the object query and the output of the transform encoder as inputs, and the input embedding of the learning position coding is converted into output embedding by using a multi-head attention module. The specific steps of S5 comprise:

s5.1: the input of the transform decoder comprises query embedding, query position, transform encoder output, multi-head attention mechanism module and Add&The Norm module and the forward propagation module are composed of 6 layers in total, the input of each layer is provided with the input of the last layer, the inquiry position and the position coding in a transform coder, and the initial learning rate is 2 multiplied by 10 ^-4 Weight attenuation of 10 ^-4 ；

s5.3: inputting the K and Q obtained in the step 5.2 and the output of the object query into a first multi-head attention module (MultiHeadAttn) to obtain an output, wherein the multi-head attention module realizes the following formula:

wherein m is the mth attention head,

and

to learn weights. Attention weight A _mqk Is composed of

Is a learning weight;

s5.6: performing linear reduction dimensionality and ReLU activation;

S6: and inputting the characteristic diagram of S4 into TDTE, and coding the space-time characteristic representation. The specific steps of S6 comprise:

s6.1: the TDTE layer comprises a Self-Attention module (Self-Attention), a multi-head deformable Attention module (TempDefAttn) and a forward feedback layer;

s6.2: taking the output of S4.5 as input to the self-attention module;

s6: 3: the output of S6.2 is used as the input of the multi-head deformable attention module, and the implementation formula is as follows:

wherein m is the mth attention head; l is the l-th frame of the same video sample; k is the kth sampling point; delta P _mlqk And A _mlqk Respectively representing the sampling offset and the attention weight of a kth sampling point and an mth attention head of the ith frame; scalar attention weight A _mlqk Is located at [0,1 ]]In the middle, is composed of

Normalizing the result; delta P _mlqk ∈R ² Two-dimensional real numbers with an unconstrained range; computing x (P) using bilinear interpolation _q +ΔP _mlqk ；ΔP _mlqk And A _mlqk By querying features z _q Obtained by linear projection on; using normalized coordinates

The normalized coordinates are rescaled to the input feature map of frame i. Multi-frame temporal warping notes sampling LK points from the L-feature map, rather than K points from the single-frame feature map;

s6.4: the output of S6.3 serves as the input to the feed forward layer.

S7: the output of S5 is embedded into the TQE, and all spatial object queries are obtained from the reference frame to enhance the spatial output query of the current frame. The specific steps of S7 comprise:

s7.1: the TQE layer comprises a self-Attention module, a Cross-Attention module (Cross-Attention) and a forward feedback layer;

s7.2: taking the output of S5.7 as the input from the attention module;

s7.3: the output of S7.2 is taken as the input of the cross attention module, combined with the spatial object query of all reference frames, denoted Q _ref . Scoring and selecting in a coarse-to-fine manner, i.e. predicting class logits using an additional forward feedback layer, and then calculating its Sigmoid value p = Sigmoid [ FFN (Q) _ref )]Sorting all the reference points through P values, selecting the highest k value to be input into a shallow network, and inputting the lower k value into a deeper network;

s7.4: the output query is iteratively updated.

S8: inputting the outputs of S6 and S7 into a TDTD to learn the temporal context of different frames, the TDTD layer comprising a self-attention module, a deformable aggregate attention module, a forward feedback layer, the deformable aggregate attention module implementing the formula:

the specific steps of S8 include:

s8.1: taking the output of S6.4 as the input of the self-attention module;

s8.2: taking the refined time object query sum of S7.4 as the output of step 8.2 as the input of the deformable aggregated attention module;

s8.3: taking the output of S8.2 as the output of a forward feedback layer;

the loss function is:

wherein the content of the first and second substances,

representing a focus loss for classification;

and

representing the loss for locating L1 and the loss of generalized IoU; into _cls In addition to _L1 And into _GIoU Is the coefficient thereof.

S9: and (5) transmitting each output of the S8 to the FFN in an embedding manner to perform target detection and identification.

Fig. 4 is a frame extracted from a monitoring video image shot in the construction process of the power industry, a multi-convolution kernel residual error neural network represented by extracted features is firstly used, three convolution kernels are shown in fig. 5, the sensitivity of the model to a polymorphic target is improved by the method, the problem of low accuracy rate of small target identification is solved, and the convolution result is shown in fig. 6.

The method comprises the steps of carrying out helmet wearing artificial marking on a monitoring video image shot in the construction process, then randomly splitting the video data into a training frame and a testing frame by using a randderm function, training a target detector by using the training frame, and testing target identification and individual tracking by using the testing frame, wherein the accuracy rate of a testing result can reach 83.7%, and the recall rate can reach 85.5%. The detection effect for a single frame image in a video is shown in fig. 7, and a green frame and a blue frame respectively indicate the case where the helmet is correctly worn as required and the case where the helmet is not worn as required.

Accuracy P = (TP + TN)/(TP + FN + FP + TN); the recall ratio R = TP/(TP + FN), wherein TP is the number of frames for predicting a correct wearing safety helmet, FN is the number of frames for predicting a correct wearing safety helmet as an error, FP is the number of frames for predicting an incorrect wearing safety helmet as an error, and TN is the number of frames for predicting an incorrect wearing safety helmet as an error. The testing mode is a universal testing means adopted by a machine learning algorithm in the process of effectiveness testing. Because the deep learning algorithm is driven by data, many factors affect the accuracy and recall rate of the algorithm, and it is impossible to accurately point out the steps of the algorithm to be improved.

The method comprises the steps of extracting frames (t-i) to t of a power operation dynamic video, extracting features of the power operation dynamic video, coding time and space features through a TDTE module, enhancing spatial output query of a current frame through a TQE module, learning time context of different frames through a TDTD module, obtaining a current frame detection result, and achieving wearing state identification of constructors at different times. Fig. 8 to 10 show the results of recognition by the constructor in different time states 1, 2 and 3, respectively.

The experimental result verifies the usability and the high efficiency of the scheme of the invention by carrying out experiments on the monitoring video for safe operation in the power industry.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. The safety helmet wearing identification method based on the multi-convolution kernel residual error module time transformer model is characterized by comprising the following steps of: the design scheme mainly comprises four parts: a multi-convolution kernel residual error neural network backbone, a transform encoder-decoder, a time transform and a feedforward network for extracting feature representation; wherein the time transformer is composed of three components, namely a time deformable transformer encoder, a time query encoder and a time deformable transformer decoder, and mainly comprises the following steps:

s2: extracting characteristic representation of the video image of the power operator by using a multi-convolution kernel residual error neural network backbone structure;

wherein PE is a two-dimensional matrix, pos represents a position, and dmodel represents a vector dimension;

s5: taking the output obtained by object query and the output of the transform encoder in S4 as the input of the transform decoder, converting the input embedding of learning position coding into output embedding by using a multi-head attention module MultiHeadAttn, and transmitting each output embedding of the transform encoder and the transform decoder to the time transform, wherein the multi-head attention module realizes the following formula:

wherein m is the mth attention head,

and

to learn the weights, attention weight A _mqk Is composed of

Is a learning weight;

s6: inputting the characteristic diagram of the S4 into a time-variable transform coder, and coding the time-space characteristic representation;

s7: embedding and inputting the output of the step 5 into a time query encoder, and acquiring all spatial object queries from a reference frame to enhance the spatial output query of the current frame;

s8: inputting the outputs of step 6 and step 7 into a temporally deformable transform decoder to learn the temporal context of different frames, the temporally deformable transform decoder layer comprising a self-attention module, a deformable aggregate attention module, a forward feedback layer, the deformable aggregate attention module implementing the formula:

s9: and (4) transmitting each output embedding of the step (8) to a feedforward network, and carrying out final target detection and identification through the feedforward network.

2. The method for identifying wearing of a helmet based on a multi-convolution kernel residual module time transformer model according to claim 1, wherein: the specific steps of S2 include:

s2.1: initializing parameters of a multi-convolution kernel residual error neural network structure, using ResNet-18 as a network main trunk, using three convolution kernels at a residual error module, wherein the sizes of the convolution kernels are respectively 3 multiplied by 3, 3 multiplied by 1 and 1 multiplied by 3, and the learning rate is 10 ^-5 Weight attenuation of 10 ^-4 ；

S2.2: extracting the (t-i) th to t th frames of the power operation dynamic video to extract the characteristics of the power operation dynamic video, wherein the initial picture is 3 XH ₀ ×W ₀ And generating a new characteristic map CxHxW through a multi-convolution kernel residual error neural network.

3. The method of claim 2, wherein the method comprises: the specific steps of S3 comprise:

s3.1: dividing the characteristic diagram obtained in S2.2 into three parts, wherein one part is directly used as a V value vector, the other two parts are directly added with a position coding vector to be respectively used as a key vector K and a query vector Q, and according to a position coding vector formula (3-1) (3-2), the vector of PE (pos + K,2 i) can be solved and expressed as linear expression on PE (pos, 2 i):

4. the method of claim 3 for identifying a headgear wearing system based on the multi-convolution kernel residual error module time transformer model, wherein the method comprises: the specific steps of S4 comprise:

s4.1: each layer of the transformer encoder is composed of a multi-head attention mechanism module, add&The Norm module and the Forward propagation module are combined, the total number of the Norm module and the Forward propagation module is 6, normalization is respectively carried out after a multi-head attention layer and a Forward feedback layer Feed-Forward, and the initial learning rate is 2 multiplied by 10 ^-4 Weight decay of 10 ^-4 ；

s4.4: performing linear reduction dimensionality and ReLU activation;

5. The method of claim 4 for identifying headgear wear based on a multi-convolution kernel residual module time transformer model, wherein: the specific steps of S5 comprise:

s5.6: performing linear reduction dimensionality and ReLU activation;

6. The method of claim 5 for identifying headgear wear based on multiple convolution kernel residual module time transformer model, wherein: the specific steps of S6 comprise:

s6.1: the TDTE layer comprises a Self-Attention module Self-Attention, a multi-head deformable Attention module TempDefAttn and a forward feedback layer;

s6.2: taking the output of S4.5 as input to the self-attention module;

s6.4: the output of S6.3 serves as the input to the feed forward layer.

7. The method of claim 6, wherein the method comprises: in S6.3, the input of the multi-head deformable attention module is implemented as the following formula:

wherein m is the mth attention head; l is the l-th frame of the same video sample; k is the kth sampling point; delta P _mlqk And A _mlqk Respectively representing the sampling offset and the attention weight of a kth sampling point and an mth attention head of the ith frame; scalar attention weight A _mlqk Is located at [0,1]In the middle, is composed of

The normalized coordinates are rescaled to the input feature map for the L-th frame, and multi-frame temporal warping takes care to sample the LK points from the L feature map instead of the K points in the single frame feature map.

8. The method of claim 7, wherein the method comprises: the specific steps of S7 include:

s7.1: the TQE layer comprises a self-Attention module, a Cross-Attention module and a forward feedback layer;

s7.2: the output of step 5.7 is taken as input from the attention module;

s7.3: the output of step 7.2 is taken as input to the cross attention module, combined with spatial object queries for all frames of reference, denoted as Q _ref Scoring and selecting in a coarse-to-fine manner, i.e. predicting class logits using an additional forward feedback layer, and then calculating its Sigmoid value p = Sigmoid [ FFN (Q) ] _ref )]Sorting all reference points through the P values, selecting the highest k value to be input into a shallow network, and inputting the lower k value into a deeper network;

s7.4: the output query is iteratively updated.

9. The method of claim 8, wherein the method comprises: the specific steps of S8 include:

s8.1: taking the output of S6.4 as the input of the self-attention module;

s8.3: taking the output of S8.2 as the output of a forward feedback layer;

10. The method of claim 9, wherein the method comprises: in S8.3, the loss function is formulated as:

wherein the content of the first and second substances,

representing a loss of focus for classification;

and

representing the loss for positioning L1 and the loss of the generalized IoU; lambda [ alpha ] _cls ，λ _L1 And λ _GIoU Is the coefficient thereof.