CN115471776A - Helmet wearing identification method based on multi-convolution kernel residual error module time transformer model - Google Patents

Helmet wearing identification method based on multi-convolution kernel residual error module time transformer model Download PDF

Info

Publication number
CN115471776A
CN115471776A CN202211214277.5A CN202211214277A CN115471776A CN 115471776 A CN115471776 A CN 115471776A CN 202211214277 A CN202211214277 A CN 202211214277A CN 115471776 A CN115471776 A CN 115471776A
Authority
CN
China
Prior art keywords
output
module
attention
query
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211214277.5A
Other languages
Chinese (zh)
Inventor
朱建宝
邓伟超
俞鑫春
陈宇
马青山
张才智
叶超
孙根森
陈鹏
曹雯佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong Power Supply Co Of State Grid Jiangsu Electric Power Co
Original Assignee
Nantong Power Supply Co Of State Grid Jiangsu Electric Power Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong Power Supply Co Of State Grid Jiangsu Electric Power Co filed Critical Nantong Power Supply Co Of State Grid Jiangsu Electric Power Co
Priority to CN202211214277.5A priority Critical patent/CN115471776A/en
Publication of CN115471776A publication Critical patent/CN115471776A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a safety helmet wearing identification method based on a multi-convolution kernel residual error module time transform model, which is characterized in that a time transform model based on a multi-convolution kernel residual error module is selected, a deep learning technology is combined, a method suitable for detecting a moving individual in a complex power operation environment is provided, automatic identification and tracking detection of the wearing condition of a safety helmet during safety operation of workers in the power industry are realized, the adaptability and the high efficiency of the safety helmet wearing identification in the power industry can be effectively improved, and an effective and feasible path is explored for the dynamic target automatic identification and tracking application of the deep learning technology in the power safety operation.

Description

Helmet wearing identification method based on multi-convolution kernel residual error module time transformer model
Technical Field
The invention relates to safety operation of a power system, in particular to a helmet wearing identification method based on a multi-convolution kernel residual error module time transformer model.
Background
Nowadays, with the continuous and deep development of deep learning technology, video image processing technology has been widely applied to various fields in social life. For the electric power industry, safety is not strange for everyone, stable production of enterprises can be guaranteed only by realizing safety, and loss caused by safety accidents is huge. The safety helmet has a certain protection effect on the head during power operation, which requires that a power operator must wear the safety helmet in the construction process. However, in recent years, electric power safety accidents caused by electric power workers not wearing helmets as required against electric power safety regulations sometimes occur. In order to prevent the occurrence of power safety accidents and protect the personal safety of power operators, it is becoming more and more important for the power industry to develop a system capable of automatically identifying abnormal situations such as that a power operator does not wear a safety helmet during operation.
In order to solve the above problems, for example, chinese patent publication No. CN114387508A discloses a method for identifying a safety helmet based on a Transformer, which comprises the steps of firstly extracting features on an image, adding image position information, secondly obtaining attention feature information through a Transformer coding module, inputting the obtained attention feature information and target query information into a Transformer decoding module, outputting an attention feature map, and finally predicting object type, center coordinates, and height and width of a block diagram by using a feed-forward neural network.
Also, for example, chinese patent publication No. CN114241247A discloses a transformer substation helmet identification method and system based on a deep residual error network, which identifies a transformer substation helmet by constructing a deep residual error network with 4 residual error blocks, thereby effectively improving the detection and identification accuracy, making it easier to train a deep network, and effectively avoiding the problems of gradient explosion and gradient disappearance, etc. in the conventional convolutional network along with the increase of depth. The constructed depth residual error network adopts a Dropout layer, so that random neurons do not work with certain probability, and the overfitting phenomenon can be effectively prevented. During training, the loss function value is calculated in a mode of combining a Softmax function and a cross entropy function, so that the training efficiency is improved, and the performance degradation of a deep residual error network training model obtained through training is slowed down.
At present, the existing safety helmet wearing identification technology has the following defects: two patents mainly focus on detecting and identifying the safety helmet in a static image, the efficiency of target detection is not high, the accuracy rate needs to be further improved, the accurate detection of a dynamic target operating in a complex power environment still needs to be solved at present, and the prior art still needs to be improved.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a helmet wearing identification method based on a multi-convolution kernel residual error module time transform model, so as to solve the problems.
In order to achieve the purpose, the invention is realized by the following technical scheme: the design scheme mainly comprises four parts: a multi-convolution kernel residual neural network (MK-RCN) backbone, a transform encoder-decoder, a time transform, and a Feed Forward Network (FFN) for extracting feature representations; where the time transformer consists of three components, namely a time-deformable transformer encoder (TDTE), a Time Query Encoder (TQE) and a time-deformable transformer decoder (TDTD).
The safety helmet wearing identification method based on the multi-convolution kernel residual error module time transformer model mainly comprises the following steps:
s1: acquiring a power operation dynamic video through field monitoring or other camera tools;
s2: extracting characteristic representation from the video image of the power operator by using an MK-RCN structure;
s3: and supplementing the characteristics extracted in the step S2 by using position coding to obtain position embedding, wherein a position coding vector formula is as follows:
Figure BDA0003876194760000021
Figure BDA0003876194760000022
where PE is a two-dimensional matrix, pos represents a location, and dmodel represents a vector dimension.
S4: embedding and transmitting the characteristic diagram obtained in the step S2 and the position obtained in the step S3 to a transform encoder;
s5: taking the output obtained by the object query and the output of the transform encoder in S4 as the input of the transform decoder, converting the input embedding of the learning position coding into the output embedding by using a multi-head attention module (MultiHeadAttn), and transmitting each output embedding of the transform encoder and the transform decoder to the time transform, wherein the multi-head attention module realizes the following formula:
Figure BDA0003876194760000031
wherein, the first and the second end of the pipe are connected with each other,m is the m-th attention head,
Figure BDA0003876194760000032
and
Figure BDA0003876194760000033
to learn weights. Attention weight A mqk Is composed of
Figure BDA0003876194760000034
Is a learning weight;
s6: inputting the characteristic diagram of S4 into TDTE, and coding the space-time characteristic representation;
s7: embedding and inputting the output of the step 5 into the TQE, and acquiring all spatial object queries from the reference frame to enhance the spatial output query of the current frame;
s8: inputting the output of the step 6 and the step 7 into a TDTD to learn the time context of different frames, wherein the TDTD layer comprises a self-attention module, a deformable aggregation attention module and a forward feedback layer, and the deformable aggregation attention module realizes the following formula:
Figure BDA0003876194760000035
s9: and (4) transferring each output embedding of the step (8) to the FFN, and performing final target detection and identification through the FFN.
Further, the specific step of S2 includes:
s2.1: initializing parameters of an MK-RCN structure, using ResNet-18 as a network backbone, using three convolution kernels at a residual error module, wherein the sizes of the convolution kernels are 3 x 3, 3 x 1 and 1 x 3 respectively, and the learning rate is 10 -5 Weight attenuation of 10 -4
S2.2: extracting the (t-i) th to t th frames of the power operation dynamic video to extract the characteristics of the power operation dynamic video, wherein the initial picture is 3 XH 0 ×W 0 A new profile, cxHxW, was generated by MK-RCN.
Further, the specific step of S3 includes:
s3.1: dividing the characteristic diagram obtained in S2.2 into three parts, wherein one part is directly used as a V value vector, the other two parts are directly added with a position coding vector to be respectively used as a K (key vector) and a Q (query vector), and according to a position coding vector formula (3-1) (3-2), the vector of PE (pos + K,2 i) can be solved and expressed as linear expression on PE (pos, 2 i):
Figure BDA0003876194760000041
Figure BDA0003876194760000042
further, the specific step of S4 includes:
s4.1: each layer of the transform encoder is composed of a multi-head attention mechanism module, add&The Norm module and the Forward propagation module are composed of 6 layers in total, normalization is respectively carried out after a multi-head attention layer and a Forward feedback layer (Feed-Forward), and the initial learning rate is 2 multiplied by 10 -4 Weight attenuation of 10 -4
S4.2: inputting the KVQ obtained in the S3.1 into a multi-head attention module, and outputting a new characteristic diagram;
s4.3: adding the new characteristic diagram obtained in the step S4.2 with the original characteristic diagram;
s4.4: performing linear reduction dimensionality and ReLU activation;
s4.5: and repeating the 6 transform encoder layers, finishing encoding and outputting.
Further, the specific step of S5 includes:
s5.1: the input of the transform decoder comprises query embedding, query position, transform encoder output, multi-head attention mechanism module and Add&The Norm module and the forward propagation module are composed of 6 layers in total, the input of each layer is provided with a query position and a position code in a transform coder besides the input of the previous layer, and the initial learning rate is 2 multiplied by 10 -4 Weight attenuation of 10 -4
S5.2: inputting query embedding of codes of anchors through object query, and adding the query embedding and the query position to obtain K and Q, wherein the object query is set to be 300;
s5.3: inputting the K and Q obtained in S5.2 and the output of the object query into a first multi-head attention module MultiHeadAttn to obtain output;
s5.4: dropout is carried out on the output of S5.3, and the output of the object query is added and output;
s5.5: adding the output of the object query and the query position to obtain Q, adding the output of S4.5 and the position coding vector to obtain K, and inputting the output of S4.5 serving as V to a second multi-head attention module;
s5.6: performing linear reduction dimensionality and ReLU activation;
s5.7: and after passing through 6 transform decoder layers, ending decoding and outputting.
Further, the specific step of S6 includes:
s6.1: the TDTE layer includes a Self-Attention module (Self-Attention), a multi-headed deformable Attention module (TempDefAttn), and a feedforward layer
S6.2: using the output of S4.5 as input to the self-attention module
S6.3: s6.2, taking the output of the multi-head deformable attention module as the input of the multi-head deformable attention module;
s6.4: s6.3, taking the output of the forward feedback layer as the input of the forward feedback layer;
further, in S6.3, the multi-head deformable attention module is input as follows:
Figure BDA0003876194760000051
wherein m is the mth attention head; l is the l-th frame of the same video sample; k is the kth sampling point; delta P mlqk And A mlqk Respectively representing the sampling offset and attention weight of a kth sampling point and an mth attention head of the ith frame; scalar attention weight A mlqk Is located at [0,1]In the middle, by
Figure BDA0003876194760000052
Normalizing the result; delta P mlqk ∈R 2 To have an unconstrained rangeTwo-dimensional real numbers; computing x (P) using bilinear interpolation q +ΔP mlqk ;ΔP mlqk And A mlqk By query features z q Obtained by linear projection on; using normalized coordinates
Figure BDA0003876194760000053
Realizing a proportional formula, wherein normalized coordinates (0, 0) and (1, 1) respectively represent image inflection points of the upper left corner and the lower right corner; function(s)
Figure BDA0003876194760000054
The normalized coordinates are rescaled to the input feature map of frame i. Multi-frame temporal warping notes that the samples are from LK points in the L profile, rather than K points in the single-frame profile
Further, the specific step of S7 includes:
s7.1: the TQE layer comprises a self-Attention module, a Cross-Attention module (Cross-Attention), and a feed-forward layer
S7.2: the output of step 5.7 is taken as input from the attention module
S7.3: the output of step 7.2 is taken as input to the cross attention module, combined with spatial object queries for all frames of reference, denoted as Q ref . Scoring and selecting in a coarse-to-fine manner, i.e. predicting class logits using an additional forward feedback layer, and then calculating its Sigmoid value p = Sigmoid [ FFN (Q) ref )]Sorting all reference points through P values, selecting the highest k value to input into a shallow network, and inputting the lower k value into a deeper network
S7.4: the output query is iteratively updated.
Further, the specific step of S8 includes:
s8.1: taking the output of S6.4 as the input of the self-attention module;
s8.2: taking the refined temporal object query of S7.4 and the output of S8.1 as inputs to a deformable aggregate attention module;
s8.3: taking the output of S8.2 as the output of a forward feedback layer;
s8.4: and taking the output of the S8.3 as the input of the FFN to realize target detection and identification.
Further, in S8.3, the loss function is formulated as:
Figure BDA0003876194760000061
wherein the content of the first and second substances,
Figure BDA0003876194760000062
representing a loss of focus for classification;
Figure BDA0003876194760000063
and
Figure BDA0003876194760000064
representing the loss for locating L1 and the loss of generalized IoU; lambda [ alpha ] cls ,λ L1 And λ GIoU Is the coefficient thereof.
Compared with the prior art, the invention has the beneficial effects that: the invention discloses a helmet wearing identification method based on a multi-convolution kernel residual error module time transform model, which selects the time transform model based on the multi-convolution kernel residual error module, combines a deep learning technology, provides a method suitable for mobile individual detection in a complex electric power operation environment, realizes automatic identification and tracking detection of the helmet wearing condition during safety operation of workers in the electric power industry, reduces false alarm rate, and improves detection efficiency.
Drawings
FIG. 1 is a flow chart of a helmet identification wearing method based on a multi-convolution kernel residual error module time transformer model according to the present invention;
FIG. 2 is a time transformer model of a multi-convolution kernel residual error module of a helmet fit identification method based on a multi-convolution kernel residual error module time transformer model;
FIG. 3 is a RCN multi-convolution kernel residual error module of the helmet wearing identification method based on a multi-convolution kernel residual error module time transformer model according to the present invention;
FIG. 4 is an image frame of a video of a helmet wearing identification method based on a multi-convolution kernel residual error module time transformer model according to the present invention;
FIG. 5 shows three convolution kernels of the method for identifying the wearing of a helmet based on a multi-convolution kernel residual error module time transformer model according to the present invention;
FIG. 6 is a diagram of a perception with mosaic during convolution of a helmet wearing identification method based on a multi-convolution kernel residual error module time transformer model according to the present invention;
FIG. 7 is a target detection result of a single frame image in a video of the method for identifying wearing of a crash helmet based on a multi-convolution kernel residual error module time transformer model according to the present invention;
FIG. 8 is a tracking result of the helmet wearing identification method based on the multi-convolution kernel residual error module time transformer model in state 1 according to the present invention;
FIG. 9 is a tracking result in state 2 of the method for identifying a wearable safety helmet based on a multi-convolution kernel residual error module time transformer model according to the present invention;
fig. 10 is a tracking result in state 3 of the method for identifying wearing of a helmet based on a multiple convolution kernel residual error module time transformer model according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
The invention provides a technical scheme that: the safety helmet wearing identification method based on the multi-convolution kernel residual error module time transformer model mainly comprises the following four parts: and performing final detection identification by using a multi-convolution kernel residual error neural network (MK-RCN) trunk, a transform encoder-decoder, a time transform and a feed-forward network (FFN) for extracting feature representation. The time transformer consists of three components, namely a Time Deformable Transformer Encoder (TDTE), a Time Query Encoder (TQE) and a Time Deformable Transformer Decoder (TDTD).
The MK-RCN structure aims to improve the perception sensitivity of a model polymorphic target and strengthen characteristic reuse so as to solve the problem of low accuracy rate of small target identification; the transform encoder-decoder encodes each frame (including the reference frame and the current frame) into two compact representations, namely spatial object query and memory coding; TDTE encodes the spatio-temporal feature representation to provide location cues for the final decoder output; the TQE measures the interaction between the object in the current frame and the object in the reference image, and is used for fused object query; TDTD learns the time context of different frames to obtain the detection result of the current frame; the FFN structure realizes target detection and identification.
The method for identifying the wearing of the safety helmet based on the multi-convolution kernel residual error module time transformer model specifically comprises the following steps of:
the method comprises the following steps that S1, a dynamic video of the power operation is obtained through field monitoring, a power monitoring system is used for centralized management, scheduling, control and data acquisition of a power supply system, and the field monitoring is that an intelligent system used for monitoring and controlling the power production and supply process obtains the dynamic video of the power operation field through a background workstation of the power monitoring system.
And S2, extracting characteristic representation of the video image of the power operator by using an MK-RCN structure. The specific steps of S2 include:
s2.1: initializing parameters of an MK-RCN structure, using ResNet-18 as a network backbone, using three convolution kernels at a residual error module, wherein the sizes of the convolution kernels are 3 x 3, 3 x 1 and 1 x 3 respectively, and the learning rate is 10 -5 Weight attenuation of 10 -4
S2.2: extracting the (t-i) th to t th frames of the power operation dynamic video to extract the characteristics of the dynamic video, wherein the initial picture is 3 XH 0 ×W 0 A new profile, cxHxW, was generated by MK-RCN.
And S3, supplementing the characteristics obtained in the S2 by using position codes to obtain position embedding. The specific steps of S3 comprise:
s3.1: dividing the characteristic diagram obtained in S2.2 into three parts, wherein one part is directly used as a V value vector, the other two parts are directly added with a position coding vector and respectively used as a K (key vector) and a Q (query vector), and the position coding vector realizes the following formula:
Figure BDA0003876194760000091
Figure BDA0003876194760000092
where PE is a two-dimensional matrix, pos represents a location, and dmodel represents a vector dimension.
From equations (3-1) (3-2), the vector of PE (pos + k,2 i) can be solved, represented as a linear representation on PE (pos, 2 i):
Figure BDA0003876194760000093
Figure BDA0003876194760000094
s4: and embedding and inputting the characteristic diagram obtained in the S2 and the position obtained in the S3 into a transform encoder. The specific steps of S4 comprise:
s4.1: each layer of the transformer encoder is composed of a multi-head attention mechanism module, add&The Norm module and the Forward propagation module are composed of 6 layers in total, normalization is respectively carried out after a multi-head attention layer and a Forward feedback layer (Feed-Forward), and the initial learning rate is 2 multiplied by 10 -4 Weight attenuation of 10 -4
S4.2: inputting the KVQ obtained in the S3.1 into a multi-head attention module, and outputting a new characteristic diagram;
s4.3: adding the new characteristic diagram obtained in the S4.2 with the original characteristic diagram;
s4.4: performing linear reduction dimensionality and ReLU activation;
s4.5: and repeating the 6 transform encoder layers, finishing encoding and outputting.
S5: and (4) inputting the output obtained by the object query and the feature map obtained in the step (4) into a transform decoder, wherein the transform decoder takes the output obtained by the object query and the output of the transform encoder as inputs, and the input embedding of the learning position coding is converted into output embedding by using a multi-head attention module. The specific steps of S5 comprise:
s5.1: the input of the transform decoder comprises query embedding, query position, transform encoder output, multi-head attention mechanism module and Add&The Norm module and the forward propagation module are composed of 6 layers in total, the input of each layer is provided with the input of the last layer, the inquiry position and the position coding in a transform coder, and the initial learning rate is 2 multiplied by 10 -4 Weight attenuation of 10 -4
S5.2: inputting query embedding of codes of anchors through object query, and adding the query embedding and the query position to obtain K and Q, wherein the object query is set to be 300;
s5.3: inputting the K and Q obtained in the step 5.2 and the output of the object query into a first multi-head attention module (MultiHeadAttn) to obtain an output, wherein the multi-head attention module realizes the following formula:
Figure BDA0003876194760000101
wherein m is the mth attention head,
Figure BDA0003876194760000102
and
Figure BDA0003876194760000103
to learn weights. Attention weight A mqk Is composed of
Figure BDA0003876194760000104
Is a learning weight;
s5.4: dropout is carried out on the output of S5.3, and the output of the object query is added and output;
s5.5: adding the output of the object query and the query position to obtain Q, adding the output of S4.5 and the position coding vector to obtain K, and inputting the output of S4.5 serving as V to a second multi-head attention module;
s5.6: performing linear reduction dimensionality and ReLU activation;
s5.7: and after passing through 6 transform decoder layers, ending decoding and outputting.
S6: and inputting the characteristic diagram of S4 into TDTE, and coding the space-time characteristic representation. The specific steps of S6 comprise:
s6.1: the TDTE layer comprises a Self-Attention module (Self-Attention), a multi-head deformable Attention module (TempDefAttn) and a forward feedback layer;
s6.2: taking the output of S4.5 as input to the self-attention module;
s6: 3: the output of S6.2 is used as the input of the multi-head deformable attention module, and the implementation formula is as follows:
Figure BDA0003876194760000111
wherein m is the mth attention head; l is the l-th frame of the same video sample; k is the kth sampling point; delta P mlqk And A mlqk Respectively representing the sampling offset and the attention weight of a kth sampling point and an mth attention head of the ith frame; scalar attention weight A mlqk Is located at [0,1 ]]In the middle, is composed of
Figure BDA0003876194760000112
Normalizing the result; delta P mlqk ∈R 2 Two-dimensional real numbers with an unconstrained range; computing x (P) using bilinear interpolation q +ΔP mlqk ;ΔP mlqk And A mlqk By querying features z q Obtained by linear projection on; using normalized coordinates
Figure BDA0003876194760000113
Realizing a proportional formula, wherein normalized coordinates (0, 0) and (1, 1) respectively represent image inflection points of the upper left corner and the lower right corner; function(s)
Figure BDA0003876194760000114
The normalized coordinates are rescaled to the input feature map of frame i. Multi-frame temporal warping notes sampling LK points from the L-feature map, rather than K points from the single-frame feature map;
s6.4: the output of S6.3 serves as the input to the feed forward layer.
S7: the output of S5 is embedded into the TQE, and all spatial object queries are obtained from the reference frame to enhance the spatial output query of the current frame. The specific steps of S7 comprise:
s7.1: the TQE layer comprises a self-Attention module, a Cross-Attention module (Cross-Attention) and a forward feedback layer;
s7.2: taking the output of S5.7 as the input from the attention module;
s7.3: the output of S7.2 is taken as the input of the cross attention module, combined with the spatial object query of all reference frames, denoted Q ref . Scoring and selecting in a coarse-to-fine manner, i.e. predicting class logits using an additional forward feedback layer, and then calculating its Sigmoid value p = Sigmoid [ FFN (Q) ref )]Sorting all the reference points through P values, selecting the highest k value to be input into a shallow network, and inputting the lower k value into a deeper network;
s7.4: the output query is iteratively updated.
S8: inputting the outputs of S6 and S7 into a TDTD to learn the temporal context of different frames, the TDTD layer comprising a self-attention module, a deformable aggregate attention module, a forward feedback layer, the deformable aggregate attention module implementing the formula:
Figure BDA0003876194760000121
the specific steps of S8 include:
s8.1: taking the output of S6.4 as the input of the self-attention module;
s8.2: taking the refined time object query sum of S7.4 as the output of step 8.2 as the input of the deformable aggregated attention module;
s8.3: taking the output of S8.2 as the output of a forward feedback layer;
the loss function is:
Figure BDA0003876194760000122
wherein the content of the first and second substances,
Figure BDA0003876194760000123
representing a focus loss for classification;
Figure BDA0003876194760000124
and
Figure BDA0003876194760000125
representing the loss for locating L1 and the loss of generalized IoU; into cls In addition to L1 And into GIoU Is the coefficient thereof.
S8.4: and taking the output of the S8.3 as the input of the FFN to realize target detection and identification.
S9: and (5) transmitting each output of the S8 to the FFN in an embedding manner to perform target detection and identification.
Fig. 4 is a frame extracted from a monitoring video image shot in the construction process of the power industry, a multi-convolution kernel residual error neural network represented by extracted features is firstly used, three convolution kernels are shown in fig. 5, the sensitivity of the model to a polymorphic target is improved by the method, the problem of low accuracy rate of small target identification is solved, and the convolution result is shown in fig. 6.
The method comprises the steps of carrying out helmet wearing artificial marking on a monitoring video image shot in the construction process, then randomly splitting the video data into a training frame and a testing frame by using a randderm function, training a target detector by using the training frame, and testing target identification and individual tracking by using the testing frame, wherein the accuracy rate of a testing result can reach 83.7%, and the recall rate can reach 85.5%. The detection effect for a single frame image in a video is shown in fig. 7, and a green frame and a blue frame respectively indicate the case where the helmet is correctly worn as required and the case where the helmet is not worn as required.
Accuracy P = (TP + TN)/(TP + FN + FP + TN); the recall ratio R = TP/(TP + FN), wherein TP is the number of frames for predicting a correct wearing safety helmet, FN is the number of frames for predicting a correct wearing safety helmet as an error, FP is the number of frames for predicting an incorrect wearing safety helmet as an error, and TN is the number of frames for predicting an incorrect wearing safety helmet as an error. The testing mode is a universal testing means adopted by a machine learning algorithm in the process of effectiveness testing. Because the deep learning algorithm is driven by data, many factors affect the accuracy and recall rate of the algorithm, and it is impossible to accurately point out the steps of the algorithm to be improved.
The method comprises the steps of extracting frames (t-i) to t of a power operation dynamic video, extracting features of the power operation dynamic video, coding time and space features through a TDTE module, enhancing spatial output query of a current frame through a TQE module, learning time context of different frames through a TDTD module, obtaining a current frame detection result, and achieving wearing state identification of constructors at different times. Fig. 8 to 10 show the results of recognition by the constructor in different time states 1, 2 and 3, respectively.
The experimental result verifies the usability and the high efficiency of the scheme of the invention by carrying out experiments on the monitoring video for safe operation in the power industry.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (10)

1. The safety helmet wearing identification method based on the multi-convolution kernel residual error module time transformer model is characterized by comprising the following steps of: the design scheme mainly comprises four parts: a multi-convolution kernel residual error neural network backbone, a transform encoder-decoder, a time transform and a feedforward network for extracting feature representation; wherein the time transformer is composed of three components, namely a time deformable transformer encoder, a time query encoder and a time deformable transformer decoder, and mainly comprises the following steps:
s1: acquiring a power operation dynamic video through field monitoring or other camera tools;
s2: extracting characteristic representation of the video image of the power operator by using a multi-convolution kernel residual error neural network backbone structure;
s3: and supplementing the characteristics extracted in the step S2 by using position coding to obtain position embedding, wherein a position coding vector formula is as follows:
Figure FDA0003876194750000011
Figure FDA0003876194750000012
wherein PE is a two-dimensional matrix, pos represents a position, and dmodel represents a vector dimension;
s4: embedding and transmitting the characteristic diagram obtained in the step S2 and the position obtained in the step S3 to a transform encoder;
s5: taking the output obtained by object query and the output of the transform encoder in S4 as the input of the transform decoder, converting the input embedding of learning position coding into output embedding by using a multi-head attention module MultiHeadAttn, and transmitting each output embedding of the transform encoder and the transform decoder to the time transform, wherein the multi-head attention module realizes the following formula:
Figure FDA0003876194750000013
wherein m is the mth attention head,
Figure FDA0003876194750000014
and
Figure FDA0003876194750000015
to learn the weights, attention weight A mqk Is composed of
Figure FDA0003876194750000016
Is a learning weight;
s6: inputting the characteristic diagram of the S4 into a time-variable transform coder, and coding the time-space characteristic representation;
s7: embedding and inputting the output of the step 5 into a time query encoder, and acquiring all spatial object queries from a reference frame to enhance the spatial output query of the current frame;
s8: inputting the outputs of step 6 and step 7 into a temporally deformable transform decoder to learn the temporal context of different frames, the temporally deformable transform decoder layer comprising a self-attention module, a deformable aggregate attention module, a forward feedback layer, the deformable aggregate attention module implementing the formula:
Figure FDA0003876194750000021
s9: and (4) transmitting each output embedding of the step (8) to a feedforward network, and carrying out final target detection and identification through the feedforward network.
2. The method for identifying wearing of a helmet based on a multi-convolution kernel residual module time transformer model according to claim 1, wherein: the specific steps of S2 include:
s2.1: initializing parameters of a multi-convolution kernel residual error neural network structure, using ResNet-18 as a network main trunk, using three convolution kernels at a residual error module, wherein the sizes of the convolution kernels are respectively 3 multiplied by 3, 3 multiplied by 1 and 1 multiplied by 3, and the learning rate is 10 -5 Weight attenuation of 10 -4
S2.2: extracting the (t-i) th to t th frames of the power operation dynamic video to extract the characteristics of the power operation dynamic video, wherein the initial picture is 3 XH 0 ×W 0 And generating a new characteristic map CxHxW through a multi-convolution kernel residual error neural network.
3. The method of claim 2, wherein the method comprises: the specific steps of S3 comprise:
s3.1: dividing the characteristic diagram obtained in S2.2 into three parts, wherein one part is directly used as a V value vector, the other two parts are directly added with a position coding vector to be respectively used as a key vector K and a query vector Q, and according to a position coding vector formula (3-1) (3-2), the vector of PE (pos + K,2 i) can be solved and expressed as linear expression on PE (pos, 2 i):
Figure FDA0003876194750000022
Figure FDA0003876194750000023
4. the method of claim 3 for identifying a headgear wearing system based on the multi-convolution kernel residual error module time transformer model, wherein the method comprises: the specific steps of S4 comprise:
s4.1: each layer of the transformer encoder is composed of a multi-head attention mechanism module, add&The Norm module and the Forward propagation module are combined, the total number of the Norm module and the Forward propagation module is 6, normalization is respectively carried out after a multi-head attention layer and a Forward feedback layer Feed-Forward, and the initial learning rate is 2 multiplied by 10 -4 Weight decay of 10 -4
S4.2: inputting the KVQ obtained in the S3.1 into a multi-head attention module, and outputting a new characteristic diagram;
s4.3: adding the new characteristic diagram obtained in the S4.2 with the original characteristic diagram;
s4.4: performing linear reduction dimensionality and ReLU activation;
s4.5: and repeating the 6 transform encoder layers, finishing encoding and outputting.
5. The method of claim 4 for identifying headgear wear based on a multi-convolution kernel residual module time transformer model, wherein: the specific steps of S5 comprise:
s5.1: the input of the transform decoder comprises query embedding, query position, transform encoder output, multi-head attention mechanism module and Add&The Norm module and the forward propagation module are composed of 6 layers in total, the input of each layer is provided with the input of the last layer, the inquiry position and the position coding in a transform coder, and the initial learning rate is 2 multiplied by 10 -4 Weight attenuation of 10 -4
S5.2: inputting query embedding of codes of anchors through object query, and adding the query embedding and the query position to obtain K and Q, wherein the object query is set to be 300;
s5.3: inputting the K and Q obtained in S5.2 and the output of the object query into a first multi-head attention module MultiHeadAttn to obtain output;
s5.4: dropout is carried out on the output of S5.3, and the output of the object query is added and output;
s5.5: adding the output of the object query and the query position to obtain Q, adding the output of S4.5 and the position coding vector to obtain K, and inputting the output of S4.5 serving as V to a second multi-head attention module;
s5.6: performing linear reduction dimensionality and ReLU activation;
s5.7: and after passing through 6 transform decoder layers, ending decoding and outputting.
6. The method of claim 5 for identifying headgear wear based on multiple convolution kernel residual module time transformer model, wherein: the specific steps of S6 comprise:
s6.1: the TDTE layer comprises a Self-Attention module Self-Attention, a multi-head deformable Attention module TempDefAttn and a forward feedback layer;
s6.2: taking the output of S4.5 as input to the self-attention module;
s6.3: s6.2, taking the output of the multi-head deformable attention module as the input of the multi-head deformable attention module;
s6.4: the output of S6.3 serves as the input to the feed forward layer.
7. The method of claim 6, wherein the method comprises: in S6.3, the input of the multi-head deformable attention module is implemented as the following formula:
Figure FDA0003876194750000041
wherein m is the mth attention head; l is the l-th frame of the same video sample; k is the kth sampling point; delta P mlqk And A mlqk Respectively representing the sampling offset and the attention weight of a kth sampling point and an mth attention head of the ith frame; scalar attention weight A mlqk Is located at [0,1]In the middle, is composed of
Figure FDA0003876194750000042
Normalizing the result; delta P mlqk ∈R 2 Two-dimensional real numbers with an unconstrained range; computing x (P) using bilinear interpolation q +ΔP mlqk ;ΔP mlqk And A mlqk By querying features z q Obtained by linear projection on; using normalized coordinates
Figure FDA0003876194750000043
Realizing a proportional formula, wherein normalized coordinates (0, 0) and (1, 1) respectively represent image inflection points of the upper left corner and the lower right corner; function(s)
Figure FDA0003876194750000044
The normalized coordinates are rescaled to the input feature map for the L-th frame, and multi-frame temporal warping takes care to sample the LK points from the L feature map instead of the K points in the single frame feature map.
8. The method of claim 7, wherein the method comprises: the specific steps of S7 include:
s7.1: the TQE layer comprises a self-Attention module, a Cross-Attention module and a forward feedback layer;
s7.2: the output of step 5.7 is taken as input from the attention module;
s7.3: the output of step 7.2 is taken as input to the cross attention module, combined with spatial object queries for all frames of reference, denoted as Q ref Scoring and selecting in a coarse-to-fine manner, i.e. predicting class logits using an additional forward feedback layer, and then calculating its Sigmoid value p = Sigmoid [ FFN (Q) ] ref )]Sorting all reference points through the P values, selecting the highest k value to be input into a shallow network, and inputting the lower k value into a deeper network;
s7.4: the output query is iteratively updated.
9. The method of claim 8, wherein the method comprises: the specific steps of S8 include:
s8.1: taking the output of S6.4 as the input of the self-attention module;
s8.2: taking the refined temporal object query of S7.4 and the output of S8.1 as inputs to a deformable aggregate attention module;
s8.3: taking the output of S8.2 as the output of a forward feedback layer;
s8.4: and taking the output of the S8.3 as the input of the FFN to realize target detection and identification.
10. The method of claim 9, wherein the method comprises: in S8.3, the loss function is formulated as:
Figure FDA0003876194750000051
wherein the content of the first and second substances,
Figure FDA0003876194750000052
representing a loss of focus for classification;
Figure FDA0003876194750000053
and
Figure FDA0003876194750000054
representing the loss for positioning L1 and the loss of the generalized IoU; lambda [ alpha ] cls ,λ L1 And λ GIoU Is the coefficient thereof.
CN202211214277.5A 2022-09-28 2022-09-28 Helmet wearing identification method based on multi-convolution kernel residual error module time transformer model Pending CN115471776A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211214277.5A CN115471776A (en) 2022-09-28 2022-09-28 Helmet wearing identification method based on multi-convolution kernel residual error module time transformer model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211214277.5A CN115471776A (en) 2022-09-28 2022-09-28 Helmet wearing identification method based on multi-convolution kernel residual error module time transformer model

Publications (1)

Publication Number Publication Date
CN115471776A true CN115471776A (en) 2022-12-13

Family

ID=84334824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211214277.5A Pending CN115471776A (en) 2022-09-28 2022-09-28 Helmet wearing identification method based on multi-convolution kernel residual error module time transformer model

Country Status (1)

Country Link
CN (1) CN115471776A (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114782986A (en) * 2022-03-28 2022-07-22 佳源科技股份有限公司 Helmet wearing detection method, device, equipment and medium based on deep learning

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114782986A (en) * 2022-03-28 2022-07-22 佳源科技股份有限公司 Helmet wearing detection method, device, equipment and medium based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LU HE ET AL.: "End-to-End Video Object Detection with Spatial-Temporal Transformers", 《ARXIV:2105.10920V1》, pages 1 - 10 *

Similar Documents

Publication Publication Date Title
CN112287816B (en) Dangerous work area accident automatic detection and alarm method based on deep learning
CN112016500A (en) Group abnormal behavior identification method and system based on multi-scale time information fusion
CN110472519B (en) Human face in-vivo detection method based on multiple models
CN112560745B (en) Method for discriminating personnel on electric power operation site and related device
CN113516076A (en) Improved lightweight YOLO v4 safety protection detection method based on attention mechanism
CN110349229A (en) A kind of Image Description Methods and device
US20230076017A1 (en) Method for training neural network by using de-identified image and server providing same
CN113312973A (en) Method and system for extracting features of gesture recognition key points
CN115205604A (en) Improved YOLOv 5-based method for detecting wearing of safety protection product in chemical production process
CN114972316A (en) Battery case end surface defect real-time detection method based on improved YOLOv5
CN115471771A (en) Video time sequence action positioning method based on semantic level time sequence correlation modeling
CN115188066A (en) Moving target detection system and method based on cooperative attention and multi-scale fusion
Nayak et al. Video anomaly detection using convolutional spatiotemporal autoencoder
CN117115584A (en) Target detection method, device and server
CN115471776A (en) Helmet wearing identification method based on multi-convolution kernel residual error module time transformer model
CN117076983A (en) Transmission outer line resource identification detection method, device, equipment and storage medium
CN116385962A (en) Personnel monitoring system in corridor based on machine vision and method thereof
CN115937788A (en) Yolov5 industrial area-based safety helmet wearing detection method
Fang et al. Safety Helmet Detection Based on Optimized YOLOv5
CN115439926A (en) Small sample abnormal behavior identification method based on key region and scene depth
CN112487927B (en) Method and system for realizing indoor scene recognition based on object associated attention
CN111144492B (en) Scene map generation method for mobile terminal virtual reality and augmented reality
Li et al. The research of recognition of peep door open state of ethylene cracking furnace based on deep learning
CN113051617A (en) Privacy protection method based on improved generation countermeasure network
CN111767907B (en) Method of multi-source data fire detection system based on GA and VGG network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination