CN117237993B

CN117237993B - Method and device for detecting operation site illegal behaviors, storage medium and electronic equipment

Info

Publication number: CN117237993B
Application number: CN202311490614.8A
Authority: CN
Inventors: 刘云刚; 刘云川; 甘乐天; 邓粤鹏; 易军
Original assignee: Chongqing Hongbao Technology Co ltd; Sichuan Hongbaorunye Engineering Technology Co ltd
Current assignee: Chongqing Hongbao Technology Co ltd; Sichuan Hongbaorunye Engineering Technology Co ltd
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2024-01-26
Anticipated expiration: 2043-11-10
Also published as: CN117237993A

Abstract

The disclosure discloses a method, a device, a storage medium and electronic equipment for detecting operation site violations, wherein the method comprises the following steps: s100: collecting video stream data of an operation site; s200: preprocessing the video stream data; s300: constructing an illegal behavior detection model and training; s400: and inputting the preprocessed video stream data into a trained offence detection model to realize offence detection. According to the method and the device, the GBCSP module based on the phantom bottleneck detection mechanism is introduced into the model, and the collaborative attention mechanism module is arranged at the output end of the GBCSP module, so that operation violations in video data can be accurately identified.

Description

Method and device for detecting operation site illegal behaviors, storage medium and electronic equipment

Technical Field

The invention belongs to the technical field of image detection, and particularly relates to a method and a device for detecting illegal behaviors on a working site, a storage medium and electronic equipment.

Background

In the construction field of large-scale gas production field operation areas, safety is the most important and fundamental requirement of workers. The operation sites are numerous, the distribution is wider, the field environment is quite complex, the operation personnel are numerous, the staff responsible for supervision is limited, the emergency can be rapidly, accurately and comprehensively handled, only the postnatal tracing can be realized, the requirement of the real-time safety monitoring of the operation area of the large-scale gas production site cannot be met, and the generated personal loss and property loss can not be recovered. Within these construction work areas, there are often violations such as falls, calls, smoke, etc., which can present potential safety risks to operators and equipment. Therefore, the illegal behavior detection system based on the deep learning technology gradually becomes an effective solution, and the existing behavior recognition algorithm has the problems that the image feature extraction precision is low due to the fact that the network parameter quantity is too large and the small target information is difficult to recognize.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method for detecting the operation site violations, which can extract small target information in an image, so that the identification accuracy of the operation site violations can be improved.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a method for detecting operation site illegal behaviors comprises the following steps:

s100: collecting video stream data of an operation site;

s200: preprocessing the video stream data;

s300: constructing an illegal behavior detection model and training;

s400: and inputting the preprocessed video stream data into a trained offence detection model to realize offence detection.

Preferably, in step S200, preprocessing the video stream data includes the following steps: the picture is translated, rotated, clipped and gaussian noise added.

Preferably, in step S300, the offence detection model further includes a result output detection head.

Preferably, in step S300, the rule-breaking behavior detection model is trained by the following steps:

s301: collecting video data sets containing illegal behaviors, preprocessing, marking the preprocessed video data sets, and dividing the marked video data sets into training sets and test sets;

s302: setting model training parameters, inputting a training set into a model to train, and finishing model training when the model precision reaches 90% and the model running speed is within 2 seconds in the training process;

s303: testing the trained model by using a test set, evaluating the model by using evaluation indexes including an accuracy rate, a recall rate and an F1 value in the test process, and passing the test when the accuracy rate evaluation value, the recall rate evaluation value and the F1 evaluation value reach 0.9; otherwise, the model parameters are adjusted to train the model again.

The invention also provides a device for detecting the illegal actions on the operation site, which comprises the following steps:

the acquisition module is used for acquiring video stream data of the operation site;

the preprocessing module is used for preprocessing the video stream data;

the model construction and training module is used for constructing an offence detection model and training; the system comprises a fault behavior detection model, a fault behavior detection model and a fault behavior detection model, wherein the fault behavior detection model comprises a backbone network and a feature fusion network, the backbone network introduces a GBCSP module based on a phantom bottleneck detection mechanism, and a collaborative attention mechanism module is arranged at the output end of the GBCSP module; the feature fusion network extracts higher-level and abstract feature representations by adopting a bidirectional path;

the detection module is used for inputting the preprocessed video stream data into a trained illegal action detection model so as to realize illegal action detection.

Preferably, the device further comprises an alarm module for alarming when the detection module detects the illegal action.

The present invention also provides an electronic device including:

a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein,

the processor, when executing the program, implements a method as described in any of the preceding.

The invention also provides a computer storage medium storing computer executable instructions for performing a method as described in any one of the preceding claims.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention is highly suitable for the complex background environment of the large-scale gas production station operation area;

2. the feature extraction module is based on the proposed GBCSP module, introduces a cooperative attention mechanism, has smaller parameter quantity, has algorithm precision of more than 90%, and can more accurately identify smaller targets compared with other algorithms.

Drawings

FIG. 1 is a flow chart of a method for detecting job site violations according to one embodiment of the present invention;

FIG. 2 is a schematic diagram of a model for detecting violations according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a backbone network of an offence detection model according to another embodiment of the present invention;

FIG. 4 is a schematic diagram of a phantom bottleneck structure;

fig. 5 is a schematic structural diagram of a GBCSP module according to another embodiment of the invention;

FIG. 6 is a schematic diagram of a cooperative attention mechanism module CA according to another embodiment of the present invention;

fig. 7 is a schematic structural diagram of a feature fusion network DWBiPFN according to another embodiment of the invention;

fig. 8 is a schematic diagram of the structure of EIOU Loss according to another embodiment of the present invention.

Detailed Description

Specific embodiments of the present invention will be described in detail below with reference to fig. 1 to 8. While specific embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

It should be noted that certain terms are used throughout the description and claims to refer to particular components. Those of skill in the art will understand that a person may refer to the same component by different names. The specification and claims do not identify differences in terms of components, but rather differences in terms of the functionality of the components. As used throughout the specification and claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. The description hereinafter sets forth a preferred embodiment for practicing the invention, but is not intended to limit the scope of the invention, as the description proceeds with reference to the general principles of the description. The scope of the invention is defined by the appended claims.

For the purpose of facilitating an understanding of the embodiments of the present invention, reference will now be made to the drawings, by way of example, and specific examples of which are illustrated in the accompanying drawings.

In one embodiment, as shown in fig. 1, the present invention provides a method for detecting a violation on a job site, including the following steps:

s100: collecting video stream data of an operation site;

s200: preprocessing the video stream data;

s300: constructing an illegal behavior detection model and training;

In another embodiment, in step S200, preprocessing the video stream data includes the following steps: the picture is translated, rotated, clipped and gaussian noise added.

In another embodiment, as shown in fig. 2, the offence detection model includes: a Backbone network (Backbone), a feature fusion network (negk), and a result output detection head.

In this embodiment, as shown in fig. 3, the backbone network includes, connected in order:

input layers (640,640,3) (three quantities correspond to width, height, number of channels, respectively);

a Focus layer (320,320,12);

Conv2D_BU_SiLU layer (320,320,64);

GBCSP layer (160,160,128);

Conv2D_BU_SiLU layer (80,80,256);

GBCSP layer (80,80,256);

Conv2D_BU_SiLU layer (40,40,512);

GBCSP layer (40,40,512);

Conv2D_BU_SiLU layer (20,20,1024);

sppbottcleck layer (20,20,1024);

GBCSP layer (20,20,1024).

The embodiment creatively proposes that a GBCSP module based on a phantom bottleneck detection mechanism is introduced into a backbone network (the structure of the phantom bottleneck detection mechanism is shown as a figure 4, in the figure 4, the phantom bottleneck detection mechanism consists of two continuous Ghost modules (Ghost modules), a batch normalization BN layer and a ReLU function are connected after the first Ghost module, a batch normalization BN layer is connected after the second Ghost module, the first Ghost module adopts channel separation and expansion operation to process a feature map, the processed feature map is used as the input of a second Ghost module through the batch normalization BN layer and the ReLU function, and the second Ghost module is used for matching and reducing channels through shortcut connection to superimpose (Add) the input and the output, so that information transfer between the two Ghost modules is realized.

Specifically, as shown in fig. 5, the GBCSP module includes two branches, the first branch including a CBS layer (composed of a convolution operation Conv, bulk normalized BN and a SiLU activation function), a Ghost Bottleneck layer and a Conv layer, and the branches performing a phantom Bottleneck stack operation; the second branch comprises a Conv layer, the branch is directly spliced with the result calculated by the first branch through a Concat function without calculation, and then channel integration is carried out through a batch normalization BN layer, a leak Relu function and a CBS layer in sequence. In the GBCSP module, the model can benefit from the characteristics of different levels and different sources at the same time by performing cross-level splicing and channel integration on the two branches, so that the learning capability of the model can be enhanced. The phantom bottleneck structure allows the model to communicate and interact at different levels, enabling the model to better capture features of different scales and levels of abstraction. The multi-level and multi-scale feature representation is beneficial to improving the learning capacity and representation capacity of the model, so that the model obtains better performance in more complex tasks, and further, the image features including color features, texture features, shape features and the like in the image can be better extracted.

In addition, in order to better extract small target information (such as smoking, etc.), the present embodiment further introduces a cooperative attention mechanism module (Coordinate Attention, CA) at the output end of each GBCSP module shown in fig. 3, and compared with the attention mechanism CBAM and the attention mechanism SE, the use of the cooperative attention mechanism module CA according to the present invention not only considers the relationship between channels and spaces, but also considers the importance of pixel location information, so that small target information can be better extracted.

Specifically, as shown in fig. 6, the collaborative attention mechanism module CA includes an input layer (Residual) after Residual connection, so as to input the input characteristics of the number of channels, the height and the width of c×h×w respectively, where the input layer connects a parallel horizontal global average pooling layer and vertical global average pooling layers X Avg Pool and Y Avg Pool, where the channel compression ratio of the global average pooling layer is r. The global average pooling in the horizontal and vertical directions is directly spliced through Concat functions, feature extraction is carried out through Conv2d layers, batch normalization is carried out on BN layers and Non-linear activation layers, and finally weights (Re-weights) of important features are output through parallel modules respectively composed of Conv2d layers and sigmoid functions and are fused with input CxH2xW. The collaborative attention mechanism module CA decomposes the global pooling to take into account the pixel location information in the image, and encodes all channels in both the horizontal (H, 1) and vertical (1, W) directions using two one-dimensional features, where H takes a value from 1 to H and W takes a value from 1 to W. Thus, for each channel C from 1 to C, where C is valued from 1 to C, the output of the C-th channel at height h and width wAnd->Expressed as:

wherein uppercase H and uppercase W denote the size of the pooled core,and->Output of the c-th channel, height h and width w are respectively represented; x is x _c (h, i) represents the input at the c-th channel, coordinates (h, i), to the global average pooling layer X Avg Pool; y is _c (j, w) represents the input at the c-th channel, coordinate (j, w), to the global average pooling layer Y Avg Pool. The cooperative attention mechanism module CA uses a 1 x 1 convolution transform function F ₁ Transforming to generate an intermediate feature mapThe spatial information in the horizontal and vertical directions is encoded, and the calculation formula is as follows:

wherein the handles are along the spatial dimensionIs divided into->(i.e. f ^h ) And->（f ^w ) Two tensors, +.>For spatial tandem operation, δ is the activation function. Then through two 1X 1 convolution transformation functionsF _h And F _w Performing a transformation operation to make f ^h F of sum ^w The number of channels is the same as the input, and the result is shown in the following formula:

wherein g ^h Representing the transform function F according to convolution _h For f ^h The weight mapping, g, is further implemented according to the mapping transformation sigma after the transformation operation ^w Representing the transform function F according to convolution _w For f ^w The transformation operation is further followed by a weight mapping implemented according to a mapping transformation sigma.

Finally, for g ^h And g ^w And expanding to be used as the attention weight. For the c-th channel, the output y of the co-attention mechanism module CA _c (i, j) is expressed as:

wherein x is _c (i, j) represents the input at coordinates (i, j) at the c-th channel,g represents g ^h Represents the value of the c-th channel at line i,/for the c-th channel>Representing the value at the jth column for the c-th channel.

In summary, the backbone network introduced into the GBCSP module and the collaborative attention mechanism module CA considers the pixel position information of the image, so that small target information in the image can be better extracted, and the extraction accuracy of the model to the input image can be further enhanced.

Further, the present embodiment proposes a DWBiFPN network (De-Weighted Bidirectional FPN) shown in fig. 7 as a feature fusion network (negk) based on a bi-directional feature pyramid network.

Specifically, the feature fusion network includes three branches, the first branch including a Conv2D (80,80,255) layer and a Concat-Conv2D (80,80,255) layer; the second branch includes Conv2D (40,40,512) layer, concat-Conv2D (40,40,512) layer, and Concat-Conv2D (40,40,512) layer; the third branch includes a Conv2D (20,20,1024) layer and a Concat-Conv2D (20,20,1024) layer.

The feature fusion network treats each bi-directional path (the first bi-directional path includes an up-sampling path and a down-sampling path between the first branch and the second branch; the second bi-directional path includes an up-sampling path and a down-sampling path between the second branch and the third branch) as a feature network layer and repeats the same layer multiple times to enable higher level feature fusion, by repeating the same layer multiple times and feature fusion in each of the repeated layers, the network can utilize and fuse features at different levels to progressively extract higher level and abstract feature representations. The structure can fuse deep semantic features and shallow semantic features to finally generate a feature pyramid with multi-scale information, and can enhance the learning ability of the model, so that the model can better understand and represent input data and obtain better performance in complex tasks. The weight part is removed on the basis of the bidirectional feature pyramid network, so that the calculation amount and the parameter number of the network are reduced.

Further, the result output detection Head adopts coupled output to output three different Head for multi-scale detection, namely 80×80, 40×40 and 20×20, wherein each Head comprises two branches, one branch outputs a human body detection frame, and the other branch outputs an attitude estimation result. The human body detection frame uses EIOULoss as a regression Loss function of the boundary frame (the calculation principle is shown in fig. 8), in fig. 8, EIoU introduces center distance and wide-height Loss, and minimum circumscribed rectangle is replaced by Euclidean distance for minimizing the center point of the target frame, and meanwhile, calculation of the width and the height of the target frame is separated. EIoU loss function L _αEIoU Consists of three parts: overlap loss L _IoU Center distance loss L _dis Loss of width and height L _asp The calculation formula is as follows:

wherein c _w 、c _h Representing the width and height of the smallest bounding rectangle of the prediction frame and the real frame respectively, c represents the diagonal distance of the smallest bounding rectangle of the prediction frame and the real frame, b ^gt Representing the center points of the predicted and real frames respectively,representing the Euclidean distance, w, of the center point ^gt Respectively representing the width, h and h of the prediction frame and the real frame ^gt Representing the heights of the prediction and real boxes, respectively, α represents a given hyper-parameter, and IoU represents the cross-over ratio.

The calculation method has the advantages that not only the overlapping area of the boundary frame regression is considered, but also the distance between the center points of the boundary frames is calculated, and the length-width ratio loss is split into the sum of the length-width loss of the prediction frame and the real frame, so that the model convergence speed is higher.

In another embodiment, in step S300, the offence detection model is trained by:

s301: collecting and preprocessing video data sets containing illegal behaviors (three behaviors of falling, calling and smoking are defined as operation site illegal behaviors) and marking the preprocessed video data sets through a target frame (namely, marking the falling, calling and smoking behaviors in the video data), and dividing the marked video data sets into a training set and a testing set;

s302: setting model training parameters, such as Batch size of the model to 8, learning rate to 0.0005, training the model by using a training set, and finishing model training when model accuracy reaches 90% or more and model running speed is within 2 seconds in the training process;

s303: testing the trained model by using a test set, evaluating the model by using evaluation indexes including an accuracy rate, a recall rate and an F1 value in the test process, and passing the test when the accuracy rate evaluation value, the recall rate evaluation value and the F1 evaluation value reach 0.9; otherwise, the model parameters are adjusted to train the model again, for example, the Batch size of the model is adjusted from 8 to 16, the learning rate is adjusted from 0.0005 to 0.00005, and an optimizer used by the model is adjusted until the model test passes.

In another embodiment, the present invention further provides a device for detecting a violation on a job site, including:

the preprocessing module is used for preprocessing the video stream data;

In another embodiment, the device further comprises an alarm module for alarming when the detection module detects the illegal action.

In another embodiment, the present invention further provides an electronic device, including:

In another embodiment, the present invention also provides a computer storage medium storing computer-executable instructions for performing a method as described in any one of the preceding claims.

It will be appreciated by persons skilled in the art that the above embodiments are provided for clarity of illustration only and are not intended to limit the scope of the invention. Other variations or modifications will be apparent to persons skilled in the art from the foregoing disclosure, and such variations or modifications are intended to be within the scope of the present invention.

Claims

1. A method for detecting operation site violations, the method comprising the steps of:

s100: collecting video stream data of an operation site;

s200: preprocessing the video stream data;

preprocessing the video stream data comprises the following steps: translating, rotating, shearing and adding Gaussian noise to the picture;

s300: constructing an illegal behavior detection model and training;

the system comprises a fault behavior detection model, a fault behavior detection model and a fault behavior detection model, wherein the fault behavior detection model comprises a backbone network and a feature fusion network, the backbone network introduces a GBCSP module based on a phantom bottleneck detection mechanism, and a collaborative attention mechanism module is arranged at the output end of the GBCSP module; the feature fusion network extracts higher-level and abstract feature representations by adopting a bidirectional path;

specifically, the backbone network includes sequentially connected:

an Inputs layer 640,640,3;

focus layer 320,320,12;

Conv2D_BU_SiLU layer 320,320,64;

GBCSP layer 160,160,128;

Conv2D_BU_SiLU layer 80,80,256;

GBCSP layer 80,80,256;

Conv2D_BU_SiLU layer 40,40,512;

GBCSP layer 40,40,512;

Conv2D_BU_SiLU layer 20,20,1024;

sppbottcleck layer 20,20,1024;

GBCSP layer 20,20,1024;

the GBCSP layer comprises two branches, wherein the first branch comprises a CBS layer, a Ghost Bottleneck layer and a Conv layer, the CBS layer is composed of a convolution operation Conv, batch normalization BN and SiLU activation functions, and the branches perform phantom Bottleneck stacking operation; the second branch comprises a Conv layer, the branch is directly spliced with the result calculated by the first branch through a Concat function without calculation, and then channel integration is carried out through a batch normalization BN layer, a leak Relu function and a CBS layer in sequence;

the output end of the GBCSP layer is introduced with a cooperative attention mechanism module, the cooperative attention mechanism module comprises an input layer after residual connection, input characteristics of the number of input channels, the height and the width of the input layer are respectively C multiplied by H multiplied by W, the input layer is connected with a parallel horizontal global average pooling layer X Avg Pool and a vertical global average pooling layer Y Avg Pool, the channel compression ratio of the global average pooling layer is r, the horizontal and vertical global average pooling is directly spliced through a Concat function, the characteristics are extracted through a Conv2d layer, the weights of important characteristics are output through a module composed of a Conv2d layer and a Non-linear activation layer respectively, and the input is fused with the input C multiplied by H multiplied by W;

the feature fusion network comprises three branches, wherein the first branch comprises a Conv2D layer and a Concat-Conv2D layer; the second branch comprises a Conv2D layer, a Concat-Conv2D layer and a Concat-Conv2D layer; the third branch comprises Conv2D layer and Concat-Conv2D layer;

the feature fusion network regards a first bidirectional path formed by an up-sampling path and a down-sampling path between a first branch and a second bidirectional path formed by an up-sampling path and a down-sampling path between a second branch and a third branch as a feature network layer, and repeats the same layer a plurality of times to enable higher-level feature fusion, and by repeating the same layer a plurality of times and performing feature fusion in each repeated layer, the network can utilize and fuse features on different levels to gradually extract higher-level and abstract feature representations;

the illegal behavior detection model further comprises an output detection Head, wherein the output detection Head adopts coupled output to output three different heads for multi-scale detection, namely 80×80, 40×40 and 20×20, each Head comprises two branches, one branch outputs a human body detection frame, and the other branch outputs a gesture estimation result;

the offence detection model is trained by the following steps:

s303: testing the trained model by using a test set, evaluating the model by using evaluation indexes including an accuracy rate, a recall rate and an F1 value in the test process, and passing the test when the accuracy rate evaluation value, the recall rate evaluation value and the F1 evaluation value reach 0.9; otherwise, adjusting the model parameters to train the model again;

2. A job site violation detection device for implementing the method of claim 1, the device comprising:

the preprocessing module is used for preprocessing the video stream data;

3. The apparatus of claim 2, further comprising an alarm module for alerting when the detection module detects an violation.

4. An electronic device, comprising:

the processor, when executing the program, implements the method of claim 1.

5. A computer storage medium storing computer executable instructions for performing the method of claim 1.