CN115331172A

CN115331172A - Workshop dangerous behavior recognition alarm method and system based on monitoring video

Info

Publication number: CN115331172A
Application number: CN202210993747.6A
Authority: CN
Inventors: 谢俊; 王子贤; 赵宇凡; 刘军
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2022-08-18
Filing date: 2022-08-18
Publication date: 2022-11-11

Abstract

The invention provides a workshop dangerous behavior identification and alarm method and system based on a monitoring video, which comprises the following steps: collecting a dangerous behavior image dataset; constructing a plurality of dangerous behavior identification modules by utilizing an improved YOLOv4-MobileNet V3 deep learning network architecture; inputting a training set into a plurality of constructed dangerous behavior recognition modules, and training by using a loss function to obtain a plurality of trained dangerous behavior recognition modules; inputting the test set into a plurality of dangerous behavior recognition modules which are trained to carry out convolution processing, wherein the classification result output by the model comprises the category to which the target belongs and the corresponding confidence coefficient; setting a threshold according to the confidence of the target, and removing the class to which the target with the confidence lower than the threshold belongs; and judging whether dangerous behaviors exist in all image frames in the monitoring video by the aid of a plurality of dangerous behavior identification modules based on convolution processing, and triggering the alarm module when the dangerous behaviors are determined to exist. The invention can monitor and detect the behavior in the workshop in real time.

Description

Workshop dangerous behavior recognition alarm method and system based on monitoring video

Technical Field

The invention relates to the technical field of workshop safety behavior identification, in particular to a workshop dangerous behavior identification and alarm method and system based on a monitoring video.

Background

With the development of industry, the security problem existing in enterprises is not effectively controlled. The safety protection consciousness of workers is not correspondingly enhanced, and safety accidents in a workshop are frequent. At present, most enterprises have numerous monitoring cameras, and the monitoring cameras depend on manual real-time reflection of monitoring videos. On one hand, the labor operation cost is extremely consumed, and on the other hand, higher missing report and missing identification rate exists in the process of relying on manpower monitoring. Therefore, the workshop is also provided with a safety auxiliary identification alarm device.

In a workshop, the smoking behavior of workers can cause fire accidents; the activities of eating and playing mobile phones of workers may cause safety accidents by distracting the workers. The hazardous behavior detection technology is therefore an important technical component of the identification alarm system. Conventional object detection methods include applying conventional machine learning methods and applying deep learning methods. And the problems of complex environment, personnel walking, illumination change and the like in the workshop easily cause the omission of dangerous behaviors and misjudgment.

By the aid of machine vision, non-manual detection and identification of dangerous behaviors can be realized under the condition that a monitoring video and a computer hardware facility are integrated, so that the purposes of reducing labor cost and improving detection efficiency are achieved.

Most of the existing dangerous behavior detection and identification methods are deep learning-based methods, and various dangerous behaviors are detected and identified by performing simple training set replacement operation on a classical algorithm in the field of target detection. For example, two detection methods commonly used in target detection are used, and currently, the most commonly used two-stage target detection fast R-CNN series and single-stage target detection YOLO series are used. However, in an actual application scenario, due to the problems of complex environment in a workshop, light brightness, change of the size and the angle of a target to be detected and the like, the existing target detection algorithm cannot meet the requirements of detection accuracy (mAP) and detection speed (FPS) in real-time detection. Secondly, hardware facilities in a workshop are usually not enough to meet the computational power requirements of complex models, so that the training parameters of the models need to be reduced on the premise of considering the balance of detection precision and detection speed, so that the hardware parallel computing capability provided by the GPU is fully utilized.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a workshop dangerous behavior recognition alarm method and system based on a monitoring video, which are based on improved YOLOv4-MobileNet V3 network model detection, fully consider the situation of low efficiency of workshop infrastructure hardware facilities, can monitor and detect behaviors in a workshop in real time, improve the speed and accuracy of dangerous behavior detection, and trigger an alarm module to notify safety personnel in time. The cost of workshop equipment is greatly reduced, the occurrence of dangerous behaviors in the workshop is reduced, the safety in the workshop is improved, and the good order and the production work of the workshop can be effectively maintained.

The present invention achieves the above-described object by the following technical means.

A workshop dangerous behavior recognition alarm method based on a monitoring video comprises the following steps:

s1: collecting dangerous behavior image data sets, and performing quantity supplement on the obtained images through an image augmentation technology;

s2: preprocessing dangerous behavior images, and determining a training set and a testing set;

s3: constructing a plurality of dangerous behavior identification modules by utilizing an improved YOLOv4-MobileNet V3 deep learning network architecture;

s4: inputting a training set into a plurality of constructed dangerous behavior recognition modules, and training by using a loss function to obtain a plurality of trained dangerous behavior recognition modules;

s5: inputting the test set into a plurality of dangerous behavior recognition modules which are trained to carry out convolution processing, wherein the classification result output by the model comprises the category to which the target belongs and the corresponding confidence coefficient; setting a threshold according to the confidence of the target, and removing the class to which the target with the confidence lower than the threshold belongs;

s6: and judging whether dangerous behaviors exist in all image frames in the monitoring video by the aid of a plurality of dangerous behavior identification modules based on convolution processing, and triggering the alarm module when the dangerous behaviors are determined to exist.

Further, the dangerous behavior image dataset comprises a smoking image, a eating image and a playing mobile phone image.

Further, image augmentation techniques include flipping, cropping, changing colors (brightness, contrast, saturation, and hue), superimposing multiple images.

Further, dangerous behavior images are processed through cutting mixing, mosaic data enhancement and class label smoothing; the sample ratio of the training set to the test set in the data set was 10: 1.

Further, the method for constructing a plurality of dangerous behavior recognition modules by using the improved classification network architecture of YOLOv4-MobileNet V3 deep learning specifically comprises the following steps:

s3.1: the used YOLOv4-MobileNet V3 deep learning network architecture comprises a Backbone network for Backbone feature extraction, a Neck reinforced feature extraction network and a Head prediction network:

in a Backbone network for backhaul feature extraction, replacing CSPDarknet53 in an original YOLOv4 network with a MobileNet V3, and replacing a channel attention SENet mechanism module in the MobileNet V3 with a position attention CA mechanism module;

in a Neck enhanced feature extraction network, replacing a common convolution in a PANet module in a YOLOv4 network with a deep separable convolution; the Head prediction network is a Head prediction network in YOLOv 4;

s3.2: and (5) performing anchor frame dimension clustering by using a K-means + + algorithm.

Further, the position attention CA mechanism module decomposes the channel attention into two 1-dimensional feature codes, and the two 1-dimensional feature codes are aggregated along 2 spatial directions respectively, so that the remote dependency relationship is captured along one spatial direction, and the accurate position information is kept along the other spatial direction; respectively encoding the generated characteristic graphs into a stack of direction perception and position sensitive attention information, and specifically comprising the following steps of:

coordinate information embedding: given an input x _c Using pooling kernels of size (H, 1) or (1, w) to encode each channel along the horizontal and vertical coordinates, respectively, the output of the height H, channel c, can be expressed as:

the output of the c-th channel of width W can be expressed as:

wherein:

x _c (h, p) represents the p-th vertical tensor in height h in the c-th channel; x is a radical of a fluorine atom _c (q, w) represents the q-th horizontal tensor of width w in the c-th channel;

and

respectively aggregating the features in two spatial directions to obtain a corresponding direction perception feature map; r is ^C×H×W The method comprises the following steps of (1) collecting a feature set with the channel number of C, the length in the horizontal direction of H and the length in the vertical direction of W;

coordinate attention generation: for is to

And

performing a join operation and then using a 1 x 1 convolution transformation function F ₁ Carrying out transformation operation on the obtained product:

in the formula [,]a join operation along a spatial dimension; δ is a nonlinear activation function; f. of _c ∈R ^C/r×(H+W) An intermediate feature map for encoding the spatial information in a horizontal direction and a vertical direction; f to be generated along the spatial dimension _c Decomposed into two separate tensors

And

r is used for controlling the reduction rate of the size of the SE block;

using two other 1 x 1 convolution transformation functions F _h And F _w Respectively mapping the characteristics in the c channel

And

transforming, the output being represented as

And

and

as attention weights, respectively, the following formulas:

wherein, sigma is a sigmoid activation function, i is a horizontal coordinate variable in the c channel, and j is a vertical coordinate variable in the c channel;

the output of the attention CA mechanism module is as follows:

wherein: x is the number of _c (i, j) is the input feature in the c-th channel.

Further, in the Neck enhanced feature extraction network, the ordinary convolution in the PANET module in the YOLOv4 network is replaced by the deep separable convolution, wherein the parameter number of the deep separable convolution is that of the ordinary convolution

In the formula D _K Is the convolution kernel size and N is the number of output channels.

Further, the confidence is defined as:

wherein the content of the first and second substances,

representing the confidence of the nth bounding box of the mth grid cell; p _r (Object) represents the probability that the current bounding box has an Object;

when the current bounding box has an object, the predicted bounding box is compared with the real bounding box of the object.

A system of the workshop dangerous behavior recognition alarm method based on the monitoring video comprises monitoring equipment, computing equipment, control equipment, display equipment and alarm equipment;

the monitoring equipment is used for acquiring real-time video data shot by a workshop monitoring camera and transmitting a workshop real-time video to the computing equipment; the computing device comprises a plurality of trained dangerous behavior recognition modules which are used for detecting and recognizing image frames in video transmission and transmitting real-time video with target positions, categories and confidence degrees to a display device and a control device,

the control equipment judges that dangerous behaviors exist by comparing the set confidence threshold value, and controls the display equipment to display a monitoring picture with a rectangular frame in real time when the control equipment detects that the confidence of the key image frame is greater than the set threshold value; when dangerous behaviors are judged to exist, the control device combines all the judged image frames into a video stream with the dangerous behaviors, uploads the video stream to the alarm device, and triggers an alarm.

The invention has the beneficial effects that:

1. the workshop dangerous behavior recognition alarm method and system based on the monitoring video are based on the original YOLOv4 network, and perform anchor dimension clustering on a data set by using a K-means + + algorithm, so that the algorithm precision is improved. The network with stronger feature extraction capability and smaller network parameter quantity, namely MobileNet V3, is used as the alternative backbone network. And the SE attention (channel attention) mechanism module is replaced by a CA attention (position attention) mechanism module, so that the feature map has stronger feature expression capability. And replacing a common convolution module in the PANET with a depth separable convolution and adding CBAM attention, improving the precision, reducing the parameter number to reduce the calculation load, and further adapting to hardware facilities.

2. The workshop dangerous behavior recognition alarm method and system based on the monitoring video are based on YOLOv4-MobileNet V3 network model detection, fully consider the situation that the workshop basic hardware facility is low in efficiency, can monitor and detect behaviors in a workshop in real time, improve the speed and accuracy of dangerous behavior detection, and trigger an alarm module to notify safety personnel in time. The cost of workshop equipment is greatly reduced, the occurrence of dangerous behaviors in the workshop is reduced, the safety in the workshop is improved, and the good order and the production work of the workshop can be effectively maintained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained from the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a workshop dangerous behavior identification and alarm method based on a surveillance video according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a YOLOv4 network model according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of an overall network structure of MobileNet V3 according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of bneck in a MobileNet V3 network according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a CA attention mechanism structure according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of an improved PANet according to an embodiment of the present invention.

Fig. 7 is a block diagram of a workshop dangerous behavior recognition alarm system based on a surveillance video according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the following figures and specific examples, but the scope of the invention is not limited thereto.

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "axial," "radial," "vertical," "horizontal," "inner," "outer," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the present invention and for simplicity in description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are not to be considered limiting. Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

As shown in fig. 1, the method for identifying and alarming dangerous behaviors of a workshop based on a surveillance video includes the following steps:

s1: the method comprises the following steps of collecting dangerous behavior image data sets, and supplementing the quantity of the obtained images through an image augmentation technology, wherein the method specifically comprises the following steps:

s1.1: selecting monitoring videos in a plurality of sections of enterprise workshops at different time, extracting key image frames containing dangerous behaviors, and uniformly cutting the key image frames into 608 × 608 sizes to obtain monitoring pictures of workshop personnel for smoking, eating and playing mobile phones; the dangerous behavior image data set comprises smoking images, eating images and mobile phone playing images.

S1.2: and carrying out a series of image augmentation processing operations on the extracted key frame image, wherein the operations comprise turning, clipping, changing the brightness, contrast, saturation, hue and the like of colors.

S2: preprocessing the dangerous behavior image, and determining a training set and a testing set, wherein the method specifically comprises the following steps:

s2.1: the data enhancement operations used include:

s2.1.1: cutting and mixing CutMix: combining images by cutting a portion from one image and pasting it onto an enhanced image;

s2.1.2: mosaic data enhancement Mosaic: combining the four training images into one sheet at random proportion;

s2.1.3: class Label Smoothing Label Smoothing: class labels are encoded to measure to some extent the complexity of uncertainty, i.e., errors that may occur, overfitting, and over-focusing on other predictions. Typically, 0.9, i.e., [0, 9,0 ], is selected to represent the correct class.

S2.2: and carrying out artificial name marking and positioning position marking on the extracted image, respectively defining the extracted image as tags of clicking, marking and playing, and dividing the tags into a training set and a test set. Wherein, the sample ratio of the training set and the testing set is 10: 1.

S3: the method comprises the following steps of constructing a plurality of dangerous behavior identification modules by utilizing an improved YOLOv4-MobileNet V3 deep learning network architecture, and specifically comprises the following steps:

s3.1: as shown in fig. 2, the YOLOv4-MobileNet V3 deep learning network architecture used in the present invention includes a Backbone network for backhaul feature extraction, a Neck enhanced feature extraction network, and a Head prediction network:

in a Neck reinforced feature extraction network, replacing the common convolution in a PANET module in a YOLOv4 network with a deep separable convolution; the Head prediction network is a Head prediction network in YOLOv 4; the method has the advantages that the method integrates four characteristics of depth separable convolution, an inverse residual error structure with a linear bottleneck, an improved CA attention mechanism and an h-swish activation function, reduces the parameter quantity of the model while ensuring the detection precision, improves the feature extraction capability of the model, improves the detection rate (FPS), and reduces the requirements on the infrastructure hardware;

as shown in fig. 3 and 4, the MobileNet V3 backbone network and bneck (bottleneck layer) in yollov 4-MobileNet V3 are described in more detail:

the MobileNet V3 network structure and Input represent the dimension of a feature matrix Input into the current layer. Operator stands for block operation each feature layer goes through. expsize represents the dimension of the first up-dimension 1 × 1 convolution output in bneck. # out indicates the number of channels in the feature layer at the input to bneck. NBN indicates no normalization operation is used. CA indicates whether or not the attention mechanism is used. NL denotes the nonlinear activation function currently used, HS denotes h-swish and RE denotes ReLU. s is the step size stride used for each block structure. The method comprises the steps of firstly convolving the input image by 1 multiplied by 1 to increase the number of channels, then using deep convolution in a high-dimensional space, then optimizing feature map data through a CA attention mechanism, and finally reducing the number of channels through 1 multiplied by 1 convolution (using an activation function). Connecting the input and output using a residual when the step size =1 and the dimensions of the input and output feature maps are the same; when the step =2 (down-sampling stage), the feature map after dimensionality reduction is directly output.

As shown in fig. 5, the improved attention mechanism module in the MobileNet V3 backbone network is explained:

the original attention mechanism SEnet (SqueezeExceptionNet) is a channel attention network, global average pooling is carried out on input feature maps, and corresponding weights are output through two full-connection layers by a Sigmoid activation function. It mainly measures the channel relationship, neglects the position information, i.e. neglects the space selection. Position Attention CA (coding Attention), which decomposes channel Attention into two 1-dimensional feature coding processes, each aggregated along 2 spatial directions, retains accurate position information along one spatial direction while capturing remote dependencies along the other spatial direction. Respectively encoding the generated feature maps into a stack of attention information attribute maps with direction perception and position sensitivity, complementarily applying the encoded feature maps to the input feature maps, and enhancing the representation of the attention object, specifically comprising the following steps:

coordinate information embedding: given an input x _c Using pooling kernels of size (H, 1) or (1, w) to encode each channel along horizontal and vertical coordinates, respectively, the output of the c-th channel of height H can be expressed as:

the output of the c-th channel of width W can be expressed as:

wherein:

x _c (h, p) represents a pth vertical tensor at height h in the pth channel; x is a radical of a fluorine atom _c (q, w) represents a qth horizontal tensor with width w in the c-th channel;

and

aggregating the features for two spatial directions respectively to obtain a perception feature graph corresponding to the directions; r ^C×H×W The method comprises the following steps of (1) a feature set with the channel number of C, the length of H in the horizontal direction and the length of W in the vertical direction;

coordinate attention generation: to pair

And

performing a join operation, and then using a 1 x 1 convolutionTransformation function F ₁ Carrying out transformation operation on the data:

in the formula [,]a join operation along a spatial dimension; δ is the nonlinear activation function; f. of _c ∈R ^C/r×(H+W) Intermediate feature mapping for encoding spatial information in horizontal and vertical directions; f to be generated along the spatial dimension _c Decomposed into two separate tensors

And

r is used for controlling the reduction rate of the size of the SE block;

And

transforming, the output being represented as

And

and

as attention weights, respectively, the following formulas:

wherein sigma is a sigmoid activation function, i is a horizontal coordinate variable in a c channel, and j is a vertical coordinate variable in the c channel;

the output of the attention CA mechanism module is as follows:

wherein: x is the number of _c (i, j) is the input feature in the c-th channel.

As shown in fig. 6, the specific steps of improving the PANet module in the Neck feature extraction layer in the original yollov 4 network are as follows:

the CBAM attention mechanism is used for the output feature map after the up-sampling kernel down-sampling, and the CBAM is a light-weight general-purpose module, so that the cost of the module can be ignored and the module can be seamlessly integrated into a PANET framework. The CBAM module takes the output result of the convolutional layer as an input feature map, obtains a weighting result through a channel attention module, weights the intermediate feature map processed by the channel attention module through a space attention module, and finally multiplies the attention distribution weight by the input feature map.

Replace the original normal Convolution module Conv in PANet with a depth Separable Convolution (i.e. DWConv — Depthwise Separable Convolution in fig. 4):

let the input feature dimension be D _F ×D _F ×M，D _F For feature size, M is the number of channels, D _K The convolution kernel size is N, the number of output channels is N;

standard convolution kernel parameter of D _K ×D _K ×M×N；

The depth separable convolution has a depth convolution + a point-by-point convolution, which is calculated as follows:

deep convolution: convolution kernel parameter of D _K ×D _K X1. Times.M. After outputting a deep convolutionThe characteristic dimensions are as follows: d _F ×D _F And (4) x M. Convolution is such that each channel corresponds to only one convolution kernel (scan depth is 1), so FLOPs (floating point arithmetic value) is: mxD _F ×D _F ×D _K ×D _K ；

Point-by-point convolution: the input is the feature after deep convolution and the dimension is D _F ×D _F X M, convolution kernel parameters of 1 × 1 × M × N, and output dimension of D _F ×D _F Xn, performing a 1 × 1 standard convolution for each feature during the convolution process, with the FLOPs being: nxD _F ×D _F ×M；

Adding the above convolution kernel parameters to obtain: d _K ×D _K XM + M XN. The depth separable convolution parameters are thus standard convolutions:

therefore, when the number of output channels is larger or the size of the convolution kernel is larger, the calculation amount of the depth separable convolution parameter number is smaller

S3.2: performing anchor frame dimension clustering by using a K-means + + algorithm, wherein the method comprises the following steps:

s3.2.1: computationally selecting a sample as a first initial clustering center in a training set;

s3.2.2: selecting the remaining cluster centers: calculating the shortest distance between all samples in the training set and the initial clustering center, and expressing the shortest distance by D (x); the probability of all samples being selected as the next cluster center is then calculated

Selecting the next clustering center according to a wheel disc method;

s3.2.3: the above process is repeated until k cluster centers are determined.

S4: inputting a training set into a plurality of constructed dangerous behavior recognition modules, and training by utilizing a loss function to obtain a plurality of trained dangerous behavior recognition modules, wherein the method specifically comprises the following steps:

s4.1: respectively setting training labels into smoking, eating and playing, and respectively detecting smoking, eating and playing mobile phones;

s4.2: inputting a training set into a plurality of constructed dangerous behavior recognition modules, modifying parameters required by training, and specifically:

s4.2.1: the parameters include the number of pictures batch of each iterative training, the picture reproduction subsets in the batch, the iteration times taps when the learning rate changes, the input picture width, the input picture height, the input picture channel number, the class, the picture angle change angle and the like, and specifically:

s4.2.2: batch =96; subdivisions =32; steps =14000 or 16000; width =608; height =608; channels =3; classes =3; angel =0;

s4.3: inputting the training set with the three types of labels into the plurality of dangerous behavior recognition modules in batches to obtain corresponding output results, calculating the loss value according to the loss function, performing back propagation, continuously updating the parameters of the model until the iteration times are larger than the threshold value, stopping training, selecting the parameter with the minimum loss as the final model parameter, and further obtaining a plurality of trained dangerous behavior recognition modules.

Wherein, the loss function comprises three parts: classification Loss Class _ Loss, confidence Loss Conf _ Loss, and bounding box regression Loss CIOU _ Loss.

Loss＝Class_Loss+Conf_Loss+CIOU_Loss；

In the formula: s ² The number of grids; n is the number of prediction frames in each grid;

containing the target and not containing the target for the prediction frame; lambda [ alpha ] _noobj Is a weight coefficient;

representing the prediction confidence of the nth bounding box of the mth grid cell;

representing its true confidence; p _r (Object)、

The prediction probability and the real probability of the object in the current box are represented; IOU is the cross-over ratio between the prediction box and the real box; b, b ^gt Respectively representing the central points of a prediction box and a ground truth box GT box of the prediction box; rho ² Means the square of the distance between the two center points; l. the ² The length square of a diagonal line of a minimum frame which just can contain a prediction frame and a real frame is defined; α is a penalty factor; ν is the ratio of length to width similarity of the real and predicted frames. Wherein, the calculation formulas of alpha and nu are as follows:

in the formula: omega ^gt ，h ^gt ω, h is the width and height of the real and predicted frames, respectively;

the confidence is defined as:

wherein the content of the first and second substances,

when the current bounding box has an object, the intersection ratio of the predicted bounding box and the real bounding box of the object is shown.

In the training process, the training process is carried out,

the actual value is represented by the value of,

the value of (b) is determined by whether a bounding box of the grid cell is responsible for predicting a certain object. If there is responsibility, then

If not, then,

the rectangular frame represents the size and the accurate position of the target; the confidence value represents the feasible degree of the prediction rectangular box, and the larger the value is, the higher the possibility that the target exists in the rectangular box is; screening the prediction frames with the targets according to a non-maximum suppression algorithm, and removing repeated rectangular frames corresponding to the same target; and obtaining an index corresponding to the maximum probability according to the classification probability of the screened prediction frame, namely the classification index number of the target, thereby obtaining the category of the target.

As shown in fig. 7, a system of the workshop dangerous behavior recognition alarm method based on the monitoring video includes a monitoring device, a computing device, a control device, a display device and an alarm device;

the monitoring equipment is used for acquiring real-time video data shot by the workshop monitoring camera and transmitting the workshop real-time video to the computing equipment; the computing device comprises a plurality of trained dangerous behavior recognition modules which are used for detecting and recognizing image frames in video transmission and transmitting real-time videos with target positions, categories and confidence degrees to a display device and a control device.

The control equipment judges that dangerous behaviors exist by comparing set confidence threshold values, and when the confidence of the key image frame is detected to be greater than the set threshold values, the control equipment controls the display equipment to display a monitoring picture with a rectangular frame in real time; when dangerous behaviors are judged to exist, the control device combines all the judged image frames into a video stream with the dangerous behaviors, the video stream is uploaded to the alarm device, and an alarm is triggered.

It should be understood that although the specification has been described in terms of various embodiments, not every embodiment includes every single embodiment, and such description is for clarity purposes only, and it will be appreciated by those skilled in the art that the specification as a whole can be combined as appropriate to form additional embodiments as will be apparent to those skilled in the art.

The above-listed detailed description is only a specific description of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

Claims

1. A workshop dangerous behavior recognition alarm method based on a monitoring video is characterized by comprising the following steps:

s1: acquiring a dangerous behavior image data set, and performing quantity supplement on the acquired images through an image augmentation technology;

s5: inputting the test set into a plurality of trained dangerous behavior recognition modules for convolution processing, wherein classification results output by the models comprise categories to which targets belong and corresponding confidence coefficients; setting a threshold according to the confidence of the target, and removing the class to which the target with the confidence lower than the threshold belongs;

2. The monitoring video based workshop dangerous behavior recognition alarm method according to claim 1, wherein the dangerous behavior image dataset comprises smoking images, eating images and playing mobile phone images.

3. The monitoring video based workshop dangerous behavior recognition alarm method according to claim 1, characterized in that dangerous behavior images are processed through clipping mixing, mosaic data enhancement and class label smoothing; the sample ratio of the training set to the test set in the data set is 10: 1.

4. The monitoring video based workshop dangerous behavior recognition alarm method according to claim 1, wherein a plurality of dangerous behavior recognition modules are constructed by using the improved classification network architecture of YOLOv4-MobileNet V3 deep learning, and specifically the method comprises the following steps:

s3.1: the YOLOv4-MobileNet V3 deep learning network architecture comprises a Backbone network for Backbone feature extraction, a Neck enhanced feature extraction network and a Head prediction network:

in a Neck reinforced feature extraction network, replacing the common convolution in a PANET module in a YOLOv4 network with a deep separable convolution; the Head prediction network is a Head prediction network in YOLOv 4;

s3.2: and (5) carrying out anchor frame dimension clustering by using a K-means + + algorithm.

5. The monitoring video based workshop dangerous behavior recognition alarm method according to claim 4, wherein the position attention CA mechanism module decomposes the channel attention into two 1-dimensional feature codes, and the two 1-dimensional feature codes are respectively aggregated along 2 spatial directions, so that the remote dependency relationship is captured along one spatial direction, and meanwhile, accurate position information is kept along the other spatial direction; respectively encoding the generated characteristic graphs into a stack of direction perception and position sensitive attention information, and specifically comprising the following steps of:

the output of the c-th channel of width W can be expressed as:

wherein:

x _c (h, p) denotes a height of c in the c-th channel ofThe p vertical tensor of h; x is a radical of a fluorine atom _c (q, w) represents the q-th horizontal tensor of width w in the c-th channel;

and

respectively aggregating the features in two spatial directions to obtain a corresponding direction perception feature map; r is ^C ^×H×W The method comprises the following steps of (1) a feature set with the channel number of C, the length of H in the horizontal direction and the length of W in the vertical direction;

coordinate attention generation: to pair

And

performing a join operation and then using a 1 x 1 convolution transform function F ₁ Carrying out transformation operation on the data:

in the formula [,]a join operation along a spatial dimension; δ is a nonlinear activation function; f. of _c ∈R ^C/r× ( ^H+w ) Intermediate feature mapping for encoding spatial information in horizontal and vertical directions; f to be generated along the spatial dimension _c Decomposed into two separate tensors

And

r is used for controlling the reduction rate of the size of the SE block;

And

transforming, the output being represented as

And

and

the attention weights are respectively given by the following formula:

the output of the attention CA mechanism module is as follows:

wherein: x is a radical of a fluorine atom _c (i, j) is the input feature in the c-th channel.

6. The monitoring video based workshop dangerous behavior recognition alarm method according to claim 4, characterized in that in the Neck enhanced feature extraction network, the ordinary convolution in the PANet module in the YOLOv4 network is replaced by the deep separable convolution, wherein the deep separable convolutionThe parameters being ordinary convolutions

7. The monitoring video based workshop dangerous behavior recognition alarm method according to claim 1, wherein the confidence level is defined as:

wherein the content of the first and second substances,

8. The system for the workshop dangerous behavior recognition alarm method based on the monitoring video is characterized by comprising monitoring equipment, computing equipment, control equipment, display equipment and alarm equipment, wherein the monitoring equipment is used for monitoring workshop dangerous behaviors;

the monitoring equipment is used for acquiring real-time video data shot by the workshop monitoring camera and transmitting the workshop real-time video to the computing equipment; the computing device comprises a plurality of trained dangerous behavior recognition modules used for detecting and recognizing image frames in video transmission and transmitting real-time video with target positions, categories and confidence degrees to a display device and a control device,