CN114120444A

CN114120444A - 3D convolution neural network unsafe behavior detection system based on human skeleton characteristics

Info

Publication number: CN114120444A
Application number: CN202111361515.0A
Authority: CN
Inventors: 姚森均; 范昕炜
Original assignee: China Jiliang University
Current assignee: China Jiliang University
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-03-01

Abstract

The invention discloses a 3D convolutional neural network unsafe behavior detection system based on human skeleton characteristics. The invention relates to the technical field of human body target detection, human body posture and human body behavior recognition. The system can obtain coordinate information of a minimum rectangular enclosure frame of a human body through inputting a real-time monitoring picture of a safe production operation site and a YOLOX-Tiny algorithm model, reduces video redundancy, then generates a human skeleton feature vector by using a LitehrNet, and finally judges whether unsafe behaviors exist in an operator or not and classifies the behaviors based on a PoseC3D human action recognition network model. The invention adopts a lightweight algorithm, simultaneously carries out frame-by-frame redundant filtering on the input video image, retains key information and improves the real-time performance of action recognition on the premise of ensuring the recognition precision.

Description

3D convolution neural network unsafe behavior detection system based on human skeleton characteristics

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of human body target detection, human body posture and human body behavior recognition, in particular to a 3D convolution neural network unsafe behavior detection system based on human body skeleton characteristics.

[ background of the invention ]

In the field of safety production, along with the development of technology, the proportion of unsafe conditions of objects represented by equipment, places and the like and unsafe factors of the environment in the accident cause of safety production is reduced year by year. How to standardize operators and reduce the probability of unsafe behavior becomes an important but challenging problem. Although the related problems are improved to a certain extent along with the increase of the supervision, the perfection of related laws and the popularization of monitoring systems, the number of safety production accidents and the number of death people are still a considerable number. Therefore, there is a need to research into the field of identification of unsafe behavior in certain specific workplaces in order to reduce accidents of safe production, and to reduce casualties and property loss caused by accidents.

Most of the previous supervision systems rely on manpower, and in the initial underdeveloped technology stage, the previous supervision systems solely rely on supervision of safety related personnel and are limited by personal literacy and quantity problems of supervision personnel, and loopholes are frequent. At the present stage, monitoring systems have been widely popularized in the field of safety production, but few of them can provide a function of intelligently identifying personnel, and cannot monitor the action of personnel in a monitoring range in real time, and a manager needs to pay attention to a monitoring picture all the time, so that a large vulnerability still exists.

[ summary of the invention ]

Aiming at the defects of the prior art in the field of safety production, the invention provides a 3D convolutional neural network unsafe behavior detection system based on human body skeleton characteristics.

The technical scheme adopted by the invention for solving the technical problems is as follows: the invention discloses a 3D convolutional neural network unsafe behavior detection system based on human skeleton characteristics, which is characterized by comprising the following steps:

the method comprises the following steps: acquiring a real-time video image through video acquisition equipment;

step two: identifying the people in the video image by adopting a YOLOX-Tiny algorithm to obtain the total number of the people appearing in the video image and the coordinates and confidence of the minimum rectangular surrounding frame of each person, and setting a confidence threshold;

step three: identifying the object to be identified after the processing of the second step by adopting a LitehrNet algorithm, estimating the posture of the human body by using a human body minimum rectangular surrounding frame with the confidence coefficient higher than a set threshold value to obtain important skeleton joint point coordinates of the human body, and connecting the skeleton joint points of the human body in series according to the positions of each limb joint of the human body to form a skeleton characteristic vector of the human body;

step four: and (5) performing behavior recognition on the human skeleton feature vectors in the third step by adopting a PoseC3D algorithm, and judging whether the person has unsafe behaviors.

Further, the second step includes:

(1) collecting sample videos which typically contain unsafe behaviors as a positive sample set, collecting human body videos under other various postures as a negative sample set, combining the sample sets into one data set according to a COCO data set format, and dividing the data set formed by uniformly processing and fusing the two data sets into a training set, a verification set and a test set according to the ratio of 8:1: 1;

(2) the used YOLOX-Tiny algorithm model is constructed by using CSPDarknet as a back bone, YOLOXAFPN as a nack and YOLOXEAd as a bbox _ head;

(3) training YOLOX-Tiny by adopting the data set, and establishing a human body target detection model with high robustness;

(4) resetting a frame picture input by a video into a picture with the resolution of 416 multiplied by 416 pixels, reserving the original position information and the scaling of the picture, and carrying out blank filling on a blank part;

(5) and predicting the input frame picture information according to the obtained target detection model to obtain coordinate information of a rectangular surrounding frame where the human body is located and prediction confidence.

Further, the third step comprises the following steps:

(1) the used LitehrNet network model is constructed by adopting LitehrNet-18 as backbone and neck and TopDownSimplehead as keypoint _ head;

(2) training a LitehrNet network model by adopting the data set, and establishing a human skeleton feature vector prediction model;

(3) the input frame picture is used as input in a picture stream mode, and the obtained coordinate information of the rectangular surrounding frame where the human body is located and the prediction confidence coefficient are used for predicting through the trained LitehrNet model;

(4) in the above-mentioned inputted picture stream, the prediction of the human skeleton feature vector is performed only on the region within the human minimum rectangular bounding box obtained in step three in the frame picture.

Further, the fourth step is characterized by comprising the following steps:

(1) the PoseC3D algorithm model is constructed by adopting ResNet3 dSlowOly as backbone and neck and I3DHead as class _ head;

(2) training a PoseC3D network model by adopting the data set, and establishing a behavior recognition prediction model based on human skeleton feature vectors;

(3) generating a 2D attitude map by the established human skeleton feature vector, and stacking T two-dimensional key point heat maps with the shape of K multiplied by H multiplied by W to generate a 3D heat map with the shape of K multiplied by T multiplied by H multiplied by W;

(4) and stacking and inputting the obtained 3D heat map into a PoseC3D model for identification, taking a 64-frame practice window as a time length for frame acquisition, dividing the frame into N sections with the same length in an even sampling mode, selecting one frame in each section, acquiring N frames for behavior identification, and judging whether the behavior is unsafe through a pre-trained classifier and classifying. .

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention combines the technologies of target recognition, human body key point extraction, behavior recognition and the like, and provides a real-time detection mode for unsafe behaviors of operators for safety supervision personnel;

(2) according to the invention, a human body depth information camera is not required to be used for acquiring the human body key point information, and the credible human body key point information can be acquired and unsafe behaviors can be identified only through the traditional RGB video images in a deep learning mode, so that the method can be simply and quickly combined with the existing monitoring equipment;

(3) the invention has various detection identifications, and still has higher identification precision and higher identification speed when dealing with complex and variable environments of safe production places.

[ description of the drawings ]

FIG. 1 is a summary drawing of the present invention illustrating the processing flow of an unsafe behavior detection system;

FIG. 2 is a process flow of the YOLOX-Tiny algorithm model;

FIG. 3 is a schematic diagram of a human bone feature vector;

FIG. 4 is a LitehrNet network model processing flow;

fig. 5 is a processing flow of the PoseC3D network model.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and the invention. It is to be understood that the specific invention described herein is illustrative only and is not to be taken as limiting the scope of the invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

With reference to the attached drawings of the specification, the problems of large computation amount and poor real-time performance existing in most of existing human behavior detection models which are predicted based on a space-time model and lack of monitoring of behaviors of intelligent real-time detection operators in existing safety production operation places are considered, and the practicability in a large number of deployed industrial environments is not high. In addition, in practical application scenarios, due to the complex and numerous devices in the operating environment, the acquired data is often data containing a large amount of interference.

Aiming at the root cause of the technical problems, the method adopts a well-trained YOLOX-Tiny algorithm model to firstly identify the human body of the collected video image, after the coordinates and the confidence coefficient of the minimum rectangular enclosure frame of the human body in the image are obtained, then a LitehrNet lightweight network model is used, after the model is pre-trained, carrying out human body key point identification on the minimum human body rectangular bounding box in the acquired video image to generate human body skeleton characteristic vectors, identifying whether the operator in the image has unsafe behaviors or not through a human body posture estimation model PoseC3D network model based on 3D-CNN (3D convolutional neural network), compared with the traditional single model, the system can greatly reduce the calculation amount, thereby realizing the monitoring of unsafe behaviors of operators in the safety production process on the premise of ensuring the precision.

Based on the thought, the invention provides an unsafe behavior recognition system based on YOLOX-Tiny, LitehrNet and PoseC 3D. Please refer to fig. 1, which is a flowchart illustrating a process of an unsafe behavior recognition system for an operator according to the present invention. The invention provides an unsafe behavior recognition system, which specifically comprises the following steps:

the method comprises the following steps: and acquiring a real-time video image through video acquisition equipment.

S1: in the invention, the input real-time video image to be detected can be a monitoring video of a working place, wherein the video data to be detected can be formed by multi-frame image data;

s21: in the invention, the YOLOX-Tiny algorithm needs to be pre-trained before training to obtain high recognition rate of human body, and a human body target detection model with high robustness is established, wherein the selection of a pre-trained data set is as follows:

the data set will employ a COCO data set and an MPII Human Pose data set (both of which contain a large number of Human body pictures in various poses);

for the two data sets, because the labeled information of the two data sets has a large number of categories, the invention only extracts the person bbox information in the labeled information, and the person bbox information is processed into a COCO format in a unified way and is combined into one data set;

and dividing the data set obtained by uniformly processing and fusing the two data sets into a training set, a verification set and a test set according to the ratio of 8:1: 1.

S22: in the invention, after the pre-training is completed, the YOLOX-Tiny algorithm needs to perform targeted training on an unsafe behavior data set, so as to improve the accuracy of identifying the human target with unsafe behavior in an operation scene, wherein the unsafe behavior data set is selected as follows:

collecting sample pictures which typically contain unsafe behaviors as a positive sample set, collecting human body pictures under other various postures as a negative sample set, and combining the sample sets into a data set according to a COCO data set format;

dividing the data set obtained by uniformly processing and fusing the two data sets into a training set, a verification set and a test set according to the ratio of 8:1: 1;

s23: in the invention, the YOLOX-Tiny algorithm needs to process the video input frame picture, reset the picture to 416 × 416 pixel resolution, retain the original position information and scaling of the picture, and fill the blank part with blank.

S24: in the invention, the YOLOX-Tiny algorithm predicts the processed frame pictures after all training is finished to obtain coordinate information and prediction confidence coefficient of a rectangular surrounding frame where the human body is located.

Step three: and identifying the object to be identified after the processing of the second step by adopting a LitehrNet algorithm. Carrying out human posture estimation on a human body rectangular surrounding frame with the confidence coefficient higher than a set threshold value to obtain pixel coordinates and prediction confidence coefficient of important joint points of the human body;

s31: in the invention, the LitehrNet network model needs to be pre-trained to obtain a prediction model of human skeleton feature vectors, and the selection of the trained data set is as follows:

for the two data sets, extracting the human body key point information in the labeling information, uniformly processing the human body key point information into a COCO person _ keypoints format, and combining the two data sets into one data set;

S32: in the invention, after the pre-training is finished, the Lite-HRNet algorithm needs to carry out targeted training on an unsafe behavior data set, so that the identification precision of the human skeleton vector with unsafe behaviors in an operation scene is improved, and the unsafe behavior data set adopts the data set used in the S22;

s33: in the present invention, fig. 4 is a schematic flow chart of a LiteHRNet network model, and as shown in fig. 3, the LiteHRNet network model obtains coordinate information and a prediction confidence of a rectangular surrounding frame where a human body is located in a form of a picture stream for a frame picture processed in the above step two, together with YOLOX-Tiny, as an input, and predicts through a LiteHRNet network model which has been trained in advance;

s34: fig. 3 is a schematic diagram of a human skeleton feature vector, and is drawn according to a sample label format of coordinate position information of 17 human key points provided by a COCO data set, where the key point positions are: 0-nose, 1-left eye, 2-right eye, 3-left ear, 4-right ear, 5-left shoulder, 6-right shoulder, 7-left elbow, 8-right elbow, 9-left wrist, 10-right wrist, 11-left crotch, 12-right crotch, 13-left knee, 14-right knee, 15-left foot, 16-right foot.

Step four: adopting PoseC3D algorithm to identify the behavior of human body joint diagram in step three, and judging whether the person has unsafe behavior;

s41: in the present invention, based on the human skeletal feature vectors measured by the LiteHRNet described above, a 2D pose graph is generated, and T two-dimensional keypoint heat maps of a shape K × H × W, where K is the number of joints, and H and W are the height and width of the frame picture, respectively, are stacked to generate a 3D heat map of a shape K × T × H × W. Since the human skeletal feature vector is already given, padding with zeros is done elsewhere to match the size of the frame picture. By synthesizing K Gaussian maps centering on each joint, a heat map H can be obtained under the condition of only obtaining human skeleton feature vectors, and the specific calculation method is as follows:

H_kij＝exp((-\[(i-x_k)^2+(j-y_k)^2\])/((2*σ^2)))*c_k

where σ serves to control the variance of the gaussian, (x _ k, y _ k) and c _ k are the position and confidence, respectively, of the k-th joint. Meanwhile, the 3D human body posture heat map can be created by stacking Gaussian maps, and the specific calculation method comprises the following steps:

v_ij＝exp(-d^2\/(2*σ^2))*min(c_i,c_j)

where v _ ij is the value of a point in the gaussian map, i.e. the limb connecting the ith and jth joints, and d is the minimum distance of that point to the limb linking the ith and jth joints. Since there may be a case of multiple persons in the frame picture, the kth gaussian map of all persons will be accumulated without enlarging the heat map. Finally, by stacking all heat maps along the time dimension, one 3D heat map volume is obtained.

S42: in the invention, the obtained 3D heat map is stacked and input into a PoseC3D model for recognition, a 64-frame time window is taken as a time length for frame collection, then the frame is divided into N sections with the same length in an even sampling mode, one frame is selected from each section, N frames are collected for behavior recognition, and attitude prediction is carried out through a pre-trained classifier;

s43: in the present invention, to further reduce redundant information of the 3D heat map, a technique of topic center cropping will be employed to improve efficiency. And the 2D postures of all the frames are enclosed once through the minimum rectangular enclosure frame information of the human body obtained by the YOLOX-Tiny algorithm model, the target size is readjusted, and the size of the 3D heat map body is reduced in the spatial dimension.

S44: in the invention, the PoseC3D network model needs to be pre-trained to obtain the recognition of human body actions, and the selection of the trained data set is as follows:

the data set adopts an NTURGB-D motion recognition data set, is divided into NTU-60 and NTU-120, and respectively comprises 57k sections of videos of 60 human motions and 114k sections of videos of 120 human motions

Performing posture extraction on NTURGB-D by using the LitehrNet network model, and simultaneously writing human skeleton characteristic vectors and videos subjected to posture extraction into a data set as video annotation information;

S45: in the invention, after the pre-training is finished, the Lite-HRNet algorithm needs to carry out targeted training on an unsafe behavior data set, so that the identification precision of human behaviors with unsafe behaviors in an operation scene is improved, wherein the unsafe behavior data set adopts the data set used in the S22;

s46: in the present invention, the PoseC3D network model needs to be pre-trained to obtain identification of unsafe behavior of the operator, and the trained data set is selected as follows:

the data set is to collect unsafe behavior videos in a typical safe production operation site, mark the videos as an NTURGB-D format and generate an unsafe behavior data set;

taking the training weight for NTURGB-D as a pre-weight, using the LitehrNet network model to extract the posture of an unsafe behavior data set, and simultaneously taking the human skeleton feature vector and the video after posture extraction as input to perform transfer learning training;

s47: in the invention, the processed 3D heat map body is used as input, a trained PoseC3D network model is used for identifying human action behaviors, and finally, whether unsafe behaviors exist in the operator in the video image or not and the category of the unsafe behaviors are judged through cls _ head.

The invention is not to be considered as limited to the details of the foregoing description, but is to be construed as broadly as the invention can be embodied in various forms without departing from the spirit or essential characteristics thereof.

Claims

1. A3D convolution neural network unsafe behavior detection system based on human skeleton characteristics is characterized by comprising the following steps:

2. The system for detecting unsafe behavior of 3D convolutional neural network based on human skeleton features of claim 1, wherein the second step comprises;

3. The system for detecting unsafe behavior of a 3D convolutional neural network based on human skeletal features as claimed in claim 1, wherein the third step comprises the following steps:

4. The system for detecting unsafe behavior of a 3D convolutional neural network based on human skeletal features as claimed in claim 1, wherein said step four comprises the steps of:

(4) and stacking and inputting the obtained 3D heat map into a PoseC3D model for identification, taking a 64-frame practice window as a time length for frame acquisition, dividing the frame into N sections with the same length in an even sampling mode, selecting one frame in each section, acquiring N frames for behavior identification, and judging whether the behavior is unsafe through a pre-trained classifier and classifying.