CN116798117A - Video understanding-based method for identifying abnormal actions under mine - Google Patents

Video understanding-based method for identifying abnormal actions under mine Download PDF

Info

Publication number
CN116798117A
CN116798117A CN202310387213.3A CN202310387213A CN116798117A CN 116798117 A CN116798117 A CN 116798117A CN 202310387213 A CN202310387213 A CN 202310387213A CN 116798117 A CN116798117 A CN 116798117A
Authority
CN
China
Prior art keywords
network
feature
size
video
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310387213.3A
Other languages
Chinese (zh)
Inventor
贾兆红
夏浩源
段章领
仰劲涛
彭志
王坤
周行云
慈正航
江一航
金怡蒙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Turing Zhichi Intelligent Technology Co ltd
Anhui University
Original Assignee
Suzhou Turing Zhichi Intelligent Technology Co ltd
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Turing Zhichi Intelligent Technology Co ltd, Anhui University filed Critical Suzhou Turing Zhichi Intelligent Technology Co ltd
Priority to CN202310387213.3A priority Critical patent/CN116798117A/en
Publication of CN116798117A publication Critical patent/CN116798117A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The method for identifying the abnormal actions under the mine based on video understanding comprises the following steps: acquiring video data containing real-time actions of miners underground through a camera; preprocessing video data to cut and extract video, and identifying and marking characters in picture frames; then carrying out front and back frame target tracking on the marked person target binding ID; sending the target tracking result into a 3D convolutional neural network to extract video frame characteristics; inputting a sample into a SlowFast network to obtain an action recognition result; according to the specific action of the tracked target, abnormal behavior is found and a warning is given. The method solves the problem of low intelligent level of judgment of abnormal actions of miners under mines.

Description

Video understanding-based method for identifying abnormal actions under mine
Technical Field
The invention relates to a video detection method, in particular to a method for identifying abnormal actions under a mine based on video understanding.
Background
The mine safety production is the basis for ensuring the economic benefit of ore enterprises, and is also the main content and the primary link of the production and operation. In the past 10 years, the ore yield of China is successively subjected to rising, falling and returning before and after supply side reform, however, the mine safety level is continuously improved, and the death rate of millions of tons is basically in a gradual trend, so that the method benefits from the promotion of coal mine mechanization and intelligent promotion on the promotion of little humanization and no humanization, and the policy level is highly valued for the safety production. Mineral resources are an important material basis for economic and social development, and development and utilization of mineral resources are necessary requirements for modern construction. The mine is generally characterized by complicated environment, large number of miners, huge machine equipment and the like, and if the behavior of the miners cannot be effectively monitored, safety accidents are likely to occur in the working process of the miners, and the safety of personnel and equipment is endangered. By investigating underground accidents in recent years, most accidents can be found out to be caused by abnormal actions of operators due to irregular behaviors. The behavior monitoring of underground operators in domestic industry still adopts a traditional manual monitoring method, namely, the monitors monitor underground conditions through collected monitoring videos. However, this method of relying on manual work has a series of problems. First, monitoring personnel watch the video in the pit for a long time, and the health produces tired easily, and along with the increase of time, monitoring personnel are difficult to keep concentrating, and reaction force can decline, when discovery underground working personnel carries out unusual action, like when crossing the track, can not in time respond to unusual action, therefore the manual monitoring has great potential safety hazard. Secondly, the underground topography is complex, the area is numerous, and monitoring personnel can not effectively monitor videos of a plurality of areas simultaneously, and partial areas are easy to miss. Moreover, because of the large number of workers and complex behavior actions, the change of the number of people and larger action range can sometimes occur in a short time, and the monitoring capability of the manual work on a plurality of videos is limited, compared with intelligent monitoring, the intelligent monitoring has low working efficiency. Thirdly, the details of the underground image are fuzzy and uneven in exposure, and the distinguishing ability of monitoring staff can be greatly weakened in places with weak light and more dust. Meanwhile, because the underground roadway is narrow, the obstacles such as ores are more, the visual field blind area is easy to form, the details of the behavior actions of the miners are difficult to accurately monitor only through the observation of human eyes, and the wrong judgment can be made, so that the monitoring effect is poor.
In summary, the conventional mine abnormal action recognition relies on manual processing in a large amount, and has the technical problems of incapability of maintaining efficient monitoring, easiness in missing areas, weak image discrimination capability and the like.
Disclosure of Invention
For the problems existing in the prior art, an underground abnormal action recognition system based on video understanding is provided, and the purpose is to avoid abnormal actions of underground workers. Comprising the following steps: acquiring underground video data through a camera; preprocessing video data to cut and extract video, and identifying and marking characters in picture frames; then carrying out front and back frame tracking on the marked person target binding ID; sending the video result to a preset 3D-Resnet network and obtaining weight; inputting a sample into a SlowFast network to obtain an action recognition result; according to the specific action of the tracked target, abnormal behavior is found and a warning is given. The method solves the problem of low intelligent level of judgment of abnormal actions of miners under mines.
The invention solves the technical problems by adopting the following technical scheme:
1. the method for identifying the abnormal actions under the mine based on video understanding is used for intelligently identifying the abnormal actions of workers in a mine scene and is characterized by comprising the following steps of:
a. in the sample preparation stage, acquiring a video of an underground working environment, cutting and framing the video, preprocessing the video with the appearance of miners as a marked image, dividing the processed marked image into a training sample and a test sample according to a ratio of 7:3, and removing abnormal marked image data to obtain a training data set;
b. training a yolov5s network by using the processed training sample, and carrying out character recognition on videos with miners;
c. performing person ID binding on a result of yolov5s by using a deepsort algorithm to perform target tracking;
d. detecting through the trained SlowFast network to obtain character action recognition results, and recognizing possible abnormal actions;
2. the method for identifying abnormal actions under a mine based on video understanding as claimed in claim 1, wherein said step a of obtaining the working environment under the mine in the sample preparation stage comprises the steps of:
(1) Installing a camera on a mine car or in a downhole working area to acquire video stream data of miners;
(2) Extracting key frames in the video according to a certain time interval and storing the key frames as image data;
3. the method for recognizing abnormal actions under a mine based on video understanding as claimed in claim 1, wherein said step a of preprocessing the sample preparation stage comprises the steps of:
(1) Marking the image data by marking software to obtain and store marked data sets;
(2) Integrating and reducing the extracted video frames;
(3) Dividing the marked data set into a training sample and a test sample according to a ratio of 7:3;
4. the method for recognizing abnormal actions under a mine based on video understanding as claimed in claim 1, wherein said step a of removing abnormal data in the sample preparation stage comprises the steps of:
(1) Eliminating data without the appearance of the mineral targets;
(2) Eliminating data of which the character targets appear but the character information is incomplete;
5. the method for recognizing abnormal actions under a mine based on video understanding as claimed in claim 1, wherein said step b performs character recognition on the video with miner appearance, comprising:
(1) The input end is used for preprocessing the input picture, and the whole process comprises the steps of Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling.
(2) The backbone network part mainly includes Focus layer, convolution Blocks (CBL), cross-phase local network (Cross Stage Partial Network, CSPNet) and spatial pyramid pooling (Spatial Pyramid Pooling, SPP) modules. The input image is cut and stacked through slicing operation, the length and the width of the image are reduced to be half of the original length and width, the number of channels is 4 times of the original number, the calculated amount of the model can be reduced, and information loss can not be caused. The specific flow is as follows: first the slicing operation divides the input original 640 x 3 channel image into 4 slices, each of size 320 x 3. And secondly, 4 partial Concat operation depths are connected by using the convolution operation of 32 convolution kernels, and a feature map with the size of 320 multiplied by 32 is output through a convolution layer formed by 32 convolutions. CBL is formed by Conv convolution layer, batchNorm layer and LeakyRelu activation function, namely input part firstly passes through the convolution layer (Conv), input characteristics are extracted, and specific local image characteristics are found; then, normalizing through a BatchNorm layer, and controlling gradient distribution of each time near an origin so that deviation of each batch is not overlarge; finally, the output result is transmitted to the next layer of convolution by the LeakyRelu activation function.
The LeakyReLU adjusts the zero gradient problem of negative values by giving a very small linear component of x to the negative input, typically a has a value of around 0.01. CSP: there are two CSP structures in YOLOv5s, with csp1_x implementing feature extraction in the Backbone network and csp2_x used in the neg structure for prediction. The CSP1_X module of the backbone network consists of a branch 1 and a branch 2, wherein the branch 1 consists of a convolution layer, a batch normalization function and an activation function, and the branch 2 consists of the convolution layer, the batch normalization function, the activation function and X residual error units; the CSP2_X module of the Neck network consists of a branch 3 and a branch 4, wherein the branch 3 and the branch 4 consist of convolution layers, batch normalization and activation functions. The number of channels is halved through the two branches, and then the channels are spliced through Concat, so that the number of channels is kept unchanged. The CSP1_X module solves the problem of gradient information repetition of network optimization in a back bone structure, reduces the parameter number of a YOLOv5s network structure model, ensures the detection speed and accuracy, and reduces the size of the model.
SPP: conv convolution extraction feature output is firstly carried out, then maximum pooling of four scales of 1 multiplied by 1, 5 multiplied by 5, 9 multiplied by 9 and 13 multiplied by 13 is adopted, and then Concat is used for splicing to realize multi-scale feature fusion, so that the problem of non-uniform size of an input image can be solved.
The Backbone network obtains feature maps with three different dimensions of 80×80×128, 40×40×256 and 20×20×512, and sends the feature maps to a Neck end; wherein, the feature map with the size of 80×80×128 contains most low-level layer features to enhance the small target detection performance of the model; the feature map with the size of 20×20×512 contains the majority of advanced layer features to enhance the model large target detection performance; the low-level and high-level feature information of the feature map with the size of 40×40×256 are equivalent in duty ratio for medium target detection.
(3) The Neck network utilizes a feature pyramid network (Feature Pyramid Networks, FPN) to deliver deep semantic features to shallow layers, while a path aggregation network (Path Aggregation Network, PAN) can deliver shallow layer position information to deeper layers, thereby improving positioning capability. The FPN+PAN structure not only obtains rich semantic features, but also obtains stronger positioning features, and the feature fusion effect is enhanced. The specific flow is as follows: firstly, a convolutional operation is carried out on a feature map processed by an image pyramid SPP by an FPN network, the size of a convolution kernel of the convolutional process is 1*1, the step length is 2, the feature map with the size of 20 x 20 is obtained after convolutional feature extraction, up-sampling feature fusion with the feature map with the same size of 20 x 20 extracted from a main network is carried out for 2 times, the feature map with the size of 40 x 40 is obtained, then the convolution operation is continuously carried out on the feature map with the size of 40 x 40 obtained after processing, the convolution kernel size of the convolutional process is 1*1, the step length is 2, the feature map with the size of 40 x 40 is obtained after convolutional feature extraction, up-sampling feature fusion with the feature map with the same size of 40 x 40 extracted from the main network is carried out for 2 times, the feature map with the size of 80 x 80 is obtained, the feature map with the same size of 3 times of convolution kernel of 3*3, the step length is 2, the extracted feature map with the size of 40 x 40, the feature map with the same size of 20 x 40 is 3 times, and the final feature map with the size of 80 x 40 is obtained, and the final feature map with the size of 20 x 40 is obtained.
(4) The prediction end outputs three dimension feature graphs through 8 times downsampling, 16 times downsampling and 32 times downsampling, and outputs prediction frame information with highest confidence through non-maximum suppression (Non Maximum Suppression, NMS), so that a detection result is obtained. In the object detection, a large number of candidate frames are generated at the same object position, and the candidate frames may overlap with each other, so that it is necessary to find the optimal object boundary frame by using non-maximum suppression, and eliminate redundant boundary frames. The specific flow is as follows: firstly, sequencing confidence degrees of all prediction frames in a descending order, then selecting the prediction frame with the highest confidence degree, confirming the prediction frame to be the correct prediction, calculating the IOU of other prediction frames, removing the IOU with high overlapping degree according to the calculated IOU, namely deleting the IOU, returning the rest prediction frames to the step 1 until no rest prediction frames exist, and finally finding the optimal target boundary frame.
(5) The Loss functions of YOLOv5s include confidence Loss (Objectness Loss), classification Loss (Classification Loss), and bounding regression Loss (Bounding Box Regeression Loss).
The total loss formula is defined as:
Loss=a 1 L obj +a 2 L cla +a 3 L bbox
wherein a1, a2 and a3 are weight coefficients.
Objectless Loss and Classification Loss are calculated from binary cross entropy Loss function (BCE Loss):
wherein x is a Is a binary tag value of 0 or 1, p (x a ) Is of the x a Probability of tag value.
The frame regression loss is calculated from the CIoU function (Complete Intersection over Union):
wherein:
IOU represents the cross-over ratio between two overlapping rectangular boxes; x and x gt Representing the center points of two overlapping rectangular boxes; d represents the Euclidean distance between two overlapping rectangular boxes; y represents the diagonal distance of the closure areas of two overlapping rectangular boxes; m is used for measuring the consistency of the relative proportion of the two rectangular frames; beta represents a weight coefficient. The loss function considers the similarity of the superposition area of the two frames, the distance of the center point and the length-width ratio, so that the predicted frame is more in line with the real frame, and the effects of higher convergence speed and higher precision can be achieved.
Evaluation index: in order to accurately evaluate the network model performance, precision (P) and Recall (R), average Precision mean (mean Average Precision, mAP) and transmission frame number per second (Frames Per Second, FPS) are used as evaluation indexes, and specific formulas are as follows:
wherein t is the number of correct detection targets, f is the number of false detection targets, n is the number of missed detection targets, AP represents the area under the Precision-Recall curve, average Precision of each type of the picture is averaged to obtain mAP, n represents the number of samples to be detected, and alpha represents the time required for testing all samples.
6. The method for identifying abnormal actions under a mine based on video understanding as claimed in claim 1, wherein the step c uses deep algorithm to track the target of the detection result of the yolov5s network, and comprises the following steps:
(1) Obtaining a detection frame according to the detection steps, extracting a motion feature prediction frame coordinate value of a target and a figure target coordinate value by adopting a CNN network, performing track prediction by using Kalman filtering, extracting appearance features by adopting Deep Association Metric, and adopting a cascade matching and IOU matching mechanism;
(2) Setting a network training strategy, comprising: training the batch size, initializing a learning rate, a weight attenuation rate, an optimization method and a loss function;
(3) And sending the training data into a network model to obtain a new feature extraction network. The difference between the network extracted appearance features and the true results is calculated using the following loss function:
wherein a is i For the feature vector, (1) calculate softmax function result y (a) i ) I.e. as a prediction result, w (a) i ) The true result.
(4) Obtaining appearance characteristics (denoted b) of the current frame detection frame through the network trained in the step (3) m ) The minimum distance between the appearance feature of the current frame detection frame and all the appearance features is calculated by the following formula:
wherein the method comprises the steps ofBelonging to the set of appearance characteristics of all frames.
(5) The appearance characteristic information and the motion characteristic information are weighted and fused through the following formula:
t m,n =αd 1 (m,n)+(1-α)d(m,n)
wherein d is 1 (m, n) is the mahalanobis distance, d (m, n) is the cosine law, and α is the weight coefficient.
7. The method for identifying abnormal actions under a mine based on video understanding as claimed in claim 1, wherein said step c constructs a network model; the method for identifying the abnormal actions of the SlowFast network on the tracking result of Deepsort comprises the following steps:
(1) Extracting features in the video frame using a 3D Convolutional Neural Network (CNN); setting a resnet3d network, and dividing a backbone network into a slow path and a fast path. The slow path uses a resnet3d network, the first layer uses 1 convolution kernel with the size of 7*7, the operation is performed at a low frame rate, the space semantics are captured, and the environment information is obtained; the fast path uses a resnet3d network, with a base number of channels of 8, and the first layer uses 5 convolution kernels of size 7*7, running at high frame rate to capture fine time resolution motion. The header classification network uses a SlowFast header with a number of channels for feature connection of 2048+256.
(2) Feature fusion, the network adopts the way that data of Fast channels are sent into Slow channels through lateral connection, namely, information extracted from Fast paths is fused into Slow paths. The size of the convolution kernel is denoted as { T S ] 2 C, where T, S and C represent the number of samples, spatial resolution, and number of convolution kernels, respectively. The frame skip rate is α=8, and the channel ratio is 1/β=1/8. The single data sample of Fast channel is { alpha T, S 2 The single data sample of the Slow channel is { T, S }, beta C } 2 αβc }. The invention uses 2 beta C output channel and steps of alpha to 5X 1 2 Kernel to perform threeThe data transformation is performed by a dimension convolution mode.
(3) Global averaging pooling is performed at the end of each channel, after which the results of the fast and slow channels are combined and fed into a fully connected classification layer that uses softmax to identify the actions being taken by the mine worker in the image.
As described above, according to the method for identifying the abnormal actions under the mine based on video understanding provided by the invention, the video data under the mine is obtained through the camera; preprocessing video data to cut and extract video, and identifying and marking characters in picture frames; then carrying out front and back frame tracking on the marked person target binding ID; the video result is sent to a preset resnet3d network and weight is obtained; inputting a sample into a SlowFast network to obtain an action recognition result; according to the specific action of the tracked target, abnormal behavior is found and a warning is given. And complicated features are not required to be manually extracted, so that the detection efficiency is high. The invention breaks through the high detection error rate caused by a large amount of manual observation and operation in the traditional underground safety detection, improves the accuracy of identifying the underground abnormal actions detected by the system, and enhances the detection capability aiming at severe conditions.
In conclusion, the invention provides a video understanding-based system for identifying abnormal actions under a mine, which solves the problems of low intelligent level and low identification accuracy of judging abnormal actions of miners under the mine.
Drawings
Fig. 1 is a schematic diagram showing steps of a method for identifying abnormal actions under a mine based on video understanding.
Fig. 2 shows a schematic diagram of the yolov5s and deepsort network architecture of the present invention.
Fig. 3 is a specific flowchart of step S1 in fig. 1 in an embodiment.
Fig. 4 is a flowchart showing the step S2 in fig. 1 in an embodiment.
Fig. 5 is a flowchart showing the step S3 in fig. 1 in an embodiment.
Fig. 6 is a flowchart showing the step S4 in fig. 1 in an embodiment.
Fig. 7 is a flowchart showing the step S5 in fig. 1 in an embodiment.
Fig. 8 is a flowchart showing the step S6 in fig. 1 in an embodiment.
Description of step reference numerals
S1-S6 method steps
S11-S12 method steps
S21-S23 method steps
S31 to S32 method steps
S41-S43 method steps
S51-S53 method steps
S61-S64 method steps
Detailed Description
Referring to fig. 1 and 2, a schematic step diagram of an underground abnormal action recognition system based on video understanding and a schematic diagram of yolov5s and deepsort architecture are shown, and the invention aims to provide an underground abnormal action recognition system based on video understanding, which solves the problems of low intelligent level, low efficiency, high false detection rate and the like in the conventional underground miner abnormal action judgment. The traditional image detection method also has the technical problem of low accuracy, and the abnormal action recognition method based on video understanding comprises the following steps:
s1, in a sample preparation stage, acquiring a video of an underground working environment, taking frames of the video according to a certain time interval, converting the frames into pictures, and cutting and extracting frames of the video;
s2, preprocessing the video to be a marked image, carrying out character recognition on the video with the appearance of miners, and dividing the processed marked image into a training sample and a test sample;
s3, data cleaning, namely eliminating abnormal marked image data;
s4, training a yolov5S network by using the processed training samples;
s5, performing target tracking on a result of the yolov5S by using a deepsort algorithm;
s6, detecting the test sample through the trained SlowFast network to obtain a character action recognition result, and recognizing possible abnormal actions.
Referring to fig. 3, a specific flowchart of step S1 in fig. 1 in one embodiment is shown, as shown in fig. 3, including:
s11, setting camera parameters, wherein the camera parameters have great interference on images acquired by the cameras due to great dust emission in the industrial field, so the camera parameters are set
The camera is arranged to adopt higher resolution to capture more characteristics of the image; the camera frame rate is set, and when the unmanned vehicle runs faster, the acquired images can be clearer by adopting the higher camera frame rate; and adjusting parameters such as saturation, contrast and the like of the camera according to the light characteristics of the industrial field so as to achieve optimal shooting of the actions of miners under the mine.
S12, acquiring images of miners from the video frames, setting fixed time intervals, extracting key frames according to the specified time intervals, and converting the key frames into images. Video images of miners are the data sources for training samples and test samples.
Referring to fig. 4, a specific flowchart of step S2 in fig. 1 in one embodiment is shown, and as shown in fig. 4, step S2 includes:
s21, performing primary screening on the images, removing unqualified images such as excessive blurring, excessive shielding, overexposure, underexposure and the like, and unifying the size resolution of the processed images to 1280 x 720.
S22, labeling the qualified images, wherein optional labeling tools comprise tools such as via. When the labeling is performed, the polygonal labeling is adopted, the labeling frame is attached to the human body as much as possible, and if the human body is overlapped, the part which is not shielded is labeled. The marked data is stored in an xml format and the same as the original image name.
S23, splitting the marked data set into a training set and a testing set according to a certain proportion.
Referring to fig. 5, a specific flowchart of step S3 in fig. 1 in one embodiment is shown, where, as shown in fig. 5, step S3 includes:
s31, because the detection method is used for detecting the complete human body target, the human body trunk only partially appears can be deleted.
S32, eliminating the data with obvious errors, and eliminating the labeling frame corresponding to the vertex with the reversed coordinate position according to the position coordinates of each vertex of the labeling frame.
Referring to fig. 6, a specific flowchart of step S4 in fig. 1 in one embodiment is shown, and as shown in fig. 6, step S4 includes:
s41, the input end is used for preprocessing the input picture, and the whole process comprises the steps of Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling. The input image is cropped and stacked through a slicing operation.
S42, the input part firstly passes through a convolution layer (conv), extracts input characteristics, normalizes the input characteristics, and transmits an output result to the next convolution layer by a LeakyRelu activation function.
S43, performing Conv convolution extraction feature output, then performing splicing by Concat to realize multi-scale feature fusion, outputting three-dimension feature graphs by a prediction end through 8 times downsampling, 16 times downsampling and 32 times downsampling, and outputting prediction frame information with highest confidence coefficient through non-maximum suppression (Non Maximum Suppression, NMS), thereby obtaining a detection result.
Referring to fig. 7, a specific flowchart of step S5 in fig. 1 in one embodiment is shown, and as shown in fig. 7, step S5 includes:
s51, obtaining a detection frame according to the detection steps, extracting a motion feature prediction frame coordinate value of a target and a figure target coordinate value by adopting a CNN network, performing track prediction by using Kalman filtering, extracting appearance features by adopting Deep Association Metric, and adopting a cascade matching and IOU matching mechanism.
S52, setting a network training strategy, comprising: training the batch size, initializing the learning rate, the weight attenuation rate, the optimization method and the loss function.
And S53, sending the training data into a network model to obtain a new feature extraction network. And calculating the difference between the appearance characteristics extracted by the network and the real result.
S54, obtaining the appearance characteristics of the current frame detection frame through the new characteristic extraction network obtained in the previous step, and calculating the minimum distance between the appearance characteristics of the current frame detection frame and all the appearance characteristics.
S55, weighting and fusing the appearance characteristic and the motion characteristic information.
Referring to fig. 8, a specific flowchart of step S6 in fig. 1 in one embodiment is shown, where, as shown in fig. 8, step S6 includes:
s61, extracting features from the video frames using a 3D Convolutional Neural Network (CNN).
S62, setting a network of a resnet3d, wherein the backbone network is divided into a slow path and a fast path. The slow path uses a resnet3d network, the first layer uses 1 convolution kernel with the size of 7*7, the operation is performed at a low frame rate, the space semantics are captured, and the environment information is obtained; the fast path uses a resnet3d network, with a base number of channels of 8, and the first layer uses 5 convolution kernels of size 7*7, running at high frame rate to capture fine time resolution motion. The header classification network uses a SlowFast header with a number of channels for feature connection of 2048+256.
S63, feature fusion, wherein the network adopts the way that data of the Fast channel is sent into the Slow channel through lateral connection, namely, information extracted from the Fast path is fused into the Slow path. The size of the convolution kernel is denoted as { T S ] 2 C, where T, S and C represent the number of samples, spatial resolution, and number of convolution kernels, respectively. The frame skip rate is α=8, and the channel ratio is 1/β=1/8. The single data sample of Fast channel is { alpha T, S 2 The single data sample of the Slow channel is { T, S }, beta C } 2 αβc }. The invention uses 2 beta C output channel and steps of alpha to 5X 1 2 The kernel performs data transformation in a three-dimensional convolution manner.
S64, global averaging pooling is performed at the end of each channel, after which the results of the fast and slow channels are combined and fed into a fully connected classification layer, which uses softmax to identify the actions being taken by the mine workers in the image.

Claims (7)

1. The method for identifying the abnormal actions under the mine based on video understanding is used for intelligently identifying the abnormal actions of workers in a mine scene and is characterized by comprising the following steps of:
a. in the sample preparation stage, acquiring a video of an underground working environment, cutting and framing the video, preprocessing the video with the appearance of miners as a marked image, dividing the processed marked image into a training sample and a test sample according to a ratio of 7:3, and removing abnormal marked image data to obtain a training data set;
b. training a yolov5s network by using the processed training sample, and carrying out character recognition on videos with miners;
c. performing person ID binding on a result of yolov5s by using a deepsort algorithm to perform target tracking;
d. and detecting through the trained SlowFast network to obtain character action recognition results, and recognizing possible abnormal actions.
2. The method for identifying abnormal actions under a mine based on video understanding as claimed in claim 1, wherein said step a of obtaining the working environment under the mine in the sample preparation stage comprises the steps of:
(1) Installing a camera on a mine car or in a downhole working area to acquire video stream data of miners;
(2) And extracting key frames in the video according to a certain time interval and storing the key frames as image data.
3. The method for recognizing abnormal actions under a mine based on video understanding as claimed in claim 1, wherein said step a of preprocessing the sample preparation stage comprises the steps of:
(1) Marking the image data by marking software to obtain and store marked data sets;
(2) Integrating and reducing the extracted video frames;
(3) The training samples and the test samples were separated from the annotated dataset at 7:3.
4. The method for recognizing abnormal actions under a mine based on video understanding as claimed in claim 1, wherein said step a of removing abnormal data in the sample preparation stage comprises the steps of:
(1) Eliminating data without the appearance of the mineral targets;
(2) And eliminating the data with the character targets but incomplete character information.
5. The method for recognizing abnormal actions under a mine based on video understanding as claimed in claim 1, wherein said step b performs character recognition on the video with miner appearance, comprising:
(1) The input end is used for preprocessing the input picture, and the whole process comprises the steps of Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling.
(2) The backbone network part mainly includes Focus layer, convolution Blocks (CBL), cross-phase local network (Cross Stage Partial Network, CSPNet) and spatial pyramid pooling (Spatial Pyramid Pooling, SPP) modules. Focus: the input image is cut and stacked through slicing operation, the length and the width of the image are reduced to be half of the original length and width, the number of channels is 4 times of the original number, the calculated amount of the model can be reduced, and information loss can not be caused. The specific flow is as follows: first the slicing operation divides the input original 640 x 3 channel image into 4 slices, each of size 320 x 3. And secondly, 4 partial Concat operation depths are connected by using the convolution operation of 32 convolution kernels, and a feature map with the size of 320 multiplied by 32 is output through a convolution layer formed by 32 convolutions. CBL: the method comprises the steps of jointly forming Conv convolution layer, batchNorm layer and LeakyRelu activation function, namely, firstly, extracting input characteristics through the convolution layer (Conv) of an input part, and finding out specific local image characteristics; then, normalizing through a BatchNorm layer, and controlling gradient distribution of each time near an origin so that deviation of each batch is not overlarge; finally, the output result is transmitted to the next layer of convolution by the LeakyRelu activation function.
The LeakyReLU adjusts the zero gradient problem of negative values by giving a very small linear component of x to the negative input, typically a has a value of around 0.01. CSP: there are two CSP structures in YOLOv5s, with csp1_x implementing feature extraction in the Backbone network and csp2_x used in the neg structure for prediction. The CSP1_X module of the backbone network consists of a branch 1 and a branch 2, wherein the branch 1 consists of a convolution layer, a batch normalization function and an activation function, and the branch 2 consists of the convolution layer, the batch normalization function, the activation function and X residual error units; the CSP2_X module of the Neck network consists of a branch 3 and a branch 4, wherein the branch 3 and the branch 4 consist of convolution layers, batch normalization and activation functions. The number of channels is halved through the two branches, and then the channels are spliced through Concat, so that the number of channels is kept unchanged. SPP: firstly performing Conv convolution extraction feature output, then adopting maximum pooling of four scales of 1×1, 5×5, 9×9 and 13×13, and then performing splicing by Concat to realize multi-scale feature fusion. The Backbone network obtains feature maps with three different dimensions of 80×80×128, 40×40×256 and 20×20×512, and sends the feature maps to a Neck end; wherein, the feature map with the size of 80×80×128 contains most low-level layer features to enhance the small target detection performance of the model; the feature map with the size of 20×20×512 contains the majority of advanced layer features to enhance the model large target detection performance; the low-level and high-level feature information of the feature map with the size of 40×40×256 are equivalent in duty ratio for medium target detection.
(3) The Neck network utilizes a feature pyramid network (Feature Pyramid Networks, FPN) to deliver deep semantic features to shallow layers, while a path aggregation network (Path Aggregation Network, PAN) can deliver shallow layer position information to deeper layers, thereby improving positioning capability. The FPN+PAN structure not only obtains rich semantic features, but also obtains stronger positioning features, and the feature fusion effect is enhanced. The specific flow is as follows: firstly, a convolutional operation is carried out on a feature map processed by an image pyramid SPP by an FPN network, the size of a convolution kernel of the convolutional process is 1*1, the step length is 2, the feature map with the size of 20 x 20 is obtained after convolutional feature extraction, up-sampling feature fusion with the feature map with the same size of 20 x 20 extracted from a main network is carried out for 2 times, the feature map with the size of 40 x 40 is obtained, then the convolution operation is continuously carried out on the feature map with the size of 40 x 40 obtained after processing, the convolution kernel size of the convolutional process is 1*1, the step length is 2, the feature map with the size of 40 x 40 is obtained after convolutional feature extraction, up-sampling feature fusion with the feature map with the same size of 40 x 40 extracted from the main network is carried out for 2 times, the feature map with the size of 80 x 80 is obtained, the feature map with the same size of 3 times of convolution kernel of 3*3, the step length is 2, the extracted feature map with the size of 40 x 40, the feature map with the same size of 20 x 40 is 3 times, and the final feature map with the size of 80 x 40 is obtained, and the final feature map with the size of 20 x 40 is obtained.
(4) The prediction end outputs three dimension feature graphs through 8 times downsampling, 16 times downsampling and 32 times downsampling, and outputs prediction frame information with highest confidence through non-maximum suppression (Non Maximum Suppression, NMS), so that a detection result is obtained. In object detection, a large number of candidate frames are generated at the same object position, and the candidate frames may overlap with each other, so that the NMS needs to find the best object boundary frame by using non-maximum suppression, and redundant boundary frames are eliminated. The specific flow is as follows: firstly, sequencing confidence degrees of all prediction frames in a descending order, then selecting the prediction frame with the highest confidence degree, confirming the prediction frame to be the correct prediction, calculating the IOU of other prediction frames, removing the IOU with high overlapping degree according to the calculated IOU, namely deleting the IOU, returning the rest prediction frames to the step 1 until no rest prediction frames exist, and finally finding the optimal target boundary frame.
(5) The Loss functions of YOLOv5s include confidence Loss (Objectness Loss), classification Loss (Classification Loss), and bounding regression Loss (Bounding Box Regeression Loss).
The total loss formula is defined as:
Loss=a 1 L obj +a 2 L cla +a 3 L bbox
wherein a1, a2 and a3 are weight coefficients.
Objectless Loss and Classification Loss are calculated from binary cross entropy Loss function (BCE Loss):
wherein x is a Is a binary tag value of 0 or 1, p (x a ) Is of the x a Probability of tag value.
The frame regression loss is calculated from the CIoU function (Complete Intersection over Union):
wherein:
IOU represents the cross-over ratio between two overlapping rectangular boxes; x and x gt Representing the center points of two overlapping rectangular boxes; d represents the Euclidean distance between two overlapping rectangular boxes; y represents the diagonal distance of the closure areas of two overlapping rectangular boxes; m is used for measuring the consistency of the relative proportion of the two rectangular frames; beta represents a weight coefficient. The loss function considers the similarity of the superposition area of the two frames, the distance of the center point and the length-width ratio, so that the predicted frame is more in line with the real frame, and the effects of higher convergence speed and higher precision can be achieved.
Evaluation index: in order to accurately evaluate the network model performance, precision (P) and Recall (R), average Precision mean (mean Average Precision, mAP) and transmission frame number per second (Frames Per Second, FPS) are used as evaluation indexes, and specific formulas are as follows:
wherein t is the number of correct detection targets, f is the number of false detection targets, n is the number of missed detection targets, AP represents the area under the Precision-Recall curve, average Precision of each type of the picture is averaged to obtain mAP, n represents the number of samples to be detected, and alpha represents the time required for testing all samples.
6. The method for identifying abnormal actions under a mine based on video understanding as claimed in claim 1, wherein the step c uses the deepsort algorithm to perform object tracking on the binding of person ID to the result of yolov5s, and comprises the following steps:
(1) Obtaining a detection frame according to the detection steps, extracting a motion feature prediction frame coordinate value of a target and a figure target coordinate value by adopting a CNN network, performing track prediction by using Kalman filtering, extracting appearance features by adopting Deep Association Metric, and adopting a cascade matching and IOU matching mechanism;
(2) Setting a network training strategy, comprising: training the batch size, initializing a learning rate, a weight attenuation rate, an optimization method and a loss function;
(3) And sending the training data into a network model to obtain a new feature extraction network. Calculating the difference between the network extracted appearance features and the actual results using the following loss function
y=F(x,{W i })+x
(1)
F=W 2 *σ*(W 1 x)
(2)。
7. The method for identifying abnormal actions under a mine based on video understanding as claimed in claim 1, wherein said step d constructs a network model; and carrying out tracking results of the SlowFast network on Deepsort, wherein the tracking results comprise the following steps:
(1) And extracting space-time characteristics from the video data by using a 3D Convolutional Neural Network (CNN), and setting a resnet3D network, wherein the backbone network is divided into a slow path and a fast path. The slow path uses a resnet3d network, the first layer uses 1 convolution kernel with the size of 7*7, the operation is performed at a low frame rate, the space semantics are captured, and the environment information is obtained; the fast path uses a resnet3d network, with a base number of channels of 8, and the first layer uses 5 convolution kernels of size 7*7, running at high frame rate to capture fine time resolution motion. The header classification network uses a SlowFast header with a number of channels for feature connection of 2048+256.
(2) Feature fusion, the network adopts the way that data of Fast channels are sent into Slow channels through lateral connection, namely, information extracted from Fast paths is fused into Slow paths. The size of the convolution kernel is denoted as { T S ] 2 C, where T, S and C represent the number of samples, spatial resolution, and number of convolution kernels, respectively. The frame skip rate is α=8, and the channel ratio is 1/β=1/8. The single data sample of Fast channel is { alpha T, S 2 The single data sample of the Slow channel is { T, S }, beta C } 2 αβc }). The invention uses 2 beta C output channel and steps of alpha to 5X 1 2 The kernel performs data transformation in a three-dimensional convolution manner.
(3) Global averaging pooling is performed at the end of each channel, after which the results of the fast and slow channels are combined and fed into a fully connected classification layer that uses softmax to identify the actions being taken by the mine worker in the image, ultimately fusing the outputs of the two paths together to produce the final output.
CN202310387213.3A 2023-04-07 2023-04-07 Video understanding-based method for identifying abnormal actions under mine Pending CN116798117A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310387213.3A CN116798117A (en) 2023-04-07 2023-04-07 Video understanding-based method for identifying abnormal actions under mine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310387213.3A CN116798117A (en) 2023-04-07 2023-04-07 Video understanding-based method for identifying abnormal actions under mine

Publications (1)

Publication Number Publication Date
CN116798117A true CN116798117A (en) 2023-09-22

Family

ID=88039291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310387213.3A Pending CN116798117A (en) 2023-04-07 2023-04-07 Video understanding-based method for identifying abnormal actions under mine

Country Status (1)

Country Link
CN (1) CN116798117A (en)

Similar Documents

Publication Publication Date Title
CN110232380B (en) Fire night scene restoration method based on Mask R-CNN neural network
CN111898514B (en) Multi-target visual supervision method based on target detection and action recognition
CN110084165B (en) Intelligent identification and early warning method for abnormal events in open scene of power field based on edge calculation
CN111209810A (en) Bounding box segmentation supervision deep neural network architecture for accurately detecting pedestrians in real time in visible light and infrared images
CN111598066A (en) Helmet wearing identification method based on cascade prediction
CN111339883A (en) Method for identifying and detecting abnormal behaviors in transformer substation based on artificial intelligence in complex scene
CN113052876B (en) Video relay tracking method and system based on deep learning
CN111222478A (en) Construction site safety protection detection method and system
CN104966304A (en) Kalman filtering and nonparametric background model-based multi-target detection tracking method
CN110852179B (en) Suspicious personnel invasion detection method based on video monitoring platform
CN113642474A (en) Hazardous area personnel monitoring method based on YOLOV5
CN112131951A (en) System for automatically identifying behaviors of illegal ladder use in construction
CN116259002A (en) Human body dangerous behavior analysis method based on video
CN115035088A (en) Helmet wearing detection method based on yolov5 and posture estimation
CN114140745A (en) Method, system, device and medium for detecting personnel attributes of construction site
CN115965578A (en) Binocular stereo matching detection method and device based on channel attention mechanism
CN115620178A (en) Real-time detection method for abnormal and dangerous behaviors of power grid of unmanned aerial vehicle
CN111798435A (en) Image processing method, and method and system for monitoring invasion of engineering vehicle into power transmission line
CN117423157A (en) Mine abnormal video action understanding method combining migration learning and regional invasion
CN114170686A (en) Elbow bending behavior detection method based on human body key points
KR20230086457A (en) Electronic apparatus for building fire detecting system and method thereof
CN117475353A (en) Video-based abnormal smoke identification method and system
CN116385962A (en) Personnel monitoring system in corridor based on machine vision and method thereof
Rampriya et al. RSNet: Rail semantic segmentation network for extracting aerial railroad images
CN116798117A (en) Video understanding-based method for identifying abnormal actions under mine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination