CN116030412A

CN116030412A - Escalator monitoring video anomaly detection method and system

Info

Publication number: CN116030412A
Application number: CN202211703754.4A
Authority: CN
Inventors: 童勤峰; 钟毅; 卓荣荣; 杨建党; 蒋俊涛; 刘勇
Original assignee: Zhejiang University ZJU; Ningbo Hongda Elevator Co Ltd
Current assignee: Zhejiang University ZJU; Ningbo Hongda Elevator Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-04-28

Abstract

The invention discloses a method and a system for detecting an abnormal escalator monitoring video, which relate to the field of escalator detection, wherein a trained YOLOv5 target detection model and an HR-Net key point extraction model are combined to obtain a key point heat map corresponding to each frame picture in a data set, a key point inter-frame change map corresponding to the key point heat map is obtained, a convolutional neural network model is obtained by training the convolutional neural network through an image tag containing the key point inter-frame change map and a tag corresponding to the key point inter-frame change map, when the escalator monitoring video to be detected starts to be detected, a YOLOv5 target detection model is input into a set of escalator monitoring video pictures frame by frame to obtain the positions of pedestrian target frames corresponding to each frame picture, the HR-Net key point extraction model is used for predicting the key point heat map corresponding to each pedestrian target frame position, the key point inter-frame change map is input into the convolutional neural network model to be predicted to obtain the corresponding tag, and the behavior state on the escalator is obtained through the tag, so that intelligent escalator monitoring video abnormality detection is realized.

Description

Escalator monitoring video anomaly detection method and system

Technical Field

The invention relates to the field of escalator detection, in particular to an escalator monitoring video anomaly detection method and system.

Background

With the promotion of urban construction and the improvement of the material level of people, urban infrastructure is continuously perfected, and people pay more attention to the safety of the facilities. The escalator is widely used in public places such as subway stations, shopping malls and the like. However, safety accidents related to escalators are also increasing due to equipment problems, improper user behavior, and the like. According to news reports, the activities such as retrograde, running, falling or carrying a baby carriage and large pieces of luggage on an escalator often easily cause safety accidents, and passengers are injured. In time, accurately and efficiently monitor and alarm the abnormal behavior of the elevator, thereby being beneficial to quickly responding to accidents, avoiding casualties and improving the emergency treatment level.

To raise the level of abnormality detection, researchers have made some studies and searches. Shao Haibo an escalator safety monitoring system is proposed, with an authorized bulletin number CN205257749U, comprising a first camera group for capturing an overall image of the escalator and passengers on the escalator, a second camera group for capturing mechanical parts of the escalator, and a data processing device for image analysis and processing. Liuzhuo and the like propose a safety monitoring device for an escalator, the authorized bulletin number is CN204310668U, the monitoring device is relatively separated from a control system, and the safety monitoring device can be conveniently and flexibly applied to different control systems, and is convenient for modularized design of the control system. However, the invention lacks an automatic algorithm design (or an intelligent detection design), is still only focused on the aspect of information acquisition, does not relate to the joint application of the YOLOv5 model, the HR-Net model and the convolutional neural network model in the field of monitoring video anomaly detection of the escalator, and realizes the intelligent detection of the monitoring video anomaly of the escalator by the joint application of the YOLOv5 model, the HR-Net model and the convolutional neural network model, and simultaneously greatly improves the detection accuracy.

Disclosure of Invention

In order to realize intelligent detection of the abnormal escalator monitoring video and improve the accuracy of the abnormal escalator monitoring video detection, the invention provides a method for detecting the abnormal escalator monitoring video by combining a YOLOv5 model, an HR-Net model and a convolutional neural network model, which comprises the following steps:

acquiring a monitoring video of an escalator, converting the monitoring video into a picture set with continuous time points, judging whether a current frame is a normal frame or not frame by frame, if so, marking the current frame as a positive sample, and if not, marking the current frame as a negative sample, so as to obtain a model data set containing the positive sample and the negative sample; the positive sample is specifically a picture marked with a normal label, and the negative sample is specifically a picture marked with an abnormal label;

respectively training a YOLOv5 model and an HR-Net model through a large-scale data set to obtain a corresponding YOLOv5 target detection model and an HR-Net key point extraction model;

inputting pictures in the model data set into a YOLOv5 target detection model frame by frame to obtain pedestrian target frame positions corresponding to the pictures of each frame; predicting a keypoint heat map corresponding to each pedestrian target frame position through an HR-Net keypoint extraction model;

for each frame of key point heat map, a corresponding forward difference map and a corresponding backward difference map are obtained, and a key point inter-frame change map corresponding to the current frame of key point heat map is obtained by taking or operating the forward difference map and the backward difference map;

acquiring image label pairs corresponding to each key point inter-frame change graph, and training a convolutional neural network through the image label pairs to obtain a convolutional neural network model; the image tag pair comprises a key point inter-frame change graph and a tag corresponding to the key point inter-frame change graph;

inputting the escalator monitoring video image set to be detected into a YOLOv5 target detection model frame by frame to obtain pedestrian target frame positions corresponding to all frame images, predicting key point heat maps corresponding to all pedestrian target frame positions through an HR-Net key point extraction model, obtaining key point inter-frame change maps corresponding to all frame key point heat maps, and inputting the key point inter-frame change maps into a convolutional neural network model to predict to obtain corresponding labels.

Further, the predicting the keypoint heat map corresponding to the target frame position of each pedestrian through the HR-Net keypoint extraction model specifically comprises the following steps: and sequentially inputting the positions of the target frames of the pedestrians into an HR-Net key point extraction model to obtain a high-resolution feature map comprising human key points and human key point rectangular bounding boxes confidence degrees through the HR-Net key point extraction model, and estimating the human body gestures of the human key point rectangular bounding boxes with the confidence degrees higher than a set threshold in the high-resolution feature map to obtain pixel coordinates of the human key points and the prediction confidence degrees thereof, so as to obtain a key point heat map corresponding to the positions of the target frames of the pedestrians.

Further, the large-scale dataset is an MS COCO dataset; the training of the YOLOv5 model by a large-scale data set specifically comprises the following steps:

training YOLOv5 model by MS COCO data set and passing loss function during training

Training YOL0v5 modelModel regression branches, target and class branches of YOL0v5 model were trained by BCE loss function.

Further, the loss function

The formula expression of (2) is:

wherein:

wherein, intersectionb (a) represents the Intersection area of the predicted frame a and the target frame B of the YOLOv5 model, and union (a, B) represents the union area of the predicted frame a and the target frame B of the YOLOv5 model; b, b ^gt Respectively representing the center points of the predicted frame and the real frame, p ² (b，bg ^t ) For the Euclidean distance between the center points of the predicted frame and the real frame, c is the diagonal distance of the minimum closure area capable of simultaneously containing the predicted frame and the real frame, and alpha is the loss weight;

representing a loss value; w (w) ^gt Is the width of a real frame, h ^gt The height of the real frame is w is the width of the prediction frame, and h is the height of the prediction frame;

the formula expression of the BCE loss function is as follows:

wherein p is the probability that the YOLOv5 model predicts that the sample is a positive sample;

the label is a sample, when the sample belongs to a positive sample, the value is 1, otherwise, the value is 0, and BCELoss is a loss value.

Further, when training the YOLOv5 model by a large-scale dataset, it further comprises:

aiming at the picture data in a large-scale data set, randomly overturning the picture of a current input YOLOv5 model or HR-Net model, and enhancing by using Mosaic data, wherein the method specifically comprises the following steps: and splicing any four pictures, obtaining a new picture after splicing, and adding training to expand the data set.

Further, the formula for acquiring the forward difference map is as follows: BDI _k ＝|H _k-1 -H _k |；

The acquisition formula of the backward difference graph is as follows: FDI (fully drawn yarn) _k ＝|H _k -H _k+1 |；

Wherein the value range of k is (1, n-2); wherein n represents the total frame number of the key point heat map, H _k Represents the kth frame key point heat map, H _k-1 Represents the key point heat diagram of the k-1 frame, H _k+1 Representing the k+1st frame key point heat map, BDI _k Forward difference map, FDI, corresponding to the kth frame keypoint heatmap _k Representing a backward difference graph corresponding to the key point heat graph of the kth frame;

the acquisition formula of the key point inter-frame change graph is as follows: CDI (compact digital interface) _k ＝BDI _k ∪FDI _k In the formula, CDI _k Representing a keypoint inter-frame variation graph.

The invention also provides a system for detecting the abnormality of the monitoring video of the escalator, which comprises the following steps:

the data set acquisition module is used for acquiring the monitoring video of the escalator, converting the monitoring video into a picture set with continuous time points, judging whether the current frame is a normal frame or not frame by frame, if yes, marking the current frame as a positive sample, and if not, marking the current frame as a negative sample, so as to obtain a model data set containing the positive sample and the negative sample; the positive sample is specifically a picture marked with a normal label, and the negative sample is specifically a picture marked with an abnormal label;

the first training module is used for respectively training the YOLOv5 model and the HR-Net model through a large-scale data set to obtain a corresponding YOLOv5 target detection model and an HR-Net key point extraction model;

the key point heat map acquisition module is used for inputting pictures in the model data set into the YOLOv5 target detection model frame by frame to obtain pedestrian target frame positions corresponding to the pictures of each frame; predicting a keypoint heat map corresponding to each pedestrian target frame position through an HR-Net keypoint extraction model;

the inter-frame change map acquisition module is used for acquiring a forward difference map and a backward difference map corresponding to each frame key point heat map, and acquiring a key point inter-frame change map corresponding to the current frame key point heat map by taking or operating the forward difference map and the backward difference map;

the second training module is used for acquiring image label pairs corresponding to the inter-frame change graphs of the key points, and training the convolutional neural network through the image label pairs to obtain a convolutional neural network model; the image tag pair comprises a key point inter-frame change graph and a tag corresponding to the key point inter-frame change graph;

the detection module is used for inputting the escalator monitoring video image set to be detected into the YOLOv5 target detection model frame by frame to obtain pedestrian target frame positions corresponding to the images of each frame, predicting key point heat maps corresponding to the pedestrian target frame positions through the HR-Net key point extraction model, obtaining key point inter-frame change maps corresponding to the key point heat maps of each frame, and inputting the key point inter-frame change maps into the convolutional neural network model to predict to obtain corresponding labels.

Further, the large-scale dataset is an MS COCO dataset; the training of the YOLOv5 model by a large-scale data set specifically comprises the following steps: training YOLOv5 model by MS COCO data set and passing loss function during training

Training regression branches of the YOLOv5 model, and training target and class branches of the YOLOv5 model by BCE loss function.

Compared with the prior art, the invention at least has the following beneficial effects:

(1) According to the method, a trained YOLOv5 target detection model and an HR-Net key point extraction model are combined to obtain a key point heat map corresponding to each frame picture in a data set, a key point inter-frame change map corresponding to the key point heat map is obtained, a convolutional neural network model is obtained by training the convolutional neural network through an image tag pair comprising the key point inter-frame change map and a tag corresponding to the key point inter-frame change map, when detection is started, an escalator monitoring video picture set to be detected is input into the YOLOv5 target detection model frame by frame to obtain pedestrian target frame positions corresponding to each frame picture, the key point heat map corresponding to each pedestrian target frame position is predicted through the HR-Net key point extraction model, the key point inter-frame change map corresponding to the key point heat map is input into the convolutional neural network model to be predicted to obtain a corresponding tag, and the behavior state (abnormal or normal) on an escalator is obtained through the tag, so that intelligent detection of the escalator monitoring video abnormality is realized, and meanwhile, the abnormality detection accuracy is greatly improved;

(2) Specifically, the pedestrian in the picture is positioned through the YOLOv5 target detection model, the human body key points are extracted through the HR-Net key point extraction model, the key point heat map is obtained in a combined mode, and the inter-frame change map of the key points is obtained by utilizing the inter-frame relation so as to capture abnormal behaviors and reflect the inter-frame change of the human body posture, so that the accuracy of abnormality detection is greatly improved;

(3) According to the invention, the key point inter-frame change diagram corresponding to the key point thermal diagram of the current frame can be obtained by taking or operating the forward difference diagram and the backward difference diagram, so that inter-frame changes of human actions are obtained very quickly and conveniently, and the abnormal actions are captured conveniently, and the abnormal detection speed is accelerated;

(4) When the method starts to detect, after the key point heat map is obtained, the detection can be realized by only a small convolutional neural network model, and the detection efficiency is greatly improved.

Drawings

FIG. 1 is a flow chart of a method for detecting anomaly of monitoring video of an escalator;

FIG. 2 is a block diagram of an escalator surveillance video anomaly detection system;

FIG. 3 is a network structure diagram of the HR-Net model.

Detailed Description

The following are specific embodiments of the present invention and the technical solutions of the present invention will be further described with reference to the accompanying drawings, but the present invention is not limited to these embodiments.

Example 1

In order to realize intelligent detection of the monitoring video of the escalator, as shown in fig. 1, the invention provides a method for detecting abnormality of the monitoring video of the escalator, which comprises the following steps:

acquiring a monitoring video of an escalator through a camera or other sensing equipment, converting the monitoring video into a picture set with continuous time points, judging whether a current frame is a normal frame by manually frame by frame, if so, marking the current frame as a positive sample, and if not, marking the current frame as a negative sample, so as to obtain a model data set containing the positive sample and the negative sample; the positive sample is specifically a picture marked with a normal label, and the negative sample is specifically a picture marked with an abnormal label;

it should be noted that, in general, retrograde, fall, pram, large luggage and other rare anomalies may be manually marked as negative examples.

Respectively training a YOLOv5 model (specifically three detection heads YOLOv5 s) and an HR-Net model through a large-scale data set to obtain a corresponding YOLOv5 target detection model and an HR-Net key point extraction model;

the large-scale data set is an MS COCO data set; splitting the MS COCO data set to obtain a training set and a verification set, wherein the YOLOv5 model is trained through the large-scale data set (before training, the images in the data set are required to be unified into a designated size), and the method specifically comprises the following steps:

training the YOLOv5 model by a training set and passing the loss function during the training process

Training regression branches of a YOLOv5 model, training target and category branches of the YOLOv5 model through a BCE loss function, verifying the trained model through a verification set, and taking the model with the best performance on the verification set in all rounds as a YOLOv5 target detection model.

The loss function

The formula expression of (2) is:

wherein:

wherein, intersectionb (a) represents the Intersection area of the predicted frame a and the target frame B of the YOLOv5 model, and Union B (a) represents the Union area of the predicted frame a and the target frame B of the YOLOv5 model; b, b ^gt Respectively representing a predicted frame and a real frameCenter point ρ of (1) ² (b，b ^gt ) For the Euclidean distance between the center points of the predicted frame and the real frame, c is the diagonal distance of the minimum closure area capable of simultaneously containing the predicted frame and the real frame, and alpha is the loss weight;

the formula expression of the BCE loss function is as follows:

It should be noted that the loss function used for training the HR-Net model is:

wherein the method comprises the steps of

For the sample true pixel coordinate value, y is the sample predicted pixel coordinate value, m is the total number of pixels for the sample, and MSELoss is the loss value.

When training the YOLOv5 model by a large-scale dataset, it further comprises:

the method for predicting the keypoint heat map corresponding to the target frame position of each pedestrian through the HR-Net keypoint extraction model specifically comprises the following steps: and sequentially inputting the positions of the target frames of the pedestrians into an HR-Net key point extraction model to obtain a high-resolution feature map comprising human key points and human key point rectangular bounding boxes confidence degrees through the HR-Net key point extraction model, and estimating the human body gestures of the human key point rectangular bounding boxes with the confidence degrees higher than a set threshold in the high-resolution feature map to obtain pixel coordinates of the human key points and the prediction confidence degrees thereof, so as to obtain a key point heat map corresponding to the positions of the target frames of the pedestrians.

It should be explained in detail that, as shown in fig. 3, the HR-Net keypoint extraction model starts from a high resolution sub-network, gradually increases the sub-networks from high to low resolution one by one, and then connects the multi-resolution sub-networks in parallel; in the whole process, information is exchanged in parallel multi-resolution subnets one by one, so that the repeated multi-scale fusion process is completed, a high-resolution characteristic diagram is obtained, loss of the high-resolution information is avoided, and the predicted key point heat diagram is more accurate.

the invention can obtain the key point inter-frame change map corresponding to the key point thermal map of the current frame by taking or operating the forward difference map and the backward difference map, thereby obtaining the inter-frame change of human body actions very quickly and conveniently, and accelerating the speed of abnormality detection while being beneficial to capturing abnormal actions.

The forward difference map obtaining formula is as follows: BDI _k ＝|H _k-1 -H _k |；

Wherein n represents the total frame number of the key point heat map, H _k Represents the kth frame key point heat map, H _k-1 Represents the key point heat diagram of the k-1 frame, H _k+1 Representing the k+1st frame key point heat map, BDI _k Forward difference map, FDI, corresponding to the kth frame keypoint heatmap _k Representing a backward difference graph corresponding to the key point heat graph of the kth frame; because the first frame key point heat map has no forward difference map and the last frame has no backward difference map, the value range of k is (1, n-2);

the acquisition formula of the key point inter-frame change graph is as follows: CDI (compact digital interface) _k ＝BDI _k ∪FDI _k In the formula, CDI _k And representing a keypoint inter-frame change diagram corresponding to the keypoint heat diagram of the kth frame.

Acquiring image tag pairs (CDI) corresponding to inter-frame change maps of key points _k ，Y _k )，Y _k E (+1, -1), training a convolutional neural network through an image label pair (in the embodiment, the convolutional neural network is ResNet 10), and obtaining a convolutional neural network model; the image tag pair comprises a key point inter-frame change graph and a tag corresponding to the key point inter-frame change graph; wherein Y is _k And the label corresponding to the key point inter-frame change graph is represented.

The method comprises the steps of acquiring escalator monitoring videos in real time, inputting a picture set corresponding to the escalator monitoring videos to be detected (namely, acquired in real time) into a YOLOv5 target detection model frame by frame to obtain pedestrian target frame positions corresponding to all frames of pictures, predicting key point heat maps corresponding to all pedestrian target frame positions through an HR-Net key point extraction model, acquiring key point inter-frame change maps corresponding to all frames of key point heat maps, inputting the key point inter-frame change maps into a convolutional neural network model to predict to obtain corresponding labels, and accordingly realizing real-time detection of escalator monitoring video abnormality.

The invention can realize detection by only needing a smaller convolutional neural network model after starting detection and obtaining the key point heat map, thereby greatly improving the detection efficiency.

According to the method, a trained YOLOv5 target detection model and an HR-Net key point extraction model are combined to obtain a key point heat map corresponding to each frame picture in a data set, a key point inter-frame change map corresponding to the key point heat map is obtained, a convolutional neural network model is obtained by training the convolutional neural network through an image tag pair comprising the key point inter-frame change map and a tag corresponding to the key point inter-frame change map, when detection is started, an escalator monitoring video picture set to be detected is input into the YOLOv5 target detection model frame by frame to obtain pedestrian target frame positions corresponding to each frame picture, the key point heat map corresponding to each pedestrian target frame position is predicted through the HR-Net key point extraction model, the key point inter-frame change map corresponding to the key point heat map is input into the convolutional neural network model to be predicted to obtain a corresponding tag, and the behavior state (abnormal or normal) on an escalator is obtained through the tag.

Example two

As shown in fig. 2, the invention further provides a system for detecting the abnormality of the monitoring video of the escalator, which comprises:

the large-scale data set is an MS COCO data set; the training of the YOLOv5 model by a large-scale data set specifically comprises the following steps: training YOLOv5 model by MS COCO data set and passing loss function during training

When training the YOLOv5 model by a large-scale dataset, it further comprises:

Specifically, the pedestrian in the picture is positioned through the YOLOv5 target detection model, the human body key points are extracted through the HR-Net key point extraction model, the key point heat map is obtained in a combined mode, and the inter-frame change map of the key points is obtained by utilizing the inter-frame relation, so that abnormal behaviors are captured, the inter-frame change of the human body posture is reflected, and the accuracy of abnormality detection is greatly improved.

Example III

The invention also provides equipment for detecting the abnormality of the monitoring video of the escalator, which comprises a memory and a processor; the memory is used for storing a computer program; the processor is used for realizing the escalator monitoring video abnormity detection method when executing the computer program.

It should be noted that all directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present invention are merely used to explain the relative positional relationship, movement, etc. between the components in a particular posture (as shown in the drawings), and if the particular posture is changed, the directional indicator is changed accordingly.

Furthermore, descriptions such as those referred to herein as "first," "second," "a," and the like are provided for descriptive purposes only and are not to be construed as indicating or implying a relative importance or an implicit indication of the number of features being indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

In the present invention, unless specifically stated and limited otherwise, the terms "connected," "affixed," and the like are to be construed broadly, and for example, "affixed" may be a fixed connection, a removable connection, or an integral body; can be mechanically or electrically connected; either directly or indirectly, through intermediaries, or both, may be in communication with each other or in interaction with each other, unless expressly defined otherwise. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

In addition, the technical solutions of the embodiments of the present invention may be combined with each other, but it is necessary to be based on the fact that those skilled in the art can implement the technical solutions, and when the technical solutions are contradictory or cannot be implemented, the combination of the technical solutions should be considered as not existing, and not falling within the scope of protection claimed by the present invention.

Claims

1. The method for detecting the abnormality of the monitoring video of the escalator is characterized by comprising the following steps of:

2. The method for detecting the abnormal condition of the monitoring video of the escalator according to claim 1, wherein the predicting the key point heat map corresponding to the target frame position of each pedestrian by the HR-Net key point extraction model specifically comprises: and sequentially inputting the positions of the target frames of the pedestrians into an HR-Net key point extraction model to obtain a high-resolution feature map comprising human key points and human key point rectangular bounding boxes confidence degrees through the HR-Net key point extraction model, and estimating the human body gestures of the human key point rectangular bounding boxes with the confidence degrees higher than a set threshold in the high-resolution feature map to obtain pixel coordinates of the human key points and the prediction confidence degrees thereof, so as to obtain a key point heat map corresponding to the positions of the target frames of the pedestrians.

3. The escalator surveillance video anomaly detection method of claim 1, wherein the large-scale dataset is an MS COCO dataset; the training of the YOLOv5 model by a large-scale data set specifically comprises the following steps:

4. The method for detecting the abnormal condition of the monitoring video of the escalator according to claim 3, wherein,

the loss function

The formula expression of (2) is:

wherein:

wherein, intersectionb (a) represents the Intersection area of the predicted frame a and the target frame B of the YOLOv5 model, and Union B (a) represents the Union area of the predicted frame a and the target frame B of the YOLOv5 model; b, b ^gt Respectively representing the center points of the prediction frame and the real frame, ρ ² (b,b ^gt ) For the Euclidean distance between the center points of the predicted frame and the real frame, c is the diagonal distance of the minimum closure area capable of simultaneously containing the predicted frame and the real frame, and alpha is the loss weight;

the formula expression of the BCE loss function is as follows:

5. The escalator surveillance video anomaly detection method of claim 4, further comprising, when training the YOLOv5 model with a large-scale dataset:

6. The method for detecting the video anomaly of the escalator according to claim 5, wherein,

7. An escalator surveillance video anomaly detection system, comprising:

8. The escalator surveillance video anomaly detection system according to claim 7, wherein the predicting the keypoint heatmap corresponding to each pedestrian target frame position through the HR-Net keypoint extraction model specifically comprises: and sequentially inputting the positions of the target frames of the pedestrians into an HR-Net key point extraction model to obtain a high-resolution feature map comprising human key points and human key point rectangular bounding boxes confidence degrees through the HR-Net key point extraction model, and estimating the human body gestures of the human key point rectangular bounding boxes with the confidence degrees higher than a set threshold in the high-resolution feature map to obtain pixel coordinates of the human key points and the prediction confidence degrees thereof, so as to obtain a key point heat map corresponding to the positions of the target frames of the pedestrians.

9. The escalator surveillance video anomaly detection system of claim 8, wherein the large-scale dataset is an MS COCO dataset; the training of the YOLOv5 model by a large-scale data set specifically comprises the following steps: training YOLOv5 model by MS COCO data set and passing loss function during training

10. The escalator surveillance video anomaly detection system of claim 9, further comprising, when training the YOLOv5 model with a large-scale dataset: