CN111582095B

CN111582095B - Light-weight rapid detection method for abnormal behaviors of pedestrians

Info

Publication number: CN111582095B
Application number: CN202010346229.6A
Authority: CN
Inventors: 吴晓军; 袁佳兴; 原盛
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2022-02-01
Anticipated expiration: 2040-04-27
Also published as: CN111582095A

Abstract

The invention provides a method for quickly detecting abnormal behaviors of lightweight pedestrians, which comprises the following steps: step 1, carrying out pedestrian detection on an image, and framing by using a detection frame to obtain a pedestrian detection frame; step 2, extracting human skeleton information from the pedestrian detection frame obtained in the step 1 to obtain a human skeleton information picture; step 3, carrying out background removal pretreatment on the picture obtained in the step 2; step 4, rapidly detecting the abnormal behaviors of the pedestrians on the preprocessed human body skeleton information picture in the step 3 by using a lightweight multi-scale information fusion detection network based on depth separable convolution to obtain a four-dimensional vector which respectively corresponds to four types of actions of the abnormal behaviors of the human body; the invention effectively utilizes the human skeleton information and the multi-scale information which are removed from the background and the light-weight network based on the depth separable convolution, enhances the robustness and the real-time property of the algorithm, obviously reduces the calculated amount of the network model, reduces the requirement of the algorithm on hardware and reduces the cost.

Description

Light-weight rapid detection method for abnormal behaviors of pedestrians

Technical Field

The invention belongs to the field of intelligent video monitoring, and particularly relates to a method for rapidly detecting abnormal behaviors of lightweight pedestrians.

Background

The intelligent video monitoring is a computer vision application which covers a plurality of technologies such as target detection, image classification, motion detection, deep learning and the like, and is different from a traditional monitoring system in intelligence. At present, the judgment and monitoring of the abnormal behaviors of pedestrians in videos mainly stay in the stage of manual identification. Compared with the continuous operation of computers and monitoring equipment, manual identification is basically an impossible task for a monitor to accurately process a large amount of monitoring videos in real time, and it is difficult to quickly extract useful information from the large amount of videos. The pedestrian abnormal behavior detection algorithm can ignore a large amount of data information which is useless for security protection in the video, overcomes the problems of missed detection and investigation and evidence obtaining which are easily caused by manually monitoring a video image, saves the consumption of manpower, material resources and financial resources, promotes economic benefits, and provides stable guarantee for life of people. With the progress of scientific technology, how to apply modern technology to improve the safety of public areas is a valuable topic. When some abnormal behaviors occur in public areas, such as fighting, running and crowd gathering, if the specific behaviors can be detected and identified in real time so as to be found in time and be prevented by alarming, the possibility of injury can be greatly reduced, and the method is a very effective safety measure.

At present, the human behavior recognition algorithm mostly adopts a double-flow method or combines an LSTM algorithm, the motion vector of all pixels of each frame of image in an image sequence is used for detecting the motion of an object, once the object moves, the optical flow of the corresponding pixel point changes, and therefore the behavior detection and recognition of the moving object are realized. The algorithm uses a single-frame input image to process information of space dimension, uses a multi-frame density optical flow field as input to process information of time dimension, combines data of two behavior classifications by a multi-task training method, and removes fitting to improve identification accuracy. In the method, the optical flow calculation is carried out on each pixel point in the image, so that the foreground and the background are distinguished, and the calculation of the optical flow field also relates to the fusion of multi-frame information. Obviously, the calculation amount is too large, the time consumption is long, and the timeliness is not guaranteed. In actual life, it is not practical to prepare a high-performance video card for each monitoring camera to detect abnormal behaviors of pedestrians at a high cost. Moreover, the algorithm has extremely high requirements on scenes, and the algorithm is greatly influenced by the brightness and scene change.

The 3D convolutional neural network is an extension of a traditional 2D convolutional neural network in a video behavior recognition task, and compared with a common 2D convolutional neural network, a convolution kernel of the 3D convolutional neural network increases information of a time dimension in a convolution calculation process. The input of the traditional 2D convolution is a single frame RGB image, and the obtained output is a two-dimensional characteristic diagram. The input of the 3D convolution is continuous multi-frame RGB images to form a cube, the 3D convolution kernel simultaneously extracts space domain information and time domain information, the obtained output is the cube formed by a plurality of two-dimensional characteristic graphs, and each two-dimensional image is formed by the convolution of the input multi-frame images. The number of parameters per 3D convolution is large, as well as the amount of computation.

In practical application, the detection scene of the pedestrian behavior is extremely complex, and the scene may be influenced by a lot of noises; in addition, due to the consideration of cost factors, the requirement for timeliness in practical application is difficult to meet, and the application of the algorithm is limited to a great extent. There are two ways to detect abnormal behaviors of pedestrians in public areas: firstly, connecting a camera and equipment for algorithm processing, such as a GPU (graphics processing unit) and the like together to be used as integral equipment for detection; secondly, the video stream acquired by the camera is uploaded to a cloud end through a network, and the cloud end server performs identification and detection. In either way, the timeliness is a very important indicator. Abnormal behavior detection loses its meaning if it cannot be detected in a timely manner. Both implementations consume a lot of computing and memory resources if based on today's mainstream detection algorithms, and the second approach may even put higher demands on network transmission. The increase of computer hardware and network requirements is accompanied with the increase of equipment cost price, even under the condition, the real-time requirement is difficult to achieve, and the actual deployment effect of the model is seriously influenced.

Disclosure of Invention

The invention aims to provide a method for quickly detecting abnormal behaviors of lightweight pedestrians, which overcomes the defects of poor timeliness and high cost in the prior art.

In order to achieve the purpose, the invention adopts the technical scheme that:

the invention provides a method for quickly detecting abnormal behaviors of lightweight pedestrians, which comprises the following steps:

step 1, carrying out pedestrian detection on an image, and framing by using a detection frame to obtain a pedestrian detection frame;

step 2, extracting human skeleton information from the pedestrian detection frame obtained in the step 1 to obtain a human skeleton information picture;

step 3, carrying out background removal pretreatment on the human body skeleton information picture obtained in the step 2;

and 4, detecting the abnormal behaviors of the pedestrians on the preprocessed human body skeleton information picture in the step 3 by using a lightweight multi-scale information fusion detection network based on depth separable convolution to obtain a four-dimensional vector which respectively corresponds to four types of actions of the abnormal behaviors of the human body. Preferably, in step 1, a pedestrian detection is performed on the image by using a YOLOv3 target detection algorithm, so as to obtain a pedestrian detection frame.

Preferably, in step 2, the RMPE frame is used to extract the human skeleton information from the pedestrian detection frame obtained in step 1, so as to obtain a human skeleton information picture.

Preferably, in step 4, the lightweight multi-scale information fusion detection network based on depth separable convolution comprises a trunk residual error network module and two branch network modules, wherein the trunk residual error network module comprises an input layer, and an input end of the input layer is used for receiving the preprocessed human skeleton information picture; the output end of the input layer is sequentially connected with a first convolution module and a second convolution module, and the output end of the second convolution module is respectively connected with a branch network module and a third convolution module; the output end of the third convolution module is respectively connected with the other branch network module and the fourth convolution module; the fourth convolution module is connected with the fifth convolution module, the fifth convolution module is combined with the output ends of the two branch network modules, and multi-scale information is fused and transmitted to the full connection layer; the output layer is a softmax classifier.

Preferably, the first convolution module includes one convolution layer and one pooling layer; the second convolution module comprises three depth separable convolution sub-residual network units, and each depth separable convolution sub-residual network unit comprises four depth separable convolution layers; the third convolution module comprises four depth separable convolution sub-residual network elements, and each depth separable convolution sub-residual network element comprises four depth separable convolution layers; the fourth convolution module comprises six depth separable convolution sub-residual network elements, and each depth separable convolution sub-residual network element comprises four depth separable convolution layers; the fifth convolution module includes three depth-separable convolutional sub-residual network elements and one pooling layer, each depth-separable convolutional sub-residual network element including four depth-separable convolutional layers.

Preferably, one branch network module connected to the output end of the second convolution module includes three convolution layers and one pooling layer; the other branch network module connected with the output end of the third convolution module comprises three convolution layers and a pooling layer.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a method for rapidly detecting abnormal behaviors of lightweight pedestrians, which provides a detection mode different from the existing behavior recognition algorithm by innovating the thought of the detection mode and designing the lightweight detection network, and specifically comprises the following steps:

the method has the advantages that the method is used for identifying the abnormal behaviors of the pedestrian based on the human body skeleton information and the background removal mode, so that the use of time sequence information is avoided, the calculated amount is effectively reduced, and the real-time performance of the algorithm is improved;

secondly, the invention enhances the robustness of the algorithm by removing the interference of background information, so that the algorithm is not restricted by the application scene;

thirdly, the invention designs a detection network fusing multi-scale information, effectively utilizes the multi-scale information and improves the performance of the algorithm;

and fourthly, the invention replaces the traditional convolution kernel with the deep separable convolution kernel, redesigns the lightweight network, and solves the problem that the detection timeliness is too poor due to the fact that the deployed equipment has poor computing and storing capabilities in the actual application deployment process and is influenced by cost and environmental factors.

The experimental result shows that the parameter quantity and the calculated quantity of the network model are remarkably reduced, the parameter quantity is only 1/8 of the reference algorithm model, the calculated quantity is only 1/6 of the reference algorithm model, and the requirement of the algorithm on hardware is reduced.

Drawings

FIG. 1 is a flow chart of light-weight pedestrian abnormal behavior identification;

FIG. 2 is a diagram of a lightweight multi-scale information fusion detection network architecture based on depth separable convolution;

FIG. 3 is a flow chart of a knowledge distillation algorithm;

FIG. 4 is a flow chart of human skeletal information extraction;

FIG. 5 an example of a run based on human skeletal information;

fig. 6 example of falls based on human skeletal information;

FIG. 7 is an example of fighting based on human skeletal information;

fig. 8 is a walking example based on human skeletal information.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The invention provides a method for rapidly detecting abnormal behaviors of lightweight pedestrians, which is small in calculation amount, capable of meeting the requirement on calculation timeliness in actual deployment, not prone to interference of scene factors and high in robustness. The recognized behaviors include four types of behaviors of running, fighting, falling and walking.

Specifically, as shown in fig. 1, the algorithm for rapidly detecting the abnormal behavior of the lightweight pedestrian provided by the invention comprises the following steps:

step 1, firstly, a Yolov3 target detection algorithm is used for carrying out pedestrian detection on people in an image, and the pedestrians in the image are framed by a detection frame to obtain a pedestrian detection frame.

And 2, extracting the human skeleton information of the pedestrian detection frame in the step 1 by using an RMPE frame to obtain a human skeleton information picture.

And 3, performing background removal pretreatment on the picture obtained in the step 2, wherein the specific method comprises the following steps: and multiplying the original input image by 0, setting the original input image to be black, and only keeping the extracted human skeleton information so as to achieve the purpose of removing the background.

Saving the human body skeleton information picture without the background, and making a pedestrian abnormal behavior data set based on the human body skeleton information, wherein the data set comprises four behaviors of running, fighting, falling and walking; the data set is divided into a training set and a testing set, the training set is used for training of a subsequent recognition network, and the testing set is used for verifying the accuracy of the recognition network.

Step 4, the light-weight multi-scale information fusion detection network (DSN) based on the deep separable convolution designed by the invention is used for rapidly detecting the abnormal behaviors of the pedestrian, and the network mainly comprises 3 parts: the method comprises a trunk residual error network module based on depth separable convolution, two depth separable convolution branch network modules responsible for multi-scale information, and finally, the outputs of the three modules are connected together for final prediction. The network structure is shown in fig. 2:

1) the input of the network is an RGB image in a pedestrian abnormal behavior data set, the image in the data set is preprocessed before training, the image is uniformly adjusted to be an image with the size of 224 multiplied by 3 pixels, and the image is randomly turned over to achieve the purpose of data enhancement.

2) The first convolution module conv1 is a 7 × 7 convolution kernel with 64 number and step size of 2, and then performs the pooling operation with step size of 2, where the dimension of the output feature map is (56,56, 64).

3) The second convolution module conv2 includes 3 depth-separable convolution sub-residual network elements, each of which contains 4 convolution layers, the sizes of convolution kernels are two groups of 3 × 3 depth convolution layers and 1 × 1 point-by-point convolution respectively, the number of convolution kernels is 64, the convolution module has 12 convolution layers in total, and the output feature map dimension is (56,56, 64).

3) The left branch network conv _ left is composed of three convolutional layers and one pooling layer. The convolution kernels are 1 × 1 ordinary convolution, 3 × 3 deep convolution and 1 × 1 point-by-point convolution, respectively, and the numbers are 32, 32 and 16, respectively. The first and third 1 x 1 convolution kernels here act to change the number of input and output channels. The input is subjected to dimensionality reduction processing through the first 1 x 1 common convolution kernel to reduce the computation amount of the deep convolution operation. The third 1 × 1 point-by-point convolution kernel integrates the output features of the deep convolution and outputs the feature map with 16 channels, and because the information of small scale in the image occupies a small proportion, the proportion of the feature map participating in classification decision at last is reduced. Here, the pooling layer adopts average pooling, the size of a pooling layer filter (filter) is 14 × 14, the step size is 14, and the output feature map dimension of the network is (4,4, 16).

5) The third convolution module conv3 contains 4 depth-separable convolution sub-residual network elements, each of which contains 4 convolution layers, the sizes of convolution kernels are two groups of 3 × 3 depth convolution and 1 × 1 point-by-point convolution respectively, the first number of depth convolution is 64, the latter number is 128, the convolution module has 16 convolution layers in total, and the feature map dimension of the output is (28, 128).

6) The right branch network conv _ right consists of three convolutional layers and one pooling layer. The convolution kernels are 1 × 1 normal convolution, 3 × 3 deep convolution and 1 × 1 point-by-point convolution, respectively, and the numbers are 64, 64 and 32, respectively. Here, the pooling layer adopts average pooling, the size of a pooling layer filter (filter) is 14 × 14, the step size is 14, and the output feature map dimension of the network is (2,2, 32).

7) The fourth convolution module conv4 contains 6 depth-separable convolution sub-residual network elements, each of which contains 4 convolution layers, the sizes of convolution kernels are two groups of 3 × 3 depth convolution and 1 × 1 point-by-point convolution respectively, the first number of depth convolutions is 128, the latter number of convolutions is 256, the convolution module has 24 convolution layers in total, and the feature map dimension of the output is (14, 256).

8) The fifth convolution module conv5 includes 3 depth-separable convolution sub-residual network elements, each of which contains 4 convolution layers, the sizes of convolution kernels are two groups of 3 × 3 depth convolution and 1 × 1 point-by-point convolution, the first number of depth convolution is 256, the number of convolution layers is 512, the convolution module has 12 convolution layers in total, the feature map dimension of the output is (7, 512), then global average pooling is performed, the parameter quantity and the calculation quantity of the full connection layer are reduced, and the feature map dimension of the output is (1, 512).

9) And then combining the outputs of the backbone network and the two branches, and fusing and inputting the multi-scale information into a full connection layer.

10) The output layer uses a softmax classifier, and the output of the softmax classifier is a four-dimensional vector which corresponds to four types of actions in abnormal behavior detection respectively. The formula is as follows:

wherein p is⁽ⁱ⁾The representation is the probability of the ith type of action, which is a scalar, and z is a 4-dimensional vector representing the input of softmax. The loss function used is a cross-entropy loss function, an expressionThe following were used:

wherein, y_iEqual to 0 or 1, which is 1 if the predicted action category is correct, and 0 otherwise. The more accurate the result of the prediction, the smaller the value of the loss function, the more ReLU function is the activation function used in the network.

Due to the reduction of the number of network layers and parameters, the ability of the network to extract features and fit data distributions is weakened, and the accuracy is inferior to that of the conventional convolutional network with a deeper number of layers. Therefore, the performance of the lightweight network is improved by using a knowledge distillation optimization algorithm, and the flow of the knowledge distillation algorithm is shown in fig. 3:

the algorithm process is as follows:

1) firstly, training a large-scale complex teacher network by using a pedestrian abnormal behavior data set based on human body skeleton information, and storing a trained network model.

2) And simultaneously sending the input image into a teacher network and a student network for processing, wherein the teacher network is used for generating a soft target, dividing the logits of the teacher network by a parameter T, and then connecting a softmax classifier to output the soft target.

Wherein p is_iSoft target, t, for teacher network output_iThe value before softmax operation for the teacher network, T is the set temperature parameter.

3) Training the student network by using the soft target of the teacher network, wherein the output of the student network needs to be processed by two items to obtain different loss, wherein one item is the loss L obtained by using the soft target as a target value_softThe operation of dividing by T is also needed before the student network performs softmax classification output. The expression formula is shown as the following formula:

wherein s is_iT is the set temperature parameter for the student network before softmax operation.

L_softIs calculated as shown in the equation:

wherein q is_iIs the output of the student network, p_iThe other loss is obtained by using hard target as target value and is marked as L_hardThe term loss uses the generic softmax classification function, whose expression is as follows:

wherein, y_iAs the true tag value of the data, q_iIs the output of the student network.

4) The two terms of loss are weighted and summed to obtain the total loss, which is marked as L_totalThe expression is as follows:

t and alpha are hyper-parameters, the value range of T is 1-20, a larger T value is set in the training process and then is gradually reduced, so that the output information of the student network can be fully utilized, and the capability of extracting the characteristic information can be better learned. The value of alpha is initially set to 0.95, the weight of the teacher network is increased, the student network preferentially learns the teacher network, then the weight is gradually reduced, the real label and the soft target are balanced, and the student network is enabled to perform better in specific tasks.

And 5, outputting a detection result.

Examples

The method for rapidly detecting the abnormal behaviors of the pedestrians by utilizing the information based on the human skeleton specifically comprises the following steps:

step 1: and carrying out human body frame calibration of the pedestrian by using a YOLOv3 target detection algorithm on the input image.

Step 2: and extracting the human skeleton information of the obtained human body frame by using an RMPE frame to obtain a human skeleton information picture.

And step 3: and (4) carrying out background removal pretreatment on the picture obtained in the step (2). The human skeleton information extraction flow is shown in fig. 4, and examples of running, falling, fighting and walking are shown in fig. 5, 6, 7 and 8.

And 4, step 4: and 3, rapidly detecting the abnormal behaviors of the pedestrians by using the light-weight multi-scale information fusion detection network based on the depth separable convolution, which is designed by the invention, for the image which contains the human skeleton information and is removed of the background.

And 5: and outputting a detection result.

Experiments prove that the calculation amount is obviously reduced, the real-time performance and the robustness are improved, and the effect of rapid detection is achieved.

Claims

1. A method for rapidly detecting abnormal behaviors of lightweight pedestrians is characterized by comprising the following steps:

step 4, carrying out pedestrian abnormal behavior detection on the preprocessed human body skeleton information picture in the step 3 by using a lightweight multi-scale information fusion detection network based on depth separable convolution to obtain a four-dimensional vector which respectively corresponds to four types of actions of human body abnormal behaviors;

in step 4, the lightweight multi-scale information fusion detection network based on the depth separable convolution comprises a trunk residual error network module and two branch network modules, wherein the trunk residual error network module comprises an input layer, and the input end of the input layer is used for receiving the preprocessed human skeleton information picture; the output end of the input layer is sequentially connected with a first convolution module and a second convolution module, and the output end of the second convolution module is respectively connected with a branch network module and a third convolution module; the output end of the third convolution module is respectively connected with the other branch network module and the fourth convolution module; the fourth convolution module is connected with the fifth convolution module, the fifth convolution module is combined with the output ends of the two branch network modules, and multi-scale information is fused and transmitted to the full connection layer; the output layer is a softmax classifier;

the first convolution module comprises a convolution layer and a pooling layer; the second convolution module comprises three depth separable convolution sub-residual network units, and each depth separable convolution sub-residual network unit comprises four depth separable convolution layers; the third convolution module comprises four depth separable convolution sub-residual network elements, and each depth separable convolution sub-residual network element comprises four depth separable convolution layers; the fourth convolution module comprises six depth separable convolution sub-residual network elements, and each depth separable convolution sub-residual network element comprises four depth separable convolution layers; the fifth convolution module includes three depth-separable convolutional sub-residual network elements and one pooling layer, each depth-separable convolutional sub-residual network element including four depth-separable convolutional layers.

2. The method as claimed in claim 1, wherein in step 1, a pedestrian detection frame is obtained by performing pedestrian detection on the image by using a YOLOv3 target detection algorithm.

3. The method according to claim 1, wherein in step 2, the RMPE frame is used to extract the human skeleton information from the pedestrian detection frame obtained in step 1, so as to obtain a human skeleton information picture.

4. The method for rapidly detecting the abnormal behaviors of the lightweight pedestrian according to claim 1, wherein a branch network module connected with the output end of the second convolution module comprises three convolution layers and a pooling layer; the other branch network module connected with the output end of the third convolution module comprises three convolution layers and a pooling layer.