CN113536885A

CN113536885A - Human behavior recognition method and system based on YOLOv3-SPP

Info

Publication number: CN113536885A
Application number: CN202110364743.7A
Authority: CN
Inventors: 贠卫国; 南星辰
Original assignee: Xian University of Architecture and Technology
Current assignee: Xian University of Architecture and Technology
Priority date: 2021-04-02
Filing date: 2021-04-02
Publication date: 2021-10-22

Abstract

A human behavior recognition method and system based on YOLOv3-SPP, the method of the technical scheme introduces an SPP module in a YOLOv3 network, correspondingly adjusts the network resolution according to the size of a training set image, re-clusters an initial Anchor Box (Anchor frame), simultaneously adjusts the number of network detection categories, converts multi-category detection classification problems into five categories of human behavior target detection classification problems of general facial actions in a certain scene, facial actions, whole body actions, body actions interacting with objects and body actions interacting with human bodies through object manipulation, and realizes better detection effects on human behaviors at high density and fine granularity by fusing different scale features, and has fewer missed detection actions. The method improves the detection effect and the detection speed and reduces the missed detection behavior.

Description

Human behavior recognition method and system based on YOLOv3-SPP

Technical Field

The invention belongs to the field of behavior detection in deep learning, and particularly relates to a human behavior identification method and system based on YOLOv 3-SPP.

Background

The traditional video analysis technology adopts manual selection features, so that the problems of low accuracy, incapability of analyzing big data by shallow learning and the like exist, and the problems can be well overcome by deep learning, so that the identification accuracy is higher, the robustness is better and the identification types are richer in the video analysis process.

In most of video analysis at present, frames are compared with one another to realize abnormal behavior classification, and the design is to extract human body targets into a neural network and directly realize end-to-end abnormal behavior classification, so that abnormal behavior detection of specific application scenes is realized.

In the intelligent video analysis, a time domain difference method and an optical flow method are generally adopted to extract a moving target of an image, the time domain difference motion detection method has strong adaptivity to a dynamic environment, but cannot completely extract all related characteristic pixel points, the identification precision is relatively low, and a void phenomenon is easily generated. Most optical flow methods are relatively complex to calculate and have poor noise immunity, and the real-time processing of video streams that cannot be applied to full frames without special hardware devices makes the operation cost high.

Disclosure of Invention

The invention aims to provide a method and a system for recognizing human body behaviors based on YOLOv3-SPP, so as to solve the problems.

In order to achieve the purpose, the invention adopts the following technical scheme:

a human behavior recognition method based on YOLOv3-SPP comprises the following steps:

step 1, introducing a spatial pyramid pooling SPP module in a YOLOv3 network, and constructing a target detection model based on YOLOv 3-SPP;

step 2, preprocessing Stanford40 (Stanford human behavior dataset): labeling information of five types of human behavior targets, namely facial action, facial action through object manipulation, whole body action, body action interacting with an object and body action interacting with a human body in a Stanford40 (Stanford human behavior data set) labeling file, and converting the five types of labeling information into a format supported under a Darknet (Yolo feature extraction network) framework;

step 3, according to Stanford40 (Stanford human behavior data set) human activity data set image resolution, reclustering the labeling information frames converted into formats supported by Darknet (Yolo feature extraction network) framework in the step 2 by using a kmean algorithm to obtain new initial Anchor Box (Anchor frame), and distributing a corresponding number of Anchor boxes for each detection scale in the Yolov3-SPP target detection model according to Anchor Box distribution rules set by Alexey Bockovsky (Yolo series authors);

step 4, respectively inputting a training set and a verification set in Stanford40 (Stanford human behavior data set) into a YOLOv3-SPP target detection model for training and evaluating the detection model;

and 5, detecting the test video by using the YOLOv3-SPP target detection model trained in the step 4, identifying the action in each frame of the video, and finally splicing the detection result into the video again.

Further, step 1 specifically includes the following steps:

step 1.1, wherein the SSP module consists of four parallel pooling layers with Kernel Size (convolution Kernel) of 1 × 1, 5 × 5, 9 × 9, 13 × 13, respectively, and is integrated between the 5 th convolution and the 6 th convolution of the first detection scale in the YOLOv3 network;

and step 1.2, completing construction of a target detection model based on YOLOv3-SPP, and realizing fusion of features with different scales.

Further, the step 2 specifically comprises the following steps:

step 2.1, extracting facial actions from Stanford40 (Stanford human behavior data set) label files, and carrying out facial actions, whole body actions, body actions interacting with objects and label information of five types of human behavior targets, namely the human behavior targets of the human behavior targets interacting with the human body, the whole body actions and the body actions interacting with the human body

Step 2.2, writing the five types of behavior marking information in the step 2.1 into an XML file named by pictures;

step 2.3, the Stanford40 (Stanford human behavior data set) data set file directory structure is converted into the file directory structure shaped like a PASCAL VOC data set file.

Further, step 2.2 specifically includes the following steps:

step 2.21, writing the five types of behavior marking information in the step 2.1 into an XML file named by pictures;

step 2.22 design code as follows:

X_center＝(box_xmin+box_xmax)/(2×picture—width)

y_center＝(box_ymin+box_ymax)/(2×picture_height)

width＝(box_xmax-box_xmin)/picture_width

hight＝(box_ymax-box_ymin)/picture_height

wherein: x_centerIs the coordinate of the center point of the x axis of the anchor frame, y_centerIs the coordinate of the center point of the y axis of the anchor frame; box_xminThe minimum value of the x-axis coordinate of the anchor frame is obtained; box_maxThe maximum value of the x-axis coordinate of the anchor frame is obtained; picture _ width is the width of the original picture; picture _ height is the height of the original image; the width is the width of the anchor frame; light is the height of the anchor frame

Converting the labeling information into a format under a Darknet (Yolo feature extraction network) framework;

step 2.23, checking that the format of the TXT marking box of each converted picture needs to be as follows:

wherein: object-class is a category, x _ center is an x-axis central point coordinate of the anchor frame, and y _ center is a y-axis central point coordinate of the anchor frame; the width is the width of the anchor frame; light is the height of the anchor frame

Further, step 3 specifically includes the following steps:

step 3.1, observing the coordinate information distribution of the labeling boxes of the training set in Stanford40 (Stanford human behavior data set), and randomly selecting and selecting k cluster centers (omega)_i,h_i) I ∈ {1,2 … …, k }, where w_iAnd h_iWidth and height of the frame;

step 3.2, respectively calculating the distance d between each labeling frame and the center of each cluster, wherein the calculation formula is as follows:

step 3.3, recalculating the average value of the width and height of the labeling frames to which the k cluster centers belong as a new cluster center;

step 3.4, repeating the steps 3.2 and 3.3, and outputting a clustering result when the clustering center is not changed any more;

step 3.5, outputting the final clustering result;

step 3.6, respectively allocating 2, 1 and 6 Anchor Box (Anchor frames) for three detection scales in the YOLOv3-SPP target detection model;

step 3.6 specifically comprises the following steps:

step 3.6.1, adjusting the number of filters of all YOLO layers in a YOLOv3-SPP network structure;

and 3.6.2, changing the corresponding MASK in the configuration file.

Further, step 4 specifically includes the following steps:

step 4.1, taking a model parameter Darknet53.conv.74 trained in advance on the ImageNet data set as an initialization weight to reduce training time;

step 4.2, setting a training hyper-parameter of the network model to obtain a behavior target detection model based on YOLOv 3-SPP;

and 4.3, inputting the Stanford40 (Stanford human behavior data set) verification concentrated traffic lane images into a behavior target detection model based on YOLOv3-SPP to obtain an evaluation index of the behavior target detection model based on a YOLOv3-SPP network.

Further, step 4.2 specifically includes the following steps:

step 4.21, setting a training hyper-parameter of the network model;

step 4.22, using pictures in Stanford40 (Stanford human behavior data set) data set as training input;

and 4.23, further performing network training by using a Darknet-53 deep learning framework, and obtaining a behavior target detection model based on YOLOv3-SPP when the training average loss reaches a stable value and is not reduced any more.

Further, step 5 specifically includes the following steps:

step 5.1, adjusting the resolution of the test data set picture to 1280x720, inputting the test data set picture into the Yolov3-SPP target detection model trained in the step 4, further extracting the down-sampling features by 32 times, and finally outputting the feature pictures with three scales through a network;

step 5.2, distributing different Anchor Box (Anchor frame) for each grid of each scale to detect;

step 5.3, aiming at the overlapped detection frames, inhibiting the detection frames with lower confidence coefficient and higher overlap rate than a set threshold value through an NMS algorithm to obtain an optimal detection frame;

and 5.4, framing the target position by using a rectangular frame in the behavior picture to be detected and marking the category of the behavior picture.

Further, step 5.2 specifically includes the following steps:

step 5.21, allocating 2, 1 and 6 different Anchor boxes (Anchor boxes) for each grid of each scale to detect, wherein each Anchor Box (Anchor Box) prediction comprises 4 boundary Box offsets and 1 confidence t₀And C detection target classes, 4 boundary offsets including t_x,t_y，t_w，t_h；

Where confidence is defined as follows:

pr (object) represents the probability that an object exists in the Anchor Box, and if not, it is 0,

represents the intersection ratio of the predicted bounding Box and the real bounding Box group Truth Box:

each trellis predicts the C class probabilities, pr (class)_iI object) represents the probability that the lattice belongs to a certain class under the condition of containing the target, the probability that the predicted Bounding Box belongs to the class is represented as:

step 5.22, obtaining the predicted position information of the boundary Box according to the predicted offset value of the Anchor Box relative to the labeling Box, wherein the calculation formula is as follows:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

σ(t₀)＝pr(object)*IOU(b,object)

wherein, for predicting the confidence corresponding to the positioning frame,

is t_x，t_yThe value of the horizontal and vertical coordinates b of the center of the grid relative to the upper left corner of the grid is represented by the normalization value of the Sigmoid function_x，b_y，b_w，b_hIs the bounding box of the final output.

Further, a human behavior recognition system based on Yolov3-SPP comprises:

the target detection model building module is used for introducing a spatial pyramid pooling SPP module in a YOLOv3 network and building a target detection model based on YOLOv 3-SPP;

stanford40 (Stanford human behavior dataset) preprocessing module is used to preprocess Stanford40 (Stanford human behavior dataset): labeling information of five types of human behavior targets, namely facial action, facial action through object manipulation, whole body action, body action interacting with an object and body action interacting with a human body in a Stanford40 (Stanford human behavior data set) labeling file, and converting the five types of labeling information into a format supported under a Darknet (Yolo feature extraction network) framework;

the detection scale distribution module is used for reclustering the marking information frames in the format supported by the Darknet (Yolo feature extraction network) framework converted in the step 2 by using a kmean algorithm according to the image resolution of a training set in a Stanford40 (Stanford human behavior data set) human activity data set, so as to obtain new initial Anchor Box (Anchor frame), and distributing a corresponding number of Anchor boxes to each detection scale in the YOLOv3-SPP target detection model according to an Anchor Box distribution rule set by Alexey Bockovsky (Yolo series authors);

training and evaluating the detection model, namely respectively inputting a training set and a verification set in Stanford40 (Stanford human behavior data set) into a YOLOv3-SPP target detection model for training and evaluating the detection model;

and (4) detecting the test video by using the YOLOv3-SPP target detection model trained in the step (4), identifying the action in each frame of the video, and finally splicing the detection result into the video again.

Compared with the prior art, the invention has the following technical effects:

according to the method, an SPP module is introduced into a YOLOv3 network, the network resolution is correspondingly adjusted according to the size of an image in a training set, an initial Anchor Box (Anchor frame) is clustered again, the number of network detection categories is adjusted at the same time, the multi-category detection and classification problems are converted into general facial actions in a certain scene, five human behavior target detection and classification problems including facial actions, whole body actions, body actions interacting with objects and body actions interacting with human bodies are carried out through object manipulation, and by fusing different scale features, the human behaviors under high density and fine granularity are better detected, and the number of missed detection behaviors is less.

The method improves the detection effect and the detection speed and reduces the missed detection behavior. Compared with the prior art, the invention has the following beneficial technical effects:

the invention first proposes a method for detecting, locating and identifying an action of interest in real time. Frames obtained from a continuous video data stream captured by a surveillance camera are accepted after a specified period of time and an action tag is given based on a single frame. Secondly, experiments prove that the YOLOv3 is an effective method, the speed of identification and positioning in a human activity data set is high, only a small group of frames or even one frame in a video is required in the model for accurate identification, and the YOLOv3 algorithm adopted in the optimization process is low in complexity and high in portability, which is very important in practical use. And further carrying out clustering analysis on the data set by adopting k-means clustering before training to obtain the prior condition size aiming at the data set, so that the training detection precision speed is improved. Furthermore, the invention adopts a freezing layer training method during training and iterates the learning rate to achieve the optimal training effect.

The design provides a human body posture identification method based on YOLOv3, and instead of adopting traditional frame-to-frame comparison to realize abnormal behavior classification for videos, human body targets are extracted and put into a neural network to directly realize end-to-end abnormal behavior classification, so that the identification precision and speed can be improved, and the complexity of a posture identification algorithm is reduced.

Drawings

FIG. 1 is a flow chart of an embodiment;

FIG. 2 is a block diagram of a YOLOv3 network in an embodiment;

FIG. 3 is a schematic diagram of an embodiment of an SPP module;

FIG. 4 is a graph comparing loss value versus iteration number curves for training of the embodiment and the prior model;

FIG. 5 is a graph comparing accuracy-recall or PR curves for the example and the prior art model;

FIG. 6 is a diagram illustrating the detection results of the embodiment.

Detailed Description

The present invention will now be described in further detail with reference to specific examples, which are intended to be illustrative, but not limiting, of the invention.

This embodiment is implemented in a PyTorch deep learning framework, and the hardware configuration is as follows: intel (R) core (TM) i7-7800X CPU @3.50GHz 8-core CPU, 16G memory, video memory, NVIDIA GeForce RTX2080Ti, 10 GB. Software configuration: linux System, Python 3.6

The evaluation index is mAP Mean Average Precision, namely an Average AP value, and the Average AP value is obtained by averaging a plurality of verification set individuals.

The basic flow diagram of the system of the invention is shown in fig. 1, and the human body posture identification method based on YOLOv3-SPP comprises the following steps:

step 1, introducing a Spatial Pyramid Pooling (SPP for short) module into a YOLOv3 network, and constructing a target detection model based on YOLOv3-SPP, specifically comprising the following steps:

step 1.1, wherein the SPP module consists of four parallel pooling layers with Kernel Size of 1 × 1, 5 × 5, 9 × 9, 13 × 13, respectively, and is integrated between the 5 th and 6 th convolutions of the first detection scale in the YOLOv3 network.

And step 1.2, completing construction of a target detection model based on YOLOv3-SPP, and being used for realizing fusion of features with different scales, enriching the expression capability of a final feature map, and improving the detection effect when the scale difference of the behavior target is large in the environment.

Step 2, preprocessing Stanford40 (Stanford human behavior dataset), namely labeling general facial actions (smiling, laughing, chewing, talking, etc.) in a file from Stanford40 (Stanford human behavior dataset), performing facial actions (smoking, eating, drinking, etc.) through object manipulation, whole body actions (clapping, climbing stairs, diving, etc.), body actions (brushing, swabbing, dribbling, golfing, etc.) interacting with objects, and body actions (fencing, hugging, kicking, kissing, boxing, handshaking, etc.) interacting with human bodies, labeling information of the five types of human body behavior targets, and converting the five types of labeling information into a format supported by a Darknet (Yolo feature extraction network) framework, the specific steps are as follows:

step 2.1, extracting general facial actions from Stanford40 (Stanford human behavior data set) label files, and carrying out facial actions, whole body actions, body actions interacting with objects and label information of five types of human behavior targets, namely the human behavior targets of the human behavior targets interacting with the human body and the human body actions

Step 2.2, writing the five types of behavior marking information in the step 2.1 into an XML file named by pictures,

wherein the target location in the JSON file of Stanford40 (Stanford human behavior dataset) is the box top left coordinate (box)_xmin,box_ymin) And lower right corner coordinates (box)_xmax,box_ymax) Specifically, the design code converts the annotation information into a format under the framework of Darknet (Yolo's feature extraction network) according to the following formula:

X_center＝(box_xmin+box_xmax)/(2×picture_width)

y_center＝(box_ymin+box_ymax)/(2×picture_height)

width＝(box_xmax-box_xmin)/picture_width

hight＝(box_ymax-box_ymin)/picture_height

further, (X)_center,y_center) The coordinate of the center point of the labeling frame is represented, the width of the labeling frame is represented, the height of the labeling frame is represented, and the TXT labeling frame format of each picture after conversion is changed into:

Step 2.3 converts the Stanford40 (Stanford human behavior dataset) dataset file directory structure into a file directory structure shaped like a PASCAL VOC dataset,

further, a TXT file with labeling information is placed in a Labels folder, a generated XML file is placed in an Annotation folder, pictures in a Stanford40 (Stanford human behavior data set) data set are placed in a JPEGImages folder, and names for model training and verifying pictures are written in train.txt and val.txt in a Main folder under an ImageSets directory respectively.

Step 3, re-clustering and distributing, re-clustering and re-clustering the labeled information frames converted to the format supported by the Darknet (Yolo feature extraction network) framework in the step 2 by using a k-means algorithm according to the image resolution 1280x720 of the training set in the Stanford40 (Stanford human behavior data set) data set, so as to obtain new initial Anchor Box (Anchor frame), and distributing a corresponding number of Anchor boxes (Anchor frames) for each detection scale in the Yolov3-SPP target detection model according to an Anchor Box distribution rule set by the Alexey Bochkovsky (Yolo series authors), which specifically comprises the following steps:

the method comprises the following steps that a numerator represents the size of the intersected area of an anchor frame and a marking frame, a denominator represents the size of the combined area of the anchor frame and the marking frame, and further, when an IOU value is the largest, namely the marking frame and the anchor frame are best matched, at the moment d is the smallest, the marking frame is respectively divided into the clusters which are the closest to the marking frame and are the smallest in d;

step 3.5, finally outputting a clustering result: 10,13,16,30,33,23,30,61,62,45,59,119,116,90,156,198,373, 326;

step 3.6, respectively allocating 2, 1 and 6 Anchor boxes to three detection scales in the YOLOv3-SPP target detection model, namely adjusting the number of filters of all YOLO layers in the YOLOv3-SPP network structure to be (N + 5). times.3-135, wherein N is the number of the Anchor boxes allocated, and further changing the corresponding NASK in the configuration file to be 7, 8; 6; 0,1,2,3,5,6.

Step 4, training and evaluating the example model, respectively inputting a training set and a verification set in a Stanford40 (Stanford human behavior data set) data set into a YOLOv3-SPP target detection model for training and evaluating the detection model, and specifically comprising the following steps:

step 4.1, adopting a model parameter Darknet (a feature extraction network of YOLO) 53.conv.74 trained in advance on the ImageNet data set as an initialization weight to reduce training time;

step 4.2, setting example training hyper-parameters, namely network resolution, momentum, weight attenuation, Base _ lr, batch, maximum iteration times and a learning rate adjustment strategy, taking pictures in a Stanford40 (Stanford human behavior data set) data set as training input, performing network training by using a Darknet-53 deep learning framework, obtaining a behavior target detection model based on Yolov3-SPP when training average loss reaches a stable value and is not reduced any more, wherein the training hyper-parameters are set as shown in the following table 1:

TABLE 1 network training hyper-parameter setting table

The learning rate adjustment strategy policy is set to epoch, when the iteration times are 100 and 120, the learning rate lr is reduced by 10 times, score _ thresh is set to 0.25, iou _ thresh is set to 0.2, after the training parameters are configured, a comparison graph of a function loss value-iteration times curve in the training process of three groups of network structures, namely NVIDIA GeForce RTX2080Ti, 10GB training network, yollov 3-SPP, yollov 3 and yollov 3-tiny, is shown in fig. 4;

when iteration is carried out to 60 epochs, the Loss value of the YOLOv3-SPP network converges to about 0.5, the YOLOv3 network converges to about 0.8, the fluctuation range of the Loss value of the training of the Tiny YOLOv3 network is large, and the network is unstable, so that the YOLOv3-SPP network can converge faster relative to the YOLOv3 and the Tiny YOLOv3, has better characteristic learning capability and has a lower Loss value under the same learning rate;

step 4.3, inputting the human behavior pictures in the Stanford40 (Stanford human behavior data set) verification set into a behavior target detection model based on YOLOv 3-SPP;

step 4.3.1, recording the network prediction result in a TXT file through network layer-by-layer calculation, and obtaining the accuracy, the recall rate and the F of a behavior target detection model based on a YOLOv3-SPP network through codes₁Values, detection rates (FPS) and P-R curve evaluation indices.

Step 4.3.2, in order to analyze the model detection performance more comprehensively, the trained three models, namely YOLOv3-SPP, YOLOv3 and YOLOv3-Tiny, are subjected to performance evaluation on a verification set picture, the GPU adopts RTX2080Ti, and specific indexes are shown in table 2:

TABLE 2 comparison of evaluation indexes of different models

The YOLOv3-SPP network model has the best detection effect, the accuracy, the recall rate and the F1 value are as high as 78.90%, 92.20% and 0.853, each index is respectively improved by 14.7%, 11.4 and 0.16 compared with the YOLOv3 network, the YOLOv3-Tiny network hierarchy, simple structure and low evaluation index, and the detection requirements of complex background and large target scale difference in crowd environment are difficult to meet, and the YOLOv3-SPP network has a large amount of convolution operation, so that the detection rate is relatively slow, but the real-time requirement is basically met;

step 4.3.3, in order to comprehensively measure the detection performance of the model, drawing a precision-recall ratio (PR) curve chart as shown in FIG. 5, wherein the area under the curve is the average precision ratio AP, and the higher the AP is, the better the detection performance of the model is;

wherein red represents the YOLOv3-SPP network PR curve, green represents the YOLOv3 network PR curve, and blue represents the YOLOv3-tiny network PR curve, as can be seen from FIG. 5,

the average accuracy rate of YOLOv3-SPP reaches 78.90%, which is obviously better than that of YOLOv3 network, the body movement of human body interaction is relatively low in mAP due to the large difference of movement among different individuals, but still better than that of other models of YOLO.

Step 5, carrying out target detection on the crowd behaviors in Stanford40 (Stanford human behavior data set) by using the YOLOv3-SPP target detection model trained in the step 4, wherein the specific detection process comprises the following steps:

step 5.1, adjusting the picture resolution of the intercepted video frames in the test video data set to 1280x720, inputting the picture resolution into the YOLOv3-SPP target detection model trained in the step 4, and finally outputting a feature map with three scales through a network after 32 times of downsampling feature extraction;

step 5.2, allocating 2, 1 and 6 different Anchor boxes (Anchor boxes) for each grid of each scale to detect, wherein each Anchor Box (Anchor Box) prediction comprises 4 boundary Box offsets and 1 confidence t₀And C detection target classes, 4 boundary offsets including t_x,t_y，t_w，t_hConfidence is defined as follows:

where pr (object) represents the probability that a target exists in the Anchor Box, if not, it has a value of 0,

represents the intersection ratio of the predicted bounding Box to the real bounding Box (Ground Truth Box):

further, each trellis predicts the C class probabilities, pr (class)_iI object) represents the probability that the lattice belongs to a certain class under the condition of containing the target, the probability that the predicted Bounding Box belongs to the class i is represented as:

obtaining the position information of the predicted boundary Box according to the deviation value of the predicted Anchor Box (Anchor Box) relative to the marking Box, wherein the calculation formula is as follows:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

σ(t₀)＝pr(object)*IOU(b,object)

wherein the content of the first and second substances,

in order to predict the confidence level corresponding to the location box,

is t_x，t_yThe value of the horizontal and vertical coordinates of the center of the grid relative to the upper left corner of the grid is represented by a Sigmoid function normalization value, b_x，b_y，b_w，b_hIs the final output bounding box;

and 5.4, framing the target position by using a rectangular frame in the human behavior picture and marking the category of the target position.

The detection result is shown in fig. 6, by observing the detection result, it can be obtained that the method of this embodiment introduces an SPP module into the YOLOv3 network, by fusing different scale features, correspondingly adjusting the network resolution according to the size of the training set image, re-clustering the initial Anchor Box, and adjusting the number of network detection categories, converting the multi-category detection classification problem into five categories of human behavior target detection classification problems, namely general facial movements in a certain scene, facial movements through object manipulation, whole body movements, body movements interacting with the object, and body movements interacting with the human body, and by fusing different scale features, achieving better detection effect on human body behaviors at high density and fine granularity, and fewer missed detection behaviors.

The design provides a human body posture identification method based on YOLOv3, the method does not adopt the traditional frame-to-frame comparison to realize abnormal behavior classification of videos, but extracts human body targets and puts the human body targets into a neural network to directly realize end-to-end abnormal behavior classification, and the performance is greatly improved compared with the existing algorithm under the same condition, so that the identification precision and speed can be improved, and the complexity of the posture identification algorithm is reduced.

Claims

1. A human behavior recognition method based on YOLOv3-SPP is characterized by comprising the following steps:

step 2, Stanford40 is preprocessed for Stanford human behavior data set: labeling information of five human behavior targets, namely facial action, facial action through object manipulation, whole body action, body action interacting with an object and body action interacting with a human body in a Stanford40 document, and converting the five types of labeled information into a format supported by a Yolo feature extraction network Darknet framework;

step 3, according to the image resolution of the training set in the Stanford40, reclustering the marking information frames converted into the format supported by the Darknet framework in the step 2 by using a kmean clustering algorithm to obtain new initial Anchor Box, and distributing a corresponding number of Anchor boxes for each detection scale in the YOLOv3-SPP target detection model according to a set Anchor Box distribution rule;

step 4, respectively inputting the training set and the verification set in the Stanford40 into a YOLOv3-SPP target detection model for training and evaluating the detection model;

2. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 1, wherein the step 1 specifically comprises the following steps:

step 1.1, wherein the SSP module consists of four parallel pooling layers with convolution kernels Kernel Size of 1 × 1, 5 × 5, 9 × 9 and 13 × 13 respectively, and the SSP module is integrated between the 5 th convolution and the 6 th convolution of the first detection scale in the YOLOv3 network;

3. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 1, wherein the step 2 specifically comprises the following steps:

step 2.1, extracting facial actions from the Stanford40 labeling file, and performing labeling information of five human behavior targets, namely facial actions, whole body actions, body actions interacting with objects and body actions interacting with human bodies through object manipulation

and 2.3, converting the Stanford40 data set file directory structure into a file directory structure shaped like a PASCAL VOC data set.

4. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 3, wherein the step 2.2 specifically comprises the following steps:

step 2.22 design code as follows:

X_center＝(box_xmin+box_xmax)/(2×picture—width)

y_center＝(box_ymin+box_ymax)/(2×picture_height)

width＝(box_xmax-box_xmin)/picture_width

hight＝(box_ymax-box_ymin)/picture_height

wherein: x_centerIs the coordinate of the center point of the x axis of the anchor frame, y_centerIs the coordinate of the center point of the y axis of the anchor frame; box_xminThe minimum value of the x-axis coordinate of the anchor frame is obtained; box_maxThe maximum value of the x-axis coordinate of the anchor frame is obtained; picture _ width is the width of the original picture; picture _ height is the height of the original image; the width is the width of the anchor frame; right is the anchor frame height;

converting the labeling information into a format under a Darknet framework;

wherein: object-class is a category, x _ center is an x-axis central point coordinate of the anchor frame, and y _ center is a y-axis central point coordinate of the anchor frame; the width is the width of the anchor frame; and hip is the height of the anchor frame.

5. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 1, wherein the step 3 specifically comprises the following steps:

step 3.1, observing the coordinate information distribution of the labeling boxes of the Stanford40 training set, and randomly selecting and selecting k cluster centers (omega)_i,h_i) I ∈ {1,2 … …, k }, where w_iAnd h_iWidth and height of the frame;

step 3.5, outputting the final clustering result;

step 3.6, respectively allocating 2, 1 and 6 Anchor Box for three detection scales in the YOLOv3-SPP target detection model;

step 3.6 specifically comprises the following steps:

and 3.6.2, changing the corresponding MASK in the configuration file.

6. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 1, wherein the step 4 specifically comprises the following steps:

and 4.3, inputting the Stanford40 verification centralized driving road picture into a behavior target detection model based on YOLOv3-SPP to obtain an evaluation index of the behavior target detection model based on the YOLOv3-SPP network.

7. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 6, wherein the step 4.2 specifically comprises the following steps:

step 4.21, setting a training hyper-parameter of the network model;

step 4.22, using the pictures in the Stanford40 data set as training input;

8. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 1, wherein the step 5 specifically comprises the following steps:

step 5.2, distributing different Anchor Box for each grid of each scale for detection;

9. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 8, wherein the step 5.2 specifically comprises the following steps:

step 5.21, allocating 2, 1 and 6 different Anchor Box for each grid of each scale for detection, wherein each Anchor Box prediction comprises 4 bounding Box offsets and 1 confidence t₀And C detection target classes, 4 boundary offsets including t_x,t_y，t_w，t_h；

Where confidence is defined as follows:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

σ(t₀)＝pr(object)*IOU(b,object)

wherein, σ (t)₀) In order to predict the confidence level corresponding to the location box,

10. A human behavior recognition system based on YOLOv3-SPP, comprising:

the Stanford40 pretreatment module was used to pretreat Stanford 40: labeling, from Stanford40, labeling information of five types of human behavior targets, namely facial motion, facial motion through object manipulation, whole body motion, body motion interacting with an object and body motion interacting with a human body in a file, and converting the five types of labeling information into a format supported under a Darknet) framework;

the detection scale distribution module is used for re-clustering the labeling information frames converted into the format supported by the Darknet framework in the step 2 by utilizing a kmean algorithm according to the image resolution of the training set in the Stanford40 to obtain new initial Anchor Box, and distributing a corresponding number of Anchor boxes for each detection scale in the YOLOv3-SPP target detection model according to a set Anchor Box distribution rule;

the training and evaluation of the detection model are used for inputting the training set and the verification set in the Stanford40 into a YOLOv3-SPP target detection model respectively for training and evaluation of the detection model;