CN113536885A - Human behavior recognition method and system based on YOLOv3-SPP - Google Patents

Human behavior recognition method and system based on YOLOv3-SPP Download PDF

Info

Publication number
CN113536885A
CN113536885A CN202110364743.7A CN202110364743A CN113536885A CN 113536885 A CN113536885 A CN 113536885A CN 202110364743 A CN202110364743 A CN 202110364743A CN 113536885 A CN113536885 A CN 113536885A
Authority
CN
China
Prior art keywords
yolov3
spp
box
anchor
detection model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110364743.7A
Other languages
Chinese (zh)
Inventor
贠卫国
南星辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Architecture and Technology
Original Assignee
Xian University of Architecture and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Architecture and Technology filed Critical Xian University of Architecture and Technology
Priority to CN202110364743.7A priority Critical patent/CN113536885A/en
Publication of CN113536885A publication Critical patent/CN113536885A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

A human behavior recognition method and system based on YOLOv3-SPP, the method of the technical scheme introduces an SPP module in a YOLOv3 network, correspondingly adjusts the network resolution according to the size of a training set image, re-clusters an initial Anchor Box (Anchor frame), simultaneously adjusts the number of network detection categories, converts multi-category detection classification problems into five categories of human behavior target detection classification problems of general facial actions in a certain scene, facial actions, whole body actions, body actions interacting with objects and body actions interacting with human bodies through object manipulation, and realizes better detection effects on human behaviors at high density and fine granularity by fusing different scale features, and has fewer missed detection actions. The method improves the detection effect and the detection speed and reduces the missed detection behavior.

Description

Human behavior recognition method and system based on YOLOv3-SPP
Technical Field
The invention belongs to the field of behavior detection in deep learning, and particularly relates to a human behavior identification method and system based on YOLOv 3-SPP.
Background
The traditional video analysis technology adopts manual selection features, so that the problems of low accuracy, incapability of analyzing big data by shallow learning and the like exist, and the problems can be well overcome by deep learning, so that the identification accuracy is higher, the robustness is better and the identification types are richer in the video analysis process.
In most of video analysis at present, frames are compared with one another to realize abnormal behavior classification, and the design is to extract human body targets into a neural network and directly realize end-to-end abnormal behavior classification, so that abnormal behavior detection of specific application scenes is realized.
In the intelligent video analysis, a time domain difference method and an optical flow method are generally adopted to extract a moving target of an image, the time domain difference motion detection method has strong adaptivity to a dynamic environment, but cannot completely extract all related characteristic pixel points, the identification precision is relatively low, and a void phenomenon is easily generated. Most optical flow methods are relatively complex to calculate and have poor noise immunity, and the real-time processing of video streams that cannot be applied to full frames without special hardware devices makes the operation cost high.
Disclosure of Invention
The invention aims to provide a method and a system for recognizing human body behaviors based on YOLOv3-SPP, so as to solve the problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
a human behavior recognition method based on YOLOv3-SPP comprises the following steps:
step 1, introducing a spatial pyramid pooling SPP module in a YOLOv3 network, and constructing a target detection model based on YOLOv 3-SPP;
step 2, preprocessing Stanford40 (Stanford human behavior dataset): labeling information of five types of human behavior targets, namely facial action, facial action through object manipulation, whole body action, body action interacting with an object and body action interacting with a human body in a Stanford40 (Stanford human behavior data set) labeling file, and converting the five types of labeling information into a format supported under a Darknet (Yolo feature extraction network) framework;
step 3, according to Stanford40 (Stanford human behavior data set) human activity data set image resolution, reclustering the labeling information frames converted into formats supported by Darknet (Yolo feature extraction network) framework in the step 2 by using a kmean algorithm to obtain new initial Anchor Box (Anchor frame), and distributing a corresponding number of Anchor boxes for each detection scale in the Yolov3-SPP target detection model according to Anchor Box distribution rules set by Alexey Bockovsky (Yolo series authors);
step 4, respectively inputting a training set and a verification set in Stanford40 (Stanford human behavior data set) into a YOLOv3-SPP target detection model for training and evaluating the detection model;
and 5, detecting the test video by using the YOLOv3-SPP target detection model trained in the step 4, identifying the action in each frame of the video, and finally splicing the detection result into the video again.
Further, step 1 specifically includes the following steps:
step 1.1, wherein the SSP module consists of four parallel pooling layers with Kernel Size (convolution Kernel) of 1 × 1, 5 × 5, 9 × 9, 13 × 13, respectively, and is integrated between the 5 th convolution and the 6 th convolution of the first detection scale in the YOLOv3 network;
and step 1.2, completing construction of a target detection model based on YOLOv3-SPP, and realizing fusion of features with different scales.
Further, the step 2 specifically comprises the following steps:
step 2.1, extracting facial actions from Stanford40 (Stanford human behavior data set) label files, and carrying out facial actions, whole body actions, body actions interacting with objects and label information of five types of human behavior targets, namely the human behavior targets of the human behavior targets interacting with the human body, the whole body actions and the body actions interacting with the human body
Step 2.2, writing the five types of behavior marking information in the step 2.1 into an XML file named by pictures;
step 2.3, the Stanford40 (Stanford human behavior data set) data set file directory structure is converted into the file directory structure shaped like a PASCAL VOC data set file.
Further, step 2.2 specifically includes the following steps:
step 2.21, writing the five types of behavior marking information in the step 2.1 into an XML file named by pictures;
step 2.22 design code as follows:
Xcenter=(boxxmin+boxxmax)/(2×picture—width)
ycenter=(boxymin+boxymax)/(2×picture_height)
width=(boxxmax-boxxmin)/picture_width
hight=(boxymax-boxymin)/picture_height
wherein: xcenterIs the coordinate of the center point of the x axis of the anchor frame, ycenterIs the coordinate of the center point of the y axis of the anchor frame; boxxminThe minimum value of the x-axis coordinate of the anchor frame is obtained; boxmaxThe maximum value of the x-axis coordinate of the anchor frame is obtained; picture _ width is the width of the original picture; picture _ height is the height of the original image; the width is the width of the anchor frame; light is the height of the anchor frame
Converting the labeling information into a format under a Darknet (Yolo feature extraction network) framework;
step 2.23, checking that the format of the TXT marking box of each converted picture needs to be as follows:
<object-class><x_center><y_center><width><height>。
wherein: object-class is a category, x _ center is an x-axis central point coordinate of the anchor frame, and y _ center is a y-axis central point coordinate of the anchor frame; the width is the width of the anchor frame; light is the height of the anchor frame
Further, step 3 specifically includes the following steps:
step 3.1, observing the coordinate information distribution of the labeling boxes of the training set in Stanford40 (Stanford human behavior data set), and randomly selecting and selecting k cluster centers (omega)i,hi) I ∈ {1,2 … …, k }, where wiAnd hiWidth and height of the frame;
step 3.2, respectively calculating the distance d between each labeling frame and the center of each cluster, wherein the calculation formula is as follows:
Figure BDA0003006092800000031
step 3.3, recalculating the average value of the width and height of the labeling frames to which the k cluster centers belong as a new cluster center;
step 3.4, repeating the steps 3.2 and 3.3, and outputting a clustering result when the clustering center is not changed any more;
step 3.5, outputting the final clustering result;
step 3.6, respectively allocating 2, 1 and 6 Anchor Box (Anchor frames) for three detection scales in the YOLOv3-SPP target detection model;
step 3.6 specifically comprises the following steps:
step 3.6.1, adjusting the number of filters of all YOLO layers in a YOLOv3-SPP network structure;
and 3.6.2, changing the corresponding MASK in the configuration file.
Further, step 4 specifically includes the following steps:
step 4.1, taking a model parameter Darknet53.conv.74 trained in advance on the ImageNet data set as an initialization weight to reduce training time;
step 4.2, setting a training hyper-parameter of the network model to obtain a behavior target detection model based on YOLOv 3-SPP;
and 4.3, inputting the Stanford40 (Stanford human behavior data set) verification concentrated traffic lane images into a behavior target detection model based on YOLOv3-SPP to obtain an evaluation index of the behavior target detection model based on a YOLOv3-SPP network.
Further, step 4.2 specifically includes the following steps:
step 4.21, setting a training hyper-parameter of the network model;
step 4.22, using pictures in Stanford40 (Stanford human behavior data set) data set as training input;
and 4.23, further performing network training by using a Darknet-53 deep learning framework, and obtaining a behavior target detection model based on YOLOv3-SPP when the training average loss reaches a stable value and is not reduced any more.
Further, step 5 specifically includes the following steps:
step 5.1, adjusting the resolution of the test data set picture to 1280x720, inputting the test data set picture into the Yolov3-SPP target detection model trained in the step 4, further extracting the down-sampling features by 32 times, and finally outputting the feature pictures with three scales through a network;
step 5.2, distributing different Anchor Box (Anchor frame) for each grid of each scale to detect;
step 5.3, aiming at the overlapped detection frames, inhibiting the detection frames with lower confidence coefficient and higher overlap rate than a set threshold value through an NMS algorithm to obtain an optimal detection frame;
and 5.4, framing the target position by using a rectangular frame in the behavior picture to be detected and marking the category of the behavior picture.
Further, step 5.2 specifically includes the following steps:
step 5.21, allocating 2, 1 and 6 different Anchor boxes (Anchor boxes) for each grid of each scale to detect, wherein each Anchor Box (Anchor Box) prediction comprises 4 boundary Box offsets and 1 confidence t0And C detection target classes, 4 boundary offsets including tx,ty,tw,th
Where confidence is defined as follows:
Figure BDA0003006092800000051
pr (object) represents the probability that an object exists in the Anchor Box, and if not, it is 0,
Figure BDA0003006092800000052
represents the intersection ratio of the predicted bounding Box and the real bounding Box group Truth Box:
Figure BDA0003006092800000053
each trellis predicts the C class probabilities, pr (class)iI object) represents the probability that the lattice belongs to a certain class under the condition of containing the target, the probability that the predicted Bounding Box belongs to the class is represented as:
Figure BDA0003006092800000054
step 5.22, obtaining the predicted position information of the boundary Box according to the predicted offset value of the Anchor Box relative to the labeling Box, wherein the calculation formula is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure BDA0003006092800000055
Figure BDA0003006092800000056
σ(t0)=pr(object)*IOU(b,object)
wherein, for predicting the confidence corresponding to the positioning frame,
Figure BDA0003006092800000057
is tx,tyThe value of the horizontal and vertical coordinates b of the center of the grid relative to the upper left corner of the grid is represented by the normalization value of the Sigmoid functionx,by,bw,bhIs the bounding box of the final output.
Further, a human behavior recognition system based on Yolov3-SPP comprises:
the target detection model building module is used for introducing a spatial pyramid pooling SPP module in a YOLOv3 network and building a target detection model based on YOLOv 3-SPP;
stanford40 (Stanford human behavior dataset) preprocessing module is used to preprocess Stanford40 (Stanford human behavior dataset): labeling information of five types of human behavior targets, namely facial action, facial action through object manipulation, whole body action, body action interacting with an object and body action interacting with a human body in a Stanford40 (Stanford human behavior data set) labeling file, and converting the five types of labeling information into a format supported under a Darknet (Yolo feature extraction network) framework;
the detection scale distribution module is used for reclustering the marking information frames in the format supported by the Darknet (Yolo feature extraction network) framework converted in the step 2 by using a kmean algorithm according to the image resolution of a training set in a Stanford40 (Stanford human behavior data set) human activity data set, so as to obtain new initial Anchor Box (Anchor frame), and distributing a corresponding number of Anchor boxes to each detection scale in the YOLOv3-SPP target detection model according to an Anchor Box distribution rule set by Alexey Bockovsky (Yolo series authors);
training and evaluating the detection model, namely respectively inputting a training set and a verification set in Stanford40 (Stanford human behavior data set) into a YOLOv3-SPP target detection model for training and evaluating the detection model;
and (4) detecting the test video by using the YOLOv3-SPP target detection model trained in the step (4), identifying the action in each frame of the video, and finally splicing the detection result into the video again.
Compared with the prior art, the invention has the following technical effects:
according to the method, an SPP module is introduced into a YOLOv3 network, the network resolution is correspondingly adjusted according to the size of an image in a training set, an initial Anchor Box (Anchor frame) is clustered again, the number of network detection categories is adjusted at the same time, the multi-category detection and classification problems are converted into general facial actions in a certain scene, five human behavior target detection and classification problems including facial actions, whole body actions, body actions interacting with objects and body actions interacting with human bodies are carried out through object manipulation, and by fusing different scale features, the human behaviors under high density and fine granularity are better detected, and the number of missed detection behaviors is less.
The method improves the detection effect and the detection speed and reduces the missed detection behavior. Compared with the prior art, the invention has the following beneficial technical effects:
the invention first proposes a method for detecting, locating and identifying an action of interest in real time. Frames obtained from a continuous video data stream captured by a surveillance camera are accepted after a specified period of time and an action tag is given based on a single frame. Secondly, experiments prove that the YOLOv3 is an effective method, the speed of identification and positioning in a human activity data set is high, only a small group of frames or even one frame in a video is required in the model for accurate identification, and the YOLOv3 algorithm adopted in the optimization process is low in complexity and high in portability, which is very important in practical use. And further carrying out clustering analysis on the data set by adopting k-means clustering before training to obtain the prior condition size aiming at the data set, so that the training detection precision speed is improved. Furthermore, the invention adopts a freezing layer training method during training and iterates the learning rate to achieve the optimal training effect.
The design provides a human body posture identification method based on YOLOv3, and instead of adopting traditional frame-to-frame comparison to realize abnormal behavior classification for videos, human body targets are extracted and put into a neural network to directly realize end-to-end abnormal behavior classification, so that the identification precision and speed can be improved, and the complexity of a posture identification algorithm is reduced.
Drawings
FIG. 1 is a flow chart of an embodiment;
FIG. 2 is a block diagram of a YOLOv3 network in an embodiment;
FIG. 3 is a schematic diagram of an embodiment of an SPP module;
FIG. 4 is a graph comparing loss value versus iteration number curves for training of the embodiment and the prior model;
FIG. 5 is a graph comparing accuracy-recall or PR curves for the example and the prior art model;
FIG. 6 is a diagram illustrating the detection results of the embodiment.
Detailed Description
The present invention will now be described in further detail with reference to specific examples, which are intended to be illustrative, but not limiting, of the invention.
This embodiment is implemented in a PyTorch deep learning framework, and the hardware configuration is as follows: intel (R) core (TM) i7-7800X CPU @3.50GHz 8-core CPU, 16G memory, video memory, NVIDIA GeForce RTX2080Ti, 10 GB. Software configuration: linux System, Python 3.6
The evaluation index is mAP Mean Average Precision, namely an Average AP value, and the Average AP value is obtained by averaging a plurality of verification set individuals.
The basic flow diagram of the system of the invention is shown in fig. 1, and the human body posture identification method based on YOLOv3-SPP comprises the following steps:
step 1, introducing a Spatial Pyramid Pooling (SPP for short) module into a YOLOv3 network, and constructing a target detection model based on YOLOv3-SPP, specifically comprising the following steps:
step 1.1, wherein the SPP module consists of four parallel pooling layers with Kernel Size of 1 × 1, 5 × 5, 9 × 9, 13 × 13, respectively, and is integrated between the 5 th and 6 th convolutions of the first detection scale in the YOLOv3 network.
And step 1.2, completing construction of a target detection model based on YOLOv3-SPP, and being used for realizing fusion of features with different scales, enriching the expression capability of a final feature map, and improving the detection effect when the scale difference of the behavior target is large in the environment.
Step 2, preprocessing Stanford40 (Stanford human behavior dataset), namely labeling general facial actions (smiling, laughing, chewing, talking, etc.) in a file from Stanford40 (Stanford human behavior dataset), performing facial actions (smoking, eating, drinking, etc.) through object manipulation, whole body actions (clapping, climbing stairs, diving, etc.), body actions (brushing, swabbing, dribbling, golfing, etc.) interacting with objects, and body actions (fencing, hugging, kicking, kissing, boxing, handshaking, etc.) interacting with human bodies, labeling information of the five types of human body behavior targets, and converting the five types of labeling information into a format supported by a Darknet (Yolo feature extraction network) framework, the specific steps are as follows:
step 2.1, extracting general facial actions from Stanford40 (Stanford human behavior data set) label files, and carrying out facial actions, whole body actions, body actions interacting with objects and label information of five types of human behavior targets, namely the human behavior targets of the human behavior targets interacting with the human body and the human body actions
Step 2.2, writing the five types of behavior marking information in the step 2.1 into an XML file named by pictures,
wherein the target location in the JSON file of Stanford40 (Stanford human behavior dataset) is the box top left coordinate (box)xmin,boxymin) And lower right corner coordinates (box)xmax,boxymax) Specifically, the design code converts the annotation information into a format under the framework of Darknet (Yolo's feature extraction network) according to the following formula:
Xcenter=(boxxmin+boxxmax)/(2×picture_width)
ycenter=(boxymin+boxymax)/(2×picture_height)
width=(boxxmax-boxxmin)/picture_width
hight=(boxymax-boxymin)/picture_height
further, (X)center,ycenter) The coordinate of the center point of the labeling frame is represented, the width of the labeling frame is represented, the height of the labeling frame is represented, and the TXT labeling frame format of each picture after conversion is changed into:
<object-class><x_center><y_center><width><height>,
wherein: object-class is a category, x _ center is an x-axis central point coordinate of the anchor frame, and y _ center is a y-axis central point coordinate of the anchor frame; the width is the width of the anchor frame; light is the height of the anchor frame
Step 2.3 converts the Stanford40 (Stanford human behavior dataset) dataset file directory structure into a file directory structure shaped like a PASCAL VOC dataset,
further, a TXT file with labeling information is placed in a Labels folder, a generated XML file is placed in an Annotation folder, pictures in a Stanford40 (Stanford human behavior data set) data set are placed in a JPEGImages folder, and names for model training and verifying pictures are written in train.txt and val.txt in a Main folder under an ImageSets directory respectively.
Step 3, re-clustering and distributing, re-clustering and re-clustering the labeled information frames converted to the format supported by the Darknet (Yolo feature extraction network) framework in the step 2 by using a k-means algorithm according to the image resolution 1280x720 of the training set in the Stanford40 (Stanford human behavior data set) data set, so as to obtain new initial Anchor Box (Anchor frame), and distributing a corresponding number of Anchor boxes (Anchor frames) for each detection scale in the Yolov3-SPP target detection model according to an Anchor Box distribution rule set by the Alexey Bochkovsky (Yolo series authors), which specifically comprises the following steps:
step 3.1, observing the coordinate information distribution of the labeling boxes of the training set in Stanford40 (Stanford human behavior data set), and randomly selecting and selecting k cluster centers (omega)i,hi) I ∈ {1,2 … …, k }, where wiAnd hiWidth and height of the frame;
step 3.2, respectively calculating the distance d between each labeling frame and the center of each cluster, wherein the calculation formula is as follows:
Figure BDA0003006092800000091
the method comprises the following steps that a numerator represents the size of the intersected area of an anchor frame and a marking frame, a denominator represents the size of the combined area of the anchor frame and the marking frame, and further, when an IOU value is the largest, namely the marking frame and the anchor frame are best matched, at the moment d is the smallest, the marking frame is respectively divided into the clusters which are the closest to the marking frame and are the smallest in d;
step 3.3, recalculating the average value of the width and height of the labeling frames to which the k cluster centers belong as a new cluster center;
step 3.4, repeating the steps 3.2 and 3.3, and outputting a clustering result when the clustering center is not changed any more;
step 3.5, finally outputting a clustering result: 10,13,16,30,33,23,30,61,62,45,59,119,116,90,156,198,373, 326;
step 3.6, respectively allocating 2, 1 and 6 Anchor boxes to three detection scales in the YOLOv3-SPP target detection model, namely adjusting the number of filters of all YOLO layers in the YOLOv3-SPP network structure to be (N + 5). times.3-135, wherein N is the number of the Anchor boxes allocated, and further changing the corresponding NASK in the configuration file to be 7, 8; 6; 0,1,2,3,5,6.
Step 4, training and evaluating the example model, respectively inputting a training set and a verification set in a Stanford40 (Stanford human behavior data set) data set into a YOLOv3-SPP target detection model for training and evaluating the detection model, and specifically comprising the following steps:
step 4.1, adopting a model parameter Darknet (a feature extraction network of YOLO) 53.conv.74 trained in advance on the ImageNet data set as an initialization weight to reduce training time;
step 4.2, setting example training hyper-parameters, namely network resolution, momentum, weight attenuation, Base _ lr, batch, maximum iteration times and a learning rate adjustment strategy, taking pictures in a Stanford40 (Stanford human behavior data set) data set as training input, performing network training by using a Darknet-53 deep learning framework, obtaining a behavior target detection model based on Yolov3-SPP when training average loss reaches a stable value and is not reduced any more, wherein the training hyper-parameters are set as shown in the following table 1:
TABLE 1 network training hyper-parameter setting table
Figure BDA0003006092800000101
The learning rate adjustment strategy policy is set to epoch, when the iteration times are 100 and 120, the learning rate lr is reduced by 10 times, score _ thresh is set to 0.25, iou _ thresh is set to 0.2, after the training parameters are configured, a comparison graph of a function loss value-iteration times curve in the training process of three groups of network structures, namely NVIDIA GeForce RTX2080Ti, 10GB training network, yollov 3-SPP, yollov 3 and yollov 3-tiny, is shown in fig. 4;
when iteration is carried out to 60 epochs, the Loss value of the YOLOv3-SPP network converges to about 0.5, the YOLOv3 network converges to about 0.8, the fluctuation range of the Loss value of the training of the Tiny YOLOv3 network is large, and the network is unstable, so that the YOLOv3-SPP network can converge faster relative to the YOLOv3 and the Tiny YOLOv3, has better characteristic learning capability and has a lower Loss value under the same learning rate;
step 4.3, inputting the human behavior pictures in the Stanford40 (Stanford human behavior data set) verification set into a behavior target detection model based on YOLOv 3-SPP;
step 4.3.1, recording the network prediction result in a TXT file through network layer-by-layer calculation, and obtaining the accuracy, the recall rate and the F of a behavior target detection model based on a YOLOv3-SPP network through codes1Values, detection rates (FPS) and P-R curve evaluation indices.
Step 4.3.2, in order to analyze the model detection performance more comprehensively, the trained three models, namely YOLOv3-SPP, YOLOv3 and YOLOv3-Tiny, are subjected to performance evaluation on a verification set picture, the GPU adopts RTX2080Ti, and specific indexes are shown in table 2:
TABLE 2 comparison of evaluation indexes of different models
Figure BDA0003006092800000111
The YOLOv3-SPP network model has the best detection effect, the accuracy, the recall rate and the F1 value are as high as 78.90%, 92.20% and 0.853, each index is respectively improved by 14.7%, 11.4 and 0.16 compared with the YOLOv3 network, the YOLOv3-Tiny network hierarchy, simple structure and low evaluation index, and the detection requirements of complex background and large target scale difference in crowd environment are difficult to meet, and the YOLOv3-SPP network has a large amount of convolution operation, so that the detection rate is relatively slow, but the real-time requirement is basically met;
step 4.3.3, in order to comprehensively measure the detection performance of the model, drawing a precision-recall ratio (PR) curve chart as shown in FIG. 5, wherein the area under the curve is the average precision ratio AP, and the higher the AP is, the better the detection performance of the model is;
wherein red represents the YOLOv3-SPP network PR curve, green represents the YOLOv3 network PR curve, and blue represents the YOLOv3-tiny network PR curve, as can be seen from FIG. 5,
the average accuracy rate of YOLOv3-SPP reaches 78.90%, which is obviously better than that of YOLOv3 network, the body movement of human body interaction is relatively low in mAP due to the large difference of movement among different individuals, but still better than that of other models of YOLO.
Step 5, carrying out target detection on the crowd behaviors in Stanford40 (Stanford human behavior data set) by using the YOLOv3-SPP target detection model trained in the step 4, wherein the specific detection process comprises the following steps:
step 5.1, adjusting the picture resolution of the intercepted video frames in the test video data set to 1280x720, inputting the picture resolution into the YOLOv3-SPP target detection model trained in the step 4, and finally outputting a feature map with three scales through a network after 32 times of downsampling feature extraction;
step 5.2, allocating 2, 1 and 6 different Anchor boxes (Anchor boxes) for each grid of each scale to detect, wherein each Anchor Box (Anchor Box) prediction comprises 4 boundary Box offsets and 1 confidence t0And C detection target classes, 4 boundary offsets including tx,ty,tw,thConfidence is defined as follows:
Figure BDA0003006092800000121
where pr (object) represents the probability that a target exists in the Anchor Box, if not, it has a value of 0,
Figure BDA0003006092800000122
represents the intersection ratio of the predicted bounding Box to the real bounding Box (Ground Truth Box):
Figure BDA0003006092800000123
further, each trellis predicts the C class probabilities, pr (class)iI object) represents the probability that the lattice belongs to a certain class under the condition of containing the target, the probability that the predicted Bounding Box belongs to the class i is represented as:
Figure BDA0003006092800000124
obtaining the position information of the predicted boundary Box according to the deviation value of the predicted Anchor Box (Anchor Box) relative to the marking Box, wherein the calculation formula is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure BDA0003006092800000131
Figure BDA0003006092800000132
σ(t0)=pr(object)*IOU(b,object)
wherein the content of the first and second substances,
Figure BDA0003006092800000133
in order to predict the confidence level corresponding to the location box,
Figure BDA0003006092800000134
is tx,tyThe value of the horizontal and vertical coordinates of the center of the grid relative to the upper left corner of the grid is represented by a Sigmoid function normalization value, bx,by,bw,bhIs the final output bounding box;
step 5.3, aiming at the overlapped detection frames, inhibiting the detection frames with lower confidence coefficient and higher overlap rate than a set threshold value through an NMS algorithm to obtain an optimal detection frame;
and 5.4, framing the target position by using a rectangular frame in the human behavior picture and marking the category of the target position.
The detection result is shown in fig. 6, by observing the detection result, it can be obtained that the method of this embodiment introduces an SPP module into the YOLOv3 network, by fusing different scale features, correspondingly adjusting the network resolution according to the size of the training set image, re-clustering the initial Anchor Box, and adjusting the number of network detection categories, converting the multi-category detection classification problem into five categories of human behavior target detection classification problems, namely general facial movements in a certain scene, facial movements through object manipulation, whole body movements, body movements interacting with the object, and body movements interacting with the human body, and by fusing different scale features, achieving better detection effect on human body behaviors at high density and fine granularity, and fewer missed detection behaviors.
The design provides a human body posture identification method based on YOLOv3, the method does not adopt the traditional frame-to-frame comparison to realize abnormal behavior classification of videos, but extracts human body targets and puts the human body targets into a neural network to directly realize end-to-end abnormal behavior classification, and the performance is greatly improved compared with the existing algorithm under the same condition, so that the identification precision and speed can be improved, and the complexity of the posture identification algorithm is reduced.

Claims (10)

1. A human behavior recognition method based on YOLOv3-SPP is characterized by comprising the following steps:
step 1, introducing a spatial pyramid pooling SPP module in a YOLOv3 network, and constructing a target detection model based on YOLOv 3-SPP;
step 2, Stanford40 is preprocessed for Stanford human behavior data set: labeling information of five human behavior targets, namely facial action, facial action through object manipulation, whole body action, body action interacting with an object and body action interacting with a human body in a Stanford40 document, and converting the five types of labeled information into a format supported by a Yolo feature extraction network Darknet framework;
step 3, according to the image resolution of the training set in the Stanford40, reclustering the marking information frames converted into the format supported by the Darknet framework in the step 2 by using a kmean clustering algorithm to obtain new initial Anchor Box, and distributing a corresponding number of Anchor boxes for each detection scale in the YOLOv3-SPP target detection model according to a set Anchor Box distribution rule;
step 4, respectively inputting the training set and the verification set in the Stanford40 into a YOLOv3-SPP target detection model for training and evaluating the detection model;
and 5, detecting the test video by using the YOLOv3-SPP target detection model trained in the step 4, identifying the action in each frame of the video, and finally splicing the detection result into the video again.
2. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 1, wherein the step 1 specifically comprises the following steps:
step 1.1, wherein the SSP module consists of four parallel pooling layers with convolution kernels Kernel Size of 1 × 1, 5 × 5, 9 × 9 and 13 × 13 respectively, and the SSP module is integrated between the 5 th convolution and the 6 th convolution of the first detection scale in the YOLOv3 network;
and step 1.2, completing construction of a target detection model based on YOLOv3-SPP, and realizing fusion of features with different scales.
3. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 1, wherein the step 2 specifically comprises the following steps:
step 2.1, extracting facial actions from the Stanford40 labeling file, and performing labeling information of five human behavior targets, namely facial actions, whole body actions, body actions interacting with objects and body actions interacting with human bodies through object manipulation
Step 2.2, writing the five types of behavior marking information in the step 2.1 into an XML file named by pictures;
and 2.3, converting the Stanford40 data set file directory structure into a file directory structure shaped like a PASCAL VOC data set.
4. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 3, wherein the step 2.2 specifically comprises the following steps:
step 2.21, writing the five types of behavior marking information in the step 2.1 into an XML file named by pictures;
step 2.22 design code as follows:
Xcenter=(boxxmin+boxxmax)/(2×picture—width)
ycenter=(boxymin+boxymax)/(2×picture_height)
width=(boxxmax-boxxmin)/picture_width
hight=(boxymax-boxymin)/picture_height
wherein: xcenterIs the coordinate of the center point of the x axis of the anchor frame, ycenterIs the coordinate of the center point of the y axis of the anchor frame; boxxminThe minimum value of the x-axis coordinate of the anchor frame is obtained; boxmaxThe maximum value of the x-axis coordinate of the anchor frame is obtained; picture _ width is the width of the original picture; picture _ height is the height of the original image; the width is the width of the anchor frame; right is the anchor frame height;
converting the labeling information into a format under a Darknet framework;
step 2.23, checking that the format of the TXT marking box of each converted picture needs to be as follows:
<object-class><x_center><y_center><width><height>;
wherein: object-class is a category, x _ center is an x-axis central point coordinate of the anchor frame, and y _ center is a y-axis central point coordinate of the anchor frame; the width is the width of the anchor frame; and hip is the height of the anchor frame.
5. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 1, wherein the step 3 specifically comprises the following steps:
step 3.1, observing the coordinate information distribution of the labeling boxes of the Stanford40 training set, and randomly selecting and selecting k cluster centers (omega)i,hi) I ∈ {1,2 … …, k }, where wiAnd hiWidth and height of the frame;
step 3.2, respectively calculating the distance d between each labeling frame and the center of each cluster, wherein the calculation formula is as follows:
Figure FDA0003006092790000021
step 3.3, recalculating the average value of the width and height of the labeling frames to which the k cluster centers belong as a new cluster center;
step 3.4, repeating the steps 3.2 and 3.3, and outputting a clustering result when the clustering center is not changed any more;
step 3.5, outputting the final clustering result;
step 3.6, respectively allocating 2, 1 and 6 Anchor Box for three detection scales in the YOLOv3-SPP target detection model;
step 3.6 specifically comprises the following steps:
step 3.6.1, adjusting the number of filters of all YOLO layers in a YOLOv3-SPP network structure;
and 3.6.2, changing the corresponding MASK in the configuration file.
6. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 1, wherein the step 4 specifically comprises the following steps:
step 4.1, taking a model parameter Darknet53.conv.74 trained in advance on the ImageNet data set as an initialization weight to reduce training time;
step 4.2, setting a training hyper-parameter of the network model to obtain a behavior target detection model based on YOLOv 3-SPP;
and 4.3, inputting the Stanford40 verification centralized driving road picture into a behavior target detection model based on YOLOv3-SPP to obtain an evaluation index of the behavior target detection model based on the YOLOv3-SPP network.
7. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 6, wherein the step 4.2 specifically comprises the following steps:
step 4.21, setting a training hyper-parameter of the network model;
step 4.22, using the pictures in the Stanford40 data set as training input;
and 4.23, further performing network training by using a Darknet-53 deep learning framework, and obtaining a behavior target detection model based on YOLOv3-SPP when the training average loss reaches a stable value and is not reduced any more.
8. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 1, wherein the step 5 specifically comprises the following steps:
step 5.1, adjusting the resolution of the test data set picture to 1280x720, inputting the test data set picture into the Yolov3-SPP target detection model trained in the step 4, further extracting the down-sampling features by 32 times, and finally outputting the feature pictures with three scales through a network;
step 5.2, distributing different Anchor Box for each grid of each scale for detection;
step 5.3, aiming at the overlapped detection frames, inhibiting the detection frames with lower confidence coefficient and higher overlap rate than a set threshold value through an NMS algorithm to obtain an optimal detection frame;
and 5.4, framing the target position by using a rectangular frame in the behavior picture to be detected and marking the category of the behavior picture.
9. The method for recognizing human body behaviors based on YOLOv3-SPP according to claim 8, wherein the step 5.2 specifically comprises the following steps:
step 5.21, allocating 2, 1 and 6 different Anchor Box for each grid of each scale for detection, wherein each Anchor Box prediction comprises 4 bounding Box offsets and 1 confidence t0And C detection target classes, 4 boundary offsets including tx,ty,tw,th
Where confidence is defined as follows:
Figure FDA0003006092790000041
pr (object) represents the probability that an object exists in the Anchor Box, and if not, it is 0,
Figure FDA0003006092790000042
represents the intersection ratio of the predicted bounding Box and the real bounding Box group Truth Box:
Figure FDA0003006092790000043
each trellis predicts the C class probabilities, pr (class)iI object) represents the probability that the lattice belongs to a certain class under the condition of containing the target, the probability that the predicted Bounding Box belongs to the class is represented as:
Figure FDA0003006092790000044
step 5.22, obtaining the predicted position information of the boundary Box according to the predicted offset value of the Anchor Box relative to the labeling Box, wherein the calculation formula is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure FDA0003006092790000045
Figure FDA0003006092790000046
σ(t0)=pr(object)*IOU(b,object)
wherein, σ (t)0) In order to predict the confidence level corresponding to the location box,
Figure FDA0003006092790000047
is tx,tyThe value of the horizontal and vertical coordinates b of the center of the grid relative to the upper left corner of the grid is represented by the normalization value of the Sigmoid functionx,by,bw,bhIs the bounding box of the final output.
10. A human behavior recognition system based on YOLOv3-SPP, comprising:
the target detection model building module is used for introducing a spatial pyramid pooling SPP module in a YOLOv3 network and building a target detection model based on YOLOv 3-SPP;
the Stanford40 pretreatment module was used to pretreat Stanford 40: labeling, from Stanford40, labeling information of five types of human behavior targets, namely facial motion, facial motion through object manipulation, whole body motion, body motion interacting with an object and body motion interacting with a human body in a file, and converting the five types of labeling information into a format supported under a Darknet) framework;
the detection scale distribution module is used for re-clustering the labeling information frames converted into the format supported by the Darknet framework in the step 2 by utilizing a kmean algorithm according to the image resolution of the training set in the Stanford40 to obtain new initial Anchor Box, and distributing a corresponding number of Anchor boxes for each detection scale in the YOLOv3-SPP target detection model according to a set Anchor Box distribution rule;
the training and evaluation of the detection model are used for inputting the training set and the verification set in the Stanford40 into a YOLOv3-SPP target detection model respectively for training and evaluation of the detection model;
and (4) detecting the test video by using the YOLOv3-SPP target detection model trained in the step (4), identifying the action in each frame of the video, and finally splicing the detection result into the video again.
CN202110364743.7A 2021-04-02 2021-04-02 Human behavior recognition method and system based on YOLOv3-SPP Pending CN113536885A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110364743.7A CN113536885A (en) 2021-04-02 2021-04-02 Human behavior recognition method and system based on YOLOv3-SPP

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110364743.7A CN113536885A (en) 2021-04-02 2021-04-02 Human behavior recognition method and system based on YOLOv3-SPP

Publications (1)

Publication Number Publication Date
CN113536885A true CN113536885A (en) 2021-10-22

Family

ID=78094520

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110364743.7A Pending CN113536885A (en) 2021-04-02 2021-04-02 Human behavior recognition method and system based on YOLOv3-SPP

Country Status (1)

Country Link
CN (1) CN113536885A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913438A (en) * 2022-03-28 2022-08-16 南京邮电大学 Yolov5 garden abnormal target identification method based on anchor frame optimal clustering

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334607A (en) * 2019-06-12 2019-10-15 武汉大学 A kind of video human interbehavior recognition methods and system
CN110414421A (en) * 2019-07-25 2019-11-05 电子科技大学 A kind of Activity recognition method based on sequential frame image
CN110807429A (en) * 2019-10-23 2020-02-18 西安科技大学 Construction safety detection method and system based on tiny-YOLOv3
CN111709381A (en) * 2020-06-19 2020-09-25 桂林电子科技大学 Road environment target detection method based on YOLOv3-SPP
CN111814621A (en) * 2020-06-29 2020-10-23 中国科学院合肥物质科学研究院 Multi-scale vehicle and pedestrian detection method and device based on attention mechanism
WO2020215961A1 (en) * 2019-04-25 2020-10-29 北京工业大学 Personnel information detection method and system for indoor climate control

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020215961A1 (en) * 2019-04-25 2020-10-29 北京工业大学 Personnel information detection method and system for indoor climate control
CN110334607A (en) * 2019-06-12 2019-10-15 武汉大学 A kind of video human interbehavior recognition methods and system
CN110414421A (en) * 2019-07-25 2019-11-05 电子科技大学 A kind of Activity recognition method based on sequential frame image
CN110807429A (en) * 2019-10-23 2020-02-18 西安科技大学 Construction safety detection method and system based on tiny-YOLOv3
CN111709381A (en) * 2020-06-19 2020-09-25 桂林电子科技大学 Road environment target detection method based on YOLOv3-SPP
CN111814621A (en) * 2020-06-29 2020-10-23 中国科学院合肥物质科学研究院 Multi-scale vehicle and pedestrian detection method and device based on attention mechanism

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913438A (en) * 2022-03-28 2022-08-16 南京邮电大学 Yolov5 garden abnormal target identification method based on anchor frame optimal clustering

Similar Documents

Publication Publication Date Title
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN109543695B (en) Population-density population counting method based on multi-scale deep learning
CN106897670B (en) Express violence sorting identification method based on computer vision
CN112418117B (en) Small target detection method based on unmanned aerial vehicle image
CN111709310B (en) Gesture tracking and recognition method based on deep learning
CN111476827B (en) Target tracking method, system, electronic device and storage medium
CN110781838A (en) Multi-modal trajectory prediction method for pedestrian in complex scene
CN111832489A (en) Subway crowd density estimation method and system based on target detection
Khan et al. Advances and trends in real time visual crowd analysis
CN110728252B (en) Face detection method applied to regional personnel motion trail monitoring
Zheng et al. Cross-line pedestrian counting based on spatially-consistent two-stage local crowd density estimation and accumulation
CN112001278A (en) Crowd counting model based on structured knowledge distillation and method thereof
CN108256462A (en) A kind of demographic method in market monitor video
CN106951834B (en) Fall-down action detection method based on old-age robot platform
CN113158983A (en) Airport scene activity behavior recognition method based on infrared video sequence image
CN114372503A (en) Cluster vehicle motion trail prediction method
CN113808166A (en) Single-target tracking method based on clustering difference and depth twin convolutional neural network
CN111738164A (en) Pedestrian detection method based on deep learning
CN113536885A (en) Human behavior recognition method and system based on YOLOv3-SPP
CN117765258A (en) Large-scale point cloud semantic segmentation method based on density self-adaption and attention mechanism
CN116453048A (en) Crowd counting method combined with learning attention mechanism
CN114246767B (en) Blind person intelligent navigation glasses system and device based on cloud computing
CN114663835A (en) Pedestrian tracking method, system, equipment and storage medium
Xu et al. Crowd density estimation of scenic spots based on multifeature ensemble learning
CN112597842A (en) Movement detection facial paralysis degree evaluation system based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination