CN113065431B

CN113065431B - Human body violation prediction method based on hidden Markov model and recurrent neural network

Info

Publication number: CN113065431B
Application number: CN202110302219.7A
Authority: CN
Inventors: 包梓群; 张娜; 邵一鸣; 许铭洋; 马云龙; 马铉钧; 包晓安
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2022-06-17
Anticipated expiration: 2041-03-22
Also published as: CN113065431A

Abstract

The invention discloses a human body violation behavior prediction method based on a hidden Markov model and a recurrent neural network, and belongs to the field of computer vision image processing. The method comprises the following steps: 1) collecting a data set; 2) preprocessing the image; 3) carrying out target detection on the preprocessed image to obtain a target detection frame; 4) taking the image with the target detection frame as the input of the CPN, extracting the human skeleton in the target detection frame, marking the joint points, and obtaining the image with the target detection frame and the joint point marking information; converting the images with the target detection frame and the joint point mark information in each group into pixel matrixes; 5) obtaining the probability that the sample belongs to each violation behavior by using an LSTM model to form a probability matrix; 6) and correcting the probability matrix by using a hidden Markov model, and taking the violation behavior corresponding to the maximum probability value in the corrected probability matrix as a final prediction result.

Description

Human body violation prediction method based on hidden Markov model and recurrent neural network

Technical Field

The invention relates to the field of computer vision image processing, in particular to a human body violation behavior prediction method based on a hidden Markov model and a recurrent neural network.

Background

Object detection and human behavior recognition technologies are hot spots in the field of computer vision today. The goal of human behavior recognition is to automatically analyze the ongoing behavior in an unknown video or image sequence. The method applies many emerging technologies. It can be used as a predictor of violations. The system can timely suppress the occurrence of the illegal event and can also play a role of alarming the illegal event. In order to better predict the occurrence of human violation, the human behavior recognition technology is a technology which is inevitably used. However, human behavior identification is a very challenging task due to the influence of various uncontrollable factors such as different illumination conditions, viewing angle diversity, complex background, large intra-class variation, crowd gathering and the like. In order to solve the above problems, researchers have proposed various treatments. The RNN network can process time series data, but cannot solve the problem of disappearance of the gradient, and therefore cannot process data having a long time series.

LSTM is a recurrent neural network after RNN improvement. On the basis of RNN structure, it adds a neuron and three gate structures to control the memory information on time sequence. The three gates are, respectively, a forgetting gate, an input gate and an output gate. Every time information needs to pass through the three gates, after certain processing, the cells can selectively memorize and control the forgetting degree. In this way, the defects left by RNN can be well compensated, and the method can be effectively applied to medium-long-term sequence data.

Hidden Markov Models (HMM) are dynamic Bayesian networks with the simplest structure, are well-known directed graph models, are mainly used for time series data modeling, and are widely applied to the fields of speech recognition, natural language processing and the like. The time sequence data with Markov property is solved by using the model, so that the calculation solving process can be greatly reduced. The present invention applies a hidden markov model to correct the prediction of violation behavior.

Disclosure of Invention

The invention aims to better predict human body violation behaviors, and provides a human body violation behavior prediction method based on a hidden Markov model and a recurrent neural network.

The technical scheme of the invention is as follows:

a human body violation behavior prediction method based on a hidden Markov model and a recurrent neural network comprises the following steps:

1) data acquisition: acquiring video data of different illegal behaviors, slicing the video data, converting continuous video data into continuous images, and marking the illegal behaviors of each group of continuous images;

2) preprocessing the image;

3) carrying out target detection on the preprocessed image to obtain a target detection frame;

4) taking the image with the target detection frame as the input of the CPN, extracting the human skeleton in the target detection frame, marking the joint points, and obtaining the image with the target detection frame and the joint point marking information; converting the images with the target detection frame and the joint point mark information in each group into pixel matrixes;

5) taking each group of pixel matrixes obtained in the step 4) as a sample to form a sample training set; training the LSTM model by using a sample training set to obtain the probability that the sample belongs to each violation behavior, and forming a probability matrix;

6) correcting the probability matrix by using a hidden Markov model, taking the violation behavior corresponding to the maximum probability value in the corrected probability matrix as a final prediction result, and training the hidden Markov model according to the prediction result and a real result;

7) obtaining video data to be predicted, converting the video data to be predicted into a pixel matrix to be processed through steps 1) to 4), obtaining an initial probability matrix by using a trained LSTM model, correcting the initial probability matrix by using a trained hidden Markov model, and taking violation behaviors corresponding to maximum probability values in the corrected probability matrix as final prediction results.

Further, the preprocessing method in step 2) is a filtering method, a square region is taken by taking a pixel point on the picture as a center, the gray values of all the pixel points in the region are sorted, the sorted middle value is taken as a new value of the gray value of the center pixel, and the image is traversed in a sliding window mode.

Further, the target detection process in step 3) includes:

sequentially carrying out size adjustment on continuous images in a group and extracting features to obtain a feature map;

carrying out convolution on the characteristic diagram once, concentrating characteristic information, and then dividing the characteristic diagram into two branches: in the first branch, a person and a background are distinguished through an rpn _ data layer, and a candidate frame marked as the person is output; in the second branch, calculating and outputting the offset of the candidate frame;

performing border crossing elimination and NMS non-maximum suppression on the candidate frames, and eliminating overlapped frames; inputting the residual candidate frame and the feature map into an ROI Pooling layer, mapping the candidate frame onto the feature map, and outputting after passing through a full connection layer.

Further, the data input into the LSTM model is a time-series ordered pixel matrix a ═ a (a)₁,a₂,a₃,a₄,…，a_n) Wherein a is_iRepresenting the pixel matrix corresponding to the ith image in the group.

Further, the hidden markov model is established by the following steps:

6.1) determining the set of implicit states as S ═ S₁,s₂,...,s_NAnd the observation state set is O ═ O₁,o₂,...,o_NN is the type number of the violation behaviors;

6.2) determining the state transition probability matrix A ═ a_ij]_N*NAnd the state at the current moment is only related to the state at the last moment, namely:

a_ij＝p(y_t+1＝s_j|y_t＝s_i)

wherein, y_tIndicates the state at time t, y_t＝s_iThe state at time t is represented as s_i，y_t+1＝s_jRepresents a state at time t +1 as s_j；p(y_t+1＝s_j|y_t＝s_i) Indicates that the state at time t is s_iThen the state at time t +1 is s_jA probability value of (d); a is_ijFor observing the element in the ith row and jth column of the probability matrix A, i.e. at time t-1, the state of the model is s_iAt time t, the state of the model is transferred to s_j；

6.3) determining the observation probability matrix B ═ B_ij]_N*N：

b_ij＝p(x_t＝o_j|y_t＝s_i)

Wherein o is_jDenotes the jth observed value, x_tRepresenting the observed value, x, at time t_t＝o_jThe observed value at time t is o_j；p(x_t＝o_j|y_t＝s_i) Indicates that the state at time t is s_iThen the observed value at time t is o_jA probability value of (d); b_ijFor observing the elements of row i and column j in the probability matrix B, i.e. in state s_iUnder the condition of occurrence of the observed value o_jA probability value of (d);

6.5) taking the probability matrix output by the LSTM model as the probability distribution pi (pi) of the initial state₁，Π₂，…，Π_N)：

Π_i＝p(y＝s_i)

Therein, II_iIndicates belonging to state s_iThe probability of (c).

Further, a step of performing feature extraction and dimension reduction on the pixel matrix is further included between step 4) and step 5), specifically: inputting the pixel matrix obtained in the step 4) into a convolution layer for feature extraction, and then entering a pooling layer for dimensionality reduction; the reduced pixel matrix is used as the input of the LSTM model.

The invention has the beneficial effects that:

the invention provides a human body behavior violation prediction method based on a hidden Markov model and a recurrent neural network, which is characterized in that different violation behaviors are set according to different scenes, a data set is adopted in a specific scene, then the network model is trained to predict the violation behaviors, and then the hidden Markov model is combined to correct errors of the network model, so that the violation behaviors can be correctly judged and prevented in time, and the functions of warning in time are achieved.

Drawings

FIG. 1 is a basic flow diagram of a prediction method;

FIG. 2 is a simplified flow of CPN estimation of a human joint;

FIG. 3 is a diagram illustrating the setting and adjustment of various parameters of the LSTM model;

FIG. 4 is a flow chart of LSTM model training;

FIG. 5 is a diagram structure of a hidden Markov model;

Detailed Description

The human violation prediction method based on the hidden markov model and the recurrent neural network according to the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, a human violation prediction method based on hidden markov model and recurrent neural network of the present invention includes the following steps:

s1: data acquisition: and acquiring video data of different illegal behaviors, slicing the video data, converting continuous video data into continuous images, and marking the illegal behaviors of each group of continuous images. In the embodiment, the time series data are collected according to the monitoring system, part of the data come from an INRIA XMAX multi-view video library in the model training process, and the data are screened out according to specific requirements with obvious characteristics.

In order to better perform subsequent operations, when the obtained data is subjected to slicing operation, image interception is performed on the video data at certain intervals, and the video data is changed into continuous images.

S2: in order to improve the training efficiency and the accuracy of the model after the training is completed, the intercepted image data needs to be preprocessed.

Median filtering is used, which is a non-linear method. The median filter has a good effect in filtering impulse noise and selects an appropriate point instead of the value of the pollution point. Firstly, a region with a certain pixel as the center is determined, wherein a square region is taken, then the gray values of the pixels in the region are sorted, and the middle value is taken as the new value of the gray value of the center pixel. And moving the square area in a sliding window mode to obtain a clearer picture with less loss after all traversals are completed.

S3: carrying out target detection on the preprocessed image to obtain a target detection frame:

the target detection is carried out by adopting a fast-RCNN-based network, and the target detection mainly comprises four layer structures. The first layer is Conv layers, and a feature map is extracted by a convolution pooling method. The second layer is an rpn (region pro-technical network) for obtaining accurate candidate frames. And the third layer is ROIPooling and is used for extracting a candidate frame feature map and sending the candidate frame feature map into a full connection layer judgment target. The fourth layer is a classification layer, and the classification of the candidate frame is calculated by using the candidate frame feature map, and meanwhile, the accurate position of the candidate frame is obtained.

In implementation, the continuous images in a group are sequentially subjected to size adjustment and feature extraction to obtain a feature map;

carrying out convolution on the characteristic diagram once, concentrating characteristic information, and then dividing the characteristic diagram into two branches: in the first branch, people and backgrounds are distinguished through rpn _ data layers, and candidate frames marked as people are output; in the second branch, calculating and outputting the offset of the candidate frame;

This embodiment is illustrated by way of example in fig. 5:

s3.1: the pictures are used for feature extraction through Conv layers. This layer contains 13 conv layers, where kernel _ size is 3, pad is 1, stride is 1, and according to the picture size formula:

therefore, the size of the original image is not changed by the convoluted image; next, 13 relu layers are linked, the number of relu images is doubled through one layer, 4 pooling layers, and kernel _ size is 2 and stride is 2, so the picture size becomes 1/2. The pictures are sequentially subjected to convolution, activation, convolution, activation and pooling, and then a feature map is obtained, wherein the size of the feature map is (M/16) × (N/16) × 512.

S3.2: after Feature Map is entered into RPN, it is first passed through a 3 × 3 convolution, again with Feature Map size M × N, number 512, which should be done to further concentrate Feature information, followed by two full convolutions, i.e., kernel _ size 1, pad 0, stride 1.

S3.3: after convolution, the signal is divided into two branches. The upper branch is firstly passed through rpn _ data layer, and 9 Anchor boxes thereof are classified into two types pixel by pixel, and whether the branch is human or background is distinguished.

The Anchor is a picture cut by respectively carrying out different aspect ratios on each small frame of each convolved picture. When generating the anchor, a box with a size of 16 × 16 is defined, and 16 × 16 is taken because a point on the feature map can correspond to an area with a size of 16 × 16 on the original image. Based on a box with the size of 16 × 16, three square scaling graphs with the side lengths of 8,16 and 32 are marked off, and three anchors with the aspect ratios of 0.5,1 and 2 are cut out from each square, so that a total of M × N × 9 anchors boxes are generated.

Further, the remaining anchor box is filtered and marked in this layer, as follows:

firstly, filtering out anchor boxes exceeding the size of an original image;

② if IoU values of the anchor box and the ground truth are maximum, marking as a positive sample, and label is 1;

③ if IoU of anchor box and ground channel is greater than 0.7, the label is positive sample, label is 1;

if IoU of the anchor box and the ground route is less than 0.3, marking as a negative sample, and labeling as 0;

the formula for calculation IoU is as follows:

IoU＝(A∩B)/(A∪B)

wherein, the positive sample indicates that there is a target, the negative sample indicates that there is no target, the rest is neither the positive sample nor the negative sample, and does not participate in training, and label is-1.

In addition to marking the anchor box, the offset between the anchor box and the ground channel is also calculated. Order: and a ground route, namely the coordinates x and y of the central point position and the widths and heights w and h of a calibrated frame (a real frame), and an anchor box, namely the coordinates x _ a and y _ a of the central point position and the widths and heights w _ a and h _ a.

Then:

Δx＝(x^*-x_a)/w_a

Δy＝(y^*-y_a)/h_a

Δw＝log(w^*/w_a)

Δh＝log(h^*/h_a)

learning is performed by the difference between the ground channel box and the predicted anchor box, so that the weights in the RPN network can learn the capability of predicting the box.

S3.4: in the following layers, wherein rpn _ loss _ cls is the use of a cross entropy (binary cross entropy) function to calculate the classification loss; rpn _ loss _ bbox is the calculation of the regression loss using the Smooth L1 loss function; rpn _ cls _ prob, the probability value is calculated using the softmax function. The prediction box anchor box has been marked in RPN _ data as input to the tributary network above the RPN; and the offset between the anchor box and the gt _ boxes is calculated as the input to the network of the branch below the RPN. And then training by using the RPN network. The reason why the two layers rpn _ cls _ score _ restore are used is that the input/output shape of softmax is predetermined and needs to be changed to a predetermined shape, and the output result after sorting is changed to a desired shape. And finally, putting the data classified as the target into a candidate box (proposal).

S3.5: rpn _ bbox _ pred records the trained four regression position deviation values delta x, delta y, delta w, delta h, and then corrects the position information of anchors to obtain a more accurate prediction frame by using the four predicted position deviation values. Further performing out-of-range elimination on the prediction boxes and using NMS non-maximum value to inhibit, and eliminating overlapped boxes. The threshold IoU for the NMS is first set to 0.7, i.e., only anchor boxes with a local maximum fraction of coverage not exceeding 0.7 are retained. Finally, leaving about M anchors, and then taking the first N anchors from large to small according to the value of rpn _ cls _ prob; finally, there are only about N regions of region propofol when the next ROI is entered into Pooling.

S3.6: inputting the region prousals generated by RPN and the feature maps generated before into ROI Pooling layer to traverse each region prousal, and reducing the coordinate value by 16 times, thus the region prousals generated on the basis of original drawing can be mapped onto the feature map of M x N, and a region is determined on the feature map, namely the feature map corresponding to the region prousals, and is used as the full connection input of the next layer.

S3.7: calculating the specific category of each region proxy through full connect layer and softmax, and outputting cls _ prob probability vector; and simultaneously, obtaining the position offset bbox _ pred of each region proxy by using the bounding box regression again, and obtaining a more accurate target detection frame by regression.

And after finishing the target detection, taking the picture with the target detection frame as the input of the CPN network to carry out joint point estimation.

S4: taking the image with the target detection frame as the input of the CPN, extracting the human skeleton in the target detection frame, marking the joint points, and obtaining the image with the target detection frame and the joint point marking information; and converting the images with the target detection frame and the joint point mark information in each group into a pixel matrix.

As shown in fig. 2, in the GolbalNet stage, it is responsible for detecting all joint points in the image, and the prediction effect of joint points of the eye, arm, etc. which are easier to detect is better. On the other hand, sufficient contextual information can be provided, which is important for inferring occluded and invisible joint points. According to the figure, the joint points of the human body, such as ears, left elbow and right elbow, can be simply measured and predicted, and the joint points which are easy to observe are mainly detected at this stage. And processing and predicting the pixel information around the joint points which are difficult to observe.

In the RefineNet stage, it is responsible for modifying the results of the GolbAllNet prediction. GolbalNet predicts large errors for those joints where the body part is occluded, invisible, or has a complex background, and reflonenet corrects these points exclusively.

And integrating the two steps, and putting the two steps into an original image to obtain a picture subjected to joint point estimation.

S5: taking each group of pixel matrixes obtained in the step S4 as a sample to form a sample training set; training the LSTM model by using a sample training set to obtain the probability that the sample belongs to each violation behavior, and forming a probability matrix;

in order to better avoid the problem of gradient disappearance, the LSTM is selected as a network model. Training the LSTM network model by adopting a back propagation algorithm to obtain characteristic relation between human body states and time in a training set, and acquiring a plurality of weights and offsets of the LSTM network;

with reference to fig. 3 and 4, the specific process of the LSTM model training is as follows:

firstly, determining functions and parameters of the LSTM model, wherein the specific activation functions are sigmoid and Tanh functions. To prevent over-fitting, a value of dropout, here 0.2, should be determined, and the loss function uses the variance of the predicted value from the true value. 1000 data sets, with the batch size set to 200, then epoch is 5.

During training, data is input into the convolutional layer firstly, feature extraction is carried out, and dimension reduction is carried out when the data enters the pooling layer. Here, as to the specific example, there may be a plurality of convolution and pooling layers, and the size of the sliding window may vary according to the example. After a dropout, it enters the LSTM network. After being processed by a plurality of neurons, the data is subjected to dropout again, regression is carried out, and the result is a probability matrix in which each behavior can possibly occur. And on the premise that the iteration times are not full, calculating the update weight and the offset of the LossFunction. The above process is repeated until the number of iterations is full.

S6: and correcting the probability matrix by using the hidden Markov model, taking the violation behavior corresponding to the maximum probability value in the corrected probability matrix as a final prediction result, and training the hidden Markov model according to the prediction result and the real result.

The method specifically comprises the following steps: determining a set of implicit states as S ═ S₁,s₂,...,s_NAnd the observation state set is O ═ O₁,o₂,...,o_NN is the type number of the violation behaviors;

determining a state transition probability matrix A ═ a_ij]_N*NAnd the state at the current moment is only related to the state at the last moment, namely:

a_ij＝p(y_t+1＝s_j|y_t＝s_i)

wherein, y_tIndicating the state at time t, y_t＝s_iThe state at time t is represented as s_i，y_t+1＝s_jRepresents a state at time t +1 as s_j；p(y_t+1＝s_j|y_t＝s_i) Indicates that the state at time t is s_iThen the state at time t +1 is s_jA probability value of (d); a is_ijFor observing the element in the ith row and jth column of the probability matrix A, i.e. at time t-1, the state of the model is s_iAt time t, the state of the model is transferred to s_j；

Determining an observation probability matrix B ═ B_ij]_N*N：

b_ij＝p(x_t＝o_j|y_t＝s_i)

taking the probability matrix output by the LSTM model as the probability distribution pi (pi) of the initial state₁，Π₂，…，Π_N)：

Π_i＝p(y＝s_i)

Therein, II_iIndicates belonging to state s_iThe probability of (c).

In practical application, video data to be predicted is obtained, the video data to be predicted is converted into a pixel matrix to be processed through S1-S4, a trained LSTM model is used for obtaining an initial probability matrix, the initial probability matrix is corrected through the trained hidden Markov model, and violation behaviors corresponding to the maximum probability values in the corrected probability matrix are taken as final prediction results.

Claims

1. A human violation behavior prediction method based on a hidden Markov model and a recurrent neural network is characterized by comprising the following steps:

2) preprocessing the image;

5) taking each group of pixel matrixes obtained in the step 4) as a sample to form a sample training set; training the LSTM model by using a sample training set, wherein the data input into the LSTM model is a pixel matrix a which is ordered according to time series (a)₁,a₂,a₃,a₄,…，a_n) Wherein a is_iExpressing a pixel matrix corresponding to the ith image in the group, obtaining the probability that the sample belongs to each violation behavior, and forming a probability matrix;

the hidden Markov model establishing process comprises the following steps:

a_ij＝p(y_t+1＝s_j|y_t＝s_i)

6.3) determining the observation probability matrix B ═ B_ij]_N*N：

b_ij＝p(x_t＝o_j|y_t＝s_i)

6.4) taking the probability matrix output by the LSTM model as the probability distribution pi (pi) of the initial state₁，Π₂，…，Π_N)：

Π_i＝p(y＝s_i)

Therein, II_iIndicates belonging to state s_iThe probability of (d);

2. The human body violation prediction method based on hidden markov models and recurrent neural networks as claimed in claim 1, wherein the preprocessing method in step 2) is a filtering method, taking a square region with a pixel point on a picture as a center, sorting the gray value of each pixel point in the region, taking the sorted middle value as a new value of the gray value of the center pixel, and traversing the image in a sliding window manner.

3. The human violation prediction method based on hidden markov models and recurrent neural networks as claimed in claim 1, wherein the target detection process in step 3) comprises:

performing border crossing elimination and NMS non-maximum suppression on the candidate frames, and eliminating overlapped frames; and inputting the remaining candidate frames and the feature map into a ROIPooling layer, mapping the candidate frames onto the feature map, and outputting the candidate frames after passing through a full connection layer.

4. The human violation prediction method according to claim 1, further comprising a step of feature extraction and dimension reduction of the pixel matrix between step 4) and step 5), and specifically comprising: inputting the pixel matrix obtained in the step 4) into a convolution layer for feature extraction, and then entering a pooling layer for dimension reduction; the reduced pixel matrix is used as the input of the LSTM model.