CN113743306A

CN113743306A - Method for analyzing abnormal behaviors of real-time intelligent video monitoring based on slowfast double-frame rate

Info

Publication number: CN113743306A
Application number: CN202111037913.7A
Authority: CN
Inventors: 涂小妹; 包晓安; 吴彪; 张娜; 金瑜婷
Original assignee: Zhejiang Guangxia Construction Vocational and Technical University
Current assignee: Zhejiang Guangxia Construction Vocational and Technical University
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2021-12-03

Abstract

The invention discloses a method for analyzing abnormal behaviors of real-time intelligent video monitoring based on slowfast double-frame rate, and belongs to the field of video image recognition. In order to enable the slow network model to capture more space semantic information under the slow branch, the invention builds a multi-feature fusion slow double-frame rate network model, and the model performs top-down feature fusion on a feature layer under the slow branch, thereby improving the ability of the slow branch to extract category space semantic information. In order to enable the built slowfast network model to be better and converge more quickly, the method optimizes and designs the loss function of the slowfast network model, and improves the classification capability of the network by adopting the loss function based on the soft label.

Description

Method for analyzing abnormal behaviors of real-time intelligent video monitoring based on slowfast double-frame rate

Technical Field

The invention relates to the field of video image recognition, in particular to a method for analyzing abnormal behaviors of real-time intelligent video monitoring based on a slowfast double-frame rate.

Background

The digital industrial innovation integrating a new generation of internet technology, internet of things technology and AI technology is becoming a new engine for realizing public safety of monitoring, processing, analyzing and information conversion outputting, promoting the rapid development of intelligent monitoring and promoting the upgrade of traditional Chinese security to digital security. China relies on the rapid development of information technology and the increasingly perfect construction of intelligent information facilities, and intelligent equipment is becoming a reliable support for a new generation of public safety products. In recent years, video surveillance has begun to play a great role in a variety of scenes, effectively improving public safety management efficiency. However, with the continuous development of video monitoring, the monitoring cameras are distributed at all corners of a city, and the video inspection by means of human eye observation cannot meet the requirements of the current social development.

The intelligent monitoring of the 'eyes and the brain' in public safety can identify and interpret abnormal states in a monitoring range, send out corresponding early warning according to regulations and timely remind a supervisor to take measures. However, in the abnormal behavior analysis in the intelligent video monitoring system, there are dimensional disparities in the target, range and accuracy of analysis and detection, and uncertainty exists in quality, effect and behavior action. Moreover, the intelligent monitoring abnormal behavior analysis is used as an extension of perception type analysis and behavior judgment, and the analysis itself has the problems of single-node analysis failure, dynamic change of analysis and constraint, unequal information in space and uncertainty in time dimension. Conventional video surveillance does not adequately account for the aforementioned problems of disparity and uncertainty.

Disclosure of Invention

The invention aims to analyze the abnormal behavior of intelligent video monitoring in real time by using a slow double-frame rate model, and aims to enable the slow network model to capture more space semantic information under a slow branch, improve the capacity of the slow branch for extracting class space semantics and enable the classification precision of the network training model to be higher. The invention provides a method for analyzing abnormal behaviors of slow fast double-frame-rate real-time intelligent video monitoring based on multi-feature fusion and soft-label cross entropy loss function

The technical scheme adopted by the invention is as follows:

a method for analyzing abnormal behaviors of real-time intelligent video monitoring based on slowfast double-frame rate comprises the following steps:

A. acquiring a character video clip with a specific behavior as a sample data set in an application scene, labeling a pedestrian category label, and preprocessing the sample data set;

B. building a multi-feature fused slow fast double-frame rate network model, wherein the model comprises a slow branch and a fast branch, the low branch operates at a low frame rate, and the fast branch operates at a high frame rate;

the slow branch comprises three first convolution blocks which are sequentially connected, the input of the 1 st convolution block is a video frame image obtained by low frame rate sampling, the output of the 1 st convolution block is simultaneously used as the input of the 2 nd and 3 rd convolution blocks, the output of the 2 nd convolution block is also used as the input of the 3 rd convolution block, and multi-feature fusion is realized in the 3 rd convolution block;

the fast branch comprises three second rolling blocks which are connected in sequence, the input of the 1 st rolling block is a video frame image obtained by high frame rate sampling, and the output of the previous rolling block is used as the input of the next rolling block; and connecting the output of the first convolution block in the fast branch laterally with the output of the first convolution block of the slow branch;

after the output results of the last rolling block of the Slow branch and the last rolling block of the fast branch are connected, predicting the behavior category through a softmax function;

C. b, training the multi-feature fusion slowfast double-frame rate network model established in the step B by using a loss function based on a soft label and the sample data set in the step A;

D. and acquiring a monitoring video in real time, and detecting abnormal behaviors by using a trained slowfast double-frame rate network model with multi-feature fusion.

Preferably, the category labels include fighting, climbing and falling.

Preferably, the category label adopts one-hot coding, the position of the category is 1, and the rest is 0.

Preferably, the lateral connection is specifically: and fusing the output of the 1 st volume block of the fast branch with the output of the 1 st volume block of the slow branch to be used as the input of the 2 nd volume block of the slow branch, and fusing the output of the 2 nd volume block of the fast branch with the output of the 2 nd volume block of the slow branch to be used as the input of the 3 rd volume block of the slow branch.

Preferably, when the side connection is performed, the result output in the fast branch is sampled every α frames, converted into the same number of video frames as in the slow branch, and then connected in the channel direction.

Preferably, the first convolution block and the second convolution block are implemented by adopting a network structure of multi-layer feature fusion output, and are composed of k +1 layers of convolution layers, the output of the ith layer of convolution layer is spliced with the input of the ith layer of convolution layer and then used as the input of the (i + 1) th layer of convolution layer, and the input of the 1 st layer of convolution layer is spliced with the output of the (k + 1) th layer of convolution layer and then used as the final output of the convolution block.

Preferably, the soft tag-based loss function is:

wherein L is_CEFor the loss value, N is the batch-size sample number, m is the class number, p_ji(k)Is the probability, p, that the jth sample is predicted as class i at the kth time of the network iteration_ji(k-1)Is the probability that the jth sample is predicted as the ith class during the last iteration of the network; y is_ji(k)Representing a soft label vector, having a length of m; k denotes the number of iterations of the network, N_epochRepresenting a preset maximum number of iterations of the network.

Preferably, the training of step C is performed according to a cross-validation method on the video data set in step a.

Preferably, the step D specifically includes: the method comprises the steps of obtaining a monitoring video in real time, extracting 15 frames of high frame rate sampling video frame images in 1 second by a fast branch in a trained multi-feature fused slowfast double-frame rate network model, extracting 2 frames of low frame rate sampling video frame images in 1 second by a slow branch, obtaining a behavior type through softmax function prediction after connecting output results of the two branches, and sending out a warning when abnormal behaviors are monitored.

Compared with the prior art, the invention has the beneficial effects that:

the method is used for analyzing the abnormal behaviors of the monitored video based on a slow double-frame rate network model, wherein the slow double-frame rate network model comprises a slow branch and a fast branch, the slow branch runs at a low frame rate, a large time sequence span (namely the number of frames skipped per second) is used, for example, 2 frames are extracted within 1 second, and the purpose is to capture semantic information provided by images or a plurality of sparse frames; the fast branch operates at a high frame rate, has a high temporal resolution, and takes 15 frames using a very small time span, e.g., 1 second, with the goal of capturing rapidly changing motion. In addition, the two branches adopt a structure of multi-layer feature fusion output, and feature fusion from top to bottom is carried out on the feature layers by utilizing the characteristic that features of different layers have different semantics, so that the capability of the Slow branch for extracting category space semantics and the capability of the fast branch for extracting time semantic information to weaken the space semantic information are improved.

In the slowfast network model, in order to train the classification model better, the cross entropy loss function is improved, and y is used_jiClass one-hot encoding (a vector consisting of 0 and 1 is converted to a soft label form, updating the label with the probability prediction results of each round.

The present invention is a technical breakthrough for methods based on analysis on image slices lacking analysis in the time dimension or on video data that is not differentiated in the time dimension.

Drawings

FIG. 1 is a diagram of the abnormal behavior analysis steps of the present invention;

FIG. 2 is a video sequence sample data for three types of behaviors shown in an embodiment of the present invention;

fig. 3 is a schematic diagram of a slowfast network structure with multi-feature fusion proposed by the present invention;

fig. 4 is a schematic diagram of the structure of each volume block in the slowfast network of fig. 3.

Detailed Description

The invention is described in detail below with reference to the drawings and specific embodiments, but the invention is not limited thereto.

The cross entropy loss function of the soft label is shown in fig. 1.

More specifically, the implementation steps of the invention are as follows:

A. in the practical application scene, the acquisition and labeling (such as fighting, climbing, paddling and the like) of the pedestrian video sample data set are carried out, and the pretreatment is carried out on the sample data set

In this embodiment, in an actual application scenario, a monitoring camera is used to capture video samples of 30 people, 500 video segments of about 10 seconds are obtained, the 500 videos are divided into 10 types of pedestrian behaviors (e.g., fighting, climbing, falling, etc.), each type of behavior includes 50 video segments, a data sample set of a part of the video segments is shown in fig. 2, where case 1 is a fighting video sequence, case 2 is a climbing video sequence, and case 3 is a falling video sequence.

B. Building a multi-feature fused slowfast double-frame rate network model

As shown in fig. 3, the slowfast two-frame rate network model with multi-feature fusion includes two branches: a slow branch and a fast branch, wherein the slow branch operates at a low frame rate, and extracts 2 frames in 1 second using a large time span (i.e., the number of frames skipped per second), aiming at capturing semantic information provided by an image or a few sparse frames; the fast branch operates at a high frame rate, has high temporal resolution, and takes 15 frames in 1 second using a very small time span, with the aim of capturing rapidly changing motion.

The slow branch comprises three first convolution blocks which are connected in sequence, the input of the 1 st convolution block is a video frame image obtained by low frame rate sampling, the output of the 1 st convolution block is simultaneously used as the input of the 2 nd and 3 rd convolution blocks, the output of the 2 nd convolution block is also used as the input of the 3 rd convolution block, and multi-feature fusion is realized in the 3 rd convolution block. The marked C in the Slow branch indicates the channel and T indicates the number of sample frames.

The fast branch comprises three second convolution blocks which are connected in sequence, the input of the 1 st convolution block of the fast branch is a video frame image obtained by high frame rate sampling, and the output of the previous convolution block is used as the input of the next convolution block. In addition, the feature information extracted from the fast branch is added to the main stem of the slow branch through lateral connection, that is, the output of the 1 st volume block of the fast branch is fused with the output of the 1 st volume block of the slow branch to be used as the input of the 2 nd volume block of the slow branch, and the output of the 2 nd volume block of the fast branch is fused with the output of the 2 nd volume block of the slow branch to be used as the input of the 3 rd volume block of the slow branch. This enables the slow branch to extract the spatial semantic information and also to obtain the temporal semantic information of the fast branch. β C, labeled in the Fast branch, represents the channel and α T represents the number of sample frames. Because the fast branch focuses more on the time sequence and weakens the spatial semantic information, the number of the fast branch network channels is 1/8 of the slow branch, so that the whole network becomes light and efficient, and the performance of real-time monitoring can be achieved.

Structure for outputting fast branch in lateral connection (alpha T, S)²Beta C deformation into a structure of slow branch { T, S }²α β C, that is to say that α frames need to be pushed into one frame. In this embodiment, the sampling may be performed by time-sampled, and the sampling may be performed simply every α frame, { α T, S²Beta C is transformed into { T, S }²β C }. Transformed { T, S²Beta C and { T, S ] output by slow branch²And the alpha beta C is connected according to the channel.

The convolution block in the slow branch and the convolution block in the fast branch both adopt a multi-layer characteristic fused output network structure shown in fig. 4 and are composed of k +1 layers of convolution layers, the output of the ith layer of convolution layer is spliced with the input of the ith layer of convolution layer and then used as the input of the (i + 1) th layer of convolution layer, and the input of the 1 st layer of convolution layer is spliced with the output of the (k + 1) th layer of convolution layer and then used as the final output of the convolution block. In FIG. 3, the input of the 1 st convolutional layer is denoted as X₀The output of the 1 st convolutional layer is denoted as X₁The input of the k-th convolutional layer is denoted as X_k-1Layer kThe output of the convolutional layer is recorded as X_kAnd the final output is recorded as X_U。

The network structure shown in fig. 3 includes output of multiple layers of features, and features of different layers of features are utilized to perform feature fusion from top to bottom, so that the ability of slow branch extraction type space semantics and the ability of fast branch extraction time semantics are improved.

C. Loss function for designing slowfast network model

The method comprises the following steps that the Slowfast network model is finally used for classifying human behaviors in a video, the probability that the last characteristic layer of the network outputs categories through softmax is utilized, and in a training stage, the network model is optimized through a cross entropy loss function, so that the probability that the softmax outputs correct categories is higher, and the cross entropy loss function formula is as follows:

where N is the number of batch-size samples, m is the number of classes, p_jiIs the probability that the network predicts as this class, y_jiAnd if the real category of the sample j is i, the position of i in the one-hot coding vector is 1, and otherwise, the position is 0.

For better training of the classification model, this embodiment improves the cross-entropy loss function described above, where y is_jiIs a one-hot code (a vector consisting of 0 and 1) of a class, and is expressed as that the position of the class is 1, and the position of the class is not 0, belonging to a hard tag, and the improved cross-entropy loss function changes the real tag from a hard tag to a soft tag, which is expressed as follows:

where N is the number of batch-size samples, m is the number of classes, p_ji(k)Is the probability, y, of predicting the class at the kth time of the network iteration_ji(k)The position indicated as being of this class is then p_ji(k-1)If the position in this category is 0, p_ji(k-1)The probability of the class predicted by the network in the last iteration is represented as follows:

where k denotes the number of iterations of the network, N_epochRepresenting a preset maximum number of iterations of the network.

D. Training the slowfast network model by using the sample data set constructed in the step A

Dividing the video segment data set in the step A into 10 mutually exclusive subsets with equal size according to a cross validation method, using a union set of 9 subsets as a training set each time, and using the rest subset as a test set, thus obtaining 10 groups of training/test sets;

training a slowfast network model by using the divided video data sets in a combined manner, and expanding and enhancing the data sets by integrally turning and randomly erasing videos of the data sets; for the network structure shown in fig. 4, the network is initialized with weights pre-trained by the ImageNet dataset, so that the network converges faster. In the slowfast network training, an initial learning rate is set to be 0.01, the value of the learning rate is exponentially reduced along with the training times, the size of batch size is set to be 8, the network training is stopped after 400 epochs are trained, and finally the trained model is stored as a pt file.

E. Testing the Slowfast network model

And loading the slowfast network model, reading the trained parameter file, namely, the pt file, including the weight values of all the network layers, importing the weight values into the slowfast network, and testing the effect of the model by using the test set divided in the step D.

In practical application, the video monitoring image is collected to be used as the input of a trained slowfast network model, so that the method can be used for monitoring abnormal behaviors in real time, and the type of the abnormal behaviors can be output and an alarm can be given out when the abnormal behaviors exist.

Claims

1. A method for analyzing abnormal behaviors of real-time intelligent video monitoring based on slow fast double-frame rate is characterized by comprising the following steps:

2. The method according to claim 1, wherein the category labels include fighting, climbing, and falling over.

3. The method according to claim 1, wherein the class label is encoded by using a single hot code, the class position is 1, and the rest are 0.

4. The method for analyzing the abnormal behavior of the real-time intelligent video monitoring based on the slowfast double-frame rate as claimed in claim 1, wherein the lateral connection specifically comprises: and fusing the output of the 1 st volume block of the fast branch with the output of the 1 st volume block of the slow branch to be used as the input of the 2 nd volume block of the slow branch, and fusing the output of the 2 nd volume block of the fast branch with the output of the 2 nd volume block of the slow branch to be used as the input of the 3 rd volume block of the slow branch.

5. The method according to claim 4, wherein during the lateral connection, the result output in the fast branch is sampled every α frames, converted into the same number of video frames as in the slow branch, and then connected in the channel direction.

6. The method according to claim 1, wherein the slow-frame-rate-based real-time intelligent video monitoring abnormal behavior analysis method is implemented by using a network structure with multi-layer feature fusion output, and is characterized in that the slow-branch convolution block and the fast-branch convolution block are formed by k +1 layers of convolution layers, the output of the ith layer of convolution layer is spliced with the input of the ith layer of convolution layer and then used as the input of the (i + 1) th layer of convolution layer, and the input of the 1 st layer of convolution layer is spliced with the output of the (k + 1) th layer of convolution layer and then used as the final output of the convolution block.

7. The method according to claim 1, wherein the soft-tag-based loss function is:

8. The method for analyzing the abnormal behavior of the real-time intelligent video monitoring based on the slowfast double-frame rate as claimed in claim 1, wherein the training of the step C is performed on the video data set in the step A according to a cross-validation method.

9. The method for analyzing the abnormal behavior of the real-time intelligent video monitoring based on the slowfast double-frame rate according to the claim 1, wherein the step D is specifically as follows: the method comprises the steps of obtaining a monitoring video in real time, extracting 15 frames of high frame rate sampling video frame images in 1 second by a fast branch in a trained multi-feature fused slowfast double-frame rate network model, extracting 2 frames of low frame rate sampling video frame images in 1 second by a slow branch, obtaining a behavior type through softmax function prediction after connecting output results of the two branches, and sending out a warning when abnormal behaviors are monitored.