CN114202803A - Multi-stage human body abnormal action detection method based on residual error network - Google Patents

Multi-stage human body abnormal action detection method based on residual error network Download PDF

Info

Publication number
CN114202803A
CN114202803A CN202111553555.5A CN202111553555A CN114202803A CN 114202803 A CN114202803 A CN 114202803A CN 202111553555 A CN202111553555 A CN 202111553555A CN 114202803 A CN114202803 A CN 114202803A
Authority
CN
China
Prior art keywords
network
video
human body
abnormal
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111553555.5A
Other languages
Chinese (zh)
Inventor
张凤全
程健
周锋
王桂玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China University of Technology
Original Assignee
North China University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China University of Technology filed Critical North China University of Technology
Priority to CN202111553555.5A priority Critical patent/CN114202803A/en
Publication of CN114202803A publication Critical patent/CN114202803A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention relates to a multi-stage human body abnormal action detection method based on a residual error network, which comprises the following steps: segmenting a video segment to be detected into video instances with equal length; obtaining a human body target boundary frame and the position and size of each monitoring video example by using a target detection network model; calculating the category and confidence of the human body action in the boundary box in each monitoring video instance by using an action recognition network model according to the human body boundary box information; and (4) giving an abnormal score of each monitoring video instance by using an abnormal score learning model and carrying out weighted fusion. The invention can quickly obtain abnormal action information in the monitoring video, and designs a target detection network model required for detecting the human body boundary frame, an action recognition network model for analyzing the human body action and an abnormal score learning model for predicting abnormal scores. The method realizes the detection of the abnormal actions of the human body in the common monitoring video scene, has the advantages of simplicity, low false alarm rate and certain practical value.

Description

Multi-stage human body abnormal action detection method based on residual error network
Technical Field
The invention relates to a multi-stage human body abnormal action detection method based on a residual error network, in particular to a human body action recognition and classification method based on the residual error network, which relates to some technologies in the fields of machine learning, deep learning, computer vision and abnormal detection, is mainly applied to abnormal action detection in various real monitoring video scenes, and is particularly applied to the fields of intelligent home, intelligent security and the like.
Background
Under the continuous promotion of the environment of the rapid development of the society, the safety awareness of people is being improved year by year, and frequent public safety problems are paid more and more attention by the social public. The security engineering formed on the basis of the video recording system and the monitoring system is developed rapidly, a large number of monitoring video recording products are applied to traffic major roads, key buildings and public places such as campuses where people are dense and frequently flow in each city, and meanwhile, the application of the security monitoring system in the household aspect is also an important component which cannot be lacked in modern life. In the field of intelligent life assistant in each family or intelligent safety monitoring in social public places, a lot of key information can be extracted from the monitoring video. However, the key information needs to consume a lot of time and energy for relevant workers to find out, and in the actual situation of various occasions nowadays, the workers often face massive monitoring data, and the efficiency and accuracy of manually analyzing abnormal action information in videos cannot reach the ideal level. Therefore, the intelligent video monitoring system is realized, the computer is enabled to complete the task of intelligently detecting abnormal actions, the key information is quickly extracted and transmitted to the working personnel, the detection efficiency and accuracy of the video monitoring system can be improved, the real-time performance of abnormal action detection is achieved, the public safety and normal activities of the people are effectively guaranteed, and the occurrence rate of public safety problems is reduced.
The traditional video monitoring system is limited to the function of video recording, however, complex tasks such as real-time detection and analysis of abnormal events still need manual operation, however, the requirement of the current monitoring system is difficult to meet only by processing the abnormal behavior events by a manual method, and the intelligent video monitoring technology becomes one of the best ways for processing the abnormal behavior events in real time and rapidly.
In the abnormal action recognition task, because the occurrence frequency of the abnormal events is far lower than that of the normal events, enough learning samples with complete labels are lacked, and the abnormal actions are various in types, so that all the types of the abnormal events cannot be predicted in advance.
The abnormal action recognition task can be learned in a weak supervision training mode, a model is trained through coarse-grained labels, namely video-level labels, and the abnormal action recognition task cannot be completed through fine-grained labels. The method adopts a multi-stage human body abnormal action detection method based on a residual error network, uses a plurality of network models, respectively calculates the sizes of a human body target boundary frame and the position and the size, the category and the confidence coefficient of human body actions and the abnormal score of a monitoring video example, and then carries out weighting fusion.
Disclosure of Invention
The invention overcomes part of the defects of the prior art, provides a multi-stage human body abnormal motion detection method based on a residual error network, adopts a target detection network model, a human body motion recognition model and an abnormal score learning model, can quickly obtain abnormal motion information in a monitoring video, designs the target detection network model required for detecting a human body boundary frame, the motion recognition network model for analyzing human body motion and the abnormal score learning model for predicting abnormal scores, classifies and recognizes abnormal human body motions such as tumble, beating and the like in the monitoring video, and has good real-time detection speed and precision.
The technical solution of the invention is as follows: a multi-stage human body abnormal motion detection method based on a residual error network is suitable for classifying and identifying abnormal human body motions (such as tumble, beating and the like) in a monitoring video, and comprises the following steps:
(1) segmenting a video segment to be detected into m video instances with equal length;
(2) according to the video example segmented in the step (1), adopting a target detection network model based on a residual error network to obtain the position and the size of a human body target boundary frame appearing in each monitoring video example;
(3) based on the position and the size of the human body target boundary box in the step (2), calculating the category and the confidence coefficient of the human body action in the human body target boundary box in each monitoring video instance by using a human body action recognition model based on a double-flow space-time residual error network;
(4) and (4) according to the category and the confidence coefficient of the human body action in the step (3), an abnormal score of each monitoring video instance is given by adopting an abnormal score learning model, and weighting fusion is carried out, so that abnormal human body actions such as tumble, beating and the like can be classified and identified.
The step (1) is specifically realized as follows:
(11) if the applicable data set is a training and verification data set, dividing the video segments containing the abnormal events into abnormal video example packets according to the abnormal event labels in the data set, wherein the abnormal video example packets are represented as
Figure BDA0003417829960000021
Video segments not containing exceptional events are divided into normal video instance packets, denoted as
Figure BDA0003417829960000022
(12) If the applicable data set is a test data set, dividing the video clip to be detected into m video instances with equal length;
(13) and (3) cutting or scaling the resolution of the video example obtained in the step (11) or (12) to 228 multiplied by 228.
The step (2) is specifically realized as follows:
(21) using fast R-CNN as a target detection network model, using the video instance obtained in step (13) as input, using a residual error network as a skeleton network, where conv2_ x, conv3_ x, conv4_ x and conv5_ x, which are main bodies of the residual error network completing the feature extraction part, are composed of a plurality of sets of residual error module structures, and 3 convolutional layers are connected with the residual errors to form a set of residual error modules, where conv2_ x is composed of 3 residual error modules, conv3_ x is composed of 4 residual error modules, conv4_ x is composed of 23 residual error modules, conv5_ x is composed of 3 residual error modules, and an average sampling layer, a full connection layer and a Softmax layer are added behind the conv5_ x network layer;
(22) combining a characteristic pyramid network as a candidate region generator, reducing the dimensionality of the characteristic diagram of an upper layer in a mode of adding convolution layers with convolution kernel size of 1 × 1, simultaneously performing up-sampling operation on the characteristic diagram of a side convolution layer to obtain the characteristic diagram with the same size, performing element-by-element superposition operation on the convolution characteristic diagrams of two processed convolution layers, taking the result as the input of the next layer of the side convolution layer, performing element-by-element superposition on the characteristic diagrams of conv2_ x, conv3_ x, conv4_ x and the characteristic diagram of conv5_ x after convolution, dimensionality reduction and sampling operation, and outputting different predictions of each layer according to different characteristic diagrams;
(23) approximate joint training is used for a candidate area network and Fast R-CNN in Fast R-CNN, in each random gradient descent iteration process, a generated candidate area is fixed and unchanged when Fast R-CNN is trained, and the training loss from the candidate area network and the training loss of Fast R-CNN are combined in a sharing layer through reverse transmission obtained through pre-calculation;
(24) the boundary frame of the human body target adopts a parametric calculation mode and is represented as tx=(x-xa)/wa,ty=(y-ya)/ha,tw=log(w/wa),th=log(h/ha),
Figure BDA0003417829960000031
Figure BDA0003417829960000032
Wherein x, y, w, h respectively represent the coordinate of abscissa axis, the coordinate of ordinate axis, the width and height of the boundary box of the central point of the predicted boundary box, and xa,ya,wa,haCoordinate of abscissa axis representing center point of anchor box boundary box, coordinate of ordinate axis, width and height of boundary box, x*,y*,w*,h*The coordinate of the abscissa axis and the coordinate of the ordinate axis representing the central point of the real boundary box, and the width and the height of the boundary box;
(25) the optimization goal of the target detection network model is to reduce the error of the boundary box and the category of the predicted human target and minimize the multi-task training objective function
Figure BDA0003417829960000033
Wherein L isc(p,u)=-logpu,Lr(p,u)=R(p-u),
Figure BDA0003417829960000034
R (x) is the smooth L1 penalty function, i denotes the index of the subscript of the anchor bin, piIndicating the probability that the ith anchor bin contains the target object,
Figure BDA0003417829960000035
is a sample label for learning, tiIs a vector representing a set of parameterized coordinates of a human target bounding box,
Figure BDA0003417829960000036
is a set of parameterized coordinates, L, of an anchor box with a positive exemplar labelcIs a log classification loss predicting the presence of a human target, and λ is in a regularization loss function
Figure BDA0003417829960000037
And the two terms account for the super-parameters of the weight, and the output of the target detection network model is the position and the size of a human target boundary box in the video example.
The step (3) is specifically realized as follows:
(31) the method comprises the steps of using a double-flow space-time network as a human body action recognition model, using the position and the size of a human body target boundary box as input, using two residual error networks of time flow and space flow as a skeleton network for the human body action recognition model, and performing analysis on a group of input data samples { x }1,x2,x3,…,xmBy β, γ as translation parameters and scalingParameter, batch normalization calculation of transformed yiThe mean value of each batch is calculated along the channel
Figure BDA0003417829960000041
Calculate variance of each batch along the channel
Figure BDA0003417829960000042
Computing
Figure BDA0003417829960000043
Performing scaling and translation transformations
Figure BDA0003417829960000044
(32) Adjusting a first layer filter in a temporal stream by three RGB filter channels for running on horizontal and vertical optical flow stacks, each stack containing 10 video frames, using cross-layer residual connection in temporal and spatial stream residual networks, respectively
Figure BDA00034178299600000415
And cross-stream residual concatenation
Figure BDA0003417829960000045
Deployed before residual module of time flow to space flow, where f is ReLU activation function, XlIs an input matrix of the l-th layer, YlFor the output matrix of layer l, g (·,) is a non-linear residual mapping function, weighted by the convolution filter WlIt is shown that,
Figure BDA0003417829960000046
is the input of the l-th layer spatial stream,
Figure BDA0003417829960000047
is the input of the layer i time stream,
Figure BDA0003417829960000048
is concatenation of layer I residual neurons of spatial streamAnd then the weight matrix is connected with the weight matrix,
Figure BDA0003417829960000049
is the output of the l layer spatial stream;
(33) using a residual network with the resolution ratio of alpha frame rate and C size as a spatial stream network to determine a static area in a video, and using a residual network with the resolution ratio of 8 alpha frame rate and C/8 size as a temporal stream network to determine a dynamic area in the video, wherein the spatial stream network is called a slow channel, the temporal stream network is called a fast channel, the slow channel network uses a time sequence span as a hyper-parameter to control the number of video frames crossed per second, about 2 frame characteristics are extracted in 1 second in a data set with a video frame rate of 30, 15 frame characteristics are extracted in 1 second in the fast channel network, data in the fast channel network is transmitted to the slow channel network through a side connection, a network model performs dimension conversion on data transmitted into the side connection by the fast channel network and then is merged into the slow channel network, when the fast channel network performs dimension conversion on a data sample, performing three-dimensional convolution at intervals of a certain number of frames, performing global average sampling on a human body action recognition network model on the last layer of a fast channel network and a slow channel network, reducing dimensionality of data samples, fusing results, transmitting the results into a full-connection classification layer, and giving a confidence score of each action through Softmax;
(34) the optimization goal of the human body action recognition model is to reduce the error of human body action class prediction and minimize the cross entropy loss function of the Softmax classifier
Figure BDA00034178299600000410
Wherein xiMotion confidence score, y, for the ith personal body motion category, which is the output vector of the Softmax classifieriIs the ith element in the corresponding real action tag vector.
The step (4) is specifically realized as follows:
(41) the method uses a multi-layer perceptron as an abnormal score learning model, uses the category and the confidence score of human body action as input, and uses a multi-instance learning method
Figure BDA00034178299600000411
Training and predicting the anomaly score obtained for each video instance in the instance package, wherein
Figure BDA00034178299600000412
Is a packet of an instance of an abnormal video,
Figure BDA00034178299600000413
is a packet of an instance of an abnormal video,
Figure BDA00034178299600000414
for the corresponding exceptional video instance and normal video instance,
Figure BDA0003417829960000051
h is a multilayer perceptron model which consists of three fully-connected layers, wherein an input layer comprises 512 neurons, a hidden layer comprises 32 neurons, a final output layer comprises 1 neuron, abnormal scores of a given video example are predicted, and the output value range is between 0 and 1;
(42) the optimization goal of the abnormal score learning model is to maximize the distance between two classes of samples between the abnormal video instance package and the normal video instance package to facilitate classification by the multi-layered perceptron model using hinge-based ordering penalty
Figure BDA0003417829960000052
Considering the real situation, the probability and frequency of normal events are far greater than those of abnormal events, so a sparse constraint is added to the loss function
Figure BDA0003417829960000053
Enabling some abnormal human body actions with lower occurrence frequency than other normal events, such as falling, beating and the like, and violence to have higher abnormal scores, wherein epsilon is a hyper-parameter for adjusting the weight occupied by the hinge sequencing loss and the sparse constraint;
(43) network model and human body action detection on targetPerforming joint training and optimization on the recognition model and the abnormal score learning model
Figure BDA0003417829960000054
Accumulating the loss function and performing weight balancing, λod,λaf,λfilIs a regularization hyper-parameter that controls each loss function.
Compared with the prior art, the invention has the advantages that:
(1) the model extracts abundant human body target boundary box characteristics and human body action characteristics, can effectively learn the abnormal action characteristics of the human body from a training sample, and can also give the abnormal action category and confidence score of the human body on the basis of the prior art that whether the weak supervision multi-instance learning model detects abnormal events;
(2) by the segmented video examples and the characteristics extracted by the model, the abnormal mode can be rapidly learned, the space-time position of abnormal action in the monitoring video is detected, and the model also has the capability of rapidly positioning the space position of the abnormal action on the basis of the prior art that the weak supervision multi-example learning model detects the time sequence information of the abnormal action;
(3) the model can be applied to an unclear image monitoring video with the resolution ratio lower than 228 x 228, in an experimental result of a monitoring video data set, the AUC (area under the ROC curve) of abnormal actions reaches 77.43% and is used as a reference of performance indexes of model classification, and compared with the prior art, the identification capability of the model can obtain better classification performance, such as a time series regularity learning model (50.60%), a high frame rate abnormal event detection model (65.51%), a weak supervision multi-instance learning model (75.41%), and the like.
Drawings
FIG. 1 is a flow chart of a method implementation of the present invention;
FIG. 2 is a schematic diagram of an internal architecture of a dual-stream residual error network;
FIG. 3 is a schematic diagram of the fast R-CNN architecture;
FIG. 4 is a schematic diagram of a composite network model for target detection;
FIG. 5 is a schematic diagram of a dual-flow network model architecture;
FIG. 6 is a schematic diagram of video example partitioning;
FIG. 7 is a diagram of a multi-level perceptron architecture.
Detailed Description
The present invention will be described in detail with reference to the following embodiments.
As shown in fig. 1, the method of the present invention specifically includes the following steps:
(1) segmenting a video segment to be detected into m video instances with equal length;
(2) according to the video example segmented in the step (1), adopting a target detection network model based on a residual error network to obtain the position and the size of a human body target boundary frame appearing in each monitoring video example;
(3) based on the position and the size of the human body target boundary box in the step (2), calculating the category and the confidence coefficient of the human body action in the human body target boundary box in each monitoring video instance by using a human body action recognition model based on a double-flow space-time residual error network;
(4) and (4) according to the category and the confidence coefficient of the human body action in the step (3), an abnormal score of each monitoring video instance is given by adopting an abnormal score learning model, and weighting fusion is carried out, so that abnormal human body actions such as tumble, beating and the like can be classified and identified.
The basic implementation process is as follows:
(1) dividing the video clips containing abnormal labels into abnormal video example packages and dividing the video clips not containing abnormal events into normal video example packages by using a UCF-Crime data set according to the abnormal event labels of the video clips in the input data set, wherein the normal video example packages are shown in figure 6;
(2) cropping or scaling the video instance resolution to 228 x 228 using the FFMPEG audio video library;
(3) using fast R-CNN as a target detection network model, wherein the target detection network model uses 101 layers of residual error networks as a skeleton network, and the main bodies conv2_ x, conv3_ x, conv4_ x and conv5_ x in the residual error networks, which complete the feature extraction part, are composed of several groups of residual error module structures, and 3 convolutional layers are connected with the residual errors to form a group of residual error modules, wherein conv2_ x is composed of 3 residual error modules, conv3_ x is composed of 4 residual error modules, conv4_ x is composed of 23 residual error modules, conv5_ x is composed of 3 residual error modules, and an average sampling layer, a full connection layer and a Softmax layer are added behind the conv5_ x network layer, and the framework is shown in fig. 3;
(4) combining the feature pyramid network as a candidate region generator, the structure of the fast R-CNN combining the feature pyramid network is as shown in fig. 4, the feature pyramid network is added with sampling operations layer by layer, the fast R-CNN is connected with the network layers of the feature pyramid network in a transverse manner, the fast R-CNN reduces the dimension of the feature map of the upper layer in a manner of adding convolution layers with convolution kernel size of 1 × 1, the feature map of the side convolution layer in the feature pyramid network is up-sampled at the same time to obtain the feature map with the same dimension, the convolution feature maps of the two processed convolution layers are overlapped element by element, the result is used as the input of the next layer of the side convolution layer, the fast R-CNN convolves convolve, the feature maps of conv2_ x, conv3_ x, conv4_ x and the feature map of conv5_ x through convolution, dimension reduction and sampling operations, element-by-element superposition is carried out, and different predictions of each layer are output by the feature pyramid network according to different feature graphs;
(5) approximate joint training is used for a candidate area network and Fast R-CNN in Fast R-CNN, in each random gradient descent iteration process, a generated candidate area is fixed and unchanged when Fast R-CNN is trained, and the training loss from the candidate area network and the training loss of Fast R-CNN are combined in a sharing layer through reverse transmission obtained through pre-calculation;
(6) the boundary frame of the human body target adopts a parametric calculation mode and is represented as tx=(x-xa)/wa,ty=(y-ya)/ha,tw=log(w/wa),th=log(h/ha),
Figure BDA0003417829960000071
Figure BDA0003417829960000072
Wherein x, y, w, h respectively represent the coordinate of abscissa axis, the coordinate of ordinate axis, the width and height of the boundary box of the central point of the predicted boundary box, and xa,ya,wa,haCoordinate of abscissa axis representing center point of anchor box boundary box, coordinate of ordinate axis, width and height of boundary box, x*,y*,w*,h*The coordinate of the abscissa axis and the coordinate of the ordinate axis representing the central point of the real boundary box, and the width and the height of the boundary box;
(7) the optimization goal of the target detection network model is to reduce the error of the boundary box and the category of the predicted human target and minimize the multi-task training objective function
Figure BDA0003417829960000073
Wherein L isc(p,u)=-logpu,Lr(p,u)=R(p-u),
Figure BDA0003417829960000074
R (x) is the smooth L1 penalty function, i denotes the index of the subscript of the anchor bin, piIndicating the probability that the ith anchor bin contains the target object,
Figure BDA0003417829960000075
is a sample label for learning, tiIs a vector representing a set of parameterized coordinates of a human target bounding box,
Figure BDA0003417829960000076
is a set of parameterized coordinates, L, of an anchor box with a positive exemplar labelcIs a log classification loss predicting the presence of a human target, and λ is in a regularization loss function
Figure BDA0003417829960000077
The two terms account for the super-parameters of the weight, and the output of the target detection network model is the position and the size of a human target boundary frame in the video example;
(8) using double-flow space-time network as human body action recognition model and human body target boundary frameThe human motion recognition model uses two residual networks of time stream and space stream as skeleton network, the internal architecture of the residual network is shown in fig. 2, wherein the input layer step of the time stream network is 2, the channel number of the conv1 convolutional layer is 8, the input layer step of the space stream network is 16, the channel number is 64, two network streams are added with sampling layers after the conv1 convolutional layer, the characteristic diagram size of the conv2, conv3, conv4 and conv5 convolutional layer of the time stream network is 1/8 of the corresponding convolutional layer of the space stream network, and in the dual-stream space-time network, a group of input data samples { x 3, conv4 and conv5 are input into the skeleton network1,x2,x3,…,xmAnd calculating transformed y in batch normalization by taking beta and gamma as translation parameters and scaling parametersiThe mean value of each batch is calculated along the channel
Figure BDA0003417829960000081
Calculate variance of each batch along the channel
Figure BDA0003417829960000082
Computing
Figure BDA0003417829960000083
Performing scaling and translation transformations
Figure BDA0003417829960000084
(9) Adjusting a first layer filter in a temporal stream by three RGB filter channels for running on horizontal and vertical optical flow stacks, each stack containing 10 video frames, using cross-layer residual connection in temporal and spatial stream residual networks, respectively
Figure BDA00034178299600000811
And cross-stream residual concatenation
Figure BDA0003417829960000085
Deployed before residual module of time flow to space flow, where f is ReLU activation function, XlIs an input matrix of the l-th layer, YlIs the first layerG (·,) is a non-linear residual mapping function, weighted by the convolution filter WlIt is shown that,
Figure BDA0003417829960000086
is the input of the l-th layer spatial stream,
Figure BDA0003417829960000087
is the input of the layer i time stream,
Figure BDA0003417829960000088
is the connection weight matrix for the layer i residual neurons of the spatial stream,
Figure BDA0003417829960000089
is the output of the l layer spatial stream;
(10) using a residual network with the resolution ratio of alpha frame rate and C size as a spatial stream network to determine a static area in a video, and using a residual network with the resolution ratio of 8 alpha frame rate and C/8 size as a temporal stream network to determine a dynamic area in the video, wherein the spatial stream network is called a slow channel, the temporal stream network is called a fast channel, the slow channel network uses a time sequence span as a hyper-parameter to control the number of video frames crossed per second, about 2 frame characteristics are extracted in 1 second in a data set with a video frame rate of 30, 15 frame characteristics are extracted in 1 second in the fast channel network, data in the fast channel network is transmitted to the slow channel network through a side connection, a network model performs dimension conversion on data transmitted into the side connection by the fast channel network and then is merged into the slow channel network, when the fast channel network performs dimension conversion on a data sample, performing three-dimensional convolution at intervals of a certain number of frames, performing global average sampling on a human body action recognition network model on the last layer of a fast channel network and a slow channel network, reducing the dimensionality of a data sample, fusing results, transmitting the results into a full-connection classification layer, and giving a confidence score of each action through Sofimax, wherein a double-flow space-time residual error network architecture is shown in FIG. 5;
(11) in the learning process of the human body action recognition model, the aim is to reduce the human body action classError of prediction and minimization of cross entropy loss function of Softmax classifier
Figure BDA00034178299600000810
Wherein xiMotion confidence score, y, for the ith personal body motion class, which is the output vector of the Sofimax classifieriIs the ith element in the corresponding real action tag vector;
(12) the method uses a multi-layer perceptron as an abnormal score learning model, uses the category and the confidence score of human body action as input, and uses a multi-instance learning method
Figure BDA0003417829960000091
Training and predicting the anomaly score obtained for each video instance in the instance package, wherein
Figure BDA0003417829960000092
Is a packet of an instance of an abnormal video,
Figure BDA0003417829960000093
is a packet of an instance of an abnormal video,
Figure BDA0003417829960000094
for the corresponding exceptional video instance and normal video instance,
Figure BDA0003417829960000095
h is a multilayer perceptron model, the framework of which is shown in fig. 7 and consists of three fully-connected layers, wherein an input layer comprises 512 neurons, a hidden layer comprises 32 neurons, a final output layer comprises 1 neuron, abnormal scores of a given video example are predicted, and the output value ranges from 0 to 1;
(13) the optimization goal of the abnormal score learning model is to maximize the distance between two classes of samples between the abnormal video instance package and the normal video instance package to facilitate classification by the multi-layered perceptron model using hinge-based ordering penalty
Figure BDA0003417829960000096
Considering the real situation, the probability and frequency of normal events are far greater than those of abnormal events, so a sparse constraint is added to the loss function
Figure BDA0003417829960000097
Enabling abnormal human body actions such as falls, blows and the like with lower occurrence frequency than other normal events to have higher abnormal scores, wherein the epsilon is a super parameter for adjusting the weight occupied by the hinge sequencing loss and the sparse constraint;
(14) performing joint training and optimization on the target detection network model, the human body action recognition model and the abnormal score learning model
Figure BDA0003417829960000098
Accumulating the loss function and performing weight balancing, λod,λaf,λfilIs a regularization hyper-parameter that controls each loss function.
The above description is only a part of the embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (5)

1. A multi-stage human body abnormal motion detection method based on a residual error network is characterized by comprising the following steps:
(1) segmenting a video segment to be detected into m video instances with equal length to obtain segmented video instances;
(2) according to the video example segmented in the step (1), adopting a target detection network model based on a residual error network to obtain the position and the size of a human body target boundary frame appearing in each monitoring video example;
(3) based on the position and the size of the human body target boundary box in the step (2), calculating the category and the confidence coefficient of the human body action in the human body target boundary box in each monitoring video instance by using a human body action recognition model based on a double-flow space-time residual error network;
(4) and (4) according to the category and the confidence of the human body action in the step (3), giving an abnormal score of each monitoring video instance by adopting an abnormal score learning model, and performing weighting fusion, so that the abnormal human body action can be classified and identified.
2. The method of claim 1, wherein the method comprises: the step (1) is specifically realized as follows:
(11) if the applicable data set is a training and verification data set, dividing the video segments containing the abnormal events into abnormal video example packets according to the abnormal event labels in the data set, wherein the abnormal video example packets are represented as
Figure FDA0003417829950000011
Video segments not containing exceptional events are divided into normal video instance packets, denoted as
Figure FDA0003417829950000012
(12) If the applicable data set is a test data set, dividing the video clip to be detected into m video instances with equal length;
(13) and (3) cutting or scaling the resolution of the video example obtained in the step (11) or (12) to 228 multiplied by 228.
3. The method of claim 1, wherein the method comprises: the step (2) is specifically realized as follows:
(21) using fast R-CNN as a target detection network model, taking a video instance as an input, the target detection network model using a residual error network as a skeleton network, the main bodies conv2_ x, conv3_ x, conv4_ x and conv5_ x in the residual error network, which complete the feature extraction part, are composed of several groups of residual error module structures, 3 convolutional layers are connected with the residual errors to form a group of residual error modules, wherein conv2_ x is composed of 3 residual error modules, conv3_ x is composed of 4 residual error modules, conv4_ x is composed of 23 residual error modules, conv5_ x is composed of 3 residual error modules, and an average sampling layer, a full connection layer and a Softmax layer are added behind the conv5_ x network layer;
(22) combining a characteristic pyramid network as a candidate region generator, reducing the dimensionality of the characteristic diagram of an upper layer in a mode of adding convolution layers with convolution kernel size of 1 × 1, simultaneously performing up-sampling operation on the characteristic diagram of a side convolution layer to obtain the characteristic diagram with the same size, performing element-by-element superposition operation on the convolution characteristic diagrams of two processed convolution layers, taking the result as the input of the next layer of the side convolution layer, performing element-by-element superposition on the characteristic diagrams of conv2_ x, conv3_ x, conv4_ x and the characteristic diagram of conv5_ x after convolution, dimensionality reduction and sampling operation, and outputting different predictions of each layer according to different characteristic diagrams;
(23) approximate joint training is used for a candidate area network and FastR-CNN in FasterR-CNN, in each random gradient descent iteration process, a generated candidate area is fixed and unchanged when FastR-CNN is trained, and the training loss from the candidate area network and the training loss of FastR-CNN are combined in a shared layer through reverse transmission through pre-calculation;
(24) the boundary frame of the human body target adopts a parametric calculation mode and is represented as tx=(x-xa)/wa,ty=(y-ya)/ha,tw=log(w/wa),th=log(h/ha),
Figure FDA0003417829950000021
Figure FDA0003417829950000022
Wherein x, y, w, h respectively represent the coordinate of abscissa axis, the coordinate of ordinate axis, the width and height of the boundary box of the central point of the predicted boundary box, and xa,ya,wa,haCoordinate of abscissa axis representing center point of anchor box boundary box, coordinate of ordinate axis, width and height of boundary box, x*,y*,w*,h*The coordinate of the abscissa axis and the coordinate of the ordinate axis representing the central point of the real boundary box, and the width and the height of the boundary box;
(25) the optimization goal of the target detection network model is to reduce the error of the boundary box and the category of the predicted human target and minimize the multi-task training objective function
Figure FDA0003417829950000023
Wherein L isc(p,u)=-logpu,Lr(p,u)=R(p-u),
Figure FDA0003417829950000024
R (x) is the smooth L1 penalty function, i denotes the index of the subscript of the anchor bin, piIndicating the probability that the ith anchor bin contains the target object,
Figure FDA0003417829950000025
is a sample label for learning, tiIs a vector representing a set of parameterized coordinates of a human target bounding box,
Figure FDA0003417829950000026
is a set of parameterized coordinates, L, of an anchor box with a positive exemplar labelcIs a log classification loss predicting the presence of a human target, and λ is in a regularization loss function
Figure FDA0003417829950000027
And the two terms account for the super-parameters of the weight, and the output of the target detection network model is the position and the size of a human target boundary box in the video example.
4. The method of claim 1, wherein the method comprises: the step (3) is specifically realized as follows:
(31) the method comprises the steps of using a double-flow space-time network as a human body action recognition model, using the position and the size of a human body target boundary box as input, using two residual error networks of time flow and space flow as a skeleton network for the human body action recognition model, and performing analysis on a group of input data samples { x }1,x2,x3,…,xmAnd calculating transformed y in batch normalization by taking beta and gamma as translation parameters and scaling parametersiThe mean value of each batch is calculated along the channel
Figure FDA0003417829950000028
Calculate variance of each batch along the channel
Figure FDA0003417829950000029
Computing
Figure FDA00034178299500000210
Performing scaling and translation transformations
Figure FDA00034178299500000211
(32) Adjusting a first layer filter in a temporal stream by three RGB filter channels for running on horizontal and vertical optical flow stacks, each stack containing 10 video frames, using cross-layer residual connection in temporal and spatial stream residual networks, respectively
Figure FDA00034178299500000212
And cross-stream residual concatenation
Figure FDA00034178299500000213
Deployed before residual module of time flow to space flow, where f is ReLU activation function, XlIs an input matrix of the l-th layer, YlFor the output matrix of layer l, g (·,) is a non-linear residual mapping function, weighted by the convolution filter WlIt is shown that,
Figure FDA0003417829950000031
is the input of the l-th layer spatial stream,
Figure FDA0003417829950000032
is an input of the l layer time stream, Wl SIs the l-th layer of the spatial streamA connection weight matrix of the residual neurons,
Figure FDA0003417829950000033
is the output of the l layer spatial stream;
(33) using a residual network with the resolution ratio of alpha frame rate and C size as a spatial stream network to determine a static area in a video, and using a residual network with the resolution ratio of 8 alpha frame rate and C/8 size as a temporal stream network to determine a dynamic area in the video, wherein the spatial stream network is called a slow channel, the temporal stream network is called a fast channel, the slow channel network uses a time sequence span as a hyper-parameter to control the number of video frames crossed per second, about 2 frame characteristics are extracted in 1 second in a data set with a video frame rate of 30, 15 frame characteristics are extracted in 1 second in the fast channel network, data in the fast channel network is transmitted to the slow channel network through a side connection, a network model performs dimension conversion on data transmitted into the side connection by the fast channel network and then is merged into the slow channel network, when the fast channel network performs dimension conversion on a data sample, performing three-dimensional convolution at intervals of a certain number of frames, performing global average sampling on a human body action recognition network model on the last layer of a fast channel network and a slow channel network, reducing dimensionality of data samples, fusing results, transmitting the results into a full-connection classification layer, and giving a confidence score of each action through Softmax;
(34) the optimization goal of the human body action recognition model is to reduce the error of human body action class prediction and minimize the cross entropy loss function of the Softmax classifier
Figure FDA0003417829950000034
Wherein xiMotion confidence score, y, for the ith personal body motion category, which is the output vector of the Softmax classifieriIs the ith element in the corresponding real action tag vector.
5. The method of claim 1, wherein the method comprises: the step (4) is specifically realized as follows:
(41) the method uses a multi-layer perceptron as an abnormal score learning model, uses the category and the confidence score of human body action as input, and uses a multi-instance learning method
Figure FDA0003417829950000035
Training and predicting the anomaly score obtained for each video instance in the instance package, wherein
Figure FDA0003417829950000036
Is a packet of an instance of an abnormal video,
Figure FDA0003417829950000037
is a packet of an instance of an abnormal video,
Figure FDA0003417829950000038
for the corresponding exceptional video instance and normal video instance,
Figure FDA0003417829950000039
h is a multilayer perceptron model which consists of three fully-connected layers, wherein an input layer comprises 512 neurons, a hidden layer comprises 32 neurons, a final output layer comprises 1 neuron, abnormal scores of a given video example are predicted, and the output value range is between 0 and 1;
(42) the optimization goal of the abnormal score learning model is to maximize the distance between two classes of samples between the abnormal video instance package and the normal video instance package to facilitate classification by the multi-layered perceptron model using hinge-based ordering penalty
Figure FDA00034178299500000310
Considering the real situation, the probability and frequency of normal events are far greater than those of abnormal events, so a sparse constraint is added to the loss function
Figure FDA0003417829950000041
Make oneThe abnormal human body actions with the occurrence frequency lower than other normal events have higher abnormal scores, wherein the epsilon is a hyper-parameter for adjusting the weight occupied by the hinge sequencing loss and the sparse constraint;
(43) performing joint training and optimization on the target detection network model, the human body action recognition model and the abnormal score learning model
Figure FDA0003417829950000042
Accumulating the loss function and performing weight balancing, λodaffilIs a regularization hyper-parameter that controls each loss function.
CN202111553555.5A 2021-12-17 2021-12-17 Multi-stage human body abnormal action detection method based on residual error network Pending CN114202803A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111553555.5A CN114202803A (en) 2021-12-17 2021-12-17 Multi-stage human body abnormal action detection method based on residual error network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111553555.5A CN114202803A (en) 2021-12-17 2021-12-17 Multi-stage human body abnormal action detection method based on residual error network

Publications (1)

Publication Number Publication Date
CN114202803A true CN114202803A (en) 2022-03-18

Family

ID=80654998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111553555.5A Pending CN114202803A (en) 2021-12-17 2021-12-17 Multi-stage human body abnormal action detection method based on residual error network

Country Status (1)

Country Link
CN (1) CN114202803A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419528A (en) * 2022-04-01 2022-04-29 浙江口碑网络技术有限公司 Anomaly identification method and device, computer equipment and computer readable storage medium
CN115147921A (en) * 2022-06-08 2022-10-04 南京信息技术研究院 Key area target abnormal behavior detection and positioning method based on multi-domain information fusion
CN116564460A (en) * 2023-07-06 2023-08-08 四川省医学科学院·四川省人民医院 Health behavior monitoring method and system for leukemia child patient
CN115147921B (en) * 2022-06-08 2024-04-30 南京信息技术研究院 Multi-domain information fusion-based key region target abnormal behavior detection and positioning method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419528A (en) * 2022-04-01 2022-04-29 浙江口碑网络技术有限公司 Anomaly identification method and device, computer equipment and computer readable storage medium
CN115147921A (en) * 2022-06-08 2022-10-04 南京信息技术研究院 Key area target abnormal behavior detection and positioning method based on multi-domain information fusion
CN115147921B (en) * 2022-06-08 2024-04-30 南京信息技术研究院 Multi-domain information fusion-based key region target abnormal behavior detection and positioning method
CN116564460A (en) * 2023-07-06 2023-08-08 四川省医学科学院·四川省人民医院 Health behavior monitoring method and system for leukemia child patient
CN116564460B (en) * 2023-07-06 2023-09-12 四川省医学科学院·四川省人民医院 Health behavior monitoring method and system for leukemia child patient

Similar Documents

Publication Publication Date Title
Sun et al. RSOD: Real-time small object detection algorithm in UAV-based traffic monitoring
Zhang et al. Deep convolutional neural networks for forest fire detection
Kumar et al. A New Vehicle Tracking System with R-CNN and Random Forest Classifier for Disaster Management Platform to Improve Performance
CN114202803A (en) Multi-stage human body abnormal action detection method based on residual error network
CN112861635A (en) Fire and smoke real-time detection method based on deep learning
Zheng et al. A review of remote sensing image object detection algorithms based on deep learning
CN116824335A (en) YOLOv5 improved algorithm-based fire disaster early warning method and system
CN113569766B (en) Pedestrian abnormal behavior detection method for patrol of unmanned aerial vehicle
Yandouzi et al. Investigation of combining deep learning object recognition with drones for forest fire detection and monitoring
Liu et al. An improved faster R-CNN for UAV-based catenary support device inspection
Wu et al. Single shot multibox detector for vehicles and pedestrians detection and classification
Lin Automatic recognition of image of abnormal situation in scenic spots based on Internet of things
Bhardwaj et al. Machine Learning-Based Crowd Behavior Analysis and Forecasting
Chen et al. An improved pedestrian detection algorithm based on YOLOv3
CN115294519A (en) Abnormal event detection and early warning method based on lightweight network
CN112633162B (en) Pedestrian rapid detection and tracking method suitable for expressway external field shielding condition
Wang et al. Abnormal event detection algorithm based on dual attention future frame prediction and gap fusion discrimination
Wu et al. Video Abnormal Event Detection Based on CNN and LSTM
Lestari et al. Comparison of two deep learning methods for detecting fire
Zhang et al. Multi-object crowd real-time tracking in dynamic environment based on neural network
Pan et al. An Improved Two-stream Inflated 3D ConvNet for Abnormal Behavior Detection.
Ge et al. Theory and method of data collection for mixed traffic flow based on image processing technology
Padmaja et al. Crowd abnormal behaviour detection using convolutional neural network and bidirectional LSTM
Li et al. Multi-scale Feature Extraction and Fusion Net: Research on UAVs Image Semantic Segmentation Technology
CN113312968B (en) Real abnormality detection method in monitoring video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination