CN114202803A

CN114202803A - Multi-stage human body abnormal action detection method based on residual error network

Info

Publication number: CN114202803A
Application number: CN202111553555.5A
Authority: CN
Inventors: 张凤全; 程健; 周锋; 王桂玲
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-18

Abstract

The invention relates to a multi-stage human body abnormal action detection method based on a residual error network, which comprises the following steps: segmenting a video segment to be detected into video instances with equal length; obtaining a human body target boundary frame and the position and size of each monitoring video example by using a target detection network model; calculating the category and confidence of the human body action in the boundary box in each monitoring video instance by using an action recognition network model according to the human body boundary box information; and (4) giving an abnormal score of each monitoring video instance by using an abnormal score learning model and carrying out weighted fusion. The invention can quickly obtain abnormal action information in the monitoring video, and designs a target detection network model required for detecting the human body boundary frame, an action recognition network model for analyzing the human body action and an abnormal score learning model for predicting abnormal scores. The method realizes the detection of the abnormal actions of the human body in the common monitoring video scene, has the advantages of simplicity, low false alarm rate and certain practical value.

Description

Multi-stage human body abnormal action detection method based on residual error network

Technical Field

The invention relates to a multi-stage human body abnormal action detection method based on a residual error network, in particular to a human body action recognition and classification method based on the residual error network, which relates to some technologies in the fields of machine learning, deep learning, computer vision and abnormal detection, is mainly applied to abnormal action detection in various real monitoring video scenes, and is particularly applied to the fields of intelligent home, intelligent security and the like.

Background

Under the continuous promotion of the environment of the rapid development of the society, the safety awareness of people is being improved year by year, and frequent public safety problems are paid more and more attention by the social public. The security engineering formed on the basis of the video recording system and the monitoring system is developed rapidly, a large number of monitoring video recording products are applied to traffic major roads, key buildings and public places such as campuses where people are dense and frequently flow in each city, and meanwhile, the application of the security monitoring system in the household aspect is also an important component which cannot be lacked in modern life. In the field of intelligent life assistant in each family or intelligent safety monitoring in social public places, a lot of key information can be extracted from the monitoring video. However, the key information needs to consume a lot of time and energy for relevant workers to find out, and in the actual situation of various occasions nowadays, the workers often face massive monitoring data, and the efficiency and accuracy of manually analyzing abnormal action information in videos cannot reach the ideal level. Therefore, the intelligent video monitoring system is realized, the computer is enabled to complete the task of intelligently detecting abnormal actions, the key information is quickly extracted and transmitted to the working personnel, the detection efficiency and accuracy of the video monitoring system can be improved, the real-time performance of abnormal action detection is achieved, the public safety and normal activities of the people are effectively guaranteed, and the occurrence rate of public safety problems is reduced.

The traditional video monitoring system is limited to the function of video recording, however, complex tasks such as real-time detection and analysis of abnormal events still need manual operation, however, the requirement of the current monitoring system is difficult to meet only by processing the abnormal behavior events by a manual method, and the intelligent video monitoring technology becomes one of the best ways for processing the abnormal behavior events in real time and rapidly.

In the abnormal action recognition task, because the occurrence frequency of the abnormal events is far lower than that of the normal events, enough learning samples with complete labels are lacked, and the abnormal actions are various in types, so that all the types of the abnormal events cannot be predicted in advance.

The abnormal action recognition task can be learned in a weak supervision training mode, a model is trained through coarse-grained labels, namely video-level labels, and the abnormal action recognition task cannot be completed through fine-grained labels. The method adopts a multi-stage human body abnormal action detection method based on a residual error network, uses a plurality of network models, respectively calculates the sizes of a human body target boundary frame and the position and the size, the category and the confidence coefficient of human body actions and the abnormal score of a monitoring video example, and then carries out weighting fusion.

Disclosure of Invention

The invention overcomes part of the defects of the prior art, provides a multi-stage human body abnormal motion detection method based on a residual error network, adopts a target detection network model, a human body motion recognition model and an abnormal score learning model, can quickly obtain abnormal motion information in a monitoring video, designs the target detection network model required for detecting a human body boundary frame, the motion recognition network model for analyzing human body motion and the abnormal score learning model for predicting abnormal scores, classifies and recognizes abnormal human body motions such as tumble, beating and the like in the monitoring video, and has good real-time detection speed and precision.

The technical solution of the invention is as follows: a multi-stage human body abnormal motion detection method based on a residual error network is suitable for classifying and identifying abnormal human body motions (such as tumble, beating and the like) in a monitoring video, and comprises the following steps:

(1) segmenting a video segment to be detected into m video instances with equal length;

(2) according to the video example segmented in the step (1), adopting a target detection network model based on a residual error network to obtain the position and the size of a human body target boundary frame appearing in each monitoring video example;

(3) based on the position and the size of the human body target boundary box in the step (2), calculating the category and the confidence coefficient of the human body action in the human body target boundary box in each monitoring video instance by using a human body action recognition model based on a double-flow space-time residual error network;

(4) and (4) according to the category and the confidence coefficient of the human body action in the step (3), an abnormal score of each monitoring video instance is given by adopting an abnormal score learning model, and weighting fusion is carried out, so that abnormal human body actions such as tumble, beating and the like can be classified and identified.

The step (1) is specifically realized as follows:

(11) if the applicable data set is a training and verification data set, dividing the video segments containing the abnormal events into abnormal video example packets according to the abnormal event labels in the data set, wherein the abnormal video example packets are represented as

Video segments not containing exceptional events are divided into normal video instance packets, denoted as

(12) If the applicable data set is a test data set, dividing the video clip to be detected into m video instances with equal length;

(13) and (3) cutting or scaling the resolution of the video example obtained in the step (11) or (12) to 228 multiplied by 228.

The step (2) is specifically realized as follows:

(21) using fast R-CNN as a target detection network model, using the video instance obtained in step (13) as input, using a residual error network as a skeleton network, where conv2_ x, conv3_ x, conv4_ x and conv5_ x, which are main bodies of the residual error network completing the feature extraction part, are composed of a plurality of sets of residual error module structures, and 3 convolutional layers are connected with the residual errors to form a set of residual error modules, where conv2_ x is composed of 3 residual error modules, conv3_ x is composed of 4 residual error modules, conv4_ x is composed of 23 residual error modules, conv5_ x is composed of 3 residual error modules, and an average sampling layer, a full connection layer and a Softmax layer are added behind the conv5_ x network layer;

(22) combining a characteristic pyramid network as a candidate region generator, reducing the dimensionality of the characteristic diagram of an upper layer in a mode of adding convolution layers with convolution kernel size of 1 × 1, simultaneously performing up-sampling operation on the characteristic diagram of a side convolution layer to obtain the characteristic diagram with the same size, performing element-by-element superposition operation on the convolution characteristic diagrams of two processed convolution layers, taking the result as the input of the next layer of the side convolution layer, performing element-by-element superposition on the characteristic diagrams of conv2_ x, conv3_ x, conv4_ x and the characteristic diagram of conv5_ x after convolution, dimensionality reduction and sampling operation, and outputting different predictions of each layer according to different characteristic diagrams;

(23) approximate joint training is used for a candidate area network and Fast R-CNN in Fast R-CNN, in each random gradient descent iteration process, a generated candidate area is fixed and unchanged when Fast R-CNN is trained, and the training loss from the candidate area network and the training loss of Fast R-CNN are combined in a sharing layer through reverse transmission obtained through pre-calculation;

(24) the boundary frame of the human body target adopts a parametric calculation mode and is represented as t_x＝(x-x_a)/w_a，t_y＝(y-y_a)/h_a，t_w＝log(w/w_a)，t_h＝log(h/h_a)，

Wherein x, y, w, h respectively represent the coordinate of abscissa axis, the coordinate of ordinate axis, the width and height of the boundary box of the central point of the predicted boundary box, and x_a，y_a，w_a，h_aCoordinate of abscissa axis representing center point of anchor box boundary box, coordinate of ordinate axis, width and height of boundary box, x^*，y^*，w^*，h^*The coordinate of the abscissa axis and the coordinate of the ordinate axis representing the central point of the real boundary box, and the width and the height of the boundary box;

(25) the optimization goal of the target detection network model is to reduce the error of the boundary box and the category of the predicted human target and minimize the multi-task training objective function

Wherein L is_c(p，u)＝-logp_u，L_r(p，u)＝R(p-u)，

R (x) is the smooth L1 penalty function, i denotes the index of the subscript of the anchor bin, p_iIndicating the probability that the ith anchor bin contains the target object,

is a sample label for learning, t_iIs a vector representing a set of parameterized coordinates of a human target bounding box,

is a set of parameterized coordinates, L, of an anchor box with a positive exemplar label_cIs a log classification loss predicting the presence of a human target, and λ is in a regularization loss function

And the two terms account for the super-parameters of the weight, and the output of the target detection network model is the position and the size of a human target boundary box in the video example.

The step (3) is specifically realized as follows:

(31) the method comprises the steps of using a double-flow space-time network as a human body action recognition model, using the position and the size of a human body target boundary box as input, using two residual error networks of time flow and space flow as a skeleton network for the human body action recognition model, and performing analysis on a group of input data samples { x }₁，x₂，x₃，…，x_mBy β, γ as translation parameters and scalingParameter, batch normalization calculation of transformed y_iThe mean value of each batch is calculated along the channel

Calculate variance of each batch along the channel

Computing

Performing scaling and translation transformations

(32) Adjusting a first layer filter in a temporal stream by three RGB filter channels for running on horizontal and vertical optical flow stacks, each stack containing 10 video frames, using cross-layer residual connection in temporal and spatial stream residual networks, respectively

And cross-stream residual concatenation

Deployed before residual module of time flow to space flow, where f is ReLU activation function, X_lIs an input matrix of the l-th layer, Y_lFor the output matrix of layer l, g (·,) is a non-linear residual mapping function, weighted by the convolution filter W_lIt is shown that,

is the input of the l-th layer spatial stream,

is the input of the layer i time stream,

is concatenation of layer I residual neurons of spatial streamAnd then the weight matrix is connected with the weight matrix,

is the output of the l layer spatial stream;

(33) using a residual network with the resolution ratio of alpha frame rate and C size as a spatial stream network to determine a static area in a video, and using a residual network with the resolution ratio of 8 alpha frame rate and C/8 size as a temporal stream network to determine a dynamic area in the video, wherein the spatial stream network is called a slow channel, the temporal stream network is called a fast channel, the slow channel network uses a time sequence span as a hyper-parameter to control the number of video frames crossed per second, about 2 frame characteristics are extracted in 1 second in a data set with a video frame rate of 30, 15 frame characteristics are extracted in 1 second in the fast channel network, data in the fast channel network is transmitted to the slow channel network through a side connection, a network model performs dimension conversion on data transmitted into the side connection by the fast channel network and then is merged into the slow channel network, when the fast channel network performs dimension conversion on a data sample, performing three-dimensional convolution at intervals of a certain number of frames, performing global average sampling on a human body action recognition network model on the last layer of a fast channel network and a slow channel network, reducing dimensionality of data samples, fusing results, transmitting the results into a full-connection classification layer, and giving a confidence score of each action through Softmax;

(34) the optimization goal of the human body action recognition model is to reduce the error of human body action class prediction and minimize the cross entropy loss function of the Softmax classifier

Wherein x_iMotion confidence score, y, for the ith personal body motion category, which is the output vector of the Softmax classifier_iIs the ith element in the corresponding real action tag vector.

The step (4) is specifically realized as follows:

(41) the method uses a multi-layer perceptron as an abnormal score learning model, uses the category and the confidence score of human body action as input, and uses a multi-instance learning method

Training and predicting the anomaly score obtained for each video instance in the instance package, wherein

Is a packet of an instance of an abnormal video,

is a packet of an instance of an abnormal video,

for the corresponding exceptional video instance and normal video instance,

h is a multilayer perceptron model which consists of three fully-connected layers, wherein an input layer comprises 512 neurons, a hidden layer comprises 32 neurons, a final output layer comprises 1 neuron, abnormal scores of a given video example are predicted, and the output value range is between 0 and 1;

(42) the optimization goal of the abnormal score learning model is to maximize the distance between two classes of samples between the abnormal video instance package and the normal video instance package to facilitate classification by the multi-layered perceptron model using hinge-based ordering penalty

Considering the real situation, the probability and frequency of normal events are far greater than those of abnormal events, so a sparse constraint is added to the loss function

Enabling some abnormal human body actions with lower occurrence frequency than other normal events, such as falling, beating and the like, and violence to have higher abnormal scores, wherein epsilon is a hyper-parameter for adjusting the weight occupied by the hinge sequencing loss and the sparse constraint;

(43) network model and human body action detection on targetPerforming joint training and optimization on the recognition model and the abnormal score learning model

Accumulating the loss function and performing weight balancing, λ_od，λ_af，λ_filIs a regularization hyper-parameter that controls each loss function.

Compared with the prior art, the invention has the advantages that:

(1) the model extracts abundant human body target boundary box characteristics and human body action characteristics, can effectively learn the abnormal action characteristics of the human body from a training sample, and can also give the abnormal action category and confidence score of the human body on the basis of the prior art that whether the weak supervision multi-instance learning model detects abnormal events;

(2) by the segmented video examples and the characteristics extracted by the model, the abnormal mode can be rapidly learned, the space-time position of abnormal action in the monitoring video is detected, and the model also has the capability of rapidly positioning the space position of the abnormal action on the basis of the prior art that the weak supervision multi-example learning model detects the time sequence information of the abnormal action;

(3) the model can be applied to an unclear image monitoring video with the resolution ratio lower than 228 x 228, in an experimental result of a monitoring video data set, the AUC (area under the ROC curve) of abnormal actions reaches 77.43% and is used as a reference of performance indexes of model classification, and compared with the prior art, the identification capability of the model can obtain better classification performance, such as a time series regularity learning model (50.60%), a high frame rate abnormal event detection model (65.51%), a weak supervision multi-instance learning model (75.41%), and the like.

Drawings

FIG. 1 is a flow chart of a method implementation of the present invention;

FIG. 2 is a schematic diagram of an internal architecture of a dual-stream residual error network;

FIG. 3 is a schematic diagram of the fast R-CNN architecture;

FIG. 4 is a schematic diagram of a composite network model for target detection;

FIG. 5 is a schematic diagram of a dual-flow network model architecture;

FIG. 6 is a schematic diagram of video example partitioning;

FIG. 7 is a diagram of a multi-level perceptron architecture.

Detailed Description

The present invention will be described in detail with reference to the following embodiments.

As shown in fig. 1, the method of the present invention specifically includes the following steps:

The basic implementation process is as follows:

(1) dividing the video clips containing abnormal labels into abnormal video example packages and dividing the video clips not containing abnormal events into normal video example packages by using a UCF-Crime data set according to the abnormal event labels of the video clips in the input data set, wherein the normal video example packages are shown in figure 6;

(2) cropping or scaling the video instance resolution to 228 x 228 using the FFMPEG audio video library;

(3) using fast R-CNN as a target detection network model, wherein the target detection network model uses 101 layers of residual error networks as a skeleton network, and the main bodies conv2_ x, conv3_ x, conv4_ x and conv5_ x in the residual error networks, which complete the feature extraction part, are composed of several groups of residual error module structures, and 3 convolutional layers are connected with the residual errors to form a group of residual error modules, wherein conv2_ x is composed of 3 residual error modules, conv3_ x is composed of 4 residual error modules, conv4_ x is composed of 23 residual error modules, conv5_ x is composed of 3 residual error modules, and an average sampling layer, a full connection layer and a Softmax layer are added behind the conv5_ x network layer, and the framework is shown in fig. 3;

(4) combining the feature pyramid network as a candidate region generator, the structure of the fast R-CNN combining the feature pyramid network is as shown in fig. 4, the feature pyramid network is added with sampling operations layer by layer, the fast R-CNN is connected with the network layers of the feature pyramid network in a transverse manner, the fast R-CNN reduces the dimension of the feature map of the upper layer in a manner of adding convolution layers with convolution kernel size of 1 × 1, the feature map of the side convolution layer in the feature pyramid network is up-sampled at the same time to obtain the feature map with the same dimension, the convolution feature maps of the two processed convolution layers are overlapped element by element, the result is used as the input of the next layer of the side convolution layer, the fast R-CNN convolves convolve, the feature maps of conv2_ x, conv3_ x, conv4_ x and the feature map of conv5_ x through convolution, dimension reduction and sampling operations, element-by-element superposition is carried out, and different predictions of each layer are output by the feature pyramid network according to different feature graphs;

(5) approximate joint training is used for a candidate area network and Fast R-CNN in Fast R-CNN, in each random gradient descent iteration process, a generated candidate area is fixed and unchanged when Fast R-CNN is trained, and the training loss from the candidate area network and the training loss of Fast R-CNN are combined in a sharing layer through reverse transmission obtained through pre-calculation;

(6) the boundary frame of the human body target adopts a parametric calculation mode and is represented as t_x＝(x-x_a)/w_a，t_y＝(y-y_a)/h_a，t_w＝log(w/w_a)，t_h＝log(h/h_a)，

(7) the optimization goal of the target detection network model is to reduce the error of the boundary box and the category of the predicted human target and minimize the multi-task training objective function

Wherein L is_c(p，u)＝-logp_u，L_r(p，u)＝R(p-u)，

The two terms account for the super-parameters of the weight, and the output of the target detection network model is the position and the size of a human target boundary frame in the video example;

(8) using double-flow space-time network as human body action recognition model and human body target boundary frameThe human motion recognition model uses two residual networks of time stream and space stream as skeleton network, the internal architecture of the residual network is shown in fig. 2, wherein the input layer step of the time stream network is 2, the channel number of the conv1 convolutional layer is 8, the input layer step of the space stream network is 16, the channel number is 64, two network streams are added with sampling layers after the conv1 convolutional layer, the characteristic diagram size of the conv2, conv3, conv4 and conv5 convolutional layer of the time stream network is 1/8 of the corresponding convolutional layer of the space stream network, and in the dual-stream space-time network, a group of input data samples { x 3, conv4 and conv5 are input into the skeleton network₁，x₂，x₃，…，x_mAnd calculating transformed y in batch normalization by taking beta and gamma as translation parameters and scaling parameters_iThe mean value of each batch is calculated along the channel

Calculate variance of each batch along the channel

Computing

Performing scaling and translation transformations

(9) Adjusting a first layer filter in a temporal stream by three RGB filter channels for running on horizontal and vertical optical flow stacks, each stack containing 10 video frames, using cross-layer residual connection in temporal and spatial stream residual networks, respectively

And cross-stream residual concatenation

Deployed before residual module of time flow to space flow, where f is ReLU activation function, X_lIs an input matrix of the l-th layer, Y_lIs the first layerG (·,) is a non-linear residual mapping function, weighted by the convolution filter W_lIt is shown that,

is the input of the l-th layer spatial stream,

is the input of the layer i time stream,

is the connection weight matrix for the layer i residual neurons of the spatial stream,

is the output of the l layer spatial stream;

(10) using a residual network with the resolution ratio of alpha frame rate and C size as a spatial stream network to determine a static area in a video, and using a residual network with the resolution ratio of 8 alpha frame rate and C/8 size as a temporal stream network to determine a dynamic area in the video, wherein the spatial stream network is called a slow channel, the temporal stream network is called a fast channel, the slow channel network uses a time sequence span as a hyper-parameter to control the number of video frames crossed per second, about 2 frame characteristics are extracted in 1 second in a data set with a video frame rate of 30, 15 frame characteristics are extracted in 1 second in the fast channel network, data in the fast channel network is transmitted to the slow channel network through a side connection, a network model performs dimension conversion on data transmitted into the side connection by the fast channel network and then is merged into the slow channel network, when the fast channel network performs dimension conversion on a data sample, performing three-dimensional convolution at intervals of a certain number of frames, performing global average sampling on a human body action recognition network model on the last layer of a fast channel network and a slow channel network, reducing the dimensionality of a data sample, fusing results, transmitting the results into a full-connection classification layer, and giving a confidence score of each action through Sofimax, wherein a double-flow space-time residual error network architecture is shown in FIG. 5;

(11) in the learning process of the human body action recognition model, the aim is to reduce the human body action classError of prediction and minimization of cross entropy loss function of Softmax classifier

Wherein x_iMotion confidence score, y, for the ith personal body motion class, which is the output vector of the Sofimax classifier_iIs the ith element in the corresponding real action tag vector;

(12) the method uses a multi-layer perceptron as an abnormal score learning model, uses the category and the confidence score of human body action as input, and uses a multi-instance learning method

Is a packet of an instance of an abnormal video,

is a packet of an instance of an abnormal video,

for the corresponding exceptional video instance and normal video instance,

h is a multilayer perceptron model, the framework of which is shown in fig. 7 and consists of three fully-connected layers, wherein an input layer comprises 512 neurons, a hidden layer comprises 32 neurons, a final output layer comprises 1 neuron, abnormal scores of a given video example are predicted, and the output value ranges from 0 to 1;

(13) the optimization goal of the abnormal score learning model is to maximize the distance between two classes of samples between the abnormal video instance package and the normal video instance package to facilitate classification by the multi-layered perceptron model using hinge-based ordering penalty

Enabling abnormal human body actions such as falls, blows and the like with lower occurrence frequency than other normal events to have higher abnormal scores, wherein the epsilon is a super parameter for adjusting the weight occupied by the hinge sequencing loss and the sparse constraint;

(14) performing joint training and optimization on the target detection network model, the human body action recognition model and the abnormal score learning model

The above description is only a part of the embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A multi-stage human body abnormal motion detection method based on a residual error network is characterized by comprising the following steps:

(1) segmenting a video segment to be detected into m video instances with equal length to obtain segmented video instances;

(4) and (4) according to the category and the confidence of the human body action in the step (3), giving an abnormal score of each monitoring video instance by adopting an abnormal score learning model, and performing weighting fusion, so that the abnormal human body action can be classified and identified.

2. The method of claim 1, wherein the method comprises: the step (1) is specifically realized as follows:

3. The method of claim 1, wherein the method comprises: the step (2) is specifically realized as follows:

(21) using fast R-CNN as a target detection network model, taking a video instance as an input, the target detection network model using a residual error network as a skeleton network, the main bodies conv2_ x, conv3_ x, conv4_ x and conv5_ x in the residual error network, which complete the feature extraction part, are composed of several groups of residual error module structures, 3 convolutional layers are connected with the residual errors to form a group of residual error modules, wherein conv2_ x is composed of 3 residual error modules, conv3_ x is composed of 4 residual error modules, conv4_ x is composed of 23 residual error modules, conv5_ x is composed of 3 residual error modules, and an average sampling layer, a full connection layer and a Softmax layer are added behind the conv5_ x network layer;

(23) approximate joint training is used for a candidate area network and FastR-CNN in FasterR-CNN, in each random gradient descent iteration process, a generated candidate area is fixed and unchanged when FastR-CNN is trained, and the training loss from the candidate area network and the training loss of FastR-CNN are combined in a shared layer through reverse transmission through pre-calculation;

Wherein x, y, w, h respectively represent the coordinate of abscissa axis, the coordinate of ordinate axis, the width and height of the boundary box of the central point of the predicted boundary box, and x_a,y_a,w_a,h_aCoordinate of abscissa axis representing center point of anchor box boundary box, coordinate of ordinate axis, width and height of boundary box, x^*,y^*,w^*,h^*The coordinate of the abscissa axis and the coordinate of the ordinate axis representing the central point of the real boundary box, and the width and the height of the boundary box;

Wherein L is_c(p,u)＝-logp_u，L_r(p,u)＝R(p-u)，

4. The method of claim 1, wherein the method comprises: the step (3) is specifically realized as follows:

(31) the method comprises the steps of using a double-flow space-time network as a human body action recognition model, using the position and the size of a human body target boundary box as input, using two residual error networks of time flow and space flow as a skeleton network for the human body action recognition model, and performing analysis on a group of input data samples { x }₁,x₂,x₃,…,x_mAnd calculating transformed y in batch normalization by taking beta and gamma as translation parameters and scaling parameters_iThe mean value of each batch is calculated along the channel

Calculate variance of each batch along the channel

Computing

Performing scaling and translation transformations

And cross-stream residual concatenation

is the input of the l-th layer spatial stream,

is an input of the l layer time stream, W_l ^SIs the l-th layer of the spatial streamA connection weight matrix of the residual neurons,

is the output of the l layer spatial stream;

5. The method of claim 1, wherein the method comprises: the step (4) is specifically realized as follows:

Is a packet of an instance of an abnormal video,

is a packet of an instance of an abnormal video,

for the corresponding exceptional video instance and normal video instance,

Make oneThe abnormal human body actions with the occurrence frequency lower than other normal events have higher abnormal scores, wherein the epsilon is a hyper-parameter for adjusting the weight occupied by the hinge sequencing loss and the sparse constraint;

(43) performing joint training and optimization on the target detection network model, the human body action recognition model and the abnormal score learning model

Accumulating the loss function and performing weight balancing, λ_od,λ_af,λ_filIs a regularization hyper-parameter that controls each loss function.