CN117541994A - Abnormal behavior detection model and detection method in dense multi-person scene - Google Patents

Abnormal behavior detection model and detection method in dense multi-person scene Download PDF

Info

Publication number
CN117541994A
CN117541994A CN202311572461.1A CN202311572461A CN117541994A CN 117541994 A CN117541994 A CN 117541994A CN 202311572461 A CN202311572461 A CN 202311572461A CN 117541994 A CN117541994 A CN 117541994A
Authority
CN
China
Prior art keywords
skeleton
layer
lstm
module
joint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311572461.1A
Other languages
Chinese (zh)
Inventor
王西超
董祥庆
孙伯潜
赵淑阳
李保江
王海燕
陈国初
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Dianji University
Original Assignee
Shanghai Dianji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dianji University filed Critical Shanghai Dianji University
Priority to CN202311572461.1A priority Critical patent/CN117541994A/en
Publication of CN117541994A publication Critical patent/CN117541994A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an abnormal behavior detection model and a detection method aiming at a dense multi-person scene. The skeleton gesture extraction module is used for extracting skeleton joint point information of each person under video monitoring. The behavior classification module models long-term time dynamics inside and between actions, predicting action tags with video sequences. Meanwhile, a full connection layer is introduced into the action recognition module to classify normal and abnormal behaviors. In order to improve the accuracy and precision of the target detection module, the invention replaces the common convolution layer of the YOLOv5 with the snake-shaped convolution layer, and simultaneously introduces an ASFF feature fusion module which can effectively fuse the extracted features of the network in a multi-scale manner. The invention also adopts a separated characteristic training strategy and a method for calculating Euclidean distance so as to solve the problem of lack of abnormal behavior data samples and under fitting of a deep neural network.

Description

Abnormal behavior detection model and detection method in dense multi-person scene
Technical Field
The invention belongs to the technical field of computer vision. And more particularly to a detection model and method that combines image processing, object detection, behavior recognition, and pattern recognition.
Background
Behavior recognition aims at finding out the specific position of a behavior of interest in time or space from long videos, and is one of the most basic video understanding tasks. In current behavior recognition studies, it is important to detect abnormal behavior in video sequences. Particularly, abnormal behaviors of multiple people in a crowd under public monitoring video are a great potential threat, and if the abnormal behaviors cannot be found in time, the life and property safety of people can be seriously affected. The diversity and rapid changes in human behavior present difficulties in the detection of abnormal behavior. Although some classical algorithms achieve good results in single person behavior recognition on public data sets, the detection accuracy of multi-person abnormal behavior is low. Meanwhile, the mainstream algorithm with better performance has large calculated amount and high model complexity, and is difficult to practice and industrialize.
Early abnormal behavior recognition studies mostly use artificial feature descriptors to represent the appearance features of pedestrians and motion features extracted from corresponding feature information. These methods use conventional machine learning algorithms for human behavior recognition. Artificial feature descriptors include trajectory, gradient direction Histogram (HOG), optical flow Histogram (HOF), hybrid dynamic texture, and other low-level visual features. Traditional behavior detection methods rely on a simple understanding of image features. However, with the rapid development of deep learning, researchers have begun exploring abnormal behavior recognition studies based on deep learning. They have achieved a series of achievements and deep learning methods can extract advanced features of human behavior in video to more effectively distinguish normal behavior from abnormal behavior.
The human skeleton gesture extraction is to detect and track the coordinate position of key joints (skeleton joints) of human body from images or videos by using a deep learning method. Unlike the behavior recognition method based on image features, the behavior recognition based on skeleton gesture extraction is a method for classifying actions by using human body gesture information. The method extracts and captures behavior features through skeleton gestures, and performs action classification by using a machine or a deep learning method. The human skeleton gesture extraction can carry out fine analysis on human gesture and joint motion. This enables behavior recognition to capture minute motion changes and detailed information, thereby providing more accurate, finer behavior analysis results.
Traditional algorithms are unstable in complex scenes and are easily affected by challenging factors such as illumination changes, angle differences, deformation, shielding and the like. Normal behavior is typically more common than abnormal behavior, which leads to the problem of class imbalance in the dataset. This may result in a model biased towards normal behavior that is easier to identify, and less accurate in identifying abnormal behavior. Also, the number of people may change in different video frames, which may result in reduced adaptability to most models.
Current mainstream methods generally do not directly consider time or timing information, and thus they may not make full use of time-series action relationships for abnormal behavior recognition tasks involving time series. Abnormal behavior is typically associated with a particular pattern or trend in the time series. If the algorithm does not consider the time information, the evolution and time sequence relation of the abnormal behavior can not be accurately captured, so that false alarm or missing alarm occurs. In dense multi-person scenarios, complex timing relationships also exist between actions from person to person. Ignoring these relationships may also result in an oversimplification of the judgment of the abnormal behavior, and an inability to accurately distinguish between normal and abnormal behavior.
Problems of the prior art: (1) behavioral class imbalance in a dataset: normal behavior is typically more common than abnormal behavior, resulting in a bias of the current model towards normal behavior that is easier to identify, and thus less accurate in the identification of abnormal behavior.
(2) Reduced adaptation and number of people: in dense scenes, the number of people in a video frame may change, which may lead to reduced adaptation of most models. The model of the invention aims to solve the problem and can adapt to the change of the number of people in different frames, thereby improving the robustness and the accuracy of the model.
(3) Multiple joint point misconnection problem: when behavior recognition is performed in a dense scene, the problem of incorrect connection of multiple joints exists.
Disclosure of Invention
The invention provides an abnormal behavior detection model and an abnormal behavior detection method under a dense multi-person scene, and aims to provide an abnormal behavior detection model under the dense multi-person scene so as to improve the defects of a current mainstream algorithm under the scene. The model aims to improve the accuracy of abnormal behavior detection in dense scenes.
The invention solves the technical problems as follows: the method comprises the steps that an abnormal behavior detection model in a dense multi-person scene is adopted, the abnormal behavior detection model comprises a skeleton gesture extraction module YH-Pose and a behavior classification module BR-LSTM, the YH-Pose module firstly extracts behavior characteristic information from original video data, and then the behavior characteristic information is input into the BR-LSTM module to finish action classification; the YH-Pose module: the human body detector is realized based on the YOLOv5, firstly, a boundary box is added to each person detected by the YOLOv5 in a scene, and the boundary box contains the position information of the person in the image; the YH-Pose module fuses a high-resolution skeleton gesture extraction network HRNet, predicts the coordinates of n skeleton joints of each person in an image, and then orderly connects the skeleton joints according to the structure of a human skeleton to form a human body gesture skeleton model; the YH-phase module utilizes input RGB video stream data to combine a human body boundary frame with a gesture skeleton network, and the combined output result is two-dimensional human body gesture information, wherein the two-dimensional coordinate information of k joints of a human body in each frame, and the coordinate position and confidence score of a human body target frame are included; the BR-LSTM module: predicting action labels according to the video sequence to finish the action classification task; the model incorporates a separate feature training strategy: in the data preprocessing stage, the BR-LSTM divides the two-dimensional joint coordinate data into an x coordinate sequence and a y coordinate sequence, and calculates Euclidean distance from each joint to a root node so as to expand a data sample; the module comprises a data preprocessing module and a bidirectional behavior classification network consisting of six LSTM units, wherein the data preprocessing module maps motion characteristics into characteristic vectors through a linear layer and inputs the characteristic vectors into a forward LSTM layer and a reverse LSTM layer, the LSTM calculates the hidden state and the hidden state of the current time step according to the hidden state of the current input and the previous time step in each time step, and finally, a fully-connected layer is used for further classifying normal and abnormal behaviors.
The human body detector extracts detailed skeleton data from the captured RGB video data, including skeleton joint point positions and time information of inter-frame joint points; in order to realize that the human body skeleton gesture is detected firstly and then the human body skeleton gesture is detected, a YH-Pose module combines a high-resolution skeleton gesture extraction network HRNet with a YOLOv5 target detection frame; introducing a prediction head part of YOLOv5 into a plurality of ASFF modules for improvement, wherein the modules carry out multi-scale weighted fusion on characteristic information; the input image passes through a feature pyramid network with feature graphs of different layers, the ASFF module completely fuses the feature information among the layers, and the feature information is combined into a new feature representation through weighted average; in conventional convolution operations, the convolution kernel typically scans the input feature map from left to right, top to bottom in a fixed order; the serpentine convolution adopts a nonlinear scanning mode, and the scanning path of the convolution kernel is designed into a serpentine shape or curve, so that the scanning sequence is changed; serpentine convolution can also reduce the number of parameters of the model; in conventional convolution operations, adjacent convolution kernels typically have similar weights, so the number of parameters can be reduced by sharing the parameters; the serpentine convolution has different weights between adjacent convolution kernels due to the nonlinear scanning paths, so that parameters cannot be directly shared; thus, each convolution kernel needs independent parameters, but the expression capacity of the model is increased; therefore, in the Yolov5 backbone network part, a serpentine convolution layer is used for replacing a common convolution layer, so that each convolution kernel can see a larger range of input information, and the perception capability of the model on global features is improved.
α 333 Is the weighted scaling factor of the third layer, x 1→3 ,x 2→3 ,x 3→3 Is the characteristic tensor of each layer; as in equation (1), the new feature of the third layer ASFF module after weighting is ASFF3:
ASFF3=x 1→33 +x 2→33 +x 3→33 (1)。
when the network outputs N classified key points, the network firstly outputs an N-dimensional characteristic diagram, then a Gaussian kernel is constructed according to the key points on the N-dimensional labeling characteristic diagram, a human skeleton key point heat diagram code is generated, and then (x, y) coordinates and confidence scores of N skeleton key points are obtained from the skeleton key point heat diagram.
In the separated feature training, a Focal Loss function is adopted, as shown in a formula (2), p is the probability of model prediction belonging to the foreground, and the value range of p is 0 to 1; y has a value range of 1 and-1; alpha t Plays a role of modulation, can reduce the attention to unimportant sample features and increase the attention to challenging sample featuresIt is a defined parameter ranging from 0 to 5;
FL(p t )=-α t (1-p t ) γ log(p t ) (2)
the separated feature training adopts a separated feature coding strategy to extract skeleton features, firstly, a vector representing the human body posture is generated, and the vector is formed by splicing three vectors: the distance between the camera and the person at different area positions in each frame of image is different, so the proportion of the joint positions in the image is also different, each joint in the image is described by the abscissa and the ordinate thereof, and the original position of the ith joint is defined as (x i ,y i ) The joint position of each person detected in each frame of image is normalized by adopting the formula (4),representing normalized coordinate positions of the joints, each joint in the image is described by its abscissa and ordinate, so that the normalized joint position vector contains 2k features corresponding to k joints,
after the coordinate position of the human skeleton joint is determined, a second component vector is obtained by calculating the distance from each joint in p joints to a human root joint point O (mass center), wherein the Euclidean distance calculation formula (5) from each joint to the root joint comprises k characteristics, and the k characteristics respectively correspond to k distances (d 1-dp) and Euclidean distances (x 0, y 0) from the joint to the root joint;
the BR-LSTM module performs feature extraction and behavior classification on the skeleton information using bidirectional LSTM, first, splits k coordinates into x coordinate values (x 1, x2,) xk and y coordinate values (y 1, y2,) yk, and then calculates each of the nodes (d 1, d2,) dk to the root node as a third feature component; when continuous image frames are detected, an x coordinate sequence, a y coordinate sequence and a distance sequence which change with time are obtained; then, the data are processed into the length and the size suitable for LSTM training, and are respectively input into three LSTM networks for time sequence feature extraction, and then each time a new frame of image data is detected, new coordinate values are added into the sequence, and the old coordinate sequence is deleted; and finally, merging and inputting classification information of the behavior actions into a full-connection layer, and classifying normal behaviors and abnormal behaviors to determine whether the behaviors are abnormal behaviors.
The LSTM nerve unit comprises an input gate i t Forgetting door f t State of cell C t And an output gate O t The long-term and short-term memory is controlled by the door and the cell state, and the calculation process is expressed by formulas (6) to (11); in the formula (6), the information of the input gate at the time t is a combination of the hidden output at the previous time and the input information at the time t; in equation (7), the candidate cell state at time t is represented by h t-1 And x t Calculated, h t-1 And x t Respectively representing hidden output at the previous moment and input information at the moment t; in formula (8), the forget gate is used to control which information in the memory state at the previous time should be forgotten or retained; in the formula (9), the outputs of the input gate and the forget gate are combined to update the cell state at the time t; then, connecting the double-layer LSTM network end to end, and sequentially connecting cells of each LSTM layer, and predicting a forward learning action characteristic sequence and a reverse learning action characteristic sequence;
i t =σ(W i ·[h t-1 ,x t ]+b i ) (6);
C t =tanh(W c ·[h t-1 ,x t ]+b c ) (7);
f t =σ(W f ·[h t-1 ,x t ])+b f (8);
C t =f t *C t-1 +i t *C t (9);
O t =σ(W o ·[h t-1 ,x t ]+b o ) (10);
h t =O t *tanh(C t ) (11);
the forward layer and the backward layer are commonly connected to the output layer in the structure of the double-layer bidirectional LSTM, and the output layer comprises six sharing weights w 1 -w 6 The forward propagation of the behavior feature is calculated in the forward layer along the time from time 1 to time t, and in the backward layer, the backward calculation is performed from time t to time 1 to obtain and save the output of each time of the backward hidden layer, and the corresponding output results of the forward layer and the backward layer at each time are combined to obtain the final output, wherein the mathematical formulas are as (12) - (15), Is biased; o' t ,o″ t The result of processing the motion feature vector output at the corresponding moment by the two layers of LSTM;
the abnormal behavior detection method in the dense multi-person scene is characterized by comprising the following steps: step 1, video acquisition: recording or acquiring related video data for dense crowd scenes needing abnormal behavior analysis; step 2, human body detection: based on a detection frame of YOLOv5, adding a boundary box to each person in the video, and marking the position information of the person in the image; step 3, extracting skeleton gestures: using a YH-Pose module integrated with a high-resolution skeleton gesture extraction network HRNet to calculate and determine k key skeleton node positions of each person in the video; step 4, generating a gesture skeleton model: according to the skeleton structure of the human body, connecting the key skeleton nodes confirmed in the previous step in an orderly manner to generate a posture skeleton model of the human body; step 5, feature fusion: the YH-phase network utilizes input RGB video frame data to fuse a human body boundary frame with a gesture skeleton model to generate fused two-dimensional coordinate information containing k joints of a human body in each frame, boundary frame position and human body gesture information of confidence; step 6, data preprocessing: the BR-LSTM module is used in the behavior classification stage, the generated human body posture information is subjected to data preprocessing, the two-dimensional joint coordinates are divided into independent x and y coordinate sequences, and Euclidean distance from each joint to a root node is calculated; step 7, behavior feature extraction: the feature extraction part in the BR-LSTM module receives the preprocessed data and extracts the space-time features of the action through the long-term and short-term memory network; step 8, classification and prediction: after data processing and behavior feature extraction, carrying out final classification calculation by using a full-connection layer, and identifying and predicting abnormal and normal behaviors through a trained model; step 9, outputting a result: and marking abnormal behaviors in the video according to the classification and prediction results for further analysis and processing.
The invention has the beneficial effects that: the network in the framework is expanded through various improved modules and strategies, so that abnormal behaviors of people in dense scenes can be accurately detected.
(1) The key of the target detection module is to combine HRNet with YOLOv5, extract high-resolution gesture features by using HRNet, and simultaneously perform target detection by using an improved YOLOv5 frame. By introducing the ASFF feature fusion module, features of two networks can be effectively fused, and accuracy and precision of target detection are improved.
(2) During the training process, it was noted that the lack of sufficient abnormal behavioral data samples may result in under-fitting of the model. To address this problem, a separate feature training strategy is employed. Specifically, the two-dimensional coordinate data after fusion is divided into an x coordinate sequence and a y coordinate sequence, and then the x coordinate sequence and the y coordinate sequence are respectively input into a network for training. In this way, existing normal behavior data can be more fully utilized to train the model, improving its performance and generalization ability. To enhance the sensitivity of the model to abnormal behavior, the Euclidean distance of each two-dimensional coordinate to the root node is calculated. The purpose of this is to introduce more information about the spatial relationship between the joints, making the model more capable of detecting abnormal behavior. By introducing a separated feature training strategy and a method for calculating Euclidean distance, the problem of lack of abnormal behavior data samples and under fitting of a deep neural network can be effectively solved. These improvements may improve the performance and accuracy of the model in the abnormal behavior detection task.
Drawings
FIG. 1 is a general frame diagram of an abnormal behavior recognition model technical scheme;
FIG. 2 is a visual comparison of the articulation of two model skeletons;
FIG. 3 is a flow chart of a multi-scale spatial feature fusion strategy;
FIG. 4 is a diagram illustrating the behavior characteristics of an node-to-root node;
FIG. 5 is a block diagram of the internal network of the LSTM unit;
fig. 6 is a detailed structural diagram of the double layer bi-directional LSTM.
Detailed Description
The invention researches an abnormal behavior detection model under a dense multi-person scene, and the abnormal behavior detection model is composed of two modularized networks to respectively realize skeleton gesture extraction and behavior classification. The skeleton gesture extraction network is a top-down skeleton gesture extraction module (YH-Pose) which can extract 17 skeleton nodes (when K is 17) of each person under video monitoring. The module describes the human motion information by analyzing the changes of skeleton nodes in the time sequence. The method combines a high-resolution skeleton gesture extraction network (HRNet) with an improved YOLOv5 target detection frame, and introduces a feature fusion module ASFF to improve the detection accuracy of YOLOv 5. Meanwhile, the problem that skeleton joints are disconnected when the HRNet network extracts skeleton gestures is solved.
The behavior classification network is a BR-Lstm module that models long-term time dynamics inside and between actions, predicting action tags using video sequences of about 2 seconds. The system comprises a data preprocessing module and a bidirectional behavior classification network with six LSTM units. In order to solve the problem that abnormal behavior data samples are fewer, a deep neural network is easy to generate an under fitting phenomenon in a training process, a separate feature training strategy is provided, and the data samples are expanded from each two-dimensional coordinate to a root node by dividing the fused two-dimensional coordinate data into an x-coordinate sequence and a y-coordinate sequence and calculating Euclidean distance in a BR-Lstm network. The data preprocessing module encodes two-dimensional human joint coordinates containing motion characteristics in a separated mode. It calculates the euclidean distance of all joints to the root joint, thereby generating feature vectors of different actions. The data preprocessing module maps the motion detail features into motion feature vectors through the linear layer and inputs the motion feature vectors into the forward LSTM layer and the reverse LSTM layer. Within each time step, the forward LSTM calculates the hidden state and output for the current time step from the hidden states for the current input and the previous time step. The reverse LSTM calculates the hidden state and the hidden state of the current time step and the hidden state of the subsequent time step according to the current input and the hidden state of the current time step. The hidden states of the forward LSTM and the reverse LSTM are concatenated at each time step to form a richer representation while retaining past and future context information. Finally, a fully connected layer is added after the bi-directional LSTM to further classify normal and abnormal behavior.
Specifically, a general block diagram of the intelligent abnormal behavior recognition framework provided by the invention is shown in the attached figure 1. The framework integrates a skeleton gesture extraction module (YH-Pose) and a behavior classification module (BR-Lstm). The YH-Pose comprises a human body detection module and a skeleton gesture extraction module. The invention will be further described with reference to the drawings and examples.
I. The overall technical scheme is as follows:
our overall framework architecture integrates a skeletal gesture extraction module (YH-Pose) and a behavior classification module (BR-Lstm). The object detector in the skeleton gesture extraction module is designed based on YOLOv5, and a bounding box is added for each detected person in the scene. The human body bounding box contains information about the position of the person in the image. The high resolution skeleton gesture extraction network (HRNet) is improved by adopting a Focal Loss function instead of a Mean Square Error (MSE) Loss function, so that the model converges more quickly. The joint detection module fuses an improved high-resolution skeleton gesture extraction network (HRNet) and is responsible for estimating coordinates of 17 skeleton nodes of each person in an image. A human body posture skeleton model is a structured model for representing and capturing human body postures. It mimics the structure of the human skeletal system and defines the connection and constraint relationships between joints. The module orderly connects the joint points according to the structure of the human body skeleton to form a special human body posture skeleton model. The YH-phase module processes the input RGB video frames and outputs a human body bounding box. The two-dimensional human body posture with the bounding box comprises two-dimensional coordinate information of 17 joints of the human body in each frame, and coordinate positions and confidence scores of the human body target frame.
The BR-LSTM is a two-way convolution extended short-time memory network capable of extracting space-time characteristic information, and integrates a preprocessing data module taking gesture information as input, an action characteristic extraction module consisting of six two-way linked LSTM units and a full connection layer (FC layer). The pre-processing data module of the BR-LSTM network segments the two-dimensional joint coordinates of the 17 skeletal joints into independent sequences of xi and yi coordinates. It calculates the euclidean distance di (i=1, 2..17) of each skeletal joint to the root node. The feature extraction module extracts space-time features of the human body posture. A Dropout strategy was introduced after feature fusion to avoid overfitting. II, human body detector module:
detailed skeleton data is extracted from the captured RGB video data of human behavior, including detailed information of skeleton joint positions and time information of inter-frame joints. The YOLOv5 algorithm detects the human body in the target area. There are four types of alternatives in YOLOv 5: YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x. According to the paper of YOLOv5, the depth of the YOLOv5s model is minimum, the width of the feature map is minimum, the complexity of the model is minimum, and the detection speed is the fastest. The Yolov5s network model has low detection accuracy when applied to target detection, and multi-scale characteristics of targets in images cannot be extracted, so that the Yolov5s network model is improved in a mode that a plurality of ASFF modules are introduced before a Yolov5s pre-measuring head and a serpentine convolution layer is used for replacing a common convolution layer, and the detection speed is not reduced, and the accuracy is improved. Convolutional Neural Networks (CNNs) require input images of fixed size. Due to the different image sizes, conventional cropping methods may lose some effective information and result in unnecessary loss of accuracy. In addition, repeated convolution calculations on candidate regions may lead to problems such as computational redundancy. In order to solve the problem of inconsistent scale feature fusion, a plurality of ASFF modules are introduced before a Yolov5 pre-measurement head, feature information is subjected to weighted fusion, and as shown in figure 2, an input image passes through a feature pyramid network with different layers of feature graphs. If no ASFF module is present, each layer can only output the feature prediction result of each layer. The ASFF module may fully fuse the feature information between the layers. The features of the different layers are combined into a new feature representation by weighted averaging them. Alpha 333 Is the weighted scaling factor of the third layer. X is x 1→3 ,x 2→3 ,x 3→3 Is the characteristic tensor of each layer. In a conventional convolution operation, the convolution kernel typically scans the input feature map from left to right, top to bottom in a fixed order. The serpentine convolution adopts a nonlinear scanning mode, and the scanning path of the convolution kernel is designed into a serpentine shape or curve, so that the scanning sequence is changed. Serpentine convolution can also reduce the number of parameters of the model. In conventional convolution operations, adjacent convolution kernels typically haveThere are similar weights and thus the number of parameters can be reduced by sharing the parameters. The serpentine convolution cannot directly share parameters because the weights between adjacent convolution kernels are often different due to the nonlinear scan paths thereof. In this way, each convolution kernel requires an independent parameter, but at the same time increases the expressive power of the model. Therefore, in the Yolov5 backbone network part, a serpentine convolution layer is used for replacing a common convolution layer, so that each convolution kernel can see a larger range of input information, and the perception capability of the model on global features is improved.
ASFF3=x 1 33 +x 2 33 +x 3 33 (1)
The main current two-dimensional human skeleton gesture extraction method is key point detection. It is based mainly on two aspects: joint position regression task and joint heat map estimation task. In the current mainstream skeleton gesture extraction network, the HRNet network has a good detection effect. Its entire network can maintain high resolution throughout, rather than restoring resolution by a low to high process, resulting in a heat map prediction with greater spatial accuracy. Thus, the YH-Pose scheme takes HRNet as the framework for skeleton gesture extraction. When the network is to output keypoints for N classifications, it will output an N-dimensional feature map. Meanwhile, a Gaussian kernel is constructed according to key points on the N-dimensional labeling feature diagram, and key points of human skeleton are detected, so that (x, y) coordinates and confidence scores of 17 skeleton joint points in the image are obtained. The Focal Loss function may dynamically reduce the weight of nodes that have less impact on behavior during the training process, thereby focusing attention more quickly on nodes that have more impact on behavior. The method can effectively solve the problem of unbalanced data samples and uneven node weight distribution, and improves the performance of the model in abnormal behavior recognition tasks. Loss functions are very sensitive to outliers (outliers) due to Mean Square Error (MSE). Again, due to the presence of the square term, MSE amplifies the effects of outliers, causing the model to pay excessive attention to these outliers, ignoring the importance of other data points. This is It may cause the model to perform unstably in the face of outliers. The Focal Loss function is used instead of the original Mean Square Error (MSE) Loss function. The Focal Loss function is shown in formula (2), the p value ranges from 0 to 1, and p t Is the probability that the model predicts belonging to the foreground. y has a value of 1 and-1, alpha t Is an introduced weighting factor. Alpha t Is a modulation factor that reduces the loss contribution of unimportant samples, thereby increasing the loss fraction at challenging samples. Which is a determined parameter. The parameters range from 0 to 5.
FL(p t )=-α t (1-p t ) γ log(p t ) (2)
Separation type feature coding strategy:
skeleton feature extraction is the most critical task in the whole process. In this task, a vector representing the human body posture is generated, which is formed by stitching three vectors: normalized joint position x-axis coordinate vector, normalized joint position y-axis coordinate vector, and joint-to-root joint point distance vector. The distances between the person and the camera at different area positions in each frame of the image are different, and therefore the ratio of the joint positions in the image is also different. Each joint in the image is depicted by its abscissa and ordinate. Defining the original position of the ith joint as (x) i ,y i ). The method is used for normalizing the joint position of each person detected in each frame of image. As shown in the formula 4 of the drawings, Representing the normalized coordinate position of the joint. Each joint in the image is depicted by its abscissa and ordinate. Thus, the normalized joint position vector contains 34 features corresponding to 17 joints.
As shown in fig. 4, after determining the coordinate positions of the human skeleton joints, a second component vector is obtained by calculating the distance from each of the 17 joints to the human root joint point O (centroid). Euclidean distance from each joint to the root joint is calculated as formula (5). The joint distance vector contains 17 features, corresponding to 17 distances (d 1-d 17), respectively. Euclidean distance of joint to root joint (x 0, y 0).
IV, double-layer bidirectional BR-Lstm module:
the BR-Lstm module uses bi-directional LSTM for feature extraction and behavior classification of skeleton information. First, 17 coordinates are split into x coordinate values (x 1, x2,) and y coordinate values (y 1, y2,) and y17, and then each of the nodes (d 1, d2,) and d 17) is calculated as a third feature component. When successive image frames are detected, a time-varying x-coordinate sequence, y-coordinate sequence, and distance sequence are obtained. The data is then processed to a length and size suitable for LSTM training and is input into three LSTM networks for sequential feature extraction, respectively, after which each new frame of image data is detected, new coordinate values are added to the sequence and the old coordinate sequence is deleted. And finally, merging and inputting classification information of the behavior actions into a full-connection layer, and classifying normal behaviors and abnormal behaviors to determine whether the behaviors are abnormal. FIG. 3 shows a schematic of the structure of LSTM neurons. The LSTM nerve unit comprises an input gate i t Forgetting door f t State of cell C t And an output gate O t . The long-short term memory is controlled by the gate and cell state, and the calculation process can be represented by the following formulas 6 to 11. In equation (6), the information at the time of inputting the gate t is a combination of the hidden output at the previous time and the input information at the time t. In equation (7), the candidate cell state at time t is represented by h t-1 And x t Calculated, h t-1 And x t Respectively representThe hidden output at the previous time and the input information at the time t. In equation (8), the forget gate is used to control which information in the memory state at the previous time should be forgotten or retained. In equation (9), the outputs of the input gate and the forget gate are combined to update the cell state at time t. Then, the double-layer LSTM network is connected end to end, and cells of each LSTM layer are sequentially connected, and forward learning action feature sequences and reverse learning action feature sequences are predicted.
i t =σ(W i ·[h t-1 ,x t ]+b i ) (6)
C t =tanh(W c ·[h t-1 ,x t ]+b c 0 (7)
f t =σ(W f ·[h t-1 ,x t ])+b f (8)
C t =f t *C t-1 +i t *C t (9)
O t =σ(W o ·[h t-1 ,x t ]+b o ) (10)
h t =O t *tanh(C t ) (11)
The specific structure of the double-layer bidirectional LSTM is shown in figure 4, wherein the forward layer and the backward layer are commonly connected to the output layer, and the output layer contains six shared weights w 1 -w 6 . The forward propagation of the behavior feature is calculated in the forward layer along the time from time 1 to time t. In the backward layer, a backward calculation is performed from time t to time 1 to obtain and store the output of each time of the backward hidden layer. And combining the corresponding output results of the forward layer and the backward layer at each moment to obtain final output. Mathematical formulas are shown as (12) - (15). In the above-mentioned description of the invention, Is offset. o' t ,o″ t Is the result of processing the motion feature vector output at the corresponding moment by the two layers of LSTM.
V. data set acquisition:
the common data sets used for training are HMDB51 and ntu_rgb+d and the self-built Shanghai traffic road data set. The HMDB51 dataset contains 51 types of actions, for a total of 6849 videos, each containing at least 101 videos with a resolution of 320x 320. HMDB51 contains 51 different categories of motion, such as "brushing," clapping, "" running, "and" waving. These categories represent human action behaviors common in everyday life. These action categories cover different viewing angles, movement speeds, lighting conditions and background disturbances. Thus, accurately classifying and identifying actions is a very challenging task. Ntu_rgb+d is a widely used skeleton-based human motion recognition dataset. It contains 56,880 skeleton-action sequences. There are two evaluation benchmarks, including cross-subject (X-Sub) and cross-View (X-View) settings. For X-Sub, the training set and the test set were from two disjoint sets of 20 subjects each. For X-View, the training set contains 37,920 samples taken by the camera and the test set contains 18,960 sequences taken by the camera. The Shanghai traffic road dataset consisted of 100 two minute videos that were legally captured by a static Gao Jiagao clear camera on 32 traffic roads in Shanghai China. The dataset is divided into three parts: 60 training videos, 13 verification videos, and 27 test videos. The video is marked with fine granularity action start and end times of 10 different action categories. In the study, five normal behaviors frequently occurring on traffic zebra crossings were selected: walk, jump, run, turn around and smoke, five abnormal behaviors: fall, kick, get a person, vomit and look down at the phone.
VI, model training:
the device used for training was a CPU configured with the 12 th generation Intel (R) core (TM) i9-12900K 3.19GHz, 64GB memory, 64 bit Windows 10 operating system and Nvidia 3090GPU server. The program used Anaconda version 3.2.0 as the integrated development environment and the programming language was Python 3.6.5. Various modular networks designed by oneself are constructed under the PyTorch deep learning framework.
VII, loss function:
a cross entropy loss function is used to calculate the difference between the predicted output of the behavior recognition model and the real label. For each sample, the cross entropy loss can be calculated by the following formula (16): y_true is the true label vector and y_pred is the predicted output vector of the model.
L=-∑(y - true * log(y - pred)) (16)。
Model training parameter setting:
each sample in the three data sets is uniformly adjusted to 50 frames, and then data preprocessing is performed by using methods such as deletion, replacement or smoothing, etc. proposed by Zhang et al for statistics and domain knowledge identification and outlier processing. For the HMDB51 and ntu_rgb+d datasets, the model was trained on a 200 lot scale for 110 rounds. The initial learning rate was set to 0.1 and reduced by a factor of 10 at 80 and 120epochs, the weight decay was set to 5e-4, the loss was set to 0.1, and the model was used as an iteration termination condition. For the traffic road self-building dataset, the initial learning rate is set to 0.001, the training round number is 10,000, optimization is carried out by using an adam optimizer, and the dropout rate of the last layer of the FC is set to 0.5.
IX. evaluation index:
the action classification performance of the framework is measured by two indicators: average precision (map) and accuracy (Acc). Two metrics, gigabit per second floating point operations (GFLOPs) and Frame Per Second (FPS), are used to analyze the computational complexity and detection speed of the model. Object Keypoint Similarity (OKS) is an evaluation index of joint detection, used to compare the performance differences of the skeleton gesture extraction model and other models.
The above detailed description of the present invention is merely illustrative or explanatory of the principles of the invention and is not necessarily intended to limit the invention. Accordingly, any modification, equivalent replacement, improvement, etc. made without departing from the spirit and scope of the present invention should be included in the scope of the present invention.

Claims (9)

1. The abnormal behavior detection model under the dense multi-person scene is characterized by comprising a skeleton gesture extraction module YH-Pose and a behavior classification module BR-LSTM, wherein the YH-Pose module firstly extracts behavior characteristic information from original video data, and then inputs the behavior characteristic information into the BR-LSTM module to finish action classification; the YH-Pose module: the human body detector is realized based on the YOLOv5, firstly, a boundary box is added to each person detected by the YOLOv5 in a scene, and the boundary box contains the position information of the person in the image; the YH-Pose module fuses a high-resolution skeleton gesture extraction network HRNet, predicts the coordinates of n skeleton joints of each person in an image, and then orderly connects the skeleton joints according to the structure of a human skeleton to form a human body gesture skeleton model; the YH-phase module utilizes input RGB video stream data to combine a human body boundary frame with a gesture skeleton network, and the combined output result is two-dimensional human body gesture information, wherein the two-dimensional coordinate information of k joints of a human body in each frame, and the coordinate position and confidence score of a human body target frame are included; the BR-LSTM module: predicting action labels according to the video sequence to finish the action classification task; the model incorporates a separate feature training strategy: in the data preprocessing stage, the BR-LSTM divides the two-dimensional joint coordinate data into an x coordinate sequence and a y coordinate sequence, and calculates Euclidean distance from each joint to a root node so as to expand a data sample; the module comprises a data preprocessing module and a bidirectional behavior classification network consisting of six LSTM units, wherein the data preprocessing module maps motion characteristics into characteristic vectors through a linear layer and inputs the characteristic vectors into a forward LSTM layer and a reverse LSTM layer, the LSTM calculates the hidden state and the hidden state of the current time step according to the hidden state of the current input and the previous time step in each time step, and finally, a fully-connected layer is used for further classifying normal and abnormal behaviors.
2. The model for detecting abnormal behavior in dense multi-person scenes according to claim 1, wherein the human body detector extracts detailed skeleton data including skeleton joint positions and time information of inter-frame joint points from the captured RGB video data; in order to realize that the human body skeleton gesture is detected firstly and then the human body skeleton gesture is detected, a YH-Pose module combines a high-resolution skeleton gesture extraction network HRNet with a YOLOv5 target detection frame; introducing a prediction head part of YOLOv5 into a plurality of ASFF modules for improvement, wherein the modules carry out multi-scale weighted fusion on characteristic information; the input image passes through a feature pyramid network with feature graphs of different layers, the ASFF module completely fuses the feature information among the layers, and the feature information is combined into a new feature representation through weighted average; the serpentine convolution adopts a nonlinear scanning mode, and the scanning path of the convolution kernel is designed into a serpentine shape or curve, so that the scanning sequence is changed; serpentine convolution can also reduce the number of parameters of the model; due to the nonlinear scanning path, the serpentine convolution often has different weights between adjacent convolution kernels, so that parameters cannot be directly shared; thus, each convolution kernel needs independent parameters, but the expression capacity of the model is increased; therefore, in the Yolov5 backbone network part, a serpentine convolution layer is used for replacing a common convolution layer, so that each convolution kernel can see a larger range of input information, and the perception capability of the model on global features is improved.
3. The model for abnormal behavior detection in dense multi-person scenarios of claim 2, characterized by a 333 Is the weighted scaling factor of the third layer, x 1→3 ,x 2→3 ,x 3→3 Is the characteristic tensor of each layer; as in equation (1), the new feature of the third layer ASFF module after weighting is ASFF3:
ASFF3=x 1→33 +x 2→33 +x 3→33 (1)。
4. the model for detecting abnormal behavior in a dense multi-person scene according to claim 2, wherein HRNet is used as a model for predicting human skeleton gesture, when the network outputs N classification key points, it firstly outputs an N-dimensional feature map, then builds a gaussian kernel according to the key points on the N-dimensional labeling feature map, generates a human skeleton key point heat map code, and then obtains (x, y) coordinates and confidence scores of N skeleton key points from the skeleton key point heat map.
5. The abnormal behavior detection model in a dense multi-person scene according to claim 1, wherein a FocalLoss function is adopted in the separated feature training, as in formula (2), p is the probability that the model prediction belongs to the foreground, and the value range is 0 to 1; y has a value range of 1 and-1; alpha t The modulation function is realized, the attention degree to unimportant sample features can be reduced, and the attention degree of challenging sample features is increased, wherein the attention degree is a determined parameter, and the parameter range is 0 to 5;
FL(p t )=-α t (1-p t ) γ log(p t ) (2)
6. The compact of claim 1The abnormal behavior detection model under the multi-person scene is characterized in that separated feature training adopts a separated feature coding strategy to extract skeleton features, firstly, a vector representing the human body posture is generated, and the vector is formed by splicing three vectors: the distance between the camera and the person at different area positions in each frame of image is different, so the proportion of the joint positions in the image is also different, each joint in the image is described by the abscissa and the ordinate thereof, and the original position of the ith joint is defined as (x i ,y i ) The joint position of each person detected in each frame of image is normalized by adopting the formula (4),
representing normalized coordinate positions of the joints, each joint in the image is described by its abscissa and ordinate, so that the normalized joint position vector contains 2k features corresponding to k joints,
after the coordinate position of the human skeleton joint is determined, a second component vector is obtained by calculating the distance from each joint in p joints to a human root joint point O (mass center), wherein the Euclidean distance calculation formula (5) from each joint to the root joint comprises k characteristics, and the k characteristics respectively correspond to k distances (d 1-dp) and Euclidean distances (x 0, y 0) from the joint to the root joint;
7. The abnormal behavior detection model in a dense multi-person scene according to claim 1, wherein the BR-LSTM module performs feature extraction and behavior classification on skeleton information using bi-directional LSTM, first, splits k coordinates into x coordinate values (x 1, x2,) and xk and y coordinate values (y 1, y2,) and then calculates each of the gateway nodes (d 1, d2,) to the root node as a third feature component; when continuous image frames are detected, an x coordinate sequence, a y coordinate sequence and a distance sequence which change with time are obtained; then, the data are processed into the length and the size suitable for LSTM training, and are respectively input into three LSTM networks for time sequence feature extraction, and then each time a new frame of image data is detected, new coordinate values are added into the sequence, and the old coordinate sequence is deleted; and finally, merging and inputting classification information of the behavior actions into a full-connection layer, and classifying normal behaviors and abnormal behaviors to determine whether the behaviors are abnormal behaviors.
8. The model for detecting abnormal behavior in dense multi-person scenario of claim 7, wherein LSTM neural unit comprises an input gate i t Forgetting door f t State of cell C t And an output gate O t The long-term and short-term memory is controlled by the door and the cell state, and the calculation process is expressed by formulas (6) to (11); in the formula (6), the information of the input gate at the time t is a combination of the hidden output at the previous time and the input information at the time t; in equation (7), the candidate cell state at time t is represented by h t-1 And x t Calculated, h t-1 And x t Respectively representing hidden output at the previous moment and input information at the moment t; in formula (8), the forget gate is used to control which information in the memory state at the previous time should be forgotten or retained; in the formula (9), the outputs of the input gate and the forget gate are combined to update the cell state at the time t; then, connecting the double-layer LSTM network end to end, and sequentially connecting cells of each LSTM layer, and predicting a forward learning action characteristic sequence and a reverse learning action characteristic sequence;
i t =σ(W i ·[h t-1 ,x t ]+b i ) (6);
C t =tanh(W c ·[h t-1 ,x t ]+b c ) (7);
f t =σ(W f ·[h t-1 ,x t ])+b f (8);
C t =f t *C t-1 +i t *C t (9);
O t =σ(W o ·[h t-1 ,x t ]+b o ) (10);
h t =O t *tanh(C t ) (11);
the forward layer and the backward layer are commonly connected to the output layer in the structure of the double-layer bidirectional LSTM, and the output layer comprises six sharing weights w 1 -w 6 The forward propagation of the behavior feature is calculated in the forward layer along the time from time 1 to time t, and in the backward layer, the backward calculation is performed from time t to time 1 to obtain and save the output of each time of the backward hidden layer, and the corresponding output results of the forward layer and the backward layer at each time are combined to obtain the final output, wherein the mathematical formulas are as (12) - (15), Is biased; o' t ,o” t The result of processing the motion feature vector output at the corresponding moment by the two layers of LSTM;
9. the abnormal behavior detection method in the dense multi-person scene is characterized by comprising the following steps:
step 1, video acquisition: recording or acquiring related video data for dense crowd scenes needing abnormal behavior analysis;
step 2, human body detection: based on a detection frame of YOLOv5, adding a boundary box to each person in the video, and marking the position information of the person in the image;
step 3, extracting skeleton gestures: using a YH-Pose module integrated with a high-resolution skeleton gesture extraction network HRNet to calculate and determine k key skeleton node positions of each person in the video;
step 4, generating a gesture skeleton model: according to the skeleton structure of the human body, connecting the key skeleton nodes confirmed in the previous step in an orderly manner to generate a posture skeleton model of the human body;
step 5, feature fusion: the YH-phase network utilizes input RGB video frame data to fuse a human body boundary frame with a gesture skeleton model to generate fused two-dimensional coordinate information containing k joints of a human body in each frame, boundary frame position and human body gesture information of confidence;
Step 6, data preprocessing: the BR-LSTM module is used in the behavior classification stage, the generated human body posture information is subjected to data preprocessing, the two-dimensional joint coordinates are divided into independent x and y coordinate sequences, and Euclidean distance from each joint to a root node is calculated;
step 7, behavior feature extraction: the feature extraction part in the BR-LSTM module receives the preprocessed data and extracts the space-time features of the action through the long-term and short-term memory network;
step 8, classification and prediction: after data processing and behavior feature extraction, carrying out final classification calculation by using a full-connection layer, and identifying and predicting abnormal and normal behaviors through a trained model;
step 9, outputting a result: and marking abnormal behaviors in the video according to the classification and prediction results for further analysis and processing.
CN202311572461.1A 2023-11-23 2023-11-23 Abnormal behavior detection model and detection method in dense multi-person scene Pending CN117541994A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311572461.1A CN117541994A (en) 2023-11-23 2023-11-23 Abnormal behavior detection model and detection method in dense multi-person scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311572461.1A CN117541994A (en) 2023-11-23 2023-11-23 Abnormal behavior detection model and detection method in dense multi-person scene

Publications (1)

Publication Number Publication Date
CN117541994A true CN117541994A (en) 2024-02-09

Family

ID=89785729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311572461.1A Pending CN117541994A (en) 2023-11-23 2023-11-23 Abnormal behavior detection model and detection method in dense multi-person scene

Country Status (1)

Country Link
CN (1) CN117541994A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117934522A (en) * 2024-03-25 2024-04-26 江西师范大学 Two-stage coronary artery image segmentation method, system and computer equipment
CN118298514A (en) * 2024-06-06 2024-07-05 华东交通大学 Deep learning-based worker dangerous action recognition method and system
CN118506443A (en) * 2024-05-06 2024-08-16 山东千人考试服务有限公司 Examinee abnormal behavior recognition method based on human body posture assessment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117934522A (en) * 2024-03-25 2024-04-26 江西师范大学 Two-stage coronary artery image segmentation method, system and computer equipment
CN118506443A (en) * 2024-05-06 2024-08-16 山东千人考试服务有限公司 Examinee abnormal behavior recognition method based on human body posture assessment
CN118298514A (en) * 2024-06-06 2024-07-05 华东交通大学 Deep learning-based worker dangerous action recognition method and system

Similar Documents

Publication Publication Date Title
US11195051B2 (en) Method for person re-identification based on deep model with multi-loss fusion training strategy
CN106897670B (en) Express violence sorting identification method based on computer vision
Hu Design and implementation of abnormal behavior detection based on deep intelligent analysis algorithms in massive video surveillance
Chen et al. End-to-end learning of object motion estimation from retinal events for event-based object tracking
CN117541994A (en) Abnormal behavior detection model and detection method in dense multi-person scene
Rahman et al. Fast action recognition using negative space features
Li et al. Gait recognition via GEI subspace projections and collaborative representation classification
CN112818175B (en) Factory staff searching method and training method of staff identification model
Xia et al. Face occlusion detection using deep convolutional neural networks
JP2022082493A (en) Pedestrian re-identification method for random shielding recovery based on noise channel
Chen et al. A multi-scale fusion convolutional neural network for face detection
CN112541403A (en) Indoor personnel falling detection method utilizing infrared camera
Lu et al. Online visual tracking
Fu et al. [Retracted] Sports Action Recognition Based on Deep Learning and Clustering Extraction Algorithm
Pang et al. Dance video motion recognition based on computer vision and image processing
Li et al. Spatial and temporal information fusion for human action recognition via Center Boundary Balancing Multimodal Classifier
Tang et al. Using a multilearner to fuse multimodal features for human action recognition
Ti et al. GenReGait: Gender Recognition using Gait Features
Liu et al. Weighted sequence loss based spatial-temporal deep learning framework for human body orientation estimation
Hao et al. Human behavior analysis based on attention mechanism and LSTM neural network
Zhao et al. Abnormal human behavior recognition based on image processing technology
Zhu et al. Crowd tracking by group structure evolution
Pan A method of key posture detection and motion recognition in sports based on Deep Learning
Alkahla et al. Face identification in a video file based on hybrid intelligence technique-review
Chao et al. Key spatio-temporal energy information mapping for action recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination