CN117541994A

CN117541994A - Abnormal behavior detection model and detection method in dense multi-person scene

Info

Publication number: CN117541994A
Application number: CN202311572461.1A
Authority: CN
Inventors: 王西超; 董祥庆; 孙伯潜; 赵淑阳; 李保江; 王海燕; 陈国初
Original assignee: Shanghai Dianji University
Current assignee: Shanghai Dianji University
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-02-09

Abstract

The invention provides an abnormal behavior detection model and a detection method aiming at a dense multi-person scene. The skeleton gesture extraction module is used for extracting skeleton joint point information of each person under video monitoring. The behavior classification module models long-term time dynamics inside and between actions, predicting action tags with video sequences. Meanwhile, a full connection layer is introduced into the action recognition module to classify normal and abnormal behaviors. In order to improve the accuracy and precision of the target detection module, the invention replaces the common convolution layer of the YOLOv5 with the snake-shaped convolution layer, and simultaneously introduces an ASFF feature fusion module which can effectively fuse the extracted features of the network in a multi-scale manner. The invention also adopts a separated characteristic training strategy and a method for calculating Euclidean distance so as to solve the problem of lack of abnormal behavior data samples and under fitting of a deep neural network.

Description

An abnormal behavior detection model and detection method in dense multi-person scenarios

技术领域Technical field

本发明属于计算机视觉技术领域。具体来说，涉及图像处理、目标检测、行为识别和模式识别相结合检测模型和方法。The invention belongs to the technical field of computer vision. Specifically, it involves detection models and methods that combine image processing, target detection, behavior recognition and pattern recognition.

背景技术Background technique

行为识别旨在从长视频中找出人们感兴趣的行为在时间或空间上的具体位置，是最基本的视频理解任务之一。在当前的行为识别研究中，检测视频序列中的异常行为至关重要。特别是公共监控视频下人群中的多人异常行为是一个重大的潜在威胁，如果不能及时发现，会严重影响人们的生命财产安全。人类行为的多样性和快速变化给异常行为的检测带来了困难。虽然一些经典算法在公共数据集上的单人行为识别上取得了不错的效果，但对多人异常行为的检测准确率却很低。同时，目前性能较好的主流算法计算量大，模型复杂度高，难以实践和产业化。Behavior recognition aims to find out the specific location in time or space of behaviors that people are interested in from long videos, and is one of the most basic video understanding tasks. In current behavior recognition research, detecting abnormal behaviors in video sequences is crucial. In particular, the abnormal behavior of many people in the crowd under public surveillance videos is a major potential threat. If it cannot be discovered in time, it will seriously affect the safety of people's lives and property. The diversity and rapid changes in human behavior make it difficult to detect abnormal behaviors. Although some classic algorithms have achieved good results in identifying single-person behaviors on public data sets, the accuracy of detecting abnormal behaviors of multiple people is very low. At the same time, the current mainstream algorithms with better performance have a large amount of calculations and high model complexity, making them difficult to practice and industrialize.

早期的异常行为识别研究大多使用人工特征描述符来表示行人的外观特征和从相应特征信息中提取的运动特征。这些方法使用传统的机器学习算法进行人类行为识别。人工特征描述符包括轨迹、梯度方向直方图(HOG)、光流直方图(HOF)、混合动态纹理和其他低级视觉特征。传统的行为检测方法依赖于对图像特征的简单理解。然而，随着深度学习的快速发展，研究人员开始探索基于深度学习的异常行为识别研究。他们已经取得了一系列成果，深度学习方法可以提取视频中人类行为的高级特征，从而更有效地区分正常行为和异常行为。Most early abnormal behavior recognition research used artificial feature descriptors to represent pedestrian appearance features and motion features extracted from corresponding feature information. These methods use traditional machine learning algorithms for human action recognition. Artificial feature descriptors include trajectories, Histogram of Oriented Gradient (HOG), Histogram of Optical Flow (HOF), hybrid dynamic textures and other low-level visual features. Traditional behavior detection methods rely on a simple understanding of image features. However, with the rapid development of deep learning, researchers have begun to explore abnormal behavior identification research based on deep learning. They have achieved a series of results. Deep learning methods can extract high-level features of human behaviors in videos, thereby more effectively distinguishing normal behaviors from abnormal behaviors.

人体骨架姿态提取是利用深度学习方法从图像或视频中检测和跟踪人体关键关节(骨架关节)的坐标位置。与基于图像特征的行为识别方法不同，基于骨架姿态提取的行为识别是一种利用人体姿态信息进行动作分类的方法。它通过骨架姿态提取捕捉行为特征，并使用机器或深度学习方法进行动作分类。人体骨架姿态提取可以对人体姿态和关节运动进行精细分析。这使得行为识别能够捕捉到微小的动作变化和详细信息，从而提供更准确、更精细的行为分析结果。Human skeleton posture extraction uses deep learning methods to detect and track the coordinate positions of key joints (skeleton joints) of the human body from images or videos. Different from the behavior recognition method based on image features, the behavior recognition based on skeleton pose extraction is a method that uses human body pose information for action classification. It captures behavioral features through skeleton pose extraction and uses machine or deep learning methods for action classification. Human skeleton posture extraction can conduct detailed analysis of human body posture and joint motion. This enables behavior recognition to capture minute motion changes and detailed information, providing more accurate and granular behavioral analysis results.

传统算法在复杂场景中并不稳定，容易受到光照变化、角度差异、变形和遮挡等挑战因素的影响。正常行为通常比异常行为更常见，这就导致了数据集中类别不平衡的问题。这会导致模型偏向于更容易识别的正常行为，而在识别异常行为时准确性较低。同时，在不同的视频帧中，人数可能会发生变化，这种人数变化会导致大多数模型的适应性下降。Traditional algorithms are not stable in complex scenes and are easily affected by challenging factors such as lighting changes, angle differences, deformation and occlusion. Normal behavior is usually more common than abnormal behavior, which leads to the problem of class imbalance in the data set. This causes the model to be biased towards normal behavior that is easier to identify and less accurate at identifying abnormal behavior. At the same time, the number of people may change in different video frames, and this change in the number of people will cause the fitness of most models to decrease.

当前主流方法通常不直接考虑时间或定时信息，因此对于涉及时间序列的异常行为识别任务，它们可能无法充分利用时间序列动作关系。异常行为通常与时间序列中的特定模式或趋势相关。如果算法不考虑时间信息，可能无法准确捕捉到异常行为的演变和时序关系，导致误报或漏报的情况发生。在密集多人场景中，人与人之间的动作间也存在复杂的时序关系。忽略这些关系还可能导致对异常行为的判断过于简化，无法准确区分正常行为和异常行为。Current mainstream methods usually do not directly consider time or timing information, so they may not fully exploit time series action relationships for abnormal behavior identification tasks involving time series. Abnormal behavior is often related to specific patterns or trends in the time series. If the algorithm does not consider time information, it may not be able to accurately capture the evolution and timing relationships of abnormal behaviors, resulting in false positives or false negatives. In dense multi-person scenes, there are also complex temporal relationships between actions between people. Ignoring these relationships may also lead to oversimplified judgments of abnormal behavior and an inability to accurately distinguish between normal and abnormal behavior.

现有技术存在的问题：(1)数据集中行为类别不平衡：正常行为通常比异常行为更为常见，导致当前模型偏向于更容易识别的正常行为，从而在异常行为的识别方面准确性较低。Problems with existing technologies: (1) Behavior category imbalance in the data set: Normal behaviors are usually more common than abnormal behaviors, causing the current model to be biased towards normal behaviors that are easier to identify, resulting in lower accuracy in identifying abnormal behaviors. .

(2)适应性下降与人数变化：密集场景下，视频帧中的人数可能发生变化，这会导致大多数模型的适应性下降。本发明的模型旨在解决这一问题，能够适应不同帧中的人数变化，从而提高模型的鲁棒性和准确性。(2) Adaptability decrease and number of people changes: In dense scenes, the number of people in the video frame may change, which will lead to a decrease in the adaptability of most models. The model of the present invention aims to solve this problem and can adapt to changes in the number of people in different frames, thereby improving the robustness and accuracy of the model.

(3)多人关节点错误连接问题：在密集场景下进行行为识别时，存在多人关节点错误连接的问题。(3) The problem of incorrect connection of multi-person joint points: When performing behavior recognition in dense scenes, there is the problem of incorrect connection of multi-person joint points.

发明内容Contents of the invention

本发明提供一种密集多人场景下的异常行为检测模型及检测方法，旨在提出一种针对密集多人场景下的异常行为检测模型，以改善当前主流算法在该场景下的不足之处。该模型旨在提高密集场景下异常行为检测的准确性。The present invention provides an abnormal behavior detection model and detection method in a dense multi-person scenario, and aims to propose an abnormal behavior detection model in a dense multi-person scenario to improve the shortcomings of current mainstream algorithms in this scenario. This model aims to improve the accuracy of abnormal behavior detection in dense scenes.

本发明解决其技术问题的方案是：采用一种密集多人场景下的异常行为检测模型，包括骨架姿态提取模块YH-Pose和行为分类模块BR-LSTM，YH-Pose模块首先从原始视频数据中提取出行为特征信息，然后将这些行为特征信息输入到BR-LSTM模块中，完成动作分类；所述YH-Pose模块：是一个自上而下的骨架姿态提取模块，此模块里又包含人体检测器和骨架姿态提取器，所述人体检测器是基于YOLOv5实现的，首先，在场景中每个被YOLOv5检测到的人都会添加一个边界框，边界框包含了人物在图像中的位置信息；YH-Pose模块融合了高分辨率骨架姿态提取网络HRNet，它预测出图像中每个人的n个骨架关节点的坐标，然后，按照人体骨架的结构，有序地连接这些骨架关节点，形成人体姿态骨架模型；YH-pose模块利用输入的RGB视频流数据将人体边界框与姿态骨架网络进行组合，组合后的输出结果是二维人体姿态信息，其中包括每帧中人体k个关节的二维坐标信息，以及人体目标框的坐标位置和置信度得分；所述BR-LSTM模块：根据视频序列来预测动作标签，完成行为分类任务；该模型引入了一个分离的特征训练策略：在数据预处理阶段，BR-LSTM会将二维关节坐标数据分成x坐标序列和y坐标序列，并计算每个关节到根节点的欧氏距离，以扩充数据样本；该模块包括一个数据预处理模块和一个由六个LSTM单元组成的双向行为分类网络，数据预处理模块会通过线性层将动作特征映射为特征向量，并将这些特征向量输入到正向LSTM层和反向LSTM层，在每个时间步长内，LSTM会根据当前输入和前一时间步长的隐藏状态来计算当前时间步长的隐藏状态和输出，最后，一个全连接层会被用来进一步对正常与异常行为进行分类。The solution of the present invention to solve the technical problem is to adopt an abnormal behavior detection model in a dense multi-person scene, including a skeleton posture extraction module YH-Pose and a behavior classification module BR-LSTM. The YH-Pose module first extracts data from the original video data. Extract behavioral feature information, and then input these behavioral feature information into the BR-LSTM module to complete action classification; the YH-Pose module: is a top-down skeleton pose extraction module, which also contains human body detection detector and skeleton pose extractor. The human body detector is implemented based on YOLOv5. First, each person detected by YOLOv5 in the scene will add a bounding box. The bounding box contains the position information of the person in the image; YH -Pose module integrates the high-resolution skeleton pose extraction network HRNet, which predicts the coordinates of n skeleton joint points of each person in the image, and then connects these skeleton joint points in an orderly manner according to the structure of the human skeleton to form a human pose Skeleton model; the YH-pose module uses the input RGB video stream data to combine the human body bounding box and the posture skeleton network. The combined output result is two-dimensional human posture information, which includes the two-dimensional coordinates of k joints of the human body in each frame. information, as well as the coordinate position and confidence score of the human target frame; the BR-LSTM module: predicts action labels based on video sequences to complete the behavior classification task; the model introduces a separate feature training strategy: in the data preprocessing stage , BR-LSTM will divide the two-dimensional joint coordinate data into x coordinate sequence and y coordinate sequence, and calculate the Euclidean distance from each joint to the root node to expand the data sample; this module includes a data preprocessing module and a six-dimensional A bidirectional behavior classification network composed of LSTM units. The data preprocessing module maps action features into feature vectors through a linear layer, and inputs these feature vectors into the forward LSTM layer and the reverse LSTM layer, within each time step. , LSTM will calculate the hidden state and output of the current time step based on the current input and the hidden state of the previous time step. Finally, a fully connected layer will be used to further classify normal and abnormal behaviors.

该人体检测器从捕获的RGB视频数据中提取了详细的骨架数据，包括骨架关节点位置和帧间关节点的时间信息；为了实现先检测图像中的人再检测人体骨架姿态，YH-Pose模块将高分辨率骨架姿态提取网络HRNet与YOLOv5目标检测框架结合起来；将YOLOv5的预测头部分引入了多个ASFF模块进行改进，这些模块对特征信息进行多尺度加权融合；输入图像经过具有不同层特征图的特征金字塔网络，ASFF模块对各层之间的特征信息进行完全融合，并通过加权平均将它们合并成一个新的特征表示；在传统的卷积操作中，卷积核通常按照固定的顺序从左到右、从上到下扫描输入特征图；而蛇形卷积则采用一种非线性的扫描方式，将卷积核的扫描路径设计为蛇形状或曲线状，从而改变了扫描的顺序；蛇形卷积还可以减少模型的参数数量；传统的卷积操作中，相邻的卷积核通常具有相似的权重，因此可以通过共享参数的方式减少参数量；而蛇形卷积由于其非线性的扫描路径，使得相邻的卷积核之间的权重往往不同，因此无法直接共享参数；这样一来，每个卷积核都需要独立的参数，但同时也增加了模型的表达能力；所以，在YOLOv5骨干网络部分，使用蛇形卷积层替换掉普通卷积层，使每个卷积核能够看到更大范围的输入信息，从而提高了模型对全局特征的感知能力。The human detector extracts detailed skeleton data from the captured RGB video data, including skeleton joint point positions and time information of joint points between frames; in order to detect people in the image first and then detect the human skeleton posture, the YH-Pose module The high-resolution skeleton pose extraction network HRNet is combined with the YOLOv5 target detection framework; the prediction head part of YOLOv5 is improved by introducing multiple ASFF modules, which perform multi-scale weighted fusion of feature information; the input image is processed with different layers of features In the feature pyramid network of graphs, the ASFF module completely fuses the feature information between each layer and merges them into a new feature representation through weighted averaging; in traditional convolution operations, the convolution kernels are usually in a fixed order. Scan the input feature map from left to right and top to bottom; while serpentine convolution uses a non-linear scanning method, designing the scanning path of the convolution kernel into a snake shape or a curve shape, thus changing the order of scanning ;Serpentine convolution can also reduce the number of parameters of the model; in traditional convolution operations, adjacent convolution kernels usually have similar weights, so the number of parameters can be reduced by sharing parameters; and serpentine convolution due to its Nonlinear scan paths often result in different weights between adjacent convolution kernels, so parameters cannot be shared directly. In this way, each convolution kernel requires independent parameters, but it also increases the expression ability of the model. ; Therefore, in the YOLOv5 backbone network part, the serpentine convolution layer is used to replace the ordinary convolution layer, so that each convolution kernel can see a wider range of input information, thereby improving the model's perception of global features.

α³,β³,γ³是第三层的加权比例因子，x^1→3,x^2→3,x^3→3是各层的特征张量；如公式(1)，第三层ASFF模块加权后的新特征为ASFF3：α ³ , β ³ , γ ³ are the weighted scaling factors of the third layer, x ^1→3 , x ^2→3 , x ^3→3 are the feature tensors of each layer; as shown in formula (1), the third layer ASFF module The new weighted feature is ASFF3:

ASFF3＝x^1→3*α³+x^2→3*β³+x^3→3*γ³ (1)。ASFF3＝x ^1→3 *α ³ +x ^2→3 *β ³ +x ^3→3 *γ ³ (1).

将HRNet作为人体骨架姿态预测的模型，当网络要输出N个分类关键点时，它首先输出N维特征图，然后，根据N维标注特征图上的关键点构建高斯核，并生成人体骨架关键点热图编码，接着，从骨架关键点热图中获得n个骨架关节点的(x,y)坐标和置信度分数。Using HRNet as a model for human skeleton posture prediction, when the network wants to output N classification key points, it first outputs an N-dimensional feature map, and then builds a Gaussian kernel based on the key points on the N-dimensional annotated feature map and generates human skeleton key points. Point heat map encoding, then, obtain the (x, y) coordinates and confidence scores of n skeleton joint points from the skeleton key point heat map.

在分离的特征训练中采用Focal Loss损失函数，如公式(2)，p是模型预测属于前景的概率，它的取值范围是0到1；y的取值范围为1和-1；α_t起到调制作用，能减少对不重要样本特征的关注度，增加有挑战性样本特征的关注度，它是一个确定的参数，参数范围为0至5；Focal Loss loss function is used in separated feature training, such as formula (2). p is the probability that the model predicts belonging to the foreground, and its value range is 0 to 1; y's value range is 1 and -1; α _t It plays a modulating role and can reduce the attention on unimportant sample features and increase the attention on challenging sample features. It is a certain parameter with a parameter range of 0 to 5;

FL(p_t)＝-α_t(1-p_t)^γlog(p_t) (2)FL(p _t )＝-α _t (1-p _t ) ^γ log(p _t ) (2)

分离的特征训练采用分离式特征编码策略来提取骨架特征，首先生成一个代表人体姿态的向量，该向量由三个向量拼接而成：归一化关节位置x轴坐标向量、归一化关节位置y轴坐标向量和关节到根关节点距离向量，每帧图像中不同区域位置的人与摄像机之间的距离不同，因此图像中关节位置的比例也不同，图像中的每个关节由其横坐标和纵坐标描述，定义第i个关节的原始位置为(x_i，y_i)，采用公式(4)对每帧图像中检测到的每个人的关节位置进行归一化处理，表示关节的归一化后的坐标位置，图像中的每个关节都由其横坐标和纵坐标来描述，因此，归一化后的关节位置向量包含与k个关节相对应的2k个特征，Separate feature training uses a separate feature encoding strategy to extract skeleton features. First, a vector representing the human body posture is generated. This vector is spliced from three vectors: the normalized joint position x-axis coordinate vector, the normalized joint position y The axis coordinate vector and the distance vector from the joint to the root joint point. The distance between the person and the camera in different areas in each frame of the image is different, so the proportion of the joint positions in the image is also different. Each joint in the image is composed of its abscissa and Description of the ordinate, define the original position of the i-th joint as (x _i , y _i ), and use formula (4) to normalize the joint position of each person detected in each frame of image, Represents the normalized coordinate position of the joint. Each joint in the image is described by its abscissa and ordinate. Therefore, the normalized joint position vector contains 2k features corresponding to k joints,

在确定人体骨架关节的坐标位置后，通过计算p个关节中每个关节到人体根关节点O(质心)的距离，得到第二个分量向量，每个关节到根关节的欧氏距离计算公式(5)，关节距离向量包含k个特征，分别对应k个距离(d1-dp)，关节到根关节的欧氏距离(x0，y0)；After determining the coordinate positions of the human skeleton joints, the second component vector is obtained by calculating the distance from each of the p joints to the human root joint point O (center of mass). The Euclidean distance calculation formula from each joint to the root joint is (5), the joint distance vector contains k features, corresponding to k distances (d1-dp) and the Euclidean distance (x0, y0) from the joint to the root joint;

BR-LSTM模块使用双向LSTM对骨架信息进行特征提取和行为分类，首先，将k个坐标拆分为x坐标值(x1,x2,....,xk)和y坐标值(y1,y2,....,yk)，然后计算每个关节点(d1,d2,....,dk)到根节点作为第三个特征分量；当检测到连续的图像帧时，得到随时间变化的x坐标序列、y坐标序列和距离序列；接下来，数据会被处理成适合LSTM训练的长度和大小，并分别输入到三个LSTM网络中进行时序特征提取，之后每检测到一帧新的图像数据，就会将新的坐标值添加到序列中，并删除旧的坐标序列；最后，将行为动作的分类信息合并并输入到全连接层，对正常行为和异常行为进行分类，以确定行为是否为异常行为。The BR-LSTM module uses bidirectional LSTM to perform feature extraction and behavior classification on skeleton information. First, k coordinates are split into x coordinate values (x1, x2,....,xk) and y coordinate values (y1, y2, ....,yk), and then calculate each joint point (d1, d2,....,dk) to the root node as the third feature component; when continuous image frames are detected, the time-varying x coordinate sequence, y coordinate sequence and distance sequence; next, the data will be processed into a length and size suitable for LSTM training, and input into three LSTM networks for temporal feature extraction, and each time a new image is detected data, new coordinate values will be added to the sequence and the old coordinate sequence will be deleted; finally, the classification information of behavioral actions is merged and input to the fully connected layer to classify normal behavior and abnormal behavior to determine whether the behavior for abnormal behavior.

LSTM神经单元包括输入门i_t、遗忘门f_t、单元状态C_t和输出门O_t，长短期记忆通过门和细胞状态进行控制，其计算过程用公式(6)～(11)表示；在公式(6)中，输入门t时刻的信息是前一时刻的隐藏输出和t时刻的输入信息的组合；在公式(7)中，t时刻的候选单元状态由h_t-1和x_t计算得到,h_t-1和x_t分别代表前一时刻的隐藏输出和t时刻的输入信息；在公式(8)中，遗忘门用于控制上一时刻记忆状态中的哪些信息应该被遗忘或保留；在公式(9)中，输入门和遗忘门的输出相结合，更新t时刻的细胞状态；然后，将双层LSTM网络首尾相连，并依次连接各LSTM层的细胞，预测正向学习的动作特征序列和反向学习的动作特征序列；The LSTM neural unit includes input gate i _t , forgetting gate f _t , unit state C _t and output gate O _t . The long and short-term memory is controlled through the gate and cell state. The calculation process is represented by formulas (6) to (11); in In formula (6), the information at time t of the input gate is a combination of the hidden output at the previous moment and the input information at time t; in formula (7), the candidate unit state at time t is calculated by h _t-1 and x _t It is obtained that h _t-1 and x _t represent the hidden output at the previous moment and the input information at time t respectively; in formula (8), the forgetting gate is used to control which information in the memory state at the previous moment should be forgotten or retained. ; In formula (9), the outputs of the input gate and the forget gate are combined to update the cell state at time t; then, the two-layer LSTM network is connected end to end, and the cells of each LSTM layer are connected in turn to predict the forward learning action. Feature sequence and reverse learned action feature sequence;

i_t＝σ(W_i·[h_t-1，x_t]+b_i) (6)；i _t =σ(W _i ·[h _t-1 , x _t ]+b _i ) (6);

C_t＝tanh(W_c·[h_t-1，x_t]+b_c) (7)；C _t = tanh (W _c ·[h _t-1 , x _t ]+b _c ) (7);

f_t＝σ(W_f·[h_t-1，x_t])+b_f (8)；f _t =σ(W _f ·[h _t-1 ,x _t ])+b _f (8);

C_t＝f_t*C_t-1+i_t*C_t (9)；C _t =f _t *C _t-1 +i _t *C _t (9);

O_t＝σ(W_o·[h_t-1，x_t]+b_o) (10)；O _t =σ(W _o ·[h _t-1 ,x _t ]+b _o ) (10);

h_t＝O_t*tanh(C_t) (11)；h _t =O _t *tanh(C _t ) (11);

双层双向LSTM的结构中前向层和后向层共同连接到输出层，输出层包含六个共享权重w₁-w₆，行为特征的前向传播在前向层中沿着从时刻1到时刻t的时间进行计算，在后向层中，从时刻t到时刻1进行反向计算，以获得并保存后向隐含层每个时刻的输出，将前向层和后向层在每个时刻的相应输出结果合并，就得到了最终输出，数学公式如(12)-(15)，式中，是偏置；o’_t,o″_t是两层LSTM对相应时刻输出的动作特征向量进行处理的结果；In the structure of the two-layer bidirectional LSTM, the forward layer and the backward layer are jointly connected to the output layer. The output layer contains six shared weights w ₁ -w ₆ . The forward propagation of behavioral features is in the forward layer from time 1 to Calculation is performed at time t. In the backward layer, reverse calculation is performed from time t to time 1 to obtain and save the output of the backward hidden layer at each time. The forward layer and backward layer are calculated at each time. The corresponding output results at each time are combined to obtain the final output. The mathematical formula is as (12)-(15), where, is the bias; o' _t ,o″ _t is the result of two layers of LSTM processing the action feature vector output at the corresponding moment;

一种密集多人场景下的异常行为检测方法，其特征在于，包括如下步骤：步骤1，视频采集：对于所要进行异常行为分析的密集人群场景，录制或获取相关的视频数据；步骤2，人体检测：基于YOLOv5的检测框架，对视频中的每个人进行边界框添加，标记人物在图像中的位置信息；步骤3，骨架姿态提取：使用融入了高分辨率骨架姿态提取网络HRNet的YH-Pose模块，计算并确定视频中每个人的k个关键骨架节点位置；步骤4，姿态骨架模型生成：根据人体的骨架结构，以有序方式连接上一步骤确认的关键骨架节点，生成人体的姿态骨架模型；步骤5，特征融合：YH-pose网络利用输入的RGB视频帧数据将人体边界框与姿态骨架模型进行融合，生成包含每帧中人体k个关节的融合后的二维坐标信息、边界框位置以及置信度的人体姿态信息；步骤6，数据预处理：在行为分类阶段使用BR-LSTM模块，将生成的人体姿态信息进行数据预处理，包括将二维关节坐标分割成独立的x和y坐标序列，并计算每个关节至根节点的欧氏距离；步骤7，行为特征提取：BR-LSTM模块中的特征提取部分接受预处理后的数据，通过长短期记忆网络提取动作的时空特征；步骤8，分类和预测：经过数据处理和行为特征提取之后，利用全连接层进行最后的分类计算，异常与正常行为通过训练好的模型进行识别和预测；步骤9，输出结果：根据分类和预测结果，标记视频中的异常行为，供进一步的分析和处理。An abnormal behavior detection method in a dense crowd scene, which is characterized by including the following steps: Step 1, video collection: for the dense crowd scene to be analyzed for abnormal behavior, record or obtain relevant video data; Step 2, human body Detection: Based on the detection framework of YOLOv5, add a bounding box to each person in the video and mark the position information of the person in the image; Step 3, Skeleton pose extraction: Use YH-Pose integrated with the high-resolution skeleton pose extraction network HRNet module, calculate and determine the positions of k key skeleton nodes for each person in the video; Step 4, pose skeleton model generation: According to the skeleton structure of the human body, connect the key skeleton nodes confirmed in the previous step in an orderly manner to generate the pose skeleton of the human body Model; Step 5, feature fusion: The YH-pose network uses the input RGB video frame data to fuse the human body bounding box and the posture skeleton model to generate the fused two-dimensional coordinate information and bounding box containing k joints of the human body in each frame. Position and confidence human posture information; Step 6, data preprocessing: Use the BR-LSTM module in the behavior classification stage to perform data preprocessing on the generated human posture information, including dividing the two-dimensional joint coordinates into independent x and y Coordinate sequence, and calculate the Euclidean distance from each joint to the root node; Step 7, behavioral feature extraction: The feature extraction part in the BR-LSTM module accepts the preprocessed data and extracts the spatiotemporal features of the action through the long short-term memory network; Step 8, classification and prediction: After data processing and behavioral feature extraction, the fully connected layer is used for final classification calculation, and abnormal and normal behaviors are identified and predicted through the trained model; Step 9, output result: based on classification and prediction As a result, abnormal behaviors in the video are flagged for further analysis and processing.

本发明的有益效果：通过多种改进的模块与策略扩展了框架内的网络使得可以对密集场景下人的异常行为进行精准检测。Beneficial effects of the present invention: through a variety of improved modules and strategies, the network within the framework is expanded so that abnormal human behavior in dense scenes can be accurately detected.

(1)目标检测模块的关键是将HRNet与YOLOv5相结合，利用HRNet提取高分辨率的姿态特征，同时采用改进后的YOLOv5框架进行目标检测。通过引入ASFF特征融合模块，能够有效地将两个网络的特征进行融合，提升目标检测的准确性和精度。(1) The key to the target detection module is to combine HRNet with YOLOv5, use HRNet to extract high-resolution posture features, and use the improved YOLOv5 framework for target detection. By introducing the ASFF feature fusion module, the features of the two networks can be effectively fused to improve the accuracy and precision of target detection.

(2)在训练过程中，注意到缺乏足够的异常行为数据样本可能导致模型的欠拟合。为了解决这个问题，采用了分离的特征训练策略。具体而言，将融合后的二维坐标数据分成x坐标序列和y坐标序列，然后分别输入到网络中进行训练。通过这种方式，可以更充分地利用现有的正常行为数据来训练模型，提高其性能和泛化能力。为了增强模型对异常行为的敏感性，计算了每个二维坐标到根节点的欧氏距离。这样做的目的是引入更多关于关节点之间空间关系的信息，使模型更有能力检测出异常行为。通过引入分离的特征训练策略和计算欧氏距离的方法，能够有效地解决缺乏异常行为数据样本和深度神经网络欠拟合的问题。这些改进措施可以提高模型在异常行为检测任务中的性能和准确性。(2) During the training process, it was noted that the lack of sufficient abnormal behavior data samples may lead to underfitting of the model. To solve this problem, a separated feature training strategy is adopted. Specifically, the fused two-dimensional coordinate data is divided into x coordinate sequence and y coordinate sequence, and then respectively input into the network for training. In this way, existing normal behavior data can be more fully utilized to train the model, improving its performance and generalization capabilities. To enhance the model's sensitivity to abnormal behavior, the Euclidean distance from each 2D coordinate to the root node is calculated. The purpose of this is to introduce more information about the spatial relationships between joint points, making the model more capable of detecting abnormal behavior. By introducing a separated feature training strategy and a method of calculating Euclidean distance, the problem of lack of abnormal behavior data samples and underfitting of deep neural networks can be effectively solved. These improvements can improve the performance and accuracy of the model in abnormal behavior detection tasks.

附图说明Description of drawings

图1是异常行为识别模型技术方案总框架图；Figure 1 is the overall framework diagram of the abnormal behavior recognition model technical solution;

图2是两种模型骨架关节连接可视化对比图；Figure 2 is a visual comparison of the skeleton joint connections of the two models;

图3是多尺度空间特征融合策略流程图；Figure 3 is a flow chart of the multi-scale spatial feature fusion strategy;

图4是关节点到根关节点的动作特征表示图；Figure 4 is a representation of the action characteristics from the joint point to the root joint point;

图5是LSTM单元内部网络结构图；Figure 5 is the internal network structure diagram of the LSTM unit;

图6是双层双向LSTM的详细结构图。Figure 6 is a detailed structural diagram of the two-layer bidirectional LSTM.

具体实施方式Detailed ways

本发明研究了一种针对密集多人场景下的异常行为检测模型，由两个模块化的网络组成分别实现骨架姿态提取与行为分类。骨架姿态提取网络是一个自上而下的骨架姿态提取模块(YH-Pose)，该模块可提取视频监控下每个人的17个(K为17时)骨架关节点。该模块通过分析时间序列中骨架关节点的变化来描述人的动作信息。该方法将高分辨率骨架姿态提取网络(HRNet)与改进后的YOLOv5目标检测帧结合起来，引入了特征融合模块ASFF，以提高YOLOv5的检测精度。同时，解决了HRNet网络进行骨架姿态提取时骨架关节断开的问题。The present invention studies an abnormal behavior detection model for dense multi-person scenes, which consists of two modular networks to realize skeleton posture extraction and behavior classification respectively. The skeleton pose extraction network is a top-down skeleton pose extraction module (YH-Pose), which can extract 17 (when K is 17) skeleton joint points of each person under video surveillance. This module describes human action information by analyzing changes in skeleton joint points in time series. This method combines the high-resolution skeleton pose extraction network (HRNet) with the improved YOLOv5 target detection frame, and introduces the feature fusion module ASFF to improve the detection accuracy of YOLOv5. At the same time, it solves the problem of skeleton joint disconnection when the HRNet network performs skeleton pose extraction.

行为分类网络是一个BR-Lstm模块，它对动作内部和动作之间的长期时间动态进行建模，利用大约2秒钟的视频序列来预测动作标签。它包括一个数据预处理模块和一个带有六个LSTM单元的双向行为分类网络。为了解决异常行为数据样本较少，深度神经网络在训练过程中易出现欠拟合现象，提出了一个分离的特征训练策略，在BR-Lstm网络通过将融合后的二维坐标数据分成x坐标序列和y坐标序列和计算欧氏距离从每个二维坐标到根节点来扩充数据样本。数据预处理模块以分离的方式对包含动作特征的二维人体关节坐标进行编码。它计算所有关节到根关节的欧氏距离，从而生成不同动作的特征向量。数据预处理模块通过线性层将动作细节特征映射为动作特征向量，并将动作特征向量输入正向LSTM层和反向LSTM层。在每个时间步长内，前向LSTM根据当前输入和前一时间步长的隐藏状态计算当前时间步长的隐藏状态和输出。反向LSTM根据当前输入和后续时间步的隐藏状态计算当前时间步的隐藏状态和输出。正向LSTM和反向LSTM的隐藏状态会在每个时间步长进行连接，以形成更丰富的表示，同时保留过去和未来的上下文信息。最后，在双向LSTM之后添加一个全连接层，以进一步对正常与异常行为进行分类。The Behavior Classification Network is a BR-Lstm module that models long-term temporal dynamics within and between actions, leveraging approximately 2 seconds of video sequences to predict action labels. It includes a data preprocessing module and a bidirectional behavior classification network with six LSTM units. In order to solve the problem that there are few abnormal behavior data samples and deep neural networks are prone to under-fitting during the training process, a separated feature training strategy is proposed. In the BR-Lstm network, the fused two-dimensional coordinate data is divided into x-coordinate sequences. and y-coordinate sequences and compute the Euclidean distance from each 2D coordinate to the root node to augment the data sample. The data preprocessing module encodes the 2D human joint coordinates containing action features in a separate manner. It calculates the Euclidean distance of all joints to the root joint, thereby generating feature vectors for different actions. The data preprocessing module maps action detail features into action feature vectors through linear layers, and inputs the action feature vectors into the forward LSTM layer and the reverse LSTM layer. Within each time step, the forward LSTM calculates the hidden state and output of the current time step based on the current input and the hidden state of the previous time step. The reverse LSTM calculates the hidden state and output of the current time step based on the current input and the hidden state of subsequent time steps. The hidden states of forward LSTM and backward LSTM are concatenated at each time step to form a richer representation while preserving past and future contextual information. Finally, a fully connected layer is added after the bidirectional LSTM to further classify normal and abnormal behaviors.

具体地，本发明提出的智能异常行为识别框架总体框图如附图1所示。该框架集成了骨架姿态提取模块(YH-Pose)和行为分类模块(BR-Lstm)。YH-Pose包含人体检测模块和骨架姿态提取模块。下面结合附图和实施例对本发明进一步说明。Specifically, the overall block diagram of the intelligent abnormal behavior identification framework proposed by the present invention is shown in Figure 1. The framework integrates the skeleton pose extraction module (YH-Pose) and the behavior classification module (BR-Lstm). YH-Pose includes a human body detection module and a skeleton pose extraction module. The present invention will be further described below in conjunction with the accompanying drawings and examples.

I.总体技术方案：I. Overall technical plan:

我们的总体框架架构集成了骨架姿态提取模块(YH-Pose)和行为分类模块(BR-Lstm)。其中，骨架姿态提取模块中的目标检测器是基于YOLOv5设计的，为场景中每个被检测到的人添加一个边界框。人体边界框中包含了人物在图像中的位置信息。高分辨率骨架姿态提取网络(HRNet)采用Focal Loss损失函数代替均方误差(MSE)损失函数进行改进，这样使得模型更加快速收敛。关节检测模块融合了改进后的高分辨率骨架姿态提取网络(HRNet)，负责估计图像中每个人的17个骨架关节点的坐标。人体姿态骨架模型是一种用于表示和捕捉人体姿势的结构化模型。它模拟了人体骨骼系统的结构，并定义了关节之间的连接和约束关系。该模块按照人体骨架的结构，有序地连接这些关节点，形成特有的人体姿态骨架模型。YH-pose模块处理输入的RGB视频帧并输出人体边界框。带有边界框的二维人体姿态包括每帧中人体17个关节的二维坐标信息，以及人体目标框的坐标位置和置信度得分。Our overall framework architecture integrates the skeleton pose extraction module (YH-Pose) and the behavior classification module (BR-Lstm). Among them, the target detector in the skeleton pose extraction module is designed based on YOLOv5 and adds a bounding box for each detected person in the scene. The human bounding box contains the position information of the person in the image. The High-Resolution Skeleton Posture Extraction Network (HRNet) is improved by using the Focal Loss loss function instead of the mean square error (MSE) loss function, which makes the model converge faster. The joint detection module incorporates an improved high-resolution skeleton pose extraction network (HRNet) and is responsible for estimating the coordinates of the 17 skeleton joint points of each person in the image. The human pose skeleton model is a structured model used to represent and capture human poses. It simulates the structure of the human skeletal system and defines the connections and constraints between joints. This module connects these joint points in an orderly manner according to the structure of the human skeleton to form a unique human posture skeleton model. The YH-pose module processes the input RGB video frames and outputs human body bounding boxes. The 2D human pose with bounding box includes the 2D coordinate information of the 17 joints of the human body in each frame, as well as the coordinate position and confidence score of the human target box.

BR-LSTM是一种能够提取时空特征信息的双向卷积扩展短时记忆网络，它集成了一个以姿态信息为输入的预处理数据模块、一个由六个双向链接的LSTM单元组成的动作特征提取模块和一个全连接层(FC层)。BR-LSTM网络的预处理数据模块将17个骨架关节的二维关节坐标分割成xi和yi坐标的独立序列。它计算每个骨架关节到根节点的欧氏距离di(i＝1,2,...,17)。特征提取模块提取人体姿态的时空特征。在特征融合后引入了Dropout策略，以避免过拟合。II.人体检测器模块：BR-LSTM is a bidirectional convolutional extended short-term memory network that can extract spatiotemporal feature information. It integrates a preprocessing data module with posture information as input, and an action feature extraction composed of six bidirectionally linked LSTM units. module and a fully connected layer (FC layer). The preprocessing data module of the BR-LSTM network divides the two-dimensional joint coordinates of the 17 skeleton joints into independent sequences of xi and yi coordinates. It calculates the Euclidean distance di (i=1,2,...,17) from each skeleton joint to the root node. The feature extraction module extracts the spatiotemporal features of human body posture. The Dropout strategy is introduced after feature fusion to avoid overfitting. II. Human body detector module:

从捕获的人类行为的RGB视频数据中提取了详细的骨架数据，包括骨架关节点位置的详细信息和帧间关节点的时间信息。YOLOv5算法检测目标区域内的人体。在YOLOv5中有四种型号可供选择：YOLOv5s、YOLOv5m、YOLOv5l和YOLOv5x。根据YOLOv5的论文，YOLOv5s模型的深度最小，特征图的宽度最小，模型复杂度最低，检测速度最快。YOLOv5s网络模型在应用于目标检测时候检测准确度较低，无法对图像中目标的多尺度特征进行提取，所以我们在YOLOv5s预测头之前引入了多个ASFF模块并使用蛇形卷积层替换普通卷积层的方式改进了yolov5s网络模型，从而在不降低检测速度的情况下提高了精度。卷积神经网络(CNN)需要固定大小的输入图像。由于图像大小不同，传统的裁剪方法可能会丢失一些有效信息，并导致不必要的精度损失。此外，重复对候选区域进行卷积计算会导致计算冗余等问题。为了解决尺度特征融合不一致的问题，在YOLOv5预测头之前引入了多个ASFF模块，对特征信息进行加权融合，如附图2所示，输入图像经过一个具有不同层特征图的特征金字塔网络。如果没有ASFF模块，每一层只能输出每一层的特征预测结果。ASFF模块可以完全融合各层之间的特征信息。通过对不同层的特征进行加权平均，将它们合并成一个新的特征表示。α³,β³,γ³是第三层的加权比例因子。x^1→3,x^2→3,x^3→3是各层的特征张量。在传统的卷积操作中，卷积核通常按照固定的顺序从左到右、从上到下扫描输入特征图。而蛇形卷积则采用一种非线性的扫描方式，将卷积核的扫描路径设计为蛇形状或曲线状，从而改变了扫描的顺序。蛇形卷积还可以减少模型的参数数量。传统的卷积操作中，相邻的卷积核通常具有相似的权重，因此可以通过共享参数的方式减少参数量。而蛇形卷积由于其非线性的扫描路径，使得相邻的卷积核之间的权重往往不同，因此无法直接共享参数。这样一来，每个卷积核都需要独立的参数，但同时也增加了模型的表达能力。所以，在YOLOv5骨干网络部分，使用蛇形卷积层替换掉普通卷积层，使每个卷积核能够看到更大范围的输入信息，从而提高了模型对全局特征的感知能力。Detailed skeleton data is extracted from the captured RGB video data of human behavior, including detailed information of skeleton joint point positions and temporal information of joint points between frames. The YOLOv5 algorithm detects human bodies in the target area. There are four models to choose from in YOLOv5: YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x. According to the YOLOv5 paper, the YOLOv5s model has the smallest depth, the smallest width of the feature map, the lowest model complexity, and the fastest detection speed. The YOLOv5s network model has low detection accuracy when applied to target detection and cannot extract multi-scale features of the target in the image. Therefore, we introduced multiple ASFF modules before the YOLOv5s prediction head and used snake-shaped convolution layers to replace ordinary convolutions. The stacking method improves the yolov5s network model, thereby improving the accuracy without reducing the detection speed. Convolutional neural networks (CNN) require fixed-size input images. Due to different image sizes, traditional cropping methods may lose some effective information and cause unnecessary loss of accuracy. In addition, repeated convolution calculations on candidate regions will lead to problems such as computational redundancy. In order to solve the problem of inconsistent scale feature fusion, multiple ASFF modules are introduced before the YOLOv5 prediction head to perform weighted fusion of feature information. As shown in Figure 2, the input image passes through a feature pyramid network with different layers of feature maps. Without the ASFF module, each layer can only output the feature prediction results of each layer. The ASFF module can completely fuse feature information between layers. By taking a weighted average of features from different layers, they are combined into a new feature representation. α ³ , β ³ , γ ³ are the weighted scaling factors of the third layer. x ^1→3 , x ^2→3 , x ^3→3 are the feature tensors of each layer. In traditional convolution operations, the convolution kernel usually scans the input feature map from left to right and top to bottom in a fixed order. Serpentine convolution uses a non-linear scanning method and designs the scanning path of the convolution kernel into a snake shape or curve, thereby changing the scanning order. Serpentine convolution can also reduce the number of parameters of the model. In traditional convolution operations, adjacent convolution kernels usually have similar weights, so the amount of parameters can be reduced by sharing parameters. Due to its nonlinear scan path, the serpentine convolution often has different weights between adjacent convolution kernels, so parameters cannot be shared directly. In this way, each convolution kernel requires independent parameters, but it also increases the expressive ability of the model. Therefore, in the YOLOv5 backbone network part, the serpentine convolution layer is used to replace the ordinary convolution layer, so that each convolution kernel can see a wider range of input information, thus improving the model's ability to perceive global features.

ASFF3＝x¹ _→ ³*α³+x² _→ ³*β³+x³ _→ ³*γ³ (1)ASFF3＝x ¹ _→ ³ *α ³ +x ² _→ ³ *β ³ +x ³ _→ ³ *γ ³ (1)

目前主流的二维人体骨架姿态提取方法是关键点检测。它主要基于两个方面：关节位置回归任务和关节热图估计任务。在目前主流的骨架姿态提取网络中，HRNet网络具有良好的检测效果。它的整个网络可以自始至终保持高分辨率，而不是通过从低到高的过程来恢复分辨率，从而获得空间精度更高的热图预测。因此，YH-Pose方案将HRNet作为骨架姿态提取的框架。当网络要为N个分类输出关键点时，它将输出N维特征图。同时，根据N维标注特征图上的关键点构建高斯核，并检测人体骨架关键点，从而获得该图像中17个骨架关节点的(x,y)坐标和置信度分数。Focal Loss损失函数可以在训练过程中动态降低对行为影响较小的节点的权重，从而更快地将注意力集中到对行为影响较大的节点上。这种方法可以有效应对数据样本不均衡、节点权重分布不均匀的情况，提高模型在异常行为识别任务中的性能。由于均方误差(MSE)损失函数对异常值(离群值)非常敏感。再由于平方项的存在，MSE会放大异常值的影响，导致模型过度关注这些异常值，而忽略了其他数据点的重要性。这可能导致模型在面对异常值时表现不稳定。所以采用Focal Loss损失函数代替原来的均方误差(MSE)损失函数。Focal Loss损失函数如公式(2)所示，p取值范围为0到1，p_t是模型预测属于前景的概率。y的取值范围为1和-1，α_t是引入的加权因子。α_t是一个调制因子，可以减少不重要样本的损失贡献，从而增加有挑战性样本下的损失比例。它是一个确定的参数。参数范围为0至5。The current mainstream two-dimensional human skeleton pose extraction method is key point detection. It is mainly based on two aspects: joint position regression task and joint heatmap estimation task. Among the current mainstream skeleton pose extraction networks, the HRNet network has good detection results. Its entire network can maintain high resolution throughout, rather than restoring resolution through a low-to-high process, resulting in heat map predictions with higher spatial accuracy. Therefore, the YH-Pose scheme uses HRNet as the framework for skeleton pose extraction. When the network wants to output key points for N classifications, it will output N-dimensional feature maps. At the same time, a Gaussian kernel is constructed based on the key points on the N-dimensional annotated feature map, and the key points of the human skeleton are detected, thereby obtaining the (x, y) coordinates and confidence scores of the 17 skeleton joint points in the image. The Focal Loss loss function can dynamically reduce the weight of nodes that have a small impact on behavior during the training process, thereby focusing attention more quickly on nodes that have a greater impact on behavior. This method can effectively deal with the situation of uneven data samples and uneven node weight distribution, and improve the performance of the model in abnormal behavior identification tasks. Since the mean square error (MSE) loss function is very sensitive to outliers (outliers). Due to the existence of the square term, MSE will amplify the impact of outliers, causing the model to pay excessive attention to these outliers and ignore the importance of other data points. This can lead to unstable model performance in the face of outliers. Therefore, the Focal Loss loss function is used to replace the original mean square error (MSE) loss function. The Focal Loss loss function is shown in formula (2), p ranges from 0 to 1, and p _t is the probability that the model predicts belonging to the foreground. The value range of y is 1 and -1, and α _t is the weighting factor introduced. _αt is a modulation factor that can reduce the loss contribution of unimportant samples, thereby increasing the loss proportion under challenging samples. It is a definite parameter. The parameter range is 0 to 5.

III.分离式特征编码策略：III. Separate feature encoding strategy:

骨架特征提取是整个过程中最关键的任务。在这项任务中，生成一个代表人体姿态的向量，该向量由三个向量拼接而成：归一化关节位置x轴坐标向量、归一化关节位置y轴坐标向量和关节到根关节点距离向量。每帧图像中不同区域位置的人与摄像机之间的距离不同，因此图像中关节位置的比例也不同。图像中的每个关节由其横坐标和纵坐标描述。定义第i个关节的原始位置为(x_i，y_i)。使用该方法对每帧图像中检测到的每个人的关节位置进行归一化处理。如公式4所示，表示关节的归一化后的坐标位置。图像中的每个关节都由其横坐标和纵坐标来描述。因此，归一化后的关节位置向量包含与17个关节相对应的34个特征。Skeleton feature extraction is the most critical task in the entire process. In this task, a vector representing the human posture is generated, which is composed of three vectors: the normalized joint position x-axis coordinate vector, the normalized joint position y-axis coordinate vector, and the distance from the joint to the root joint point. vector. The distance between the person and the camera in different areas of each frame of the image is different, so the proportions of joint positions in the image are also different. Each joint in the image is described by its abscissa and ordinate. Define the original position of the i-th joint as (xi _, y _i ). Use this method to normalize the joint positions of each person detected in each frame. As shown in Equation 4, Represents the normalized coordinate position of the joint. Each joint in the image is described by its abscissa and ordinate. Therefore, the normalized joint position vector contains 34 features corresponding to 17 joints.

如附图4所示，在确定人体骨架关节的坐标位置后，通过计算17个关节中每个关节到人体根关节点O(质心)的距离，得到第二个分量向量。每个关节到根关节的欧氏距离计算公式(5)。关节距离向量包含17个特征，分别对应17个距离(d1-d17)。关节到根关节的欧氏距离(x0，y0)。As shown in Figure 4, after determining the coordinate positions of the human skeleton joints, the second component vector is obtained by calculating the distance from each of the 17 joints to the human body root joint point O (center of mass). The Euclidean distance from each joint to the root joint is calculated using formula (5). The joint distance vector contains 17 features, corresponding to 17 distances (d1-d17). The Euclidean distance (x0, y0) from the joint to the root joint.

IV.双层双向BR-Lstm模块：IV.Double-layer bidirectional BR-Lstm module:

BR-Lstm模块使用双向LSTM对骨架信息进行特征提取和行为分类。首先，将17个坐标拆分为x坐标值(x1,x2,....,x17)和y坐标值(y1,y2,....,y17)，然后计算每个关节点(d1,d2,....,d17)到根节点作为第三个特征分量。当检测到连续的图像帧时，就可以得到随时间变化的x坐标序列、y坐标序列和距离序列。接下来，数据会被处理成适合LSTM训练的长度和大小，并分别输入到三个LSTM网络中进行时序特征提取，之后每检测到一帧新的图像数据，就会将新的坐标值添加到序列中，并删除旧的坐标序列。最后，将行为动作的分类信息合并并输入到全连接层，对正常行为和异常行为进行分类，以确定行为是否异常。附图3显示了LSTM神经元的结构示意图。LSTM神经单元包括输入门i_t、遗忘门f_t、单元状态C_t和输出门O_t。长短期记忆通过门和细胞状态进行控制，其计算过程可以用下面的公式6～11表示。在公式(6)中，输入门t时刻的信息是前一时刻的隐藏输出和t时刻的输入信息的组合。在公式(7)中，t时刻的候选单元状态由h_t-1和x_t计算得到,h_t-1和x_t分别代表前一时刻的隐藏输出和t时刻的输入信息。在公式(8)中，遗忘门用于控制上一时刻记忆状态中的哪些信息应该被遗忘或保留。在公式(9)中，输入门和遗忘门的输出相结合，更新t时刻的细胞状态。然后，将双层LSTM网络首尾相连，并依次连接各LSTM层的细胞，预测正向学习的动作特征序列和反向学习的动作特征序列。The BR-Lstm module uses bidirectional LSTM to perform feature extraction and behavior classification on skeleton information. First, split the 17 coordinates into x coordinate values (x1,x2,....,x17) and y coordinate values (y1,y2,....,y17), and then calculate each joint point (d1, d2,....,d17) to the root node as the third feature component. When continuous image frames are detected, the x-coordinate sequence, y-coordinate sequence and distance sequence changing over time can be obtained. Next, the data will be processed into a length and size suitable for LSTM training, and input into three LSTM networks for temporal feature extraction. After that, every time a new frame of image data is detected, new coordinate values will be added to sequence and delete the old coordinate sequence. Finally, the classification information of behavioral actions is merged and input into the fully connected layer to classify normal behavior and abnormal behavior to determine whether the behavior is abnormal. Figure 3 shows a schematic structural diagram of an LSTM neuron. The LSTM neural unit includes input gate i _t , forget gate f _t , unit state C _t and output gate O _t . Long and short-term memory is controlled through gates and cell states, and its calculation process can be expressed by the following formulas 6 to 11. In formula (6), the information at time t of the input gate is the combination of the hidden output at the previous time and the input information at time t. In formula (7), the candidate unit state at time t is calculated from h _t-1 and x _t , where h _t-1 and x _t represent the hidden output at the previous time and the input information at time t respectively. In formula (8), the forgetting gate is used to control which information in the memory state at the previous moment should be forgotten or retained. In formula (9), the outputs of the input gate and the forget gate are combined to update the cell state at time t. Then, the two-layer LSTM network is connected end to end, and the cells of each LSTM layer are connected in turn to predict the forward learning action feature sequence and the reverse learning action feature sequence.

i_t＝σ(W_i·[h_t-1，x_t]+b_i) (6)i _t =σ(W _i ·[h _t-1 , x _t ]+b _i ) (6)

C_t＝tanh(W_c·[h_t-1，x_t]+b_c0 (7)C _t = tanh (W _c ·[h _t-1 , x _t ]+b _c 0 (7)

f_t＝σ(W_f·[h_t-1，x_t])+b_f (8)f _t =σ(W _f ·[h _t-1 ,x _t ])+b _f (8)

C_t＝f_t*C_t-1+i_t*C_t (9)C _t ＝f _t *C _t-1 +i _t *C _t (9)

O_t＝σ(W_o·[h_t-1，x_t]+b_o) (10)O _t =σ(W _o ·[h _t-1 ,x _t ]+b _o ) (10)

h_t＝O_t*tanh(C_t) (11)h _t =O _t *tanh(C _t ) (11)

双层双向LSTM的具体结构如附图4所示，其中前向层和后向层共同连接到输出层，输出层包含六个共享权重w₁-w₆。行为特征的前向传播在前向层中沿着从时刻1到时刻t的时间进行计算。在后向层中，从时刻t到时刻1进行反向计算，以获得并保存后向隐含层每个时刻的输出。将前向层和后向层在每个时刻的相应输出结果合并，就得到了最终输出。数学公式如(12)-(15)。在上式中，是偏置。o′_t,o″_t是两层LSTM对相应时刻输出的动作特征向量进行处理的结果。The specific structure of the two-layer bidirectional LSTM is shown in Figure 4, in which the forward layer and the backward layer are jointly connected to the output layer, and the output layer contains six shared weights w ₁ -w ₆ . The forward propagation of behavioral features is computed in the forward layer along the time from time 1 to time t. In the backward layer, reverse calculation is performed from time t to time 1 to obtain and save the output of the backward hidden layer at each time. The final output is obtained by combining the corresponding output results of the forward layer and the backward layer at each moment. The mathematical formula is as (12)-(15). In the above formula, It's bias. o′ _t ,o″ _t are the results of two layers of LSTM processing the action feature vector output at the corresponding moment.

V.数据集采集：V. Data set collection:

训练使用的公共数据集是HMDB51和NTU_RGB+D和自建上海交通道路数据集。HMDB51数据集包含51种类型的动作，共有6849个视频，每个视频至少包含101个分辨率为320x320的视频。HMDB51包含51种不同的动作类别，如"刷毛"、"拍手"、"跑步"和"挥手"。这些类别代表了日常生活中常见的人类动作行为。这些动作类别涵盖了不同的视角、移动速度、照明条件和背景干扰。因此，对动作进行准确分类和识别是一项极具挑战性的任务。NTU_RGB+D是一个广泛使用的基于骨架的人体动作识别数据集。它包含56,880个骨架动作序列。有两个评估基准，包括跨主体(X-Sub)和跨视角(X-View)设置。对于X-Sub，训练集和测试集来自两个不相连的集，每个集有20个受试者。对于X-View，训练集包含摄像头拍摄的37,920个样本，测试集包含摄像头拍摄的18,960个序列。上海交通道路数据集由100个两分钟的视频组成，这些视频由中国上海32条交通道路上的静态高架高清摄像机合法拍摄。将该数据集分为三个部分：60个训练视频、13个验证视频和27个测试视频。视频使用10个不同动作类别的细粒度动作开始和结束时间进行标记。在的研究中，选取了经常出现在交通斑马线上的五种正常行为：行走、跳跃、奔跑、转身和吸烟，以及五种异常行为：跌倒、踢人、打人、呕吐和低头看手机。The public data sets used for training are HMDB51 and NTU_RGB+D and the self-built Shanghai traffic road data set. The HMDB51 dataset contains 51 types of actions and a total of 6849 videos, each of which contains at least 101 videos with a resolution of 320x320. HMDB51 contains 51 different action categories, such as "brushing", "clapping", "running" and "waving". These categories represent common human action behaviors in daily life. These action categories cover different viewing angles, movement speeds, lighting conditions, and background distractions. Therefore, accurate classification and recognition of actions is an extremely challenging task. NTU_RGB+D is a widely used skeleton-based human action recognition dataset. It contains 56,880 skeletal action sequences. There are two evaluation benchmarks, including cross-subject (X-Sub) and cross-view (X-View) settings. For X-Sub, the training and test sets are from two disconnected sets, each with 20 subjects. For X-View, the training set contains 37,920 samples captured by the camera, and the test set contains 18,960 sequences captured by the camera. The Shanghai Traffic Roads dataset consists of 100 two-minute videos that were legally captured by static overhead high-definition cameras on 32 traffic roads in Shanghai, China. The dataset is divided into three parts: 60 training videos, 13 validation videos, and 27 test videos. Videos are tagged with fine-grained action start and end times for 10 different action categories. In the study, five normal behaviors that often appear on traffic zebra crossings were selected: walking, jumping, running, turning, and smoking, and five abnormal behaviors: falling, kicking, hitting, vomiting, and looking down at mobile phones.

VI.模型训练：VI. Model training:

用于训练的设备是配置有第12代英特尔(R)酷睿(TM)i9-12900K 3.19GHz的CPU、64GB内存、64位Windows 10操作系统和Nvidia 3090GPU服务器。程序使用Anaconda3 5.2.0版作为集成开发环境，编程语言为Python 3.6.5版。在PyTorch深度学习框架下构建了自己设计的各种模块化网络。The equipment used for training is a server equipped with a 12th generation Intel(R) Core(TM) i9-12900K 3.19GHz CPU, 64GB memory, 64-bit Windows 10 operating system and Nvidia 3090 GPU. The program uses Anaconda3 version 5.2.0 as the integrated development environment, and the programming language is Python version 3.6.5. Various modular networks of my own design were built under the PyTorch deep learning framework.

VII.损失函数：VII. Loss function:

使用交叉熵损失函数用于计算行为识别模型的预测输出与真实标签之间的差值。对于每个样本，交叉熵损失可通过以下公式计算(16)：y_true是真实标签向量，y_pred是模型的预测输出向量。The cross-entropy loss function is used to calculate the difference between the predicted output of the action recognition model and the true label. For each sample, the cross-entropy loss can be calculated by the following formula (16): y_true is the true label vector and y_pred is the predicted output vector of the model.

L＝-∑(y_-true^*log(y_-pred)) (16)。L=-∑(y _- true ^* log(y _- pred)) (16).

VIII.模型训练参数设置：VIII. Model training parameter settings:

将三个数据集中的每个样本统一调整为50帧，然后使用Zhang等人提出统计和领域知识识别和处理异常值等方法，如删除、替换或平滑处理等操作进行数据预处理。对于HMDB51和NTU_RGB+D数据集，以200个批次的规模对模型进行了110个轮次的训练。初始学习率设定为0.1，并在80和120epochs时降低10倍，权重衰减设定为5e-4，损失设定为0.1，作为模型的迭代终止条件。对于交通道路自建数据集，初始学习率设定为0.001，训练轮数为10,000轮，使用adam优化器进行优化，FC最后一层的dropout率设定为0.5。Each sample in the three data sets is uniformly adjusted to 50 frames, and then the statistics and domain knowledge proposed by Zhang et al. are used to identify and process outliers, such as deletion, replacement or smoothing operations for data preprocessing. For the HMDB51 and NTU_RGB+D datasets, the model was trained for 110 epochs with a batch size of 200. The initial learning rate is set to 0.1 and reduced by 10 times at 80 and 120 epochs, the weight decay is set to 5e-4, and the loss is set to 0.1 as the iteration termination condition of the model. For the self-built traffic road data set, the initial learning rate is set to 0.001, the number of training rounds is 10,000, the adam optimizer is used for optimization, and the dropout rate of the last layer of FC is set to 0.5.

IX.评价指标：IX. Evaluation indicators:

该框架的动作分类性能通过两个指标来衡量：平均精度(map)和准确度(Acc)。每秒千兆浮点运算(GFLOPs)和每秒帧数(FPS)这两个指标用于分析模型的计算复杂度和检测速度。物体关键点相似度(OKS)是联合检测的评估指标，用于比较的骨架姿态提取模型和其他模型的性能差异。The action classification performance of this framework is measured by two metrics: average precision (map) and accuracy (Acc). The two indicators Giga Floating Point Operations per Second (GFLOPs) and Frames Per Second (FPS) are used to analyze the computational complexity and detection speed of the model. Object key point similarity (OKS) is an evaluation index for joint detection and is used to compare the performance differences between the skeleton pose extraction model and other models.

本发明的上述具体实施方式仅仅用于示例性说明或解释本发明的原理，而不构成对本发明的限制。因此，在不偏离本发明的精神和范围的情况下所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above-described specific embodiments of the present invention are only used to illustrate or explain the principles of the present invention, and do not constitute a limitation of the present invention. Therefore, any modifications, equivalent substitutions, improvements, etc. made without departing from the spirit and scope of the present invention shall be included in the protection scope of the present invention.

Claims

1. An abnormal behavior detection model in dense multi-person scenes, which is characterized by including the skeleton pose extraction module YH-Pose and the behavior classification module BR-LSTM. The YH-Pose module first extracts behavioral feature information from the original video data. , and then input these behavioral feature information into the BR-LSTM module to complete action classification; the YH-Pose module: is a top-down skeleton pose extraction module, which also includes a human detector and skeleton pose extraction The human detector is implemented based on YOLOv5. First, each person detected by YOLOv5 in the scene will add a bounding box. The bounding box contains the position information of the person in the image; the YH-Pose module integrates The high-resolution skeleton pose extraction network HRNet predicts the coordinates of n skeleton joint points for each person in the image, and then connects these skeleton joint points in an orderly manner according to the structure of the human skeleton to form a human pose skeleton model; YH- The pose module uses the input RGB video stream data to combine the human body bounding box and the pose skeleton network. The combined output result is two-dimensional human pose information, which includes the two-dimensional coordinate information of k joints of the human body in each frame, as well as the human target. The coordinate position and confidence score of the box; the BR-LSTM module: predicts action labels based on video sequences to complete the behavior classification task; the model introduces a separated feature training strategy: in the data preprocessing stage, BR-LSTM will Divide the two-dimensional joint coordinate data into an x-coordinate sequence and a y-coordinate sequence, and calculate the Euclidean distance from each joint to the root node to expand the data sample; this module includes a data preprocessing module and an LSTM unit composed of six Bidirectional behavior classification network, the data preprocessing module will map the action features into feature vectors through the linear layer, and input these feature vectors into the forward LSTM layer and the reverse LSTM layer. In each time step, the LSTM will be based on the current The input and the hidden state of the previous time step are used to calculate the hidden state and output of the current time step. Finally, a fully connected layer is used to further classify normal and abnormal behaviors.

2. The abnormal behavior detection model in dense multi-person scenes according to claim 1, characterized in that the human detector extracts detailed skeleton data from the captured RGB video data, including skeleton joint point positions and inter-frame Temporal information of joint points; in order to detect people in the image first and then detect the human skeleton pose, the YH-Pose module combines the high-resolution skeleton pose extraction network HRNet with the YOLOv5 target detection framework; the prediction head part of YOLOv5 is introduced into multiple ASFF modules are improved, and these modules perform multi-scale weighted fusion of feature information; the input image passes through a feature pyramid network with different layers of feature maps, and the ASFF module completely fuses the feature information between each layer and combines them through a weighted average. Merge into a new feature representation; while serpentine convolution uses a non-linear scanning method, designing the scanning path of the convolution kernel into a snake shape or curve, thus changing the order of scanning; serpentine convolution also It can reduce the number of parameters of the model; due to its nonlinear scan path, the serpentine convolution often has different weights between adjacent convolution kernels, so parameters cannot be shared directly; in this way, each convolution kernel needs independent parameters, but also increases the expressive ability of the model; therefore, in the YOLOv5 backbone network part, the snake-shaped convolution layer is used to replace the ordinary convolution layer, so that each convolution kernel can see a wider range of input information , thereby improving the model’s ability to perceive global features.

3. The abnormal behavior detection model in dense multi-person scenes according to claim 2, characterized in that α ³ , β ³ , γ ³ are the weighted scaling factors of the third layer, x ^1→3 , x ^2→3 , x ^3→3 is the feature tensor of each layer; as shown in formula (1), the new feature after weighting of the third layer ASFF module is ASFF3:

ASFF3＝x ^1→3 *α ³ +x ^2→3 *β ³ +x ^3→3 *γ ³ (1).

4. The abnormal behavior detection model in dense multi-person scenes according to claim 2, characterized in that HRNet is used as a model for human skeleton posture prediction. When the network wants to output N classification key points, it first outputs N dimensions. feature map, then, build a Gaussian kernel based on the key points on the N-dimensional annotated feature map, and generate the human skeleton key point heat map encoding. Then, obtain the (x, y) of n skeleton joint points from the skeleton key point heat map. Coordinates and confidence scores.

5. The abnormal behavior detection model in a dense multi-person scene according to claim 1, characterized in that the FocalLoss loss function is used in the separated feature training, such as formula (2), p is the probability of the model prediction belonging to the foreground, Its value range is 0 to 1; y's value range is 1 and -1; α _t plays a modulating role, which can reduce the attention to unimportant sample features and increase the attention to challenging sample features. It is a certain parameter, ranging from 0 to 5;

FL(p _t )＝-α _t (1-p _t ) ^γ log(p _t ) (2)

6. The abnormal behavior detection model in a dense multi-person scene according to claim 1, characterized in that the separated feature training adopts a separated feature encoding strategy to extract skeleton features. First, a vector representing the human body posture is generated. The vector It is composed of three vectors: the normalized joint position x-axis coordinate vector, the normalized joint position y-axis coordinate vector and the joint-to-root joint point distance vector. The distance between the person and the camera in different areas in each frame of the image is The distance is different, so the proportion of the joint position in the image is also different. Each joint in the image is described by its abscissa and ordinate. The original position of the i-th joint is defined as ( _xi , _yi ), using formula (4) The joint positions of each person detected in each frame of the image are normalized,

Represents the normalized coordinate position of the joint. Each joint in the image is described by its abscissa and ordinate. Therefore, the normalized joint position vector contains 2k features corresponding to k joints,

After determining the coordinate positions of the human skeleton joints, the second component vector is obtained by calculating the distance from each of the p joints to the human root joint point O (center of mass). The Euclidean distance calculation formula from each joint to the root joint is (5), the joint distance vector contains k features, corresponding to k distances (d1-dp) and the Euclidean distance (x0, y0) from the joint to the root joint;

7. The abnormal behavior detection model in a dense multi-person scene according to claim 1, characterized in that the BR-LSTM module uses a bidirectional LSTM to perform feature extraction and behavior classification on the skeleton information. First, the k coordinates are split into x coordinate value (x1,x2,....,xk) and y coordinate value (y1,y2,....,yk), and then calculate each joint point (d1,d2,....,dk) to the root node as the third feature component; when continuous image frames are detected, the x coordinate sequence, y coordinate sequence and distance sequence changing over time are obtained; next, the data will be processed into a length and size suitable for LSTM training , and are respectively input into three LSTM networks for temporal feature extraction. After that, every time a new frame of image data is detected, the new coordinate value will be added to the sequence and the old coordinate sequence will be deleted; finally, the behavioral action The classification information is merged and input into the fully connected layer to classify normal behavior and abnormal behavior to determine whether the behavior is abnormal behavior.

8. The abnormal behavior detection model in dense multi-person scenarios according to claim 7, characterized in that the LSTM neural unit includes an input gate _it , a forgetting gate _ft , a unit state _Ct and an output gate _Ot . Memory is controlled through gates and cell states, and its calculation process is expressed by formulas (6) to (11); in formula (6), the information at time t of the input gate is the hidden output of the previous moment and the input information at time t. Combination; in formula (7), the candidate unit state at time t is calculated from h _t-1 and x _t , h _t-1 and x _t respectively represent the hidden output at the previous moment and the input information at time t; in the formula In (8), the forget gate is used to control which information in the memory state at the previous moment should be forgotten or retained; in formula (9), the output of the input gate and the forget gate are combined to update the cell state at time t; then , connect the two-layer LSTM network end to end, and connect the cells of each LSTM layer in turn to predict the action feature sequence of forward learning and the action feature sequence of reverse learning;

i _t =σ(W _i ·[h _t-1 , x _t ]+b _i ) (6);

C _t = tanh (W _c ·[h _t-1 , x _t ]+b _c ) (7);

f _t =σ(W _f ·[h _t-1 ,x _t ])+b _f (8);

C _t =f _t *C _t-1 +i _t *C _t (9);

O _t =σ(W _o ·[h _t-1 ,x _t ]+b _o ) (10);

h _t =O _t *tanh(C _t ) (11);

In the structure of the two-layer bidirectional LSTM, the forward layer and the backward layer are jointly connected to the output layer. The output layer contains six shared weights w ₁ -w ₆ . The forward propagation of behavioral features is in the forward layer from time 1 to Calculation is performed at time t. In the backward layer, reverse calculation is performed from time t to time 1 to obtain and save the output of the backward hidden layer at each time. The forward layer and backward layer are calculated at each time. The corresponding output results at each time are combined to obtain the final output. The mathematical formula is as (12)-(15), where, is the bias; o' _t ,o" _t is the result of two layers of LSTM processing the action feature vector output at the corresponding moment;

9. An abnormal behavior detection method in a dense multi-person scenario, which is characterized by including the following steps:

Step 1, video collection: Record or obtain relevant video data for dense crowd scenes where abnormal behavior analysis is to be performed;

Step 2, human detection: Based on the YOLOv5 detection framework, add a bounding box to each person in the video and mark the person's position information in the image;

Step 3, Skeleton pose extraction: Use the YH-Pose module integrated with the high-resolution skeleton pose extraction network HRNet to calculate and determine the k key skeleton node positions of each person in the video;

Step 4, pose skeleton model generation: According to the skeleton structure of the human body, connect the key skeleton nodes confirmed in the previous step in an orderly manner to generate the pose skeleton model of the human body;

Step 5, feature fusion: The YH-pose network uses the input RGB video frame data to fuse the human body bounding box and the posture skeleton model, and generates the fused two-dimensional coordinate information, bounding box position and k joints of the human body in each frame. Confidence of human posture information;

Step 6, data preprocessing: Use the BR-LSTM module in the behavior classification stage to perform data preprocessing on the generated human posture information, including dividing the two-dimensional joint coordinates into independent x and y coordinate sequences, and calculating each joint to Euclidean distance of the root node;

Step 7, behavioral feature extraction: The feature extraction part in the BR-LSTM module accepts the preprocessed data and extracts the spatiotemporal features of the action through the long short-term memory network;

Step 8, classification and prediction: After data processing and behavioral feature extraction, the fully connected layer is used for final classification calculation, and abnormal and normal behaviors are identified and predicted through the trained model;

Step 9, output results: Based on the classification and prediction results, mark abnormal behaviors in the video for further analysis and processing.