CN117789255A - Pedestrian abnormal behavior video identification method based on attitude estimation - Google Patents

Pedestrian abnormal behavior video identification method based on attitude estimation Download PDF

Info

Publication number
CN117789255A
CN117789255A CN202410211689.6A CN202410211689A CN117789255A CN 117789255 A CN117789255 A CN 117789255A CN 202410211689 A CN202410211689 A CN 202410211689A CN 117789255 A CN117789255 A CN 117789255A
Authority
CN
China
Prior art keywords
image
frame
pedestrian
key points
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410211689.6A
Other languages
Chinese (zh)
Other versions
CN117789255B (en
Inventor
张鹏
李爱华
董克
王泽灏
赵威
李志超
翟月
肖景洋
吴敏思
李刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Elysan Electronic Technology Co ltd
Original Assignee
Shenyang Elysan Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Elysan Electronic Technology Co ltd filed Critical Shenyang Elysan Electronic Technology Co ltd
Priority to CN202410211689.6A priority Critical patent/CN117789255B/en
Publication of CN117789255A publication Critical patent/CN117789255A/en
Application granted granted Critical
Publication of CN117789255B publication Critical patent/CN117789255B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses aA pedestrian abnormal behavior video identification method based on attitude estimation comprises the following steps: extracting continuous image frames from the video stream, and inputting a target detection model to obtain an image block containing a pedestrian target; for an image block, performing frame-separated attitude estimation by using an improved HRNet network to obtain a plurality of skeleton key points of a human body; by the firsti-2 th frame of imagejImage block numberi+2 frame image NojThe skeleton key points of the image blocks are subjected to prediction and supplement calculation to fill the missing skeleton key points, so that complete skeleton key points are obtained; and constructing a space-time diagram model for the complete skeleton key points, and inputting the space-time diagram model into an existing behavior recognition module to recognize abnormal behaviors. According to the method, the key points are extracted through multiple features, so that the body posture and the action of a pedestrian can be accurately identified, and the safety of a subway is effectively improved; under the shielding and complex background environment, the precision of key point detection is maintained.

Description

Pedestrian abnormal behavior video identification method based on attitude estimation
Technical Field
The invention relates to the technical field of pedestrian detection and tracking, in particular to a pedestrian abnormal behavior video identification method based on gesture estimation.
Background
Pedestrian abnormal behavior identification is an important study in the field of computer vision, and particularly in safety systems such as video monitoring, the technology plays a key role. With the increase of public places and the increase of safety requirements, the demands for technology in the field are also gradually rising. Particularly in a plurality of social places, business areas and traffic nodes such as subway stations, the timely detection and early warning of abnormal behaviors are very critical, and safety accidents can be effectively avoided or reduced.
The object of pedestrian abnormal behavior recognition is to distinguish and pre-warn abnormal behaviors such as falling, frame, attack and the like by analyzing and understanding pedestrian behaviors. These activities may cause injury to others or damage to facilities. The main challenge in this field is to define what is "anomalies", because people's knowledge and definition of "anomalies" are different in different cultures and contexts, plus the influence of environmental factors, which makes the recognition of anomalous behavior a complex task.
The pedestrian abnormal behavior recognition method based on the gesture estimation is a newer research direction, and can analyze and understand the behavior of the pedestrian more accurately and deeply by using the gesture estimation technology. In addition, with the rapid development of deep learning technology in recent years, the behavior pattern of the pedestrian can be more deeply resolved and even the next action of the pedestrian can be predicted in combination with the multilayer structure of the neural network. The posture estimation herein refers to the inference of the three-dimensional posture of the human body from the image or video, including information such as joint position and rotation. This technique can provide richer, more accurate behavioral characteristics than traditional image analysis methods, which are critical for abnormal behavior identification in complex environments.
Studies of behavior recognition originated in 1975, johansson proposed in experiments that effectively describe the motion of the human body by observing and analyzing the movement of the primary nodes. This principle provides the basis for many subsequent studies. He then devised a mannequin consisting of 12 points, and by tracking the combination and movement of these key points, recognition of the person's behaviour could be achieved. In addition, the model also lays a foundation for human body tracking technology in the field of computer vision.
Sukthanker et al have further studied on this basis and have proposed a hierarchical average drift algorithm that combines 3D spatiotemporal and optical flow information of silhouettes to model human actions, thereby enabling recognition of simple behaviors. For more complex behavior recognition, krizhevsky et al propose a markov logic network that combines a markov network with first order logic. The network model can effectively describe the space-time relationship between the sub-behaviors so as to realize the identification of the complex behaviors.
However, in an actual scenario, behavior recognition needs to cope with various challenges, such as occlusion, illumination change, and the like. These factors may interfere with the images in the video, making recognition difficult. In order to improve the expressive power, distinguishing property and robustness of behavior recognition, laptev et al propose a behavior recognition method based on local interest point tracks. They combine the local feature detection of the spatio-temporal interest point with the KLT tracker to obtain the motion trajectory of the interest point. Later, wang et al further proposed a dense trajectory based behavior recognition method. This method collects a large number of feature points in each frame and uses the optical flow field for tracking. They then use the apparent information of the tracks and the spatiotemporal information between the tracks to express and identify behavior.
The conventional gesture estimation behavior recognition method is generally slow in recognition speed due to high calculation complexity, and when a pedestrian is shielded, the accuracy of gesture estimation is reduced due to the fact that part of key points cannot be recognized, so that the effect of the gesture estimation method in practical application is not ideal.
Disclosure of Invention
Aiming at the defects that the recognition speed is low due to higher calculation complexity and the accuracy of the gesture estimation is reduced due to the fact that part of key points cannot be recognized when a pedestrian is shielded in the gesture estimation behavior recognition method in the prior art, the invention aims to provide the gesture estimation-based pedestrian abnormal behavior video recognition method which can accurately recognize the body gesture and the action of the pedestrian.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention provides a pedestrian abnormal behavior video identification method based on gesture estimation, which comprises the following steps:
s1, extracting continuous image frames from video streamF i Inputting a pre-trained target detection model to obtain an image block containing a pedestrian targetH ij The method comprises the steps of carrying out a first treatment on the surface of the Wherein,iis the reference number of the image frame,jis the firstiImage block labels of frame images;i=1,2,3…,j=1,2,3……;
s2, for image blocksH ij Frame-separated attitude estimation is carried out by utilizing an improved HRNet network, and a plurality of skeletal key points G of a human body are obtained ij
S3, regarding skeleton key point G ij By the firsti-2 th frame of imagejIndividual image blocksH i- j(2) And (d)i+2 frame image NojIndividual image blocksH i j(+2) The bone key points of the (2) are filled up by adopting a predictive supplement calculation method, so that the complete bone key point G 'is obtained' ij
S4, the whole skeleton key point G' ij And constructing a space-time diagram model, and inputting the space-time diagram model into an existing behavior recognition module to recognize abnormal behaviors.
In step S1, successive image frames are extracted from the video streamF i Inputting a pre-trained target detection model to obtain an image block containing a pedestrian targetH ij The method specifically comprises the following steps:
s101, extracting continuous image frames from video streamF i The video stream is received by the server from the camera, is processed by the encoder and the decoder, converts the original video data into continuous digital image frames, and is preprocessed by adjusting the image size, normalizing, adjusting the brightness, the contrast and the enhancement algorithm to ensure the image quality, so that the subsequent feature extraction is more effective and accurate;
s102, performing feature extraction by using a pre-trained target detection model, inputting an image tensor of a video frame into the target detection model, and obtaining feature representation of each video frame, wherein the feature representation is a high-dimensional data structure, and providing a large amount of useful information transmitted into a regional proposal network for subsequent target detection;
s103, transmitting the feature representation to a region proposal network, and generating candidate target regions by moving on the feature map in a sliding window mode, wherein the candidate frames are different in size and length-width ratio and are generated for each position;
s104, sending the candidate frames into another regional proposal network, and carrying out target classification and bounding box regression on each candidate frame; object classification uses a convolutional neural network structure to classify features in each candidate box, and whether each candidate box contains an object of interest is judged by a classifier;
s105, accurately adjusting candidate frames classified as the interested targets by using a regression device to ensure that the candidate frames accurately capture the positions of the targets, and finally obtaining an image block containing the pedestrian targetsH ij
In step S2, the 3 x3 convolution kernel of the bowleneck structure in the HRNet network initial stage is replaced by a residual unit; the improvement of the Bottleneck module is specifically as follows:
the Bottleneck module includes three processing stages: 1×1 convolution kernel dimension reduction, 3×3 convolution kernel extraction features, 1×1 convolution kernel dimension increase;
after 1×1 convolution is performed on the input data, the original 3×3 convolution is replaced by a residual unit, and the data is divided intokSub-features, defining input sub-features asR k Each input feature map has the same size;
output ofC k (k=1, 2,3 …) is expressed as follows:
wherein the method comprises the steps ofW k Representing the output of the sub-feature after 3 x3 convolution kernel, inputting the sub-featureR k And (3) withW k-1 Added and input toW k To reduce parameters in the event of an increase in input sub-features;C k-1 to last oneThe number of outputs is chosen such that,W k-1 for the last input sub-featureR k-1 An output after passing through a 3×3 convolution kernel;
the residual blocks of the hierarchy are used to extract the hybrid features as input to subsequent layers in order to more deeply analyze and extract the information contained in the hybrid features.
In step S3, for bone key point G ij By the firsti-2 th frame of imagejIndividual image blocksH i- j(2) And (d)i+2 frame image NojIndividual image blocksH i j(+2) The bone key points of the (2) are filled up by adopting a predictive supplement calculation method, so that the complete bone key point G 'is obtained' ij The method specifically comprises the following steps:
s301, use%x i-2 ,y i-2 ) Representation pairH i- j(2) Skeletal keypoints G in image blocks i-2 j() Skeletal key point coordinates corresponding to the previous frame of image when prediction and supplement are carried outx i+2 ,y i+2 ) Representation pairH i j(+2) Skeletal keypoints G in image blocks i 2 j(+) The skeleton key point coordinates in the image of the next frame are correspondingly separated when prediction and supplementation are carried out;
s302, utilizing information of front and rear frames to block position skeleton key point coordinates of current framex i ,y i ) The completion is carried out through a prediction supplementing module, and a prediction supplementing calculation formula in the prediction supplementing module is as follows:
(x i ,y i )=((x i-2 +y i-2 )/2, (x i+2 +y i+2 )/2)。
in step S4, according to the connection and time sequence relation of the skeleton key points of the human body, selecting a plurality of skeleton key point coordinates of the body part in step S3 to construct a space-time diagram model; these key points are selected according to the structure of human skeleton, representing the basic unit of human body action;
there are two types of edges in the space-time diagram model: a spatial side and a temporal side; the spatial edge is responsible for blocking each frame of imageH ij Different skeletal keypoints G in ij Connecting the three images in a prescribed order to capture posture information at each time; the timing edge separates the same skeletal keypoint G of all image blocks in a frame ij And (3) connecting the time information and the space information of the behavior sequence together, and capturing the structure and the evolution of human body actions.
The number of the skeletal key points is 15.
Compared with the prior art, the invention has the following beneficial technical effects and advantages:
1. the invention provides a pedestrian abnormal behavior video identification method based on gesture estimation, which utilizes a server to acquire and process a monitoring video in real time, and adopts a fast RCNN model to rapidly and accurately detect targets in images, unlike the traditional processing mode; when each frame of video is decoded, the video is continuously input into the model to generate an accurate image block containing a pedestrian target; by adopting the improved HRNet network, not only the key points of the travelers can be detected in each frame, but also the common shielding and complex background problems in practical application can be processed;
2. in order to further enhance the robustness, the invention combines a predictive complementary calculation method in adjacent frame images to fill in missing skeleton key points, which is particularly important for complex public scenes such as traffic congestion, large-scale activities and the like, and the obtained key points provide a basis for constructing a space-time diagram model, which is a diagram model capable of capturing human body motion changes, and then carries out convolution classification operation to judge whether abnormal behaviors exist in videos in real time; the body posture and the action of the pedestrian can be accurately identified through extracting key points by multiple features;
3. the invention improves the detection strategy of the gesture estimation, changes the single frame detection into the strategy of the frame separation detection, and greatly improves the calculation speed of the gesture estimation; in addition, the system also has a real-time alarm function, and can immediately give an alarm once abnormal behaviors are detected, so that related personnel can timely process the abnormal behaviors, and the safety of the subway is effectively improved. Therefore, compared with the traditional behavior recognition method, the method has better speed;
4. the invention comprehensively applies the deep learning and gesture estimation technology, improves the speed of behavior recognition through a strategy of frame-separated detection, and keeps the precision of key point detection through an improved method of an HRNet network and a predictive supplement algorithm of key points of adjacent frames under the conditions of shielding and complex background.
Drawings
FIG. 1 is a flow chart of a pedestrian abnormal behavior video identification method based on gesture estimation;
FIG. 2 is a block diagram of a residual unit modified in the method of the present invention;
FIG. 3 is a space-time diagram model diagram in the method of the invention;
FIG. 4 is a flow chart of a behavior recognition module in the method of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings, but the scope of the present invention is not limited by the accompanying drawings.
In the field of abnormal behavior recognition, conventional methods are often limited by complex background and occlusion conditions, so that their effect in practical applications is not ideal. In order to solve the problems, the invention provides a pedestrian abnormal behavior video identification method based on gesture estimation, which comprises the following steps of, as shown in fig. 1:
s1, extracting continuous image frames from video streamF i Inputting a pre-trained target detection model to obtain an image block containing a pedestrian targetH ij The method comprises the steps of carrying out a first treatment on the surface of the Wherein,iis the reference number of the image frame,jis the firstiImage block labels of frame images;i=1,2,3…,j=1,2,3……;
s2, for image blocksH ij Frame-separated attitude estimation is carried out by utilizing an improved HRNet network, and a plurality of skeletal key points G of a human body are obtained ij
S3, regarding skeleton key point G ij By the firsti-2 th frame of imagejIndividual image blocksH i- j(2) And (d)i+2 frame image NojIndividual image blocksH i j(+2) The bone key points of the (2) are filled up by adopting a predictive supplement calculation method, so that the complete bone key point G 'is obtained' ij
S4, the whole skeleton key point G' ij And constructing a space-time diagram model, and inputting the space-time diagram model into an existing behavior recognition module to recognize abnormal behaviors.
Firstly, a server in a subway scene decodes a video stream acquired from a camera which is installed and deployed into continuous image frames, and then transmits the continuous image frames into a target detection module for feature extraction to obtain an image block containing a pedestrian targetH ij And then, carrying out attitude estimation on the image block output by the target detection module. In the process, firstly, 15 key point coordinates of a human body are extracted through multiple features, and then, prediction completion operation is carried out on missing key points by utilizing front and rear interval frames; and inputting the image block with the attitude estimation into an existing behavior recognition module to perform space-time diagram model construction, convolution and classification operation on the key points, and finally judging whether abnormal behaviors exist in the pedestrians. If abnormal behaviors exist, a local host connected with a server in the subway scene can send out an alarm prompt.
In step S1, successive image frames are extracted from the video streamF i Inputting a pre-trained target detection model to obtain an image block containing a pedestrian targetH ij The method specifically comprises the following steps:
s101, extracting continuous image frames from video streamF i The video stream is received by the server from the camera, is processed by the encoder and the decoder, converts the original video data into continuous digital image frames, and is preprocessed by adjusting the image size, normalizing, adjusting the brightness, the contrast and the enhancement algorithm to ensure the image quality, so that the subsequent feature extraction is more effective and accurate;
s102, performing feature extraction by using a pre-trained target detection model (the fast RCNN model is adopted in the embodiment), inputting the image tensor of the video frames into the target detection model, and obtaining feature representations of each video frame, wherein the feature representations are high-dimensional data structures, and providing a large amount of useful information transmitted into a regional proposal network for subsequent target detection;
s103, transmitting the feature representation to a region proposal network, and generating candidate target regions by moving on the feature map in a sliding window mode, wherein the candidate frames are different in size and length-width ratio and are generated for each position;
s104, sending the candidate frames into another regional proposal network, and carrying out target classification and bounding box regression on each candidate frame; object classification uses a convolutional neural network structure to classify features in each candidate box, and whether each candidate box contains an object of interest is judged by a classifier;
s105, accurately adjusting candidate frames classified as the interested targets by using a regression device to ensure that the candidate frames accurately capture the positions of the targets, and finally obtaining an image block containing the pedestrian targetsH ij
The server side in the subway scene firstly starts to receive the video stream from the camera, and the data captured by the camera is processed by the codec to convert the original video data into continuous digital image frames, so that preparation is made for further analysis and processing. These image frames input to the server are first subjected to a series of preprocessing steps, including resizing the image, normalizing, adjusting brightness and contrast, and other enhancement algorithms to ensure image quality, so that subsequent feature extraction is more efficient and accurate. These preprocessed image frames are then input into the pre-trained fast RCNN model, which in this embodiment uses mainly the VGG16 network for feature extraction. VGG16 is a deep neural network consisting of 16 layers, which has demonstrated its powerful performance in the field of image recognition. After inputting the image tensor of the video frames into the VGG16 network, a characteristic representation of each video frame may be obtained. These feature representations are a high-dimensional data structure that can provide a large amount of useful information for subsequent target detection. These features are transferred into a Regional Proposal Network (RPN). The RPN is one of the core components of the Faster RCNN, which is responsible for generating candidate target regions. Moving on the feature map by means of a sliding window, the RPN generates a number of candidate boxes of different sizes and aspect ratios for each location.
These candidate boxes are then sent to another network module, which is responsible for target classification and bounding box regression for each candidate box. Object classification uses a specific convolutional neural network structure to classify features in each candidate box, and this classifier determines whether each candidate box contains an object of interest. For candidate boxes classified as targets, they are then fine-tuned using a regressor, ensuring that the boxes capture the target's location as accurately as possible.
Finally obtaining the image block containing the pedestrian target in the subway scene through all the stepsH ij
In step S2, the 3×3 convolution kernel of the HRNet network initial stage Bottleneck structure is replaced with a residual unit (the structure of which is shown in fig. 2); the Bottleneck module was modified.
Step S2 first, the human body detection network Faster RCNN in step S1 is used to detect the image block containing the pedestrian object in the subway sceneH ij Then at the image block of the detected objectH ij The pose estimation network is used for frame-separated pose estimation. The improved HRNet pose estimation network belongs to a top-down algorithm, and the network structure of the HRNet comprises an initial stage, a multi-scale parallel stage and a final stage.
In the initial phase, the network is based on a smaller high resolution network, which is typically referred to from the initial phase of ResNet. In the innovation scheme, the convolution kernel of the original Bottleneck structure is replaced by a residual unit with a plurality of receptive fields, so that finer multi-scale characterization can be realized, and richer image information is captured.
Followed by a multi-scale parallel stage. At this stage, the network is split into several sub-networks running in parallel, each operating at a different resolution, exchanging information with each other, forming a network system working in concert. Such an arrangement allows the network to capture image features at different levels, thereby obtaining more comprehensive and diversified feature information. These sub-networks enter a multi-stage interaction process, and they interact and cooperate with each other in multiple stages, so that the feature characterization is more complex and complete, and the recognition and analysis capability of the network is enhanced.
Finally, the fusion phase. At this stage, all sub-networks aggregate the respective outputs to form a high quality composite output. This output is not simply a combination, but rather a deep fusion process, by integrating feature information of multiple scales together, provides a rich and high quality feature input for the final pose estimation task.
The 3 x3 convolution kernel in the initial-stage Bottleneck structure is replaced by a residual unit, and multi-scale characterization of granularity level is obtained by using a plurality of receptive fields, so that the method is used as an improvement on the HRNet network. The method for estimating the gesture based on the HRNet network comprises the following steps:
s201, in the initial stage, in order to extract low-level features, the HRNet extracts features of an input image through convolution operation, specifically, two convolution layers and four improved Bottleneck modules are sequentially used for the convolution operation, and finally downsampling is performed to reduce the space size of a feature map.
The improvement to the Bottleneck module includes three processing stages: the dimension reduction of the 1×1 convolution kernel, the extraction of the features of the 3×3 convolution kernel and the dimension increase of the 1×1 convolution kernel are specifically as follows:
the Bottleneck module is a classical residual module consisting of three main parts: the 1 x 1 convolution kernel reduces the dimension, the 3 x3 convolution kernel extracts the features, and the 1 x 1 convolution kernel increases the dimension. In the improved process, after 1×1 convolution is performed on the input data, the original 3×3 convolution is replaced by a residual unit, and the data is divided intokSub-features, defined asR k Each of the outputsThe in-profile has the same dimensions. Output ofC k (k=1, 2,3 …) is expressed as follows:
wherein the method comprises the steps ofW k Representing the output of the sub-feature after 3 x3 convolution kernel, inputting the sub-featureR k And (3) withW k-1 Added and input toW k To reduce parameters in the event of an increase in input sub-features;C k-1 for the last output it is possible to provide,W k-1 for the last input sub-featureR k-1 An output after passing through a 3×3 convolution kernel;
the residual blocks of this hierarchy are used to extract hybrid features as input to subsequent layers in order to more deeply analyze and extract the information contained in these hybrid features. The output of each residual unit contains a combination of information of different scales, which makes more efficient use of global information and local information.
S202, in the multi-scale parallel stage, HRNet is enabled to keep feature images with multiple resolutions and is processed through parallel operation, each feature image with the resolution is provided with branches, features are further extracted and fused through a sub-module, and information exchange and fusion are carried out between feature images with different resolutions.
And S203, after the final stage passes through the multi-scale parallel stage, the HRNet upsamples the feature map to restore the resolution of the original image, then the high-resolution feature representation is extracted through convolution, and finally the features are used for an output layer.
In step S3, for bone key point G ij By the firsti-2 th frame of imagejIndividual image blocksH i- j(2) And (d)i+2 frame image NojIndividual image blocksH i j(+2) The bone key points of the (2) are filled up by adopting a predictive supplement calculation method, so that the complete bone key point G 'is obtained' ij The method specifically comprises the following steps:
in a similar complex scene of a subway station, the existing top-down-based multi-person gesture estimation method can only estimate and identify a single gesture in the complex scene, so that the problem of partial key point missing detection or false detection is caused, and the accuracy of a gesture estimation result is greatly reduced. The method for predicting and supplementing the missing key points comprises the following steps:
s301, use%x i-2 ,y i-2 ) Representation pairH i- j(2) Skeletal keypoints G in image blocks i-2 j() Skeletal key point coordinates corresponding to the previous frame of image when prediction and supplement are carried outx i+2 ,y i+2 ) Representation pairH i j(+2) Skeletal keypoints G in image blocks i 2 j(+) The skeleton key point coordinates in the image of the next frame are correspondingly separated when prediction and supplementation are carried out;
s302, utilizing information of front and rear frames to block position skeleton key point coordinates of current framex i ,y i ) The completion is carried out through a prediction supplementing module, and a prediction supplementing calculation formula in the prediction supplementing module is as follows:
(x i ,y i )=((x i-2 +y i-2 )/2, (x i+2 +y i+2 )/2)。
step S4, namely the complete bone key point G' ij Constructing a space-time diagram model (shown in fig. 3), and inputting the space-time diagram model into a behavior recognition module (the flow chart of which is shown in fig. 4) to recognize abnormal behaviors;
according to the connection and time sequence relation of the key points of the skeleton of the human body, 15 key point coordinates of the body part in the step S3 are selected to construct a space-time diagram model. These skeletal key points are selected according to the structure of the human skeleton and represent the basic units of human motion. The blank map model is unique in this case because it not only considers spatial information, but also fuses temporal information. Specifically, there are two types of edges in the model: spatial edge and timingEdges. The spatial edge is responsible for blocking each frame of imageH ij Key point G of different human bones ij Connecting according to a specific sequence to capture the gesture information of each moment; while the timing edge separates all image blocks in the frameH ij Is the same skeleton key point G of (1) ij In connection, the time information and the space information of the behavior sequence are integrated, and the space-time diagram model aims at capturing the structure and evolution of human body actions.
After the space-time diagram model is completed, the characteristics which are significant for behavior recognition are extracted from the model, and the characteristics are the task of the behavior recognition module. The feature extraction is performed in this module using a 3D convolutional neural network C3D, unlike a conventional 2D convolutional neural network, 3D convolution can capture spatial and temporal information of images at the same time, which makes C3D very suitable for processing a space-time diagram model. The structure of the C3D consists of an input layer, a convolution layer, a pooling layer and a full connection layer. At the convolution layer, convolution operations capture spatiotemporal information, which can identify subtle changes in behavior over time as well as spatially complex structures. In order to further enhance the network identification capability, the invention also adopts a cascade mode of a plurality of C3D networks, so that more space-time characteristics can be obtained, the receptive field of the networks is enlarged, and longer time sequence information is captured.
After extracting features, these features are input into a classifier. The function of this classifier is to map complex spatiotemporal features onto specific behavior classes. After receiving the output of the C3D, the method finally generates a prediction result of the abnormal behavior through a series of conversion and mapping, thereby realizing the complete conversion process from the original video data to the specific behavior category in the subway monitoring.
The method of the invention utilizes the server in the subway scene to acquire and process the monitoring video in real time. Different from the traditional processing mode, the method adopts a Faster RCNN model and is specially used for rapidly and accurately detecting targets in images. As each frame of video is decoded, they will be continuously input into this model, generating accurate image blocks containing pedestrian objects.
But merely detecting a pedestrian is not sufficient to complete the task, the next key being pose estimation. An improved HRNet network is employed herein. The network not only can detect key points of travelers in each frame, but also can process common shielding and complex background problems in subway scenes. In order to further enhance the robustness, a predictive complementary calculation method is combined in the adjacent frame images, and missing skeleton key points are filled, so that the method is particularly important for complex public scenes such as subway stations.
The key point obtained in this way provides a basis for constructing a space-time diagram model, which is a diagram model capable of capturing human motion changes. And then, performing convolution classification operation to judge whether abnormal behaviors exist in the monitoring video of the subway in real time.
The experiment adopts Ubuntu20.04.1 and 64-bit operating system, the display card is NVIDIA GTX3090Ti, the compiler setting is Python3.9, and a Pytorch1.11.0 deep learning framework is used as an experiment platform.
A target detection bounding box is generated using a COCO pre-training model with a fast-RCNN threshold of 0.1. Pose estimation was performed using HRNet model, 300 rounds of training with learning rate size 0.001 and batch 12 were performed with the aid of the pre-training model of MPII. The average accuracy PCKh of the key point detection (an index for measuring the accuracy of the key point) of each part is improved by 2.09%. Using the STGCN model and performing behavior recognition training based on the JHMDB dataset, a pre-training model of Kinetics-400 was used to uniformly adjust the video frames to 256×256 pixels. Training settings included a learning rate of 0.01 with a batch size of 16 for a total of 150 rounds of iterations; the whole precision is improved by about 2 percent.
In summary, the invention improves the speed of behavior recognition by comprehensively applying the deep learning and gesture estimation technology and by the strategy of frame-separated detection. Under the shielding and complex background environment, the precision of key point detection is maintained through an improved method of the HRNet network and a prediction complementary algorithm of the key points of adjacent frames.

Claims (6)

1. The pedestrian abnormal behavior video identification method based on the gesture estimation is characterized by comprising the following steps of:
s1, extracting continuous image frames from video streamF i Inputting a pre-trained target detection model to obtain an image block containing a pedestrian targetH ij The method comprises the steps of carrying out a first treatment on the surface of the Wherein,iis the reference number of the image frame,jis the firstiImage block labels of frame images;i=1,2,3…,j=1,2,3……;
s2, for image blocksH ij Frame-separated attitude estimation is carried out by utilizing an improved HRNet network, and a plurality of skeletal key points G of a human body are obtained ij
S3, regarding skeleton key point G ij By the firsti-2 th frame of imagejIndividual image blocksH i- j(2) And (d)i+2 frame image NojIndividual image blocksH i j(+2) The bone key points of the (2) are filled up by adopting a predictive supplement calculation method, so that the complete bone key point G 'is obtained' ij
S4, the whole skeleton key point G' ij And constructing a space-time diagram model, and inputting the space-time diagram model into an existing behavior recognition module to recognize abnormal behaviors.
2. The pedestrian abnormal behavior video identification method based on gesture estimation according to claim 1, wherein:
in step S1, successive image frames are extracted from the video streamF i Inputting a pre-trained target detection model to obtain an image block containing a pedestrian targetH ij The method specifically comprises the following steps:
s101, extracting continuous image frames from video streamF i The video stream is received from the camera and processed by the server, the original video data is converted into continuous digital image frames by the encoder and the decoder, and the image quality is ensured by adjusting the image size, normalizing, adjusting the brightness, contrast and preprocessing by the enhancement algorithm, so that the subsequent characteristic is providedThe picking is more effective and accurate;
s102, performing feature extraction by using a pre-trained target detection model, inputting an image tensor of a video frame into the target detection model, and obtaining feature representation of each video frame, wherein the feature representation is a high-dimensional data structure, and providing a large amount of useful information transmitted into a regional proposal network for subsequent target detection;
s103, transmitting the feature representation to a region proposal network, and generating candidate target regions by moving on the feature map in a sliding window mode, wherein the candidate frames are different in size and length-width ratio and are generated for each position;
s104, sending the candidate frames into another regional proposal network, and carrying out target classification and bounding box regression on each candidate frame; object classification uses a convolutional neural network structure to classify features in each candidate box, and whether each candidate box contains an object of interest is judged by a classifier;
s105, accurately adjusting candidate frames classified as the interested targets by using a regression device to ensure that the candidate frames accurately capture the positions of the targets, and finally obtaining an image block containing the pedestrian targetsH ij
3. The pedestrian abnormal behavior video identification method based on gesture estimation according to claim 1, wherein: in step S2, the 3 x3 convolution kernel of the bowleneck structure in the HRNet network initial stage is replaced by a residual unit; the improvement of the Bottleneck module is specifically as follows:
the Bottleneck module includes three processing stages: 1×1 convolution kernel dimension reduction, 3×3 convolution kernel extraction features, 1×1 convolution kernel dimension increase;
after 1×1 convolution is performed on the input data, the original 3×3 convolution is replaced by a residual unit, and the data is divided intokSub-features, defining input sub-features asR k Each input feature map has the same size;
output ofC k (k=1, 2,3 …) is expressed as follows:
wherein the method comprises the steps ofW k Representing the output of the sub-feature after 3 x3 convolution kernel, inputting the sub-featureR k And (3) withW k-1 Added and input toW k To reduce parameters in the event of an increase in input sub-features;C k-1 for the last output it is possible to provide,W k-1 for the last input sub-featureR k-1 An output after passing through a 3×3 convolution kernel;
the residual blocks of the hierarchy are used to extract the hybrid features as input to subsequent layers in order to more deeply analyze and extract the information contained in the hybrid features.
4. The pedestrian abnormal behavior video identification method based on gesture estimation according to claim 1, wherein: in step S3, for bone key point G ij By the firsti-2 th frame of imagejIndividual image blocksH i- j(2) And (d)i+2 frame image NojIndividual image blocksH i j(+2) The bone key points of the (2) are filled up by adopting a predictive supplement calculation method, so that the complete bone key point G 'is obtained' ij The method specifically comprises the following steps:
s301, use%x i-2 ,y i-2 ) Representation pairH i j(-2) Skeletal keypoints G in image blocks i-2 j() Skeletal key point coordinates corresponding to the previous frame of image when prediction and supplement are carried outx i+2 ,y i+2 ) Representation pairH i j(+2) Skeletal keypoints G in image blocks i 2 j(+) The skeleton key point coordinates in the image of the next frame are correspondingly separated when prediction and supplementation are carried out;
s302, utilizing information of front and rear frames to block position skeleton of current frameKey point coordinates [ ]x i ,y i ) The completion is carried out through a prediction supplementing module, and a prediction supplementing calculation formula in the prediction supplementing module is as follows:
(x i ,y i )=((x i-2 +y i-2 )/2, (x i+2 +y i+2 )/2)。
5. the pedestrian abnormal behavior video identification method based on gesture estimation according to claim 1, wherein: in step S4, according to the connection and time sequence relation of the skeleton key points of the human body, selecting a plurality of skeleton key point coordinates of the body part in step S3 to construct a space-time diagram model; these key points are selected according to the structure of human skeleton, representing the basic unit of human body action;
there are two types of edges in the space-time diagram model: a spatial side and a temporal side; the spatial edge is responsible for blocking each frame of imageH ij Different skeletal keypoints G in ij Connecting the three images in a prescribed order to capture posture information at each time; the timing edge separates the same skeletal keypoint G of all image blocks in a frame ij And (3) connecting the time information and the space information of the behavior sequence together, and capturing the structure and the evolution of human body actions.
6. The pedestrian abnormal behavior video identification method based on gesture estimation according to claim 5, wherein: the number of the skeletal key points is 15.
CN202410211689.6A 2024-02-27 2024-02-27 Pedestrian abnormal behavior video identification method based on attitude estimation Active CN117789255B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410211689.6A CN117789255B (en) 2024-02-27 2024-02-27 Pedestrian abnormal behavior video identification method based on attitude estimation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410211689.6A CN117789255B (en) 2024-02-27 2024-02-27 Pedestrian abnormal behavior video identification method based on attitude estimation

Publications (2)

Publication Number Publication Date
CN117789255A true CN117789255A (en) 2024-03-29
CN117789255B CN117789255B (en) 2024-06-11

Family

ID=90389530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410211689.6A Active CN117789255B (en) 2024-02-27 2024-02-27 Pedestrian abnormal behavior video identification method based on attitude estimation

Country Status (1)

Country Link
CN (1) CN117789255B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114155610A (en) * 2021-12-09 2022-03-08 中国矿业大学 Panel assembly key action identification method based on upper half body posture estimation
CN114399838A (en) * 2022-01-18 2022-04-26 深圳市广联智通科技有限公司 Multi-person behavior recognition method and system based on attitude estimation and double classification
CN116152747A (en) * 2023-04-19 2023-05-23 南京源心教育科技有限公司 Human behavior intention recognition method based on appearance recognition and action modeling
CN116645721A (en) * 2023-04-26 2023-08-25 贵州大学 Sitting posture identification method and system based on deep learning
CN117173792A (en) * 2023-10-24 2023-12-05 长讯通信服务有限公司 Multi-person gait recognition system based on three-dimensional human skeleton
CN117392093A (en) * 2023-10-25 2024-01-12 重庆理工大学 Breast ultrasound medical image segmentation algorithm based on global multi-scale residual U-HRNet network
CN117437691A (en) * 2023-10-31 2024-01-23 上海大学 Real-time multi-person abnormal behavior identification method and system based on lightweight network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114155610A (en) * 2021-12-09 2022-03-08 中国矿业大学 Panel assembly key action identification method based on upper half body posture estimation
CN114399838A (en) * 2022-01-18 2022-04-26 深圳市广联智通科技有限公司 Multi-person behavior recognition method and system based on attitude estimation and double classification
CN116152747A (en) * 2023-04-19 2023-05-23 南京源心教育科技有限公司 Human behavior intention recognition method based on appearance recognition and action modeling
CN116645721A (en) * 2023-04-26 2023-08-25 贵州大学 Sitting posture identification method and system based on deep learning
CN117173792A (en) * 2023-10-24 2023-12-05 长讯通信服务有限公司 Multi-person gait recognition system based on three-dimensional human skeleton
CN117392093A (en) * 2023-10-25 2024-01-12 重庆理工大学 Breast ultrasound medical image segmentation algorithm based on global multi-scale residual U-HRNet network
CN117437691A (en) * 2023-10-31 2024-01-23 上海大学 Real-time multi-person abnormal behavior identification method and system based on lightweight network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
罗梦诗等: "融入双注意力的高分辨率网络人体姿态估计", 《 计算机工程》, 28 February 2022 (2022-02-28) *

Also Published As

Publication number Publication date
CN117789255B (en) 2024-06-11

Similar Documents

Publication Publication Date Title
CN108447078B (en) Interference perception tracking algorithm based on visual saliency
CN111881853B (en) Method and device for identifying abnormal behaviors in oversized bridge and tunnel
CN109657581A (en) Urban track traffic gate passing control method based on binocular camera behavioral value
Li et al. Real-world railway traffic detection based on faster better network
Zhuang et al. Illumination and temperature-aware multispectral networks for edge-computing-enabled pedestrian detection
Cho et al. Semantic segmentation with low light images by modified CycleGAN-based image enhancement
Tomar et al. Crowd analysis in video surveillance: A review
Mittal et al. Review of different techniques for object detection using deep learning
Wu et al. Real‐time running detection system for UAV imagery based on optical flow and deep convolutional networks
Cao et al. Learning spatial-temporal representation for smoke vehicle detection
Liang et al. Methods of moving target detection and behavior recognition in intelligent vision monitoring.
CN114241379A (en) Passenger abnormal behavior identification method, device and equipment and passenger monitoring system
Wang et al. Mpanet: Multi-patch attention for infrared small target object detection
Ding et al. Individual surveillance around parked aircraft at nighttime: Thermal infrared vision-based human action recognition
Lin Automatic recognition of image of abnormal situation in scenic spots based on Internet of things
CN114067273A (en) Night airport terminal thermal imaging remarkable human body segmentation detection method
Kheder et al. Transfer learning based traffic light detection and recognition using CNN inception-V3 model
CN112487926A (en) Scenic spot feeding behavior identification method based on space-time diagram convolutional network
Konstantinidis et al. Skeleton-based action recognition based on deep learning and Grassmannian pyramids
CN116824541A (en) Pedestrian crossing intention prediction method, model and device based on double channels
CN117789255B (en) Pedestrian abnormal behavior video identification method based on attitude estimation
CN110929632A (en) Complex scene-oriented vehicle target detection method and device
CN114419729A (en) Behavior identification method based on light-weight double-flow network
Yang et al. Locator slope calculation via deep representations based on monocular vision
Qu et al. An intelligent vehicle image segmentation and quality assessment model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant