CN117789255A - Pedestrian abnormal behavior video identification method based on attitude estimation - Google Patents
Pedestrian abnormal behavior video identification method based on attitude estimation Download PDFInfo
- Publication number
- CN117789255A CN117789255A CN202410211689.6A CN202410211689A CN117789255A CN 117789255 A CN117789255 A CN 117789255A CN 202410211689 A CN202410211689 A CN 202410211689A CN 117789255 A CN117789255 A CN 117789255A
- Authority
- CN
- China
- Prior art keywords
- image
- frame
- pedestrian
- key points
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 206010000117 Abnormal behaviour Diseases 0.000 title claims abstract description 34
- 230000006399 behavior Effects 0.000 claims abstract description 35
- 238000001514 detection method Methods 0.000 claims abstract description 35
- 238000010586 diagram Methods 0.000 claims abstract description 26
- 238000004364 calculation method Methods 0.000 claims abstract description 15
- 239000013589 supplement Substances 0.000 claims abstract description 11
- 230000009471 action Effects 0.000 claims abstract description 10
- 210000000988 bone and bone Anatomy 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 12
- 230000001502 supplementing effect Effects 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 5
- 230000006872 improvement Effects 0.000 claims description 4
- 230000002123 temporal effect Effects 0.000 claims description 4
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 3
- 230000000903 blocking effect Effects 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 230000009469 supplementation Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 238000012549 training Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000012544 monitoring process Methods 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 4
- 238000012512 characterization method Methods 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- GIYXAJPCNFJEHY-UHFFFAOYSA-N N-methyl-3-phenyl-3-[4-(trifluoromethyl)phenoxy]-1-propanamine hydrochloride (1:1) Chemical compound Cl.C=1C=CC=CC=1C(CCNC)OC1=CC=C(C(F)(F)F)C=C1 GIYXAJPCNFJEHY-UHFFFAOYSA-N 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000006378 damage Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 101000742346 Crotalus durissus collilineatus Zinc metalloproteinase/disintegrin Proteins 0.000 description 1
- 101000872559 Hediste diversicolor Hemerythrin Proteins 0.000 description 1
- 208000027418 Wounds and injury Diseases 0.000 description 1
- 230000002547 anomalous effect Effects 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000003703 image analysis method Methods 0.000 description 1
- 208000014674 injury Diseases 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses aA pedestrian abnormal behavior video identification method based on attitude estimation comprises the following steps: extracting continuous image frames from the video stream, and inputting a target detection model to obtain an image block containing a pedestrian target; for an image block, performing frame-separated attitude estimation by using an improved HRNet network to obtain a plurality of skeleton key points of a human body; by the firsti-2 th frame of imagejImage block numberi+2 frame image NojThe skeleton key points of the image blocks are subjected to prediction and supplement calculation to fill the missing skeleton key points, so that complete skeleton key points are obtained; and constructing a space-time diagram model for the complete skeleton key points, and inputting the space-time diagram model into an existing behavior recognition module to recognize abnormal behaviors. According to the method, the key points are extracted through multiple features, so that the body posture and the action of a pedestrian can be accurately identified, and the safety of a subway is effectively improved; under the shielding and complex background environment, the precision of key point detection is maintained.
Description
Technical Field
The invention relates to the technical field of pedestrian detection and tracking, in particular to a pedestrian abnormal behavior video identification method based on gesture estimation.
Background
Pedestrian abnormal behavior identification is an important study in the field of computer vision, and particularly in safety systems such as video monitoring, the technology plays a key role. With the increase of public places and the increase of safety requirements, the demands for technology in the field are also gradually rising. Particularly in a plurality of social places, business areas and traffic nodes such as subway stations, the timely detection and early warning of abnormal behaviors are very critical, and safety accidents can be effectively avoided or reduced.
The object of pedestrian abnormal behavior recognition is to distinguish and pre-warn abnormal behaviors such as falling, frame, attack and the like by analyzing and understanding pedestrian behaviors. These activities may cause injury to others or damage to facilities. The main challenge in this field is to define what is "anomalies", because people's knowledge and definition of "anomalies" are different in different cultures and contexts, plus the influence of environmental factors, which makes the recognition of anomalous behavior a complex task.
The pedestrian abnormal behavior recognition method based on the gesture estimation is a newer research direction, and can analyze and understand the behavior of the pedestrian more accurately and deeply by using the gesture estimation technology. In addition, with the rapid development of deep learning technology in recent years, the behavior pattern of the pedestrian can be more deeply resolved and even the next action of the pedestrian can be predicted in combination with the multilayer structure of the neural network. The posture estimation herein refers to the inference of the three-dimensional posture of the human body from the image or video, including information such as joint position and rotation. This technique can provide richer, more accurate behavioral characteristics than traditional image analysis methods, which are critical for abnormal behavior identification in complex environments.
Studies of behavior recognition originated in 1975, johansson proposed in experiments that effectively describe the motion of the human body by observing and analyzing the movement of the primary nodes. This principle provides the basis for many subsequent studies. He then devised a mannequin consisting of 12 points, and by tracking the combination and movement of these key points, recognition of the person's behaviour could be achieved. In addition, the model also lays a foundation for human body tracking technology in the field of computer vision.
Sukthanker et al have further studied on this basis and have proposed a hierarchical average drift algorithm that combines 3D spatiotemporal and optical flow information of silhouettes to model human actions, thereby enabling recognition of simple behaviors. For more complex behavior recognition, krizhevsky et al propose a markov logic network that combines a markov network with first order logic. The network model can effectively describe the space-time relationship between the sub-behaviors so as to realize the identification of the complex behaviors.
However, in an actual scenario, behavior recognition needs to cope with various challenges, such as occlusion, illumination change, and the like. These factors may interfere with the images in the video, making recognition difficult. In order to improve the expressive power, distinguishing property and robustness of behavior recognition, laptev et al propose a behavior recognition method based on local interest point tracks. They combine the local feature detection of the spatio-temporal interest point with the KLT tracker to obtain the motion trajectory of the interest point. Later, wang et al further proposed a dense trajectory based behavior recognition method. This method collects a large number of feature points in each frame and uses the optical flow field for tracking. They then use the apparent information of the tracks and the spatiotemporal information between the tracks to express and identify behavior.
The conventional gesture estimation behavior recognition method is generally slow in recognition speed due to high calculation complexity, and when a pedestrian is shielded, the accuracy of gesture estimation is reduced due to the fact that part of key points cannot be recognized, so that the effect of the gesture estimation method in practical application is not ideal.
Disclosure of Invention
Aiming at the defects that the recognition speed is low due to higher calculation complexity and the accuracy of the gesture estimation is reduced due to the fact that part of key points cannot be recognized when a pedestrian is shielded in the gesture estimation behavior recognition method in the prior art, the invention aims to provide the gesture estimation-based pedestrian abnormal behavior video recognition method which can accurately recognize the body gesture and the action of the pedestrian.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention provides a pedestrian abnormal behavior video identification method based on gesture estimation, which comprises the following steps:
s1, extracting continuous image frames from video streamF i Inputting a pre-trained target detection model to obtain an image block containing a pedestrian targetH ij The method comprises the steps of carrying out a first treatment on the surface of the Wherein,iis the reference number of the image frame,jis the firstiImage block labels of frame images;i=1,2,3…,j=1,2,3……;
s2, for image blocksH ij Frame-separated attitude estimation is carried out by utilizing an improved HRNet network, and a plurality of skeletal key points G of a human body are obtained ij ;
S3, regarding skeleton key point G ij By the firsti-2 th frame of imagejIndividual image blocksH i- j(2) And (d)i+2 frame image NojIndividual image blocksH i j(+2) The bone key points of the (2) are filled up by adopting a predictive supplement calculation method, so that the complete bone key point G 'is obtained' ij ;
S4, the whole skeleton key point G' ij And constructing a space-time diagram model, and inputting the space-time diagram model into an existing behavior recognition module to recognize abnormal behaviors.
In step S1, successive image frames are extracted from the video streamF i Inputting a pre-trained target detection model to obtain an image block containing a pedestrian targetH ij The method specifically comprises the following steps:
s101, extracting continuous image frames from video streamF i The video stream is received by the server from the camera, is processed by the encoder and the decoder, converts the original video data into continuous digital image frames, and is preprocessed by adjusting the image size, normalizing, adjusting the brightness, the contrast and the enhancement algorithm to ensure the image quality, so that the subsequent feature extraction is more effective and accurate;
s102, performing feature extraction by using a pre-trained target detection model, inputting an image tensor of a video frame into the target detection model, and obtaining feature representation of each video frame, wherein the feature representation is a high-dimensional data structure, and providing a large amount of useful information transmitted into a regional proposal network for subsequent target detection;
s103, transmitting the feature representation to a region proposal network, and generating candidate target regions by moving on the feature map in a sliding window mode, wherein the candidate frames are different in size and length-width ratio and are generated for each position;
s104, sending the candidate frames into another regional proposal network, and carrying out target classification and bounding box regression on each candidate frame; object classification uses a convolutional neural network structure to classify features in each candidate box, and whether each candidate box contains an object of interest is judged by a classifier;
s105, accurately adjusting candidate frames classified as the interested targets by using a regression device to ensure that the candidate frames accurately capture the positions of the targets, and finally obtaining an image block containing the pedestrian targetsH ij 。
In step S2, the 3 x3 convolution kernel of the bowleneck structure in the HRNet network initial stage is replaced by a residual unit; the improvement of the Bottleneck module is specifically as follows:
the Bottleneck module includes three processing stages: 1×1 convolution kernel dimension reduction, 3×3 convolution kernel extraction features, 1×1 convolution kernel dimension increase;
after 1×1 convolution is performed on the input data, the original 3×3 convolution is replaced by a residual unit, and the data is divided intokSub-features, defining input sub-features asR k Each input feature map has the same size;
output ofC k (k=1, 2,3 …) is expressed as follows:
;
wherein the method comprises the steps ofW k Representing the output of the sub-feature after 3 x3 convolution kernel, inputting the sub-featureR k And (3) withW k-1 Added and input toW k To reduce parameters in the event of an increase in input sub-features;C k-1 to last oneThe number of outputs is chosen such that,W k-1 for the last input sub-featureR k-1 An output after passing through a 3×3 convolution kernel;
the residual blocks of the hierarchy are used to extract the hybrid features as input to subsequent layers in order to more deeply analyze and extract the information contained in the hybrid features.
In step S3, for bone key point G ij By the firsti-2 th frame of imagejIndividual image blocksH i- j(2) And (d)i+2 frame image NojIndividual image blocksH i j(+2) The bone key points of the (2) are filled up by adopting a predictive supplement calculation method, so that the complete bone key point G 'is obtained' ij The method specifically comprises the following steps:
s301, use%x i-2 ,y i-2 ) Representation pairH i- j(2) Skeletal keypoints G in image blocks i-2 j() Skeletal key point coordinates corresponding to the previous frame of image when prediction and supplement are carried outx i+2 ,y i+2 ) Representation pairH i j(+2) Skeletal keypoints G in image blocks i 2 j(+) The skeleton key point coordinates in the image of the next frame are correspondingly separated when prediction and supplementation are carried out;
s302, utilizing information of front and rear frames to block position skeleton key point coordinates of current framex i ,y i ) The completion is carried out through a prediction supplementing module, and a prediction supplementing calculation formula in the prediction supplementing module is as follows:
(x i ,y i )=((x i-2 +y i-2 )/2, (x i+2 +y i+2 )/2)。
in step S4, according to the connection and time sequence relation of the skeleton key points of the human body, selecting a plurality of skeleton key point coordinates of the body part in step S3 to construct a space-time diagram model; these key points are selected according to the structure of human skeleton, representing the basic unit of human body action;
there are two types of edges in the space-time diagram model: a spatial side and a temporal side; the spatial edge is responsible for blocking each frame of imageH ij Different skeletal keypoints G in ij Connecting the three images in a prescribed order to capture posture information at each time; the timing edge separates the same skeletal keypoint G of all image blocks in a frame ij And (3) connecting the time information and the space information of the behavior sequence together, and capturing the structure and the evolution of human body actions.
The number of the skeletal key points is 15.
Compared with the prior art, the invention has the following beneficial technical effects and advantages:
1. the invention provides a pedestrian abnormal behavior video identification method based on gesture estimation, which utilizes a server to acquire and process a monitoring video in real time, and adopts a fast RCNN model to rapidly and accurately detect targets in images, unlike the traditional processing mode; when each frame of video is decoded, the video is continuously input into the model to generate an accurate image block containing a pedestrian target; by adopting the improved HRNet network, not only the key points of the travelers can be detected in each frame, but also the common shielding and complex background problems in practical application can be processed;
2. in order to further enhance the robustness, the invention combines a predictive complementary calculation method in adjacent frame images to fill in missing skeleton key points, which is particularly important for complex public scenes such as traffic congestion, large-scale activities and the like, and the obtained key points provide a basis for constructing a space-time diagram model, which is a diagram model capable of capturing human body motion changes, and then carries out convolution classification operation to judge whether abnormal behaviors exist in videos in real time; the body posture and the action of the pedestrian can be accurately identified through extracting key points by multiple features;
3. the invention improves the detection strategy of the gesture estimation, changes the single frame detection into the strategy of the frame separation detection, and greatly improves the calculation speed of the gesture estimation; in addition, the system also has a real-time alarm function, and can immediately give an alarm once abnormal behaviors are detected, so that related personnel can timely process the abnormal behaviors, and the safety of the subway is effectively improved. Therefore, compared with the traditional behavior recognition method, the method has better speed;
4. the invention comprehensively applies the deep learning and gesture estimation technology, improves the speed of behavior recognition through a strategy of frame-separated detection, and keeps the precision of key point detection through an improved method of an HRNet network and a predictive supplement algorithm of key points of adjacent frames under the conditions of shielding and complex background.
Drawings
FIG. 1 is a flow chart of a pedestrian abnormal behavior video identification method based on gesture estimation;
FIG. 2 is a block diagram of a residual unit modified in the method of the present invention;
FIG. 3 is a space-time diagram model diagram in the method of the invention;
FIG. 4 is a flow chart of a behavior recognition module in the method of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings, but the scope of the present invention is not limited by the accompanying drawings.
In the field of abnormal behavior recognition, conventional methods are often limited by complex background and occlusion conditions, so that their effect in practical applications is not ideal. In order to solve the problems, the invention provides a pedestrian abnormal behavior video identification method based on gesture estimation, which comprises the following steps of, as shown in fig. 1:
s1, extracting continuous image frames from video streamF i Inputting a pre-trained target detection model to obtain an image block containing a pedestrian targetH ij The method comprises the steps of carrying out a first treatment on the surface of the Wherein,iis the reference number of the image frame,jis the firstiImage block labels of frame images;i=1,2,3…,j=1,2,3……;
s2, for image blocksH ij Frame-separated attitude estimation is carried out by utilizing an improved HRNet network, and a plurality of skeletal key points G of a human body are obtained ij ;
S3, regarding skeleton key point G ij By the firsti-2 th frame of imagejIndividual image blocksH i- j(2) And (d)i+2 frame image NojIndividual image blocksH i j(+2) The bone key points of the (2) are filled up by adopting a predictive supplement calculation method, so that the complete bone key point G 'is obtained' ij ;
S4, the whole skeleton key point G' ij And constructing a space-time diagram model, and inputting the space-time diagram model into an existing behavior recognition module to recognize abnormal behaviors.
Firstly, a server in a subway scene decodes a video stream acquired from a camera which is installed and deployed into continuous image frames, and then transmits the continuous image frames into a target detection module for feature extraction to obtain an image block containing a pedestrian targetH ij And then, carrying out attitude estimation on the image block output by the target detection module. In the process, firstly, 15 key point coordinates of a human body are extracted through multiple features, and then, prediction completion operation is carried out on missing key points by utilizing front and rear interval frames; and inputting the image block with the attitude estimation into an existing behavior recognition module to perform space-time diagram model construction, convolution and classification operation on the key points, and finally judging whether abnormal behaviors exist in the pedestrians. If abnormal behaviors exist, a local host connected with a server in the subway scene can send out an alarm prompt.
In step S1, successive image frames are extracted from the video streamF i Inputting a pre-trained target detection model to obtain an image block containing a pedestrian targetH ij The method specifically comprises the following steps:
s101, extracting continuous image frames from video streamF i The video stream is received by the server from the camera, is processed by the encoder and the decoder, converts the original video data into continuous digital image frames, and is preprocessed by adjusting the image size, normalizing, adjusting the brightness, the contrast and the enhancement algorithm to ensure the image quality, so that the subsequent feature extraction is more effective and accurate;
s102, performing feature extraction by using a pre-trained target detection model (the fast RCNN model is adopted in the embodiment), inputting the image tensor of the video frames into the target detection model, and obtaining feature representations of each video frame, wherein the feature representations are high-dimensional data structures, and providing a large amount of useful information transmitted into a regional proposal network for subsequent target detection;
s103, transmitting the feature representation to a region proposal network, and generating candidate target regions by moving on the feature map in a sliding window mode, wherein the candidate frames are different in size and length-width ratio and are generated for each position;
s104, sending the candidate frames into another regional proposal network, and carrying out target classification and bounding box regression on each candidate frame; object classification uses a convolutional neural network structure to classify features in each candidate box, and whether each candidate box contains an object of interest is judged by a classifier;
s105, accurately adjusting candidate frames classified as the interested targets by using a regression device to ensure that the candidate frames accurately capture the positions of the targets, and finally obtaining an image block containing the pedestrian targetsH ij 。
The server side in the subway scene firstly starts to receive the video stream from the camera, and the data captured by the camera is processed by the codec to convert the original video data into continuous digital image frames, so that preparation is made for further analysis and processing. These image frames input to the server are first subjected to a series of preprocessing steps, including resizing the image, normalizing, adjusting brightness and contrast, and other enhancement algorithms to ensure image quality, so that subsequent feature extraction is more efficient and accurate. These preprocessed image frames are then input into the pre-trained fast RCNN model, which in this embodiment uses mainly the VGG16 network for feature extraction. VGG16 is a deep neural network consisting of 16 layers, which has demonstrated its powerful performance in the field of image recognition. After inputting the image tensor of the video frames into the VGG16 network, a characteristic representation of each video frame may be obtained. These feature representations are a high-dimensional data structure that can provide a large amount of useful information for subsequent target detection. These features are transferred into a Regional Proposal Network (RPN). The RPN is one of the core components of the Faster RCNN, which is responsible for generating candidate target regions. Moving on the feature map by means of a sliding window, the RPN generates a number of candidate boxes of different sizes and aspect ratios for each location.
These candidate boxes are then sent to another network module, which is responsible for target classification and bounding box regression for each candidate box. Object classification uses a specific convolutional neural network structure to classify features in each candidate box, and this classifier determines whether each candidate box contains an object of interest. For candidate boxes classified as targets, they are then fine-tuned using a regressor, ensuring that the boxes capture the target's location as accurately as possible.
Finally obtaining the image block containing the pedestrian target in the subway scene through all the stepsH ij 。
In step S2, the 3×3 convolution kernel of the HRNet network initial stage Bottleneck structure is replaced with a residual unit (the structure of which is shown in fig. 2); the Bottleneck module was modified.
Step S2 first, the human body detection network Faster RCNN in step S1 is used to detect the image block containing the pedestrian object in the subway sceneH ij Then at the image block of the detected objectH ij The pose estimation network is used for frame-separated pose estimation. The improved HRNet pose estimation network belongs to a top-down algorithm, and the network structure of the HRNet comprises an initial stage, a multi-scale parallel stage and a final stage.
In the initial phase, the network is based on a smaller high resolution network, which is typically referred to from the initial phase of ResNet. In the innovation scheme, the convolution kernel of the original Bottleneck structure is replaced by a residual unit with a plurality of receptive fields, so that finer multi-scale characterization can be realized, and richer image information is captured.
Followed by a multi-scale parallel stage. At this stage, the network is split into several sub-networks running in parallel, each operating at a different resolution, exchanging information with each other, forming a network system working in concert. Such an arrangement allows the network to capture image features at different levels, thereby obtaining more comprehensive and diversified feature information. These sub-networks enter a multi-stage interaction process, and they interact and cooperate with each other in multiple stages, so that the feature characterization is more complex and complete, and the recognition and analysis capability of the network is enhanced.
Finally, the fusion phase. At this stage, all sub-networks aggregate the respective outputs to form a high quality composite output. This output is not simply a combination, but rather a deep fusion process, by integrating feature information of multiple scales together, provides a rich and high quality feature input for the final pose estimation task.
The 3 x3 convolution kernel in the initial-stage Bottleneck structure is replaced by a residual unit, and multi-scale characterization of granularity level is obtained by using a plurality of receptive fields, so that the method is used as an improvement on the HRNet network. The method for estimating the gesture based on the HRNet network comprises the following steps:
s201, in the initial stage, in order to extract low-level features, the HRNet extracts features of an input image through convolution operation, specifically, two convolution layers and four improved Bottleneck modules are sequentially used for the convolution operation, and finally downsampling is performed to reduce the space size of a feature map.
The improvement to the Bottleneck module includes three processing stages: the dimension reduction of the 1×1 convolution kernel, the extraction of the features of the 3×3 convolution kernel and the dimension increase of the 1×1 convolution kernel are specifically as follows:
the Bottleneck module is a classical residual module consisting of three main parts: the 1 x 1 convolution kernel reduces the dimension, the 3 x3 convolution kernel extracts the features, and the 1 x 1 convolution kernel increases the dimension. In the improved process, after 1×1 convolution is performed on the input data, the original 3×3 convolution is replaced by a residual unit, and the data is divided intokSub-features, defined asR k Each of the outputsThe in-profile has the same dimensions. Output ofC k (k=1, 2,3 …) is expressed as follows:
;
wherein the method comprises the steps ofW k Representing the output of the sub-feature after 3 x3 convolution kernel, inputting the sub-featureR k And (3) withW k-1 Added and input toW k To reduce parameters in the event of an increase in input sub-features;C k-1 for the last output it is possible to provide,W k-1 for the last input sub-featureR k-1 An output after passing through a 3×3 convolution kernel;
the residual blocks of this hierarchy are used to extract hybrid features as input to subsequent layers in order to more deeply analyze and extract the information contained in these hybrid features. The output of each residual unit contains a combination of information of different scales, which makes more efficient use of global information and local information.
S202, in the multi-scale parallel stage, HRNet is enabled to keep feature images with multiple resolutions and is processed through parallel operation, each feature image with the resolution is provided with branches, features are further extracted and fused through a sub-module, and information exchange and fusion are carried out between feature images with different resolutions.
And S203, after the final stage passes through the multi-scale parallel stage, the HRNet upsamples the feature map to restore the resolution of the original image, then the high-resolution feature representation is extracted through convolution, and finally the features are used for an output layer.
In step S3, for bone key point G ij By the firsti-2 th frame of imagejIndividual image blocksH i- j(2) And (d)i+2 frame image NojIndividual image blocksH i j(+2) The bone key points of the (2) are filled up by adopting a predictive supplement calculation method, so that the complete bone key point G 'is obtained' ij The method specifically comprises the following steps:
in a similar complex scene of a subway station, the existing top-down-based multi-person gesture estimation method can only estimate and identify a single gesture in the complex scene, so that the problem of partial key point missing detection or false detection is caused, and the accuracy of a gesture estimation result is greatly reduced. The method for predicting and supplementing the missing key points comprises the following steps:
s301, use%x i-2 ,y i-2 ) Representation pairH i- j(2) Skeletal keypoints G in image blocks i-2 j() Skeletal key point coordinates corresponding to the previous frame of image when prediction and supplement are carried outx i+2 ,y i+2 ) Representation pairH i j(+2) Skeletal keypoints G in image blocks i 2 j(+) The skeleton key point coordinates in the image of the next frame are correspondingly separated when prediction and supplementation are carried out;
s302, utilizing information of front and rear frames to block position skeleton key point coordinates of current framex i ,y i ) The completion is carried out through a prediction supplementing module, and a prediction supplementing calculation formula in the prediction supplementing module is as follows:
(x i ,y i )=((x i-2 +y i-2 )/2, (x i+2 +y i+2 )/2)。
step S4, namely the complete bone key point G' ij Constructing a space-time diagram model (shown in fig. 3), and inputting the space-time diagram model into a behavior recognition module (the flow chart of which is shown in fig. 4) to recognize abnormal behaviors;
according to the connection and time sequence relation of the key points of the skeleton of the human body, 15 key point coordinates of the body part in the step S3 are selected to construct a space-time diagram model. These skeletal key points are selected according to the structure of the human skeleton and represent the basic units of human motion. The blank map model is unique in this case because it not only considers spatial information, but also fuses temporal information. Specifically, there are two types of edges in the model: spatial edge and timingEdges. The spatial edge is responsible for blocking each frame of imageH ij Key point G of different human bones ij Connecting according to a specific sequence to capture the gesture information of each moment; while the timing edge separates all image blocks in the frameH ij Is the same skeleton key point G of (1) ij In connection, the time information and the space information of the behavior sequence are integrated, and the space-time diagram model aims at capturing the structure and evolution of human body actions.
After the space-time diagram model is completed, the characteristics which are significant for behavior recognition are extracted from the model, and the characteristics are the task of the behavior recognition module. The feature extraction is performed in this module using a 3D convolutional neural network C3D, unlike a conventional 2D convolutional neural network, 3D convolution can capture spatial and temporal information of images at the same time, which makes C3D very suitable for processing a space-time diagram model. The structure of the C3D consists of an input layer, a convolution layer, a pooling layer and a full connection layer. At the convolution layer, convolution operations capture spatiotemporal information, which can identify subtle changes in behavior over time as well as spatially complex structures. In order to further enhance the network identification capability, the invention also adopts a cascade mode of a plurality of C3D networks, so that more space-time characteristics can be obtained, the receptive field of the networks is enlarged, and longer time sequence information is captured.
After extracting features, these features are input into a classifier. The function of this classifier is to map complex spatiotemporal features onto specific behavior classes. After receiving the output of the C3D, the method finally generates a prediction result of the abnormal behavior through a series of conversion and mapping, thereby realizing the complete conversion process from the original video data to the specific behavior category in the subway monitoring.
The method of the invention utilizes the server in the subway scene to acquire and process the monitoring video in real time. Different from the traditional processing mode, the method adopts a Faster RCNN model and is specially used for rapidly and accurately detecting targets in images. As each frame of video is decoded, they will be continuously input into this model, generating accurate image blocks containing pedestrian objects.
But merely detecting a pedestrian is not sufficient to complete the task, the next key being pose estimation. An improved HRNet network is employed herein. The network not only can detect key points of travelers in each frame, but also can process common shielding and complex background problems in subway scenes. In order to further enhance the robustness, a predictive complementary calculation method is combined in the adjacent frame images, and missing skeleton key points are filled, so that the method is particularly important for complex public scenes such as subway stations.
The key point obtained in this way provides a basis for constructing a space-time diagram model, which is a diagram model capable of capturing human motion changes. And then, performing convolution classification operation to judge whether abnormal behaviors exist in the monitoring video of the subway in real time.
The experiment adopts Ubuntu20.04.1 and 64-bit operating system, the display card is NVIDIA GTX3090Ti, the compiler setting is Python3.9, and a Pytorch1.11.0 deep learning framework is used as an experiment platform.
A target detection bounding box is generated using a COCO pre-training model with a fast-RCNN threshold of 0.1. Pose estimation was performed using HRNet model, 300 rounds of training with learning rate size 0.001 and batch 12 were performed with the aid of the pre-training model of MPII. The average accuracy PCKh of the key point detection (an index for measuring the accuracy of the key point) of each part is improved by 2.09%. Using the STGCN model and performing behavior recognition training based on the JHMDB dataset, a pre-training model of Kinetics-400 was used to uniformly adjust the video frames to 256×256 pixels. Training settings included a learning rate of 0.01 with a batch size of 16 for a total of 150 rounds of iterations; the whole precision is improved by about 2 percent.
In summary, the invention improves the speed of behavior recognition by comprehensively applying the deep learning and gesture estimation technology and by the strategy of frame-separated detection. Under the shielding and complex background environment, the precision of key point detection is maintained through an improved method of the HRNet network and a prediction complementary algorithm of the key points of adjacent frames.
Claims (6)
1. The pedestrian abnormal behavior video identification method based on the gesture estimation is characterized by comprising the following steps of:
s1, extracting continuous image frames from video streamF i Inputting a pre-trained target detection model to obtain an image block containing a pedestrian targetH ij The method comprises the steps of carrying out a first treatment on the surface of the Wherein,iis the reference number of the image frame,jis the firstiImage block labels of frame images;i=1,2,3…,j=1,2,3……;
s2, for image blocksH ij Frame-separated attitude estimation is carried out by utilizing an improved HRNet network, and a plurality of skeletal key points G of a human body are obtained ij ;
S3, regarding skeleton key point G ij By the firsti-2 th frame of imagejIndividual image blocksH i- j(2) And (d)i+2 frame image NojIndividual image blocksH i j(+2) The bone key points of the (2) are filled up by adopting a predictive supplement calculation method, so that the complete bone key point G 'is obtained' ij ;
S4, the whole skeleton key point G' ij And constructing a space-time diagram model, and inputting the space-time diagram model into an existing behavior recognition module to recognize abnormal behaviors.
2. The pedestrian abnormal behavior video identification method based on gesture estimation according to claim 1, wherein:
in step S1, successive image frames are extracted from the video streamF i Inputting a pre-trained target detection model to obtain an image block containing a pedestrian targetH ij The method specifically comprises the following steps:
s101, extracting continuous image frames from video streamF i The video stream is received from the camera and processed by the server, the original video data is converted into continuous digital image frames by the encoder and the decoder, and the image quality is ensured by adjusting the image size, normalizing, adjusting the brightness, contrast and preprocessing by the enhancement algorithm, so that the subsequent characteristic is providedThe picking is more effective and accurate;
s102, performing feature extraction by using a pre-trained target detection model, inputting an image tensor of a video frame into the target detection model, and obtaining feature representation of each video frame, wherein the feature representation is a high-dimensional data structure, and providing a large amount of useful information transmitted into a regional proposal network for subsequent target detection;
s103, transmitting the feature representation to a region proposal network, and generating candidate target regions by moving on the feature map in a sliding window mode, wherein the candidate frames are different in size and length-width ratio and are generated for each position;
s104, sending the candidate frames into another regional proposal network, and carrying out target classification and bounding box regression on each candidate frame; object classification uses a convolutional neural network structure to classify features in each candidate box, and whether each candidate box contains an object of interest is judged by a classifier;
s105, accurately adjusting candidate frames classified as the interested targets by using a regression device to ensure that the candidate frames accurately capture the positions of the targets, and finally obtaining an image block containing the pedestrian targetsH ij 。
3. The pedestrian abnormal behavior video identification method based on gesture estimation according to claim 1, wherein: in step S2, the 3 x3 convolution kernel of the bowleneck structure in the HRNet network initial stage is replaced by a residual unit; the improvement of the Bottleneck module is specifically as follows:
the Bottleneck module includes three processing stages: 1×1 convolution kernel dimension reduction, 3×3 convolution kernel extraction features, 1×1 convolution kernel dimension increase;
after 1×1 convolution is performed on the input data, the original 3×3 convolution is replaced by a residual unit, and the data is divided intokSub-features, defining input sub-features asR k Each input feature map has the same size;
output ofC k (k=1, 2,3 …) is expressed as follows:
;
wherein the method comprises the steps ofW k Representing the output of the sub-feature after 3 x3 convolution kernel, inputting the sub-featureR k And (3) withW k-1 Added and input toW k To reduce parameters in the event of an increase in input sub-features;C k-1 for the last output it is possible to provide,W k-1 for the last input sub-featureR k-1 An output after passing through a 3×3 convolution kernel;
the residual blocks of the hierarchy are used to extract the hybrid features as input to subsequent layers in order to more deeply analyze and extract the information contained in the hybrid features.
4. The pedestrian abnormal behavior video identification method based on gesture estimation according to claim 1, wherein: in step S3, for bone key point G ij By the firsti-2 th frame of imagejIndividual image blocksH i- j(2) And (d)i+2 frame image NojIndividual image blocksH i j(+2) The bone key points of the (2) are filled up by adopting a predictive supplement calculation method, so that the complete bone key point G 'is obtained' ij The method specifically comprises the following steps:
s301, use%x i-2 ,y i-2 ) Representation pairH i j(-2) Skeletal keypoints G in image blocks i-2 j() Skeletal key point coordinates corresponding to the previous frame of image when prediction and supplement are carried outx i+2 ,y i+2 ) Representation pairH i j(+2) Skeletal keypoints G in image blocks i 2 j(+) The skeleton key point coordinates in the image of the next frame are correspondingly separated when prediction and supplementation are carried out;
s302, utilizing information of front and rear frames to block position skeleton of current frameKey point coordinates [ ]x i ,y i ) The completion is carried out through a prediction supplementing module, and a prediction supplementing calculation formula in the prediction supplementing module is as follows:
(x i ,y i )=((x i-2 +y i-2 )/2, (x i+2 +y i+2 )/2)。
5. the pedestrian abnormal behavior video identification method based on gesture estimation according to claim 1, wherein: in step S4, according to the connection and time sequence relation of the skeleton key points of the human body, selecting a plurality of skeleton key point coordinates of the body part in step S3 to construct a space-time diagram model; these key points are selected according to the structure of human skeleton, representing the basic unit of human body action;
there are two types of edges in the space-time diagram model: a spatial side and a temporal side; the spatial edge is responsible for blocking each frame of imageH ij Different skeletal keypoints G in ij Connecting the three images in a prescribed order to capture posture information at each time; the timing edge separates the same skeletal keypoint G of all image blocks in a frame ij And (3) connecting the time information and the space information of the behavior sequence together, and capturing the structure and the evolution of human body actions.
6. The pedestrian abnormal behavior video identification method based on gesture estimation according to claim 5, wherein: the number of the skeletal key points is 15.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410211689.6A CN117789255B (en) | 2024-02-27 | 2024-02-27 | Pedestrian abnormal behavior video identification method based on attitude estimation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410211689.6A CN117789255B (en) | 2024-02-27 | 2024-02-27 | Pedestrian abnormal behavior video identification method based on attitude estimation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117789255A true CN117789255A (en) | 2024-03-29 |
CN117789255B CN117789255B (en) | 2024-06-11 |
Family
ID=90389530
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410211689.6A Active CN117789255B (en) | 2024-02-27 | 2024-02-27 | Pedestrian abnormal behavior video identification method based on attitude estimation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117789255B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114155610A (en) * | 2021-12-09 | 2022-03-08 | 中国矿业大学 | Panel assembly key action identification method based on upper half body posture estimation |
CN114399838A (en) * | 2022-01-18 | 2022-04-26 | 深圳市广联智通科技有限公司 | Multi-person behavior recognition method and system based on attitude estimation and double classification |
CN116152747A (en) * | 2023-04-19 | 2023-05-23 | 南京源心教育科技有限公司 | Human behavior intention recognition method based on appearance recognition and action modeling |
CN116645721A (en) * | 2023-04-26 | 2023-08-25 | 贵州大学 | Sitting posture identification method and system based on deep learning |
CN117173792A (en) * | 2023-10-24 | 2023-12-05 | 长讯通信服务有限公司 | Multi-person gait recognition system based on three-dimensional human skeleton |
CN117392093A (en) * | 2023-10-25 | 2024-01-12 | 重庆理工大学 | Breast ultrasound medical image segmentation algorithm based on global multi-scale residual U-HRNet network |
CN117437691A (en) * | 2023-10-31 | 2024-01-23 | 上海大学 | Real-time multi-person abnormal behavior identification method and system based on lightweight network |
-
2024
- 2024-02-27 CN CN202410211689.6A patent/CN117789255B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114155610A (en) * | 2021-12-09 | 2022-03-08 | 中国矿业大学 | Panel assembly key action identification method based on upper half body posture estimation |
CN114399838A (en) * | 2022-01-18 | 2022-04-26 | 深圳市广联智通科技有限公司 | Multi-person behavior recognition method and system based on attitude estimation and double classification |
CN116152747A (en) * | 2023-04-19 | 2023-05-23 | 南京源心教育科技有限公司 | Human behavior intention recognition method based on appearance recognition and action modeling |
CN116645721A (en) * | 2023-04-26 | 2023-08-25 | 贵州大学 | Sitting posture identification method and system based on deep learning |
CN117173792A (en) * | 2023-10-24 | 2023-12-05 | 长讯通信服务有限公司 | Multi-person gait recognition system based on three-dimensional human skeleton |
CN117392093A (en) * | 2023-10-25 | 2024-01-12 | 重庆理工大学 | Breast ultrasound medical image segmentation algorithm based on global multi-scale residual U-HRNet network |
CN117437691A (en) * | 2023-10-31 | 2024-01-23 | 上海大学 | Real-time multi-person abnormal behavior identification method and system based on lightweight network |
Non-Patent Citations (1)
Title |
---|
罗梦诗等: "融入双注意力的高分辨率网络人体姿态估计", 《 计算机工程》, 28 February 2022 (2022-02-28) * |
Also Published As
Publication number | Publication date |
---|---|
CN117789255B (en) | 2024-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108447078B (en) | Interference perception tracking algorithm based on visual saliency | |
CN111881853B (en) | Method and device for identifying abnormal behaviors in oversized bridge and tunnel | |
CN109657581A (en) | Urban track traffic gate passing control method based on binocular camera behavioral value | |
Li et al. | Real-world railway traffic detection based on faster better network | |
Zhuang et al. | Illumination and temperature-aware multispectral networks for edge-computing-enabled pedestrian detection | |
Cho et al. | Semantic segmentation with low light images by modified CycleGAN-based image enhancement | |
Tomar et al. | Crowd analysis in video surveillance: A review | |
Mittal et al. | Review of different techniques for object detection using deep learning | |
Wu et al. | Real‐time running detection system for UAV imagery based on optical flow and deep convolutional networks | |
Cao et al. | Learning spatial-temporal representation for smoke vehicle detection | |
Liang et al. | Methods of moving target detection and behavior recognition in intelligent vision monitoring. | |
CN114241379A (en) | Passenger abnormal behavior identification method, device and equipment and passenger monitoring system | |
Wang et al. | Mpanet: Multi-patch attention for infrared small target object detection | |
Ding et al. | Individual surveillance around parked aircraft at nighttime: Thermal infrared vision-based human action recognition | |
Lin | Automatic recognition of image of abnormal situation in scenic spots based on Internet of things | |
CN114067273A (en) | Night airport terminal thermal imaging remarkable human body segmentation detection method | |
Kheder et al. | Transfer learning based traffic light detection and recognition using CNN inception-V3 model | |
CN112487926A (en) | Scenic spot feeding behavior identification method based on space-time diagram convolutional network | |
Konstantinidis et al. | Skeleton-based action recognition based on deep learning and Grassmannian pyramids | |
CN116824541A (en) | Pedestrian crossing intention prediction method, model and device based on double channels | |
CN117789255B (en) | Pedestrian abnormal behavior video identification method based on attitude estimation | |
CN110929632A (en) | Complex scene-oriented vehicle target detection method and device | |
CN114419729A (en) | Behavior identification method based on light-weight double-flow network | |
Yang et al. | Locator slope calculation via deep representations based on monocular vision | |
Qu et al. | An intelligent vehicle image segmentation and quality assessment model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |