WO2023082882A1 - 一种基于姿态估计的行人摔倒动作识别方法及设备 - Google Patents

一种基于姿态估计的行人摔倒动作识别方法及设备 Download PDF

Info

Publication number
WO2023082882A1
WO2023082882A1 PCT/CN2022/121935 CN2022121935W WO2023082882A1 WO 2023082882 A1 WO2023082882 A1 WO 2023082882A1 CN 2022121935 W CN2022121935 W CN 2022121935W WO 2023082882 A1 WO2023082882 A1 WO 2023082882A1
Authority
WO
WIPO (PCT)
Prior art keywords
graph
temporal
spatio
spatiotemporal
attitude
Prior art date
Application number
PCT/CN2022/121935
Other languages
English (en)
French (fr)
Inventor
张富凯
贺天成
张海燕
Original Assignee
河南理工大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 河南理工大学 filed Critical 河南理工大学
Priority to GB2302960.6A priority Critical patent/GB2616733A/en
Publication of WO2023082882A1 publication Critical patent/WO2023082882A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments

Definitions

  • the invention relates to the field of computers, in particular to a posture estimation-based pedestrian fall action recognition method and equipment.
  • the commonly used data modes in the field of behavior recognition are mainly divided into original RGB video and a series of key points of human body posture.
  • the original RGB video not only contains the behavior and motion information of the human body, but also has a lot of background information that affects the recognition accuracy, such as lighting and cluttered surrounding environment.
  • today's intelligent hardware level is rapidly improving, and the attitude estimation algorithm for obtaining key points of the human body is becoming more and more excellent in terms of real-time performance.
  • the attitude information of each person in the video can be extracted by means of a highly robust pedestrian detection network, and finally the attitude The output results are packaged into the required data form.
  • the human body posture coordinates extracted from the video need to be composed of graph data, and the graph convolutional network is used for feature learning.
  • some researchers proposed the skeleton-based spatio-temporal graph convolutional network ST-GCN to do feature extraction, perform image convolution on the natural connection graph of key points of the human body on a frame of image (spatial dimension), and perform temporal convolution or Using the LSTM network for feature fusion, it makes good use of the natural connection of the human body structure and the linkage relationship between the relevant joints of the action event itself, and considers the adjacent joints in space and time, but it only considers the local joint connectivity.
  • One purpose of this application is to provide a method and device for pedestrian fall action recognition based on attitude estimation, so as to solve the problem of how to reduce the influence of the background on the recognition effect in the process of fall recognition in the prior art, improve the accuracy rate, and reduce the amount of calculation at the same time The problem.
  • a method for pedestrian fall action recognition based on pose estimation including:
  • the attitude graph data is input into a graph convolutional neural network, a residual connection is introduced between multiple spatio-temporal merged graph convolution modules in the graph convolutional neural network, and sequentially passed through a plurality of spatio-temporal merged graph convolution modules.
  • the product module is used for feature extraction to obtain the gesture spatio-temporal joint feature;
  • the action recognition is performed on the joint spatio-temporal features of the posture to obtain the action recognition classification result.
  • the graph convolutional neural network includes a first spatiotemporal merged graph convolution module, a second spatiotemporal merged graph convolution module and a third spatiotemporal merged graph convolution module. module;
  • Each spatio-temporal merged graph convolution module includes a multi-window multi-scale 3D graph convolution layer and a serialization component layer, and the serialization component includes multi-scale graph convolution and two consecutive multi-scale temporal convolutions.
  • the pose map data is input into the graph convolutional neural network, and between multiple spatiotemporal merged graph convolution modules in the graph convolutional neural network Introduce the residual connection, and perform feature extraction through multiple spatio-temporal merged graph convolution modules in sequence to obtain attitude spatio-temporal joint features, including:
  • the attitude graph data is input in the graph convolutional neural network, and the attitude graph data is normalized to adjust the array shape of the attitude graph data;
  • first attitude spatiotemporal feature residual is connected to the second attitude spatiotemporal feature, it is input to the third spatiotemporal combined graph convolution module for feature extraction to obtain the attitude spatiotemporal joint feature.
  • the input of the adjusted posture graph data into the first spatio-temporal combined graph convolution module for feature extraction to obtain the first pose spatio-temporal features includes:
  • the adjusted pose graph data is sequentially subjected to feature extraction through the multi-scale graph convolution in the serialization component layer and two consecutive multi-scale time convolutions;
  • the adjusted attitude graph data is subjected to feature extraction through the multi-window multi-scale 3D graph convolution layer;
  • the first spatiotemporal merged graph convolution module is used for feature extraction to obtain the second attitude spatiotemporal features and The pose spatio-temporal joint feature.
  • the pose graph data includes a human body key point set and a bone edge set
  • the k-adjacency matrix of the pose graph data is expressed as follows:
  • k represents different neighbor orders of key points
  • (i, j) is the i-th and j-th key points
  • d(v i , v j ) represents the distance between key points i and j.
  • the multi-frame images in the original video stream are acquired, pedestrian detection and tracking are performed on each frame image, and the posture estimation obtains the human body tracking number and key point information, including :
  • the pedestrian tracking algorithm based on DeepSort is matched by calculating the similarity of the pedestrian bounding box feature information of the target to be tracked between the images of the two frames before and after, and assigning an ID to each target to be tracked to obtain the tracking result;
  • the regional multi-person pose estimation algorithm is used to extract key point coordinates for each target to be tracked, and key point information and human body tracking numbers are output.
  • the motion recognition is performed on the posture spatio-temporal joint feature in combination with the change characteristics of the fall motion to obtain the motion recognition classification result, including:
  • the classification result is obtained by outputting the category with the highest score corresponding to the joint spatio-temporal feature of the posture through the classifier.
  • a computer-readable medium on which computer-readable instructions are stored, and when the computer-readable instructions can be processed and executed, the processor can realize the above-mentioned falling action recognition methods.
  • a device for pedestrian fall action recognition based on pose estimation comprising:
  • processors one or more processors
  • the one or more processors When the one or more computer-readable instructions are executed by the one or more processors, the one or more processors implement the method for recognizing a fall action as described above.
  • this application acquires multiple frames of images in the original video stream, detects and tracks pedestrians on each frame of the image, and obtains the human body tracking number and key point information through pose estimation, and uses a multi-scale adjacency matrix Aggregating the key point information of multiple frames of images before and after each key point to obtain pose graph data; inputting the pose graph data into a graph convolutional neural network, multiple spatio-temporal graphs in the graph convolutional neural network A residual connection is introduced between the merged graph convolution modules, and feature extraction is performed through multiple spatio-temporal merged graph convolution modules in turn to obtain the posture spatio-temporal joint feature; combined with the change characteristics of the fall action, the gesture spatio-temporal joint feature is performed
  • Recognition results of action recognition classification that is, this application uses a multi-scale adjacency matrix to achieve information aggregation, and introduces a residual connection between the upper and lower space-time joint modules of the same structure, and extracts poses in two streams (key point stream, bone edge stream) respectively
  • Fig. 1 shows a schematic flow chart of a pedestrian fall action recognition method based on pose estimation according to one aspect of the present application
  • FIG. 2 shows a schematic diagram of spatial information of key points in consecutive frames in a posture estimation-based pedestrian fall action recognition method according to one aspect of the present application
  • Fig. 3 shows a schematic diagram of position changes of key points during a human body fall in a posture estimation-based pedestrian fall action recognition method according to one aspect of the present application
  • Figure 4(a) shows a schematic diagram of an embedded representation after key point feature extraction in a posture estimation-based pedestrian fall action recognition method according to one aspect of the present application
  • Fig. 4(b) shows a schematic diagram of an embodiment of aggregating neighbor information to update No. 6 key point in a method for pedestrian fall action recognition based on pose estimation according to one aspect of the present application;
  • Fig. 5 shows a schematic diagram of the data calculation process of the convolutional neural network feature extraction of a pedestrian fall action recognition method based on posture estimation according to one aspect of the present application
  • Fig. 6 shows a schematic diagram of a working process of pose estimation of a method for recognizing a pedestrian's fall action based on pose estimation according to one aspect of the present application.
  • the terminal, the device of the service network, and the trusted party all include one or more processors (such as a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), I/O interfaces, network interfaces, and memory.
  • processors such as a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), I/O interfaces, network interfaces, and memory.
  • Memory may include non-permanent memory in computer-readable media, random access memory (Random Access Memory, RAM) and/or non-volatile memory, such as read-only memory (Read Only Memory, ROM) or flash memory (flash RAM). Memory is an example of computer readable media.
  • RAM Random Access Memory
  • ROM read-only Memory
  • flash RAM flash memory
  • Computer-readable media including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information.
  • Information may be computer readable instructions, data structures, modules of a program, or other data.
  • Examples of computer storage media include, but are not limited to, phase-change memory (Phase-Change RAM, PRAM), static random access memory (Static Random Access Memory, SRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), Other types of random-access memory (RAM), read-only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash memory or other memory technologies, CD-ROM Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disk (DVD) or other optical storage, magnetic tape cartridge, magnetic tape disk storage or other magnetic storage device or any other non-transmission Media that can be used to store information that can be accessed by a computing device.
  • computer-readable media excludes non-transitory computer-readable media, such as modul
  • Figure 1 shows a schematic flow diagram of a method for pedestrian fall action recognition based on pose estimation according to one aspect of the present application, which is applicable to various human daily life scenarios, including but not limited to offices, homes, coffee rooms and lecture rooms , the method can be used to recognize various human actions, such as walking, standing, sitting down, standing up, etc., the method includes step S11, step S12 and step S13, which specifically includes:
  • Step S11 Obtain multiple frames of images in the original video stream, perform pedestrian detection and tracking on each frame of the image, and obtain the human body tracking number and key point information through pose estimation, and use a multi-scale adjacency matrix to map each key point
  • the key point information of the frame image is aggregated to obtain the pose map data; here, in real life, before the fall action occurs, it always relies on some other behaviors, such as walking, standing, etc.; after the fall action occurs, it relies on some behaviors such as lying down , so it is necessary to establish a long-term association between key points between multiple frames, and aggregate the information between multiple frames before and after this key point in this frame, as shown in Figure 2.
  • the position of each key point changes when a person falls while standing.
  • the neighbor order of a key point refers to the number of jumps passed between two points, such as the k-order neighbor of joint 11, k ⁇ [1,2 ,3], 1st-order neighbors [1,12] and 2nd-order neighbors [0,2,5,13] have the same influence on the judgment of the behavior.
  • Using the posture information of the human body to construct the posture graph data for fall recognition not only greatly reduces the influence of the background on the recognition effect, but also reduces the amount of calculation.
  • Step S12 input the pose graph data into the graph convolutional neural network, introduce residual connections between multiple spatio-temporal merge graph convolution modules in the graph convolutional neural network, and sequentially pass through multiple spatio-temporal Combining the graph convolution module for feature extraction to obtain the gesture spatio-temporal joint feature;
  • the essence of the graph convolutional neural network (Graph Convolutional Network, GCN) is to generate key point embeddings based on nearby network neighbors through convolution aggregation Indicates, as shown in Figure 4(a).
  • the key points have an embedded representation at each layer.
  • the embedded representation of the 0th layer is the input key point feature X
  • the embedded representation of the K-th layer key point is the aggregation of the information calculation of the neighbors of the K-1 layer owned.
  • the direction line from the neighbor key point to the key point is message transmission, and the aggregation in the middle is the way of neural network. After the neighbor message is transmitted, it is averaged and then aggregated.
  • the complete aggregation process representation of key points is shown in the following formula:
  • Second initialize the embedding representation of layer 0 Indicates that the embedding representation of the key point v of the previous layer is a nonlinear activation function sigmoid or relu, and the neighbor feature aggregation is the result of the K-layer neural network means to average the embedding representation of the previous layer.
  • GCN Another general vector representation of GCN is as follows:
  • Step S13 combining the change characteristics of the fall action, performing action recognition on the joint spatio-temporal feature of the posture to obtain an action recognition classification result.
  • ST-GCN graph convolution model
  • step S11 to step S13 above firstly, by acquiring multiple frames of images in the original video stream, pedestrian detection and tracking are performed on each frame of the image, and the pose estimation obtains the human body tracking number and key point information, and uses a multi-scale adjacency matrix
  • the key point information of multiple frames of images before and after each key point is aggregated to obtain pose graph data; then, the pose graph data is input into a graph convolutional neural network, and multiple A residual connection is introduced between the three spatio-temporal merged graph convolution modules, and feature extraction is performed through a plurality of spatio-temporal merged graph convolution modules in turn to obtain the posture spatio-temporal joint feature; finally, combined with the change characteristics of the fall action, the posture spatio-temporal Combined features for action recognition to obtain action recognition classification results, that is, this application uses a multi-scale adjacency matrix to achieve information aggregation, and introduces residual connections between upper and lower spatio-temporal joint modules of the same structure, and extracts poses in two streams (key point
  • LFD Le2i Fall Detection
  • URFD UR Fall Detection
  • the resolution is 640x480
  • the frame rate is 30FPS.
  • the length of the video sample is selected to be 3 to 9 seconds.
  • the original video contains some parts except for falling.
  • pedestrian detection and tracking are performed on the images P1, P2, P3...Pn of each frame, and the posture estimation obtains the human body tracking number ID and key point information X1, X2, X3...Xn , and use a multi-scale adjacency matrix to aggregate the key point information of multiple frames of images before and after each key point to obtain pose graph data.
  • the weight decay value is 0.0005
  • the model learning uses the stochastic gradient descent (SGD) optimizer
  • the initial learning rate is 0.05
  • the training is 80 epochs
  • the batchsize is 8, respectively at the 35th and 45 epochs do 0.1 times the learning rate decay.
  • the action recognition is performed on the joint spatio-temporal features of the posture to obtain the action recognition classification result, that is, the algorithm in this paper uses a multi-scale adjacency matrix to achieve information aggregation, and introduces Residual connection extracts the spatio-temporal joint features of gestures on the dual streams (key point stream, bone edge stream) respectively, and finally merges the dual stream results to make a fall action judgment, reducing the influence of the background on the recognition effect and improving the accuracy of action recognition. And reduce the amount of calculation.
  • the graph convolutional neural network includes a first spatiotemporal merged graph convolution module B1, a second spatiotemporal merged graph convolution module B2, and a third spatiotemporal merged graph convolution module B3.
  • Each spatio-temporal merging graph convolution module includes a multi-window multi-scale 3D graph convolution layer and a serialization component layer, and the serialization component includes multi-scale graph convolution and two consecutive multi-scale temporal convolutions.
  • the multi-window and multi-scale 3D image convolution layer is a combined 3D convolution of time and space dimensions under different window sizes, with the purpose of expressing the internal relationship of actions in two dimensions.
  • the sequenced components are sequentially multi-scale graph convolution, which can use the maximum distance between joint points to model the skeleton; two consecutive multi-scale time convolutions are used to capture long-term or extended time frame context information.
  • the pose graph data is input into a graph convolutional neural network, and a residual connection is introduced between multiple spatio-temporal merged graph convolution modules in the graph convolutional neural network , and sequentially perform feature extraction through multiple spatio-temporal merge graph convolution modules to obtain attitude spatio-temporal joint features, including:
  • the input is a 5-dimensional array (N, C, T, V, M), where N represents the number of videos in a forward batch; C represents the number of feature information channels of the node, ie (x, y, acc)3 T represents the number of video key frames; V represents the number of joints; M represents the number of people with the highest confidence in a frame.
  • the shape of the batch normalized layer array is adjusted to 3 dimensions (N, CxVxM, T).
  • the first attitude spatio-temporal feature is input into the second spatio-temporal combined graph convolution module for feature extraction to obtain the second attitude spatio-temporal feature.
  • the first attitude spatiotemporal feature residual is connected to the second attitude spatiotemporal feature, it is input to the third spatiotemporal combined graph convolution module for feature extraction to obtain the attitude spatiotemporal joint feature.
  • the output of the spatio-temporal merged graph convolution module B1 is connected to the module B2 after convolution conversion, and the numbers in the brackets of each sub-block are the inputs before and after calculation. and the number of output channels.
  • the input of the adjusted attitude graph data into the first spatio-temporal merged graph convolution module for feature extraction to obtain the first attitude spatio-temporal features includes:
  • the adjusted pose graph data is sequentially subjected to feature extraction through the multi-scale graph convolution in the serialization component layer and two consecutive multi-scale time convolutions.
  • the adjusted pose graph data is subjected to feature extraction through the multi-window multi-scale 3D graph convolution layer.
  • an activation function is input, and then a multi-scale temporal convolution feature extraction is performed to obtain the first pose spatiotemporal feature.
  • the first spatiotemporal merged graph convolution module is used for feature extraction to obtain the second attitude spatiotemporal features and The pose spatio-temporal joint feature.
  • the output features are summed and sent to the relu() activation function, and then a multi-scale temporal convolution feature extraction is performed, and the result is input to the
  • the next spatiotemporal merging graph convolution module with the same logical processing structure finally classifies and outputs features.
  • This application equalizes the weights of high-order neighbor nodes in the layer-by-layer information aggregation, which is beneficial to improve the accuracy of action recognition.
  • the pose graph data includes a set of key points of the human body and a set of bone edges.
  • the k- The adjacency matrix is represented as follows:
  • k represents different neighbor orders of key points
  • (i, j) is the i-th and j-th key points
  • d(v i , v j ) represents the distance between key points i and j.
  • the bone edge refers to the connection line between two key points.
  • multiple frames of images in the original video stream are obtained, pedestrian detection and tracking are performed on each frame of images, and the posture estimation is obtained to obtain the human body tracking number and key point information, including:
  • the pedestrian tracking algorithm based on DeepSort is matched by calculating the similarity of the pedestrian bounding box feature information of the target to be tracked between the images of the two frames before and after, and assigning an ID to each target to be tracked to obtain the tracking result.
  • the regional multi-person pose estimation algorithm is used to extract key point coordinates for each target to be tracked, and key point information and human body tracking numbers are output.
  • the target detection technology is mainly to locate and classify the objects appearing in the image.
  • the yolo network in the mainstream target detection method is widely used in real life scenes due to its high real-time performance.
  • This application uses a pedestrian detector based on yolov4. It first divides the input image into SxS grids of various scales by the feature extraction network, and then generates a series of anchor boxes centered on the center of each grid, and classifies and boundaries these anchor boxes. Fine-tuning, and finally predict the pedestrian bounding box position in the image.
  • the pedestrian tracking algorithm based on DeepSort is to match by calculating the similarity of the pedestrian bounding box feature information between the two frames before and after, and assigning an ID to each target to be tracked (the "person" whose trajectory is confirmed in each frame is given a number ).
  • the Kalman filter and the Hungarian algorithm are mainly used to ensure the tracking accuracy.
  • the Kalman filter algorithm is used to predict and update the pedestrian trajectory, and the Hungarian algorithm is to optimize the IOU distribution of the output results of the pedestrian detector and the tracking prediction results.
  • Cascaded matching and new trajectory confirmation mechanisms are also introduced.
  • Cascaded matching refers to the combination of pedestrian motion information and appearance information for data association.
  • the matching degree of Kalman filter prediction results and motion information in pedestrian detector results uses Markov The distance is evaluated and calculated as shown in the following formula:
  • d (1) (i, j) represents the matching degree
  • d j represents the position of the j-th detection frame
  • y i represents the predicted position of the i-th tracker for the target
  • S i represents the distance between the detection position and the average tracking position
  • the covariance matrix between is mainly to take the state uncertainty into account, and the cosine distance measurement method is used for pedestrian appearance information matching. If the Mahalanobis distance of a certain association is less than the specified threshold t (1) , the association of the motion state is set Success, the function used looks like this:
  • Confirmation of the new trajectory refers to dividing the trajectory into a confirmed state and an unconfirmed state.
  • the newly generated trajectory is in the unconfirmed state by default. It must be matched with the detection results of the pedestrian detector for a certain number of times before it can be converted into a confirmed state.
  • the confirmed state trajectory needs to be consistent with The detection results will be deleted after a certain number of mismatches.
  • the pedestrians in the fixed time frame window size are accurately tracked and numbered, and the key point coordinates are extracted for each individual pedestrian according to the tracking results.
  • the idea of the RMPE (Regional Multi-Person Pose Estimation) algorithm is to detect each human body detection frame in the environment, and then independently detect the pose of each human body area.
  • the output results include the human body tracking number and the 3D information of 18 key points of the human body ( x, y, c), where (x, y) represents the coordinates and c represents the confidence.
  • the space transformation network is responsible for receiving the human body suggestion frame, and can output the inaccurate suggestion frame in the pedestrian detection result as a high-quality human body suggestion frame, making the suggestion frame more accurate.
  • the bounding box of the pedestrian is intercepted and input into the single-person pose estimation algorithm to obtain the key points of the human pose.
  • the spatial inverse transformation network outputs the candidate pose of the human body in the original image, and the redundant pose information of each pedestrian is filtered out by the pose non-maximum value suppression method.
  • the parallel-single-person pose estimation algorithm is only used in the training phase, and the output is directly compared with the real value of the human pose label. The purpose is to backpropagate the error generated after pose positioning to the space transformation network, and help the space transformation network to generate high-quality images. regional location.
  • the action recognition is performed on the space-time joint feature of the posture to obtain the action recognition classification result, including:
  • the global average pooling process is performed on the attitude spatio-temporal joint feature, and the obtained pooling result is input into the fully connected linear layer.
  • the category with the highest score corresponding to the joint spatio-temporal feature of the posture is output through the softmax classifier to obtain the classification result.
  • the output feature channel of the graph convolutional neural network is 384, and then the global average pooling is performed on the output feature in the space-time dimension and pedestrian individual, and the pooling result is input into the fully connected linear layer (the input channel is 384, and the output channel is the number of categories) , and finally output the category with the highest score through the softmax classifier.
  • a computer-readable medium on which computer-readable instructions are stored, and when the computer-readable instructions can be executed by a processor, the processor can realize the pedestrian fall as described above. Inverted action recognition method.
  • a device for pedestrian fall motion recognition based on pose estimation characterized in that the device includes:
  • processors including CPUs and GPUs;
  • the one or more processors When the one or more computer-readable instructions are executed by the one or more processors, the one or more processors implement the above-mentioned pedestrian fall action recognition method.
  • this application obtains multiple frames of images in the original video stream, detects and tracks pedestrians on each frame of the image, and obtains the human body tracking number and key point information through pose estimation, and uses a multi-scale adjacency matrix to separate each
  • the key point information of multiple frames of images before and after a key point is aggregated to obtain attitude graph data;
  • the attitude graph data is input into a graph convolutional neural network, and multiple spatiotemporal merge graphs in the graph convolutional neural network are Residual connections are introduced between the convolution modules, and feature extraction is performed through multiple spatio-temporal merged graph convolution modules in sequence to obtain the posture spatio-temporal joint feature; combined with the change characteristics of the fall action, the motion recognition of the posture spatio-temporal joint feature is obtained
  • Action recognition classification results that is, this application uses a multi-scale adjacency matrix to achieve information aggregation, and introduces residual connections between the upper and lower spatio-temporal joint modules of the same structure, and extracts poses in two streams (key point stream
  • the present application can be implemented in software and/or a combination of software and hardware, for example, it can be implemented by using an application specific integrated circuit (ASIC), a general-purpose computer or any other similar hardware devices.
  • ASIC application specific integrated circuit
  • the software program of the present application can be executed by a processor to realize the steps or functions described above.
  • the software program (including associated data structures) of the present application can be stored in a computer-readable recording medium such as RAM memory, magnetic or optical drive or floppy disk and the like.
  • some steps or functions of the present application may be implemented by hardware, for example, as a circuit that cooperates with a processor to execute each step or function.
  • a part of the present application can be applied as a computer program product, such as a computer program instruction.
  • a computer program product such as a computer program instruction.
  • the method and/or technical solution according to the present application can be invoked or provided through the operation of the computer.
  • the program instructions for invoking the method of the present application may be stored in a fixed or removable recording medium, and/or transmitted through a data stream in a broadcast or other signal-carrying medium, and/or stored in a in the working memory of the computer device on which the program instructions described above are executed.
  • an apparatus includes a memory for storing computer program instructions and a processor for executing the program instructions, wherein when the computer program instructions are executed by the processor, Triggering the device to run is based on the foregoing methods and/or technical solutions according to multiple embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本申请提供一种基于姿态估计的行人摔倒动作识别方法及设备,本申请采用多尺度的邻接矩阵实现信息的聚合,并在相同结构的上下时空联合模块间引入残差连接,分别提取姿态在双流(关键点流、骨骼边流)上的时空联合特征,最终合并双流结果做出摔倒动作判断,减少了背景对识别效果的影响从而提高动作识别准确率,并且减少了计算量。

Description

一种基于姿态估计的行人摔倒动作识别方法及设备
本申请要求于2021年11月15日提交中国专利局、申请号为202111345550.3、发明名称为“一种基于姿态估计的行人摔倒动作识别方法及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及计算机领域,尤其涉及一种基于姿态估计的行人摔倒动作识别方法及设备。
背景技术
现有技术中,目前行为识别领域常用的数据模态主要分为原始RGB视频和一系列的人体姿态关键点。原始RGB视频中不仅包含人体的行为运动信息,而且也拥有许多对识别精度产生影响的背景信息,例如光照、杂乱的周边环境。然而当今智能硬件水平飞速提升,获取人体关键点的姿态估计算法在实时性方面也越来越优秀,可以借助于高鲁棒性的行人检测网络提取出视频中每个人的姿态信息,最后把姿态输出结果封装成所需数据形式。
摔倒动作识别方法中需要把从视频中提取的人体姿态坐标组成图数据,利用图卷积网络进行特征学习。早期有学者提出基于骨架的时空图卷积网络ST-GCN做特征提取,在一帧图像上(空间维度)对人体的关键点自然连接图做图卷积,在时间维度上做时间卷积或用LSTM网络进行特征融合,它很好地利用了人体结构的自然连接和动作事件本身各相关关节的联动关系,考虑了空间和时间上的相邻关节,但它只考虑了局部关节连接性,未考虑近邻关键点和远方关键点的同等影响力,没有把全局中此关键点的远方关键点和前后多帧中的相关关键点考虑进来,时间和空间交错进行的方式对于捕获复杂的时空联合关系鲁棒性不够。2020年有学者提出了一种G3D时空图卷积算子,把时空信息联系在一起做三维卷积,并考虑了远距离邻居的重要性,能稳定准确提取到动作本身在立体空间的高级语义特性,对动作分类精度有很大提升。但是,复杂背景和动作特征提取 不充分对动作识别准确率的影响较大。
因此,解决RGB视频中复杂背景和动作特征提取不充分对摔倒动作识别准确率产生的影响仍是本领域需要研究的方向。
发明内容
本申请的一个目的是提供一种基于姿态估计的行人摔倒动作识别方法及设备,以解决现有技术中如何在摔倒识别过程中减少背景对识别效果的影响提高准确率,同时减少计算量的问题。
根据本申请的一个方面,提供了一种基于姿态估计的行人摔倒动作识别方法,包括:
获取原始视频流中多帧图像,对每一帧所述图像进行行人检测和跟踪,姿态估计得到人体跟踪编号和关键点信息,并采用多尺度的邻接矩阵将每个关键点前后多帧图像的所述关键点信息进行聚合,得到姿态图数据;
将所述姿态图数据输入图卷积神经网络中,在所述图卷积神经网络中的多个时空合并图卷积模块之间引入残差连接,并依次经过多个所述时空合并图卷积模块进行特征提取得到姿态时空联合特征;
结合摔倒动作变化特性,对所述姿态时空联合特征进行动作识别得到动作识别分类结果。
进一步地,上述基于姿态估计的行人摔倒动作识别方法中,所述图卷积神经网络中包括第一时空合并图卷积模块、第二时空合并图卷积模块和第三时空合并图卷积模块;
每个所述时空合并图卷积模块内包括多窗口多尺度的3D图卷积层和序列化组件层,所述序列化组件包括多尺度图卷积和连续两个多尺度时间卷积。
进一步地,上述基于姿态估计的行人摔倒动作识别方法中,将所述姿态图数据输入图卷积神经网络中,在所述图卷积神经网络中的多个时空合并图卷积模块之间引入残差连接,并依次经过多个所述时空合并图卷积模块进行特征提取得到姿态时空联合特征,包括:
将所述姿态图数据输入所述图卷积神经网络中,对所述姿态图数据进 行归一化处理调整所述姿态图数据的数组形状;
将调整后的姿态图数据输入所述第一时空合并图卷积模块进行特征提取,得到第一姿态时空特征;
将所述第一姿态时空特征输入所述第二时空合并图卷积模块进行特征提取,得到第二姿态时空特征;
所述第一姿态时空特征残差连接所述第二姿态时空特征后,输入所述第三时空合并图卷积模块进行特征提取,得到所述姿态时空联合特征。
进一步地,上述基于姿态估计的行人摔倒动作识别方法中,所述将调整后的姿态图数据输入所述第一时空合并图卷积模块进行特征提取,得到第一姿态时空特征,包括:
分别将所述调整后的姿态图数据输入所述第一时空合并图卷积模块中的所述多窗口多尺度3D图卷积层和所述序列化组件层中;
所述调整后的姿态图数据依次通过所述序列化组件层中的所述多尺度图卷积和连续两个所述多尺度时间卷积进行特征提取;
所述调整后的姿态图数据通过所述多窗口多尺度3D图卷积层进行特征提取;
将通过所述多窗口多尺度3D图卷积层和所述序列化组件层后输出的特征相加后,输入激活函数,再进行一次多尺度时间卷积特征提取,得到第一姿态时空特征;
在所述第二时空合并图卷积模块和第三时空合并图卷积模块中,以所述第一时空合并图卷积模块进行特征提取同样的方法,分别得到所述第二姿态时空特征和所述姿态时空联合特征。
进一步地,上述基于姿态估计的行人摔倒动作识别方法中,所述姿态图数据包括人体关键点集和骨骼边集,所述姿态图数据的k-邻接矩阵表示如下:
Figure PCTCN2022121935-appb-000001
其中,k代表关键点的不同邻居阶数,(i,j)是第i和j号关键点,d(v i,v j) 代表关键点i和j之间的距离。
进一步地,上述基于姿态估计的行人摔倒动作识别方法中,所述获取原始视频流中多帧图像,对每一帧图像进行行人检测和跟踪,姿态估计得到人体跟踪编号和关键点信息,包括:
获取所述原始视频流中多帧所述图像,确定待跟踪目标;
基于DeepSort的行人跟踪算法通过计算前后两帧所述图像间所述待跟踪目标的行人边界框特征信息的相似度进行匹配,并为每一个所述待跟踪目标分配一个ID,得到跟踪结果;
基于所述跟踪结果,利用区域多人姿态估计算法对每个所述待跟踪目标提取关键点坐标,输出关键点信息和人体跟踪编号。
进一步地,上述基于姿态估计的行人摔倒动作识别方法中,所述结合摔倒动作变化特性,对所述姿态时空联合特征进行动作识别得到动作识别分类结果,包括:
对所述姿态时空联合特征做全局平均池化处理,将得到的池化结果输入全连接线性层;
结合摔倒动作变化特性,通过分类器输出所述姿态时空联合特征对应的得分最高的类别,得到分类结果。
根据本申请的另一方面,还提供了一种计算机可读介质,其上存储有计算机可读指令,所述计算机可读指令可被处理执行时,使所述处理器实现如上述摔倒动作识别方法。
根据本申请的另一方面,还提供了基于姿态估计的行人摔倒动作识别设备,该设备包括:
一个或多个处理器;
计算机可读介质,用于存储一个或多个计算机可读指令,
当所述一个或多个计算机可读指令被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上述摔倒动作识别方法。
与现有技术相比,本申请通过获取原始视频流中多帧图像,对每一帧所述图像进行行人检测和跟踪,姿态估计得到人体跟踪编号和关键点信息,并采用多尺度的邻接矩阵将每个关键点前后多帧图像的所述关键点信息进行聚合,得到姿态图数据;将所述姿态图数据输入图卷积神经网络中, 在所述图卷积神经网络中的多个时空合并图卷积模块之间引入残差连接,并依次经过多个所述时空合并图卷积模块进行特征提取得到姿态时空联合特征;结合摔倒动作变化特性,对所述姿态时空联合特征进行动作识别得到动作识别分类结果,即本申请采用多尺度的邻接矩阵实现信息的聚合,并在相同结构的上下时空联合模块间引入残差连接,分别提取姿态在双流(关键点流、骨骼边流)上的时空联合特征,最终合并双流结果做出摔倒动作判断,减少了背景对识别效果的影响从而提高动作识别准确率,并且减少了计算量。
说明书附图
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:
图1示出根据本申请一个方面的一种基于姿态估计的行人摔倒动作识别方法的流程示意图;
图2示出根据本申请一个方面的一种基于姿态估计的行人摔倒动作识别方法中连续帧关键点空间信息示意图;
图3示出根据本申请一个方面的一种基于姿态估计的行人摔倒动作识别方法中人体摔倒过程中关键点位置变化示意图;
图4(a)示出根据本申请一个方面的一种基于姿态估计的行人摔倒动作识别方法中关键点特征提取后的嵌入表示示意图;
图4(b)示出根据本申请一个方面的一种基于姿态估计的行人摔倒动作识别方法中一实施例聚合邻居信息更新6号关键点示意图;
图5示出根据本申请一个方面的一种基于姿态估计的行人摔倒动作识别方法的中图卷积神经网络特征提取的数据计算过程示意图;
图6示出根据本申请一个方面的一种基于姿态估计的行人摔倒动作识别方法的姿态估计工作过程示意图。
附图中相同或相似的附图标记代表相同或相似的部件。
具体实施方式
下面结合附图对本申请作进一步详细描述。
在本申请一个典型的配置中,终端、服务网络的设备和可信方均包括 一个或多个处理器(例如中央处理器(Central Processing Unit,CPU)、图形处理器(GraphicsProcessing Unit,GPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RandomAccess Memory,RAM)和/或非易失性内存等形式,如只读存储器(Read Only Memory,ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(Phase-Change RAM,PRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、动态随机存取存储器(Dynamic RandomAccess Memory,DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(Electrically Erasable Programmable Read-OnlyMemory,EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、数字多功能光盘(Digital Versatile Disk,DVD)或其他光学存储、磁盒式磁带,磁带磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitorymedia),如调制的数据信号和载波。
图1示出根据本申请一个方面的一种基于姿态估计的行人摔倒动作识别方法的流程示意图,该方法适用于各种人类日常生活场景,包括但不限于办公室、家庭、咖啡室和演讲室,该方法可以用于识别各种人类动作,比如行走、站立、坐下、站起来等,该方法包括步骤S11、步骤S12及步骤S13,其中,具体包括:
步骤S11,获取原始视频流中多帧图像,对每一帧所述图像进行行人检测和跟踪,姿态估计得到人体跟踪编号和关键点信息,并采用多尺度的邻接矩阵将每个关键点前后多帧图像的所述关键点信息进行聚合,得到姿态图数据;在此,实际生活中摔倒动作发生前总依赖一些其他行为,如行走、站立等;摔倒动作发生后依赖平躺等一些行为,所以需要建立一个长期多帧间关键点之间的关联,把此帧该关键点前后多帧之间的信息进行聚 合,如图2所示。如图3所示,人在站立时摔倒各关键点的位置变化,关键点的邻居阶数是指两点间经过的跳数,比如11号关节的k阶邻居,k∈[1,2,3],1阶邻居[1,12]和2阶邻居[0,2,5,13]对该行为的判断有同样的影响力。利用人体的姿态信息构造姿态图数据来进行摔倒识别,不仅大大减少了背景对识别效果的影响,而且也减少了计算量。
步骤S12,将所述姿态图数据输入图卷积神经网络中,在所述图卷积神经网络中的多个时空合并图卷积模块之间引入残差连接,并依次经过多个所述时空合并图卷积模块进行特征提取得到姿态时空联合特征;在此,所述图卷积神经网络(Graph ConvolutionalNetwork,GCN)的本质是基于附近的网络邻居,通过卷积聚合的方式生成关键点的嵌入表示,如图4(a)所示。关键点在每一层都有嵌入表示,在图4(b)中第0层的嵌入表示就是输入关键点特征X,第K层关键点的嵌入表示是聚合第K-1层邻居的信息计算得到的。邻居关键点到该关键点的方向线是消息传递,中间的聚合是采用神经网络的方式,在邻居消息传送后先进行平均化再聚合。关键点的完整聚合过程表示如下公式所示:
Figure PCTCN2022121935-appb-000002
首先初始化第0层的嵌入表示
Figure PCTCN2022121935-appb-000003
表示先前一层关键点v的嵌入表示σ是非线性激活函数sigmoid或relu,邻居特征聚合经K层神经网络后结果
Figure PCTCN2022121935-appb-000004
表示对前一层嵌入表示进行平均。
GCN的另一种通用向量表示如下所示:
Figure PCTCN2022121935-appb-000005
其中,
Figure PCTCN2022121935-appb-000006
是邻接矩阵A和单位矩阵E的和,
Figure PCTCN2022121935-appb-000007
表示对
Figure PCTCN2022121935-appb-000008
的归一化操作。
步骤S13,结合摔倒动作变化特性,对所述姿态时空联合特征进行动作识别得到动作识别分类结果。在此,较早使用图卷积模型(ST-GCN)获取骨架动作特征未能将时空表征联合考虑,对于对时空结合信息依赖大的 动作并不能做到很好的识别。本申请在G3D卷积算子的基础上提取姿态时空特征,并结合摔倒动作变化的特性,融合多种算法来解决现实生活中的摔倒检测问题。
上述步骤S11至步骤S13,首先,通过获取原始视频流中多帧图像,对每一帧所述图像进行行人检测和跟踪,姿态估计得到人体跟踪编号和关键点信息,并采用多尺度的邻接矩阵将每个关键点前后多帧图像的所述关键点信息进行聚合,得到姿态图数据;然后,将所述姿态图数据输入图卷积神经网络中,在所述图卷积神经网络中的多个时空合并图卷积模块之间引入残差连接,并依次经过多个所述时空合并图卷积模块进行特征提取得到姿态时空联合特征;最后,结合摔倒动作变化特性,对所述姿态时空联合特征进行动作识别得到动作识别分类结果,即本申请采用多尺度的邻接矩阵实现信息的聚合,并在相同结构的上下时空联合模块间引入残差连接,分别提取姿态在双流(关键点流、骨骼边流)上的时空联合特征,最终合并双流结果做出摔倒动作判断,减少了背景对识别效果的影响从而提高动作识别准确率,并且减少了计算量。
例如,在Le2i Fall Detection(LFD)和UR Fall Detection(URFD)两个摔倒检测数据集上进行实验。LFD包括191个人类活动视频,分为4个场景:办公室、家庭、咖啡室和演讲室,视频中存在假摔且没有人的帧数,视频的格式是320ⅹ240,25帧/秒。URFD包含70个(30个跌倒+40个日常生活活动)序列,使用2台Microsoft Kinect相机和相应的加速度计数据记录跌倒事件。
在训练时利用opencv和视频编辑工具预处理LFD的原始视频,分辨率为640ⅹ480,帧率为30FPS,根据滑窗大小选择视频样本时长在3~9秒,原始视频中包含有部分除摔倒外的动作,需要把摔倒动作视频当作一组,其他(行走、站立、坐下、站起来)当作另一组对它们重新标注,共计有26100帧被选择。由于URFD中的40个日常活动视频彼此间差别较大,需要把它们重新标注为行走、坐下、弯腰三种动作,统一归为非摔倒组。
对每一帧所述图像P1、P2、P3.......Pn用预训练微调模型yolov4做行人检测,同时适当扩大输出边界框;根据当前帧输出用卡尔曼滤波算法预测下一帧轨迹跟踪结果,之后将检测和预测结果进行合并,利用RMPE 算法做姿态估计,将结果存入对象列表中,最后更新轨迹状态。设定DeepSort算法的匹配阈值为30,RMPE的骨干网络为resnet50。即对每一帧所述图像P1、P2、P3.......Pn进行行人检测和跟踪,姿态估计得到人体跟踪编号ID和关键点信息X1、X2、X3......Xn,并采用多尺度的邻接矩阵将每个关键点前后多帧图像的所述关键点信息进行聚合,得到姿态图数据。
接着,将所述姿态图数据输入图卷积神经网络中,在所述图卷积神经网络中的多个时空合并图卷积模块B1、B2、B3......Bn之间引入残差连接,并依次经过多个所述时空合并图卷积模块B1、B2、B3......Bn进行特征提取得到姿态时空联合特征。在数据集上训练时为避免过拟合,权重衰减值为0.0005,模型学习使用随机梯度下降(SGD)优化器,初始学习率为0.05,训练80个epochs,batchsize为8,分别在第35和45个epoch做0.1倍学习率衰减。
最后,结合摔倒动作变化特性,对所述姿态时空联合特征进行动作识别得到动作识别分类结果,即本文算法采用多尺度的邻接矩阵实现信息的聚合,并在相同结构的上下时空联合模块间引入残差连接,分别提取姿态在双流(关键点流、骨骼边流)上的时空联合特征,最终合并双流结果做出摔倒动作判断,减少了背景对识别效果的影响从而提高动作识别准确率,并且减少了计算量。
另外,为体现姿态数据的优势和时空三维图卷积提取特征的强泛化能力,把流行的视频动作分类算法SlowFast,时空图卷积模型2s-AGCN和其他一些在UR Fall Detection数据集上表现良好的算法和本申请所用的算法进行对比。利用本申请方法与使用SlowFast、2s-AGCN和Harrou et al方法进行行人摔倒动作识别相比,本申请很大程度减少背景对识别精度的影响,动作识别结果精确度更高。
接着上述实施例,所述方法中,所述图卷积神经网络中包括第一时空合并图卷积模块B1、第二时空合并图卷积模块B2和第三时空合并图卷积模块B3。
每个所述时空合并图卷积模块内包括多窗口多尺度3D图卷积层和序列化组件层,所述序列化组件包括多尺度图卷积和连续两个多尺度时间卷 积。在此,所述多窗口多尺度3D图卷积层是在不同窗口大小下做时间和空间维度联合的3D卷积,目的在于将动作在两个维度下的内在关系进行表达。所述序列化组件中依次是多尺度图卷积,能够利用关节点间的最大距离对骨架进行建模;连续2个多尺度时间卷积,用来捕捉长期的或扩展的时间帧上下文信息。
接着上述实施例,所述方法中,所述将所述姿态图数据输入图卷积神经网络中,在所述图卷积神经网络中的多个时空合并图卷积模块之间引入残差连接,并依次经过多个所述时空合并图卷积模块进行特征提取得到姿态时空联合特征,包括:
将所述姿态图数据输入所述图卷积神经网络中,对所述姿态图数据进行归一化处理调整所述姿态图数据的数组形状。例如,输入是一个5维数组(N,C,T,V,M),其中N代表一个前向batch中视频的数量;C代表节点的特征信息通道数,即(x,y,acc)3个;T代表视频关键帧的数量;V代表关节的数量;M代表一帧中置信度最高的人数。经过批归一化层数组形状被调整为3维(N,CⅹVⅹM,T)。
将调整后的姿态图数据输入所述第一时空合并图卷积模块进行特征提取,得到第一姿态时空特征。
将所述第一姿态时空特征输入所述第二时空合并图卷积模块进行特征提取,得到第二姿态时空特征。
所述第一姿态时空特征残差连接所述第二姿态时空特征后,输入所述第三时空合并图卷积模块进行特征提取,得到所述姿态时空联合特征。如图5所示,为防止层数增加造成特征损失,将时空合并图卷积模块B1的输出经卷积转换后残差连接到模块B2,其中每个子块括号内数字分别是计算前后的输入和输出通道数。
接着上述实施例,所述方法中,所述将调整后的姿态图数据输入所述第一时空合并图卷积模块进行特征提取,得到第一姿态时空特征,包括:
分别将所述调整后的姿态图数据输入所述第一时空合并图卷积模块中的所述多窗口多尺度3D图卷积层和所述序列化组件层中。
所述调整后的姿态图数据依次通过所述序列化组件层中的所述多尺度图卷积和连续两个所述多尺度时间卷积进行特征提取。
所述调整后的姿态图数据通过所述多窗口多尺度3D图卷积层进行特征提取。
将通过所述多窗口多尺度3D图卷积层和所述序列化组件层后输出的特征相加后,输入激活函数,再进行一次多尺度时间卷积特征提取,得到第一姿态时空特征。
在所述第二时空合并图卷积模块和第三时空合并图卷积模块中,以所述第一时空合并图卷积模块进行特征提取同样的方法,分别得到所述第二姿态时空特征和所述姿态时空联合特征。在此,经过多窗口多尺度3D图卷积层和序列化组件层后,将输出特征相加,送入relu()激活函数,再进行一次多尺度时间卷积特征提取,结果被输入到具有同样逻辑处理结构的下一个时空合并图卷积模块,最终是将特征进行分类和输出,本申请在逐层信息聚合中将高阶邻居节点的权重均衡化,有利于而提高动作识别准确率。
本申请的又一实施例,所述方法中,所述姿态图数据包括人体关键点集和骨骼边集,为了体现前后多帧关节点和远方邻居的重要性,所述姿态图数据的k-邻接矩阵表示如下:
Figure PCTCN2022121935-appb-000009
其中,k代表关键点的不同邻居阶数,(i,j)是第i和j号关键点,d(v i,v j)代表关键点i和j之间的距离。在此,骨骼边指两关键点之间的连线。本申请使用姿态图数据比原始视频更有优势,很大程度减少背景对识别精度的影响,即利用人体的姿态信息构造姿态图数据来进行摔倒识别,不仅大大减少了背景对识别效果的影响,而且也减少了计算量。
本申请的又一实施例,所述方法中,获取原始视频流中多帧图像,对每一帧图像进行行人检测和跟踪,姿态估计得到人体跟踪编号和关键点信息,包括:
获取所述原始视频流中多帧所述图像,确定待跟踪目标。
基于DeepSort的行人跟踪算法通过计算前后两帧所述图像间所述待 跟踪目标的行人边界框特征信息的相似度进行匹配,并为每一个所述待跟踪目标分配一个ID,得到跟踪结果。
基于所述跟踪结果,利用区域多人姿态估计算法对每个所述待跟踪目标提取关键点坐标,输出关键点信息和人体跟踪编号。
例如,目标检测技术主要是对图像中出现的物体进行定位和分类,目前主流目标检测方法中yolo网络凭借其高实时性普遍应用在实际生活场景中。本申请使用基于yolov4的行人检测器,它先由特征提取网络将输入图像分成多种尺度的SⅹS网格,以每一个网格中心为中心再生成一系列锚框,对这些锚框进行分类和边界微调,最终预测出图像中的行人边界框位置。
摔倒动作视频中一般会有很多行人且带有时序信息,检测出行人的位置并逐帧跟踪是组建姿态图数据的首要条件。基于DeepSort的行人跟踪算法是通过计算前后两帧间行人边界框特征信息的相似度来进行匹配,时为每一个待跟踪目标分配一个ID(对每帧中轨迹为确认态的“person”赋予编号)。DeepSort中主要使用卡尔曼滤波和匈牙利算法保证跟踪准确性,其中卡尔曼滤波算法是对行人运动轨迹进行预测和更新,匈牙利算法是对行人检测器输出结果和跟踪预测结果进行IOU最优化分配,同时还引入级联匹配和新轨迹确认机制,级联匹配是指使用行人运动信息和外观信息相结合的方式进行数据关联,卡尔曼滤波预测结果和行人检测器结果中运动信息的匹配度使用马氏距离进行评估,计算如下公式所示:
d (1)(i,j)=(d j-y i) TS i -1(d j-y i)
其中,d (1)(i,j)表示匹配度,d j表示第j个检测框的位置,y i表示第i个追踪器对目标的预测位置,S i表示检测位置与平均追踪位置之间的协方差矩阵,主要是将状态不确定性考虑在内,行人外观信息匹配使用余弦距离度量方法,如果某次关联的马氏距离小于指定的阈值t (1),则设置运动状态的关联成功,使用的函数如下所示:
Figure PCTCN2022121935-appb-000010
其中,b i,j (1)作为是否关联成功的标记。新轨迹确认是指将轨迹分为确 认态和不确认态,新产生的轨迹默认是不确认状态,必须和行人检测器检测结果连续匹配一定的次数才可以转化为确认态,确认态轨迹需和检测结果失配一定次数才会被删除。
固定时间帧窗口大小中的行人被准确跟踪并赋予编号,根据跟踪结果对每个行人个体提取关键点坐标。RMPE(Regional Multi-Person Pose Estimation)算法思想是检测环境中的每一个人体检测框,然后独立地去检测每一个人体区域的姿态,输出结果中包含人体跟踪编号和人体18个关键点三维信息(x,y,c),其中(x,y)表示坐标,c表示置信度。
姿态估计流程如图6所示,首先空间变换网络负责接收人体建议框,可把行人检测结果中不准确的建议框输出为高质量的人体建议框,使建议框更精准。截取行人边界框输入单人姿态估计算法中得到人体姿态关键点,空间逆变换网络输出在原图中的人体候选姿态,对每个行人的冗余姿态信息采用姿态非极大值抑制方法过滤掉。并行-单人姿态估计算法只在训练阶段使用,输出直接和人体姿态标签的真实值进行对比,目的是将姿态定位后产生的误差反向传播到空间变换网络中,帮助空间变换网络产生高质量的区域位置。
接着本申请的上述实施例,结合摔倒动作变化特性,对所述姿态时空联合特征进行动作识别得到动作识别分类结果,包括:
对所述姿态时空联合特征做全局平均池化处理,将得到的池化结果输入全连接线性层。
结合摔倒动作变化特性,通过softmax分类器输出所述姿态时空联合特征对应的得分最高的类别,得到分类结果。
例如:图卷积神经网络输出特征通道为384,之后依次在时空维度、行人个体上对输出特征做全局平均池化,池化结果输入全连接线性层(输入通道384,输出通道为类别数),最后通过softmax分类器输出得分最高的类别。
根据本申请的另一个方面,还提供了一种计算机可读介质,其上存储有计算机可读指令,所述计算机可读指令可被处理器执行时,使所述处理器实现如上述行人摔倒动作识别方法。
根据本申请的另一个方面,还提供了一种基于姿态估计的行人摔倒动作识别设备,其特征在于,该设备包括:
一个或多个处理器,包括CPU和GPU;
计算机可读介质,用于存储一个或多个计算机可读指令,
当所述一个或多个计算机可读指令被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上述行人摔倒动作识别方法。
在此,所述设备的各实施例的详细内容,具体可参见上述设备端的控制用户对垒方法实施例的对应部分,在此,不再赘述。
综上所述,本申请通过获取原始视频流中多帧图像,对每一帧所述图像进行行人检测和跟踪,姿态估计得到人体跟踪编号和关键点信息,并采用多尺度的邻接矩阵将每个关键点前后多帧图像的所述关键点信息进行聚合,得到姿态图数据;将所述姿态图数据输入图卷积神经网络中,在所述图卷积神经网络中的多个时空合并图卷积模块之间引入残差连接,并依次经过多个所述时空合并图卷积模块进行特征提取得到姿态时空联合特征;结合摔倒动作变化特性,对所述姿态时空联合特征进行动作识别得到动作识别分类结果,即本申请采用多尺度的邻接矩阵实现信息的聚合,并在相同结构的上下时空联合模块间引入残差连接,分别提取姿态在双流(关键点流、骨骼边流)上的时空联合特征,最终合并双流结果做出摔倒动作判断,减少了背景对识别效果的影响从而提高动作识别准确率,并且减少了计算量。
需要注意的是,本申请可在软件和/或软件与硬件的组合体中被实施,例如,可采用专用集成电路(ASIC)、通用目的计算机或任何其他类似硬件设备来实现。在一个实施例中,本申请的软件程序可以通过处理器执行以实现上文所述步骤或功能。同样地,本申请的软件程序(包括相关的数据结构)可以被存储到计算机可读记录介质中,例如,RAM存储器,磁或光驱动器或软磁盘及类似设备。另外,本申请的一些步骤或功能可采用硬件来实现,例如,作为与处理器配合从而执行各个步骤或功能的电路。
另外,本申请的一部分可被应用为计算机程序产品,例如计算机程序指令,当其被计算机执行时,通过该计算机的操作,可以调用或提供根据本申请的方法和/或技术方案。而调用本申请的方法的程序指令,可能被 存储在固定的或可移动的记录介质中,和/或通过广播或其他信号承载媒体中的数据流而被传输,和/或被存储在根据所述程序指令运行的计算机设备的工作存储器中。在此,根据本申请的一个实施例,包括一个装置,该装置包括用于存储计算机程序指令的存储器和用于执行程序指令的处理器,其中,当该计算机程序指令被该处理器执行时,触发该装置运行基于前述根据本申请的多个实施例的方法和/或技术方案。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。装置权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。

Claims (9)

  1. 一种基于姿态估计的行人摔倒动作识别方法,其特征在于,所述方法包括:
    获取原始视频流中多帧图像,对每一帧所述图像进行行人检测和跟踪,姿态估计得到人体跟踪编号和关键点信息,并采用多尺度的邻接矩阵将每个关键点前后多帧图像的所述关键点信息进行聚合,得到姿态图数据;
    将所述姿态图数据输入图卷积神经网络中,在所述图卷积神经网络中的多个时空合并图卷积模块之间引入残差连接,并依次经过多个所述时空合并图卷积模块进行特征提取得到姿态时空联合特征;
    结合摔倒动作变化特性,对所述姿态时空联合特征进行动作识别得到动作识别分类结果。
  2. 根据权利要求1所述的方法,其特征在于,所述图卷积神经网络中包括第一时空合并图卷积模块、第二时空合并图卷积模块和第三时空合并图卷积模块;
    每个所述时空合并图卷积模块内包括多窗口多尺度3D图卷积层和序列化组件层,所述序列化组件包括多尺度图卷积和连续两个多尺度时间卷积。
  3. 根据权利要求2所述的方法,其特征在于,所述将所述姿态图数据输入图卷积神经网络中,在所述图卷积神经网络中的多个时空合并图卷积模块之间引入残差连接,并依次经过多个所述时空合并图卷积模块进行特征提取得到姿态时空联合特征,包括:
    将所述姿态图数据输入所述图卷积神经网络中,对所述姿态图数据进行归一化处理调整所述姿态图数据的数组形状;
    将调整后的姿态图数据输入所述第一时空合并图卷积模块进行特征提取,得到第一姿态时空特征;
    将所述第一姿态时空特征输入所述第二时空合并图卷积模块进行特征提取,得到第二姿态时空特征;
    所述第一姿态时空特征残差连接所述第二姿态时空特征后,输入所述第三时空合并图卷积模块进行特征提取,得到所述姿态时空联合特征。
  4. 根据权利要求3所述的方法,其特征在于,所述将调整后的姿态图 数据输入所述第一时空合并图卷积模块进行特征提取,得到第一姿态时空特征,包括:
    分别将所述调整后的姿态图数据输入所述第一时空合并图卷积模块中的所述多窗口多尺度3D图卷积层和所述序列化组件层中;
    所述调整后的姿态图数据依次通过所述序列化组件层中的所述多尺度图卷积和连续两个所述多尺度时间卷积进行特征提取;
    所述调整后的姿态图数据通过所述多窗口多尺度3D图卷积层进行特征提取;
    将通过所述多窗口多尺度3D图卷积层和所述序列化组件层后输出的特征相加后,输入激活函数,再进行一次多尺度时间卷积特征提取,得到第一姿态时空特征;
    在所述第二时空合并图卷积模块和第三时空合并图卷积模块中,以所述第一时空合并图卷积模块进行特征提取同样的方法,分别得到所述第二姿态时空特征和所述姿态时空联合特征。
  5. 根据权利要求4所述的方法,其特征在于,所述姿态图数据包括人体关键点集和骨骼边集,所述姿态图数据的k-邻接矩阵表示如下:
    Figure PCTCN2022121935-appb-100001
    其中,k代表关键点的不同邻居阶数,(i,j)是第i和j号关键点,d(v i,v j)代表关键点i和j之间的距离。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,步骤A中获取原始视频流中多帧图像,对每一帧图像进行行人检测和跟踪,姿态估计得到人体跟踪编号和关键点信息,包括:
    获取所述原始视频流中多帧所述图像,确定待跟踪目标;
    基于DeepSort的行人跟踪算法通过计算前后两帧所述图像间所述待跟踪目标的行人边界框特征信息的相似度进行匹配,并为每一个所述待跟踪目标分配一个ID,得到跟踪结果;
    基于所述跟踪结果,利用区域多人姿态估计算法对每个所述待跟踪目 标提取关键点坐标,输出关键点信息和人体跟踪编号。
  7. 根据权利要求6所述的方法,其特征在于,所述结合摔倒动作变化特性,对所述姿态时空联合特征进行动作识别得到动作识别分类结果,包括:
    对所述姿态时空联合特征做全局平均池化处理,将得到的池化结果输入全连接线性层;
    结合摔倒动作变化特性,通过分类器输出所述姿态时空联合特征对应的得分最高的类别,得到分类结果。
  8. 一种计算机可读介质,其上存储有计算机可读指令,所述计算机可读指令可被处理器执行时,使所述处理器实现如权利要求1至7中任一项所述的方法。
  9. 一种基于姿态估计的行人摔倒动作识别设备,其特征在于,该设备包括:
    一个或多个处理器;
    计算机可读介质,用于存储一个或多个计算机可读指令,
    当所述一个或多个计算机可读指令被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1至7中任一项所述的方法。
PCT/CN2022/121935 2021-11-15 2022-09-28 一种基于姿态估计的行人摔倒动作识别方法及设备 WO2023082882A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB2302960.6A GB2616733A (en) 2021-11-15 2022-09-28 Pose estimation-based pedestrian fall action recognition method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111345550.3A CN113963445B (zh) 2021-11-15 2021-11-15 一种基于姿态估计的行人摔倒动作识别方法及设备
CN202111345550.3 2021-11-15

Publications (1)

Publication Number Publication Date
WO2023082882A1 true WO2023082882A1 (zh) 2023-05-19

Family

ID=79470470

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/121935 WO2023082882A1 (zh) 2021-11-15 2022-09-28 一种基于姿态估计的行人摔倒动作识别方法及设备

Country Status (2)

Country Link
CN (1) CN113963445B (zh)
WO (1) WO2023082882A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116311542A (zh) * 2023-05-23 2023-06-23 广州英码信息科技有限公司 兼容拥挤场景和非拥挤场景的人体摔倒检测方法及系统
CN116469132A (zh) * 2023-06-20 2023-07-21 济南瑞泉电子有限公司 基于双流特征提取的跌倒检测方法、系统、设备及介质
CN116524601A (zh) * 2023-06-21 2023-08-01 深圳市金大智能创新科技有限公司 辅助养老机器人监控的自适应多阶段人体行为识别模型
CN116665309A (zh) * 2023-07-26 2023-08-29 山东睿芯半导体科技有限公司 一种步姿特征识别方法、装置、芯片及终端
CN117475518A (zh) * 2023-12-27 2024-01-30 华东交通大学 一种同步人体运动识别与预测方法及系统

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113963445B (zh) * 2021-11-15 2024-06-18 河南理工大学 一种基于姿态估计的行人摔倒动作识别方法及设备
GB2616733A (en) * 2021-11-15 2023-09-20 Univ Henan Polytechnic Pose estimation-based pedestrian fall action recognition method and device
CN115116133A (zh) * 2022-06-14 2022-09-27 鹏城实验室 一种用于独居老人监护的异常行为检测系统及方法
CN115858725B (zh) * 2022-11-22 2023-07-04 广西壮族自治区通信产业服务有限公司技术服务分公司 一种基于无监督式图神经网络的文本噪声筛选方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738154A (zh) * 2019-10-08 2020-01-31 南京熊猫电子股份有限公司 一种基于人体姿态估计的行人摔倒检测方法
US20210082144A1 (en) * 2019-09-12 2021-03-18 Nec Laboratories America, Inc Keypoint based pose-tracking using entailment
CN112966628A (zh) * 2021-03-17 2021-06-15 广东工业大学 一种基于图卷积神经网络的视角自适应多目标摔倒检测方法
WO2021114892A1 (zh) * 2020-05-29 2021-06-17 平安科技(深圳)有限公司 基于环境语义理解的人体行为识别方法、装置、设备及存储介质
CN113963445A (zh) * 2021-11-15 2022-01-21 河南理工大学 一种基于姿态估计的行人摔倒动作识别方法及设备

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800903B (zh) * 2021-01-19 2022-08-26 南京邮电大学 一种基于时空图卷积神经网络的动态表情识别方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210082144A1 (en) * 2019-09-12 2021-03-18 Nec Laboratories America, Inc Keypoint based pose-tracking using entailment
CN110738154A (zh) * 2019-10-08 2020-01-31 南京熊猫电子股份有限公司 一种基于人体姿态估计的行人摔倒检测方法
WO2021114892A1 (zh) * 2020-05-29 2021-06-17 平安科技(深圳)有限公司 基于环境语义理解的人体行为识别方法、装置、设备及存储介质
CN112966628A (zh) * 2021-03-17 2021-06-15 广东工业大学 一种基于图卷积神经网络的视角自适应多目标摔倒检测方法
CN113963445A (zh) * 2021-11-15 2022-01-21 河南理工大学 一种基于姿态估计的行人摔倒动作识别方法及设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Master's Thesis", 1 May 2021, DONGHUA UNIVERSITY, CN, article GE, WEI: "Fall Detection Based on Human Pose Recognition", pages: 1 - 70, XP009545455, DOI: 10.27012/d.cnki.gdhuu *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116311542A (zh) * 2023-05-23 2023-06-23 广州英码信息科技有限公司 兼容拥挤场景和非拥挤场景的人体摔倒检测方法及系统
CN116311542B (zh) * 2023-05-23 2023-08-04 广州英码信息科技有限公司 兼容拥挤场景和非拥挤场景的人体摔倒检测方法及系统
CN116469132A (zh) * 2023-06-20 2023-07-21 济南瑞泉电子有限公司 基于双流特征提取的跌倒检测方法、系统、设备及介质
CN116469132B (zh) * 2023-06-20 2023-09-05 济南瑞泉电子有限公司 基于双流特征提取的跌倒检测方法、系统、设备及介质
CN116524601A (zh) * 2023-06-21 2023-08-01 深圳市金大智能创新科技有限公司 辅助养老机器人监控的自适应多阶段人体行为识别模型
CN116524601B (zh) * 2023-06-21 2023-09-12 深圳市金大智能创新科技有限公司 辅助养老机器人监控的自适应多阶段人体行为识别模型
CN116665309A (zh) * 2023-07-26 2023-08-29 山东睿芯半导体科技有限公司 一种步姿特征识别方法、装置、芯片及终端
CN116665309B (zh) * 2023-07-26 2023-11-14 山东睿芯半导体科技有限公司 一种步姿特征识别方法、装置、芯片及终端
CN117475518A (zh) * 2023-12-27 2024-01-30 华东交通大学 一种同步人体运动识别与预测方法及系统
CN117475518B (zh) * 2023-12-27 2024-03-22 华东交通大学 一种同步人体运动识别与预测方法及系统

Also Published As

Publication number Publication date
CN113963445A (zh) 2022-01-21
CN113963445B (zh) 2024-06-18

Similar Documents

Publication Publication Date Title
WO2023082882A1 (zh) 一种基于姿态估计的行人摔倒动作识别方法及设备
US10198823B1 (en) Segmentation of object image data from background image data
Jegham et al. Vision-based human action recognition: An overview and real world challenges
Zhang et al. A review on human activity recognition using vision‐based method
US9965865B1 (en) Image data segmentation using depth data
Shen et al. Multiobject tracking by submodular optimization
US9098740B2 (en) Apparatus, method, and medium detecting object pose
US10778988B2 (en) Method, an apparatus and a computer program product for object detection
JP5554984B2 (ja) パターン認識方法およびパターン認識装置
Lim et al. A feature covariance matrix with serial particle filter for isolated sign language recognition
US11551060B2 (en) Identifying image aesthetics using region composition graphs
Yang et al. Facial expression recognition based on dual-feature fusion and improved random forest classifier
WO2021249114A1 (zh) 目标跟踪方法和目标跟踪装置
CN113628244B (zh) 基于无标注视频训练的目标跟踪方法、系统、终端及介质
WO2009152509A1 (en) Method and system for crowd segmentation
WO2021098802A1 (en) Object detection device, method, and systerm
KR20220004009A (ko) 키 포인트 검출 방법, 장치, 전자 기기 및 저장 매체
Feng Mask RCNN-based single shot multibox detector for gesture recognition in physical education
Huynh-The et al. Learning action images using deep convolutional neural networks for 3D action recognition
CN112560620B (zh) 一种基于目标检测和特征融合的目标跟踪方法及系统
Babu et al. Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network
Liang et al. Egocentric hand pose estimation and distance recovery in a single RGB image
Alletto et al. Head pose estimation in first-person camera views
EP4287145A1 (en) Statistical model-based false detection removal algorithm from images
Yadav et al. DroneAttention: Sparse weighted temporal attention for drone-camera based activity recognition

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 202302960

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20220928

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22891683

Country of ref document: EP

Kind code of ref document: A1