GB2616733A

GB2616733A - Pose estimation-based pedestrian fall action recognition method and device

Info

Publication number: GB2616733A
Application number: GB2302960.6A
Authority: GB
Inventors: Zhang Fukai; He Tiancheng; Zhang Haiyan
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2021-11-15
Filing date: 2022-09-28
Publication date: 2023-09-20
Also published as: GB202302960D0; GB2616733A8

Abstract

The present application provides a pose estimation-based pedestrian fall action recognition method and device. According to the present application, a multi-scale adjacency matrix is used to realize information aggregation, and residual connection is introduced between upper and lower spatial-temporal combining modules having a same structure; spatial-temporal combined features of a pose in double flows (a key point flow and a bone edge flow) are respectively extracted; and finally the double-flow results are combined to make a fall action determination, so that the influence of a background on a recognition effect is reduced to improve action recognition accuracy, and the computational complexity is reduced.

Description

METHOD AND DEVICE FOR RECOGNIZING PEDESTRIAN FALL ACTION BASED

ON POSE ESTIMATION

CROSS REFERENCE TO RELATED APPLICATION

[0001] This application claims priority to the Chinese Patent Application No. 202111345550.3, filed with the China National Intellectual Property Administration (CNIPA) on November 15, 2021, and entitled "METHOD AND DEVICE FOR RECOGNIZING PEDESTRIAN FALL ACTION BASED ON POSE ESTIMATION", which is incorporated herein by reference in its entirety.

TECHNICAL FIELD

[0002] The present disclosure relates to the field of computers, and in particular, to a method and device for recognizing pedestrian fall action based on pose estimation.

BACKGROUND

[0003] In the prior art, the data modalities commonly used in the current behavior recognition field are mainly divided into original red, green and blue (RGB) videos and a series of human pose key points. The original RGB video contains human behavior and motion information and a lot of background information that affects the recognition accuracy, such as lighting and cluttered surrounding environment. With a rapid improvement of intelligent hardware, pose estimation algorithms for obtaining key points of human body are becoming more and more excellent in real-time. Pose information of each person in a video may be extracted with the help of a pedestrian detection network with high robustness, and pose output results may be finally encapsulated into a required data form.

[0004] In a fall action recognition method, human pose coordinates extracted from the video constitute graph data, which is subject to feature learning by the graph convolutional network (GCN). Scholars early provide a skeleton-based spatio-temporal graph convolutional network (ST-GCN) for feature extraction, in which a natural connection graph of key points of the human body in an image frame (spatial dimension) is subjected to graph convolution, and temporal convolution in the temporal dimension or feature fusion by using a long short-term memory (LSTM) network. The skeleton-based ST-GCN makes good use of the natural connection of the human body structure and a linkage relationship between relevant joints of an action event itself, and takes into account the adjacent joints in space and time. However, the skeleton-based ST-GCN only takes into account a local joint connectivity, does not take into account an equal influence of nearby key points and distant key points, and does not take into account distant key points regarding this key point in the global and related key points in multiple frames before and after the current frame. Further, the alternate process between time and space is not robust enough to capture complex spatio-temporal joint relations. In 2020, some scholars provide a G3D spatio-temporal graph convolutional operator, which links spatio-temporal information for three-dimensional (3D) convolution, and takes into account the importance of distant neighbors. It can stably and accurately extract high-level semantics of the action itself in the 3D space, greatly improving the accuracy of action classification. However, complex background and insufficient action feature extraction have a great impact on the accuracy of action recognition. [0005] Therefore, it is still desired to study a method for solving the impact of the complex background and insufficient action feature extraction on the accuracy of fall action recognition in the RGB video.

SUMMARY

[0006] An objective of the present disclosure is to provide a method and device for recognizing pedestrian fall action based on pose estimation, so as to solve the problem in the prior art of how to reduce the impact of the background on the recognition effect, improve the accuracy, and reduce computation cost during fall recognition.

[0007] According to one aspect of the present disclosure, a method for recognizing pedestrian fall action based on pose estimation is provided, including: [0008] obtaining multiple frames of images in an original video stream, performing pedestrian detection and tracking on each frame to obtain a human body tracking number and key point information by pose estimation, and aggregating, for each key point, key point information of multiple adjacent image by using a multi-scale adjacency matrix to obtain pose graph data; [0009] inputting the pose graph data into a GCN, introducing residual connections between multiple spatio-temporal merging graph convolutional modules in the GCN, and performing feature extraction through the multiple spatio-temporal merging graph convolutional modules sequentially to obtain a spatio-temporal joint pose feature; and [0010] performing, in combination with a change characteristic of a fall action, action recognition on the spatio-temporal joint pose feature to obtain action recognition classification results.

[0011] Further, in the above method for recognizing pedestrian fall action based on pose estimation, the GCN includes a first spatio-temporal merging graph convolutional module, a second spatio-temporal merging graph convolutional module, and a third spatio-temporal merging graph convolutional module.

[0012] Each of the spatio-temporal merging graph convolutional modules includes a multi-window multi-scale 3D graph convolutional layer and a serialization component layer, and the serialization component layer includes a multi-scale graph convolution and two consecutive multi-scale temporal convolutions.

[0013] Further, in the above method for recognizing pedestrian fall action based on pose estimation, the inputting the pose graph data into a GCN, introducing residual connections between multiple spatio-temporal merging graph convolutional modules in the GCN, and performing feature extraction through the multiple spatio-temporal merging graph convolutional modules sequentially to obtain a spatio-temporal joint pose feature includes: [0014] inputting the pose graph data into the GCN, and performing normalization processing on the pose graph data to adjust an array shape of the pose graph data; [0015] inputting the adjusted pose graph data into the first spatio-temporal merging graph convolutional module for feature extraction, to obtain a first spatio-temporal pose feature; [0016] inputting the first spatio-temporal pose feature into the second spatio-temporal merging graph convolutional module for feature extraction, to obtain a second spatio-temporal pose feature; and [0017] inputting the first spatio-temporal pose feature after a residual connection with the second spatio-temporal pose feature, into the third spatio-temporal merging graph convolutional module for feature extraction to obtain the spatio-temporal joint pose feature.

[0018] Further, in the above method for recognizing pedestrian fall action based on pose estimation, the inputting the adjusted pose graph data into the first spatio-temporal merging graph convolutional module for feature extraction to obtain a first spatio-temporal pose feature includes: [0019] inputting the adjusted pose graph data into the multi-window multi-scale 3D graph convolutional layer and the serialization component layer in the first spatio-temporal merging graph convolutional module; [0020] performing feature extraction on the adjusted pose graph data through the multi-scale graph convolution and the two consecutive multi-scale temporal convolutions in the serialization component layer sequentially; [0021] performing feature extraction on the adjusted pose graph data through the multi-window multi-scale 3D graph convolutional layer; [0022] summing features output from the multi-window multi-scale 3D graph convolutional layer and the serialization component layer to obtain a sum, inputting the sum into an activation function to perform multi-scale temporal convolution feature extraction again to obtain the first spatio-temporal pose feature.

[0023] In the second spatio-temporal merging graph convolutional module and the third spatio-temporal merging graph convolutional module, the second spatio-temporal pose feature and the spatio-temporal joint pose feature are obtained by using a same method as in feature extraction of the first spatio-temporal merging graph convolutional module.

[0024] Further, in the above method for recognizing pedestrian fall action based on pose estimation, the pose graph data includes a human body key point set and a bone edge set, and a k-adjacency matrix of the pose graph data is expressed as follows: 1 if k, [4k)1,,; =j1 if i= 0 otherwise where k represents different neighbor orders of key points, (i, j) represents i-th and j-th key points, and d(v,,v) represents a distance between the i-th and j-th key points.

[0025] Further, in the above method for recognizing pedestrian fall action based on pose estimation, the obtaining multiple image frames in an original video stream, and performing pedestrian detection and tracking on each frame to obtain a human body tracking number and key point information by pose estimation includes: [00261 obtaining the multiple image frames in the original video stream, and determining a to-be-tracked target; [00271 performing matching through a DeepSort-based pedestrian tracking algorithm by calculating a similarity of pedestrian bounding box feature information of the to-be-tracked target between two adjacent image frames, and assigning an identification (ID) to each to-be-tracked target to obtain a tracking result; and [00281 based on the tracking result, extracting key point coordinates of each to-be-tracked target by using a regional multi-person pose estimation (RMPE) algorithm, and outputting the key point information and the human body tracking number.

[00291 Further, in the above method for recognizing pedestrian fall action based on pose estimation, performing, in combination with a change characteristic of a fall action, action recognition on the spatio-temporal joint pose feature to obtain action recognition classification results includes: [0030] performing global average pooling processing on the spatio-temporal joint pose feature to obtain pooling results, and inputting the pooling results into a fully connected linear layer; and [0031] outputting, in combination with the change characteristic of the fall action, a category with a highest score corresponding to the spatio-temporal joint pose feature through a classifier to obtain the classification results.

[0032] According to another aspect of the present disclosure, a computer-readable medium is further provided, having computer-readable instructions stored thereon. When executed by a processor, the computer-readable instructions enable the processor to implement the method for recognizing fall action as mentioned above.

[0033] According to another aspect of the present disclosure, a device for recognizing pedestrian fall action based on pose estimation is further provided, including: [0034] one or more processors; and [0035] a computer-readable medium configured to store one or more computer-readable instructions.

[0036] When executed by the one or more processors, the one or more computer-readable instructions enable the one or more processors to implement the method for recognizing fall action as mentioned above.

[0037] Compared with the prior art, the method includes: obtaining multiple image frames in an original video stream, performing pedestrian detection and tracking on each image frame to obtain a human body tracking number and key point information by pose estimation, and aggregating key point information of multiple image frames adjacent to each key point by using a multi-scale adjacency matrix method to obtain pose graph data; inputting the pose graph data into a GCN, introducing residual connections between multiple spatio-temporal merging graph convolutional modules in the GCN, and performing feature extraction through the multiple spatio-temporal merging graph convolutional modules sequentially to obtain a spatio-temporal joint pose feature; and performing, in combination with a change characteristic of a fall action, action recognition on the spatio-temporal joint pose feature to obtain action recognition classification results. That is, the method uses a multi-scale adjacency matrix for information aggregation, introduces residual connections between upper and lower spatio-temporal joint modules with a same structure, extracts spatio-temporal joint features of a pose on dual streams (a key point stream and a bone edge stream), and finally merges the spatio-temporal joint features of the dual streams to make a judgment on a fall action, thereby reducing an impact of a background on an recognition effect, improving accuracy of action recognition, and reducing computation cost.

BRIEF DESCRIPTION OF THE DRAWINGS

[0038] Other features, objectives and advantages of the present disclosure will become more apparent upon reading the detailed description of the non-restrictive embodiments with reference to the following drawings.

[0039] FIG. 1 is a schematic flowchart of a method for recognizing pedestrian fall action based on pose estimation according to an aspect of the present disclosure; [0040] FIG. 2 is a schematic diagram of spatial information of key points in consecutive frames in a method for recognizing pedestrian fall action based on pose estimation according to an aspect of the present disclosure; [0041] FIG. 3 is a schematic diagram of position changes of key points during human fall in a method for recognizing pedestrian fall action based on pose estimation according to an aspect of the present disclosure; [0042] FIG. 4(a) is a schematic diagram of embedded representations after key point feature extraction in a method for recognizing pedestrian fall action based on pose estimation according to an aspect of the present disclosure; [0043] FIG. 4(b) is a schematic diagram of updating the 6-th key point by aggregating neighbor information in an embodiment of a method for recognizing pedestrian fall action based on pose estimation according to an aspect of the present disclosure; [0044] FIG. 5 is a schematic diagram of a data calculation process of feature extraction with a GCN in a method for recognizing pedestrian fall action based on pose estimation according to an aspect of the present disclosure; and [0045] FIG. 6 is a schematic diagram of a working process of pose estimation of a method for recognizing pedestrian fall action based on pose estimation according to an aspect of the present disclosure.

[0046] The same or similar reference numerals in the drawings represent the same or similar parts.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0047] The present disclosure is further described below with reference to the accompanying drawings.

[0048] In a typical configuration of the present disclosure, the terminal, the device of the Web Services and the trusted party all include one or more processors (such as a central processing unit (CPU), a graphics processing unit (GPU)), input/output (110) interfaces, network interfaces, and a memory.

[0049] The memory may include a non-persistent memory, a random access memory (RAM) and/or a non-volatile memory in computer-readable media, such as a read only memory (ROM) or a flash RAM. The memory is an example of computer-readable media.

[0050] The computer-readable media includes persistent and non-persistent, and removable and non-removable media, and storage of information may be implemented by any method or technology. The information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, a phase-change random access memory (PRAM), a static random access memory (SRAM), a dynamic random access memory (DRAM), other types of RAMs, a ROM, an electrically erasable programmable read-only memory (EEPROM), a flash memory or other memory technology, a compact disc read-only memory (CD-ROM), a digital versatile disk (DVD) or other optical storage, a magnetic cassette tape, a magnetic tape disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device. The computer-readable media, as defined herein, excludes transitory computer-readable media, such as modulated data signals and carrier waves. [0051] FIG. 1 is a schematic flowchart of a method for recognizing pedestrian fall action based on pose estimation according to an aspect of the present disclosure. The method is suitable for various scenes of human daily life, including but not limited to offices, homes, coffee rooms, and lecture rooms, and the method can be used to recognize various human actions, such as walking, standing, sitting, and standing up. The method includes step S11, step S12, and step S13, and specifically includes the following.

[0052] In step S11, multiple image frames in an original video stream are obtained. Pedestrian detection and tracking are performed on each image frame, to obtain a human body tracking number and key point information by pose estimation. Key point information of multiple image frames adjacent to each key point is aggregated by using a multi-scale adjacency matrix method to obtain pose graph data. Here, in real life, the fall action always depends on some other behaviors before it occurs, such as walking and standing. After the fall action occurs, it depends on some behaviors such as lying down, so it is necessary to establish a long-term association among key points of multiple frames to aggregate the information of each key point of a current image frame with information of corresponding key point in multiple image frames before and after the current frame, as shown in FIG. 2. FIG. 3 shows position changes of key points when a person falls over from a standing pose. The neighbor order of a key point refers to the number of hops between two key points, such as the k-order neighbor of the 11-th joint, k E [1, 2, 3]. The 1-order neighbor [1, 12] and the 2-order neighbor [0, 2, 5, 13] have the same influence on the judgment of the behavior. By using the pose information of the human body, the pose graph data is constructed for fall recognition, which greatly reduces the impact of the background on the recognition effect and reduces the computation cost.

[0053] In step S12, the pose graph data is input into a GCN. Residual connections are introduced between multiple spatio-temporal merging graph convolutional modules in the GCN. Feature extraction is performed through the multiple spatio-temporal merging graph convolutional modules sequentially to obtain a spatio-temporal joint pose feature. Here, the essence of the GCN is to generate embedded representations of key points through convolution aggregation based on nearby network neighbors, as shown in FIG. 4(a). Key points have embedded representations at each layer. In FIG. 4(b), an embedded representation of layer 0 is an inputted key point feature X. An embedded representation of the key point in layer K is computed by aggregating information of neighbors in layer K-1. A direction line from a neighbor key point to this key point is message passing, and the intermediate aggregation is performed in the neural network mode. After the information of neighbors is transmitted, it is averaged and then aggregated. A complete aggregation process of the key point is expressed as the following formula: h" by = o-(Wk " + Bk h") V k E {1,2,...,K} . IfEA, (V) IN(v)I v [0054] First, the embedded representation h,°, = x" of layer 0 is initialized. h,l," 1 represents an embedded representation of a key point v in previous layer. cs represents a nonlinear activation function Sigmoid or Relu. An aggregation result of neighbor feature by a K-layer neural network hk-1 is expressed as means averaging embedded representations of the previous layer.

[0055] Another general vector representation of the GCN is shown below:

I

HU+1) = 6(D (I) 2:415 2H(/'W()), where A is a sum of an adjacency matrix A and an identity matrix E, and I) zAn represents a normalization operation for yj.

[0056] In step S13, in combination with change characteristics of a fall action, action recognition is performed on the spatio-temporal joint pose feature to obtain action recognition classification results. Here, the earlier use of the ST-GCN to obtain skeleton action features fails to jointly take into account the spatio-temporal characterization, and actions that rely heavily on spatio-temporal combined information cannot be recognized very well. The present disclosure extracts the spatio-temporal pose features on the basis of the G3D convolution operator, and integrates various algorithms to solve the problem of fall detection in real life, by combining with the change characteristics of the fall action.

[0057] In the above steps S11 to S13, first, multiple image frames in an original video stream are obtained. Pedestrian detection and tracking are performed on each image frame of to obtain a human body tracking number and key point information by pose estimation. For each key point, key point information of multiple image frames is aggregated with key point information of corresponding key points in frames of the multiple frames of images by using a multi-scale adjacency matrix to obtain pose graph data Then, the pose graph data is input into a GCN. Residual connections are introduced between multiple spatio-temporal merging graph convolutional modules in the GCN. Feature extraction is performed through the multiple spatio-temporal merging graph convolutional modules sequentially to obtain a spatio-temporal joint pose feature. Finally, in combination with change characteristics of a fall action, action recognition is performed on the spatio-temporal joint pose feature to obtain action recognition classification results. That is, the method according to the present disclosure uses a multi-scale adjacency matrix for information aggregation, introduces a residual connection between upper and lower spatio-temporal joint modules with a same structure, extracts spatio-temporal joint features of a pose on dual streams (a key point stream and a bone edge stream), and finally merges the spatio-temporal joint features of the dual streams to make a judgment on a fall action, thereby reducing an impact of a background on an recognition effect, improving accuracy of action recognition, and reducing computation cost.

[0058] For example, experiments are performed on two fall detection datasets, Le2i Fall Detection (LFD) and UR Fall Detection (URFD). LFD includes 191 videos of human activities, which include 4 scenes: office, home, coffee room, and lecture room. There are image frames with flops and no people in the video. The format of the video is 320 x 240, 25 frames per second. The URFD contains 70 (30 falls + 40 activities of daily living) sequences, and fall events are recorded by using 2 Microsoft Kinect cameras and corresponding accelerometer.

[0059] During training, opencv and video editing tools are used to preprocess the original video for LFD, with a resolution of 640 x 480 and a frame rate of 30 FPS. According to the size of the sliding window, video sample duration is selected from 3 to 9 seconds. The original video contains some actions other than falling. The videos with fall action are treated as one group, and the others (videos with walking, standing, sitting, and standing up) are treated as another group and relabeled, and a total of 26,100 frames are selected. Since the 40 videos with daily activity in URFD are quite different from each other, they need to be relabeled as walking, sitting, and bending, and they are classified as a non-fall group.

[0060] Pedestrian detection is performed on image frames P1, P2, P3, Pn through the pre-trained fine-tuning model yolov4, and the output bounding box is appropriately expanded. According to output of the current frame, a trajectory tracking result of a next frame is predicted by using the Kalman filter algorithm. After that, the detection and prediction results are combined. The RMPE algorithm is used for pose estimation, results are stored in an object list, and trajectory state is finally updated. A matching threshold of the DeepSort algorithm is set to 30, and a backbone network of RMPE is resnet50. That is, pedestrian detection and tracking are performed on image frames P1, P2, P3,..., Pn to obtain a human body tracking number ID and key point information Xl, X2, X3, ..., Xn by pose estimation. Key point information of multiple image frames adjacent to each key point is aggregated by using a multi-scale adjacency matrix method to obtain pose graph data.

100611 Then, the pose graph data is input into a GCN. Residual connections are introduced between multiple spatio-temporal merging graph convolutional modules B1, B2, B3, Bn in the GCN. Feature extraction is performed through the multiple spatio-temporal merging graph convolutional modules B 1, B2, B3,...,Bn sequentially to obtain a spatio-temporal joint pose feature. In order to avoid overfitting during training on a dataset, a weight decay value is set to 0.0005. The model learning uses stochastic gradient descent (SGD) optimizer. An initial learning rate is set to 0.05, training is performed up to 80 epochs, with a batchsize of 8, and 0.1-fold learning rate decay is performed at the 35-th and 45-th epochs.

[0062] Finally, in combination with change characteristics of a fall action, action recognition is performed on the spatio-temporal joint pose feature to obtain action recognition classification results. That is, the algorithm herein uses a multi-scale adjacency matrix for information aggregation, introduces a residual connection between upper and lower spatio-temporal joint modules with a same structure, extracts spatio-temporal joint features of a pose on dual streams (a key point stream and a bone edge stream), and finally merges the spatio-temporal joint features of the dual streams to make a judgment on a fall action, thereby reducing an impact of a background on an recognition effect, improving accuracy of action recognition, and reducing computation cost.

100631 In addition, in order to reflect the advantages of pose data and the strong generalization ability of spatio-temporal 3D graph convolution to extract features, the popular video action classification algorithm SlowFast, the spatio-temporal graph convolution model (the two-stream adaptive graph convolutional network (2s-AGCN)) and some other algorithms that perform well on the UR Fall Detection dataset are compared with the algorithm according to the present disclosure. Compared with the SlowFast, 2s-AGCN, and Harrou et al methods for recognizing pedestrian fall action, the method of the present disclosure greatly reduces the impact of the background on the recognition accuracy and has more accurate action recognition results.

[0064] Following the above embodiment, in the method according to the present disclosure, the GCN includes a first spatio-temporal merging graph convolutional module B1, a second spatio-temporal merging graph convolutional module B2, and a third spatio-temporal merging graph convolutional module B3.

100651 Each of the spatio-temporal merging graph convolutional modules includes a multi-window multi-scale 3D graph convolutional layer and a serialization component layer, and the serialization component layer includes a multi-scale graph convolution and two consecutive multi-scale temporal convolutions. Here, the multi-window multi-scale 3D graph convolutional layer is used to perform a 3D convolution that combines temporal and spatial dimensions, under different window sizes, in order to express internal relationship of actions in two dimensions. The serialization component layer includes a multi-scale graph convolution and two consecutive multi-scale temporal convolutions, the multi-scale graph convolution may use a maximum distance between joint points to model the skeleton, and the two consecutive multi-scale temporal convolutions may be used to capture long-term or extended time frame context information.

[0066] Following the above embodiment, in the method according to the present disclosure, a process of inputting the pose graph data into a GCN, introducing residual connections between multiple spatio-temporal merging graph convolutional modules in the GCN, and performing feature extraction through the multiple spatio-temporal merging graph convolutional modules sequentially to obtain a spatio-temporal joint pose feature includes the following sub-steps. [0067] The pose graph data is input into the GCN, and normalization processing is performed on the pose graph data to adjust an array shape of the pose graph data. For example, a 5-dimensional array (N,C,T,V,M) is input, where N represents a number of videos in a forward batch, C represents a number of characteristic information channels of a node, namely, 3 channels (x, y, acc), T represents a number of key frames in the video, V represents a number of joints, and M represents a number of people with the highest confidence in a frame. After batch normalization, the array shape is adjusted to 3 dimensions (N, C x V x M, T).

[0068] The adjusted pose graph data is input into the first spatio-temporal merging graph convolutional module for feature extraction to obtain a first spatio-temporal pose feature.

[0069] The first spatio-temporal pose feature is input into the second spatio-temporal merging graph convolutional module for feature extraction to obtain a second spatio-temporal pose feature.

[0070] The first spatio-temporal pose feature after a residual connection with the second spatio-temporal pose feature, is input into the third spatio-temporal merging graph convolutional module for feature extraction to obtain the spatio-temporal joint pose feature. As shown in FIG. 5, in order to prevent loss of features caused by the increasing in the number of layers, output of the spatio-temporal merging graph convolutional module B1 is subjected to convolution conversion and residual connection with the module B2. Numbers in brackets of each sub-block are numbers of input and output channels before and after the calculation.

[0071] Following the above embodiment, in the method according to the present disclosure, a process of inputting the adjusted pose graph data into the first spatio-temporal merging graph convolutional module for feature extraction to obtain a first spatio-temporal pose feature includes the following sub-steps.

[0072] The adjusted pose graph data is input into the multi-window multi-scale 3D graph convolutional layer and the serialization component layer in the first spatio-temporal merging graph convolutional module.

[0073] Feature extraction is performed on the adjusted pose graph data through the multi-scale graph convolution and the two consecutive multi-scale temporal convolutions in the serialization component layer sequentially.

[0074] Feature extraction is performed on the adjusted pose graph data through the multi-window multi-scale 3D graph convolutional layer.

[0075] Features output from the multi-window multi-scale 3D graph convolutional layer and the serialization component layer are summed, and is input into an activation function for a multi-scale temporal convolution feature extraction again, to obtain the first spatio-temporal pose feature.

[0076] In the second spatio-temporal merging graph convolutional module and the third spatio-temporal merging graph convolutional module, the second spatio-temporal pose feature and the spatio-temporal joint pose feature are obtained by using the same method as in feature extraction of the first spatio-temporal merging graph convolutional module. Here, the features output from the multi-window multi-scale 3D graph convolutional layer and the serialization component layer are summed, and is sent to a relu() activation function for a multi-scale temporal convolution feature extraction again. A result is input into a next spatio-temporal merging graph convolutional module with a same logical processing structure, and finally features are classified and output. The present disclosure equalizes weights of high-order neighbor nodes in a layer-by-layer information aggregation, which is beneficial to improve the accuracy of action recognition.

[0077] According to yet another embodiment of the present disclosure, in the method according to the present disclosure, the pose graph data includes a human body key point set and a bone edge set. In order to reflect the importance of joint points and distant neighbors in multiple frames arranged sequentially, a k-adjacency matrix of the pose graph data is expressed as follows: 1 if d(v"v) =kg. =

0 otherwise where k represents different neighbor orders of key points, (i, j) represents i-th and j-th key points, and d(v,,v,) represents a distance between the i-th and j-th key points. Here, the bone edge refers to a connecting line between two key points. The present disclosure uses the pose graph data, which has more advantages than the original video, and greatly reduces the impact of the background on the recognition accuracy. That is, the pose information of the human body is used to construct the pose graph data for fall recognition, which greatly reduces the impact of the background on the recognition effect and also reduces the computation cost.

[0078] According to yet another embodiment of the present disclosure, in the method according to the present disclosure, a process of obtaining multiple image frames in an original video stream, and performing pedestrian detection and tracking on each image frame to obtain a human body tracking number and key point information by pose estimation includes the following sub-steps.

[0079] The multiple image frames in the original video stream are obtained to determine to-be-tracked targets.

[0080] Matching is performed through a DeepSort-based pedestrian tracking algorithm by calculating a similarity of pedestrian bounding box feature information of each to-be-tracked target between two adjacent image frames, and an ID is assigned to each of the to-be-tracked targets to obtain tracking results.

[0081] Based on the tracking results, key point coordinates of each of the to-be-tracked targets are extracted by using a RMPE algorithm, and the key point information and the human body tracking number are output.

[0082] For example, the target detection technology mainly locates and classifies objects in the image. At present, the yolo network in the mainstream target detection methods is widely used in real life scenes due to its high real-time performance. The present disclosure uses a pedestrian detector based on yolov4, in which, an input image is first divided into S x S grids with various scales by a feature extraction network, and then a series of anchor boxes are generated with a center of each grid as a center. These anchor boxes are classified and their bounding is fine-tuned. Finally, a position of the pedestrian bounding box in the image is predicted.

[0083] There are generally many pedestrians in the fall action video with timing information, and it is a primary condition for constructing the pose graph data by detecting the position of the pedestrians and tracking them frame by frame. The pedestrian tracking algorithm based on DeepSort performs matching by calculating a similarity of pedestrian bounding box feature information between two adjacent frames, and assigns an ID to each to-be-tracked target (assign a number to a "person" whose trajectory is in the confirmation state, in each frame). DeepSort mainly uses the Kalman filter and Hungarian algorithm to ensure the tracking accuracy. The Kalman filter algorithm predicts and updates the pedestrian trajectory, and the Hungarian algorithm performs optimized allocation for intersection over union (IOU) on the output results of the pedestrian detector and the tracking prediction results. In addition, cascade matching and new trajectory confirmation mechanisms are also introduced. Cascading matching refers to data association by using a combination of pedestrian motion information and appearance information.

A matching degree between the motion information in the Kalman filter prediction result and the motion information in the pedestrian detector result is evaluated by using the Mahalanobis distance, which is calculated as follows: cr)(i, j)= (d1 -ye)' SI' (di -yi), where (11)(i, j) represents the matching degree, d., represents a position of the j-th detection frame, y, represents a position of a target predicted by the i-th tracker, and S, represents a covariance matrix between a detection position and an average tracking position, which mainly takes into account the state uncertainty. The cosine distance metric method is used for pedestrian appearance information matching, and if the Mahalanobis di stance of a certain association is less than a specified threshold, the association of the motion state is set successfully. The function used is as follows: where b"(') is used as a mark for indicating whether the association is successful. New trajectory confirmation refers to dividing the trajectory into a confirmed state and an unconfirmed state. A newly generated trajectory is in the unconfirmed state by default, and will be matched with the detection result of the pedestrian detector for a certain number of times before it can be converted into the confirmed state. Confirmed state trajectories will be mismatched with the detection result for a certain number of times before they can be deleted.

[0084] The pedestrians in a fixed time frame window are accurately tracked and assigned numbers, and key point coordinates of each individual pedestrian are extracted based on the tracking results. The idea of the RMPE algorithm is to detect each human body detection frame in an environment, and then independently detect the pose in each human body area. The output result contains the human body tracking number and the 3D information (x, y, c) of 18 key points of the human body, where (x, y) represents the coordinates, and c represents the confidence.

100851 The pose estimation process is shown in FIG. 6. First, a spatial transformer network is responsible for receiving human body detection detection frames, which can output inaccurate detection frames in the pedestrian detection results as high-quality human body detection frames, thereby making the human body detection frames more accurate. The pedestrian bounding box is intercepted, and the key points of human body pose are obtained by using the single person pose estimation (SPPE) algorithm. The spatial de-transformer network (SDTN) outputs candidate poses of the human body in the original image, and the redundant pose information of each pedestrian is filtered out by the pose non-maximum suppression method. The parallel SPPE algorithm is only used in a training phase, and an output is directly compared with a true value of a label of the human body pose, so as to back-propagate the error generated after pose positioning to the spatial transformer network, to help the spatial transformation network generate a high-quality regional position.

[0086] Following the above embodiment of the present disclosure, a process of performing action recognition on the spatio-temporal joint pose feature to obtain action recognition classification results by combining with change characteristics of a fall action includes the following sub-steps.

[0087] Global average pooling processing is performed on the spatio-temporal joint pose feature, and pooling results are input into a fully connected linear layer.

[0088] By combining with the change characteristics of the fall action, a category with a highest score corresponding to the spatio-temporal joint pose feature is output through a softmax classifier to obtain the classification results.

[0089] For example: the GCN has 384 output feature channels. Then, global average pooling is performed on the output features in the spatio-temporal dimension and individual pedestrians. The pooling result is input to the fully connected linear layer (384 input channels, and a number of the output channel corresponds to a number of categories), and finally the category with the highest score is output through the softmax classifier.

[0090] According to another aspect of the present disclosure, a computer-readable medium is further provided, having computer-readable instructions stored thereon. When executed by a processor, the computer-readable instructions enable the processor to implement the method for recognizing pedestrian fall action as mentioned above.

[0091] According to another aspect of the present disclosure, a device for recognizing pedestrian fall action based on pose estimation is further provided, including: one or more processors and a computer-readable medium.

[0092] The one or more processors include CPU and GPU.

[0093] The computer-readable medium is configured to store one or more computer-readable instructions.

[0094] When executed by the one or more processors, the one or more computer-readable instructions enable the one or more processors to implement the method for recognizing pedestrian fall action as mentioned above.

[0095] Here, for the details of each embodiment of the device, reference may be made to the corresponding part of the above-mentioned embodiment of the method, which will not be repeated here.

[0096] In summary, the method includes: obtaining multiple image frames in an original video stream, performing pedestrian detection and tracking on each image frame to obtain a human body tracking number and key point information by pose estimation, and aggregating key point information of multiple image frames adjacent to each key point by using a multi-scale adjacency matrix method to obtain pose graph data; inputting the pose graph data into a GCN, introducing residual connections between multiple spatio-temporal merging graph convolutional modules in the GCN, and performing feature extraction through the multiple spatio-temporal merging graph convolutional modules sequentially to obtain a spatio-temporal joint pose feature; and performing action recognition on the spatio-temporal joint pose feature to obtain action recognition classification results, by combining with change characteristics of a fall action. That is, the method uses a multi-scale adjacency matrix for information aggregation, introduces a residual connection between upper and lower spatio-temporal joint modules with a same structure, extracts spatio-temporal joint features of a pose on dual streams (a key point stream and a bone edge stream), and finally merges the spatio-temporal joint features of the dual streams to make a judgment on a fall action, thereby reducing an impact of a background on an recognition effect, improving accuracy of action recognition, and reducing the computation cost. [0097] It should be noted that the present disclosure may be implemented in software and/or a combination of software and hardware. For example, the present disclosure may be implemented in an application specific integrated circuit (ASIC), a general purpose computer, or any other similar hardware device. In one embodiment, the software program of the present disclosure may be executed by a processor to implement the steps or functions described above. Likewise, the software program of the present disclosure (including associated data structures) may be stored on a computer-readable recording medium, such as a RAM memory, a magnetic or optical drive or a floppy disk. In addition, some steps or functions of the present disclosure may be implemented in hardware. For example, various steps or functions may be implemented by a circuit that cooperates with a processor.

100981 In addition, a part of the present disclosure can be applied as a computer program product, such as a computer program instruction, which when executed by a computer, through the operation of the computer, can invoke or provide the methods and/or technical solutions according to the present disclosure. The program instructions for invoking the methods of the present disclosure may be stored in a fixed or removable recording medium, and/or transmitted via a data stream in a broadcast or other signal-bearing medium, and/or stored in the working memory of a computer device operating in accordance with the program instructions. Here, according to one embodiment of the present disclosure, an apparatus is included, including a memory for storing computer program instructions and a processor for executing the program instructions. When the computer program instructions are executed by the processor, the apparatus is triggered to run the methods and/or technical solutions according to foregoing embodiments of the present disclosure.

100991 For those skilled in the art, it is obvious that the present disclosure is not limited to the details of the above embodiments, and the present disclosure can be implemented in other specific forms without departing from the spirit or basic features of the present disclosure. Therefore, the embodiments should be regarded as exemplary and non-limiting in every respect, and the scope of the present disclosure is defined by the appended claims rather than the above description. Therefore, all changes falling within the meaning and scope of equivalent elements of the claims should be included in the present disclosure. The reference numerals in the claims should not be considered as limiting the involved claims. In addition, it is apparent that the word "including" does not exclude other units or steps, and a singular number does not exclude a plural number. A plurality of units or apparatuses stated in the apparatus claims may also be implemented by a same apparatus or system through software or hardware. The terms such as "first" and "second" are used to denote names and do not indicate any particular order.

Claims

WHAT IS CLAIMED IS: 1. A method for recognizing pedestrian fall action based on pose estimation, comprising: obtaining a plurality of image frames in an original video stream, performing pedestrian detection and tracking on each image frame to obtain a human body tracking number and key point information by pose estimation, and aggregating key point information of a plurality of image frames adjacent to each key point by using a multi-scale adjacency matrix method to obtain pose graph data; inputting the pose graph data into a graph convolutional network (GCN), introducing residual connections between a plurality of spatio-temporal merging graph convolutional modules in the GCN, and performing feature extraction through the plurality of spatio-temporal merging graph convolutional modules sequentially to obtain a spatio-temporal joint pose feature; and performing, in combination with a change characteristic of a fall action, action recognition on the spatio-temporal joint pose feature to obtain action recognition classification results.
2. The method according to claim 1, wherein the GCN comprises a first spatio-temporal merging graph convolutional module, a second spatio-temporal merging graph convolutional module, and a third spatio-temporal merging graph convolutional module; and each of the spatio-temporal merging graph convolutional modules comprises a multi-window multi-scale three-dimensional (3D) graph convolutional layer and a serialization component layer, and the serialization component layer comprises a multi-scale graph convolution and two consecutive multi-scale temporal convolutions.
3. The method according to claim 2, wherein the inputting the pose graph data into a GCN, introducing residual connections between a plurality of spatio-temporal merging graph convolutional modules in the GCN, and performing feature extraction through the plurality of spatio-temporal merging graph convolutional modules sequentially to obtain a spatio-temporal joint pose feature, comprises: inputting the pose graph data into the GCN, and performing normalization processing on the pose graph data to adjust an array shape of the pose graph data; inputting the adjusted pose graph data into the first spatio-temporal merging graph convolutional module for feature extraction, to obtain a first spatio-temporal pose feature; inputting the first spatio-temporal pose feature into the second spatio-temporal merging graph convolutional module for feature extraction, to obtain a second spatio-temporal pose feature; and inputting the first spatio-temporal pose feature after a residual connection with the second spatio-temporal pose feature, into the third spatio-temporal merging graph convolutional module for feature extraction to obtain the spatio-temporal joint pose feature.
4. The method according to claim 3, wherein the inputting the adjusted pose graph data into the first spatio-temporal merging graph convolutional module for feature extraction to obtain a first spatio-temporal pose feature, comprises: inputting the adjusted pose graph data into the multi-window multi-scale 3D graph convolutional layer and the serialization component layer in the first spatio-temporal merging graph convolutional module; performing feature extraction on the adjusted pose graph data through the multi-scale graph convolution and the two consecutive multi-scale temporal convolutions in the serialization component layer sequentially; performing feature extraction on the adjusted pose graph data through the multi-window multi-scale 3D graph convolutional layer; summing features output from the multi-window multi-scale 3D graph convolutional layer and the serialization component layer to obtain a sum, inputting the sum into an activation function to perform multi-scale temporal convolution feature extraction again, to obtain the first spatio-temporal pose feature; and in the second spatio-temporal merging graph convolutional module and the third spatio-temporal merging graph convolutional module, obtaining the second spatio-temporal pose feature and the spatio-temporal joint pose feature by using a same method as in feature extraction of the first spatio-temporal merging graph convolutional module.
5. The method according to claim 4, wherein the pose graph data comprises a human body key point set and a bone edge set, and a k-adjacency matrix of the pose graph data is expressed as follows: 1 if of(vi,v j) = lc, [Au, jid 1 if = 0 otherwise wherein k represents different neighbor orders of key points, (i, j) represents an i-th and a j-th key points, and d(iii,v,) represents a distance between the i-th and j-th key points.
6. The method according to any one of claims 1 to 5, wherein the obtaining a plurality of image frames in an original video stream, and performing pedestrian detection and tracking on each image frame to obtain a human body tracking number and key point information by pose estimation, comprises: obtaining the plurality of image frames in the original video stream, and determining a to-be-tracked target; performing matching through a DeepSort-based pedestrian tracking algorithm by calculating a similarity of pedestrian bounding box feature information of the to-be-tracked target between two adjacent image frames, and assigning an identification (ID) to each to-be-tracked target to obtain a tracking result; and based on the tracking result, extracting key point coordinates of each to-be-tracked target by using a regional multi-person pose estimation (RMPE) algorithm, and outputting the key point information and the human body tracking number.
7. The method according to claim 6, wherein the performing, in combination with a change characteristic of a fall action, action recognition on the spatio-temporal joint pose feature to obtain action recognition classification results, comprises: performing global average pooling processing on the spatio-temporal joint pose feature to obtain pooling results, and inputting the pooling results into a fully connected linear layer; and outputting, in combination with the change characteristic of the fall action, a category with a highest score corresponding to the spatio-temporal joint pose feature through a classifier to obtain the classification results.
8. A computer-readable medium, having computer-readable instructions stored thereon, wherein when executed by a processor, the computer-readable instructions enable the processor to implement the method according to any one of claims 1 to 7.
9. A device for recognizing pedestrian fall action based on pose estimation, comprising: one or more processors; and a computer-readable medium configured to store one or more computer-readable instructions, wherein when executed by the one or more processors, the one or more computer-readable instructions enable the one or more processors to implement the method according to any one of claims 1 to 7.