CN112801060A - Motion action recognition method and device, model, electronic equipment and storage medium - Google Patents

Motion action recognition method and device, model, electronic equipment and storage medium Download PDF

Info

Publication number
CN112801060A
CN112801060A CN202110371059.1A CN202110371059A CN112801060A CN 112801060 A CN112801060 A CN 112801060A CN 202110371059 A CN202110371059 A CN 202110371059A CN 112801060 A CN112801060 A CN 112801060A
Authority
CN
China
Prior art keywords
building block
space
sequence
layer
time graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110371059.1A
Other languages
Chinese (zh)
Inventor
蔡建平
何喆
林型双
顾鹏坤
张帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University City College ZUCC
Original Assignee
Zhejiang University City College ZUCC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University City College ZUCC filed Critical Zhejiang University City College ZUCC
Priority to CN202110371059.1A priority Critical patent/CN112801060A/en
Publication of CN112801060A publication Critical patent/CN112801060A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application discloses a motion action recognition method and device, a model, electronic equipment and a storage medium, comprising the following steps: acquiring a skeleton sequence of the motion action acquired by the attitude estimation equipment; inputting the skeleton sequence into a trained non-local space-time diagram convolution model to obtain a motion action recognition result; the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.

Description

Motion action recognition method and device, model, electronic equipment and storage medium
Technical Field
The patent relates to the technical field of deep neural networks, in particular to a motion action recognition method and device, a model, electronic equipment and a storage medium.
Background
The intelligent sports equipment needs to have the function of identifying the human body action types so as to judge the body-building actions (such as deep squatting, push-up, sit-up and the like) of a user, and the change of the human body joint sequence is very important for identifying the human body action types. Traditional methods for modeling joint sequence variations often rely on features of artificial design, thus resulting in limited expressive power and generalization difficulties. To overcome these limitations, a new approach is needed that can automatically capture the spatial and temporal patterns of changes in joint sequences. Recently, a graph convolutional neural network (GCN) that generalizes a Convolutional Neural Network (CNN) into an arbitrary structure has received increasing attention and has been successfully adopted in many applications, such as image classification, document classification, semi-supervised learning, and the like.
The space-time graph convolution model is used for applying graph convolution to a human body action classification task for the first time. Although the space-time graph convolution model can well model the change of the human skeleton sequence, the space-time graph convolution model cannot well represent the wide-range space-time dependence due to the locality of convolution operation, but is very important for recognizing some motion actions.
Disclosure of Invention
The embodiment of the application aims to provide a motion action identification method and device, a model, electronic equipment and a storage medium, so as to solve the problem that large-range space-time dependence cannot be modeled in a space-time diagram convolution model.
According to a first aspect of embodiments of the present application, there is provided a motion recognition method, including: acquiring a skeleton sequence of the motion action acquired by the attitude estimation equipment; inputting the skeleton sequence into a trained non-local space-time diagram convolution model to obtain a motion action recognition result; the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.
According to a second aspect of embodiments of the present application, there is provided a motion recognition apparatus including: the acquisition module is used for acquiring a skeleton sequence of the motion action acquired by the attitude estimation equipment; the recognition module is used for inputting the skeleton sequence into a trained non-local space-time graph convolution model to obtain a motion action recognition result; the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.
According to a third aspect of embodiments of the present application, there is provided a non-local space-time graph convolution model, including: the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.
According to a fourth aspect of embodiments of the present application, there is provided an electronic apparatus, including: one or more processors; a memory for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a method as described in the first aspect.
According to a fifth aspect of embodiments herein, there is provided a computer-readable storage medium having stored thereon computer instructions, characterized in that the instructions, when executed by a processor, implement the steps of the method according to the first aspect.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
according to the embodiment, the motion action framework sequence is obtained by using the attitude estimation equipment, and the obtained framework sequence is input into the trained non-local space-time diagram convolution model to obtain the motion action recognition result. The change of the human body skeleton sequence is crucial to the recognition of human body action types, although the space-time diagram convolution model can well model the change of the human body skeleton sequence, due to the locality of convolution operation, the space-time diagram convolution model cannot well represent large-range space-time dependence, but is crucial to the recognition of some movement actions. The ability of the space-time graph convolution model to model the relationship between the human joint points over a frame, namely the ability of spatial modeling, can be enhanced by non-local operations. Through the jump connection, the sequence information can be better transmitted in the model, so that the time modeling capability is enhanced. The combination of non-local operations, jump-connection and space-time graph convolution enables the space-time graph convolution to have better space-time modeling capability.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Fig. 1 is a flow chart illustrating a method of motion activity recognition according to an exemplary embodiment.
FIG. 2 is a space-time diagram of a skeletal sequence used by the space-time diagram convolution shown in accordance with an exemplary embodiment, with points in FIG. 2 representing joints of the body, edges between joints of the body being defined in accordance with natural connections of the body, inter-frame edges connecting the same nodes between successive frames, and joint coordinates as inputs to the space-time diagram convolution.
FIG. 3 is a diagram illustrating a distance partitioning strategy, according to an example embodiment.
FIG. 4 is a diagram of a non-local space-time graph convolution model architecture in accordance with an exemplary embodiment.
FIG. 5 is a non-local layer structure diagram shown in accordance with an example embodiment.
Fig. 6 is a block diagram illustrating a motion activity recognition device according to an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
Fig. 1 is a flowchart illustrating a motion recognition method according to an exemplary embodiment, and referring to fig. 1, a motion recognition method according to an embodiment of the present invention may include:
step S11, collecting a skeleton sequence of the motion action obtained by the attitude estimation equipment;
step S12, inputting the skeleton sequence into a trained non-local space-time graph convolution model to obtain a motion action recognition result;
the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.
According to the embodiment, the change of the human body skeleton sequence is crucial to the identification of the human body action type, the space-time diagram convolution model can well model the change of the human body skeleton sequence, but due to the locality of convolution operation, the space-time diagram convolution model cannot well represent large-range space-time dependence, but is crucial to the identification of some movement actions.
In the specific implementation of step S11, a skeleton sequence of the motion action acquired by the posture estimation device is acquired;
specifically, the attitude estimation device of the present embodiment adopts an Azure Kinect DK depth camera, which is certainly not limited thereto; and capturing a motion skeleton sequence in the motion action video through the depth camera.
In one possible implementation, the motion action video captured by the depth camera includes a video composed of successive image frames in which a person is performing some motion, such as push-up, squat, pull-up, etc.
In the specific implementation of step S12, inputting the skeleton sequence into a trained non-local space-time diagram convolution model to obtain a motion action recognition result;
in particular, FIG. 4 is a diagram of a non-local space-time graph convolution model architecture, shown in accordance with an exemplary embodiment. Referring to fig. 4, the non-local space-time graph convolution model is formed by sequentially stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer, wherein the building block group comprises a building block one B1, a building block two B2, a building block three B3, a building block four B4 and a building block five B5 which are sequentially connected, an additional jump connection is further arranged between the building block one and the building block five, an additional jump connection is further arranged between the building block two and the building block four, and each building block is composed of two space-time graph convolution models and one non-local layer.
The implementation steps of the space-time graph convolution model comprise:
(1) constructing a space-time diagram of a joint on a motion action skeleton sequence, referring to fig. 2, wherein the motion action skeleton sequence comprises a plurality of frames, and each frame comprises a human body skeleton diagram;
in particular, the skeleton sequence is typically represented by 2D or 3D coordinates of each human joint in each frame. In practical application, the Azure Kinect DK is mainly adopted for collecting joint point data. In the space-time graph convolution model, joint sequences are represented hierarchically using space-time graphs.
The space-time graph convolution model is provided with
Figure 998669DEST_PATH_IMAGE001
A joint point and
Figure 446968DEST_PATH_IMAGE002
a non-directional space-time diagram is constructed on the joint sequence of the frame
Figure 757863DEST_PATH_IMAGE003
. In this graph, a set of nodes
Figure 933499DEST_PATH_IMAGE004
Figure 279029DEST_PATH_IMAGE005
Is shown as
Figure 632650DEST_PATH_IMAGE006
On the frame
Figure 165263DEST_PATH_IMAGE007
Individual node) comprising all nodes, nodes in the node sequence
Figure 895321DEST_PATH_IMAGE005
The coordinate vector of (2) is input as a feature vector into the space vector convolution model. Edge set
Figure 846091DEST_PATH_IMAGE008
Comprising two subsets, a first subset
Figure 105034DEST_PATH_IMAGE009
Figure 656101DEST_PATH_IMAGE010
Is shown as
Figure 924271DEST_PATH_IMAGE006
On the frame
Figure 228083DEST_PATH_IMAGE011
A joint point, and
Figure 923506DEST_PATH_IMAGE005
forming a natural connecting edge between the articulation points of the human body), wherein
Figure 696290DEST_PATH_IMAGE012
Is a collection of natural connecting edges between human body joint points, describing the connections between joint points in the same frame. Second subset
Figure 768151DEST_PATH_IMAGE013
Inter-frame edges are included connecting the same joint points between successive frames. Therefore, the temperature of the molten metal is controlled,
Figure 693513DEST_PATH_IMAGE014
of the same particular joint point
Figure 294259DEST_PATH_IMAGE007
All edges of (a) represent the joint point over timeA trajectory.
(2) Defining a distance-based sampling function on a frame space diagram of the motion action skeleton sequence;
specifically, in
Figure 554339DEST_PATH_IMAGE015
On a single frame of time, there are
Figure 164312DEST_PATH_IMAGE001
Joint point and bone edge
Figure 448574DEST_PATH_IMAGE016
. In conventional convolution, when the input is a 2D grid, the output signature of the convolution operation is also a 2D grid. Using a single step size and appropriate padding, the size of the output signature can be the same as the size of the input signature. In the following description we will assume this case. Considering the convolution kernel size as
Figure 751380DEST_PATH_IMAGE017
For the number of channels is
Figure 233177DEST_PATH_IMAGE018
Input feature map of
Figure 646841DEST_PATH_IMAGE019
A conventional convolution operation is performed. In spatial position
Figure 281215DEST_PATH_IMAGE020
The output value of (d) is:
Figure DEST_PATH_IMAGE021
wherein the sampling function
Figure 286080DEST_PATH_IMAGE022
Traversing position
Figure 255174DEST_PATH_IMAGE020
Neighbor of (2), weight function
Figure 941370DEST_PATH_IMAGE023
A weight vector in the c-dimensional real space is provided for calculating an inner product with the c-dimensional input feature vector. The convolution operation on the graph is then defined by extending the above formula to the case where the input feature graph is located on a spatial graph.
On the image, sampling the function
Figure 194365DEST_PATH_IMAGE024
Is defined at the central position
Figure 838973DEST_PATH_IMAGE020
On the adjacent pixels of the pixel array. On the graph, sampling functions can be similarly defined at nodes
Figure 29783DEST_PATH_IMAGE005
Adjacent set of
Figure 785250DEST_PATH_IMAGE025
The above. Here, the
Figure 128637DEST_PATH_IMAGE026
Represents from
Figure 944147DEST_PATH_IMAGE010
To
Figure 622253DEST_PATH_IMAGE005
The minimum length of any of the paths is,
Figure 181410DEST_PATH_IMAGE027
indicating the selectable path length. Thus, the sampling function
Figure 894151DEST_PATH_IMAGE028
Can be written as
Figure 864250DEST_PATH_IMAGE029
(3) Defining a mapping function from a node to a label on a space diagram, and implementing the mapping function by adopting a distance division strategy;
in particular, we have employed a distance partitioning strategy to implement label mapping
Figure 29652DEST_PATH_IMAGE030
. Specific strategies are described below, in conjunction with FIG. 3.
The distance division strategy is based on the node to the root node
Figure 126921DEST_PATH_IMAGE005
Is a distance of
Figure 694169DEST_PATH_IMAGE031
Wherein
Figure 585901DEST_PATH_IMAGE032
Representing other joint points in the same frame, dividing the neighbor set. In the space-time graph convolution model, setting
Figure 254911DEST_PATH_IMAGE033
The neighbor set is divided into two subsets,
Figure 155871DEST_PATH_IMAGE034
which represents the root node of the network,
Figure 577625DEST_PATH_IMAGE035
representing the remaining adjacent nodes. Thus, the space-time graph convolution model will have two different weight vectors that can model local dissimilarity. In the form of
Figure 905838DEST_PATH_IMAGE036
And
Figure 295100DEST_PATH_IMAGE037
(4) defining a weighting function based on said mapping function;
in particular, the joint point
Figure 999751DEST_PATH_IMAGE005
Of (2) a neighbor set
Figure 541591DEST_PATH_IMAGE038
The distance division strategy is divided into two fixed subsets, each subset having a digital label. Therefore, we have a mapping
Figure 40705DEST_PATH_IMAGE039
The neighboring nodes are mapped to the labels of the corresponding subset. Weight function
Figure 418728DEST_PATH_IMAGE040
Can pass through
Figure 661491DEST_PATH_IMAGE041
Index tensor of dimension to realize
Figure 57837DEST_PATH_IMAGE042
(5) Based on the sampling function and the weighting function, the traditional convolution is popularized to the space map convolution;
in particular, the conventional convolution is now rewritten to the form of a graph convolution
Figure 727853DEST_PATH_IMAGE043
Normalization term
Figure 85847DEST_PATH_IMAGE044
Equal to the cardinality of the corresponding subset. This term is added to balance the contribution of the different subsets to the output. Combining the sampling function and the weighting function to obtain
Figure 866722DEST_PATH_IMAGE045
(6) Extending the sampling function and the mapping function to a time dimension, thereby generalizing the spatial graph convolution operation to a time-space domain;
in particular, after the spatial map convolution is formulated, the task of dynamically modeling the time space within the sequence of joint points is now entered. We extend the notion of neighborhood to also include temporally connected joints
Figure 383154DEST_PATH_IMAGE046
Parameter(s)
Figure 224071DEST_PATH_IMAGE047
Controlling the temporal extent in the neighborhood graph, and therefore can be referred to as the temporal convolution kernel size,
Figure 311107DEST_PATH_IMAGE048
is shown as
Figure 895672DEST_PATH_IMAGE048
And (5) frame. In order to complete the convolution on the space-time diagram, the space-time diagram convolution also needs a sampling function, the sampling function is the same as that of the weight function and the space diagram, and label mapping is carried out
Figure DEST_PATH_IMAGE049
Different. Because the time axis is regular, the convolution of the space-time diagram will be directly followed by
Figure 797769DEST_PATH_IMAGE005
Spatio-temporal neighborhood label mapping for root nodes
Figure 544008DEST_PATH_IMAGE049
Instead, it is changed into
Figure 616875DEST_PATH_IMAGE050
In this way, the space-time graph convolution model defines a well-defined convolution operation on the constructed space-time graph.
(7) And respectively performing space map convolution on the space map and time convolution on the time dimension to realize the space-time map convolution model.
In particular, the implementation of graph-based convolution is not as simple as 2D or 3D convolution. Here we provide detailed implementation information of the space-time graph convolution for skeletal motion recognition.
The human body joint points in a frame are connected by an adjacent matrix
Figure DEST_PATH_IMAGE051
Express, identity matrix
Figure 536289DEST_PATH_IMAGE052
Indicating self-connection.
In the single frame case, for the distance partitioning strategy, the adjacency matrix
Figure 761734DEST_PATH_IMAGE051
Is split into a plurality of matrices
Figure 695186DEST_PATH_IMAGE053
And is
Figure 6082DEST_PATH_IMAGE054
And
Figure 932450DEST_PATH_IMAGE055
. The spatial map convolution can therefore be implemented by
Figure 12401DEST_PATH_IMAGE056
In a similar manner to that described above,
Figure 366022DEST_PATH_IMAGE057
wherein
Figure 413481DEST_PATH_IMAGE058
To represent
Figure 143540DEST_PATH_IMAGE059
To middle
Figure 77998DEST_PATH_IMAGE007
Go to the first
Figure 602520DEST_PATH_IMAGE007
The elements of the column are,
Figure 638740DEST_PATH_IMAGE060
to represent
Figure 906911DEST_PATH_IMAGE061
To middle
Figure 961454DEST_PATH_IMAGE007
Go to the first
Figure 922457DEST_PATH_IMAGE062
The elements of the column are,
Figure 678929DEST_PATH_IMAGE059
is that
Figure 485211DEST_PATH_IMAGE061
The degree matrix of (c). Is provided with
Figure DEST_PATH_IMAGE063
To avoid
Figure 190999DEST_PATH_IMAGE061
All zero rows in (1).
In fact, in a spatio-temporal situation, we can represent the input feature map as
Figure 57324DEST_PATH_IMAGE064
The tensor of the dimension. We implement space-time graph convolution by performing space graph convolution in the third dimension of the tensor, i.e., the spatial dimension, and time convolution in the second dimension of the tensor, respectively.
FIG. 5 is a diagram illustrating a non-local layer structure, the non-local layer including
Figure 68137DEST_PATH_IMAGE065
2D convolution of (a).
Figure 678109DEST_PATH_IMAGE066
Denotes an input tensor, in which
Figure 441666DEST_PATH_IMAGE002
Which represents the number of frames,
Figure 478892DEST_PATH_IMAGE001
the number of the joint points is represented,
Figure 481395DEST_PATH_IMAGE067
the number of characteristic channels is indicated.
Figure 895059DEST_PATH_IMAGE068
Figure 717651DEST_PATH_IMAGE069
Figure 706205DEST_PATH_IMAGE070
And are and
Figure 940877DEST_PATH_IMAGE071
to represent
Figure 892652DEST_PATH_IMAGE065
The 2D convolution of (a) with (b),
Figure 647113DEST_PATH_IMAGE072
it is meant that the matrix multiplication is performed,
Figure 26142DEST_PATH_IMAGE073
representing an element-by-element addition.
The specific calculation steps for the non-local layer are as follows:
the method comprises the following steps:
Figure 748110DEST_PATH_IMAGE074
(Note:
Figure 18423DEST_PATH_IMAGE075
,
Figure 876658DEST_PATH_IMAGE076
and
Figure 957746DEST_PATH_IMAGE077
respectively represent
Figure 635852DEST_PATH_IMAGE068
Figure 680163DEST_PATH_IMAGE069
And
Figure 392904DEST_PATH_IMAGE071
three of these
Figure 113735DEST_PATH_IMAGE065
Weight of 2D convolution)
Step two:
Figure 544716DEST_PATH_IMAGE078
(Note:
Figure 885394DEST_PATH_IMAGE079
the output of the non-local layer is represented,
Figure 452641DEST_PATH_IMAGE080
represents
Figure 609953DEST_PATH_IMAGE070
This is
Figure 262651DEST_PATH_IMAGE065
Weight of 2D convolution)
We describe here in more detail the flow of data in the model.
We first input the joint sequence into the bulk normalization layer to normalize the data. Data is then input into building block one and we will get two identical outputs, one of which will directly be a skip input to building block five and the other input into building block two. The second building block will obtain two identical outputs, one of which will be directly used as a skip input of the fourth building block, and the other output is input into the third building block. The output of building block three is connected to the skip input of building block two as the input of building block four. The output of building block four is connected to the skip input of building block one as the input of building block five. The number of input/output characteristic channels of each building block is (1, 16), (16, 32), (32, 64), (64, 128), (128, 256). Each building block consists of two space-time graph convolution models and a non-local layer. The Resnet mechanism is applied to each space-time graph convolution model. Also, after each space-time graph convolution model, we randomly discarded features with a probability of 0.5 to avoid overfitting. And then performing global average pooling on the output of the building block five to obtain 256-dimensional feature vectors of each motion action skeleton sequence. Finally, we provide them to the SoftMax classifier to get the classification result.
The calculation method of the global average pooling comprises the following steps:
Figure 898032DEST_PATH_IMAGE081
wherein
Figure 336098DEST_PATH_IMAGE082
Figure 664311DEST_PATH_IMAGE083
Figure 538726DEST_PATH_IMAGE084
Figure 243377DEST_PATH_IMAGE085
The calculation method comprises the following steps:
Figure 34484DEST_PATH_IMAGE086
wherein
Figure 268019DEST_PATH_IMAGE082
Figure 895310DEST_PATH_IMAGE087
After the model is constructed, when trained, we will train the model using a random gradient descent with a learning rate of 0.1. Every 10 cycles we will reduce the learning rate by 0.1.
In order to verify the effect of the method provided by the embodiment of the invention, NTU RGB + D is selected as a data set and compared with the existing ST-GCN and 2s-AGCN, so that the effect of the method and the model is highlighted.
Briefly, NTU RGB + D (see Amir Shahroudy, Jun Liu, Tian-Tson Ng, Gang Wang: NTU RGB + D: A Large Scale database for 3D Human Activity analysis. CVPR 2016: 1010- "1019) is introduced here, where NTU RGB + D is a Large-Scale motion recognition Dataset containing 56,578 framework sequences of 60 motion classes captured from 40 different objects and 3 different camera perspectives. Each skeleton map contains 25 human joints as nodes and their 3D positions in space as initial features. Each frame of action contains 1 to 2 objects. Producers of NTU RGB + D suggest reporting the accuracy of classification under two settings: (1) Cross-Subject (X-Subject), in which 40 objects were divided into training and testing groups, yielded 40,091 and 16,487 training and testing examples, respectively. (2) Cross-View (X-View), all 18,932 samples collected from camera 1 were used for testing, and the remaining 37,646 samples were used for training.
Experiments were performed on this data set NTU RGB + D and the results are shown in table 1. Experimental results show that the method provided by the embodiment of the invention realizes great performance improvement.
Table 1 shows the accuracy of the method provided by the embodiments of the present invention compared to ST-GCN and 2s-AGCN in two settings of the NTU RGB + D dataset.
Figure 872493DEST_PATH_IMAGE088
Among them, ST-GCN can be referred to: sijie Yan, Yuanjun Xiong, Dahua Lin: Spatial Temporal Graph relational Networks for Skeleton-Based Action recognition. AAAI 2018: 7444-. 2s-AGCN can be referred to: lei Shi, Yifan Zhang, Jian Cheng, Hanqing Lu: Two-Stream Adaptive Graph conditional Networks for Skeleton-Based Action registration. CVPR 2019: 12026-.
Corresponding to the embodiment of the motion action recognition method, the application also provides an embodiment of a motion action recognition device.
Fig. 6 is a block diagram illustrating a motion activity recognition device according to an example embodiment. Referring to fig. 6, the apparatus may include:
the acquisition module 31 is configured to acquire a skeleton sequence of the motion action acquired by the posture estimation device;
the recognition module 32 is configured to input the skeleton sequence into a trained non-local space-time graph convolution model to obtain a motion action recognition result; the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.
Correspondingly, the present application also provides an electronic device, comprising: one or more processors; a memory for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a method of motion activity recognition as described above.
Accordingly, the present application also provides a computer readable storage medium, on which computer instructions are stored, wherein the instructions, when executed by a processor, implement the motion action recognition method as described above.
The embodiment of the present invention further provides a non-local space-time graph convolution model, which includes: the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.
With respect to the non-local space-time graph convolution model in the above embodiment, the specific manner of each part thereof has been described in detail in the embodiment related to the method, and will not be elaborated herein.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (6)

1. A motion recognition method, comprising:
acquiring a skeleton sequence of the motion action acquired by the attitude estimation equipment;
inputting the skeleton sequence into a trained non-local space-time diagram convolution model to obtain a motion action recognition result;
the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.
2. The method of claim 1, wherein the pose estimation device employs an Azure Kinect DK depth camera.
3. An exercise motion recognition device, comprising:
the acquisition module is used for acquiring a skeleton sequence of the motion action acquired by the attitude estimation equipment;
the recognition module is used for inputting the skeleton sequence into a trained non-local space-time graph convolution model to obtain a motion action recognition result;
the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.
4. A non-local space-time graph convolution model, comprising: the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.
5. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.
6. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, perform the steps of the method of claim 1.
CN202110371059.1A 2021-04-07 2021-04-07 Motion action recognition method and device, model, electronic equipment and storage medium Pending CN112801060A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110371059.1A CN112801060A (en) 2021-04-07 2021-04-07 Motion action recognition method and device, model, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110371059.1A CN112801060A (en) 2021-04-07 2021-04-07 Motion action recognition method and device, model, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112801060A true CN112801060A (en) 2021-05-14

Family

ID=75816376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110371059.1A Pending CN112801060A (en) 2021-04-07 2021-04-07 Motion action recognition method and device, model, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112801060A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919232A (en) * 2019-03-11 2019-06-21 西安电子科技大学 Image classification method based on convolutional neural networks and non local connection network
CN110532925A (en) * 2019-08-22 2019-12-03 西安电子科技大学 Driver Fatigue Detection based on space-time diagram convolutional network
CN110796110A (en) * 2019-11-05 2020-02-14 西安电子科技大学 Human behavior identification method and system based on graph convolution network
CN111460928A (en) * 2020-03-17 2020-07-28 中国科学院计算技术研究所 Human body action recognition system and method
CN111601088A (en) * 2020-05-27 2020-08-28 大连成者科技有限公司 Sitting posture monitoring system based on monocular camera sitting posture identification technology
CN111612046A (en) * 2020-04-29 2020-09-01 杭州电子科技大学 Characteristic pyramid graph convolutional neural network and application thereof in 3D point cloud classification
CN111652124A (en) * 2020-06-02 2020-09-11 电子科技大学 Construction method of human behavior recognition model based on graph convolution network
CN111814719A (en) * 2020-07-17 2020-10-23 江南大学 Skeleton behavior identification method based on 3D space-time diagram convolution
CN111860267A (en) * 2020-07-13 2020-10-30 浙大城市学院 Multichannel body-building movement identification method based on human body bone joint point positions
CN111950406A (en) * 2020-07-28 2020-11-17 深圳职业技术学院 Finger vein identification method, device and storage medium
CN112232106A (en) * 2020-08-12 2021-01-15 北京工业大学 Two-dimensional to three-dimensional human body posture estimation method
CN112528811A (en) * 2020-12-02 2021-03-19 建信金融科技有限责任公司 Behavior recognition method and device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919232A (en) * 2019-03-11 2019-06-21 西安电子科技大学 Image classification method based on convolutional neural networks and non local connection network
CN110532925A (en) * 2019-08-22 2019-12-03 西安电子科技大学 Driver Fatigue Detection based on space-time diagram convolutional network
CN110796110A (en) * 2019-11-05 2020-02-14 西安电子科技大学 Human behavior identification method and system based on graph convolution network
CN111460928A (en) * 2020-03-17 2020-07-28 中国科学院计算技术研究所 Human body action recognition system and method
CN111612046A (en) * 2020-04-29 2020-09-01 杭州电子科技大学 Characteristic pyramid graph convolutional neural network and application thereof in 3D point cloud classification
CN111601088A (en) * 2020-05-27 2020-08-28 大连成者科技有限公司 Sitting posture monitoring system based on monocular camera sitting posture identification technology
CN111652124A (en) * 2020-06-02 2020-09-11 电子科技大学 Construction method of human behavior recognition model based on graph convolution network
CN111860267A (en) * 2020-07-13 2020-10-30 浙大城市学院 Multichannel body-building movement identification method based on human body bone joint point positions
CN111814719A (en) * 2020-07-17 2020-10-23 江南大学 Skeleton behavior identification method based on 3D space-time diagram convolution
CN111950406A (en) * 2020-07-28 2020-11-17 深圳职业技术学院 Finger vein identification method, device and storage medium
CN112232106A (en) * 2020-08-12 2021-01-15 北京工业大学 Two-dimensional to three-dimensional human body posture estimation method
CN112528811A (en) * 2020-12-02 2021-03-19 建信金融科技有限责任公司 Behavior recognition method and device

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
LEI SHI等: "Non-Local Graph Convolutional Networks for Skeleton-Based Action Recognition", 《ARXIV:1805.07694V2》 *
LEI SHI等: "Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks", 《IEEE TRANSACTIONS ON IMAGE PROCESSING 》 *
LEI SHI等: "Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition", 《2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 *
SIJIE YAN等: "Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition", 《ARXIV:1801.07455V2》 *
XIAOLONG WANG等: "Non-local Neural Networks", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
曹毅等: "时空自适应图卷积神经网络的骨架行为识别", 《华中科技大学学报》 *
王志华: "基于时空图卷积神经网络的人体动作识别研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
黄晨: "基于姿态序列的视频人体动作识别研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
Zhang et al. Graph edge convolutional neural networks for skeleton-based action recognition
Liu et al. Trajectorycnn: a new spatio-temporal feature learning network for human motion prediction
Xia et al. Multi-scale mixed dense graph convolution network for skeleton-based action recognition
CN110472604B (en) Pedestrian and crowd behavior identification method based on video
Geng et al. Human action recognition based on convolutional neural networks with a convolutional auto-encoder
CN109558781A (en) A kind of multi-angle video recognition methods and device, equipment and storage medium
Bruce et al. Multimodal fusion via teacher-student network for indoor action recognition
CN112131908A (en) Action identification method and device based on double-flow network, storage medium and equipment
Fan et al. Context-aware cross-attention for skeleton-based human action recognition
CN108647571A (en) Video actions disaggregated model training method, device and video actions sorting technique
CN111401106A (en) Behavior identification method, device and equipment
Zhang et al. Graph convolutional LSTM model for skeleton-based action recognition
Jiang et al. Inception spatial temporal graph convolutional networks for skeleton-based action recognition
Chen et al. Hierarchical posture representation for robust action recognition
Wei et al. Dynamic hypergraph convolutional networks for skeleton-based action recognition
Bavil et al. Action Capsules: Human skeleton action recognition
Wu et al. Multimodal human action recognition based on spatio-temporal action representation recognition model
Xiaolong Simulation analysis of athletes’ motion recognition based on deep learning method and convolution algorithm
CN112801060A (en) Motion action recognition method and device, model, electronic equipment and storage medium
CN114782992A (en) Super-joint and multi-mode network and behavior identification method thereof
Raju Exercise detection and tracking using MediaPipe BlazePose and Spatial-Temporal Graph Convolutional Neural Network
CN112926517B (en) Artificial intelligence monitoring method
Zhong et al. Research on discriminative skeleton-based action recognition in spatiotemporal fusion and human-robot interaction
Sun et al. A Deep Learning Method for Intelligent Analysis of Sports Training Postures
Shi et al. Graph convolutional networks with objects for skeleton-based action recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210514