CN112801060A

CN112801060A - Motion action recognition method and device, model, electronic equipment and storage medium

Info

Publication number: CN112801060A
Application number: CN202110371059.1A
Authority: CN
Inventors: 蔡建平; 何喆; 林型双; 顾鹏坤; 张帅
Original assignee: Zhejiang University City College ZUCC
Current assignee: Zhejiang University City College ZUCC
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2021-05-14

Abstract

The application discloses a motion action recognition method and device, a model, electronic equipment and a storage medium, comprising the following steps: acquiring a skeleton sequence of the motion action acquired by the attitude estimation equipment; inputting the skeleton sequence into a trained non-local space-time diagram convolution model to obtain a motion action recognition result; the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.

Description

Motion action recognition method and device, model, electronic equipment and storage medium

Technical Field

The patent relates to the technical field of deep neural networks, in particular to a motion action recognition method and device, a model, electronic equipment and a storage medium.

Background

The intelligent sports equipment needs to have the function of identifying the human body action types so as to judge the body-building actions (such as deep squatting, push-up, sit-up and the like) of a user, and the change of the human body joint sequence is very important for identifying the human body action types. Traditional methods for modeling joint sequence variations often rely on features of artificial design, thus resulting in limited expressive power and generalization difficulties. To overcome these limitations, a new approach is needed that can automatically capture the spatial and temporal patterns of changes in joint sequences. Recently, a graph convolutional neural network (GCN) that generalizes a Convolutional Neural Network (CNN) into an arbitrary structure has received increasing attention and has been successfully adopted in many applications, such as image classification, document classification, semi-supervised learning, and the like.

The space-time graph convolution model is used for applying graph convolution to a human body action classification task for the first time. Although the space-time graph convolution model can well model the change of the human skeleton sequence, the space-time graph convolution model cannot well represent the wide-range space-time dependence due to the locality of convolution operation, but is very important for recognizing some motion actions.

Disclosure of Invention

The embodiment of the application aims to provide a motion action identification method and device, a model, electronic equipment and a storage medium, so as to solve the problem that large-range space-time dependence cannot be modeled in a space-time diagram convolution model.

According to a first aspect of embodiments of the present application, there is provided a motion recognition method, including: acquiring a skeleton sequence of the motion action acquired by the attitude estimation equipment; inputting the skeleton sequence into a trained non-local space-time diagram convolution model to obtain a motion action recognition result; the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.

According to a second aspect of embodiments of the present application, there is provided a motion recognition apparatus including: the acquisition module is used for acquiring a skeleton sequence of the motion action acquired by the attitude estimation equipment; the recognition module is used for inputting the skeleton sequence into a trained non-local space-time graph convolution model to obtain a motion action recognition result; the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.

According to a third aspect of embodiments of the present application, there is provided a non-local space-time graph convolution model, including: the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.

According to a fourth aspect of embodiments of the present application, there is provided an electronic apparatus, including: one or more processors; a memory for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a method as described in the first aspect.

According to a fifth aspect of embodiments herein, there is provided a computer-readable storage medium having stored thereon computer instructions, characterized in that the instructions, when executed by a processor, implement the steps of the method according to the first aspect.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

according to the embodiment, the motion action framework sequence is obtained by using the attitude estimation equipment, and the obtained framework sequence is input into the trained non-local space-time diagram convolution model to obtain the motion action recognition result. The change of the human body skeleton sequence is crucial to the recognition of human body action types, although the space-time diagram convolution model can well model the change of the human body skeleton sequence, due to the locality of convolution operation, the space-time diagram convolution model cannot well represent large-range space-time dependence, but is crucial to the recognition of some movement actions. The ability of the space-time graph convolution model to model the relationship between the human joint points over a frame, namely the ability of spatial modeling, can be enhanced by non-local operations. Through the jump connection, the sequence information can be better transmitted in the model, so that the time modeling capability is enhanced. The combination of non-local operations, jump-connection and space-time graph convolution enables the space-time graph convolution to have better space-time modeling capability.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flow chart illustrating a method of motion activity recognition according to an exemplary embodiment.

FIG. 2 is a space-time diagram of a skeletal sequence used by the space-time diagram convolution shown in accordance with an exemplary embodiment, with points in FIG. 2 representing joints of the body, edges between joints of the body being defined in accordance with natural connections of the body, inter-frame edges connecting the same nodes between successive frames, and joint coordinates as inputs to the space-time diagram convolution.

FIG. 3 is a diagram illustrating a distance partitioning strategy, according to an example embodiment.

FIG. 4 is a diagram of a non-local space-time graph convolution model architecture in accordance with an exemplary embodiment.

FIG. 5 is a non-local layer structure diagram shown in accordance with an example embodiment.

Fig. 6 is a block diagram illustrating a motion activity recognition device according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

Fig. 1 is a flowchart illustrating a motion recognition method according to an exemplary embodiment, and referring to fig. 1, a motion recognition method according to an embodiment of the present invention may include:

step S11, collecting a skeleton sequence of the motion action obtained by the attitude estimation equipment;

step S12, inputting the skeleton sequence into a trained non-local space-time graph convolution model to obtain a motion action recognition result;

the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.

According to the embodiment, the change of the human body skeleton sequence is crucial to the identification of the human body action type, the space-time diagram convolution model can well model the change of the human body skeleton sequence, but due to the locality of convolution operation, the space-time diagram convolution model cannot well represent large-range space-time dependence, but is crucial to the identification of some movement actions.

In the specific implementation of step S11, a skeleton sequence of the motion action acquired by the posture estimation device is acquired;

specifically, the attitude estimation device of the present embodiment adopts an Azure Kinect DK depth camera, which is certainly not limited thereto; and capturing a motion skeleton sequence in the motion action video through the depth camera.

In one possible implementation, the motion action video captured by the depth camera includes a video composed of successive image frames in which a person is performing some motion, such as push-up, squat, pull-up, etc.

In the specific implementation of step S12, inputting the skeleton sequence into a trained non-local space-time diagram convolution model to obtain a motion action recognition result;

in particular, FIG. 4 is a diagram of a non-local space-time graph convolution model architecture, shown in accordance with an exemplary embodiment. Referring to fig. 4, the non-local space-time graph convolution model is formed by sequentially stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer, wherein the building block group comprises a building block one B1, a building block two B2, a building block three B3, a building block four B4 and a building block five B5 which are sequentially connected, an additional jump connection is further arranged between the building block one and the building block five, an additional jump connection is further arranged between the building block two and the building block four, and each building block is composed of two space-time graph convolution models and one non-local layer.

The implementation steps of the space-time graph convolution model comprise:

(1) constructing a space-time diagram of a joint on a motion action skeleton sequence, referring to fig. 2, wherein the motion action skeleton sequence comprises a plurality of frames, and each frame comprises a human body skeleton diagram;

in particular, the skeleton sequence is typically represented by 2D or 3D coordinates of each human joint in each frame. In practical application, the Azure Kinect DK is mainly adopted for collecting joint point data. In the space-time graph convolution model, joint sequences are represented hierarchically using space-time graphs.

The space-time graph convolution model is provided with

A joint point and

a non-directional space-time diagram is constructed on the joint sequence of the frame

. In this graph, a set of nodes

（

Is shown as

On the frame

Individual node) comprising all nodes, nodes in the node sequence

The coordinate vector of (2) is input as a feature vector into the space vector convolution model. Edge set

Comprising two subsets, a first subset

（

Is shown as

On the frame

A joint point, and

forming a natural connecting edge between the articulation points of the human body), wherein

Is a collection of natural connecting edges between human body joint points, describing the connections between joint points in the same frame. Second subset

Inter-frame edges are included connecting the same joint points between successive frames. Therefore, the temperature of the molten metal is controlled,

of the same particular joint point

All edges of (a) represent the joint point over timeA trajectory.

(2) Defining a distance-based sampling function on a frame space diagram of the motion action skeleton sequence;

specifically, in

On a single frame of time, there are

Joint point and bone edge

. In conventional convolution, when the input is a 2D grid, the output signature of the convolution operation is also a 2D grid. Using a single step size and appropriate padding, the size of the output signature can be the same as the size of the input signature. In the following description we will assume this case. Considering the convolution kernel size as

For the number of channels is

Input feature map of

A conventional convolution operation is performed. In spatial position

The output value of (d) is:

wherein the sampling function

Traversing position

Neighbor of (2), weight function

A weight vector in the c-dimensional real space is provided for calculating an inner product with the c-dimensional input feature vector. The convolution operation on the graph is then defined by extending the above formula to the case where the input feature graph is located on a spatial graph.

On the image, sampling the function

Is defined at the central position

On the adjacent pixels of the pixel array. On the graph, sampling functions can be similarly defined at nodes

Adjacent set of

The above. Here, the

Represents from

To

The minimum length of any of the paths is,

indicating the selectable path length. Thus, the sampling function

Can be written as

(3) Defining a mapping function from a node to a label on a space diagram, and implementing the mapping function by adopting a distance division strategy;

in particular, we have employed a distance partitioning strategy to implement label mapping

. Specific strategies are described below, in conjunction with FIG. 3.

The distance division strategy is based on the node to the root node

Is a distance of

Wherein

Representing other joint points in the same frame, dividing the neighbor set. In the space-time graph convolution model, setting

The neighbor set is divided into two subsets,

which represents the root node of the network,

representing the remaining adjacent nodes. Thus, the space-time graph convolution model will have two different weight vectors that can model local dissimilarity. In the form of

And

。

(4) defining a weighting function based on said mapping function;

in particular, the joint point

Of (2) a neighbor set

The distance division strategy is divided into two fixed subsets, each subset having a digital label. Therefore, we have a mapping

The neighboring nodes are mapped to the labels of the corresponding subset. Weight function

Can pass through

Index tensor of dimension to realize

(5) Based on the sampling function and the weighting function, the traditional convolution is popularized to the space map convolution;

in particular, the conventional convolution is now rewritten to the form of a graph convolution

Normalization term

Equal to the cardinality of the corresponding subset. This term is added to balance the contribution of the different subsets to the output. Combining the sampling function and the weighting function to obtain

(6) Extending the sampling function and the mapping function to a time dimension, thereby generalizing the spatial graph convolution operation to a time-space domain;

in particular, after the spatial map convolution is formulated, the task of dynamically modeling the time space within the sequence of joint points is now entered. We extend the notion of neighborhood to also include temporally connected joints

Parameter(s)

Controlling the temporal extent in the neighborhood graph, and therefore can be referred to as the temporal convolution kernel size,

is shown as

And (5) frame. In order to complete the convolution on the space-time diagram, the space-time diagram convolution also needs a sampling function, the sampling function is the same as that of the weight function and the space diagram, and label mapping is carried out

Different. Because the time axis is regular, the convolution of the space-time diagram will be directly followed by

Spatio-temporal neighborhood label mapping for root nodes

Instead, it is changed into

In this way, the space-time graph convolution model defines a well-defined convolution operation on the constructed space-time graph.

(7) And respectively performing space map convolution on the space map and time convolution on the time dimension to realize the space-time map convolution model.

In particular, the implementation of graph-based convolution is not as simple as 2D or 3D convolution. Here we provide detailed implementation information of the space-time graph convolution for skeletal motion recognition.

The human body joint points in a frame are connected by an adjacent matrix

Express, identity matrix

Indicating self-connection.

In the single frame case, for the distance partitioning strategy, the adjacency matrix

Is split into a plurality of matrices

And is

And

. The spatial map convolution can therefore be implemented by

In a similar manner to that described above,

wherein

To represent

To middle

Go to the first

The elements of the column are,

to represent

To middle

Go to the first

The elements of the column are,

is that

The degree matrix of (c). Is provided with

To avoid

All zero rows in (1).

In fact, in a spatio-temporal situation, we can represent the input feature map as

The tensor of the dimension. We implement space-time graph convolution by performing space graph convolution in the third dimension of the tensor, i.e., the spatial dimension, and time convolution in the second dimension of the tensor, respectively.

FIG. 5 is a diagram illustrating a non-local layer structure, the non-local layer including

2D convolution of (a).

Denotes an input tensor, in which

Which represents the number of frames,

the number of the joint points is represented,

the number of characteristic channels is indicated.

，

And are and

to represent

The 2D convolution of (a) with (b),

it is meant that the matrix multiplication is performed,

representing an element-by-element addition.

The specific calculation steps for the non-local layer are as follows:

the method comprises the following steps:

(Note:

,

and

respectively represent

，

And

three of these

Weight of 2D convolution)

Step two:

(Note:

the output of the non-local layer is represented,

represents

This is

Weight of 2D convolution)

We describe here in more detail the flow of data in the model.

We first input the joint sequence into the bulk normalization layer to normalize the data. Data is then input into building block one and we will get two identical outputs, one of which will directly be a skip input to building block five and the other input into building block two. The second building block will obtain two identical outputs, one of which will be directly used as a skip input of the fourth building block, and the other output is input into the third building block. The output of building block three is connected to the skip input of building block two as the input of building block four. The output of building block four is connected to the skip input of building block one as the input of building block five. The number of input/output characteristic channels of each building block is (1, 16), (16, 32), (32, 64), (64, 128), (128, 256). Each building block consists of two space-time graph convolution models and a non-local layer. The Resnet mechanism is applied to each space-time graph convolution model. Also, after each space-time graph convolution model, we randomly discarded features with a probability of 0.5 to avoid overfitting. And then performing global average pooling on the output of the building block five to obtain 256-dimensional feature vectors of each motion action skeleton sequence. Finally, we provide them to the SoftMax classifier to get the classification result.

The calculation method of the global average pooling comprises the following steps:

wherein

，

，

。

The calculation method comprises the following steps:

wherein

，

。

After the model is constructed, when trained, we will train the model using a random gradient descent with a learning rate of 0.1. Every 10 cycles we will reduce the learning rate by 0.1.

In order to verify the effect of the method provided by the embodiment of the invention, NTU RGB + D is selected as a data set and compared with the existing ST-GCN and 2s-AGCN, so that the effect of the method and the model is highlighted.

Briefly, NTU RGB + D (see Amir Shahroudy, Jun Liu, Tian-Tson Ng, Gang Wang: NTU RGB + D: A Large Scale database for 3D Human Activity analysis. CVPR 2016: 1010- "1019) is introduced here, where NTU RGB + D is a Large-Scale motion recognition Dataset containing 56,578 framework sequences of 60 motion classes captured from 40 different objects and 3 different camera perspectives. Each skeleton map contains 25 human joints as nodes and their 3D positions in space as initial features. Each frame of action contains 1 to 2 objects. Producers of NTU RGB + D suggest reporting the accuracy of classification under two settings: (1) Cross-Subject (X-Subject), in which 40 objects were divided into training and testing groups, yielded 40,091 and 16,487 training and testing examples, respectively. (2) Cross-View (X-View), all 18,932 samples collected from camera 1 were used for testing, and the remaining 37,646 samples were used for training.

Experiments were performed on this data set NTU RGB + D and the results are shown in table 1. Experimental results show that the method provided by the embodiment of the invention realizes great performance improvement.

Table 1 shows the accuracy of the method provided by the embodiments of the present invention compared to ST-GCN and 2s-AGCN in two settings of the NTU RGB + D dataset.

Among them, ST-GCN can be referred to: sijie Yan, Yuanjun Xiong, Dahua Lin: Spatial Temporal Graph relational Networks for Skeleton-Based Action recognition. AAAI 2018: 7444-. 2s-AGCN can be referred to: lei Shi, Yifan Zhang, Jian Cheng, Hanqing Lu: Two-Stream Adaptive Graph conditional Networks for Skeleton-Based Action registration. CVPR 2019: 12026-.

Corresponding to the embodiment of the motion action recognition method, the application also provides an embodiment of a motion action recognition device.

Fig. 6 is a block diagram illustrating a motion activity recognition device according to an example embodiment. Referring to fig. 6, the apparatus may include:

the acquisition module 31 is configured to acquire a skeleton sequence of the motion action acquired by the posture estimation device;

the recognition module 32 is configured to input the skeleton sequence into a trained non-local space-time graph convolution model to obtain a motion action recognition result; the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

Correspondingly, the present application also provides an electronic device, comprising: one or more processors; a memory for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a method of motion activity recognition as described above.

Accordingly, the present application also provides a computer readable storage medium, on which computer instructions are stored, wherein the instructions, when executed by a processor, implement the motion action recognition method as described above.

The embodiment of the present invention further provides a non-local space-time graph convolution model, which includes: the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.

With respect to the non-local space-time graph convolution model in the above embodiment, the specific manner of each part thereof has been described in detail in the embodiment related to the method, and will not be elaborated herein.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A motion recognition method, comprising:

acquiring a skeleton sequence of the motion action acquired by the attitude estimation equipment;

inputting the skeleton sequence into a trained non-local space-time diagram convolution model to obtain a motion action recognition result;

2. The method of claim 1, wherein the pose estimation device employs an Azure Kinect DK depth camera.

3. An exercise motion recognition device, comprising:

the acquisition module is used for acquiring a skeleton sequence of the motion action acquired by the attitude estimation equipment;

the recognition module is used for inputting the skeleton sequence into a trained non-local space-time graph convolution model to obtain a motion action recognition result;

4. A non-local space-time graph convolution model, comprising: the non-local space-time graph convolution model is formed by stacking a batch normalization layer, a building block group, a global average pooling layer and a Softmax layer in sequence, the building block group comprises a first building block, a second building block, a third building block, a fourth building block and a fifth building block which are connected in sequence, an additional jump connection is arranged between the first building block and the fifth building block, an additional jump connection is arranged between the second building block and the fourth building block, and each building block consists of two space-time graph convolution models and a non-local layer.

5. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.

6. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, perform the steps of the method of claim 1.