CN114863325B

CN114863325B - Action recognition method, apparatus, device and computer readable storage medium

Info

Publication number: CN114863325B
Application number: CN202210411105.0A
Authority: CN
Inventors: 段浩东; 王靖博; 陈恺; 王佳琪; 林达华
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2024-06-07
Anticipated expiration: 2042-04-19
Also published as: CN114863325A

Abstract

The embodiment of the application discloses a method, a device, equipment and a computer readable storage medium for identifying actions. The method comprises the following steps: acquiring skeleton features corresponding to videos to be identified, wherein the skeleton features represent three-dimensional sequences of the number of key points, the number of the key point features and the number of video frames; dividing the skeleton features according to the feature dimensions of the key points in the skeleton features to obtain a plurality of first key point feature groups; and according to the plurality of first key point feature groups, combining the topological structure of each first key point feature group determined according to each first key point feature group and the multi-scale features of each first key point feature group, and performing space-time fusion to obtain the target skeleton feature after the spatial feature and the time sequence feature are fused. The accuracy of the target skeleton characteristics is improved by dynamically fusing the characteristics obtained by the space modeling and the time sequence modeling. And performing action recognition according to the target skeleton characteristics to obtain action categories corresponding to the videos to be recognized, and improving the accuracy of the action recognition.

Description

Action recognition method, apparatus, device and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a computer readable storage medium for identifying actions.

Background

Along with research and progress of artificial intelligence technology, the artificial intelligence technology has been rapidly developed and applied in various fields, and by taking a human skeleton motion recognition technology as an example, whether a target person falls down or not and teaching such as body building, sports and dance can be detected by understanding the body language of the whole body.

The skeleton action recognition algorithm in the prior art is based on a graph convolution network, performs space modeling of key points in a space dimension according to a manually predefined key point topological structure, and performs time sequence modeling on the movement of each key point by means of one-dimensional convolution in a time sequence dimension.

However, in the prior art, the manually predefined key point topological structure is relied on, so that extra steps are introduced, the design space of the graph convolution network is limited, and in addition, only the movement of a single key point is modeled in time sequence modeling, so that limitation exists, and the accuracy of motion recognition is reduced.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a computer readable storage medium for identifying actions, which improve the accuracy of identifying actions.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides an action recognition method, where the method includes: acquiring skeleton characteristics corresponding to a video to be identified, wherein the skeleton characteristics represent three-dimensional sequences of the number of key points, the key point characteristics and the number of video frames; dividing the bone features according to the key point feature dimensions in the bone features to obtain a plurality of first key point feature groups; according to the first key point feature groups, combining the topological structure of each first key point feature group determined according to each first key point feature group and the multi-scale features of each first key point feature group, and performing space-time fusion to obtain target skeleton features after fusion of space features and time sequence features; and performing action recognition according to the target skeleton characteristics to obtain action categories corresponding to the videos to be recognized.

In a second aspect, an embodiment of the present application provides an action recognition apparatus, including: the acquisition module is used for acquiring skeleton features corresponding to the video to be identified, wherein the skeleton features represent three-dimensional sequences of the number of key points, the key point features and the number of video frames; the dividing module is used for dividing the skeleton characteristics according to the key point characteristic dimension in the skeleton characteristics to obtain a plurality of first key point characteristic groups; the modeling module is used for carrying out space-time fusion according to a plurality of first key point feature groups, combining the topological structure of each first key point feature group determined according to each first key point feature group and the multi-scale features of each first key point feature group to obtain target skeleton features after the spatial features and the time sequence features are fused; and the identification module is used for carrying out action identification according to the target skeleton characteristics to obtain action categories corresponding to the videos to be identified.

In a third aspect, an embodiment of the present application provides an action recognition apparatus, the apparatus including: a memory for storing an executable computer program; and the processor is used for realizing the action recognition method when executing the executable computer program stored in the memory.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium storing a computer program for implementing the above-mentioned action recognition method when executed by a processor.

The embodiment of the application provides an action recognition method, an action recognition device, action recognition equipment and a computer readable storage medium. According to the scheme provided by the embodiment of the application, skeleton characteristics corresponding to the video to be identified are obtained, wherein the skeleton characteristics represent three-dimensional sequences of the number of key points, the key point characteristics and the number of video frames; dividing the skeleton features according to the feature dimensions of the key points in the skeleton features to obtain a plurality of first key point feature groups; according to the multiple first key point feature groups, combining the topological structure of each first key point feature group determined according to each first key point feature group and the multi-scale features of each first key point feature group, performing space-time fusion to obtain target skeleton features after fusion of space features and time sequence features; the topological structures of the first key point feature groups are different by spatially modeling the key point features and the topological structures corresponding to the key point features, the key point features are obtained by learning according to each first key point feature group, the prior knowledge is not relied on, and the design space of the graph convolution network is expanded. The accuracy of the target skeleton feature is improved by carrying out time sequence modeling on the multi-scale feature on the key point feature and then dynamically fusing the feature obtained by space modeling and time sequence modeling. And performing action recognition according to the target skeleton characteristics to obtain action categories corresponding to the videos to be recognized, so that the accuracy of the action recognition is improved.

Drawings

FIG. 1 is a flowchart illustrating optional steps of a method for motion recognition according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an exemplary framework of a graph rolling network according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an exemplary data enhancement provided by an embodiment of the present application;

FIG. 4 is an exemplary schematic diagram of a data enhanced performance result provided by an embodiment of the present application;

FIG. 5 is a flowchart illustrating alternative steps of another method for motion recognition according to an embodiment of the present application;

FIG. 6 is an exemplary schematic diagram of a dynamic multi-group spatial modeling module provided in an embodiment of the present application;

FIG. 7 is an exemplary schematic diagram of a spatial modeling module provided in an embodiment of the present application;

FIG. 8 is a schematic diagram of an exemplary dynamic multi-group timing modeling module according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an exemplary timing modeling module according to an embodiment of the present application;

FIG. 10 is an exemplary schematic diagram of a topology provided by an embodiment of the present application;

FIG. 11 is a flowchart illustrating an optional step of a method for identifying actions according to an embodiment of the present application;

FIG. 12A is a schematic diagram of comparing motion recognition results according to an embodiment of the present application;

FIG. 12B is a diagram illustrating another comparison of motion recognition results according to an embodiment of the present application;

FIG. 13 is a schematic diagram showing still another comparison of motion recognition results according to an embodiment of the present application;

FIG. 14 is a schematic diagram of still another comparison of motion recognition results according to an embodiment of the present application;

FIG. 15 is a schematic diagram of an alternative configuration of an action recognition device according to an embodiment of the present application;

Fig. 16 is a schematic diagram of a composition structure of an action recognition device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It should be understood that some embodiments described herein are merely for explaining the technical solution of the present application, and are not intended to limit the technical scope of the present application.

In order to better understand the action recognition method provided in the embodiment of the present application, before the technical solution of the embodiment of the present application is introduced, an application background is described.

The motion recognition method provided by the embodiment of the application is used for solving the problem of bone motion recognition, and can be used for performing motion recognition according to a bone point sequence (a 3D bone point sequence or a 2D bone point sequence) and predicting the ongoing motion category. Taking human skeleton points as an example for explanation, since the key points of human skeleton are very important for describing human body posture and predicting human body behaviors, human body action categories can be judged and identified by analyzing the sequence data of human skeleton points through a skeleton action identification technology. The skeletal motion recognition technology takes human skeletal point sequence data as input, and has the characteristics of light weight, robustness to illumination changes and background changes, and the like, compared with motion recognition methods based on other modes (for example, RGB (red green blue) images, optical flows, and the like).

The motion recognition method provided by the embodiment of the application can be applied to various video recognition depending on skeleton point sequences, including but not limited to human body key points, facial key points, hand key points and the like. The embodiment of the application provides an action recognition method, as shown in fig. 1, fig. 1 is a step flow chart of the action recognition method provided by the embodiment of the application, and the action recognition method comprises the following steps:

S101, acquiring a video to be identified, wherein skeleton features represent three-dimensional sequences of the number of key points, the key point features and the number of video frames.

In the embodiment of the application, the video to be identified is estimated through a gesture estimation algorithm to obtain bone characteristics, wherein the bone characteristics can comprise two modes, such as a coordinate position of a key point and a bone motion state (bone motion parameters between a later video frame and a former video frame); the bone features may include four modalities, for example, a key point coordinate position, a bone motion state, a difference between two adjacent key point coordinate positions, and a coordinate difference between two adjacent bone motion states. In the embodiment of the application, the skeleton feature may change the design, for example, when the mode related to the key point and the key point is set up, the position code of the key point may be added, the coordinate position of the key point is a 3D (Dimension) coordinate including x, y and z, or a 2D coordinate including x and y, the position code is different from the coordinate position of the key point, and the position code is used for characterizing which key point of which video frame is the key point.

In the embodiment of the application, the video to be identified is a plurality of video frames, each video frame comprises coordinates of a plurality of key points, and the key point characteristics of each key point comprise a 2D coordinate or a skeleton point sequence under a 3D coordinate. Bone characteristics can be understood as bone point sequence data, taking a human body as an example, the number of key points refers to the number of bone joints of the human body, for example, 18 joints can be marked by one person, and the number of key points is used for representing spatial information and can be represented by joint V; the key point features refer to features of joints, generally one joint comprises three-dimensional features such as x, y, acc and the like, x and y are coordinate positions of the joint, acc is a confidence level, if the joint is a three-dimensional skeleton, the key point features are four-dimensional features, can be represented by Dim C, and can comprise a plurality of feature channels, for example 64 dimensions; the number of video frames refers to the number of video frames in a video, for example, a video having 150 video frames, which may be represented by Temporal Length T, and the dimension of the number of video frames may be understood as the dimension of time (or timing). Bone characteristics can be understood as a C x V x T vector.

S102, dividing the bone features according to the key point feature dimensions in the bone features to obtain a plurality of first key point feature groups.

S103, according to the plurality of first key point feature groups, combining the topological structure of each first key point feature group determined according to each first key point feature group and the multi-scale features of each first key point feature group, and performing space-time fusion to obtain the target skeleton feature after the spatial feature and the time sequence feature are fused.

In the embodiment of the application, when the problem of motion recognition of key points of human bones is solved, a space-time diagram convolution network model (Spatial Temporal Graph Convolutional Networks, ST-GCN) can be adopted, and the skeleton point sequence data is obtained by carrying out gesture estimation on videos, so that a space-time diagram is constructed on the skeleton point sequence data. A multi-layer space-time diagram convolution operation (ST-GCN) is applied thereto, gradually generating a higher level feature diagram on the image. Then, the action category of the human body can be identified by classifying the action category into corresponding action categories by using a Softmax classifier. The embodiment of the application provides a Dynamic multi-Group space-time diagram convolutional network model (Dynamic Group-WISE SPATIAL Temporal Graph Convolutional Networks, DG-STGCN) based on ST-GCN, and DG-STGCN comprises a key point space modeling module and a time sequence modeling module based on Dynamic multi-Group design. The dynamic multi-group space modeling module divides the key point features into K groups to obtain K first key point feature groups, different space feature fusion is carried out on each first key point feature group, and a coefficient matrix (i.e. a topological structure) used for the space feature fusion is completely learned from end to end in the skeleton point sequence data. The K groups of characteristics after fusion are spliced together and sent to a time sequence modeling module. The time sequence modeling module comprises a dynamic bone point-human body feature fusion module and is used for carrying out multi-scale modeling on the bone features and the global features of key points in the bone features to obtain combined bone features. Dividing the combined skeleton feature into M groups to obtain M second key point feature groups, carrying out dynamic receptive field time sequence modeling on the M second key point feature groups by utilizing a multi-branch time sequence convolution network, and carrying out different time sequence feature fusion on each second key point feature group so as to realize dynamic fusion of multi-scale features. The M groups of fused features are spliced together and then fused with the bone features to obtain target bone features.

The related art relies on manually defined key point topologies, which not only introduce additional steps, but also limit the design space of the graph roll-up network. The embodiment of the application provides a graph rolling network (a key point space modeling module) with dynamic multi-group design, which models key points by adopting a topological structure completely learned from skeleton point sequence data without relying on any priori knowledge, thus widening the design space of the graph rolling network. The related art has limitations in modeling only the motion of a single key point in the time series modeling section. The embodiment of the application provides the method for simultaneously carrying out multi-scale modeling on the key points and the overall key point, and realizes the fusion of the time sequence characteristics by combining and carrying out time sequence convolution processing with different receptive fields. And then dynamically fusing the learned characteristics of the two modeling parts (spatial modeling and time sequence modeling), thereby improving the accuracy of the target bone characteristics.

It should be noted that, in the embodiment of the present application, the first and second are merely to distinguish names, and do not represent a sequential relationship, and are not to be construed as indicating or implying relative importance or implying that the number of technical features indicated is indicated, for example, the first key feature set and the second key feature set each represent a key feature set, and the first and second are merely to distinguish key feature sets obtained by different division manners.

And S104, performing action recognition according to the target skeleton characteristics to obtain action categories corresponding to the videos to be recognized.

In the embodiment of the application, the skeleton feature is a vector of C×V×T, the target skeleton feature is a skeleton feature obtained by fusing the spatial feature and the time sequence feature, and the target skeleton feature is also a vector of C×V×T. Inputting the target skeleton characteristics into a preset action recognition model, outputting the confidence coefficient of each action category, determining the action category according to the sequencing result of the confidence coefficient, and taking the action category corresponding to the highest confidence coefficient as the action category corresponding to the video to be recognized by way of example.

In embodiments of the present application, the motion recognition model may be understood as a machine learning model, which may be any suitable neural network (Neural Networks, NN) model that can be used to perform motion recognition on the target bone features. The embodiment of the application does not limit the specific structure of the action recognition model, and comprises but is not limited to: convolutional neural networks (Convolutional Neural Network, CNN), feed forward neural networks (Feedforward neural network, FNN), and the like.

According to the scheme provided by the embodiment of the application, skeleton characteristics corresponding to the video to be identified are obtained, wherein the skeleton characteristics represent three-dimensional sequences of the number of key points, the key point characteristics and the number of video frames; dividing the skeleton features according to the feature dimensions of the key points in the skeleton features to obtain a plurality of first key point feature groups; according to the multiple first key point feature groups, combining the topological structure of each first key point feature group determined according to each first key point feature group and the multi-scale features of each first key point feature group, performing space-time fusion to obtain target skeleton features after fusion of space features and time sequence features; the topological structures of the first key point feature groups are different by spatially modeling the key point features and the topological structures corresponding to the key point features, the key point features are obtained by learning according to each first key point feature group, the prior knowledge is not relied on, and the design space of the graph convolution network is expanded. The accuracy of the target skeleton feature is improved by carrying out time sequence modeling on the multi-scale feature on the key point feature and then dynamically fusing the feature obtained by space modeling and time sequence modeling. And performing action recognition according to the target skeleton characteristics to obtain action categories corresponding to the videos to be recognized, and improving the accuracy of the action recognition.

In some embodiments, S104 in fig. 1 described above may be implemented in the following manner. Pooling the characteristics of the video frame number dimension and the key point number dimension in the target skeleton characteristics to obtain one-dimensional skeleton characteristics; and performing motion recognition according to the one-dimensional skeleton characteristics to obtain motion categories corresponding to the videos to be recognized.

In the embodiment of the present application, taking the vector of the target skeleton Feature being c×v×t as an example, the preset motion recognition model may include a pooling layer and a linear layer, where the pooling layer is configured to pool the Feature of the video frame number dimension (T) and the key point number dimension (V) in the target skeleton Feature, so as to reduce the video frame number dimension from T dimension to one dimension, and reduce the key point number dimension from V dimension to one dimension, so as to pool the vector of c×v×t to a C dimension Feature (C dimension Feature). The linear layer is used for identifying the motion of the C-dimensional characteristics and outputting the motion category.

The overall architecture of the network architecture of DG-STGCN for skeletal motion recognition according to the embodiment of the present application is based on the design of ST-GCN, and is illustrated in fig. 2, and fig. 2 is an exemplary framework schematic diagram of a graph rolling network according to the embodiment of the present application. The architecture (Architectures) of the overall network is stacked by N graph-rolling network elements (e.g., GCN block×n in fig. 2), each including a Spatial modeling Module (Spatial Module) and a Temporal modeling Module (Temporal Module). Taking a vector of skeleton features being c×v×t as an example in fig. 2 for illustration, in fig. 2, # joint V in skeleton features represents the number of key points, # Dim C represents the number of key points, temporal Length T represents the number of video frames, space-time fusion of skeleton features is implemented by each graph convolution network unit, skeleton features penetrate through the whole space-time fusion process, skeleton features are input into a space modeling module in fig. 2, the space features in the skeleton features are extracted by the space modeling module, a first skeleton feature fused with the space features is output, the first skeleton feature is input into a time sequence modeling module, the time sequence features in the first skeleton feature are extracted by the time sequence modeling module, and then a third skeleton feature fused with the space features and the time sequence features is output. In FIG. 2And (4) representing that the bone characteristics are fused with the third bone characteristics again to obtain target bone characteristics. /(I)The original information can be reserved for the bone characteristics input in the process, so that the accuracy of the target bone characteristics is improved. N represents that the process of extracting the spatial features, then extracting the time sequence features and fusing the bone features can be performed for N times, and the accuracy of the target bone features can be improved through multiple extraction and fusion steps. Then, pooling the features of the number dimension of the video frames and the number dimension of the key points in the target skeleton features, such as a T pooling layer (pooling) in fig. 2, to obtain one-dimensional skeleton features, performing motion recognition on the one-dimensional skeleton features through a classifier linear layer in fig. 2, and outputting a prediction result (prediction), namely a motion category corresponding to the video to be recognized. Among others, classifier Linear layers (Linear) include, but are not limited to, fully connected layers (fully connected layer, FC), dense layers (Dense), and multi-layer perceptrons (Multilayer Perceptron, MLP), among others.

It should be noted that, when the graph rolling network unit is used to perform feature extraction, it is also possible to perform extraction of the time sequence feature first and then perform extraction of the space feature, and fig. 2 only illustrates that the space feature is extracted first and then the time sequence feature is extracted, which is not a limitation of the embodiment of the present application.

In the embodiment of the application, since the vector of c×v×t has too many dimensions, the operation recognition result is easily interfered when the linear layer is used for operation recognition, and if the multi-layer linear layer is used for operation recognition, the processing efficiency is easily affected. By carrying out pooling operation and full-connection operation on the target skeleton characteristics, the action recognition on the target skeleton characteristics is realized, and the accuracy of the action recognition result is improved.

In some embodiments, S101 in fig. 1 described above may be implemented in the following manner. Acquiring a video to be identified; performing key point estimation on the video to be identified according to a preset gesture estimation model to obtain original skeleton characteristics; the preset gesture estimation model is used for estimating a skeleton point sequence of the video to be identified; dividing the corresponding original time sequence in the original skeleton characteristics into a plurality of subsequences; the difference value of the time sequence lengths of two adjacent subsequences is within a preset range, and the number of the plurality of subsequences is the preset length; one sub-sequence includes features corresponding to a plurality of video frames; in the original skeleton characteristics, sampling characteristics corresponding to a plurality of video frames in each sub-sequence to obtain time sequence enhancement characteristics corresponding to each sub-sequence; and connecting the time sequence enhancement features corresponding to the multiple subsequences to obtain skeleton features.

In the embodiment of the application, the key point estimation is performed on the video to be identified according to the preset gesture estimation model to obtain the original skeleton feature, and the time length of each video to be identified is inconsistent, namely the number of video frames is different, while the space feature and time sequence feature extraction is performed on the number of video frames with fixed time length (such as T) when the skeleton feature is identified, namely the skeleton feature is a vector of C×V×T. Thus, there is also a need to enhance the dimension of the number of video frames in the original skeletal feature, for example, to increase the number of video frames or to prune the number of redundant video frames.

For example, as shown in fig. 3, fig. 3 is an exemplary schematic diagram of a data enhancement provided by an embodiment of the present application, fig. 3 shows an original bone feature, which may also be referred to as original bone point sequence data (original skeleton sequence), three examples of obtaining bone features are described below, one example is that a random sequence data (cyclic padding) is filled in front of or behind the original bone point sequence data, so as to form a bone feature, and fig. 3 is shown with a random sequence data filled in behind. Example two, a piece of sequence data is randomly sampled in the original bone point sequence data, or a piece of sequence data (Random loop+ Interpolate) is randomly inserted in the original bone point sequence data, as a bone feature, and is shown in fig. 3. Example three, sequential data enhancement is achieved using a uniformly sampled sub-sequence. If a sub-sequence with a time sequence length T is required to be sampled from an original time sequence with a time sequence length (also understood as the number of video frames) of Q, uniformly dividing the original time sequence into T sub-sequences with similar time sequence lengths (each sub-sequence comprises a plurality of video frames), randomly selecting one sample (namely a video frame) from each sub-sequence, and connecting the time sequence enhancement features corresponding to the sampled sub-sequences to obtain a sub-sequence with a length T, thereby obtaining the skeleton feature. For example, taking Q as 300 and t as 100 as an example, an original time sequence with a time sequence length of 300 is divided into 100 sub-sequences, each sub-sequence includes 3 video frames, 100 sub-sequences are sampled, each sub-sequence samples a time sequence feature of one video frame, the time sequence feature is used as a time sequence enhancement feature corresponding to the sub-sequence, and the sampled time sequence enhancement features corresponding to the 100 sub-sequences are connected to obtain skeleton features.

In the embodiment of the application, the sub-sequence is uniformly sampled to serve as time sequence action enhancement, so that the accuracy of skeleton characteristics is improved, and the accuracy of action recognition results is improved when action recognition is carried out according to the skeleton characteristics. The embodiment of the application provides a uniform sampling sub-sequence as time sequence action enhancement, has Jiang Pushi performance and brings remarkable improvement on a plurality of models and a plurality of evaluation indexes. FIG. 4 is a schematic illustration of an exemplary data enhanced performance result provided by an embodiment of the present application;

In fig. 4 Aug represents different calculation models, including ST-GCN, AGCN, MS-G3D, CTR-GCN and DG-STGCN, NTU120-XSub and NTU120-XSet represent different data sets, none represents that the original bone features are not processed, random loop represents that a piece of sequence data is randomly inserted, uni-Sample represents the data enhancement mode of the uniform sampling subsequence proposed in the embodiment of the present application, Δ represents the difference of evaluation indexes between the mode of the uniform sampling subsequence compared to the mode of no data processing, and the number in fig. 4 represents the calculation accuracy (%). For the NTU120-XSub dataset in fig. 4, adopting the data processing modes of None, random loop and Uni-Sample, the delta corresponding to ST-GCN, AGCN, MS-G3D, CTR-GCN and DG-STGCN are 1.1,1.2, 0.9, 1.4 and 1.8 respectively, it can be seen that the data enhancement mode provided by the embodiment of the application has universality, achieves consistent performance enhancement on multiple models and evaluation references, and is more obvious for the enhancement of DG-STGCN provided by the embodiment of the application.

In some embodiments, S102 in FIG. 1 described above may also include S201-S206. As shown in fig. 5, fig. 5 is a flowchart illustrating optional steps of another method for identifying actions according to an embodiment of the present application.

S201, determining the topological structure of each first key point feature group according to each first key point feature group.

S202, carrying out spatial feature fusion according to each first key point feature group and the topological structure of each first key point feature group to obtain first skeleton features.

Exemplary, as shown in fig. 6, fig. 6 is an exemplary schematic diagram of a dynamic multi-group spatial modeling module according to an embodiment of the present application. FIG. 6 provides an Architecture (Architecture) for DG-GCN, where DG-GCN represents a dynamic multi-group spatial modeling module, DG-GCN Input represents Input skeletal features, which are C×V×T vectors, and DG-GCN Output represents Output skeletal features after fusing spatial features, i.e., the first skeletal feature, which are C×V×T vectors. When extracting the space feature, DG-GCN divides the input skeleton feature into K groups to obtain K first key point feature groups (K groups), each first key point feature group isIs a vector of (2); and carrying out space feature fusion on each first key point feature group and the topological structure corresponding to the first key point feature group to obtain fourth key point feature groups, and then splicing and fusing M fourth key point feature groups to obtain first skeleton features. In FIG. 6, key feature dimensions are grouped by using a1×1 convolution layer (1×1 Conv) to obtain K first key feature groups, which are not necessarily divisible when the skeleton features are divided into K groups, and the symbols/>, in FIG. 6Representing a rounding down. Each first key point feature group uses different coefficient matrixes (i.e. topological structures) to perform space feature fusion, and the used coefficient matrixes are completely learned from data without depending on any prior, namely, the topological structures of each first key point feature group are determined according to each first key point feature group, and as a ₁、A₂…A_k in fig. 6, the coefficient matrixes corresponding to the K first key point feature groups are respectively represented. />, in FIG. 6And carrying out space feature fusion on the first key point feature group and the corresponding coefficient matrix, and determining a first skeleton feature.

In the embodiment of the present application, in order to facilitate understanding of the spatial modeling manner provided by the embodiment of the present application, a comparative example is further provided, as shown in fig. 7, and fig. 7 is an exemplary schematic diagram of a spatial modeling module provided by the embodiment of the present application. For each video frame (For T _i e {1 … T }), the corresponding feature of the video frame is a vector of c×v, the coefficient matrix (coeff Mat a) in fig. 7 is a vector of v×v, for each video frame i, the coefficient matrix is the same, and the feature corresponding to each video frame is fused with the same coefficient matrix to obtain the feature after the spatial feature of each video frame is fused. In FIG. 7And (3) performing spatial Feature fusion on the Feature (Feature t _i) corresponding to each video frame i and the coefficient matrix (coeff Mat A) to obtain features of the video frame i after spatial features are fused. The number of video frames is T, and the features of the T fused spatial features are stitched together to obtain a C x V x T vector (not shown in fig. 7) corresponding to the first bone feature in fig. 6.

In the embodiment of the present application, as can be seen from fig. 6 and fig. 7, the coefficient matrix in the related art is a manually predefined key point topology structure, which is the same for each video frame i. In the dynamic multi-group space modeling module provided by the embodiment of the application, the feature matrixes corresponding to the plurality of first key point feature groups are different, and the feature matrixes are learned from the skeletal point sequence data and do not depend on any priori knowledge, so that the design space of the graph rolling network is widened.

And S203, averaging the key point features in the first bone features according to the key point feature dimensions in the first bone features to obtain first global features of the key points in the first bone features.

S204, performing multi-scale division on the first skeleton feature and the first global feature according to the key point feature dimension in the first skeleton feature to obtain a plurality of second key point feature groups;

S205, performing multi-branch time sequence convolution processing on the plurality of second key point feature groups to obtain a plurality of third key point feature groups.

S206, determining target bone characteristics according to the third key point characteristic groups and the bone characteristics.

Exemplary, as shown in fig. 8, fig. 8 is an exemplary schematic diagram of a dynamic multi-group timing modeling module according to an embodiment of the present application. In fig. 8, DG-TCN represents a dynamic multi-group timing modeling module, DG-TCN Input represents a first bone feature Input and a first global feature of a key point in the first bone feature, a vector of c× (v+1) x T is Input, when the c× (v+1) x T vectors are grouped, multi-scale modeling is implemented, and DG-TCN Output represents a bone feature Output after the timing feature is fused, that is, a target bone feature. When extracting time sequence features, DG-TCN divides the input first skeleton features and the first global features of key points in the first skeleton features into M groups to obtain M second key point feature groups, wherein each second key point feature group isIs a vector of (2); inputting the M second key point feature groups into corresponding branch networks (M branch networks) respectively, performing time sequence convolution processing of different receptive fields to obtain M third key point feature groups, splicing (Concat) the M third key point feature groups together to obtain target combined skeleton features (not shown in fig. 8), separating (Split) the target combined skeleton features again to obtain second skeleton features and target global features, performing weighted fusion (D-JSF) on the second skeleton features and the target global features, and outputting third skeleton features. In FIG. 8, key point feature dimensions are grouped by using a1×1 convolution layer (1×1 Conv) to obtain M second key point feature groups, and the symbol/>, in FIG. 8, is not necessarily divisible when the first skeleton feature and the first global feature are multi-scale divided into M groupsRepresenting a rounding down. The time-series convolution processing of the different receptive fields is performed on the M second key point feature groups respectively, and in fig. 8, the time-series convolution processing of six different receptive fields is shown, including the convolution processing of the features corresponding to adjacent moments with a spacing of 1-4 (d=1-4) by using three 3×1 convolution layers (3×1 Conv), the maximum pooling processing of the features corresponding to adjacent moments by using a3×1 maximum pooling layer (3×1 Maxpool), and the convolution processing of the second key point feature groups by using 1×1 convolution layers (1×1 Conv), so as to obtain M third key point feature groups. And splicing (Concat), separating (Split) and weighting fusion (D-JSF) the M third key point feature groups to obtain third bone features. The third bone feature is then fused with the bone feature (retaining the original information) to yield the target bone feature (this step is not shown in fig. 8, see fig. 2).

It should be noted that, the M second key point feature sets are obtained by dividing a c× (v+1) ×t vector, so as to implement multi-scale modeling, and the M second key point feature sets are respectively subjected to sequential convolution processing of different receptive fields through M branch networks, so as to obtain M third key point feature sets, where the M third key point feature sets may be understood as multi-scale features, or may be understood as multi-scale features.

In the embodiment of the present application, in order to facilitate understanding of the timing modeling manner provided in the embodiment of the present application, a comparative example is further provided in the embodiment of the present application, as shown in fig. 9, and fig. 9 is an exemplary schematic diagram of a timing modeling module provided in the embodiment of the present application. For each key point (For V _i e {1 … V }), the Feature corresponding to the key point is a vector of c×t, T is from T ₁ to T _T, that is, the motion vector of the key point in T time, one-dimensional convolution (1D Conv) is performed on the Feature (Feature V _i) corresponding to each key point i in fig. 9, and one-dimensional convolution (1D Conv) represents that the time dimension is subjected to convolution processing, so as to obtain the Feature after the time Feature of each key point is fused. The number of key points is V, and the V features after the time sequence features are fused are spliced together to obtain a vector of c×v×t (not shown in fig. 9), which corresponds to the target bone feature in fig. 8.

In the embodiment of the present application, as can be seen from fig. 8 and 9, there is a limitation in modeling the motion of only a single key point in the related art. The dynamic multi-group time sequence modeling module provided by the embodiment of the application carries out multi-scale modeling on the key points and the key point overall situation at the same time, and each second key point characteristic group is spliced together again after time sequence convolution processing with different receptive fields, so that the accuracy of the target skeleton characteristics is improved.

In some embodiments, in S201 of fig. 5, for each first keypoint feature group, when determining the topology of the first keypoint feature group, S2011-S2014 may be further included.

S2011, pooling the number dimension of the video frames in the first key point feature group to obtain a two-dimensional feature matrix, wherein the two-dimensional feature matrix characterizes the number of key points and the two-dimensional sequence of key point features in the first key point feature group.

S2012, carrying out convolution processing of different parameters on the two-dimensional feature matrix to obtain a first parameter matrix and a second parameter matrix.

S2013, respectively determining a dynamic topological structure corresponding to the first key point feature group and a dynamic topological structure of a single feature channel corresponding to the key point in the first key point feature group according to the first parameter matrix and the second parameter matrix.

Exemplary, as shown in fig. 10, fig. 10 is an exemplary schematic diagram of a topology according to an embodiment of the present application. For each first key point feature group X, the process of calculating the topological structure is consistent, and the first key point feature group X isFor convenience of subsequent description and simple illustration, the first key point feature group X is represented by a vector of t×v×c in fig. 10. When the topological structure A of each first key point feature group X is calculated, the number dimension of video frames in the X is pooled to obtain a two-dimensional feature matrix/>Two-dimensional feature matrix/>And carrying out convolution processing of different parameters to obtain a first parameter matrix X _a and a second parameter matrix X _b, and respectively determining a dynamic topological structure DA corresponding to X and a dynamic topological structure CA of a single characteristic channel corresponding to a key point in X according to the first parameter matrix X _a and the second parameter matrix X _b. By pooling the number of video frames dimension in the first set of key point features, the number of video frames dimension is reduced from T dimension to one dimension, e.g., taking the average over the time dimension (number of video frames dimension), such as T pooling layer (T-pooling) in FIG. 10, a two-dimensional feature matrix is obtained, i.e./>Is a v×c vector. The two-dimensional feature matrix is then convolved with different parameters, in FIG. 10 a1×1 convolution layer (1×1 Conv) pair/>, using two different convolution kernelsAnd performing convolution processing to obtain a first parameter matrix X _a and second parameter matrices X _b,X_a and X _b which are vectors of V multiplied by C. Then, according to X _a and X _b, a dynamic topology structure corresponding to the first key point feature group X and a dynamic topology structure of a single feature channel corresponding to a key point in the first key point feature group X are respectively determined.

S2014, carrying out weighted summation according to a preset shared key point topological structure, a dynamic topological structure corresponding to the first key point characteristic group, a preset coefficient corresponding to the dynamic topological structure corresponding to the first key point characteristic group, a dynamic topological structure of a single characteristic channel corresponding to a key point in the first key point characteristic group and a preset coefficient corresponding to the dynamic topological structure of a single characteristic channel corresponding to a key point in the first key point characteristic group, so as to obtain the topological structure of the first key point characteristic group.

In the embodiment of the present application, the coefficient matrix a used for each first keypoint feature group in fig. 6 includes three components, a preset shared keypoint topology (PA), a dynamic topology (DA) corresponding to the first keypoint feature group, and a single feature Channel (CA) corresponding to a keypoint in the first keypoint feature group.

The PA illustratively represents a set of network parameters shared throughout the data set (which may be understood as skeletal features), representing a common key point topology, the parameters in the PA being obtained by gradient optimization during DG-STGCN model training, the PA being the same for each first key point feature set. PA _i in fig. 10 represents a preset shared keypoint topology of the ith first keypoint feature group, which is a v×v vector. DA means the dynamic topology specific to a single sample (which can be understood as the first set of key features) that is predicted by the network from the features of each sample, i.e. each sample has a different DA. CA means the dynamic topology specific to a single feature channel (which can be understood as the feature channel of the first set of key features) that is predicted by the network from the features of each sample, i.e. each feature channel of each sample has a different CA. Then, DG-STGCN performs weighted summation on the three different components (PA, DA, CA) by using a set of learnable coefficients (preset coefficients α corresponding to DA and preset coefficients β corresponding to CA in fig. 10), and uses the three coefficients as a coefficient matrix for dynamic spatial feature fusion, where the preset coefficients α and the preset coefficients β can be obtained after DG-STGCN model training is completed. A _i in fig. 10 represents the topology of the i-th first keypoint feature group, and fig. 10 shows three components of the topology a _i of the i-th first keypoint feature group.

It should be noted that the dynamic coefficient matrix component may be designed in different manners, and the coefficient matrix a includes three components PA, DA and CA as an example only, which is not intended to limit the embodiments of the present application.

In some embodiments, S2024 described above may also be implemented in the following manner. Performing cross multiplication on the first parameter matrix and the transposed second parameter matrix to obtain a dynamic topological structure corresponding to the first key point feature set; aiming at the same key point feature in the two-dimensional feature matrix, subtracting the feature corresponding to the number dimension of each key point in the first parameter matrix from the feature corresponding to the number dimension of a plurality of key points in the second parameter matrix to obtain a two-dimensional matrix of each key point feature; and normalizing the two-dimensional matrix of each key point feature by an activation function to obtain a dynamic topological structure of a single feature channel corresponding to the key point in the first key point feature group.

For example, referring to fig. 10 below, DA and CA are respectively described, where X _a and X _b are v×c vectors, X _b is transposed to obtain a c×v vector X _b ^T (i.e., a transposed second parameter matrix), X _a is cross-multiplied with X _b ^T, such as the transposed and cross-multiplied (Dot product+ Normalize) in fig. 10, to obtain a v×v vector, i.e., DA, and DA _i in fig. 10 represents a dynamic topology corresponding to the i first key feature set. For each key point feature, subtracting the feature corresponding to the number of the key points in X _a from the feature corresponding to the number of the key points in X _b, for example, pair-wise subtraction (Pair-wise subtraction) in fig. 10, to obtain a two-dimensional matrix of each key point feature, where the two-dimensional matrix of each key point feature is a v×v vector. The two-dimensional matrix of each key point feature is normalized by a normalized activation function (Tanh) as in fig. 10, for example, normalizing the two-dimensional matrix to [ -1,1]. Then, the two-dimensional matrix of the plurality of key point features is spliced to obtain a vector of c×v×v, that is, CA _i in fig. 10 represents the dynamic topology of the single feature channel corresponding to the key point in the i first key point feature group.

In the embodiment of the application, the dynamic topological structure of each key point characteristic group and the dynamic topological structure of each characteristic channel are introduced on the basis of the predefined key point topological structure, and compared with the key point topological structure manually defined in the related art, the flexibility of the topological structure is improved.

In some embodiments, S202 described above may be implemented in the following manner. Performing cross multiplication on each first key point feature group and the topological structure of each first key point feature group to obtain a plurality of fourth key point feature groups; and splicing and fusing the fourth key point feature groups to obtain the first skeleton feature.

Exemplary, as shown in FIG. 6, the coefficient matrix is a one-dimensional matrix, FIG. 6The first key point feature group is fused with the corresponding coefficient matrix to obtain K fourth key point feature groups, and the fourth key point feature groups are/>Is a vector of (a). And then, splicing and fusing the K fourth key point feature groups, wherein the splicing and fusing process is opposite to the process of dividing the skeleton feature into the K first key point feature groups to obtain first skeleton features, and splicing and fusing the K fourth key point feature groups by using a +1X1 convolution layer (concat+1X ConV) in FIG. 6 to obtain the first skeleton features.

In some embodiments, S204 in fig. 5 described above may be implemented in the following manner. Combining the first bone feature and the first global feature to obtain a combined bone feature; dividing the combined bone features according to the key point feature dimensions in the first bone features to obtain a plurality of second key point feature groups.

Illustratively, as shown in fig. 8, the first global feature of the key points in the first bone feature is a vector of c×1×t, the reducing the number of key points dimension from V dimension to one dimension may be achieved by pooling the number of key points dimension in the first bone feature, for example, taking an average value over the number of key points dimension, and then combining the first bone feature (vector of c×v×t) and the first global feature of the key points (vector of c×1×t) to obtain a combined bone feature, where the combined bone feature is a vector of c× (v+1) ×t. The DG-TCN divides the combined skeleton feature into M groups according to the feature dimension of the key point to obtain M second key point feature groups.

In some embodiments, S205 in FIG. 5 described above may include S2051-S2054.

S2051, convolving a first sub-key point feature group in the plurality of second key point feature groups according to different interval time to obtain a first feature group, wherein the first sub-key point feature group comprises at least two second key point feature groups.

In the embodiment of the present application, at least two second key point feature sets are selected from the M second key point feature sets, where, for convenience of explanation, different processing manners need to be performed on the key point feature sets, the selected at least two second key point feature sets are referred to as a first sub-key point feature set. For each second set of keypoint features, a convolution process is performed on each second set of keypoint features, illustratively, as shown in FIG. 8, since each second set of keypoint features isIn the video frame number dimension, includes T/>Can choose the/>, of certain timesIs convolved to obtain a first set of features.

It should be noted that, the interval of the embodiment of the present application may be set by those skilled in the art according to the actual situation, so long as the time sequence convolution processing can be performed on the features of different sensitivity fields, and the embodiment of the present application is not limited.

In some embodiments, the foregoing S2051 may further be implemented in the following manner for each second keypoint feature group of the first sub-keypoint feature group when the first feature group is obtained. Carrying out convolution processing on the features corresponding to the preset interval time in the second key point feature groups to obtain feature groups corresponding to the second key point feature groups; taking the feature group corresponding to each second key point feature group as a first feature group; the preset interval time comprises a current time, a first time before the current time and a second time after the current time, and the interval time between the current time and the first time is the same as the interval time between the second time and the current time.

As shown in fig. 8, d=1 indicates that the interval is 1, taking the current time is t as an example, one time before the current time is t-1, one time after the current time is t+1, and the preset interval time includes time t-1, time t, and time t+1. D=4 indicates that the interval is 4, taking the current time is t as an example, four times before the current time are t-4, four times after the current time are t+4, and the preset interval time includes time t-4, time t and time t+4. It can be understood that fig. 8 only shows two sets of preset intervals, and in practical application, time-series convolution processing may be performed on features of different receptive field sizes, for example, d=2, where the preset intervals include time t-2, time t, and time t+2; d=3, the preset interval time includes time t-3, time t and time t+3; d=5, the preset interval time includes time t-5, time t and time t+5; the embodiments of the present application are not limited in this regard. By performing time-series convolution processing on the features of different receptive fields by using a3×1 convolution layer (3×1 Conv), the diversity and accuracy of the first feature group are improved, so that the time-series features can be extracted better.

S2052, carrying out maximum pooling treatment on a second sub-key point feature group in the plurality of second key point feature groups to obtain a second feature group, wherein the second sub-key point feature group comprises a second key point feature group.

S2053, performing convolution processing on a third sub-key point feature set in the plurality of second key point feature sets to obtain a third feature set, wherein the third sub-key point feature set comprises a second key point feature set.

S2054, using the first feature set, the second feature set, and the third feature set as a plurality of third key point feature sets.

For example, as shown in fig. 8, one second keypoint feature group is selected from the plurality of second keypoint feature groups, and the selected one second keypoint feature group is referred to as a second sub-keypoint feature group for convenience of description. And carrying out maximum pooling treatment on the features corresponding to the adjacent time with the interval of 1 by adopting a 3X 1 maximum pooling layer (3X 1 Maxpool) to obtain a second feature group. The features corresponding to adjacent moments in which the interval is 1 include features corresponding to the moment t-1, the moment t and the moment t+1.

For example, as shown in fig. 8, one second keypoint feature group is selected from the plurality of second keypoint feature groups, and the selected one second keypoint feature group is referred to as a third sub-keypoint feature group for convenience of description. And carrying out convolution processing on the second key point feature set by adopting a 1X 1 convolution layer (1X 1 Conv) to obtain a third feature set.

It should be noted that the first feature set, the second feature set, and the third feature set may be referred to as a third key point feature set. The first sub-key point feature group, the second sub-key point feature group and the third sub-key point feature group in the embodiment of the application all belong to the second key point feature group.

In the embodiment of the present application, in fig. 8, m=6 is taken as an example to describe that the combined bone feature is divided into 6 groups, so as to obtain 6 second key feature groups. DG-TCNs include multi-branch networks (M branches) comprising 3X 1 convolutional layers (3X 1 Conv), 3X 1 max-pooling layers (3X 1 Maxpool) and 1X 1 convolutional layers (1X 1 Conv). Illustratively, a 3×1 convolution layer (3×1 Conv) is adopted to carry out convolution processing on the features corresponding to the time t-1, the time t and the time t+1 in the 1 st second key point feature group; adopting a 3 multiplied by 1 convolution layer (3 multiplied by 1 Conv) to carry out convolution processing on the features corresponding to the time t-2, the time t and the time t+2 in the 2 nd second key point feature group; adopting a 3 multiplied by 1 convolution layer (3 multiplied by 1 Conv) to carry out convolution processing on the features corresponding to the time t-3, the time t and the time t+3 in the 3 rd second key point feature group; and adopting a 3 multiplied by 1 convolution layer (3 multiplied by 1 Conv) to carry out convolution processing on the features corresponding to the time t-4, the time t and the time t+4 in the 4 th second key point feature group. And carrying out maximum pooling treatment on the features corresponding to the time t-1, the time t and the time t+1 in the 5 th second key point feature group by adopting a 3X 1 maximum pooling layer (3X 1 Maxpool). The 6 th second set of keypoint features is convolved with a1 x 1 convolution layer (1 x 1 Conv). Through the multi-branch network, 6 feature groups can be obtained, and the 6 feature groups are used as 6 third key point feature groups.

In some embodiments, S206 in fig. 5 described above may be implemented in the following manner. Splicing the key point dimensions of the plurality of third key point feature groups to obtain target combined skeleton features; separating the target combined skeleton features in the number dimension of key points to obtain a second skeleton feature and a target global feature; according to the preset coefficient corresponding to the second bone feature and the preset coefficient corresponding to the target global feature, carrying out weighted fusion on the second bone feature and the target global feature to obtain a third bone feature; and fusing the third bone feature and the bone feature to obtain the target bone feature.

Illustratively, as shown in FIG. 8, the 6 third set of keypoint features are stitched (Concat), which is the reverse of the above-described process of dividing the combined bone feature into the 6 second set of keypoint features. After the stitching is completed, a vector of c× (v+1) ×t is obtained, consistent with the vector dimensions of the combined bone features. The Split vector is also required to be Split (Split), and the process is opposite to the process of combining the first bone feature and the global feature of the key point to obtain the combined bone feature, and after the Split is completed, a second bone feature and a target global feature are obtained, where the second bone feature is a c×v×t vector, and the target global feature is a c×1×t vector. Concat & Split in fig. 8 represent the stitching and splitting of the M third key feature sets. The second bone feature and the target global feature are further required to be subjected to weighted fusion (D-JSF), and then fused with the bone feature (original information is retained) to obtain the target bone feature, wherein the target bone feature is a vector of c×v×t, and the D-JSF in fig. 8 represents the weighted fusion of the second bone feature and the target global feature. The preset coefficients of the second bone feature and the target global feature may be obtained after the DG-TCN model training is completed.

In some embodiments, S102 and S103 of FIG. 1 described above may also be implemented by S301-S308. As shown in fig. 11, fig. 11 is a flowchart illustrating optional steps of a further method for identifying actions according to an embodiment of the present application.

And S301, averaging the key point features in the bone features according to the key point feature dimensions in the bone features to obtain a second global feature of the key points in the bone features.

S302, performing multi-scale division on the bone features and the second global features according to the key point feature dimensions in the bone features to obtain a plurality of first key point feature groups.

S303, performing multi-branch time sequence convolution processing on the first key point feature groups to obtain fifth key point feature groups.

S304, determining fourth bone characteristics according to the fifth key point characteristic groups.

And S305, dividing the fourth bone features according to the key point feature dimensions in the fourth bone features to obtain a plurality of sixth key point feature groups.

S306, determining the topological structure of each sixth key point feature group according to each sixth key point feature group.

S307, carrying out spatial feature fusion according to the sixth key point feature groups and the topological structures of the sixth key point feature groups to obtain fifth skeleton features.

S308, determining target bone characteristics according to the fifth bone characteristics and the bone characteristics.

In the embodiment of the present application, the motion recognition method provided in fig. 5 is that a dynamic multi-group spatial modeling module (DG-GCN) is used to extract spatial features from bone features to obtain bone features after spatial features are fused, and then a dynamic multi-group temporal modeling module (DG-TCN) is used to extract temporal features from bone features after spatial features are fused, and then the temporal features are fused with the bone features, so as to obtain target bone features. The motion recognition method provided in fig. 11 is to extract time sequence features from bone features by using DG-TCN to obtain bone features after time sequence features are fused, extract spatial features from bone features after time sequence features are fused by using DG-GCN, and fuse with bone features to obtain target bone features. In the information extraction, fig. 5 and 11 show that the bone features (c×v×t vectors) extend therethrough, and the extraction steps of the time sequence features and the extraction steps of the space features are the same, but the input is different. Illustratively, in the motion recognition method provided in FIG. 5, skeletal features are input into the DG-GCN, and the DG-GCN outputs skeletal features after spatial features are fused (i.e., first skeletal features); the DG-TCN is input into the first bone feature, the DG-TCN outputs the bone feature after the time sequence feature is fused, and then the bone feature is fused (original information is reserved) with the DG-TCN to obtain the target bone feature. In the motion recognition method provided in FIG. 11, skeletal features are input into DG-TCN, and the DG-TCN outputs skeletal features after fusion of time sequence features (i.e., fourth skeletal features); the fourth bone feature is input into the DG-GCN, the DG-GCN outputs the bone feature after the spatial feature is fused, and then the bone feature is fused (original information is reserved) with the bone feature to obtain the target bone feature.

In the embodiment of the present application, the specific implementation process and the beneficial effects of the above-mentioned fig. 11 in extracting the time sequence feature can be seen in fig. 5, see S203-S206, fig. 8 and fig. 9; the specific implementation process and beneficial effects of the above-mentioned fig. 11 in extracting spatial features can be seen in the above-mentioned fig. 5S 201-S202, fig. 6, fig. 7 and fig. 10. And will not be described in detail herein.

The embodiment of the application provides a high-efficiency skeleton motion recognition algorithm which is superior to the related technology on a plurality of test benchmarks and has the advantages of higher precision, smaller operand and the like. And correspondingly, a graph rolling network with dynamic multi-group design is provided, and the graph rolling network exceeds the related technology on various evaluation benchmarks. As shown in fig. 12A, 12B, 13 and 14, four diagrams are exemplary schematic diagrams of comparison of four different action recognition results provided by the embodiment of the present application.

For example, fig. 12A and 12B show a comparison diagram of the results of motion recognition, and fig. 12A and 12B show the calculated consumption (GFLOPs/Clip) on the abscissa and the calculated accuracy Top-1 (%), and the calculation model of motion recognition includes: ST-GCN, AGCN, MS-G3D, CTR-GCN and DG-STGCN, NTU120-XSub and NTU120-XSet represent different 3D data sets, and FIGS. 12A and 12B are comparisons of the accuracy of the motion recognition results for the different 3D data sets. Taking the data in fig. 12A as an example, for the NTU120-XSub dataset in fig. 12A, the action recognition algorithms of ST-GCN, AGCN, MS-G3D, CTR-GCN and DG-STGCN are respectively adopted, and according to the distribution of each action recognition algorithm, it can be seen that the graph rolling network (DG-STGCN) based on dynamic multi-group design provided by the embodiment of the application surpasses the related calculation method on various evaluation benchmarks, so that the accuracy is higher, the calculation amount consumption is smaller, and the accuracy of action recognition is improved.

For example, fig. 13 shows still another comparison diagram of the action recognition result, and fig. 13 is a comparison result of recognition accuracy of action recognition on the 3D key point. The computational model of motion recognition in fig. 13 includes: ST-GCN, SGN, AS-GCN, RA-GCN, AGCN, DGNN, FGCN, shiftGCN, DSTA-Net, MS-G3D, CTR-GCN, 2s DG-STGCN and DG-STGCN, NTU60-XSub, NTU60-XView, NTU120-XSub, NTU120-XSet and Kinetics represent different datasets. The numbers in fig. 13 represent the calculation accuracy (%). Wherein 2s DG-STGCN and DG-STGCN are graph rolling networks based on dynamic multi-group designs provided by the embodiments of the present application, the skeletal features applied by 2s DG-STGCN include two modes, for example, a key point coordinate position and a skeletal motion state (a skeletal motion parameter between a subsequent video frame and a previous video frame), and the skeletal features applied by DG-STGCN include four modes, for example, a key point coordinate position, a skeletal motion state, a difference between two adjacent key point coordinate positions, and a coordinate difference between two adjacent skeletal motion states. As shown in the figure 13, the NTU120-XSub dataset adopts the action recognition algorithms of ST-GCN, 2s DG-STGCN and DG-STGCN, and the corresponding precision is 70.7, 89.2 and 89.6 respectively, the evaluation result of the graph rolling network (DG-STGCN and 2s DG-STGCN) based on the dynamic multi-group design on the 3D key point dataset provided by the embodiment of the application exceeds the related calculation method, the precision is higher, and the accuracy of action recognition is improved.

For example, fig. 14 shows still another comparison diagram of the action recognition result, and fig. 14 is a comparison result of recognition accuracy of action recognition on the 2D key point. The computational model of motion recognition in fig. 14 includes: MS-G3D++, poseC3D, 2sDG-STGCN and DG-STGCN, NTU60-XSub, NTU60-XView, NTU120-XSub and NTU120-XSet represent different 2D datasets. The meanings indicated by 2s DG-STGCN and DG-STGCN are consistent with FIG. 13 described above. The numbers in fig. 14 represent the calculation accuracy (%). For the motion recognition algorithms of PoseC D, 2s DG-STGCN and DG-STGCN adopted in the NTU120-XSub data set in fig. 14, the corresponding accuracies are 86.9, 87.3 and 87.5 respectively, it can be seen that the evaluation result of the graph rolling network (DG-STGCN and 2 sDG-STGCN) based on the dynamic multi-group design on the 2D key point data set provided by the embodiment of the application surpasses the related calculation method, the accuracy is higher, and the accuracy of motion recognition is improved.

In order to implement the motion recognition method according to the embodiment of the present application, the embodiment of the present application further provides a motion recognition device, as shown in fig. 15, fig. 15 is an optional schematic structural diagram of the motion recognition device according to the embodiment of the present application, where the motion recognition device 150 includes: an obtaining module 1501, configured to obtain skeletal features corresponding to a video to be identified, where the skeletal features represent a three-dimensional sequence of a number of key points, a number of key point features, and a number of video frames; a partitioning module 1502, configured to partition the skeletal feature according to the key point feature dimensions in the skeletal feature, to obtain a plurality of first key point feature groups; the modeling module 1503 is configured to perform space-time fusion according to the plurality of first key point feature sets, and in combination with the topology structure of each first key point feature set determined according to each first key point feature set, and the multi-scale features of each first key point feature set, to obtain a target skeleton feature after the spatial feature and the time sequence feature are fused; the recognition module 1504 is configured to perform motion recognition according to the target skeleton feature, so as to obtain a motion category corresponding to the video to be recognized.

In some embodiments, the action recognition device 150 further includes a feature enhancement module;

The acquisition module 1501 is further configured to acquire a video to be identified;

The feature enhancement module is also used for estimating key points of the video to be identified according to a preset gesture estimation model to obtain original skeleton features; the preset gesture estimation model is used for estimating a skeleton point sequence of the video to be identified; dividing the corresponding original time sequence in the original skeleton characteristics into a plurality of subsequences; the difference value of the time sequence lengths of two adjacent subsequences is within a preset range, and the number of the plurality of subsequences is the preset length; one sub-sequence includes features corresponding to a plurality of video frames; in the original skeleton characteristics, sampling characteristics corresponding to a plurality of video frames in each sub-sequence to obtain time sequence enhancement characteristics corresponding to each sub-sequence; and connecting the time sequence enhancement features corresponding to the multiple subsequences to obtain skeleton features.

In some embodiments, the modeling module 1503 is further configured to determine a topology of each first keypoint feature group according to each first keypoint feature group; carrying out space feature fusion according to each first key point feature group and the topological structure of each first key point feature group to obtain first skeleton features; according to the feature dimension of the key points in the first bone features, the key point features in the first bone features are averaged to obtain first global features of the key points in the first bone features; performing multi-scale division on the first skeleton feature and the first global feature according to the key point feature dimension in the first skeleton feature to obtain a plurality of second key point feature groups; performing multi-branch time sequence convolution processing on the plurality of second key point feature groups to obtain a plurality of third key point feature groups; a target bone feature is determined based on the plurality of third keypoint feature sets and the bone feature.

In some embodiments, the modeling module 1503 is further configured to pool, for each first key point feature group, a number of dimensions of a video frame in the first key point feature group to obtain a two-dimensional feature matrix, where the two-dimensional feature matrix characterizes the number of key points and a two-dimensional sequence of key point features in the first key point feature group; carrying out convolution processing of different parameters on the two-dimensional feature matrix to obtain a first parameter matrix and a second parameter matrix; respectively determining a dynamic topological structure corresponding to the first key point feature set and a dynamic topological structure of a single feature channel corresponding to a key point in the first key point feature set according to the first parameter matrix and the second parameter matrix; and carrying out weighted summation according to a preset shared key point topological structure, a dynamic topological structure corresponding to the first key point characteristic group, a preset coefficient corresponding to the dynamic topological structure corresponding to the first key point characteristic group, a dynamic topological structure of a single characteristic channel corresponding to a key point in the first key point characteristic group and a preset coefficient corresponding to the dynamic topological structure of a single characteristic channel corresponding to a key point in the first key point characteristic group, so as to obtain the topological structure of the first key point characteristic group.

In some embodiments, the modeling module 1503 is further configured to cross-multiply the first parameter matrix and the transposed second parameter matrix to obtain a dynamic topology structure corresponding to the first key point feature set; aiming at the same key point feature in the two-dimensional feature matrix, subtracting the feature corresponding to the number dimension of each key point in the first parameter matrix from the feature corresponding to the number dimension of a plurality of key points in the second parameter matrix to obtain a two-dimensional matrix of each key point feature; and normalizing the two-dimensional matrix of each key point feature by an activation function to obtain a dynamic topological structure of a single feature channel corresponding to the key point in the first key point feature group.

In some embodiments, the modeling module 1503 is further configured to cross-multiply each first keypoint feature group with a topology of each first keypoint feature group to obtain a plurality of fourth keypoint feature groups; and splicing and fusing the fourth key point feature groups to obtain the first skeleton feature.

In some embodiments, the modeling module 1503 is further configured to combine the first bone feature and the first global feature to obtain a combined bone feature; dividing the combined bone features according to the key point feature dimensions in the first bone features to obtain a plurality of second key point feature groups.

In some embodiments, the modeling module 1503 is further configured to convolve a first sub-keypoint feature group of the plurality of second keypoint feature groups according to different intervals to obtain a first feature group, where the first sub-keypoint feature group includes at least two second keypoint feature groups; carrying out maximum pooling treatment on a second sub-key point feature group in the plurality of second key point feature groups to obtain a second feature group, wherein the second sub-key point feature group comprises a second key point feature group; carrying out convolution processing on a third sub-key point feature set in the plurality of second key point feature sets to obtain a third feature set, wherein the third sub-key point feature set comprises a second key point feature set; the first feature set, the second feature set, and the third feature set are used as a plurality of third key point feature sets.

In some embodiments, the modeling module 1503 is further configured to convolve, for each second keypoint feature group in the first sub-keypoint feature group, features corresponding to a preset interval time in the second keypoint feature group to obtain feature groups corresponding to each second keypoint feature group; taking the feature group corresponding to each second key point feature group as a first feature group; the preset interval time comprises a current time, a first time before the current time and a second time after the current time, and the interval time between the current time and the first time is the same as the interval time between the second time and the current time.

In some embodiments, the modeling module 1503 is further configured to splice the plurality of third keypoint feature groups in a keypoint dimension to obtain a target combined skeleton feature; separating the target combined skeleton features in the number dimension of key points to obtain a second skeleton feature and a target global feature; according to the preset coefficient corresponding to the second bone feature and the preset coefficient corresponding to the target global feature, carrying out weighted fusion on the second bone feature and the target global feature to obtain a third bone feature; and fusing the third bone feature and the bone feature to obtain the target bone feature.

In some embodiments, the identifying module 1504 is further configured to pool the features of the video frame number dimension and the key point number dimension in the target skeleton feature to obtain a one-dimensional skeleton feature; and performing motion recognition according to the one-dimensional skeleton characteristics to obtain motion categories corresponding to the videos to be recognized.

In some embodiments, the partitioning module 1502 is further configured to average the key point features in the bone features according to the key point feature dimensions in the bone features, to obtain a second global feature of the key points in the bone features; according to the key point feature dimension in the bone feature, performing multi-scale division on the bone feature and the second global feature to obtain a plurality of first key point feature groups;

The modeling module 1503 is further configured to perform multi-branch time sequence convolution processing on the plurality of first key point feature sets to obtain a plurality of fifth key point feature sets; determining a fourth bone feature from the fifth plurality of key point feature sets; dividing the fourth bone feature according to the key point feature dimension in the fourth bone feature to obtain a plurality of sixth key point feature groups; determining the topological structure of each sixth key point feature group according to each sixth key point feature group; carrying out space feature fusion according to the sixth key point feature groups and the topological structures of the sixth key point feature groups to obtain fifth skeleton features; a target bone feature is determined based on the fifth bone feature and the bone feature.

It should be noted that, in the operation recognition device provided in the above embodiment, only the division of each program module is used for illustration, and in practical application, the process allocation may be performed by different program modules according to needs, that is, the internal structure of the device is divided into different program modules, so as to complete all or part of the processes described above. In addition, the action recognition device and the action recognition method provided in the foregoing embodiments belong to the same concept, and specific implementation processes and beneficial effects thereof are detailed in the method embodiments, which are not described herein again. For technical details not disclosed in the present apparatus embodiment, please refer to the description of the method embodiment of the present application for understanding.

In the embodiment of the present application, fig. 16 is a schematic diagram illustrating a composition structure of an action recognition device according to the embodiment of the present application, and as shown in fig. 16, a device 160 according to the embodiment of the present application includes a processor 1601, a memory 1602 storing an executable computer program, and the processor 1601 is configured to implement the action recognition method according to the embodiment of the present application when executing the executable computer program stored in the memory 1602. In some embodiments, the action recognition device 160 may also include a communication interface 1603, and a bus 1604 for connecting the processor 1601, memory 1602, and communication interface 1603.

In an embodiment of the present application, the Processor 1601 may be at least one of an Application SPECIFIC INTEGRATED Circuit (ASIC), a digital signal Processor (DIGITAL SIGNAL Processor, DSP), a digital signal processing device (DIGITAL SIGNAL Processing Device, DSPD), a programmable logic device (ProgRAMmable Logic Device, PLD), a field programmable gate array (Field ProgRAMmable GATE ARRAY, FPGA), a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, and a microprocessor. It will be appreciated that the electronics for implementing the above-described processor functions may be other for different devices, and embodiments of the present application are not particularly limited.

In an embodiment of the present application, bus 1604 is used to connect communication interface 1603, processor 1601, and memory 1602 for communication between these devices.

The memory 1602 is used to store executable computer programs and data, the executable computer programs including computer operating instructions, the memory 1602 may comprise high speed RAM memory, and may also include non-volatile memory, such as at least two disk memories. In practical applications, the Memory 1602 may be a volatile Memory (RAM), such as a Random-Access Memory (RAM); or a nonvolatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a hard disk (HARD DISK DRIVE, HDD) or a Solid state disk (Solid-state-STATE DRIVE, SSD); or a combination of the above, and provides executable computer programs and data to the processor 1601.

In addition, each functional module in the present embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional modules.

The integrated units, if implemented in the form of software functional modules, may be stored in a computer-readable storage medium, if not sold or used as separate products, and based on this understanding, the technical solution of the present embodiment may be embodied essentially or partly in the form of a software product, or all or part of the technical solution may be embodied in a storage medium, which includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or processor (processor) to perform all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

An embodiment of the present application provides a computer-readable storage medium storing a computer program for implementing the action recognition method according to any one of the embodiments above when executed by a processor.

For example, the program instructions corresponding to one action recognition method in the present embodiment may be stored on a storage medium such as an optical disc, a hard disc, or a usb disk, and when the program instructions corresponding to one action recognition method in the storage medium are read or executed by an electronic device, the action recognition method described in any of the foregoing embodiments may be implemented.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of implementations of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block and/or flow of the flowchart illustrations and/or block diagrams, and combinations of blocks and/or flow diagrams in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the present application.

Claims

1. A method of motion recognition, the method comprising:

acquiring skeleton characteristics corresponding to a video to be identified, wherein the skeleton characteristics represent three-dimensional sequences of the number of key points, the key point characteristics and the number of video frames;

Dividing the bone features according to the key point feature dimensions in the bone features to obtain a plurality of first key point feature groups;

According to the first key point feature groups, combining the topological structure of each first key point feature group determined according to each first key point feature group and the multi-scale features of each first key point feature group, and performing space-time fusion to obtain target skeleton features after fusion of space features and time sequence features;

performing action recognition according to the target skeleton characteristics to obtain action categories corresponding to the videos to be recognized;

The performing space-time fusion according to the plurality of first key point feature groups, combining the topological structure of each first key point feature group determined according to each first key point feature group, and the multi-scale features of each first key point feature group, to obtain a target skeleton feature after fusing the space feature and the time sequence feature, including:

determining the topological structure of each first key point feature group according to each first key point feature group;

Performing spatial feature fusion according to the first key point feature groups and the topological structures of the first key point feature groups to obtain first skeleton features;

according to the feature dimension of the key points in the first bone features, the key point features in the first bone features are averaged to obtain first global features of the key points in the first bone features;

performing multi-scale division on the first bone feature and the first global feature according to the key point feature dimension in the first bone feature to obtain a plurality of second key point feature groups;

Performing multi-branch time sequence convolution processing on the plurality of second key point feature groups to obtain a plurality of third key point feature groups;

determining the target bone feature from the plurality of third keypoint feature sets and the bone feature.

2. The method according to claim 1, wherein the acquiring the bone feature corresponding to the video to be identified comprises:

Acquiring a video to be identified;

Performing key point estimation on the video to be identified according to a preset gesture estimation model to obtain original skeleton characteristics; the preset gesture estimation model is used for estimating a skeleton point sequence of the video to be identified;

Dividing the corresponding original time sequence in the original skeleton characteristics into a plurality of subsequences; the difference value of the time sequence lengths of two adjacent subsequences is within a preset range, and the number of the subsequences is the preset length; one sub-sequence includes features corresponding to a plurality of video frames;

Sampling features corresponding to a plurality of video frames in each sub-sequence in the original skeleton features to obtain time sequence enhancement features corresponding to each sub-sequence;

and connecting the time sequence enhancement features corresponding to the subsequences to obtain the skeleton features.

3. The method of claim 1, wherein said determining a topology of each of said first keypoint feature sets from each of said first keypoint feature sets comprises:

Pooling the number dimension of the video frames in the first key point feature group aiming at each first key point feature group to obtain a two-dimensional feature matrix, wherein the two-dimensional feature matrix characterizes the number of key points and the two-dimensional sequence of key point features in the first key point feature group;

carrying out convolution processing of different parameters on the two-dimensional feature matrix to obtain a first parameter matrix and a second parameter matrix;

Respectively determining a dynamic topological structure corresponding to the first key point feature set and a dynamic topological structure of a single feature channel corresponding to a key point in the first key point feature set according to the first parameter matrix and the second parameter matrix;

And carrying out weighted summation according to a preset shared key point topological structure, a dynamic topological structure corresponding to the first key point characteristic group, a preset coefficient corresponding to the dynamic topological structure corresponding to the first key point characteristic group, a dynamic topological structure of a single characteristic channel corresponding to a key point in the first key point characteristic group and a preset coefficient corresponding to the dynamic topological structure of a single characteristic channel corresponding to a key point in the first key point characteristic group, so as to obtain the topological structure of the first key point characteristic group.

4. The method of claim 3, wherein determining the dynamic topology corresponding to the first keypoint feature group and the dynamic topology of the single feature channel corresponding to the keypoint in the first keypoint feature group according to the first parameter matrix and the second parameter matrix, respectively, comprises:

Performing cross multiplication on the first parameter matrix and the transposed second parameter matrix to obtain a dynamic topological structure corresponding to the first key point feature set;

For the same key point feature in the two-dimensional feature matrix, subtracting the feature corresponding to the number dimension of each key point in the first parameter matrix from the feature corresponding to the number dimension of a plurality of key points in the second parameter matrix to obtain a two-dimensional matrix of each key point feature;

normalizing the two-dimensional matrix of each key point feature by an activation function to obtain a dynamic topological structure of a single feature channel corresponding to the key point in the first key point feature group.

5. The method of claim 1, 3 or 4, wherein the performing spatial feature fusion according to the topology of each of the first keypoint feature groups and each of the first keypoint feature groups to obtain a first bone feature comprises:

performing cross multiplication on each first key point feature set and the topological structure of each first key point feature set to obtain a plurality of fourth key point feature sets;

And splicing and fusing the fourth key point feature groups to obtain the first skeleton feature.

6. The method of claim 1, wherein the multi-scale partitioning the first bone feature and the first global feature according to the keypoint feature dimension in the first bone feature to obtain a plurality of second keypoint feature groups includes:

Combining the first bone feature and the first global feature to obtain a combined bone feature;

and dividing the combined bone features according to the key point feature dimensions in the first bone features to obtain a plurality of second key point feature groups.

7. The method of claim 1, wherein the performing multi-branch time-sequential convolution processing on the plurality of second keypoint feature sets to obtain a plurality of third keypoint feature sets includes:

convolving a first sub-key point feature set in the plurality of second key point feature sets according to different interval time to obtain a first feature set, wherein the first sub-key point feature set comprises at least two second key point feature sets;

carrying out maximum pooling treatment on a second sub-key point feature group in the plurality of second key point feature groups to obtain a second feature group, wherein the second sub-key point feature group comprises one second key point feature group;

Performing convolution processing on a third sub-key point feature set in the plurality of second key point feature sets to obtain a third feature set, wherein the third sub-key point feature set comprises one second key point feature set;

and taking the first feature group, the second feature group and the third feature group as a plurality of third key point feature groups.

8. The method of claim 7, wherein convolving a first sub-set of the plurality of second set of keypoint features at different intervals to obtain a first set of features, comprising:

For each second key point feature group in the first sub-key point feature group, carrying out convolution processing on features corresponding to preset interval time in the second key point feature group to obtain feature groups corresponding to each second key point feature group;

taking the feature group corresponding to each second key point feature group as the first feature group;

The second key point feature sets correspond to different preset interval times, and the preset interval times comprise a current time, a first time before the current time and a second time after the current time, and the interval time between the current time and the first time is the same as the interval time between the second time and the current time.

9. The method of claim 1, 7 or 8, wherein said determining said target bone feature from a plurality of said third set of keypoint features and said bone feature comprises:

Splicing the key point dimensions of the plurality of third key point feature groups to obtain target combined skeleton features;

Separating the target combined skeleton features in the number dimension of key points to obtain a second skeleton feature and a target global feature;

according to the preset coefficient corresponding to the second bone feature and the preset coefficient corresponding to the target global feature, carrying out weighted fusion on the second bone feature and the target global feature to obtain a third bone feature;

And fusing the third bone feature and the bone feature to obtain the target bone feature.

10. The method according to claim 1 or 2, wherein the performing the motion recognition according to the target bone feature to obtain the motion category corresponding to the video to be recognized includes:

Pooling the characteristics of the video frame number dimension and the key point number dimension in the target skeleton characteristics to obtain one-dimensional skeleton characteristics;

And performing motion recognition according to the one-dimensional skeleton characteristics to obtain motion categories corresponding to the videos to be recognized.

11. The method according to claim 1 or 2, wherein the dividing the bone feature according to the key feature dimension in the bone feature to obtain a plurality of first key feature groups includes:

averaging the key point features in the bone features according to the key point feature dimensions in the bone features to obtain second global features of the key points in the bone features;

Performing multi-scale division on the bone feature and the second global feature according to the key point feature dimension in the bone feature to obtain a plurality of first key point feature groups;

Performing multi-branch time sequence convolution processing on the plurality of first key point feature groups to obtain a plurality of fifth key point feature groups;

Determining a fourth bone feature from the plurality of fifth set of key point features;

Dividing the fourth bone feature according to the key point feature dimension in the fourth bone feature to obtain a plurality of sixth key point feature groups;

Determining the topological structure of each sixth key point feature group according to each sixth key point feature group;

Performing spatial feature fusion according to the sixth key point feature groups and the topological structures of the sixth key point feature groups to obtain fifth skeleton features;

Determining the target bone feature from the fifth bone feature and the bone feature.

12. An action recognition device, the device comprising:

The acquisition module is used for acquiring skeleton features corresponding to the video to be identified, wherein the skeleton features represent three-dimensional sequences of the number of key points, the key point features and the number of video frames;

The dividing module is used for dividing the skeleton characteristics according to the key point characteristic dimension in the skeleton characteristics to obtain a plurality of first key point characteristic groups;

the modeling module is used for carrying out space-time fusion according to a plurality of first key point feature groups, combining the topological structure of each first key point feature group determined according to each first key point feature group and the multi-scale features of each first key point feature group to obtain target skeleton features after the spatial features and the time sequence features are fused;

the identification module is used for carrying out action identification according to the target skeleton characteristics to obtain action categories corresponding to the videos to be identified;

The modeling module is further used for determining the topological structure of each first key point feature group according to each first key point feature group; carrying out space feature fusion according to each first key point feature group and the topological structure of each first key point feature group to obtain first skeleton features; according to the feature dimension of the key points in the first bone features, the key point features in the first bone features are averaged to obtain first global features of the key points in the first bone features; performing multi-scale division on the first skeleton feature and the first global feature according to the key point feature dimension in the first skeleton feature to obtain a plurality of second key point feature groups; performing multi-branch time sequence convolution processing on the plurality of second key point feature groups to obtain a plurality of third key point feature groups; a target bone feature is determined based on the plurality of third keypoint feature sets and the bone feature.

13. An action recognition device, the device comprising:

A memory for storing an executable computer program;

A processor for implementing the method of any of claims 1-11 when executing an executable computer program stored in said memory.

14. A computer readable storage medium, characterized in that a computer program is stored for implementing the method of any one of claims 1-11 when being executed by a processor.