CN113963201A

CN113963201A - Bone action recognition method and device, electronic equipment and storage medium

Info

Publication number: CN113963201A
Application number: CN202111211761.8A
Authority: CN
Inventors: 陈恩庆; 吕梦柯; 辛华磊; 高猛; 马龙; 丁英强; 吕小永; 郭新; 张娟; 井中纪; 张文亚; 张建林
Original assignee: Henan Xintong Intelligent Iot Co ltd; Zhengzhou University
Current assignee: Henan Xintong Intelligent Iot Co ltd; Zhengzhou University
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-01-21
Anticipated expiration: 2041-10-18
Also published as: CN113963201B

Abstract

The application provides a skeleton action recognition method, a skeleton action recognition device, an electronic device and a storage medium, and relates to the technical field of action recognition, wherein the method comprises the following steps: generating a space-time diagram based on skeleton point data and a time sequence edge of a video sequence, transmitting the space-time diagram into a constructed action recognition model, configuring a plurality of partition numbers in an iterative mode to recognize the space-time diagram, and taking the partition number with the highest recognition accuracy as a target configuration partition number; generating target connection relations of skeleton points based on the target configuration partition number to replace predefined connection relations based on the adjacency matrix representation in the action recognition model; and identifying the human body action based on the target connection relation. The problem of low accuracy of the existing skeleton action recognition can be solved.

Description

Bone action recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of motion recognition technologies, and in particular, to a bone motion recognition method, apparatus, electronic device, and storage medium.

Background

Human bone point data has better environmental adaptability and action expression capability compared with RGB video data, so that an action recognition algorithm based on the bone point data is widely concerned and researched. In 2D or 3D coordinate form, the dynamic bone modality may be represented by a time series of human joint positions. Then, human behavior recognition can be done by analyzing its motion pattern. However, the method of motion recognition based on bones is to form feature vectors using joint coordinates at each time step and perform time series analysis on the feature vectors. The spatial relationship of human joints is not clearly utilized in the method capability, so that the natural connection relationship of the human joints cannot be fully utilized, and the problem of low accuracy of bone motion recognition exists.

Disclosure of Invention

Based on this, an object of the embodiments of the present application is to provide a bone motion recognition method, device, electronic device and storage medium, so as to solve the problem of low accuracy of bone motion recognition in the prior art.

In a first aspect, an embodiment of the present application provides a bone action recognition method, including: generating a space-time diagram based on skeleton point data and a time sequence edge of a video sequence, transmitting the space-time diagram into a constructed action recognition model, configuring the number of a plurality of partitions in an iterative mode to recognize the space-time diagram, and configuring the number of the partitions with the highest recognition accuracy as a target configuration partition number, wherein the partitions are a set formed by a plurality of skeleton points in the action recognition model;

generating target connection relations of skeleton points based on the target configuration partition number to replace predefined connection relations based on the adjacency matrix representation in the action recognition model;

and identifying the human body action based on the target connection relation.

In the implementation process, the number of the subareas with the highest accuracy is screened out in an iteration mode to serve as the number of the target configuration subareas, the skeleton point connection relation is acquired in a self-adaption mode based on the number of the target configuration subareas to replace the preset human body joint natural connection relation, the constraint of the human body skeleton point natural connection relation can be eliminated, the self-adaption learning capacity of the adjacent matrix and the network performance of the model are improved, the change characteristics of different actions can be adapted, and therefore the action recognition accuracy of the skeleton action recognition model is improved.

Optionally, before generating the space-time diagram based on the bone point data and the timing edge of the video sequence, the method further comprises:

acquiring frame data from the video sequence, and acquiring the bone point data on a frame image corresponding to the frame data, wherein the bone point data comprises coordinate data of each bone point;

and constructing a space map based on the natural skeleton connection relation of the human body, and connecting the same skeleton points in the two frames of the space map to form the time sequence edge.

In the implementation process, the data is normalized through a batch normalization layer, the action characteristics of the normalized data are extracted through a plurality of self-adaptive graph convolution network layers and time one-dimensional convolution cascade connection, and finally the characteristics are classified through a full connection layer, so that the topological structure of the skeleton can be more accurately learned aiming at different graph convolution layers and end-to-end skeleton samples, and the action identification accuracy is improved.

Optionally, the network structure of the motion recognition model includes a batch normalization layer, a plurality of adaptive graph convolution network layers, a plurality of time one-dimensional convolution layers, and a full connection layer;

the plurality of self-adaptive graph convolution network layers are respectively cascaded with the plurality of time one-dimensional convolutions one by one to form a plurality of network layers, and the batch normalization layer, the plurality of network layers and the full connection layer are sequentially connected;

the batch normalization layer is used for performing normalization processing on the space-time diagram;

the multiple self-adaptive graph convolution network layers and the multiple time one-dimensional convolutions are used for extracting action characteristics of the normalized data;

and the full connection layer is used for identifying the action characteristics.

Optionally, the plurality of network layers include a first network layer and a second network layer, the first network layer is located at the L1 layer, and the second network layer is a network layer after the L1 layer, and in the second network layer, the adaptive graph convolution network layer is connected with the time one-dimensional convolution residual.

In the implementation process, residual connection is added in the second network layer, and the sum of shallow output and deep output is used as the input of the next stage, so that the gradient of small loss can reach shallow neurons in the model more easily, and the stability of the model network is improved.

preprocessing the action data set based on random gradient descent, and setting an initial learning rate and a weight attenuation rate;

dividing the action data set into a training set and a verification set based on a cross-target division mode and a cross-view division mode, wherein the cross-target division mode is to divide the data set into the training set and the verification set, people in the training set and the verification set are different, the cross-view division mode is to divide the data set into the training set and the verification set, the training set comprises a plurality of videos shot by a first camera and a second camera, and the verification set comprises a plurality of videos shot by a third camera;

and training an initial motion recognition model based on the initial learning rate, the weight attenuation rate, the training set and the verification set to obtain the motion recognition model.

Optionally, after the identifying the human body action based on the target connection relation, the method further includes:

transmitting the recognition result of the model into a classifier, and performing behavior classification on the human body action;

and generating probability labels for the video sequence based on the behavior classification result, and taking the action category with the highest probability in the probability labels as a final identification result.

In the implementation process, the recognition result of the model is transmitted into the classifier, the classification result of the full-connection layer is normalized again, the probability label is generated, and the accuracy of bone motion recognition can be further improved.

In a second aspect, an embodiment of the present application provides a bone action recognition apparatus, including:

the partition determining module is used for generating a space-time diagram based on bone point data and a time sequence edge of a video sequence, transmitting the space-time diagram into a constructed action recognition model, configuring a plurality of partition numbers in an iterative mode to recognize the space-time diagram, and configuring the partition number with the highest recognition accuracy as a target configuration partition number, wherein the partition is a set formed by a plurality of bone points in the action recognition model;

the connection relation generation module is used for generating a target connection relation of the skeleton points based on the number of the target configuration partitions so as to replace a predefined connection relation expressed based on an adjacency matrix in the action recognition model;

and the identification module is used for identifying the human body action based on the target connection relation.

Optionally, the bone motion recognition device may further include:

and the frame data acquisition module is used for acquiring frame data from the video sequence before generating a space-time diagram based on the bone point data and the time sequence edge of the video sequence, and acquiring the bone point data on a frame image corresponding to the frame data, wherein the bone point data comprises coordinate data of each bone point.

And the construction module is used for constructing a space map based on the natural skeleton connection relation of the human body and connecting the same skeleton points in the two frames of the space map to form the time sequence edge.

Optionally, the bone motion recognition device may further include:

and the preprocessing module is used for preprocessing the action data set based on random gradient descent and setting an initial learning rate and a weight attenuation rate.

The dividing module is used for dividing the action data set into a training set and a verification set based on a cross-target dividing mode and a cross-visual angle dividing mode, wherein the cross-target dividing mode is to divide the data set into the training set and the verification set, people in the training set and the verification set are different, the cross-visual angle dividing mode is to divide the data set into the training set and the verification set, the training set comprises a plurality of videos shot by a first camera and a second camera, and the verification set comprises a plurality of videos shot by a third camera.

And the training module is used for training an initial motion recognition model based on the initial learning rate, the weight attenuation rate, the training set and the verification set so as to obtain the motion recognition model.

Optionally, the bone motion recognition device may further include:

the classification module is used for transmitting the recognition result of the model into a classifier after the human body action is recognized based on the target connection relation, and classifying the human body action; and generating probability labels for the video sequence based on the behavior classification result, and taking the action category with the highest probability in the probability labels as a final identification result.

In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores program instructions, and the processor executes steps in any one of the foregoing implementation manners when reading and executing the program instructions.

In a fourth aspect, an embodiment of the present application further provides a storage medium, where the readable storage medium stores computer program instructions, and the computer program instructions are read by a processor and executed to perform the steps in any of the foregoing implementation manners.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a schematic diagram of a network structure of a dual-stream adaptive graph convolution network;

FIG. 2 is a schematic step diagram of a method for bone motion recognition according to an embodiment of the present application;

FIG. 3 is a schematic diagram of the experimental results of the necessity of the adjacency matrix Ak provided in the embodiment of the present application;

fig. 4 is a schematic diagram of a model network structure provided in an embodiment of the present application;

fig. 5 is an experimental result diagram of the action recognition accuracy rates of different partition numbers according to the embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a step of obtaining skeleton point data and a time-series edge of a video sequence according to an embodiment of the present application;

fig. 7 is a schematic network structure diagram of a motion recognition model provided in an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating a process of training a skeletal motion recognition model according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram illustrating a step of classifying human body actions by using a classifier according to an embodiment of the present application;

FIG. 10 is a graph showing experimental results of comparative experiments provided in examples of the present application;

fig. 11 is a schematic view of a bone motion recognition device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. For example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In the research process, the applicant finds that a double-flow adaptive graph convolution network (2s-AGCG) is generally adopted in the field of skeleton point motion identification at present, and skeleton point data is fixedly divided into three partitions, namely a root node, a centripetal point and a centrifugal point according to the natural connection relation of human joints by using a traditional space configuration partition strategy. Three configuration partitions are respectively formed by an adjacent matrix A_kAnd representing the natural connection relation of the human body, wherein k is equal to {1,2,3}, and adaptively learning the non-existing connection relation and trying to learn the data correlation between the samples through 2 s-AGCG.

However, since the 2s-AGCG uses a partition scheme with fixed spatial configuration and requires manual setting of the connection relationship between the skeletal joint points, it is not possible to adapt to the changing characteristics of different actions and to ensure the partition strategy to be optimized. In practice, for different types of actions, the characteristics of the skeleton data are different, so the number of the skeleton node configuration partitions should be adjusted accordingly, and the fixed number of the configuration partitions will limit the identification capability of the model for the skeleton data.

Specifically, referring to fig. 1, fig. 1 is a schematic diagram of a network structure of a dual-stream adaptive graph convolution network, where f_inFor network input, f_outFor network output, k_vRes represents residual connection, C_inFor input channel number, C_eIs a temporary number of channels, C_outIs the number of output channels, T is the sequence length, N is the human joint point, A_kRepresenting the human body joint adjacency matrix, B_kRepresenting a parameter learning matrix for learning connections other than natural connections of joints of the human body, C_kIs an attention map between sample data, softmax is a similarity function, θ_kAnd

is the weight of two gaussian embedding functions, which is a two-dimensional convolution with a convolution kernel size of 1 x 1.

In 2s-AGCG, each partition or convolution operator (namely, an adjacency matrix) is constructed manually according to the natural connection relation of human joints, and three partitions need to be limited to a root node set, a centripetal set and a centrifugal set manually. When the number of the subareas is more than 3, the natural connection relation of human joints cannot be fully utilized, so that the self-adaptive learning capacity of the adjacent matrix is limited to a certain extent, and the identification precision of the skeleton action is influenced.

Therefore, an embodiment of the present application provides a bone motion recognition method to solve the problem that since a motion recognition model uses a fixed spatial configuration partition, the adaptive learning capability of an adjacency matrix is easily limited, and the recognition accuracy of a bone motion is affected, please refer to fig. 2, where fig. 2 is a schematic diagram of steps of the bone motion recognition method provided in the embodiment of the present application, and the method may include the following steps:

in step S22, a space-time diagram is generated based on the bone point data and the time-sequence edge of the video sequence, and the space-time diagram is introduced into the constructed motion recognition model, and the space-time diagram is recognized by allocating a plurality of partitions in an iterative manner, where the number of partitions with the highest recognition accuracy is used as the number of target allocation partitions, and the partitions are a set formed by a plurality of bone points in the motion recognition model.

The iteration times can be set according to hardware conditions for implementing the method, and the iteration times are not limited in the application, specifically, in the embodiment of the application, 7 iterations are taken as an example, and the time-space diagram is identified by respectively taking one partition number to 7 partition numbers.

In step S23, a target connection relation of the bone points is generated based on the target configuration partition number to replace the predefined connection relation based on the adjacency matrix representation in the motion recognition model.

In step S24, a human body action is recognized based on the target connection relationship.

The 2s-AGCG uses a scheme of configuring three partitions in a fixed space, so that the self-adaptive learning capacity of the adjacent matrix is limited, therefore, the adjacent matrix predefined in the skeleton action recognition model is abandoned in the embodiment of the application, and the number of the target configuration partitions obtained after iteration is used for self-adaptively acquiring the connection relation of the skeleton points so as to replace the preset natural connection relation of the human joints.

Specifically, please refer to fig. 3 and fig. 4 in combination, fig. 3 is a pair of adjacent matrixes a provided by the embodiment of the present application_kFig. 4 is a schematic diagram of a model network structure provided in an embodiment of the present application, where BC denotes using an adjacency matrix a_kInitializing adaptive adjacency matrix B_kBC/A means initialize B with 0.000001_k3P and 5P denote the

partition numbers

3 and 5, respectively (2 partitions redundant to the BC method are represented by A₂，A₃Initialization, the 2 partitions redundant to the BCPA method are initialized by 0.000001). Based on the experimental results, it can be found that the adjacency matrix a is used when the number of partitions is 3_kInitialization B_kThe network identification accuracy is highest, and the human body joint is selfHowever, the connection constraint promotes the network identification performance, but when the number of partitions is not 3, the adjacency matrix a needs to be matched_kIs reconfigured so that the present embodiment discards the adjacency matrix A_kInitialize B with 0.000001_k。

Illustratively, in the 2s-AGCN method, using a policy of fixing 3 configuration partitions, the applicant believes that the classification performance of the network model will be improved when the number of partitions increases, and therefore, an experiment is performed based on the NTU-RGBD data set, please refer to fig. 5, where fig. 5 is an experiment result diagram of the action recognition accuracy rates of different partition numbers provided in the embodiment of the present application, and it can be known from the experiment result that the recognition accuracy performance of the network model increases with the increase of the number of configuration partitions when the number of configuration partitions is small, the network performance reaches the optimum when the number of configuration partitions is 5, and the network performance starts to decrease with the continued increase of the number of partitions.

Therefore, the number of the subareas with the highest accuracy is screened out in an iteration mode to serve as the number of the target configuration subareas, the skeleton point connection relation is obtained in a self-adaptive mode based on the number of the target configuration subareas, the preset human body joint natural connection relation is replaced, the constraint of the human body skeleton point natural connection relation can be eliminated, the self-adaptive learning capacity of the adjacent matrix and the network performance of the model are improved, the change characteristics of different actions can be adapted, and the action recognition accuracy of the skeleton action recognition model is improved.

In an alternative embodiment, before step S22, an implementation step of obtaining bone point data and a timing edge of a video sequence is provided in the embodiment of the present application, please refer to fig. 6, where fig. 6 is a schematic diagram of a step of obtaining bone point data and a timing edge of a video sequence provided in the embodiment of the present application, and the step may include:

in step S211, frame data is obtained from the video sequence, and the bone point data is obtained on a frame image corresponding to the frame data, the bone point data including coordinate data of each bone point.

In step S212, a spatial map is constructed based on the natural skeleton connection relationship of the human body, and the same skeleton points in the spatial map are connected to form the time-series edge.

The method includes the steps of obtaining frame data of multiple frames from a video sequence in a mode of extracting one frame per preset number of frames, or obtaining all frames in a section of video in the video sequence, respectively connecting the same skeleton points in corresponding space maps of two adjacent frames before and after extraction to form a time sequence Edge, forming key skeleton points in all input frames into a Node Set (Node Set), and forming an Edge Set (Edge Set) by all time sequence edges, namely forming a required space-time map.

Optionally, please refer to fig. 7, where fig. 7 is a schematic diagram of a network structure of a motion recognition model provided in the embodiment of the present application, where the network structure of the motion recognition model includes a batch normalization layer, a plurality of adaptive graph convolution network layers, a plurality of time one-dimensional convolution layers, and a full connection layer;

the plurality of self-adaptive graph convolution network layers are respectively cascaded with the plurality of time one-dimensional convolutions one by one to form a plurality of network layers, and the batch normalization layer, the plurality of network layers and the full connection layer are sequentially connected; the batch normalization layer is used for performing normalization processing on the space-time diagram; the multiple self-adaptive graph convolution network layers and the multiple time one-dimensional convolutions are used for extracting action characteristics of the normalized data; and the full connection layer is used for identifying the action characteristics.

Wherein, X is the input of the model, BN is the batch normalization layer, MP-AGCN is the self-adaptive graph convolution network layer, TCN is the time one-dimensional convolution, FC is the full connection layer, L1, L2, L10 and the like represent the network layer of the motion recognition model, and the part in the dotted line represents the network layer formed by the self-adaptive graph convolution network layer and the time one-dimensional convolution cascade connection.

Therefore, in the network structure of the action recognition model provided by the application embodiment, data is normalized through a batch normalization layer, action characteristics of the normalized data are extracted through a plurality of self-adaptive graph convolution network layers and time one-dimensional convolution cascade connection, and finally the characteristics are classified through a full connection layer, so that the topological structure of bones can be more accurately learned and the accuracy of action recognition can be improved aiming at different graph convolution layers and end-to-end bone samples.

Further, the plurality of network layers include a first network layer located at the L1 layer and a second network layer that is a network layer after the L1 layer, in which the adaptive graph convolution network layer is connected with the time one-dimensional convolution residual.

Referring to fig. 7, Res in fig. 7 represents the residual connection between the adaptive graph convolution network layer and the time one-dimensional convolution, L1 is the first network layer, and L2 to L10 are the second network layers.

Therefore, the residual error connection is added in the second network layer, the shallow output and the deep output are summed to be used as the input of the next stage, the gradient with small loss can reach the shallow neuron in the model more easily, and the stability of the model network is improved.

In addition, an implementation step of training a bone motion recognition model is further provided in an embodiment of the present application, please refer to fig. 8, fig. 8 is a schematic diagram of the step of training the bone motion recognition model provided in the embodiment of the present application, and before step S22, the method may include the following steps:

in step S81, the motion data set is preprocessed based on the random gradient descent, and an initial learning rate and a weight attenuation rate are set.

In step S82, the motion data set is divided into a training set and a verification set based on a cross-target division manner and a cross-view division manner, where the cross-target division manner is to divide the data set into the training set and the verification set, people in the training set and the verification set are different, the cross-view division manner is to divide the data set into the training set and the verification set, the training set includes a plurality of videos captured by the first camera and the second camera, and the verification set includes a plurality of videos captured by the third camera.

In step S83, an initial motion recognition model is trained based on the initial learning rate, the weight decay rate, the training set, and the verification set to obtain the motion recognition model.

Illustratively, the embodiment of the present application selects NTU-RGBD and Kinetics-sketon as experimental data sets, wherein the NTU-RGBD has 60 motion categories, and contains 56000 motion data, each motion data in the data set is composed of a series of Skeleton motion frames, each frame contains at most two skeletons, each Skeleton has 25 Skeleton nodes, and each Skeleton node has corresponding three-dimensional space coordinate data. The Kinetics-skeeleton dataset has 400 action categories, containing 300000 action data. Each motion data in the data set is composed of a series of skeleton motion frames, each frame comprises at most two skeletons, each skeleton has 18 skeleton nodes, and each skeleton node has corresponding three-dimensional space coordinate data.

Firstly, preprocessing two data sets, using a random gradient descent optimization strategy to enable each data update to be carried out towards a correct direction, enabling the data to be converged at an extreme point, setting an initial learning rate and a weight attenuation rate to be 0.1 and 0.0001 respectively, setting the batch sizes to be 32 and 64 on an NTU-RGBD data set and a Kinetics-Skeleton data set respectively, setting the learning rate attenuation strategy on the NTU-RGBD data set to be 10 divided in a 30 th and a 40 th epoch respectively, and setting the training times to be 50 epoch. The learning rate decay strategy on the Kinetics-skeletton data set was divided by 10 in the 45 th and 55 th epochs, respectively, and the number of training sessions was set to 65 epochs. And training the initial motion recognition model through the initial learning rate, the weight attenuation rate, the training set and the verification set to finally obtain the trained motion recognition model.

In an alternative embodiment, after step S24, the present application further provides an implementation step of classifying the human motion by using a classifier, please refer to fig. 9, where fig. 9 is a schematic diagram illustrating the step of classifying the human motion by using a classifier according to the present application, and the step may include the following steps:

in step S91, the recognition result of the model is transmitted to the classifier, and the human body action is classified.

In step S92, probability labels are generated for the video sequence based on the result of the behavior classification, and the action category with the highest probability in the probability labels is used as the final recognition result.

The embodiment of the application adopts a SoftMax classifier to identify and classify the human body actions, the SoftMax classifier can be understood as generalized induction of a plurality of classifications faced by a logistic regression classifier, and the formula is as follows:

and L is a generated probability label, sequencing is carried out according to the action probability, and the category with the highest probability is selected as a final recognition result.

Specifically, the applicant has made comparison and comparison methods based on the NTU-RGBD and Kinetics-skeette data sets and some mainstream methods at present, including the CNN method and the GCN method, please refer to fig. 10, and fig. 10 is a graph of experimental results of comparison and comparison provided in the embodiments of the present application. The bone motion recognition method provided by the embodiment of the application has higher motion recognition precision.

Therefore, the recognition result of the model is transmitted into the classifier, the classification result of the full connection layer is normalized again, the probability label is generated, and the accuracy of bone motion recognition can be further improved.

Based on the same inventive concept, an embodiment of the present application further provides a bone motion recognition apparatus 110, please refer to fig. 11, where fig. 11 is a schematic diagram of the bone motion recognition apparatus provided in the embodiment of the present application, and the bone motion recognition apparatus 110 may include:

the partition determining module 111 is configured to generate a space-time diagram based on bone point data and a time sequence edge of a video sequence, transmit the space-time diagram into a constructed action recognition model, configure a plurality of partition numbers in an iterative manner to recognize the space-time diagram, and configure the partition number with the highest recognition accuracy as a target configuration partition number, where the partition is a set formed by a plurality of bone points in the action recognition model.

A connection relation generating module 112, configured to generate a target connection relation of the bone points based on the target configuration partition number to replace a predefined connection relation based on the adjacency matrix representation in the motion recognition model.

And the identification module 113 is configured to identify a human body action based on the target connection relationship.

In an alternative embodiment, the bone motion recognition device 110 may further comprise:

Optionally, the bone motion recognition device 110 may further comprise:

Based on the same inventive concept, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores program instructions, and the processor executes the steps in any one of the above implementation manners when reading and executing the program instructions.

Based on the same inventive concept, an embodiment of the present application further provides a storage medium, where the readable storage medium stores computer program instructions, and the computer program instructions are read by a processor and executed to perform the steps in any of the above implementation manners.

The storage medium may be a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Read Only Memory (EPROM), an electrically Erasable Read Only Memory (EEPROM), or other media capable of storing program codes. The storage medium is used for storing a program, and the processor executes the program after receiving an execution instruction.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

Alternatively, all or part of the implementation may be in software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part.

The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A bone action recognition method is characterized by comprising the following steps:

generating a space-time diagram based on skeleton point data and a time sequence edge of a video sequence, transmitting the space-time diagram into a constructed action recognition model, configuring the number of a plurality of partitions in an iterative mode to recognize the space-time diagram, and configuring the number of the partitions with the highest recognition accuracy as a target configuration partition number, wherein the partitions are a set formed by a plurality of skeleton points in the action recognition model;

and identifying the human body action based on the target connection relation.

2. The method of claim 1, wherein prior to generating a space-time diagram based on bone point data and temporal edges of a video sequence, the method further comprises:

3. The method of claim 1, wherein the network structure of the motion recognition model comprises a batch normalization layer, a plurality of adaptive graph convolution network layers, a plurality of time one-dimensional convolution and full connectivity layers;

4. The method of claim 3, wherein the plurality of network layers comprises a first network layer and a second network layer, the first network layer is located at layer L1, the second network layer is a network layer after layer L1, and in the second network layer, the adaptive graph convolution network layer is connected with the time one-dimensional convolution residual.

5. The method of claim 1, wherein prior to generating a space-time diagram based on bone point data and temporal edges of a video sequence, the method further comprises:

6. The method of claim 1, wherein after the identifying human actions based on the target connection relationship, the method further comprises:

7. A bone motion recognition device, comprising:

8. The apparatus of claim 7, further comprising:

the preprocessing module is used for preprocessing the action data set based on random gradient descent and setting an initial learning rate and a weight attenuation rate;

a data set dividing module, configured to divide the motion data set into a training set and a verification set based on a cross-target dividing manner and a cross-view dividing manner, where the cross-target dividing manner is to divide the data set into the training set and the verification set, characters in the training set and the verification set are different, the cross-view dividing manner is to divide the data set into the training set and the verification set, the training set includes multiple videos captured by a first camera and a second camera, and the verification set includes multiple videos captured by a third camera;

9. An electronic device comprising a memory having stored therein program instructions and a processor that, when executed, performs the steps of the method of any of claims 1-6.

10. A storage medium having stored thereon computer program instructions for executing the steps of the method according to any one of claims 1 to 6 when executed by a processor.