CN112651360B

CN112651360B - Skeleton action recognition method under small sample

Info

Publication number: CN112651360B
Application number: CN202011616955.1A
Authority: CN
Inventors: 柯逍; 杜鹏强
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2023-04-07
Anticipated expiration: 2040-12-31
Also published as: CN112651360A

Abstract

The invention provides a method for identifying skeleton actions under a small sample, which comprises the following steps of; step S1: constructing a sequence-to-sequence generation network to generate a skeleton motion sequence and construct an enhanced data set; step S2: constructing a sequence quality evaluation network based on the convolution of a space-time diagram, and solving the problem that the generated skeleton motion sequence with poor quality can generate negative influence on the subsequent steps; and step S3: constructing a skeleton action recognition network based on a multi-level skeleton segmentation algorithm; and step S4: integrating and using all the network models constructed in the steps S1, S2 and S3; generating a skeleton motion sequence by using the skeleton sequence generation network in the step S1, filtering the data with poor quality generated in the step S1 by using the skeleton sequence quality evaluation network in the step S2, optimizing a data set, and identifying skeleton actions by using the skeleton action identification network in the step S3 based on the data set constructed in the step S; the method can be used for skeleton action recognition under the condition of small samples.

Description

Skeleton action recognition method under small sample

Technical Field

The invention relates to the technical field of data enhancement of pattern recognition and computer vision, in particular to a method for recognizing skeleton actions under a small sample.

Background

The modern society has been ubiquitous in computing, which has had a tremendous impact on human life. Through calculation, the machine can analyze relevant data of human beings, gives out constructive opinions from the relevant data, and facilitates life of people. Computer vision is one of the fields which develop most rapidly in data analysis, and plays an important role in a plurality of fields which are closely related to human behaviors, such as intelligent monitoring, intelligent home furnishing, man-machine cooperation and the like, so that human behavior understanding based on the field of computer vision is particularly important. The machine acquires data through the visual sensor, detects and tracks people, analyzes and identifies the movement of the people, and understands the purpose and the semantics of actions by combining the context information of the data. The traditional RGB digital video has various defects in the field of human body motion recognition, so that the motion recognition research based on the RGB image motion sequence has the defects of large calculation amount and poor robustness when facing to the complex background and the change of human body dimensions and viewpoints. Because the traditional RGB image can only describe information on a two-dimensional plane, the graph structure data capable of representing three-dimensional information can describe motion more accurately, the state of human motion is more comprehensively represented from a three-dimensional angle, and compared with the traditional digital video, the human motion recognition based on the 3D skeleton has the advantages, and the human motion recognition based on the 3D skeleton has high value. But the related data of the action in many specific scenes is lack, and in the case of the small sample, the enhancement of the human skeleton data is also very important.

Disclosure of Invention

The invention provides a method for recognizing skeleton actions under a small sample, which can be used for recognizing skeleton actions under the condition of the small sample.

The invention adopts the following technical scheme.

A skeleton action recognition method under a small sample can be used for skeleton action recognition under the condition of the small sample, and the method comprises the following steps;

step S1: constructing a sequence to a sequence generation network to generate a skeleton motion sequence, and if the complexity of a human skeleton model is too high under the condition of a small sample, so that the data volume is not enough to support the generation of the sequence, enhancing the robustness of the generation network under the condition of the small sample by using a bypass network model, and finally constructing an enhanced data set;

step S2: constructing a sequence quality evaluation network based on the convolution of a space-time diagram, and solving the problem that the generation of a skeleton motion sequence with poor quality can generate negative influence on the subsequent steps; the evaluation network carries out quality evaluation on the generated sequences and optimizes the data set by filtering the generated sequences with poor quality;

and step S3: a skeleton action recognition network based on a multi-level skeleton segmentation algorithm is constructed, the network model segments a skeleton motion sequence through three levels, and then the segmented skeleton motion sequence is sent into a graph convolution neural network for feature fusion, so that the network performs feature extraction on the skeleton motion sequence from different angles, and the robustness of the network on skeleton motion sequence data is improved;

and step S4: integrating and using all the network models constructed in the steps S1, S2 and S3; and generating a skeleton motion sequence by using the skeleton sequence generation network in the step S1, filtering the data with poor quality generated in the step S1 by using the skeleton sequence quality evaluation network in the step S2, optimizing a data set, and identifying skeleton actions by using the skeleton action identification network in the step S3 based on the data set constructed in the step S.

The step S1 specifically includes the following steps;

step S11: constructing a skeleton motion sequence generation network from a sequence to a sequence framework, and connecting two cyclic neural networks in series, wherein the former cyclic neural network has the function of an encoder, the latter cyclic neural network is used for decoding, and built-in units of the two cyclic neural networks are 1024 gate control cyclic units; the encoder is represented as follows

x ^te ＝E(x ^te-1 ，h ^te-1 ) A first formula;

wherein E denotes the encoder, x ^te-1 For input at time te, x ^te Is output at time te and is also input at time te +1, h ^te-1 Indicating the state of the encoder at time te.

The decoder is represented as follows:

y ^td ＝D(y ^td-1 ，s ^td-1 ) A second formula;

wherein D denotes a decoder, y ^td-1 Is an input at time td, y ^td Is the output of time td and is also used as the input of time td +1, s ^td-1 Representing the state of the td time encoder;

step S12: on the structure of the original decoder, a residual error connection is added between each input and output. The representation method is as follows:

y ^td ＝D(y ^td-1 ，s ^td-1 )+y ^td-1 publicA third formula;

wherein D, y ^td-1 ，y ^td And s ^td-1 Consistent with the meaning of step S11, adding the residual error structure only changes the output calculation mode;

step S13: constructing a bypass network from the sequence to the sequence architecture; the bypass network is a sequence generation network, and built-in units of an encoder and a decoder of the bypass network are 256 gate control circulation units; when the function is network training, the skeleton motion sequence generation network divides the input of the main skeleton into the input of the bypass network; when the sequence is generated, the bypass network output backbone skeleton is embedded into the input of the skeleton motion sequence generation network, so that the generation of the whole skeleton model can be guided and corrected. Wherein the skeleton is the crotch middle point, the left crotch and the right crotch of the human body skeleton;

step S14: respectively training a skeleton motion sequence to generate a network and a bypass network, wherein the initial training learning rates of the two networks are both 0.005, the learning attenuation rates are both 0.95, and the iteration times are 10000;

step S15: combining the bypass network with the skeleton motion sequence generation network, that is, embedding the output of the bypass network into the skeleton motion sequence generation network, the whole network architecture can be described as follows:

wherein Mn is a skeleton motion sequence generation network, bp is a bypass network, sx ^t For the output of the main network to the backbone skeleton part at time t, sp ^t For the output of the main network to the rest at time t, p ^t Bx as the output of the rest after the residual structure ^t For bypassing the output of the network to the backbone skeleton portion at time t, while p ^t And bx ^t These two parts are integrated and then output as the entire model.

The step S2 specifically includes the following steps:

step S21: a negative sample portion of the quality assessment dataset is constructed. The skeleton motion sequence of low quality generated by the model trained earlier in step S1 is used as the negative sample part of the data set. The skeleton action sequence with low quality is a sample of generated rigid action, movement angle which does not accord with objective physical law and the like;

step S22: the positive sample portion of the quality assessment data set is constructed using time domain motion sequence interpolation. Motion track modeling is carried out on the gesture between two adjacent frames under the same sequence based on the motion sequence interpolation of the time domain, and the modeling mode is as follows:

wherein tq is ₁ And tq ₂ Is different joint vectors of two same joint points between two adjacent frames in the same skeleton motion sequence, tq is the result of time domain motion sequence interpolation, and t theta is the joint vector tq ₁ To joint vector tq ₂ The angle of rotation;

step S23: the quality assessment data set positive sample portion is constructed using spatial domain motion sequence interpolation. The interpolation based on the space domain means that the coordinates which belong to the same joint point on the space are interpolated for two different motion postures, and the calculation method comprises the following steps:

wherein sq is ₁ And sq ₂ Using two different joint vectors of the same joint point in two different skeleton motion sequences, wherein sq is the result of interpolation of a space domain motion sequence, and s theta is the joint vector sq ₁ To joint vector sq ₂ The angle of rotation, omega, is sq in the interpolation result of the time domain ₂ The weight of (c);

step S24: integrating the skeleton motion sequence data obtained in the steps S21, S22 and S23 to obtain a quality evaluation data set;

step S25: and constructing a skeleton motion sequence quality evaluation network based on graph convolution. The graph convolution network is a six-layer space-time graph convolution neural network. The first layer and the second layer are 64 channels, and the convolution step is 1; the third layer is 128 channels, and the convolution step is 2; the fourth layer is 128 channels, and the convolution step is 1; the fifth layer is 256 channels, and the convolution step is 2; the sixth layer is a fully connected layer. The quality assessment dataset constructed in step S24 was trained using an initial learning rate of 0.001, parameter decay rates of 0.95 each, and a trained batch size of 64, for a total of 80 epochs.

The step S3 specifically includes the following steps:

step S31: constructing a human skeleton motion space-time diagram, wherein the diagram comprises N joint points which form a set

This space-time diagram is constructed in two steps; first, the articulation point in the same frame is ≥ based on the physical structural connectivity of the human body>

And &>

By side->

Connecting; then the spatially semantically identical point->

And &>

By side->

Connecting; the two connections can be defined without additional manpower;

step S32: defining the segmentation on the human skeleton space-time diagram as follows:

/>

where the superscript t denotes the time t in the sequence,

is a root node, is greater than or equal to>

Is node->

The connection with the root node represents, the symbol represents that the left node and the right node are associated with each other under the rule, and the definition maps the root node into a node set related to the root node;

further, all the segmentation sets for the skeleton space-time diagram are defined as follows:

where V is the set of joint points in the human skeletal motion space-time diagram,

is one of the joint points;

is a partition with this node as the root node. The following steps are all paired>

Taking the division as an example;

step S33: the method for segmenting the human skeleton space-time diagram based on physical connection comprises the following steps:

wherein

Represents slave node pick>

To node->

The shortest path length of (2). This segmentation means that physically and temporally adjacent joint points are treated as a set;

step S34: a human skeleton space-time diagram is segmented based on a space configuration, and the method comprises the following steps:

wherein d function and step the definition of S33 is consistent; under the framework segmentation based on space configuration, and the ratio node

To the global root node->

Node composition and node->

The spatial configuration of (2) is segmented;

step S35: and segmenting the human skeleton space-time diagram based on the symmetrical semantics. Because of the symmetry of human body, two nodes which are symmetrical on the human body naturally have semantic relevance, so that the nodes are subjected to semantic segmentation

In particular, for its symmetric node->

Should have>

Step S36: construction of space-time graph convolution and node based on multi-level segmentation

The corresponding convolution calculation method is as follows:

wherein the content of the first and second substances,

is mapped->

Feature corresponding to a node, is>

Is node->

And node->

Corresponding convolution weights, A _s For one of the segmentation sets SegA (a), a decision is taken>

To divide into _s Normalizing the term | A corresponding to the nth row and column elements of the Laplace matrix approach in spectrogram convolution _s | is equal to A _s The number of all elements in the set, i.e. A _s Is added to balance the contributions of the different subsets to the output, preventing certain sets from being too large and causing a result to be biased, and>

is A _s Dividing the mth row and nth column elements of the corresponding mask attention matrix;

step S37: and constructing a time-space diagram convolution skeleton action recognition network based on multi-level segmentation. The graph convolution network is a ten-layer space-time graph convolution neural network. The first layer, the second layer, the third layer and the fourth layer are 64 channels, and the convolution step is 1; the fifth layer is 128 channels, and the convolution step is 2; the sixth layer and the seventh layer are 128 channels, and the convolution step is 1; the eighth layer is 256 channels, and the convolution step is 2; the ninth layer is 256 channels, and the convolution step is 1; the tenth layer is a 1 × 1 full convolution layer.

The step S4 specifically includes the following steps:

step S41: using the trained skeleton action sequence generation network in the step S1 to generate action sequences with 2 times of the number of the test sets for each action type;

step S42: performing quality evaluation on all generated skeleton sequences by using the skeleton sequence quality evaluation network trained in the step S2, and rejecting skeleton sequences with evaluation values smaller than 0.8;

step S43: integrating the original data set and the generated data set after the quality evaluation is completed to form an enhanced data set;

step S44: training the multi-level segmentation-based spatio-temporal graph convolution skeleton motion recognition network in the step S3 by using an enhanced data set, wherein the used initial learning rate is 0.001, the parameter attenuation rates are all 0.95, the trained batch size is 64, and the epochs are iterated for 80 times in total; and obtaining a skeleton action recognition model under the small sample after the training is finished.

Compared with the prior art, the invention has the following beneficial effects:

(1) The bypass network framework action sequence generation network provided by the invention can perform data expansion on the framework action sequence.

(2) The framework action sequence quality evaluation network provided by the invention can effectively filter the framework sequences with poor quality.

(3) The multilevel segmentation space-time diagram convolution network provided by the invention can extract more skeleton motion sequence characteristics from multilevel segmentation.

(4) The framework motion recognition framework under the small sample provided by the invention can obviously improve the performance of framework motion recognition under the condition of sample shortage.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

FIG. 1 is a schematic flow diagram of an embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in the figure, a method for recognizing skeleton actions under a small sample can be used for recognizing skeleton actions under a small sample condition, and the method comprises the following steps;

The step S1 specifically includes the following steps;

x ^te ＝E(x ^te-1 ，h ^te-1 ) A first formula;

The decoder is represented as follows:

y ^td ＝D(y ^td-1 ，s ^td-1 ) A second formula;

wherein D denotes a decoder, y ^td-1 Is an input at time td, y ^td Is the output of time td and is also used as the input of time td +1, s ^td-1 When represents tdState of the encoder;

y ^td ＝D(y ^td-1 ，s ^td-1 )+y ^td-1 a formula III;

wherein Mn is a skeleton motion sequence generation network, bp is a bypass network, sx ^t For the output of the main network to the backbone skeleton part at time t, sp ^t For the output of the main network to the rest at time t, p ^t Bx as the output of the rest after the residual structure ^t For backbone bone in order to bypass network at time tOutput of shelf part, while p ^t And bx ^t These two parts are integrated and then output as the entire model.

The step S2 specifically includes the following steps:

wherein sq is ₁ And sq ₂ Using two different joint vectors of the same joint point in two different skeleton motion sequences, wherein sq is the result of interpolation of a space domain motion sequence, and s theta is the joint vector sq ₁ To joint vector sq ₂ Angle of rotation, ω, is a time domain interpolationIn the results sq ₂ The weight of (c);

The step S3 specifically includes the following steps:

This spatiotemporal map is constructed in two steps; first, the articulation point in the same frame is ≥ based on the physical structural connectivity of the human body>

And &>

By side->

Connecting; then the spatially semantically identical point->

And &>

By side->

Connecting; the two connections can be defined without additional manpower;

where the superscript t denotes the time t in the sequence,

is a root node, is greater than or equal to>

Is node->

The connection with the root node represents that the left node and the right node are associated with each other under the rule, and the definition maps the root node into a node set related to the root node;

is one of the joint points;

Take division as an example；

wherein

Represents slave node pick>

To node>

step S34: based on the space configuration, the space-time diagram of the human skeleton is segmented, and the method comprises the following steps:

wherein the d function is consistent with the definition of step S33; under the framework segmentation based on space configuration, and the ratio node

To the global root node->

Node composition and node->

The spatial configuration of (2) is segmented;

step S35: and segmenting the human skeleton space-time diagram based on the symmetrical semantics. Because the human body has symmetry, two nodes which are symmetrical on the human body naturally have semantic relevance, so that the two nodes have semantic relevancePairing nodes in node semantic segmentation

In particular, for its symmetric node->

Should have->

The corresponding convolution calculation method is as follows: />

Wherein the content of the first and second substances,

is mapped->

Feature corresponding to a node, is>

Is node->

And node

To divide into _s Corresponding to the mth row and nth column elements of Laplace matrix approach in spectrogram convolution, normalizing the itemA _s | is equal to A _s The number of all elements in the set, i.e. A _s Is added to balance the contributions of the different subsets to the output, preventing certain sets from being too large and causing a result to be biased, and>

The step S4 specifically includes the following steps:

step S42: using the trained framework sequence quality evaluation network in the step S2 to carry out quality evaluation on all generated framework sequences, and eliminating the framework sequences with evaluation values smaller than 0.8;

Preferably, the embodiment can effectively improve the skeleton motion recognition performance under the small sample. Firstly, a bypass network-based skeleton action generation network is used, and the method is mainly used for solving the problem of insufficient skeleton action data under a small sample. By using the method of the embodiment, a large amount of skeleton motion sequence data which is difficult to obtain at ordinary times can be generated. And then, performing quality evaluation on the generated skeleton action sequence by using a skeleton action sequence quality evaluation network to filter and generate data with poor quality. The method of the embodiment aims at data that the action of the generated skeleton sequence is rigid and the motion track does not accord with the objective physical law under the condition that a small part of the skeleton sequence is unstable. It can be evaluated using the method of the present embodiment to identify these poor quality production data. And then constructing a skeleton action recognition network based on multi-level segmentation. The method of the embodiment can guide the network to extract the motion characteristics of the skeleton sequence from a plurality of layers by performing multi-layer segmentation on the skeleton action, thereby performing effective identification. And finally, integrating network architecture, and sequentially carrying out the constructed network according to the logical sequence of generation, quality evaluation and action identification so as to construct a skeleton action identification model under the condition of a small sample.

The above description is only a preferred embodiment of the present invention, and all the equivalent changes and modifications made according to the claims of the present invention should be covered by the present invention.

Claims

1. A method for recognizing skeleton actions under a small sample is used for recognizing skeleton actions under the condition of the small sample and is characterized in that: the method comprises the following steps;

step S2: constructing a sequence quality evaluation network based on the convolution of a space-time diagram, and solving the problem that the generation of a skeleton motion sequence with poor quality can generate negative influence on the subsequent steps; the evaluation network evaluates the quality of the generated sequence and optimizes the data set by filtering the generated sequence with poor quality;

and step S4: integrating and using all the network models constructed in the steps S1, S2 and S3; generating a skeleton motion sequence by using the skeleton sequence generation network in the step S1, filtering the data with poor quality generated in the step S1 by using the skeleton sequence quality evaluation network in the step S2, optimizing a data set, and identifying skeleton actions by using the skeleton action identification network in the step S3 based on the data set constructed in the step S;

the step S1 specifically includes the following steps;

x ^te ＝E(x ^te-1 ，h ^te-1 ) A first formula;

wherein E denotes the encoder, x ^te-1 For input at time te, x ^te Is output at time te and is also input at time te +1, h ^te-1 Indicating the state of the encoder at time te;

the decoder is represented as follows:

y ^td ＝D(y ^td-1 ，s ^td-1 ) A second formula;

wherein D represents a decoder, y ^td-1 Is an input at time td, y ^td Is the output of time td and is also used as the input of time td +1, s ^td-1 Representing the state of the td time encoder;

step S12: on the structure of the original decoder, a residual error connection is added between each input and each output; the representation method is as follows:

y ^td ＝D(y ^td-1 ，s ^td-1 )+y ^td-1 a formula III;

s13, constructing a bypass network from the sequence to the sequence architecture; the bypass network is a sequence generation network, and built-in units of an encoder and a decoder of the bypass network are 256 gate control circulation units; when the function is network training, the skeleton motion sequence generation network divides the input of the main skeleton into the input of the bypass network; when a sequence is generated, a bypass network output backbone framework is embedded into the input of a framework motion sequence generation network to guide and correct the generation of the whole framework model; wherein the skeleton is the crotch middle point, the left crotch and the right crotch of the human body skeleton;

step S15: combining the bypass network with the skeleton motion sequence generation network, namely embedding the output of the bypass network into the skeleton motion sequence generation network, wherein the whole network architecture is described as follows:

wherein Mn is a skeleton motion sequence generation network, bp is a bypass network, sx ^t For output of the main network to the backbone portion at time t, sp ^t For the output of the main network for the rest at time t, p ^t Bx as the output of the rest after the residual structure ^t For bypassing the output of the network to the backbone skeleton portion at time t, while p ^t And bx ^t The two parts are integrated and then output as the whole model;

the step S2 specifically includes the following steps:

step S21: constructing a negative sample part of the quality evaluation data set; using a skeleton motion sequence with low quality generated by the early-stage model trained in the step S1 as a negative sample part of the data set; the skeleton action sequence with low quality is a sample with rigid action and movement angle which does not accord with objective physical law;

step S22: constructing a positive sample part of the quality evaluation data set by using time domain action sequence interpolation; motion track modeling is carried out on the gesture between two adjacent frames under the same sequence based on the motion sequence interpolation of the time domain, and the modeling mode is as follows:

step S23: constructing a positive sample part of the quality evaluation data set by using spatial domain action sequence interpolation; the interpolation based on the space domain means that the coordinates which belong to the same joint point on the space are interpolated for two different motion postures, and the calculation method comprises the following steps:

step S25: constructing a skeleton motion sequence quality evaluation network based on graph convolution; the graph convolution network is a six-layer space-time graph convolution neural network; the first layer and the second layer are 64 channels, and the convolution step is 1; the third layer is 128 channels, and the convolution step is 2; the fourth layer is 128 channels, and the convolution step is 1; the fifth layer is 256 channels, and the convolution step is 2; the sixth layer is a full connection layer; training the quality evaluation data set constructed in the step S24, wherein the initial learning rate is 0.001, the parameter attenuation rates are all 0.95, the batch size of the training is 64, and the total number of epochs is iterated by 80;

the step S3 specifically includes the following steps:

；

This space-time diagram is constructed in two steps; firstly, joint points in the same frame are connected according to the connectivity of human body on physical structure

And &>

By side->

Connecting; then, points in the continuous frames which are identical in the spatial semantic structure are sequentially arranged on the time sequence

And &>

By side->

Connecting; the two connections do not need additional manual definition;

where the superscript t denotes the time t in the sequence,

is a root node, is greater than or equal to>

Is node->

is one of the joint points; />

A partition using the node as a root node; the following steps are all paired>

Taking the division as an example;

wherein

Represents slave node pick>

To node->

The shortest path length of (2); this segmentation means that physically and temporally adjacent joint points are treated as a set;

To the global root node->

Node composition and node->

The spatial configuration of (2) is divided;

step S35: segmenting the human skeleton space-time diagram based on the symmetrical semantics; because the human body has symmetry, two nodes which are symmetrical on the human body naturally have semantic correlation, and therefore, the nodes are subjected to semantic segmentation when being divided

In particular, for its symmetric node->

Should have->

The corresponding convolution calculation method is as follows:

/>

wherein the content of the first and second substances,

is mapped->

Feature corresponding to a node, is>

Is node->

And node->

Correspond toConvolution weight of, A _s For one of the segmentation sets SegA (a), a decision is taken>

step S37: constructing a time-space diagram convolution skeleton action recognition network based on multi-level segmentation; the graph convolution network is a ten-layer space-time graph convolution neural network; the first layer, the second layer, the third layer and the fourth layer are 64 channels, and the convolution step is 1; the fifth layer is 128 channels, and the convolution step is 2; the sixth layer and the seventh layer are 128 channels, and the convolution step is 1; the eighth layer is 256 channels, and the convolution step is 2; the ninth layer is 256 channels, and the convolution step is 1; the tenth layer is a 1 × 1 full convolution layer.

2. The method for recognizing the skeleton action under the small sample according to claim 1, characterized in that: the step S4 specifically includes the following steps:

step S44: training the space-time graph convolution skeleton motion recognition network based on multi-level segmentation in the step S3 by using an enhanced data set, wherein the used initial learning rate is 0.001, the parameter attenuation rates are all 0.95, the trained batch size is 64, and the total number of epochs is 80; and obtaining a skeleton action recognition model under the small sample after the training is finished.