CN112651360B - Skeleton action recognition method under small sample - Google Patents

Skeleton action recognition method under small sample Download PDF

Info

Publication number
CN112651360B
CN112651360B CN202011616955.1A CN202011616955A CN112651360B CN 112651360 B CN112651360 B CN 112651360B CN 202011616955 A CN202011616955 A CN 202011616955A CN 112651360 B CN112651360 B CN 112651360B
Authority
CN
China
Prior art keywords
skeleton
network
sequence
time
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011616955.1A
Other languages
Chinese (zh)
Other versions
CN112651360A (en
Inventor
柯逍
杜鹏强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202011616955.1A priority Critical patent/CN112651360B/en
Publication of CN112651360A publication Critical patent/CN112651360A/en
Application granted granted Critical
Publication of CN112651360B publication Critical patent/CN112651360B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method for identifying skeleton actions under a small sample, which comprises the following steps of; step S1: constructing a sequence-to-sequence generation network to generate a skeleton motion sequence and construct an enhanced data set; step S2: constructing a sequence quality evaluation network based on the convolution of a space-time diagram, and solving the problem that the generated skeleton motion sequence with poor quality can generate negative influence on the subsequent steps; and step S3: constructing a skeleton action recognition network based on a multi-level skeleton segmentation algorithm; and step S4: integrating and using all the network models constructed in the steps S1, S2 and S3; generating a skeleton motion sequence by using the skeleton sequence generation network in the step S1, filtering the data with poor quality generated in the step S1 by using the skeleton sequence quality evaluation network in the step S2, optimizing a data set, and identifying skeleton actions by using the skeleton action identification network in the step S3 based on the data set constructed in the step S; the method can be used for skeleton action recognition under the condition of small samples.

Description

Skeleton action recognition method under small sample
Technical Field
The invention relates to the technical field of data enhancement of pattern recognition and computer vision, in particular to a method for recognizing skeleton actions under a small sample.
Background
The modern society has been ubiquitous in computing, which has had a tremendous impact on human life. Through calculation, the machine can analyze relevant data of human beings, gives out constructive opinions from the relevant data, and facilitates life of people. Computer vision is one of the fields which develop most rapidly in data analysis, and plays an important role in a plurality of fields which are closely related to human behaviors, such as intelligent monitoring, intelligent home furnishing, man-machine cooperation and the like, so that human behavior understanding based on the field of computer vision is particularly important. The machine acquires data through the visual sensor, detects and tracks people, analyzes and identifies the movement of the people, and understands the purpose and the semantics of actions by combining the context information of the data. The traditional RGB digital video has various defects in the field of human body motion recognition, so that the motion recognition research based on the RGB image motion sequence has the defects of large calculation amount and poor robustness when facing to the complex background and the change of human body dimensions and viewpoints. Because the traditional RGB image can only describe information on a two-dimensional plane, the graph structure data capable of representing three-dimensional information can describe motion more accurately, the state of human motion is more comprehensively represented from a three-dimensional angle, and compared with the traditional digital video, the human motion recognition based on the 3D skeleton has the advantages, and the human motion recognition based on the 3D skeleton has high value. But the related data of the action in many specific scenes is lack, and in the case of the small sample, the enhancement of the human skeleton data is also very important.
Disclosure of Invention
The invention provides a method for recognizing skeleton actions under a small sample, which can be used for recognizing skeleton actions under the condition of the small sample.
The invention adopts the following technical scheme.
A skeleton action recognition method under a small sample can be used for skeleton action recognition under the condition of the small sample, and the method comprises the following steps;
step S1: constructing a sequence to a sequence generation network to generate a skeleton motion sequence, and if the complexity of a human skeleton model is too high under the condition of a small sample, so that the data volume is not enough to support the generation of the sequence, enhancing the robustness of the generation network under the condition of the small sample by using a bypass network model, and finally constructing an enhanced data set;
step S2: constructing a sequence quality evaluation network based on the convolution of a space-time diagram, and solving the problem that the generation of a skeleton motion sequence with poor quality can generate negative influence on the subsequent steps; the evaluation network carries out quality evaluation on the generated sequences and optimizes the data set by filtering the generated sequences with poor quality;
and step S3: a skeleton action recognition network based on a multi-level skeleton segmentation algorithm is constructed, the network model segments a skeleton motion sequence through three levels, and then the segmented skeleton motion sequence is sent into a graph convolution neural network for feature fusion, so that the network performs feature extraction on the skeleton motion sequence from different angles, and the robustness of the network on skeleton motion sequence data is improved;
and step S4: integrating and using all the network models constructed in the steps S1, S2 and S3; and generating a skeleton motion sequence by using the skeleton sequence generation network in the step S1, filtering the data with poor quality generated in the step S1 by using the skeleton sequence quality evaluation network in the step S2, optimizing a data set, and identifying skeleton actions by using the skeleton action identification network in the step S3 based on the data set constructed in the step S.
The step S1 specifically includes the following steps;
step S11: constructing a skeleton motion sequence generation network from a sequence to a sequence framework, and connecting two cyclic neural networks in series, wherein the former cyclic neural network has the function of an encoder, the latter cyclic neural network is used for decoding, and built-in units of the two cyclic neural networks are 1024 gate control cyclic units; the encoder is represented as follows
x te =E(x te-1 ,h te-1 ) A first formula;
wherein E denotes the encoder, x te-1 For input at time te, x te Is output at time te and is also input at time te +1, h te-1 Indicating the state of the encoder at time te.
The decoder is represented as follows:
y td =D(y td-1 ,s td-1 ) A second formula;
wherein D denotes a decoder, y td-1 Is an input at time td, y td Is the output of time td and is also used as the input of time td +1, s td-1 Representing the state of the td time encoder;
step S12: on the structure of the original decoder, a residual error connection is added between each input and output. The representation method is as follows:
y td =D(y td-1 ,s td-1 )+y td-1 publicA third formula;
wherein D, y td-1 ,y td And s td-1 Consistent with the meaning of step S11, adding the residual error structure only changes the output calculation mode;
step S13: constructing a bypass network from the sequence to the sequence architecture; the bypass network is a sequence generation network, and built-in units of an encoder and a decoder of the bypass network are 256 gate control circulation units; when the function is network training, the skeleton motion sequence generation network divides the input of the main skeleton into the input of the bypass network; when the sequence is generated, the bypass network output backbone skeleton is embedded into the input of the skeleton motion sequence generation network, so that the generation of the whole skeleton model can be guided and corrected. Wherein the skeleton is the crotch middle point, the left crotch and the right crotch of the human body skeleton;
step S14: respectively training a skeleton motion sequence to generate a network and a bypass network, wherein the initial training learning rates of the two networks are both 0.005, the learning attenuation rates are both 0.95, and the iteration times are 10000;
step S15: combining the bypass network with the skeleton motion sequence generation network, that is, embedding the output of the bypass network into the skeleton motion sequence generation network, the whole network architecture can be described as follows:
Figure BDA0002877024730000031
wherein Mn is a skeleton motion sequence generation network, bp is a bypass network, sx t For the output of the main network to the backbone skeleton part at time t, sp t For the output of the main network to the rest at time t, p t Bx as the output of the rest after the residual structure t For bypassing the output of the network to the backbone skeleton portion at time t, while p t And bx t These two parts are integrated and then output as the entire model.
The step S2 specifically includes the following steps:
step S21: a negative sample portion of the quality assessment dataset is constructed. The skeleton motion sequence of low quality generated by the model trained earlier in step S1 is used as the negative sample part of the data set. The skeleton action sequence with low quality is a sample of generated rigid action, movement angle which does not accord with objective physical law and the like;
step S22: the positive sample portion of the quality assessment data set is constructed using time domain motion sequence interpolation. Motion track modeling is carried out on the gesture between two adjacent frames under the same sequence based on the motion sequence interpolation of the time domain, and the modeling mode is as follows:
Figure BDA0002877024730000041
wherein tq is 1 And tq 2 Is different joint vectors of two same joint points between two adjacent frames in the same skeleton motion sequence, tq is the result of time domain motion sequence interpolation, and t theta is the joint vector tq 1 To joint vector tq 2 The angle of rotation;
step S23: the quality assessment data set positive sample portion is constructed using spatial domain motion sequence interpolation. The interpolation based on the space domain means that the coordinates which belong to the same joint point on the space are interpolated for two different motion postures, and the calculation method comprises the following steps:
Figure BDA0002877024730000051
wherein sq is 1 And sq 2 Using two different joint vectors of the same joint point in two different skeleton motion sequences, wherein sq is the result of interpolation of a space domain motion sequence, and s theta is the joint vector sq 1 To joint vector sq 2 The angle of rotation, omega, is sq in the interpolation result of the time domain 2 The weight of (c);
step S24: integrating the skeleton motion sequence data obtained in the steps S21, S22 and S23 to obtain a quality evaluation data set;
step S25: and constructing a skeleton motion sequence quality evaluation network based on graph convolution. The graph convolution network is a six-layer space-time graph convolution neural network. The first layer and the second layer are 64 channels, and the convolution step is 1; the third layer is 128 channels, and the convolution step is 2; the fourth layer is 128 channels, and the convolution step is 1; the fifth layer is 256 channels, and the convolution step is 2; the sixth layer is a fully connected layer. The quality assessment dataset constructed in step S24 was trained using an initial learning rate of 0.001, parameter decay rates of 0.95 each, and a trained batch size of 64, for a total of 80 epochs.
The step S3 specifically includes the following steps:
step S31: constructing a human skeleton motion space-time diagram, wherein the diagram comprises N joint points which form a set
Figure BDA0002877024730000052
This space-time diagram is constructed in two steps; first, the articulation point in the same frame is ≥ based on the physical structural connectivity of the human body>
Figure BDA0002877024730000053
And &>
Figure BDA0002877024730000054
By side->
Figure BDA0002877024730000055
Connecting; then the spatially semantically identical point->
Figure BDA0002877024730000056
And &>
Figure BDA0002877024730000057
By side->
Figure BDA0002877024730000058
Connecting; the two connections can be defined without additional manpower;
step S32: defining the segmentation on the human skeleton space-time diagram as follows:
Figure BDA0002877024730000059
/>
where the superscript t denotes the time t in the sequence,
Figure BDA0002877024730000061
is a root node, is greater than or equal to>
Figure BDA0002877024730000062
Is node->
Figure BDA0002877024730000063
The connection with the root node represents, the symbol represents that the left node and the right node are associated with each other under the rule, and the definition maps the root node into a node set related to the root node;
further, all the segmentation sets for the skeleton space-time diagram are defined as follows:
Figure BDA0002877024730000064
where V is the set of joint points in the human skeletal motion space-time diagram,
Figure BDA0002877024730000065
is one of the joint points;
Figure BDA0002877024730000066
is a partition with this node as the root node. The following steps are all paired>
Figure BDA0002877024730000067
Taking the division as an example;
step S33: the method for segmenting the human skeleton space-time diagram based on physical connection comprises the following steps:
Figure BDA0002877024730000068
wherein
Figure BDA0002877024730000069
Represents slave node pick>
Figure BDA00028770247300000610
To node->
Figure BDA00028770247300000611
The shortest path length of (2). This segmentation means that physically and temporally adjacent joint points are treated as a set;
step S34: a human skeleton space-time diagram is segmented based on a space configuration, and the method comprises the following steps:
Figure BDA00028770247300000612
wherein d function and step the definition of S33 is consistent; under the framework segmentation based on space configuration, and the ratio node
Figure BDA00028770247300000613
To the global root node->
Figure BDA00028770247300000614
Node composition and node->
Figure BDA00028770247300000615
The spatial configuration of (2) is segmented;
step S35: and segmenting the human skeleton space-time diagram based on the symmetrical semantics. Because of the symmetry of human body, two nodes which are symmetrical on the human body naturally have semantic relevance, so that the nodes are subjected to semantic segmentation
Figure BDA00028770247300000616
In particular, for its symmetric node->
Figure BDA00028770247300000617
Should have>
Figure BDA00028770247300000618
Step S36: construction of space-time graph convolution and node based on multi-level segmentation
Figure BDA00028770247300000619
The corresponding convolution calculation method is as follows:
Figure BDA0002877024730000071
wherein the content of the first and second substances,
Figure BDA0002877024730000072
is mapped->
Figure BDA0002877024730000073
Feature corresponding to a node, is>
Figure BDA0002877024730000074
Is node->
Figure BDA0002877024730000075
And node->
Figure BDA0002877024730000076
Corresponding convolution weights, A s For one of the segmentation sets SegA (a), a decision is taken>
Figure BDA0002877024730000077
To divide into s Normalizing the term | A corresponding to the nth row and column elements of the Laplace matrix approach in spectrogram convolution s | is equal to A s The number of all elements in the set, i.e. A s Is added to balance the contributions of the different subsets to the output, preventing certain sets from being too large and causing a result to be biased, and>
Figure BDA0002877024730000078
is A s Dividing the mth row and nth column elements of the corresponding mask attention matrix;
step S37: and constructing a time-space diagram convolution skeleton action recognition network based on multi-level segmentation. The graph convolution network is a ten-layer space-time graph convolution neural network. The first layer, the second layer, the third layer and the fourth layer are 64 channels, and the convolution step is 1; the fifth layer is 128 channels, and the convolution step is 2; the sixth layer and the seventh layer are 128 channels, and the convolution step is 1; the eighth layer is 256 channels, and the convolution step is 2; the ninth layer is 256 channels, and the convolution step is 1; the tenth layer is a 1 × 1 full convolution layer.
The step S4 specifically includes the following steps:
step S41: using the trained skeleton action sequence generation network in the step S1 to generate action sequences with 2 times of the number of the test sets for each action type;
step S42: performing quality evaluation on all generated skeleton sequences by using the skeleton sequence quality evaluation network trained in the step S2, and rejecting skeleton sequences with evaluation values smaller than 0.8;
step S43: integrating the original data set and the generated data set after the quality evaluation is completed to form an enhanced data set;
step S44: training the multi-level segmentation-based spatio-temporal graph convolution skeleton motion recognition network in the step S3 by using an enhanced data set, wherein the used initial learning rate is 0.001, the parameter attenuation rates are all 0.95, the trained batch size is 64, and the epochs are iterated for 80 times in total; and obtaining a skeleton action recognition model under the small sample after the training is finished.
Compared with the prior art, the invention has the following beneficial effects:
(1) The bypass network framework action sequence generation network provided by the invention can perform data expansion on the framework action sequence.
(2) The framework action sequence quality evaluation network provided by the invention can effectively filter the framework sequences with poor quality.
(3) The multilevel segmentation space-time diagram convolution network provided by the invention can extract more skeleton motion sequence characteristics from multilevel segmentation.
(4) The framework motion recognition framework under the small sample provided by the invention can obviously improve the performance of framework motion recognition under the condition of sample shortage.
Drawings
The invention is described in further detail below with reference to the following figures and detailed description:
FIG. 1 is a schematic flow diagram of an embodiment of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in the figure, a method for recognizing skeleton actions under a small sample can be used for recognizing skeleton actions under a small sample condition, and the method comprises the following steps;
step S1: constructing a sequence to a sequence generation network to generate a skeleton motion sequence, and if the complexity of a human skeleton model is too high under the condition of a small sample, so that the data volume is not enough to support the generation of the sequence, enhancing the robustness of the generation network under the condition of the small sample by using a bypass network model, and finally constructing an enhanced data set;
step S2: constructing a sequence quality evaluation network based on the convolution of a space-time diagram, and solving the problem that the generation of a skeleton motion sequence with poor quality can generate negative influence on the subsequent steps; the evaluation network carries out quality evaluation on the generated sequences and optimizes the data set by filtering the generated sequences with poor quality;
and step S3: a skeleton action recognition network based on a multi-level skeleton segmentation algorithm is constructed, the network model segments a skeleton motion sequence through three levels, and then the segmented skeleton motion sequence is sent into a graph convolution neural network for feature fusion, so that the network performs feature extraction on the skeleton motion sequence from different angles, and the robustness of the network on skeleton motion sequence data is improved;
and step S4: integrating and using all the network models constructed in the steps S1, S2 and S3; and generating a skeleton motion sequence by using the skeleton sequence generation network in the step S1, filtering the data with poor quality generated in the step S1 by using the skeleton sequence quality evaluation network in the step S2, optimizing a data set, and identifying skeleton actions by using the skeleton action identification network in the step S3 based on the data set constructed in the step S.
The step S1 specifically includes the following steps;
step S11: constructing a skeleton motion sequence generation network from a sequence to a sequence framework, and connecting two cyclic neural networks in series, wherein the former cyclic neural network has the function of an encoder, the latter cyclic neural network is used for decoding, and built-in units of the two cyclic neural networks are 1024 gate control cyclic units; the encoder is represented as follows
x te =E(x te-1 ,h te-1 ) A first formula;
wherein E denotes the encoder, x te-1 For input at time te, x te Is output at time te and is also input at time te +1, h te-1 Indicating the state of the encoder at time te.
The decoder is represented as follows:
y td =D(y td-1 ,s td-1 ) A second formula;
wherein D denotes a decoder, y td-1 Is an input at time td, y td Is the output of time td and is also used as the input of time td +1, s td-1 When represents tdState of the encoder;
step S12: on the structure of the original decoder, a residual error connection is added between each input and output. The representation method is as follows:
y td =D(y td-1 ,s td-1 )+y td-1 a formula III;
wherein D, y td-1 ,y td And s td-1 Consistent with the meaning of step S11, adding the residual error structure only changes the output calculation mode;
step S13: constructing a bypass network from the sequence to the sequence architecture; the bypass network is a sequence generation network, and built-in units of an encoder and a decoder of the bypass network are 256 gate control circulation units; when the function is network training, the skeleton motion sequence generation network divides the input of the main skeleton into the input of the bypass network; when the sequence is generated, the bypass network output backbone skeleton is embedded into the input of the skeleton motion sequence generation network, so that the generation of the whole skeleton model can be guided and corrected. Wherein the skeleton is the crotch middle point, the left crotch and the right crotch of the human body skeleton;
step S14: respectively training a skeleton motion sequence to generate a network and a bypass network, wherein the initial training learning rates of the two networks are both 0.005, the learning attenuation rates are both 0.95, and the iteration times are 10000;
step S15: combining the bypass network with the skeleton motion sequence generation network, that is, embedding the output of the bypass network into the skeleton motion sequence generation network, the whole network architecture can be described as follows:
Figure BDA0002877024730000101
wherein Mn is a skeleton motion sequence generation network, bp is a bypass network, sx t For the output of the main network to the backbone skeleton part at time t, sp t For the output of the main network to the rest at time t, p t Bx as the output of the rest after the residual structure t For backbone bone in order to bypass network at time tOutput of shelf part, while p t And bx t These two parts are integrated and then output as the entire model.
The step S2 specifically includes the following steps:
step S21: a negative sample portion of the quality assessment dataset is constructed. The skeleton motion sequence of low quality generated by the model trained earlier in step S1 is used as the negative sample part of the data set. The skeleton action sequence with low quality is a sample of generated rigid action, movement angle which does not accord with objective physical law and the like;
step S22: the positive sample portion of the quality assessment data set is constructed using time domain motion sequence interpolation. Motion track modeling is carried out on the gesture between two adjacent frames under the same sequence based on the motion sequence interpolation of the time domain, and the modeling mode is as follows:
Figure BDA0002877024730000111
wherein tq is 1 And tq 2 Is different joint vectors of two same joint points between two adjacent frames in the same skeleton motion sequence, tq is the result of time domain motion sequence interpolation, and t theta is the joint vector tq 1 To joint vector tq 2 The angle of rotation;
step S23: the quality assessment data set positive sample portion is constructed using spatial domain motion sequence interpolation. The interpolation based on the space domain means that the coordinates which belong to the same joint point on the space are interpolated for two different motion postures, and the calculation method comprises the following steps:
Figure BDA0002877024730000112
wherein sq is 1 And sq 2 Using two different joint vectors of the same joint point in two different skeleton motion sequences, wherein sq is the result of interpolation of a space domain motion sequence, and s theta is the joint vector sq 1 To joint vector sq 2 Angle of rotation, ω, is a time domain interpolationIn the results sq 2 The weight of (c);
step S24: integrating the skeleton motion sequence data obtained in the steps S21, S22 and S23 to obtain a quality evaluation data set;
step S25: and constructing a skeleton motion sequence quality evaluation network based on graph convolution. The graph convolution network is a six-layer space-time graph convolution neural network. The first layer and the second layer are 64 channels, and the convolution step is 1; the third layer is 128 channels, and the convolution step is 2; the fourth layer is 128 channels, and the convolution step is 1; the fifth layer is 256 channels, and the convolution step is 2; the sixth layer is a fully connected layer. The quality assessment dataset constructed in step S24 was trained using an initial learning rate of 0.001, parameter decay rates of 0.95 each, and a trained batch size of 64, for a total of 80 epochs.
The step S3 specifically includes the following steps:
step S31: constructing a human skeleton motion space-time diagram, wherein the diagram comprises N joint points which form a set
Figure BDA0002877024730000121
This spatiotemporal map is constructed in two steps; first, the articulation point in the same frame is ≥ based on the physical structural connectivity of the human body>
Figure BDA0002877024730000122
And &>
Figure BDA0002877024730000123
By side->
Figure BDA0002877024730000124
Connecting; then the spatially semantically identical point->
Figure BDA0002877024730000125
And &>
Figure BDA0002877024730000126
By side->
Figure BDA0002877024730000127
Connecting; the two connections can be defined without additional manpower;
step S32: defining the segmentation on the human skeleton space-time diagram as follows:
Figure BDA0002877024730000128
where the superscript t denotes the time t in the sequence,
Figure BDA0002877024730000129
is a root node, is greater than or equal to>
Figure BDA00028770247300001210
Is node->
Figure BDA00028770247300001211
The connection with the root node represents that the left node and the right node are associated with each other under the rule, and the definition maps the root node into a node set related to the root node;
further, all the segmentation sets for the skeleton space-time diagram are defined as follows:
Figure BDA0002877024730000131
where V is the set of joint points in the human skeletal motion space-time diagram,
Figure BDA0002877024730000132
is one of the joint points;
Figure BDA0002877024730000133
is a partition with this node as the root node. The following steps are all paired>
Figure BDA0002877024730000134
Take division as an example;
Step S33: the method for segmenting the human skeleton space-time diagram based on physical connection comprises the following steps:
Figure BDA0002877024730000135
wherein
Figure BDA0002877024730000136
Represents slave node pick>
Figure BDA0002877024730000137
To node>
Figure BDA0002877024730000138
The shortest path length of (2). This segmentation means that physically and temporally adjacent joint points are treated as a set;
step S34: based on the space configuration, the space-time diagram of the human skeleton is segmented, and the method comprises the following steps:
Figure BDA0002877024730000139
wherein the d function is consistent with the definition of step S33; under the framework segmentation based on space configuration, and the ratio node
Figure BDA00028770247300001310
To the global root node->
Figure BDA00028770247300001311
Node composition and node->
Figure BDA00028770247300001312
The spatial configuration of (2) is segmented;
step S35: and segmenting the human skeleton space-time diagram based on the symmetrical semantics. Because the human body has symmetry, two nodes which are symmetrical on the human body naturally have semantic relevance, so that the two nodes have semantic relevancePairing nodes in node semantic segmentation
Figure BDA00028770247300001313
In particular, for its symmetric node->
Figure BDA00028770247300001314
Should have->
Figure BDA00028770247300001315
Step S36: construction of space-time graph convolution and node based on multi-level segmentation
Figure BDA00028770247300001316
The corresponding convolution calculation method is as follows: />
Figure BDA00028770247300001317
Wherein the content of the first and second substances,
Figure BDA0002877024730000141
is mapped->
Figure BDA0002877024730000142
Feature corresponding to a node, is>
Figure BDA0002877024730000143
Is node->
Figure BDA0002877024730000144
And node
Figure BDA0002877024730000145
Corresponding convolution weights, A s For one of the segmentation sets SegA (a), a decision is taken>
Figure BDA0002877024730000146
To divide into s Corresponding to the mth row and nth column elements of Laplace matrix approach in spectrogram convolution, normalizing the itemA s | is equal to A s The number of all elements in the set, i.e. A s Is added to balance the contributions of the different subsets to the output, preventing certain sets from being too large and causing a result to be biased, and>
Figure BDA0002877024730000147
is A s Dividing the mth row and nth column elements of the corresponding mask attention matrix;
step S37: and constructing a time-space diagram convolution skeleton action recognition network based on multi-level segmentation. The graph convolution network is a ten-layer space-time graph convolution neural network. The first layer, the second layer, the third layer and the fourth layer are 64 channels, and the convolution step is 1; the fifth layer is 128 channels, and the convolution step is 2; the sixth layer and the seventh layer are 128 channels, and the convolution step is 1; the eighth layer is 256 channels, and the convolution step is 2; the ninth layer is 256 channels, and the convolution step is 1; the tenth layer is a 1 × 1 full convolution layer.
The step S4 specifically includes the following steps:
step S41: using the trained skeleton action sequence generation network in the step S1 to generate action sequences with 2 times of the number of the test sets for each action type;
step S42: using the trained framework sequence quality evaluation network in the step S2 to carry out quality evaluation on all generated framework sequences, and eliminating the framework sequences with evaluation values smaller than 0.8;
step S43: integrating the original data set and the generated data set after the quality evaluation is completed to form an enhanced data set;
step S44: training the multi-level segmentation-based spatio-temporal graph convolution skeleton motion recognition network in the step S3 by using an enhanced data set, wherein the used initial learning rate is 0.001, the parameter attenuation rates are all 0.95, the trained batch size is 64, and the epochs are iterated for 80 times in total; and obtaining a skeleton action recognition model under the small sample after the training is finished.
Preferably, the embodiment can effectively improve the skeleton motion recognition performance under the small sample. Firstly, a bypass network-based skeleton action generation network is used, and the method is mainly used for solving the problem of insufficient skeleton action data under a small sample. By using the method of the embodiment, a large amount of skeleton motion sequence data which is difficult to obtain at ordinary times can be generated. And then, performing quality evaluation on the generated skeleton action sequence by using a skeleton action sequence quality evaluation network to filter and generate data with poor quality. The method of the embodiment aims at data that the action of the generated skeleton sequence is rigid and the motion track does not accord with the objective physical law under the condition that a small part of the skeleton sequence is unstable. It can be evaluated using the method of the present embodiment to identify these poor quality production data. And then constructing a skeleton action recognition network based on multi-level segmentation. The method of the embodiment can guide the network to extract the motion characteristics of the skeleton sequence from a plurality of layers by performing multi-layer segmentation on the skeleton action, thereby performing effective identification. And finally, integrating network architecture, and sequentially carrying out the constructed network according to the logical sequence of generation, quality evaluation and action identification so as to construct a skeleton action identification model under the condition of a small sample.
The above description is only a preferred embodiment of the present invention, and all the equivalent changes and modifications made according to the claims of the present invention should be covered by the present invention.

Claims (2)

1. A method for recognizing skeleton actions under a small sample is used for recognizing skeleton actions under the condition of the small sample and is characterized in that: the method comprises the following steps;
step S1: constructing a sequence to a sequence generation network to generate a skeleton motion sequence, and if the complexity of a human skeleton model is too high under the condition of a small sample, so that the data volume is not enough to support the generation of the sequence, enhancing the robustness of the generation network under the condition of the small sample by using a bypass network model, and finally constructing an enhanced data set;
step S2: constructing a sequence quality evaluation network based on the convolution of a space-time diagram, and solving the problem that the generation of a skeleton motion sequence with poor quality can generate negative influence on the subsequent steps; the evaluation network evaluates the quality of the generated sequence and optimizes the data set by filtering the generated sequence with poor quality;
and step S3: a skeleton action recognition network based on a multi-level skeleton segmentation algorithm is constructed, the network model segments a skeleton motion sequence through three levels, and then the segmented skeleton motion sequence is sent into a graph convolution neural network for feature fusion, so that the network performs feature extraction on the skeleton motion sequence from different angles, and the robustness of the network on skeleton motion sequence data is improved;
and step S4: integrating and using all the network models constructed in the steps S1, S2 and S3; generating a skeleton motion sequence by using the skeleton sequence generation network in the step S1, filtering the data with poor quality generated in the step S1 by using the skeleton sequence quality evaluation network in the step S2, optimizing a data set, and identifying skeleton actions by using the skeleton action identification network in the step S3 based on the data set constructed in the step S;
the step S1 specifically includes the following steps;
step S11: constructing a skeleton motion sequence generation network from a sequence to a sequence framework, and connecting two cyclic neural networks in series, wherein the former cyclic neural network has the function of an encoder, the latter cyclic neural network is used for decoding, and built-in units of the two cyclic neural networks are 1024 gate control cyclic units; the encoder is represented as follows
x te =E(x te-1 ,h te-1 ) A first formula;
wherein E denotes the encoder, x te-1 For input at time te, x te Is output at time te and is also input at time te +1, h te-1 Indicating the state of the encoder at time te;
the decoder is represented as follows:
y td =D(y td-1 ,s td-1 ) A second formula;
wherein D represents a decoder, y td-1 Is an input at time td, y td Is the output of time td and is also used as the input of time td +1, s td-1 Representing the state of the td time encoder;
step S12: on the structure of the original decoder, a residual error connection is added between each input and each output; the representation method is as follows:
y td =D(y td-1 ,s td-1 )+y td-1 a formula III;
wherein D, y td-1 ,y td And s td-1 Consistent with the meaning of step S11, adding the residual error structure only changes the output calculation mode;
s13, constructing a bypass network from the sequence to the sequence architecture; the bypass network is a sequence generation network, and built-in units of an encoder and a decoder of the bypass network are 256 gate control circulation units; when the function is network training, the skeleton motion sequence generation network divides the input of the main skeleton into the input of the bypass network; when a sequence is generated, a bypass network output backbone framework is embedded into the input of a framework motion sequence generation network to guide and correct the generation of the whole framework model; wherein the skeleton is the crotch middle point, the left crotch and the right crotch of the human body skeleton;
step S14: respectively training a skeleton motion sequence to generate a network and a bypass network, wherein the initial training learning rates of the two networks are both 0.005, the learning attenuation rates are both 0.95, and the iteration times are 10000;
step S15: combining the bypass network with the skeleton motion sequence generation network, namely embedding the output of the bypass network into the skeleton motion sequence generation network, wherein the whole network architecture is described as follows:
Figure FDA0004029233350000021
wherein Mn is a skeleton motion sequence generation network, bp is a bypass network, sx t For output of the main network to the backbone portion at time t, sp t For the output of the main network for the rest at time t, p t Bx as the output of the rest after the residual structure t For bypassing the output of the network to the backbone skeleton portion at time t, while p t And bx t The two parts are integrated and then output as the whole model;
the step S2 specifically includes the following steps:
step S21: constructing a negative sample part of the quality evaluation data set; using a skeleton motion sequence with low quality generated by the early-stage model trained in the step S1 as a negative sample part of the data set; the skeleton action sequence with low quality is a sample with rigid action and movement angle which does not accord with objective physical law;
step S22: constructing a positive sample part of the quality evaluation data set by using time domain action sequence interpolation; motion track modeling is carried out on the gesture between two adjacent frames under the same sequence based on the motion sequence interpolation of the time domain, and the modeling mode is as follows:
Figure FDA0004029233350000031
wherein tq is 1 And tq 2 Is different joint vectors of two same joint points between two adjacent frames in the same skeleton motion sequence, tq is the result of time domain motion sequence interpolation, and t theta is the joint vector tq 1 To joint vector tq 2 The angle of rotation;
step S23: constructing a positive sample part of the quality evaluation data set by using spatial domain action sequence interpolation; the interpolation based on the space domain means that the coordinates which belong to the same joint point on the space are interpolated for two different motion postures, and the calculation method comprises the following steps:
Figure FDA0004029233350000041
wherein sq is 1 And sq 2 Using two different joint vectors of the same joint point in two different skeleton motion sequences, wherein sq is the result of interpolation of a space domain motion sequence, and s theta is the joint vector sq 1 To joint vector sq 2 The angle of rotation, omega, is sq in the interpolation result of the time domain 2 The weight of (c);
step S24: integrating the skeleton motion sequence data obtained in the steps S21, S22 and S23 to obtain a quality evaluation data set;
step S25: constructing a skeleton motion sequence quality evaluation network based on graph convolution; the graph convolution network is a six-layer space-time graph convolution neural network; the first layer and the second layer are 64 channels, and the convolution step is 1; the third layer is 128 channels, and the convolution step is 2; the fourth layer is 128 channels, and the convolution step is 1; the fifth layer is 256 channels, and the convolution step is 2; the sixth layer is a full connection layer; training the quality evaluation data set constructed in the step S24, wherein the initial learning rate is 0.001, the parameter attenuation rates are all 0.95, the batch size of the training is 64, and the total number of epochs is iterated by 80;
the step S3 specifically includes the following steps:
step S31: constructing a human skeleton motion space-time diagram, wherein the diagram comprises N joint points which form a set
Figure FDA0004029233350000042
This space-time diagram is constructed in two steps; firstly, joint points in the same frame are connected according to the connectivity of human body on physical structure
Figure FDA0004029233350000043
And &>
Figure FDA0004029233350000044
By side->
Figure FDA0004029233350000045
Connecting; then, points in the continuous frames which are identical in the spatial semantic structure are sequentially arranged on the time sequence
Figure FDA0004029233350000051
And &>
Figure FDA0004029233350000052
By side->
Figure FDA0004029233350000053
Connecting; the two connections do not need additional manual definition;
step S32: defining the segmentation on the human skeleton space-time diagram as follows:
Figure FDA0004029233350000054
where the superscript t denotes the time t in the sequence,
Figure FDA0004029233350000055
is a root node, is greater than or equal to>
Figure FDA0004029233350000056
Is node->
Figure FDA0004029233350000057
The connection with the root node represents that the left node and the right node are associated with each other under the rule, and the definition maps the root node into a node set related to the root node;
further, all the segmentation sets for the skeleton space-time diagram are defined as follows:
Figure FDA0004029233350000058
where V is the set of joint points in the human skeletal motion space-time diagram,
Figure FDA0004029233350000059
is one of the joint points; />
Figure FDA00040292333500000510
A partition using the node as a root node; the following steps are all paired>
Figure FDA00040292333500000511
Taking the division as an example;
step S33: the method for segmenting the human skeleton space-time diagram based on physical connection comprises the following steps:
Figure FDA00040292333500000512
wherein
Figure FDA00040292333500000513
Represents slave node pick>
Figure FDA00040292333500000514
To node->
Figure FDA00040292333500000515
The shortest path length of (2); this segmentation means that physically and temporally adjacent joint points are treated as a set;
step S34: a human skeleton space-time diagram is segmented based on a space configuration, and the method comprises the following steps:
Figure FDA00040292333500000516
wherein the d function is consistent with the definition of step S33; under the framework segmentation based on space configuration, and the ratio node
Figure FDA00040292333500000517
To the global root node->
Figure FDA00040292333500000518
Node composition and node->
Figure FDA00040292333500000519
The spatial configuration of (2) is divided;
step S35: segmenting the human skeleton space-time diagram based on the symmetrical semantics; because the human body has symmetry, two nodes which are symmetrical on the human body naturally have semantic correlation, and therefore, the nodes are subjected to semantic segmentation when being divided
Figure FDA0004029233350000061
In particular, for its symmetric node->
Figure FDA0004029233350000062
Should have->
Figure FDA0004029233350000063
Step S36: construction of space-time graph convolution and node based on multi-level segmentation
Figure FDA0004029233350000064
The corresponding convolution calculation method is as follows:
Figure FDA0004029233350000065
/>
wherein the content of the first and second substances,
Figure FDA0004029233350000066
is mapped->
Figure FDA0004029233350000067
Feature corresponding to a node, is>
Figure FDA0004029233350000068
Is node->
Figure FDA0004029233350000069
And node->
Figure FDA00040292333500000610
Correspond toConvolution weight of, A s For one of the segmentation sets SegA (a), a decision is taken>
Figure FDA00040292333500000611
To divide into s Normalizing the term | A corresponding to the nth row and column elements of the Laplace matrix approach in spectrogram convolution s | is equal to A s The number of all elements in the set, i.e. A s Is added to balance the contributions of the different subsets to the output, preventing certain sets from being too large and causing a result to be biased, and>
Figure FDA00040292333500000612
is A s Dividing the mth row and nth column elements of the corresponding mask attention matrix;
step S37: constructing a time-space diagram convolution skeleton action recognition network based on multi-level segmentation; the graph convolution network is a ten-layer space-time graph convolution neural network; the first layer, the second layer, the third layer and the fourth layer are 64 channels, and the convolution step is 1; the fifth layer is 128 channels, and the convolution step is 2; the sixth layer and the seventh layer are 128 channels, and the convolution step is 1; the eighth layer is 256 channels, and the convolution step is 2; the ninth layer is 256 channels, and the convolution step is 1; the tenth layer is a 1 × 1 full convolution layer.
2. The method for recognizing the skeleton action under the small sample according to claim 1, characterized in that: the step S4 specifically includes the following steps:
step S41: using the trained skeleton action sequence generation network in the step S1 to generate action sequences with 2 times of the number of the test sets for each action type;
step S42: using the trained framework sequence quality evaluation network in the step S2 to carry out quality evaluation on all generated framework sequences, and eliminating the framework sequences with evaluation values smaller than 0.8;
step S43: integrating the original data set and the generated data set after the quality evaluation is completed to form an enhanced data set;
step S44: training the space-time graph convolution skeleton motion recognition network based on multi-level segmentation in the step S3 by using an enhanced data set, wherein the used initial learning rate is 0.001, the parameter attenuation rates are all 0.95, the trained batch size is 64, and the total number of epochs is 80; and obtaining a skeleton action recognition model under the small sample after the training is finished.
CN202011616955.1A 2020-12-31 2020-12-31 Skeleton action recognition method under small sample Active CN112651360B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011616955.1A CN112651360B (en) 2020-12-31 2020-12-31 Skeleton action recognition method under small sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011616955.1A CN112651360B (en) 2020-12-31 2020-12-31 Skeleton action recognition method under small sample

Publications (2)

Publication Number Publication Date
CN112651360A CN112651360A (en) 2021-04-13
CN112651360B true CN112651360B (en) 2023-04-07

Family

ID=75364560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011616955.1A Active CN112651360B (en) 2020-12-31 2020-12-31 Skeleton action recognition method under small sample

Country Status (1)

Country Link
CN (1) CN112651360B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408455B (en) * 2021-06-29 2022-11-29 山东大学 Action identification method, system and storage medium based on multi-stream information enhanced graph convolution network
CN114818989B (en) * 2022-06-21 2022-11-08 中山大学深圳研究院 Gait-based behavior recognition method and device, terminal equipment and storage medium
CN116453648B (en) * 2023-06-09 2023-09-05 华侨大学 Rehabilitation exercise quality assessment system based on contrast learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881731A (en) * 2020-05-19 2020-11-03 广东国链科技股份有限公司 Behavior recognition method, system, device and medium based on human skeleton
CN111985343A (en) * 2020-07-23 2020-11-24 深圳大学 Method for constructing behavior recognition deep network model and behavior recognition method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8929612B2 (en) * 2011-06-06 2015-01-06 Microsoft Corporation System for recognizing an open or closed hand
CN106203363A (en) * 2016-07-15 2016-12-07 中国科学院自动化研究所 Human skeleton motion sequence Activity recognition method
CN110096950B (en) * 2019-03-20 2023-04-07 西北大学 Multi-feature fusion behavior identification method based on key frame
CN111199216B (en) * 2020-01-07 2022-10-28 上海交通大学 Motion prediction method and system for human skeleton
CN111325099B (en) * 2020-01-21 2022-08-26 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881731A (en) * 2020-05-19 2020-11-03 广东国链科技股份有限公司 Behavior recognition method, system, device and medium based on human skeleton
CN111985343A (en) * 2020-07-23 2020-11-24 深圳大学 Method for constructing behavior recognition deep network model and behavior recognition method

Also Published As

Publication number Publication date
CN112651360A (en) 2021-04-13

Similar Documents

Publication Publication Date Title
CN112651360B (en) Skeleton action recognition method under small sample
CN112395945A (en) Graph volume behavior identification method and device based on skeletal joint points
CN111814719A (en) Skeleton behavior identification method based on 3D space-time diagram convolution
CN108399435B (en) Video classification method based on dynamic and static characteristics
CN113469356A (en) Improved VGG16 network pig identity recognition method based on transfer learning
CN107679462A (en) A kind of depth multiple features fusion sorting technique based on small echo
CN110378208B (en) Behavior identification method based on deep residual error network
CN110135386B (en) Human body action recognition method and system based on deep learning
CN110728219A (en) 3D face generation method based on multi-column multi-scale graph convolution neural network
KR102042168B1 (en) Methods and apparatuses for generating text to video based on time series adversarial neural network
CN113128424B (en) Method for identifying action of graph convolution neural network based on attention mechanism
CN109508686B (en) Human behavior recognition method based on hierarchical feature subspace learning
CN111753207B (en) Collaborative filtering method for neural map based on comments
CN113239897B (en) Human body action evaluation method based on space-time characteristic combination regression
CN104298974A (en) Human body behavior recognition method based on depth video sequence
CN107423747A (en) A kind of conspicuousness object detection method based on depth convolutional network
CN113221663A (en) Real-time sign language intelligent identification method, device and system
CN113689382A (en) Tumor postoperative life prediction method and system based on medical images and pathological images
CN110516724A (en) Visualize the high-performance multilayer dictionary learning characteristic image processing method of operation scene
CN115049739A (en) Binocular vision stereo matching method based on edge detection
CN113255569B (en) 3D attitude estimation method based on image hole convolutional encoder decoder
CN111626296A (en) Medical image segmentation system, method and terminal based on deep neural network
CN113887501A (en) Behavior recognition method and device, storage medium and electronic equipment
CN112052795B (en) Video behavior identification method based on multi-scale space-time feature aggregation
CN117115911A (en) Hypergraph learning action recognition system based on attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant