CN113642400A - Graph convolution action recognition method, device and equipment based on 2S-AGCN - Google Patents

Graph convolution action recognition method, device and equipment based on 2S-AGCN Download PDF

Info

Publication number
CN113642400A
CN113642400A CN202110785748.7A CN202110785748A CN113642400A CN 113642400 A CN113642400 A CN 113642400A CN 202110785748 A CN202110785748 A CN 202110785748A CN 113642400 A CN113642400 A CN 113642400A
Authority
CN
China
Prior art keywords
skeleton
neural network
convolution
time
recognition model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110785748.7A
Other languages
Chinese (zh)
Inventor
颜云辉
王森
宋克臣
张劲风
王仁根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202110785748.7A priority Critical patent/CN113642400A/en
Publication of CN113642400A publication Critical patent/CN113642400A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a method, a device and equipment for identifying a graph convolution action based on 2S-AGCN, relates to the technical field of computers, and can solve the problem of limited feature expression capability when action identification is carried out based on a deep learning neural network. The method comprises the following steps: constructing a physical connection structure of each frame of skeleton point in a sample set corresponding to a human body, and extracting skeleton point information and skeleton connection information from the physical connection structure; training a motion recognition model by using the fusion characteristics of the skeleton point information and the skeleton connection information, wherein the motion recognition model is formed by alternately combining a graph convolution neural network and a time convolution network, the graph convolution neural network is used for extracting spatial characteristics, and the time convolution network is used for extracting time characteristics; and if the training of the action recognition model is judged to be finished, inputting the physical connection structure information of each frame in the target sample into the action recognition model, and obtaining an action recognition result. The method and the device are suitable for motion recognition of the compressed video.

Description

Graph convolution action recognition method, device and equipment based on 2S-AGCN
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for identifying a graph convolution action based on a 2S-AGCN.
Background
The deep learning convolution neural network has the advantages of multiple identification types, high accuracy, good robustness and the like. Under the same recognition scene, the target recognition is carried out by using the deep learning neural network, so that the interference conditions such as color, texture, illumination and the like can be effectively resisted. Therefore, the action recognition method based on the deep learning convolutional neural network is more applied to the action recognition process of the target.
Although the deep learning convolutional neural network has many advantages, the generalization capability of the model is weak because the joint modeling of each part of the human body is needed, so that the generalization capability of the model can be improved by learning the information of each part of the human body by using the locality and the time dynamics of the convolutional neural network.
The existing deep learning graph neural network action recognition method focuses more on local features of a graph convolution neural network, but neglects global features, and in addition, 2S-AGCN only establishes a skeleton point diagram with a connection relation, pays more attention to information transmission among connected skeleton key points and neglects information transmission among non-connected skeleton points, so that the feature expression capability is limited.
In summary, an action recognition model which focuses on local and global features and considers information transmission among disconnected skeleton points needs to be designed, so that the graph convolution neural network captures a larger receptive field and improves the accuracy of action recognition.
Disclosure of Invention
In view of this, the present application provides a method, an apparatus, and a device for identifying a graph convolution action based on a 2S-AGCN, which are used to solve the problem of limited feature expression capability when performing action identification based on a deep learning neural network.
According to an aspect of the present application, there is provided a graph convolution action recognition method based on a 2S-AGCN, the method including:
constructing a physical connection structure of a human body corresponding to each frame of skeleton point in a sample set, and extracting skeleton point information and skeleton connection information from the physical connection structure;
training a motion recognition model by using the fusion characteristics of the skeleton point information and the skeleton connection information, wherein the motion recognition model is formed by alternately combining a graph convolution neural network and a time convolution network, the graph convolution neural network is used for extracting spatial characteristics, and the time convolution network is used for extracting time characteristics;
and if the action recognition model is judged to be trained completely, inputting the physical connection structure information of each frame in the target sample into the action recognition model, and obtaining an action recognition result.
Optionally, the constructing a physical connection structure of the human body corresponding to each frame of bone point in the sample set, and extracting bone point information and bone connection information from the physical connection structure includes:
respectively representing the number of channels, the number of skeleton points and the number of video frames of a sample in a sample set by using symbols C, V, T, wherein the initial value of the number of the channels C is 3, and the initial value is respectively the horizontal coordinate, the vertical coordinate and the confidence coefficient of the coordinate of the skeleton points;
creating an array for representing human body physical structure connection according to predefined skeleton point numerical indexes, wherein elements in the array consist of two skeleton points with connection relations, and skeleton connection information is determined by using the skeleton points with connection relations;
representing the human skeleton points in the sample as an undirected space human skeleton map and a time human skeleton map by using symbols G (K, E), wherein K represents the skeleton point set of the t frame image, and K is { K ═ K { (K)ti1,2, … T; 1,2, … V }; e represents the set of edges connecting between the skeletal points, with ESAnd ETTwo subsets, ESFor the edge set with connection relation among the t frame skeleton points, representing the skeleton point connection in all the video frames of a single sample, ETRepresenting the change track of a certain skeleton point along with time for an edge set between the same skeleton point in the t frame and the t +1 frame;
and dividing the skeleton points in the space human body skeleton diagram into 3 skeleton point sets representing the physical structure of the human body to obtain skeleton point information.
Optionally, the dividing the skeleton points in the human body space skeleton diagram into 3 skeleton point sets representing the physical structure of the human body to obtain skeleton point information includes:
calculating the barycentric coordinates of the human body according to the coordinates of the skeleton points in the sample set;
and according to the barycentric coordinate of the human body, dividing skeleton points in a human body space skeleton map into a first skeleton point set constructed by skeleton points, a second skeleton point set which has a connection relation with the skeleton points and is less than or equal to a preset distance threshold from the barycentric coordinate, and a third skeleton point set which has a connection relation with the skeleton points and is greater than the preset distance threshold from the barycentric coordinate.
Optionally, the method further comprises:
constructing a graph convolution neural network layer capable of extracting sample space characteristics, and improving standard two-dimensional convolution into graph convolution;
constructing a time convolution neural network layer capable of extracting sample time characteristics, and improving the standard two-dimensional convolution into time convolution;
constructing a motion recognition neural network layer, and embedding the graph convolution neural network layer and the time convolution neural network layer into the motion recognition neural network layer;
and generating a 9-layer action recognition model by utilizing the action recognition neural network layer.
Optionally, the constructing may extract a time convolution neural network layer of the sample time feature, and improve the standard two-dimensional convolution into a time convolution, including;
replacing the number of 3 parameter channels, the image width and the image height required by the standard two-dimensional convolution with parameters C, T and V respectively;
respectively inputting the features extracted by the graph convolution neural network layer into 4 first convolution layers of 1 multiplied by 1 so as to improve the dimension of the feature graph and ensure that the number of output channels of the graph convolution neural network layer is 1/8 of the number of final output channels of the time convolution neural network layer;
respectively inputting the output characteristics of the first convolutional layer into 4 void convolutional layers with expansion rates of 3 multiplied by 1 of 1,2, 3 and 4, extracting time characteristics with different scales by utilizing the void convolutions of different receptive fields, wherein the numbers of input channels and output channels before and after the void convolutional layers are the same, and the time characteristics are 1/8 of the number of final output channels of the time convolutional neural network layer;
splicing the 4 groups of time characteristics pairwise to enable the number of output channels to be 1/2 of the number of final output channels of the time convolution neural network layer;
inputting the two-by-two splicing results into a 1 x 1 second convolution layer to improve the dimension of the feature diagram, so that the number of output channels is the same as the number of final output channels of the time convolution neural network layer;
inputting the output result of the second convolutional layer into a SEnet layer to improve the channel attention of the time convolutional neural network layer;
and a 1 x 1 third convolutional layer with the step length of 2 is arranged between the input and the output of the time convolutional neural network layer, and the third convolutional layer is used for stable training.
Optionally, the training of the motion recognition model using the fused features of the bone point information and the bone connection information includes:
fusing the skeleton point information and the skeleton connection information by using a weighted average method to obtain fusion characteristics;
inputting the fusion characteristics into a full-link layer and a Softmax layer in an action recognition model in sequence to obtain action category prediction results, wherein the fusion characteristics carry action category marking results;
and if the accuracy of the category prediction result is determined to be greater than a preset threshold value according to the action category labeling result, judging that the training of the action recognition model is finished.
Optionally, if it is determined that the motion recognition model training is completed, inputting physical connection structure information of each frame in a target sample into the motion recognition model to obtain a motion recognition result, where the method includes:
if the action recognition model is judged to be trained completely, extracting target fusion characteristics formed by fusion of skeleton point information and skeleton information of each frame in a target sample;
inputting the target fusion characteristics into a trained action recognition model, and acquiring evaluation values corresponding to all preset action categories;
and determining the preset action category with the highest evaluation score as the action recognition result of each frame in the target sample.
According to another aspect of the present application, there is provided a 2S-AGCN-based graph wrapping operation recognition apparatus, including:
the extraction module is used for constructing a physical connection structure of a human body corresponding to each frame of skeleton point in a sample set and extracting skeleton point information and skeleton connection information from the physical connection structure;
the training module is used for training a motion recognition model by utilizing the fusion characteristics of the skeleton point information and the skeleton connection information, the motion recognition model is formed by utilizing a graph convolution neural network and a time convolution network which are alternately combined, the graph convolution neural network is used for extracting spatial characteristics, and the time convolution network is used for extracting time characteristics;
and the obtaining module is used for inputting the physical connection structure information of each frame in the target sample into the action recognition model to obtain an action recognition result if the action recognition model is judged to be trained completely.
Optionally, the extracting module is specifically configured to:
respectively representing the number of channels, the number of skeleton points and the number of video frames of a sample in a sample set by using symbols C, V, T, wherein the initial value of the number of the channels C is 3, and the initial value is respectively the horizontal coordinate, the vertical coordinate and the confidence coefficient of the coordinate of the skeleton points;
creating an array for representing human body physical structure connection according to predefined skeleton point numerical indexes, wherein elements in the array consist of two skeleton points with connection relations, and skeleton connection information is determined by using the skeleton points with connection relations;
representing the human skeleton points in the sample as an undirected space human skeleton map and a time human skeleton map by using symbols G (K, E), wherein K represents the skeleton point set of the t frame image, and K is { K ═ K { (K)ti1,2, … T; 1,2, … V }; e represents the set of edges connecting between the skeletal points, with ESAnd ETTwo subsets, ESFor the edge set with connection relation among the t frame skeleton points, representing the skeleton point connection in all the video frames of a single sample, ETRepresenting the change track of a certain skeleton point along with time for an edge set between the same skeleton point in the t frame and the t +1 frame;
and dividing the skeleton points in the space human body skeleton diagram into 3 skeleton point sets representing the physical structure of the human body to obtain skeleton point information.
Optionally, the extracting module is specifically configured to:
calculating the barycentric coordinates of the human body according to the coordinates of the skeleton points in the sample set;
and according to the barycentric coordinate of the human body, dividing skeleton points in a human body space skeleton map into a first skeleton point set constructed by skeleton points, a second skeleton point set which has a connection relation with the skeleton points and is less than or equal to a preset distance threshold from the barycentric coordinate, and a third skeleton point set which has a connection relation with the skeleton points and is greater than the preset distance threshold from the barycentric coordinate.
Optionally, the apparatus further comprises: the system comprises a first building module, a second building module, a third building module and a generating module;
the first construction module is used for constructing a graph convolution neural network layer capable of extracting sample space characteristics, and improving standard two-dimensional convolution into graph convolution;
the second construction module is used for constructing a time convolution neural network layer capable of extracting the sample time characteristics, and improving the standard two-dimensional convolution into time convolution;
the third construction module is used for constructing a motion recognition neural network layer and embedding the graph convolution neural network layer and the time convolution neural network layer into the third construction module;
and the generating module is used for generating a 9-layer action recognition model by utilizing the action recognition neural network layer.
Optionally, the second building module is specifically configured to:
replacing the number of 3 parameter channels, the image width and the image height required by the standard two-dimensional convolution with parameters C, T and V respectively;
respectively inputting the features extracted by the graph convolution neural network layer into 4 first convolution layers of 1 multiplied by 1 so as to improve the dimension of the feature graph and ensure that the number of output channels of the graph convolution neural network layer is 1/8 of the number of final output channels of the time convolution neural network layer;
respectively inputting the output characteristics of the first convolutional layer into 4 void convolutional layers with expansion rates of 3 multiplied by 1 of 1,2, 3 and 4, extracting time characteristics with different scales by utilizing the void convolutions of different receptive fields, wherein the numbers of input channels and output channels before and after the void convolutional layers are the same, and the time characteristics are 1/8 of the number of final output channels of the time convolutional neural network layer;
splicing the 4 groups of time characteristics pairwise to enable the number of output channels to be 1/2 of the number of final output channels of the time convolution neural network layer;
inputting the two-by-two splicing results into a 1 x 1 second convolution layer to improve the dimension of the feature diagram, so that the number of output channels is the same as the number of final output channels of the time convolution neural network layer;
inputting the output result of the second convolutional layer into a SEnet layer to improve the channel attention of the time convolutional neural network layer;
and a 1 x 1 third convolutional layer with the step length of 2 is arranged between the input and the output of the time convolutional neural network layer, and the third convolutional layer is used for stable training.
Optionally, the training module is specifically configured to:
fusing the skeleton point information and the skeleton connection information by using a weighted average method to obtain fusion characteristics;
inputting the fusion characteristics into a full-link layer and a Softmax layer in an action recognition model in sequence to obtain action category prediction results, wherein the fusion characteristics carry action category marking results;
and if the accuracy of the category prediction result is determined to be greater than a preset threshold value according to the action category labeling result, judging that the training of the action recognition model is finished.
Optionally, the obtaining module is specifically configured to:
if the action recognition model is judged to be trained completely, extracting target fusion characteristics formed by fusion of skeleton point information and skeleton information of each frame in a target sample;
inputting the target fusion characteristics into a trained action recognition model, and acquiring evaluation values corresponding to all preset action categories;
and determining the preset action category with the highest evaluation score as the action recognition result of each frame in the target sample.
According to still another aspect of the present application, there is provided a non-transitory readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described 2S-AGCN-based graph convolution action recognition method.
According to yet another aspect of the present application, there is provided a computer apparatus including a non-volatile readable storage medium, a processor, and a computer program stored on the non-volatile readable storage medium and executable on the processor, the processor implementing a 2S-AGCN based graph convolution action recognition method when executing the program.
By means of the technical scheme, compared with the current mode of carrying out motion recognition based on the 2S-AGCN, the method, the device and the equipment for recognizing the graph convolution motion based on the 2S-AGCN can train a motion recognition model formed by alternately combining a graph convolution neural network and a time convolution network based on the skeleton point information and the skeleton connection information, and carry out motion recognition on a target sample by using the trained motion recognition model to obtain a motion recognition result. According to the technical scheme, when the skeleton connection information is considered, information transmission among disconnected skeleton points is also considered, a space map convolution network structure and a time convolution network structure in the motion recognition model are improved, the receptive fields of a space domain and a time domain are enlarged, more information can be extracted, and the training precision of model motion recognition is improved.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application to the disclosed embodiment. In the drawings:
fig. 1 is a schematic flowchart illustrating a method for identifying a graph convolution action based on a 2S-AGCN according to an embodiment of the present application;
fig. 2 is a schematic flowchart illustrating another graph convolution action recognition method based on a 2S-AGCN according to an embodiment of the present application;
fig. 3 is a schematic diagram illustrating a schematic flowchart of a graph convolution action recognition based on a 2S-AGCN according to an embodiment of the present application;
fig. 4 is a schematic structural diagram illustrating a graph convolution operation recognition apparatus based on a 2S-AGCN according to an embodiment of the present application;
fig. 5 is a schematic structural diagram illustrating another graph convolution operation recognition apparatus based on a 2S-AGCN according to an embodiment of the present application.
Detailed Description
The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Aiming at the problem that the feature expression capability is limited when the action recognition is carried out based on the deep learning neural network at present, the embodiment of the application provides a graph convolution action recognition method based on 2S-AGCN, as shown in fig. 1, the method comprises the following steps:
101. and constructing a physical connection structure of each frame of skeleton point in the sample set corresponding to the human body, and extracting skeleton point information and skeleton connection information from the physical connection structure.
In view of the fact that in real life, people usually need to cooperate with different body parts when doing actions, but the existing deep learning graph neural network action recognition method only considers the relation between skeleton key points with direct connection relation, but does not additionally consider the relation between skeleton key points without physical connection relation, for example, some actions need to depend on the cooperation of two parts of a left wrist and a right wrist at the same time, but the skeleton key points of the two parts do not have direct connection relation. Therefore, the method aims to consider not only information transfer between skeleton key points with direct connection relations, but also information transfer between other skeleton key points without direct connection relations, and further solve the problem that the feature expression capacity is limited when action recognition is carried out based on a deep learning neural network.
The sample set can correspond to the existing public data set, such as an NTU-RGB + D skeleton point data set and an NEU-family data set, and the number of skeleton key points of a single video frame in the NTU-RGB + D data set is 25, so that the human skeleton diagram has 3 adjacent matrixes of 25 multiplied by 25; the NEU-family data set has 17 skeletal keypoints for a single video frame, so the human skeleton map has 3 17 × 17 adjacency matrices. In the following embodiments of the present application, 25 skeletal points included in the NTU-RGB + D skeletal point data set will be described as an example. Marking an action type on each skeleton point sample in the sample set, and further providing a training basis for model training of an action recognition model; the skeleton point information corresponds to human skeleton point coordinates (x, y), the skeleton connection information is a connection vector determined by using the skeleton point coordinates with connection relation, and the coordinate of one skeleton point is assumed to be (x)1,y1) The other bone point has the coordinate of (x)2,y2) The bone connection information can be vector (x)2-x1,y2-y1) Watch (A)Shown in the figure.
For this embodiment, when constructing the physical connection structure of the human body corresponding to each frame of the skeleton point in the sample set and extracting the skeleton point information and the skeleton connection information from the physical connection structure, the physical connection structure of the human body can be represented by predefining each frame of the skeleton point in the sample set, and accordingly, the method specifically includes: the number of channels, the number of skeleton points and the number of video frames of the samples in the sample set are respectively represented by a symbol C, V, T; an array is used for representing the connection relation among 25 skeleton points of each frame in the NTU-RGB + D skeleton data set, and skeleton connection information is determined by using the skeleton points with the connection relation; representing the human skeleton points in the sample as a spatial human skeleton map and a temporal human skeleton map in an undirected way by using symbols G (K, E); and dividing the neighbor set of the skeleton points in the space human skeleton diagram into 3 skeleton point sets, representing 3 subsets of the physical structure of the human body, and obtaining the skeleton point information.
The execution subject of the embodiment may be a graph volume motion recognition system, a trained motion recognition model may be embedded in the graph volume motion recognition system, and motion recognition of the target sample may be implemented based on the bone point information and the bone connection information by using the motion recognition model.
102. And training a motion recognition model by using the fusion characteristics of the skeleton point information and the skeleton connection information, wherein the motion recognition model is formed by alternately combining a graph convolution neural network and a time convolution network, the graph convolution neural network is used for extracting the spatial characteristics, and the time convolution network is used for extracting the time characteristics.
The fusion feature is obtained by fusing the bone point information and the bone connection information by using a weighted average method.
For the embodiment, a spatial map convolution network structure and a time convolution network structure in the action recognition model can be improved, a graph convolution neural network and a time convolution network are cascaded in front and back, and are integrated into a neural network layer in an alternative use mode, so that the layer is expanded to 9 layers to form the action recognition network, and then the action recognition model is generated by utilizing the action recognition network training. By training the motion recognition model by using the improved motion recognition network, the receptive fields of a space domain and a time domain can be further expanded, more information can be extracted, and the training precision of the model motion recognition is improved. Besides, the structure of each layer of the action recognition neural network is provided with a Batch Normalization layer and a ReLu nonlinear activation layer in addition to a graph convolution layer and a time convolution layer which are alternately used in a cascade mode. Firstly, extracting the spatial characteristics of each frame in a sample skeleton key point sequence through a graph volume layer; then, the input of each layer of neural network keeps the same distribution by using Batch standardization of Batch Normalization, and the network is stably trained; then respectively carrying out nonlinear activation on the data and preventing network overfitting through a ReLu function and a DroupOut strategy; and finally, extracting the time characteristics of continuous frames in the sample skeleton key point sequence through the time convolution layer, and performing nonlinear activation by using a ReLu function. In addition, the input and the output of the neural network refer to a residual block structure in ResNet, and residual connection is carried out before and after, so that the network training is more stable.
103. And if the training of the action recognition model is judged to be finished, inputting the physical connection structure information of each frame in the target sample into the action recognition model, and obtaining an action recognition result.
For the embodiment, if the motion recognition precision of the motion recognition model is judged to be greater than the preset threshold, it can be judged that the motion recognition model training is completed; and the action identification result is divided into action types corresponding to the target sample. Furthermore, after the motion recognition model is judged to be trained, the motion recognition model can be used for recognizing the motion of the unknown compressed video, namely, the motion recognition model can determine the motion classification corresponding to the target sample according to the fusion characteristics of the target sample by inputting the physical connection structure information of each frame in the target sample into the motion recognition model.
By the graph convolution motion recognition method based on the 2S-AGCN in the embodiment, a motion recognition model formed by alternately combining a graph convolution neural network and a time convolution network can be trained based on bone point information and bone connection information, and a motion recognition result is obtained by performing motion recognition on a target sample by using the trained motion recognition model. According to the technical scheme, when the skeleton connection information is considered, information transmission among disconnected skeleton points is also considered, a space map convolution network structure and a time convolution network structure in the motion recognition model are improved, the receptive fields of a space domain and a time domain are enlarged, more information can be extracted, and the training precision of model motion recognition is improved.
Further, as a refinement and an extension of the specific implementation of the foregoing embodiment, in order to fully illustrate the specific implementation process in this embodiment, another graph convolution action recognition method based on 2S-AGCN is provided, as shown in fig. 2, the method includes:
201. the number of channels, the number of skeleton points, and the number of video frames of the samples in the sample set are represented by C, V, T.
For this embodiment, the initial value of the channel number C is 3, which are the abscissa, ordinate and confidence of the coordinates of the bone point, respectively; the Batch Size during neural network training and testing is denoted by symbol N; there are 25 skeletal points per frame in the NTU-RGB + D skeletal dataset, setting the number of skeletal points V to 25. When the action recognition neural network is trained, the size of the Batch and the number V of the skeleton points are fixed values, the values are fixed before and after, and the channel number C and the video frame number T change along with the increase of the number of layers of the neural network.
202. And creating an array for representing the physical structure connection of the human body according to the predefined skeleton point numerical index, wherein elements in the array consist of two skeleton points with connection relations, and the skeleton connection information is determined by using the skeleton points with connection relations.
For the present embodiment, the elements in the array are composed of two bone points in a connection relationship, such as [ (1,2), (2,21), (3,21), (4,3), (5,21), (6,5), (7,6), (8,7), (9,21), (10,9), (11,10), (12,11), (13,1), (14,13), (15,14), (16,15), (17,1), (18,17), (19,18), (20,19), (22,23), (23,8), (24,25), (25,12) ], where each numerical index represents a bone point, and is predefined by the NTU-RGB + D bone point data set.
203. The human skeleton points in the sample are represented by the symbols G (K, E) as undirected spatial and temporal human skeleton maps.
Where K denotes a set of skeleton points of the t-th frame, and K ═ Kti1,2, … T; 1,2, … V }; e represents the set of edges connecting between the skeletal points, with ESAnd ETTwo subsets, respectively spatial and temporal, needed for extracting spatial and temporal features, ESFor the edge set with connection relation among the t frame skeleton points, representing the skeleton key point connection in all the video frames of a single sample, ETThe edge set between the same skeleton point in the t frame and the t +1 frame represents the time-varying track of a certain skeleton point.
204. The skeleton points in the space human skeleton diagram are divided into 3 skeleton point sets representing the physical structure of the human body, and the skeleton point information is obtained.
For this embodiment, the step 204 of the embodiment may specifically include: calculating the barycentric coordinates of the human body according to the coordinates of the skeleton points in the sample set; according to the barycentric coordinate of the human body, the skeleton points in the human body space skeleton map are divided into a first skeleton point set constructed by the skeleton points, a second skeleton point set which has a connection relation with the skeleton points and has a distance from the barycentric coordinate to a preset distance threshold or less, and a third skeleton point set which has a connection relation with the skeleton points and has a distance from the barycentric coordinate to the preset distance threshold or more.
Specifically, the barycentric coordinates can be obtained according to a barycentric formula, which is characterized by:
Figure BDA0003158685880000111
wherein x isHeavy load、yHeavy loadRespectively the abscissa and ordinate, x, of the position of the centre of gravity1、x2…x25Respectively, the abscissa, y, of 25 skeletal points of the human body1、y2…y25The vertical coordinates of 25 skeletal points of the human body are respectively.
Furthermore, the skeleton points in the human body space skeleton map can be divided into a first skeleton point set constructed by the skeleton points per se, a second skeleton point set which has a connection relation with the skeleton points and has a distance coordinate less than or equal to a preset distance threshold, and a third skeleton point set which has a connection relation with the skeleton points and has a distance coordinate greater than the preset distance threshold.
Wherein the first set of skeleton points is the skeleton point itself, using an adjacency matrix A with 25 × 25 dimensions and all 1 major diagonal elements1Represents:
Figure BDA0003158685880000121
the second set of skeleton points is a set of centripetal skeleton points, i.e. skeleton points which have a connection relation with the skeleton points and are close to the center of gravity, and an adjacent matrix A with the size of 25 × 25 is used2Represents:
Figure BDA0003158685880000122
the third set of bone points is an eccentric set of bone points, i.e. a set of bone points which have a connection relationship with the bone point and are far from the center of gravity, and a normalized adjacency matrix A with the size of 25 × 25 is used3Represents:
Figure BDA0003158685880000131
in this example, A1、A2And A3The 3 adjacent matrixes are uniformly denoted by symbol AkAnd (4) showing.
As a preferable mode, in the present embodiment, in order to improve the accuracy of the motion recognition model, the spatial map convolution network structure and the temporal convolution network structure in the motion recognition model may be improved to further expand the receptive fields of the spatial domain and the temporal domain, so that more information can be extracted. Correspondingly, the embodiment steps may specifically include: constructing a graph convolution neural network layer capable of extracting sample space characteristics, and improving standard two-dimensional convolution into graph convolution; constructing a time convolution neural network layer capable of extracting sample time characteristics, and improving the standard two-dimensional convolution into time convolution; constructing a motion recognition neural network layer, and embedding the graph convolution neural network layer and the time convolution neural network layer into the motion recognition neural network layer; and generating a 9-layer action recognition model by utilizing the action recognition neural network layer. After completing the improved construction of the motion recognition model, the embodiment step 205 may be further performed, in which the motion recognition model is trained by using the fusion characteristics of the bone point information and the bone connection information.
When a graph convolution neural network layer capable of extracting sample space features is constructed and standard two-dimensional convolution is improved to graph convolution, in view of the fact that in the graph convolution neural network, a feature graph is a C multiplied by T multiplied by V tensor, C is the number of channels, T is the number of frames, and V is the number of skeleton key points. To implement graph convolution on a neural network, the two-dimensional convolution formula can be modified, and a broken bone key point adjacency matrix is added, which further translates into:
Figure BDA0003158685880000132
wherein WkIs a weight vector, Kv3, the skeleton point neighbor set in the human body space skeleton diagram has 3 subsets; a. thekThe adjacent matrix is V multiplied by V and represents the physical structure of the human body; b iskIs also a contiguous matrix of V, but with AkIn the difference thatkThe parameters may be continuously updated during the network training process. CkIs a matrix of V x V, and AkIn a difference of CkShows a fixed skeleton diagram without links and has a parameter beta to C trained with the networkkThe initial value of β is limited to 0. DkIs the V × V adjacency matrix in 2S-AGCN that is used to determine whether a connection exists between two skeletal keypoints and the strength of the connection.
The size of a characteristic graph input by the graph convolution layer is C multiplied by T multiplied by V, then the number of channels is changed through convolution of 1 multiplied by 1, the similarity of the two is extracted through multiplication of the characteristics extracted through convolution, then normalization is carried out after a softmax function is carried out, and a matrix D is obtainedk(ii) a Inspired by non-local moduleMultiplying the extracted features after the convolution of 1 × 1 of one branch, and converting the feature graph into the size of C × T × V again after the convolution of 1 × 1; further matrix CkAnd Ak、BkAnd DkAdded and multiplied by the input, where AkRepresenting a contiguous matrix defining the physical structure of the human body, which is the same for all convolutional layers and all samples, BkIs a matrix that is continuously updated as the network trains, CkIs a disconnected human skeletal keypoint adjacency matrix. Then, the feature map is converted into a size of C × T × V by performing convolution operation of 1 × 1 in the same manner. The two branches are added together to form the output of the final graph convolution module. Taking the ResNet residual block structure as a reference, 1 × 1 convolution is concatenated with input and output before and after to stabilize training.
Accordingly, the function of the time convolution module in the neural network layer is to extract the time characteristics of the bone key points between adjacent video frames. The temporal convolution module only needs to take advantage of the two-dimensional convolution in the image. Since the time features extracted by using only 9 × 1 time convolution are limited and the influence of other different time ranges on the time convolution is often ignored, the application can provide a multi-scale time hole convolution network based on resenxt, and the multi-scale time hole convolution network can be used as a time convolution module and can be plug-and-play in a graph convolution action identification network. Specifically, the features extracted by graph convolution are firstly respectively sent into 4 convolution layers of 1 × 1, the dimensionality of the feature graph is improved, and the number of output channels is 1/8 of the final number of output channels; then respectively sending into 4 void convolution layers with expansion rates of 3 multiplied by 1 of 1,2, 3 and 4, extracting time characteristics with different scales by utilizing the void convolution of different receptive fields, wherein the numbers of front and rear input channels and output channels are the same and are 1/8 of the number of final output channels; splicing the 4 groups of time characteristics, wherein the number of output channels is 1/2 of the number of final output channels; then, using a 1 × 1 convolution layer to upgrade the dimension of the feature graph, wherein the number of output channels is the same as that of the final output channels; the last SEnet layer is used to boost the channel attention of the time convolution layer. In addition, by taking the shortcut structure of ResNet as a reference, 1 × 1 convolutional layer stability training with the step length of 2 is added between the input and the output. Correspondingly, the embodiment steps may specifically include: replacing the number of 3 parameter channels, the image width and the image height required by the standard two-dimensional convolution with parameters C, T and V respectively; respectively inputting the features extracted by the graph convolution neural network layer into 4 first convolution layers of 1 multiplied by 1 so as to improve the dimension of the feature graph and ensure that the number of output channels of the graph is 1/8 of the number of final output channels of the time convolution neural network layer; respectively inputting the output characteristics of the first convolution layer into 4 void convolution layers with expansion rates of 3 multiplied by 1 of 1,2, 3 and 4 respectively, extracting time characteristics with different scales by utilizing the void convolution of different receptive fields, wherein the numbers of input channels and output channels before and after the void convolution layers are the same, and the time characteristics are 1/8 of the number of final output channels of the time convolution neural network layer; splicing the 4 groups of time characteristics pairwise to enable the number of output channels to be 1/2 of the number of final output channels of the time convolution neural network layer; inputting the two-by-two splicing results into a 1 x 1 second convolution layer to improve the dimension of the characteristic diagram, so that the number of output channels is the same as the number of final output channels of the time convolution neural network layer; inputting the output result of the second convolution layer into a SEnet layer to improve the channel attention of a time convolution neural network layer; and a 1 x 1 third convolutional layer with the step length of 2 is arranged between the time convolutional neural network layers and is used for stable training.
When constructing the action recognition neural network layer and embedding the graph convolution neural network layer and the time convolution neural network layer therein, the method specifically includes: extracting the spatial feature of each frame in the sample skeleton point sequence by utilizing the graph convolution neural network layer; the input of each layer of neural network keeps the same distribution by using Batch standardization Batch Normalization so as to lead the network to be stably trained; respectively carrying out nonlinear activation and network overfitting prevention by utilizing a ReLu function and a DroupOut function; extracting the time characteristics of continuous frames in the sample skeleton point sequence by utilizing a time convolution neural network layer; residual error connection is carried out before and after input and output of the action recognition neural network layer, so that network training is more stable.
When the action recognition neural network layer is used to generate the action recognition model with 9 layers, the method specifically includes: firstly, input needs to be normalized to eliminate the influence of different dimensions of data; extracting spatial features and time features through 9 layers of action recognition neural network layers, wherein 64 output channels are arranged on the first 3 layers, 128 output channels are arranged on the middle 3 layers, and 256 output channels are arranged on the last 3 layers; the step sizes of the 1 st, 2 nd, 3 rd, 5 th, 6 th, 8 th and 9 th layers are set to be 1, and in order to reduce calculation, the step sizes are increased from 1 to 2 at the 4 th layer and the 7 th layer; and pooling the extracted features, and extracting main features.
205. And training the motion recognition model by using the fusion characteristics of the skeleton point information and the skeleton connection information.
For this embodiment, after the fusion features are sequentially input into the full-link layer and the Softmax layer in the motion recognition model, the category scores of the motion categories are obtained, the motion category with the highest corresponding category score is the predicted recognition result of the motion recognition model, the accuracy of the predicted result can be calculated by matching the predicted recognition result with the labeled motion category corresponding to the sample compressed video, and then the training process of the motion recognition model can be determined by using the accuracy. Correspondingly, step 205 in the embodiment may specifically include: fusing the skeleton point information and the skeleton connection information by using a weighted average method to obtain fusion characteristics; inputting the fusion characteristics into a full-link layer and a Softmax layer in the action recognition model in sequence to obtain action category prediction results, wherein the fusion characteristics carry action category marking results; and if the accuracy of the category prediction result is determined to be greater than the preset threshold value according to the action category labeling result, judging that the training of the action recognition model is finished.
Correspondingly, when the bone point information and the bone connection information are fused by using a weighted average method to obtain a fusion feature, the embodiment specifically includes the following steps: based on a preset weight ratio, calculating a weighted average value corresponding to the bone point information and the bone connection information; the weighted average is determined as the fusion feature. The preset weight ratio can be set according to an actual application scene, and the preset weights corresponding to the bone point information and the bone connection information can be both 50%. In addition, different preset weights can be configured for the bone point information and the bone connection information respectively by combining with an actual application scene, for example, the preset weight corresponding to the bone point information is set to be 40%, and the preset weight corresponding to the bone connection information is set to be 60%.
For the embodiment, when training the motion recognition model, the 2S-AGCN proposes a double-flow strategy specially applied to a graph convolution motion recognition neural network based on human skeleton key points, the two different input training motion recognition neural networks of the human skeleton key points and the human skeleton are respectively utilized, then the scores calculated by the Softmax classifier are added to obtain the fused score, and the category with the highest score is the motion category judged by the motion recognition algorithm. Referring to the schematic flow chart of the 2S-AGCN-based graph convolution motion recognition shown in fig. 3, the input of branch 1 is the coordinates (x, y) of the human skeleton point of the sample in the data set, i.e. the skeleton point information; the branch 2 inputs bone connection information determined by two bone points having a connection relationship, and the coordinate of one bone point is assumed to be (x)1,y1) The other bone point has the coordinate of (x)2,y2) The bone connection information can be vector (x)2-x1,y2-y1) Represents; the characteristics of the two can be extracted based on the constructed 9-layer action recognition network, and then fusion is carried out to obtain fusion characteristics; and further, the fused result is sequentially sent to the full-connection layer and the Softmax layer to output category scores, and the action category with the highest score is the identified action category.
When the model is trained, due to the fact that overfitting of the neural network can cause the problem of weak generalization capability of the algorithm, before network training, Dropout needs to be used in the network to prevent the overfitting phenomenon, and the generalization capability of the algorithm is improved, so that the problems that after the algorithm reaches a certain training period, errors on a training set are reduced, and errors on a testing set are increased are solved. The applied action recognition network has 9 layers, the output of each layer is the input of the next layer, so the updating of the parameters of each layer can cause the data distribution of the input of the next layer to change, the change of the data distribution of the neural network of the higher layer after the layers are overlapped layer by layer is larger, and the convergence of the algorithm becomes more difficult. In order to inhibit the influence caused by the change of data distribution, a Batchnormalization data normalization strategy is introduced into the action recognition network, and the dimension input by each layer of neural network is unified, so that the data distribution is the same, the deviation of the internal covariance of the algorithm in the training process is reduced, the algorithm convergence is facilitated, and the training speed is accelerated. The core idea of the Batch Normalization method is to forcibly change the distribution of the activation input values before the nonlinear transformation into a normal distribution with a mean value of 0 and a variance of 1. The Batchnormalization method enables the input value of the nonlinear transformation function to fall into a region sensitive to input, enables the gradient of the nonlinear activation function to be kept in a large state, accelerates convergence, and avoids the problem of gradient disappearance. In the initial period of neural network training, a larger learning rate is often used to optimize the algorithm, but if the larger learning rate is used all the time, the optimal solution of the network may be skipped, so as to increase the number of iterations, the learning rate needs to be attenuated. Therefore, the method for learning rate step length attenuation is adopted in the application, the initial learning rate is set to be 0.1, the training period is set to be 50, when the network is trained to have periods of 15, 30 and 40, the learning rate is subjected to step length attenuation, and the attenuation is one tenth of the previous learning rate each time, so that the gradient descending direction is changed, and the convergence of the algorithm is accelerated.
In addition, as a preferable mode, a large amount of data is required to be used as a support during the neural network training, but the scale of a large number of data sets cannot meet the requirement of the neural network, and an overfitting phenomenon is caused. In addition, the requirement of the neural network designed by using the application as a guide to the data set is higher, and a designer of the neural network is often required to specially collect and shoot images or videos to establish a special data set, so that a large amount of manpower and material resources are consumed. The data enhancement strategy can expand under the condition that a new data set is not shot, and the generalization capability of the algorithm is improved. The data enhancement strategies include mirror flipping, rotation, zooming, cropping, color perturbation, and noise addition. Therefore, the input of the action recognition method in the application is the abscissa, the ordinate and the coordinate confidence of the human skeleton key points, only the change of the abscissa and the ordinate of the skeleton key points is concerned, and the influence caused by the color disturbance and the noise addition of the original video is not concerned, so that the adopted data enhancement strategy comprises mirror image turning, amplification and cutting. The mirror image turning is to turn the original video horizontal mirror image to largely expand the data set. The positions of all parts of the human body in the new video are changed after being mirrored, the left part of the human body in the original video is changed into the right part of the human body in the new video, and the abscissa and the ordinate of the skeleton key point of each part in the video are correspondingly changed; the amplification is to amplify the video in a certain scale, then to cut the video into the same length and width as the original video, to ensure the same resolution of the video before and after amplification. The abscissa and ordinate of the human skeleton key point in the new video are also changed after the video is amplified. The cutting strategy is different from the strategy of cutting after amplifying in the strategy of amplifying, the cutting strategy is to cut the video firstly and then amplify the video into the size of the original video, and the same resolution ratio of the video before and after cutting is ensured.
Correspondingly, since the loss function in the deep learning is used for measuring the difference between the real probability distribution and the prediction probability distribution, the smaller the loss function value is, the smaller the difference between the two probability distributions is, and the better the prediction effect of the algorithm is. And during algorithm training, whether the network is over-fitted or not can be judged by combining the loss function curve and the classification accuracy curve. The motion recognition task is essentially to classify the motion made by a human body target in a video, and is a fully supervised classification problem, a cross entropy (cross entropy) loss function is generally selected as a loss function when a motion recognition network is constructed, and the smaller the value of the cross entropy loss function is, the better the effect of the algorithm recognition motion is.
The formula for the cross entropy loss function is:
L(p,q)=-p(x)-log(q(x))
where p (x) is the true probability distribution of the sample and q (x) is the predicted probability distribution of the sample.
The score of each category is output through a full connection layer in the neural network, the score of each category becomes a probability value with the total sum of 1 after passing through a Softmax layer, and then the probability value and a one-hot coding form of a real category label participate in the calculation of the cross entropy loss function.
In a specific application scenario, the parameter updating of the neural network is an unconstrained optimization problem, and an optimizer participates in the learning and updating of network parameters during the training of the neural network, so that the network parameters approach an optimal value by utilizing back propagation updating and iterative gradients, and the purpose of minimizing the cross entropy loss function value is achieved. Commonly used optimizers are SGD, BGD, Momentum, adarad, RMSprop, Adam, and the like. The SGD optimizer is adopted in the action recognition neural network in the application, and network parameters are updated together with the momentum strategy. SGD is called Stochastic Gradient Descent (Stochastic Gradient) and randomly selects a sample from a data set to iteratively update the weight of a neuron according to a negative Gradient direction, and the core formula is as follows:
Figure BDA0003158685880000181
wherein, wtRepresents the weight of the neuron after the iteration to the t step, alpha represents the learning rate,
Figure BDA0003158685880000182
representing the calculated gradient after back propagation.
In this embodiment, an NTU-RGB + D bone point data set can be selected to train and test a designed action recognition neural network model, wherein, referring to the comparison result of the calculation amount and the accuracy rate of the pre-pruning action recognition algorithm in Table 1, under the X-Sub standard, the 88.5% accuracy rate of the 2S-AGCN is improved to 89.2%, and under the X-View standard, the 95.1% accuracy rate of the 2S-AGCN is improved to 96.0%. Through the technical direction in the application, the method still improves under the original condition of higher accuracy.
TABLE 1 comparison of computation and accuracy of pre-pruning action recognition algorithm
Method X-Sub(%) X-View(%)
Lie Group 50.1 82.8
HBRNN 59.1 64.0
Deep LSTM 60.7 67.3
Temporal Conv 74.3 83.1
Clips+CNN+MTLN 79.6 84.8
3scale ResNet152 85.0 92.3
ST-GCN 81.5 88.3
DPRL+GCNN 83.5 89.8
2S-AGCN 88.5 95.1
Ours 89.2 96.0
206. And if the training of the action recognition model is judged to be finished, inputting the physical connection structure information of each frame in the target sample into the action recognition model, and obtaining an action recognition result.
For the present embodiment, the implementing step 206 may specifically include: if the training of the motion recognition model is judged to be finished, extracting target fusion characteristics formed by fusion of skeleton point information and skeleton information of each frame in the target sample; inputting the target fusion characteristics into the trained action recognition model, and acquiring evaluation values corresponding to all preset action categories; and determining the preset action category with the highest corresponding evaluation score as the action recognition result of each frame in the target sample. Wherein, the action recognition model is composed of 9 layers of neural networks. The input physical connection structure information is firstly normalized to eliminate the influence of different dimensions of data, and then is respectively extracted with spatial features and temporal features through 9 layers of neural networks which are respectively cascaded and alternately used by a graph convolution module and a time convolution module. The first 3 layers of the neural network have 64 output channels, the middle 3 layers have 128 output channels, and the last 3 layers have 256 output channels. In addition, the step sizes of the 1 st, 2 nd, 3 rd, 5 th, 6 th, 8 th and 9 th layers are set to be 1, and in order to reduce the calculation, the step sizes are increased from 1 to 2 at the 4 th layer and the 7 th layer. And then pooling the extracted features, reducing feature dimensionality, and finally sequentially sending the features into the full connection layer and the Softmax classifier to output the probability value of each action identification, wherein the action category with the highest probability value is the action category predicted by the action identification network.
By the graph convolution motion recognition method based on the 2S-AGCN, a motion recognition model formed by alternately combining a graph convolution neural network and a time convolution network can be trained based on the bone point information and the bone connection information, and the trained motion recognition model is used for recognizing the motion of the target sample to obtain a motion recognition result. According to the technical scheme, when the skeleton connection information is considered, information transmission among disconnected skeleton points is also considered, a space map convolution network structure and a time convolution network structure in the motion recognition model are improved, the receptive fields of a space domain and a time domain are enlarged, more information can be extracted, and the training precision of model motion recognition is improved.
Further, as a specific embodiment of the method shown in fig. 1 and fig. 2, an embodiment of the present application provides a graph convolution operation recognition apparatus based on a 2S-AGCN, as shown in fig. 4, the apparatus includes: an extraction module 31, a training module 32 and an acquisition module 33;
the extraction module 31 is used for constructing a physical connection structure of a human body corresponding to each frame of skeleton point in a sample set, and extracting skeleton point information and skeleton connection information from the physical connection structure;
the training module 32 is used for training a motion recognition model by using the fusion characteristics of the skeleton point information and the skeleton connection information, the motion recognition model is formed by alternately combining a graph convolution neural network and a time convolution network, the graph convolution neural network is used for extracting spatial characteristics, and the time convolution network is used for extracting time characteristics;
the obtaining module 33 is configured to, if it is determined that the training of the motion recognition model is completed, input the physical connection structure information of each frame in the target sample into the motion recognition model, and obtain a motion recognition result.
In a specific application scenario, the extraction module 31 may be specifically configured to: respectively representing the number of channels, the number of skeleton points and the number of video frames of a sample in a sample set by using symbols C, V, T, wherein the initial value of the number of the channels C is 3, and the initial value is respectively the horizontal coordinate, the vertical coordinate and the confidence coefficient of the coordinate of the skeleton points; creating an array for representing the physical structure connection of a human body according to a predefined skeleton point number index, wherein elements in the array consist of two skeleton points with connection relation, and skeleton connection information is determined by using the skeleton points with connection relation;representing the human skeleton points in the sample as an undirected space human skeleton map and a time human skeleton map by using symbols G (K, E), wherein K represents the skeleton point set of the t frame image, and K is { K ═ K { (K)ti1,2, … T; 1,2, … V }; e represents the set of edges connecting between the skeletal points, with ESAnd ETTwo subsets, ESFor the edge set with connection relation among the t frame skeleton points, representing the skeleton point connection in all the video frames of a single sample, ETRepresenting the change track of a certain skeleton point along with time for an edge set between the same skeleton point in the t frame and the t +1 frame; the skeleton points in the space human skeleton diagram are divided into 3 skeleton point sets representing the physical structure of the human body, and the skeleton point information is obtained.
Correspondingly, in order to divide the skeleton points in the human body space skeleton diagram into 3 skeleton point sets representing the physical structure of the human body to obtain the skeleton point information, the extracting module 31 may be specifically configured to: calculating the barycentric coordinates of the human body according to the coordinates of the skeleton points in the sample set; according to the barycentric coordinate of the human body, the skeleton points in the human body space skeleton map are divided into a first skeleton point set constructed by the skeleton points, a second skeleton point set which has a connection relation with the skeleton points and has a distance from the barycentric coordinate to a preset distance threshold or less, and a third skeleton point set which has a connection relation with the skeleton points and has a distance from the barycentric coordinate to the preset distance threshold or more.
In a specific application scenario, in order to enlarge the receptive fields of the spatial domain and the time domain by improving the spatial map convolution network structure and the time convolution network structure in the motion recognition model, so that more information can be extracted, and the training precision of the motion recognition of the model is improved, as shown in fig. 5, the apparatus further includes: a first building module 34, a second building module 35, a third building module 36, a generation module 37;
the first construction module 34 is used for constructing a graph convolution neural network layer capable of extracting sample space features, and improving the standard two-dimensional convolution into graph convolution;
the second construction module 35 is configured to construct a time convolution neural network layer capable of extracting a sample time feature, and improve the standard two-dimensional convolution into a time convolution;
a third constructing module 36, configured to construct a motion recognition neural network layer, and embed the graph convolution neural network layer and the time convolution neural network layer therein;
and the generating module 37 is operable to generate a 9-layer motion recognition model by using the motion recognition neural network layer.
Correspondingly, the second building module 35 is specifically configured to replace the number of the 3 parameter channels, the image width, and the image height required by the standard two-dimensional convolution with parameters C, T and V, respectively; respectively inputting the features extracted by the graph convolution neural network layer into 4 first convolution layers of 1 multiplied by 1 so as to improve the dimension of the feature graph and ensure that the number of output channels of the graph is 1/8 of the number of final output channels of the time convolution neural network layer; respectively inputting the output characteristics of the first convolution layer into 4 void convolution layers with expansion rates of 3 multiplied by 1 of 1,2, 3 and 4 respectively, extracting time characteristics with different scales by utilizing the void convolution of different receptive fields, wherein the numbers of input channels and output channels before and after the void convolution layers are the same, and the time characteristics are 1/8 of the number of final output channels of the time convolution neural network layer; splicing the 4 groups of time characteristics pairwise to enable the number of output channels to be 1/2 of the number of final output channels of the time convolution neural network layer; inputting the two-by-two splicing results into a 1 x 1 second convolution layer to improve the dimension of the characteristic diagram, so that the number of output channels is the same as the number of final output channels of the time convolution neural network layer; inputting the output result of the second convolution layer into a SEnet layer to improve the channel attention of a time convolution neural network layer; and a 1 x 1 third convolutional layer with the step length of 2 is arranged between the input and the output of the time convolutional neural network layer, and the third convolutional layer is used for stable training.
In a specific application scenario, the training module 32 may be configured to fuse the skeleton point information and the skeleton connection information by using a weighted average method to obtain a fusion characteristic; inputting the fusion characteristics into a full-link layer and a Softmax layer in the action recognition model in sequence to obtain action category prediction results, wherein the fusion characteristics carry action category marking results; and if the accuracy of the category prediction result is determined to be greater than the preset threshold value according to the action category labeling result, judging that the training of the action recognition model is finished.
Correspondingly, the obtaining module 33 is specifically configured to extract a target fusion feature formed by fusing each frame of skeleton point information and skeleton information in the target sample if it is determined that the motion recognition model training is completed; inputting the target fusion characteristics into the trained action recognition model, and acquiring evaluation values corresponding to all preset action categories; and determining the preset action category with the highest corresponding evaluation score as the action recognition result of each frame in the target sample.
It should be noted that other corresponding descriptions of the functional units related to the graph convolution action recognition apparatus based on 2S-AGCN provided in this embodiment may refer to the corresponding descriptions in fig. 1 to fig. 2, and are not repeated herein.
Based on the method shown in fig. 1 and fig. 2, correspondingly, an embodiment of the present application further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method for identifying a 2S-AGCN-based graph volume action as shown in fig. 1 and fig. 2.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments of the present application.
Based on the method shown in fig. 1 and fig. 2 and the virtual device embodiment shown in fig. 4 and fig. 5, in order to achieve the above object, an embodiment of the present application further provides a computer device, which may specifically be a personal computer, a server, a network device, and the like, where the entity device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the graph convolution action recognition method based on 2S-AGCN as shown in FIG. 1 and FIG. 2.
Optionally, the computer device may also include a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, sensors, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.
It will be understood by those skilled in the art that the computer device structure provided in the present embodiment is not limited to the physical device, and may include more or less components, or combine some components, or arrange different components.
The nonvolatile readable storage medium can also comprise an operating system and a network communication module. The operating system is a program of hardware and software resources of entity equipment for the three-redundancy arbitration switching of the unmanned aerial vehicle, and supports the running of an information processing program and other software and/or programs. The network communication module is used for realizing communication among components in the nonvolatile readable storage medium and communication with other hardware and software in the entity device.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. By applying the technical scheme, compared with the prior art, the method can be divided into 3 subsets according to the number of channels of the skeleton point samples, the number of skeleton points, the number of video frames and a predefined physical human body connection structure, then a broken skeleton point diagram is added when the graph volume layer is constructed, and the constructed time volume layer extracts more time features by utilizing 4 convolution kernels with different void ratios, so that the problem of limited feature expression capability can be effectively solved, and the accuracy of model training is improved. According to the technical scheme, when the skeleton connection information is considered, information transmission among disconnected skeleton points is also considered, the space graph convolution network structure and the time convolution network structure in the action recognition model are improved, the receptive fields of a space domain and a time domain are enlarged, more information can be extracted, the training precision of model action recognition is improved, and the action recognition model has characteristic expression capacity.
Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims (10)

1. A graph convolution action recognition method based on 2S-AGCN is characterized by comprising the following steps:
constructing a physical connection structure of a human body corresponding to each frame of skeleton point in a sample set, and extracting skeleton point information and skeleton connection information from the physical connection structure;
training a motion recognition model by using the fusion characteristics of the skeleton point information and the skeleton connection information, wherein the motion recognition model is formed by alternately combining a graph convolution neural network and a time convolution network, the graph convolution neural network is used for extracting spatial characteristics, and the time convolution network is used for extracting time characteristics;
and if the action recognition model is judged to be trained completely, inputting the physical connection structure information of each frame in the target sample into the action recognition model, and obtaining an action recognition result.
2. The method according to claim 1, wherein constructing a physical connection structure of a human body corresponding to each skeletal point in a sample set, and extracting skeletal point information and skeletal connection information from the physical connection structure comprises:
respectively representing the number of channels, the number of skeleton points and the number of video frames of a sample in a sample set by using symbols C, V, T, wherein the initial value of the number of the channels C is 3, and the initial value is respectively the horizontal coordinate, the vertical coordinate and the confidence coefficient of the coordinate of the skeleton points;
creating an array for representing human body physical structure connection according to predefined skeleton point numerical indexes, wherein elements in the array consist of two skeleton points with connection relations, and skeleton connection information is determined by using the skeleton points with connection relations;
representing the human skeleton points in the sample as an undirected space human skeleton map and a time human skeleton map by using symbols G (K, E), wherein K represents the skeleton point set of the t frame image, and K is { K ═ K { (K)ti1,2, … T; 1,2, … V }; e represents the set of edges connecting between the skeletal points, with ESAnd ETTwo subsets, ESFor the edge set with connection relation among the t frame skeleton points, representing the skeleton point connection in all the video frames of a single sample, ETRepresenting the change track of a certain skeleton point along with time for an edge set between the same skeleton point in the t frame and the t +1 frame;
and dividing the skeleton points in the space human body skeleton diagram into 3 skeleton point sets representing the physical structure of the human body to obtain skeleton point information.
3. The method according to claim 2, wherein the dividing the skeleton points in the human body space skeleton map into 3 skeleton point sets representing the physical structure of the human body, and obtaining the skeleton point information comprises:
calculating the barycentric coordinates of the human body according to the coordinates of the skeleton points in the sample set;
and according to the barycentric coordinate of the human body, dividing skeleton points in a human body space skeleton map into a first skeleton point set constructed by skeleton points, a second skeleton point set which has a connection relation with the skeleton points and is less than or equal to a preset distance threshold from the barycentric coordinate, and a third skeleton point set which has a connection relation with the skeleton points and is greater than the preset distance threshold from the barycentric coordinate.
4. The method of claim 1, further comprising:
constructing a graph convolution neural network layer capable of extracting sample space characteristics, and improving standard two-dimensional convolution into graph convolution;
constructing a time convolution neural network layer capable of extracting sample time characteristics, and improving the standard two-dimensional convolution into time convolution;
constructing a motion recognition neural network layer, and embedding the graph convolution neural network layer and the time convolution neural network layer into the motion recognition neural network layer;
and generating a 9-layer action recognition model by utilizing the action recognition neural network layer.
5. The method of claim 4, wherein constructing a time-convolved neural network layer that can extract temporal features of the sample, improves the standard two-dimensional convolution into a time convolution, including;
replacing the number of 3 parameter channels, the image width and the image height required by the standard two-dimensional convolution with parameters C, T and V respectively;
respectively inputting the features extracted by the graph convolution neural network layer into 4 first convolution layers of 1 multiplied by 1 so as to improve the dimension of the feature graph and ensure that the number of output channels of the graph convolution neural network layer is 1/8 of the number of final output channels of the time convolution neural network layer;
respectively inputting the output characteristics of the first convolutional layer into 4 void convolutional layers with expansion rates of 3 multiplied by 1 of 1,2, 3 and 4, extracting time characteristics with different scales by utilizing the void convolutions of different receptive fields, wherein the numbers of input channels and output channels before and after the void convolutional layers are the same, and the time characteristics are 1/8 of the number of final output channels of the time convolutional neural network layer;
splicing the 4 groups of time characteristics pairwise to enable the number of output channels to be 1/2 of the number of final output channels of the time convolution neural network layer;
inputting the two-by-two splicing results into a 1 x 1 second convolution layer to improve the dimension of the feature diagram, so that the number of output channels is the same as the number of final output channels of the time convolution neural network layer;
inputting the output result of the second convolutional layer into a SEnet layer to improve the channel attention of the time convolutional neural network layer;
and a 1 x 1 third convolutional layer with the step length of 2 is arranged between the input and the output of the time convolutional neural network layer, and the third convolutional layer is used for stable training.
6. The method of claim 5, wherein the training of the motion recognition model using the fused features of the skeletal point information and the skeletal connection information comprises:
fusing the skeleton point information and the skeleton connection information by using a weighted average method to obtain fusion characteristics;
inputting the fusion characteristics into a full-link layer and a Softmax layer in an action recognition model in sequence to obtain action category prediction results, wherein the fusion characteristics carry action category marking results;
and if the accuracy of the category prediction result is determined to be greater than a preset threshold value according to the action category labeling result, judging that the training of the action recognition model is finished.
7. The method according to claim 1, wherein if it is determined that the motion recognition model training is completed, inputting physical connection structure information of each frame in a target sample into the motion recognition model to obtain a motion recognition result, and the method includes:
if the action recognition model is judged to be trained completely, extracting target fusion characteristics formed by fusion of skeleton point information and skeleton information of each frame in a target sample;
inputting the target fusion characteristics into a trained action recognition model, and acquiring evaluation values corresponding to all preset action categories;
and determining the preset action category with the highest evaluation score as the action recognition result of each frame in the target sample.
8. A graph convolution operation recognition device based on 2S-AGCN is characterized by comprising:
the extraction module is used for constructing a physical connection structure of a human body corresponding to each frame of skeleton point in a sample set and extracting skeleton point information and skeleton connection information from the physical connection structure;
the training module is used for training a motion recognition model by utilizing the fusion characteristics of the skeleton point information and the skeleton connection information, the motion recognition model is formed by utilizing a graph convolution neural network and a time convolution network which are alternately combined, the graph convolution neural network is used for extracting spatial characteristics, and the time convolution network is used for extracting time characteristics;
and the obtaining module is used for inputting the physical connection structure information of each frame in the target sample into the action recognition model to obtain an action recognition result if the action recognition model is judged to be trained completely.
9. A non-transitory readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the method for 2S-AGCN-based graph convolution action recognition according to any one of claims 1 to 7.
10. A computer device comprising a non-volatile readable storage medium, a processor, and a computer program stored on the non-volatile readable storage medium and executable on the processor, wherein the processor implements the method for 2S-AGCN based graph convolution action recognition according to any one of claims 1 to 7 when executing the program.
CN202110785748.7A 2021-07-12 2021-07-12 Graph convolution action recognition method, device and equipment based on 2S-AGCN Pending CN113642400A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110785748.7A CN113642400A (en) 2021-07-12 2021-07-12 Graph convolution action recognition method, device and equipment based on 2S-AGCN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110785748.7A CN113642400A (en) 2021-07-12 2021-07-12 Graph convolution action recognition method, device and equipment based on 2S-AGCN

Publications (1)

Publication Number Publication Date
CN113642400A true CN113642400A (en) 2021-11-12

Family

ID=78417079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110785748.7A Pending CN113642400A (en) 2021-07-12 2021-07-12 Graph convolution action recognition method, device and equipment based on 2S-AGCN

Country Status (1)

Country Link
CN (1) CN113642400A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241514A (en) * 2021-11-15 2022-03-25 北京爱笔科技有限公司 Model training method and device for extracting human skeleton features
CN114781441A (en) * 2022-04-06 2022-07-22 电子科技大学 EEG motor imagery classification method and multi-space convolution neural network model
CN115761905A (en) * 2023-01-09 2023-03-07 吉林大学 Diver action identification method based on skeleton joint points
CN115984787A (en) * 2023-03-20 2023-04-18 齐鲁云商数字科技股份有限公司 Intelligent vehicle-mounted real-time alarm method for industrial brain public transport

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059620A (en) * 2019-04-17 2019-07-26 安徽艾睿思智能科技有限公司 Bone Activity recognition method based on space-time attention
CN110309732A (en) * 2019-06-13 2019-10-08 浙江大学 Activity recognition method based on skeleton video
CN112395945A (en) * 2020-10-19 2021-02-23 北京理工大学 Graph volume behavior identification method and device based on skeletal joint points
CN112543936A (en) * 2020-10-29 2021-03-23 香港应用科技研究院有限公司 Motion structure self-attention-seeking convolutional network for motion recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059620A (en) * 2019-04-17 2019-07-26 安徽艾睿思智能科技有限公司 Bone Activity recognition method based on space-time attention
CN110309732A (en) * 2019-06-13 2019-10-08 浙江大学 Activity recognition method based on skeleton video
CN112395945A (en) * 2020-10-19 2021-02-23 北京理工大学 Graph volume behavior identification method and device based on skeletal joint points
CN112543936A (en) * 2020-10-29 2021-03-23 香港应用科技研究院有限公司 Motion structure self-attention-seeking convolutional network for motion recognition

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DUN YANG等: "Interactive two-stream graph neural network for skeleton-based action recognition", 《JOURNAL OF ELECTRONIC IMAGING》, vol. 30, no. 3, 17 June 2021 (2021-06-17), pages 033025 - 5 *
LEI SHI等: "Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition", 《PROCEEDINGS OF IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION(CVPR)》, 20 June 2019 (2019-06-20), pages 12028 - 12033 *
ZIYU LIU等: "Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition", 《PROCEEDINGS OF IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION(CVPR)》, 31 December 2020 (2020-12-31), pages 147 - 148 *
郑顽强: "基于人体骨架图卷积和图像卷积融合的行为识别", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 2020, 15 December 2020 (2020-12-15), pages 138 - 262 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241514A (en) * 2021-11-15 2022-03-25 北京爱笔科技有限公司 Model training method and device for extracting human skeleton features
CN114241514B (en) * 2021-11-15 2024-05-28 北京爱笔科技有限公司 Model training method and device for extracting human skeleton characteristics
CN114781441A (en) * 2022-04-06 2022-07-22 电子科技大学 EEG motor imagery classification method and multi-space convolution neural network model
CN114781441B (en) * 2022-04-06 2024-01-26 电子科技大学 EEG motor imagery classification method and multi-space convolution neural network model
CN115761905A (en) * 2023-01-09 2023-03-07 吉林大学 Diver action identification method based on skeleton joint points
CN115984787A (en) * 2023-03-20 2023-04-18 齐鲁云商数字科技股份有限公司 Intelligent vehicle-mounted real-time alarm method for industrial brain public transport

Similar Documents

Publication Publication Date Title
CN113642400A (en) Graph convolution action recognition method, device and equipment based on 2S-AGCN
CN110362677B (en) Text data category identification method and device, storage medium and computer equipment
CN109754078A (en) Method for optimization neural network
CA3066029A1 (en) Image feature acquisition
US11816149B2 (en) Electronic device and control method thereof
CN109840531A (en) The method and apparatus of training multi-tag disaggregated model
CN112418292B (en) Image quality evaluation method, device, computer equipment and storage medium
CN110046634B (en) Interpretation method and device of clustering result
JP2007128195A (en) Image processing system
CN112308115B (en) Multi-label image deep learning classification method and equipment
CN111898703B (en) Multi-label video classification method, model training method, device and medium
CN112200041B (en) Video motion recognition method and device, storage medium and electronic equipment
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
CN116580257A (en) Feature fusion model training and sample retrieval method and device and computer equipment
CN113657087B (en) Information matching method and device
CN111373418A (en) Learning apparatus and learning method, recognition apparatus and recognition method, program, and recording medium
CN115035418A (en) Remote sensing image semantic segmentation method and system based on improved deep LabV3+ network
CN112749737A (en) Image classification method and device, electronic equipment and storage medium
CN112749576B (en) Image recognition method and device, computing equipment and computer storage medium
CN114037056A (en) Method and device for generating neural network, computer equipment and storage medium
CN113743594A (en) Network flow prediction model establishing method and device, electronic equipment and storage medium
CN111709473A (en) Object feature clustering method and device
Firouznia et al. Adaptive chaotic sampling particle filter to handle occlusion and fast motion in visual object tracking
US20220366242A1 (en) Information processing apparatus, information processing method, and storage medium
CN116434010A (en) Multi-view pedestrian attribute identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination