CN115862150B - Diver action recognition method based on three-dimensional human body skin - Google Patents
Diver action recognition method based on three-dimensional human body skin Download PDFInfo
- Publication number
- CN115862150B CN115862150B CN202310015851.2A CN202310015851A CN115862150B CN 115862150 B CN115862150 B CN 115862150B CN 202310015851 A CN202310015851 A CN 202310015851A CN 115862150 B CN115862150 B CN 115862150B
- Authority
- CN
- China
- Prior art keywords
- module
- information
- tca
- diver
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000009471 action Effects 0.000 title claims abstract description 108
- 238000000034 method Methods 0.000 title claims abstract description 77
- 230000004927 fusion Effects 0.000 claims abstract description 30
- 230000037237 body shape Effects 0.000 claims abstract description 19
- 230000033001 locomotion Effects 0.000 claims abstract description 17
- 230000002776 aggregation Effects 0.000 claims description 26
- 238000004220 aggregation Methods 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 17
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 7
- 230000002123 temporal effect Effects 0.000 claims description 7
- 238000013075 data extraction Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 4
- 238000003860 storage Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 abstract description 5
- 230000008569 process Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000005096 rolling process Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 206010021143 Hypoxia Diseases 0.000 description 1
- KRWTWSSMURUMDE-UHFFFAOYSA-N [1-(2-methoxynaphthalen-1-yl)naphthalen-2-yl]-diphenylphosphane Chemical compound COC1=CC=C2C=CC=CC2=C1C(C1=CC=CC=C1C=C1)=C1P(C=1C=CC=CC=1)C1=CC=CC=C1 KRWTWSSMURUMDE-UHFFFAOYSA-N 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000007954 hypoxia Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 208000018944 leg cramp Diseases 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012021 retail method of payment Methods 0.000 description 1
- 238000007152 ring opening metathesis polymerisation reaction Methods 0.000 description 1
Images
Landscapes
- Image Processing (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a diver action recognition method based on three-dimensional human body skin. The invention relates to the technical field of computer vision, which extracts human body shape, posture and vertex data from diver videos by a three-dimensional human body shape and posture estimation method; the human body shape, gesture and vertex data are subjected to a data fusion module to obtain high-level semantic information; performing action recognition by using the high-level semantic information through a TCA-GCN module; performing action recognition by using the high-level semantic information through the STGCN module; and linearly fusing the identification results of the two modules. By the technical scheme, the three-dimensional gesture motion estimation of the diver is realized, and the accuracy of motion recognition is improved.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a diver action recognition method based on three-dimensional human body skin.
Background
Action recognition is the basis for understanding human behaviors by a computer, plays an important role in the fields of man-machine interaction, video understanding and the like, and has become a hot topic in the field of computer vision. Because of the specificity of the working environment of divers, they cannot communicate and express in the form of language, but because of the natural rich semantic information of human limbs, the diver's underwater work can express some special meanings by means of some actions. For example, emergency situations such as physical overdraft, hypoxia, leg cramp and the like can be expressed by different gestures. In such a scenario, how to accurately and efficiently recognize the motions of the diver has become an important research direction.
Most of the existing diver action recognition methods are based on human skeleton points, but because skeleton data lacks of human surface information, the existing diver action recognition methods are more abstract, low in semantic meaning and can only represent action characteristics of human bodies, cannot embody more specific and higher-level information, such as shape characteristics, vertex characteristics and the like, and cannot represent human actions more accurately. In order to utilize more specific and higher-level semantic information, the application provides a diver action recognition method based on three-dimensional human skin. Since human body structures can be naturally represented as a graph structure, many methods are currently based on graph convolution. The graph convolution method can more accurately find the relation between different key points of the human body, and obtain better-represented space dimension information, so as to obtain more accurate action recognition results. Because each action of the diver is a sequence data, many methods at present also obtain the relation between action sequences by means of LSTM, time convolution and the like, which can extract better time dimension information so as to achieve better performance. Currently, SMPL is the mainstream three-dimensional human skin representation method by,/>The two parameters represent the shape and posture of the human body, respectively. At the same time use->,/>The SMPL can obtain the vertex parameter v of the human mesh, which in turn provides more semantic information for our action recognition task. The three-dimensional human skin information represents the gesture, shape and vertex of the human body, higher-level semantic information can be obtained through data fusion, and finally, a more accurate diver gesture estimation result is obtained by using a graph convolution deep learning method.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention realizes the identification of the diver action by utilizing three-dimensional human skin information, and achieves more accurate action identification effect by using higher-level semantic information.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The invention provides a diver action recognition method based on three-dimensional human skin, which provides the following technical scheme:
a diver action recognition method based on three-dimensional human skin, the method comprising the steps of:
step 1: extracting the human body shape, posture and vertex information of a diver video frame by a three-dimensional human body posture estimation method;
step 2: the human body shape, gesture and vertex data are subjected to data fusion to obtain high-level semantic information;
step 3: performing action recognition by using the high-level semantic information through a TCA-GCN module;
step 4: performing action recognition by using the high-level semantic information through the STGCN module;
step 5: and (3) carrying out linear fusion on the identification results in the step (3) and the step (4) to identify the actions of the diver.
Preferably, the step 2 specifically includes:
and downsampling the vertex information, simultaneously, respectively passing the downsampled vertex information and the shape information through a convolution module in the feature extraction network to obtain coding information, and splicing the coding information to the gesture information to obtain high-level semantic information.
Preferably, the step 3 specifically includes:
the TCA-GCN module comprises a TCA module and a TF module, wherein the TCA module mainly considers and combines space-time dimension characteristics of high-level semantic information, then the TF module fuses results of time modeling convolution with an attention method, and finally the extracted space-time information characteristics are subjected to a full-connection layer and a Softmax layer to obtain estimated action categories.
Preferably, the TCA module includes time aggregation, topology generation, and two-part channel dimension aggregation, where the TCA moduleRepresented by the formula:
wherein,,expressed as channel dimension aggregate +.>Represented as a stitching operation->Structure after time aggregation for diver joint characteristics, +.>Representing the result of a topology generation process of a feature, +.>For aggregation of joint features in the channel dimension, +.>For convolution result of joint number 1 in time dimension,/->As a result of topology processing of joint No. 1,structure after temporal aggregation for node 1 feature, +.>Topology generation processing result for joint point feature No. 1,/->For the time aggregation module, +.>For time weight feature, ++>For joint characteristics, < >>Time weight feature of the node No. 1, < ->For joint feature number 1>Time weight feature for the node of the No. T, < ->Is the characteristic of the articulation point of the T-th articulation point, < ->Normalization and dimension transformation operations for third-order adjacency matrix, < >>For the adjacency matrix of the kth channel, +.>Trainable parameters for joint strength, +.>Is a channel correlation matrix.
and finally, generating a final TCA-GCN by combining temporal modeling for the multi-convolution function, judging the action type of the obtained time-space characteristic information through a full connection layer and Softmax, using L1 loss as a loss function, and using a real action type label group Truth for supervised learning.
Preferably, the step 4 specifically includes:
the STGCN module comprises a graph convolution module and a time convolution module, local features of adjacent points in the space are learned through graph convolution, and time sequence information in the sequence data is learned through time convolution; and the extracted space-time information features are subjected to a full connection layer and a Softmax layer to obtain estimated action categories.
Preferably, the step 5 specifically includes:
fusing the results of the step 3 and the step 4, and expressing the output result as output by the following formula:
wherein,,is the result of the action recognition of the STGCN module, < >>Weight of the result, +.>And the recognition result of the TCA-GCN module is represented, and score is the final output result after weighting.
A diver action recognition system based on three-dimensional human skin, the system comprising:
the data extraction module is used for extracting the human body shape, posture and vertex information of the video frame of the diver through a three-dimensional human body posture estimation method;
the data fusion module is used for: the human body shape, gesture and vertex data are subjected to data fusion to obtain high-level semantic information;
the TCA-GCN motion estimation module: performing action recognition by using the high-level semantic information through a TCA-GCN module;
STGCN action estimation module: performing action recognition by using the high-level semantic information through the STGCN module;
and the linear fusion module is used for carrying out linear fusion on the identification results of the TCA-GCN module and the STGCN module and identifying the actions of the diver.
A computer readable storage medium having stored thereon a computer program for execution by a processor for implementing a diver action recognition method based on a three-dimensional human skin.
A computer device comprising a memory storing a computer program and a processor implementing a diver action recognition method based on a three-dimensional human skin when executing the computer program.
The invention has the following beneficial effects:
compared with the prior art, the invention has the advantages that:
the invention extracts the shape, posture and vertex data of the human body from the video of the diver by a three-dimensional human body shape and posture estimation method; the human body shape, gesture and vertex data are subjected to a data fusion module to obtain high-level semantic information; performing action recognition by using the high-level semantic information through a TCA-GCN module; performing action recognition by using the high-level semantic information through the STGCN module; and linearly fusing the identification results of the two modules. By the technical scheme, the three-dimensional gesture motion estimation of the diver is realized, and the accuracy of motion recognition is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a diver action recognition method based on three-dimensional human skin data;
fig. 2 is a block diagram of a diver action recognition method based on three-dimensional human skin data.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The present invention will be described in detail with reference to specific examples.
First embodiment:
according to the specific optimization technical scheme adopted by the invention for solving the technical problems, as shown in the figures 1 to 2, the technical scheme is as follows: the invention relates to a diver action recognition method based on three-dimensional human body skin.
A diver action recognition method based on three-dimensional human skin, the method comprising the steps of:
step 1: extracting the human body shape, posture and vertex information of a diver video frame by a three-dimensional human body posture estimation method;
step 2: the human body shape, gesture and vertex data are subjected to data fusion to obtain high-level semantic information;
step 3: performing action recognition by using the high-level semantic information through a TCA-GCN module;
step 4: performing action recognition by using the high-level semantic information through the STGCN module;
step 5: and (3) carrying out linear fusion on the identification results in the step (3) and the step (4) to identify the actions of the diver.
Specific embodiment II:
the second embodiment of the present application differs from the first embodiment only in that:
the step 2 specifically comprises the following steps:
and downsampling the vertex information, simultaneously, respectively passing the downsampled vertex information and the shape information through a convolution module in the feature extraction network to obtain coding information, and splicing the coding information to the gesture information to obtain high-level semantic information.
Third embodiment:
the difference between the third embodiment and the second embodiment of the present application is only that:
the TCA-GCN module comprises a TCA module and a TF module, wherein the TCA module mainly considers and combines space-time dimension characteristics of high-level semantic information, then the TF module fuses results of time modeling convolution with an attention method, and finally the extracted space-time information characteristics are subjected to a full-connection layer and a Softmax layer to obtain estimated action categories.
Fourth embodiment:
the fourth embodiment of the present application differs from the third embodiment only in that:
the TCA module comprises time aggregation, topology generation and two-part channel dimension aggregation, wherein the TCA module is represented by the following formula:
wherein,,expressed as channel dimension aggregate +.>Represented as a stitching operation->Structure after time aggregation for diver joint characteristics, +.>Representing the result of a topology generation process of a feature, +.>For aggregation of joint features in the channel dimension, +.>For convolution result of joint number 1 in time dimension,/->As a result of topology processing of joint No. 1,structure after temporal aggregation for node 1 feature, +.>Topology generation processing result for joint point feature No. 1,/->For the time aggregation module, +.>For time weight feature, ++>For joint characteristics, < >>Time weight feature of the node No. 1, < ->For joint feature number 1>Time weight feature for the node of the No. T, < ->Is the characteristic of the articulation point of the T-th articulation point, < ->Normalization and dimension transformation operations for third-order adjacency matrix, < >>For the adjacency matrix of the kth channel, +.>Trainable parameters for joint strength, +.>Is a channel correlation matrix.
Fifth embodiment:
the fifth embodiment differs from the fourth embodiment only in that:
the TF module is represented by the following formula:
and finally, generating a final TCA-GCN by combining temporal modeling for the multi-convolution function, judging the action type of the obtained time-space characteristic information through a full connection layer and Softmax, using L1 loss as a loss function, and using a real action type label group Truth for supervised learning.
Specific embodiment six:
the difference between the sixth embodiment and the fifth embodiment of the present application is only that:
the STGCN module comprises a graph convolution module and a time convolution module, local features of adjacent points in the space are learned through graph convolution, and time sequence information in the sequence data is learned through time convolution; and the extracted space-time information features are subjected to a full connection layer and a Softmax layer to obtain estimated action categories.
Specific embodiment seven:
the seventh embodiment of the present application differs from the sixth embodiment only in that:
fusing the results of the step 3 and the step 4, and expressing the output result as output by the following formula:
wherein,,is the result of the action recognition of the STGCN module, < >>Weight of the result, +.>And the recognition result of the TCA-GCN module is represented, and score is the final output result after weighting.
Specific embodiment eight:
the eighth embodiment of the present application differs from the seventh embodiment only in that:
the invention provides a diver action recognition system based on three-dimensional human skin, which comprises:
the data extraction module is used for extracting the human body shape, posture and vertex information of the video frame of the diver through a three-dimensional human body posture estimation method;
the data fusion module is used for: the human body shape, gesture and vertex data are subjected to data fusion to obtain high-level semantic information;
the TCA-GCN motion estimation module: performing action recognition by using the high-level semantic information through a TCA-GCN module;
STGCN action estimation module: performing action recognition by using the high-level semantic information through the STGCN module;
and the linear fusion module is used for carrying out linear fusion on the identification results of the TCA-GCN module and the STGCN module and identifying the actions of the diver.
Specific embodiment nine:
embodiment nine of the present application differs from embodiment eight only in that:
the present invention provides a computer-readable storage medium having stored thereon a computer program for execution by a processor for implementing, for example, a method for diver action recognition based on three-dimensional human skin.
The method comprises the following steps:
the method comprises the following steps: the system comprises a data extraction module, a data fusion module, an action estimation module and a fusion module.
The data extraction module extracts the human body shape, posture and vertex information of the diver video frame by using a three-dimensional human body posture estimation method.
The data fusion module extracts high-level semantic information by using the shape, the gesture and the vertex information of the human body.
The action estimation module performs action recognition by using high-level semantic information through the TCA-GCN module and the STGCN module respectively.
The fusion module is used for fusing the results in the motion estimation module to obtain more accurate diver motion recognition results.
The construction module specifically comprises a feature extraction network, an STGCN network and a TCA-GCN network.
And step 21, downsampling the vertex information, and simultaneously, respectively passing the downsampled vertex information and the shape information through a convolution module in the feature extraction network to obtain coding information. Splicing the coded information to the gesture information to obtain high-level semantic information.
In step 22, the stgcn includes a graph convolution module and a time convolution module. Through graph convolution, local features of adjacent points in space are learned. And (5) after time convolution, learning time sequence information in the sequence data. And finally, the extracted space-time information features are subjected to a full connection layer and a Softmax layer to obtain estimated action categories.
And step 23, the TCA-GCN is mainly composed of a TCA module and a TF module, wherein the TCA module mainly considers and combines the space-time dimension characteristics of high-level semantic information, the TF module fuses the time modeling convolution result with an attention method, and finally the extracted space-time information characteristics are subjected to a full-connection layer and a Softmax layer to obtain estimated action types.
Step 24, according to the result in step 23 and the result in step 24, outputting a more accurate diver action recognition result in a linear weighting mode.
The calculation formula of the process is as follows:
in the method, in the process of the invention,representing an i-th frame image extracted from the video, a>A human body posture and shape estimation method is represented,the shape, posture, and vertex information of the i-th frame are respectively indicated. />A representation data fusion module for obtaining high-level semantic information +.>。/>And->Respectively represent STGCN module and TCA-GCN module, and uses +.>Respectively obtain two action recognition results +.>And->。/>Representing the linear fusion of the recognition results, +.>Representing the result weight of STGCN, and finally obtaining a more accurate identification result +.>。
Specific embodiment ten:
the tenth embodiment differs from the ninth embodiment only in that:
the invention provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes a diver action recognition method based on three-dimensional human skin when executing the computer program.
The method comprises the following steps:
step 1, because the high-quality three-dimensional human skin has better lifting effect on the action recognition task of the diver, the method uses the ROMP network with better current three-dimensional human posture estimation effect, and obtains the shape, posture and vertex parameters of the human body.
And 2, fusing the shape, the posture and the vertex parameters of the human body by using a data fusion module to obtain higher-level semantic information. Specifically, the vertex information is downsampled, then the downsampling result and the shape parameters are respectively passed through a convolution network to obtain the encoded information of the vertex and the shape, and finally the encoded information is spliced to the gesture parameters to obtain higher-level semantic information.
And 3, constructing a time-space diagram based on the skin key points. The SMPL is represented by 24 skin keypoints, so a space-time diagram can be constructed. A dead space graph g= (V, E) is constructed on a sequence of keypoints, which sequence contains N keypoints and T frames, with intra-and inter-frame connections, i.e. a time-space graph.
And 4, obtaining rich spatial information by using the GCN. The formula of GCN is shown below:
wherein,,for the connection relation matrix between each key point, < +.>For the relevant feature matrix>Is the importance degree of different joint points. />Represented as normalization processing. In order to accord with the three-dimensional action of the underwater operation of the diver, the key points are divided into root nodes, centrifugal points and centripetal points by imitating the movement trend of the action. The adjacency matrix becomes three-dimensional in this way, and the specific formulation is as follows:
for different dimensionsFor the sake of +>Representing the root node, the centrifugal point and the centripetal point, respectively, of the diver's movement. />Represented as an amount that can be trained to accommodate the different importance of each key point for different time periods.
And 5, performing diver action recognition by using a deep learning network STGCN based on graph convolution. The dimension of the high-level semantic information obtained in the step 2 is (S, 24,7), wherein S represents the length of an action sequence, 24 represents 24 human body skin key points, and 7 represents the feature dimension of each key point. The sampling function is used to specify the range of neighboring nodes involved in performing the graph rolling operation on each node. Through graph convolution, local features of adjacent points in space are learned. And (5) after time convolution, learning time sequence information in the sequence data. And judging the action category of the obtained space-time characteristic information through a full connection layer and Softmax, using L1 loss as a loss function, and using a real action category label group Truth to perform supervised learning.
And 6, performing diver action recognition by using a deep learning network TCA-GCN based on graph convolution. The time aggregation module learns the time-dimensional features and uses the channel aggregation module to effectively combine the spatial dynamic channel-level topology features with the time dynamic topology features. The method mainly comprises a TCA module and a module, wherein the TCA module is divided into time aggregation, topology generation and channel dimension aggregation of the two parts, and a specific formula of the TCA module is as follows:
wherein,,expressed as channel dimension aggregate +.>Represented as a stitching operation->Structure after time aggregation for diver joint characteristics, denoted +.>,/>Representing the result of the topology generation process of the feature, expressed as +.>. The TF module is denoted +.>,/>The final TCA-GCN is generated for the multi-convolution function by final combination of temporal modeling. And judging the action category of the obtained space-time characteristic information through a full connection layer and Softmax, using L1 loss as a loss function, and using a real action category label group Truth to perform supervised learning.
And 7, improving the accuracy of motion recognition by using a weighted linear fusion mode. Because the data features and the feature extraction modes considered by the two modules in the step 5 and the step 6 are different, the results of the two modules are fused and used as output, and the formula is as follows:
wherein the method comprises the steps ofIs the result of the action recognition of the STGCN module, < >>Weight of the result, +.>The recognition result of the TCA-GCN module is shown, and score is the final result after weighting.
Specific example eleven:
embodiment eleven of the present application differs from embodiment eleven only in that:
the embodiment provides a diver action recognition method based on three-dimensional human skin data, which comprises the following steps:
step 1, extracting the shape, posture and vertex information of a diver by using a three-dimensional human body posture estimation method;
specifically, in order to improve the accuracy of diver motion estimation, the application uses a three-dimensional human body posture estimation network RMOP with better effect at present. The network outputsThe human body shape, posture and vertex information are respectively represented. Wherein,,,/>,/>。
step 2, obtaining high-level semantic information by using shape, gesture and vertex information through a data fusion module;
specifically, the vertex information is downsampled to obtainDownsampling result +.>Through a convolution network, the network only changes channel information, but not other dimensional information, and high-level vertex coding information is obtained>. At the same time, shape parameters->Also by means of a convolutional network, the shape-coding information is obtained>Finally, will->,Splice (embading) to pose parameter +.>On top of that, higher level semantic information +.>。
Step 3, performing action recognition by using the high-level semantic information through the STGCN module;
specifically, the three-dimensional human skin parameters are represented by 24 human skin key points, so that a space-time diagram can be constructed. A dead space graph g= (V, E) is constructed on a sequence of keypoints, which sequence contains N keypoints and T frames, with intra-and inter-frame connections, i.e. a time-space graph. The hierarchical semantic information dimension is (S, 24,7), where S represents the action sequence length, 24 represents 24 human skin keypoints, and 7 represents the feature dimension of each keypoint. The sampling function is used to specify the range of neighboring nodes involved in performing the graph rolling operation on each node. Through graph convolution, local features of adjacent points in space are learned. And (5) after time convolution, learning time sequence information in the sequence data. The extracted characteristic information is subjected to action type judgment through a full connection layer and Softmax, L1 loss is used as a loss function, and a real action type label group Truth is used for supervised learning.
Step 4, performing action recognition by using the high-level semantic information through the TCA-GCN module;
specifically, the module mainly comprises a TCA module and a TF module which are formed by two sub-modules. The TCA module can consider and combine the time and space dimension characteristics of the sequence. The skin sequence data generates sample time weight through a time aggregation module, and then a channel aggregation module is used for effectively combining the space dynamic channel level topological characteristic with the time dynamic topological characteristic to generate the input of the TF module. The TF module can fuse previous time modeling convolution methods with attention methods. After two sub-modules, better characteristic information can be obtained, finally, the characteristic information is subjected to action category judgment through a full connection layer and Softmax, L1 loss is used as a loss function, and a real action category label group Truth is used for supervised learning.
And 5, fusing the results of the two modules and outputting the final action category.
Specifically, the motion recognition accuracy is improved by using a weighted linear fusion method. Because the data features considered by the two modules in the step 5 and the step 6 are different, the results of the two modules are fused and used as output, and the formula is as follows:
wherein the method comprises the steps ofIs the result of the action recognition of the STGCN module, < >>Weight of the result, +.>The recognition result of the TCA-GCN module is shown, and score is the final result after weighting.
According to the technical scheme, more specific, higher-level characteristic information is used for representing the actions of a diver, and parameters such as shape, posture and vertex are obtained by using a three-dimensional human body posture estimation method. By the method, the three-dimensional human body information passes through the action recognition module (STGCN module and TCA-GCN module) and the linear weighting module, so that the action recognition of the diver can be completed. The underwater operation communication system provides convenience for underwater operation communication and communication of divers.
In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "examples," "particular embodiments," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "N" means at least two, for example, two, three, etc., unless specifically defined otherwise. Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention. Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer cartridge (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
The above description is only a preferred embodiment of a method for identifying the motions of a diver based on a three-dimensional human skin, and the protection scope of a method for identifying the motions of a diver based on a three-dimensional human skin is not limited to the above embodiments, and all technical solutions under the concept belong to the protection scope of the invention. It should be noted that modifications and variations can be made by those skilled in the art without departing from the principles of the present invention, which is also considered to be within the scope of the present invention.
Claims (8)
1. A diver action recognition method based on three-dimensional human skin is characterized by comprising the following steps: the method comprises the following steps:
step 1: extracting the human body shape, posture and vertex information of a diver video frame by a three-dimensional human body posture estimation method;
step 2: the human body shape, gesture and vertex data are subjected to data fusion to obtain high-level semantic information;
the step 2 specifically comprises the following steps:
downsampling the vertex information, simultaneously, respectively passing the downsampled vertex information and the shape information through a convolution module in a feature extraction network to obtain coding information, splicing the coding information to gesture information, and obtaining high-level semantic information;
step 3: performing action recognition by using the high-level semantic information through a TCA-GCN module;
step 4: performing action recognition by using the high-level semantic information through the STGCN module;
step 5: the identification results in the step 3 and the step 4 are linearly fused, and the actions of the diver are identified;
the step 5 specifically comprises the following steps:
fusing the results of the step 3 and the step 4, and expressing the output result as output by the following formula:
score=γ*score st +(1-γ)score tca
wherein score st Is the result of identifying the action of STGCN module, gamma is the weight of the result, score tca And the recognition result of the TCA-GCN module is represented, and score is the final output result after weighting.
2. The method according to claim 1, characterized in that: the step 3 specifically comprises the following steps:
the TCA-GCN module comprises a TCA module and a TF module, wherein the TCA module mainly considers and combines space-time dimension characteristics of high-level semantic information, then the TF module fuses results of time modeling convolution with an attention method, and finally the extracted space-time information characteristics are subjected to a full-connection layer and a Softmax layer to obtain estimated action categories.
3. The method according to claim 2, characterized in that:
the TCA module comprises time aggregation, topology generation and two-part channel dimension aggregation, wherein the TCA module F out Represented by the formula:
A out =TA(W(W),X)=(W 1 ,X 1 )|||…||(A T ,X T )
S=μ(A k )+α·Q
wherein CA is expressed as channel dimension aggregation, and I is expressed as splicing operation, A out For the structure of diver joint features after time aggregation, S represents the result of topology generation processing of the features, F out For aggregation of joint features in the channel dimension, A out1 Is the convolution result of joint number 1 in the time dimension, S 1 As a result of topology processing of joint No. 1,structure after temporal aggregation for node 1 feature, +.>The method is characterized in that TA is a time aggregation module, W (W) is a time weight feature, X is a joint feature, and W is a result of topology generation processing of the joint feature No. 1 1 Is the time weight characteristic of the No. 1 joint point, X 1 Is the characteristic of the joint point No. 1, A T Is the time weight characteristic of the T-th joint point, X T For the characteristics of the joint point of the T joint point, mu is the normalization and dimension transformation operation of a third-order adjacency matrix, A k For the k-th channel adjacency matrix, α is the trainable parameter of joint connection strength, and Q is the channel correlation matrix.
4. A method according to claim 3, characterized in that:
TF module Z out Represented by the formula:
Z out =sk(MSCONV(F out ))
MSCONV is a multi-convolution function, final TCA-GCN is generated by final combination of temporal modeling, the obtained time-space characteristic information is subjected to action category judgment through a full connection layer and Softmax, L1 loss is used as a loss function, and a real action category label group Truth is used for supervised learning.
5. The method according to claim 4, characterized in that: the step 4 specifically comprises the following steps:
the STGCN module comprises a graph convolution module and a time convolution module, local features of adjacent points in the space are learned through graph convolution, and time sequence information in the sequence data is learned through time convolution; and the extracted space-time information features are subjected to a full connection layer and a Softmax layer to obtain estimated action categories.
6. A diver action recognition system based on three-dimensional human skin is characterized in that: the system comprises:
the data extraction module is used for extracting the human body shape, posture and vertex information of the video frame of the diver through a three-dimensional human body posture estimation method;
the data fusion module is used for: the human body shape, gesture and vertex data are subjected to data fusion to obtain high-level semantic information;
downsampling the vertex information, simultaneously, respectively passing the downsampled vertex information and the shape information through a convolution module in a feature extraction network to obtain coding information, splicing the coding information to gesture information, and obtaining high-level semantic information;
the TCA-GCN motion estimation module: performing action recognition by using the high-level semantic information through a TCA-GCN module;
STGCN action estimation module: performing action recognition by using the high-level semantic information through the STGCN module;
the linear fusion module is used for carrying out linear fusion on the identification results of the TCA-GCN module and the STGCN module and identifying the actions of the diver;
the result fusion of the action recognition by the TCA-GCN module and the action recognition by the STGCN module is taken as output, and the output result is represented by the following formula:
score=γ*score st +(1-γ)score tca
wherein score st Is the result of identifying the action of STGCN module, gamma is the weight of the result, score tca And the recognition result of the TCA-GCN module is represented, and score is the final output result after weighting.
7. A computer readable storage medium having stored thereon a computer program, characterized in that the program is executed by a processor for implementing the method according to any of claims 1-5.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized by: the processor, when executing the computer program, implements the method of any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310015851.2A CN115862150B (en) | 2023-01-06 | 2023-01-06 | Diver action recognition method based on three-dimensional human body skin |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310015851.2A CN115862150B (en) | 2023-01-06 | 2023-01-06 | Diver action recognition method based on three-dimensional human body skin |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115862150A CN115862150A (en) | 2023-03-28 |
CN115862150B true CN115862150B (en) | 2023-05-23 |
Family
ID=85656975
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310015851.2A Active CN115862150B (en) | 2023-01-06 | 2023-01-06 | Diver action recognition method based on three-dimensional human body skin |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115862150B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113591560A (en) * | 2021-06-23 | 2021-11-02 | 西北工业大学 | Human behavior recognition method |
CN114663593A (en) * | 2022-03-25 | 2022-06-24 | 清华大学 | Three-dimensional human body posture estimation method, device, equipment and storage medium |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113297955B (en) * | 2021-05-21 | 2022-03-18 | 中国矿业大学 | Sign language word recognition method based on multi-mode hierarchical information fusion |
CN114863325B (en) * | 2022-04-19 | 2024-06-07 | 上海人工智能创新中心 | Action recognition method, apparatus, device and computer readable storage medium |
CN114550308B (en) * | 2022-04-22 | 2022-07-05 | 成都信息工程大学 | Human skeleton action recognition method based on space-time diagram |
CN114973422A (en) * | 2022-07-19 | 2022-08-30 | 南京应用数学中心 | Gait recognition method based on three-dimensional human body modeling point cloud feature coding |
-
2023
- 2023-01-06 CN CN202310015851.2A patent/CN115862150B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113591560A (en) * | 2021-06-23 | 2021-11-02 | 西北工业大学 | Human behavior recognition method |
CN114663593A (en) * | 2022-03-25 | 2022-06-24 | 清华大学 | Three-dimensional human body posture estimation method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN115862150A (en) | 2023-03-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109409222B (en) | Multi-view facial expression recognition method based on mobile terminal | |
CN112163498B (en) | Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method | |
CN109410135B (en) | Anti-learning image defogging and fogging method | |
CN109086659B (en) | Human behavior recognition method and device based on multi-channel feature fusion | |
CN112132739A (en) | 3D reconstruction and human face posture normalization method, device, storage medium and equipment | |
CN112651360B (en) | Skeleton action recognition method under small sample | |
CN112036260A (en) | Expression recognition method and system for multi-scale sub-block aggregation in natural environment | |
CN115512103A (en) | Multi-scale fusion remote sensing image semantic segmentation method and system | |
CN116246338B (en) | Behavior recognition method based on graph convolution and transducer composite neural network | |
CN110348395B (en) | Skeleton behavior identification method based on space-time relationship | |
CN115761905A (en) | Diver action identification method based on skeleton joint points | |
CN112668543A (en) | Isolated word sign language recognition method based on hand model perception | |
CN116189306A (en) | Human behavior recognition method based on joint attention mechanism | |
CN111738092A (en) | Method for recovering shielded human body posture sequence based on deep learning | |
Yang et al. | Channel expansion convolutional network for image classification | |
Lee et al. | Visual thinking of neural networks: Interactive text to image synthesis | |
CN114494543A (en) | Action generation method and related device, electronic equipment and storage medium | |
CN117854155A (en) | Human skeleton action recognition method and system | |
CN115862150B (en) | Diver action recognition method based on three-dimensional human body skin | |
CN113408721A (en) | Neural network structure searching method, apparatus, computer device and storage medium | |
CN116682180A (en) | Action recognition method based on human skeleton sequence space-time information | |
CN109166118A (en) | Fabric surface attribute detection method, device and computer equipment | |
Zhou et al. | Regional Self-Attention Convolutional Neural Network for Facial Expression Recognition | |
CN117315765A (en) | Action recognition method for enhancing space-time characteristics | |
CN114463346A (en) | Complex environment rapid tongue segmentation device based on mobile terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |