CN115862150B - Diver action recognition method based on three-dimensional human body skin - Google Patents

Diver action recognition method based on three-dimensional human body skin Download PDF

Info

Publication number
CN115862150B
CN115862150B CN202310015851.2A CN202310015851A CN115862150B CN 115862150 B CN115862150 B CN 115862150B CN 202310015851 A CN202310015851 A CN 202310015851A CN 115862150 B CN115862150 B CN 115862150B
Authority
CN
China
Prior art keywords
module
information
tca
diver
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310015851.2A
Other languages
Chinese (zh)
Other versions
CN115862150A (en
Inventor
姜宇
赵明浩
齐红
王跃航
王光诚
魏枫林
王凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN202310015851.2A priority Critical patent/CN115862150B/en
Publication of CN115862150A publication Critical patent/CN115862150A/en
Application granted granted Critical
Publication of CN115862150B publication Critical patent/CN115862150B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a diver action recognition method based on three-dimensional human body skin. The invention relates to the technical field of computer vision, which extracts human body shape, posture and vertex data from diver videos by a three-dimensional human body shape and posture estimation method; the human body shape, gesture and vertex data are subjected to a data fusion module to obtain high-level semantic information; performing action recognition by using the high-level semantic information through a TCA-GCN module; performing action recognition by using the high-level semantic information through the STGCN module; and linearly fusing the identification results of the two modules. By the technical scheme, the three-dimensional gesture motion estimation of the diver is realized, and the accuracy of motion recognition is improved.

Description

Diver action recognition method based on three-dimensional human body skin
Technical Field
The invention relates to the technical field of computer vision, in particular to a diver action recognition method based on three-dimensional human body skin.
Background
Action recognition is the basis for understanding human behaviors by a computer, plays an important role in the fields of man-machine interaction, video understanding and the like, and has become a hot topic in the field of computer vision. Because of the specificity of the working environment of divers, they cannot communicate and express in the form of language, but because of the natural rich semantic information of human limbs, the diver's underwater work can express some special meanings by means of some actions. For example, emergency situations such as physical overdraft, hypoxia, leg cramp and the like can be expressed by different gestures. In such a scenario, how to accurately and efficiently recognize the motions of the diver has become an important research direction.
Most of the existing diver action recognition methods are based on human skeleton points, but because skeleton data lacks of human surface information, the existing diver action recognition methods are more abstract, low in semantic meaning and can only represent action characteristics of human bodies, cannot embody more specific and higher-level information, such as shape characteristics, vertex characteristics and the like, and cannot represent human actions more accurately. In order to utilize more specific and higher-level semantic information, the application provides a diver action recognition method based on three-dimensional human skin. Since human body structures can be naturally represented as a graph structure, many methods are currently based on graph convolution. The graph convolution method can more accurately find the relation between different key points of the human body, and obtain better-represented space dimension information, so as to obtain more accurate action recognition results. Because each action of the diver is a sequence data, many methods at present also obtain the relation between action sequences by means of LSTM, time convolution and the like, which can extract better time dimension information so as to achieve better performance. Currently, SMPL is the mainstream three-dimensional human skin representation method by
Figure 761827DEST_PATH_IMAGE001
,/>
Figure 10275DEST_PATH_IMAGE002
The two parameters represent the shape and posture of the human body, respectively. At the same time use->
Figure 891031DEST_PATH_IMAGE001
,/>
Figure 506820DEST_PATH_IMAGE002
The SMPL can obtain the vertex parameter v of the human mesh, which in turn provides more semantic information for our action recognition task. The three-dimensional human skin information represents the gesture, shape and vertex of the human body, higher-level semantic information can be obtained through data fusion, and finally, a more accurate diver gesture estimation result is obtained by using a graph convolution deep learning method.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention realizes the identification of the diver action by utilizing three-dimensional human skin information, and achieves more accurate action identification effect by using higher-level semantic information.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The invention provides a diver action recognition method based on three-dimensional human skin, which provides the following technical scheme:
a diver action recognition method based on three-dimensional human skin, the method comprising the steps of:
step 1: extracting the human body shape, posture and vertex information of a diver video frame by a three-dimensional human body posture estimation method;
step 2: the human body shape, gesture and vertex data are subjected to data fusion to obtain high-level semantic information;
step 3: performing action recognition by using the high-level semantic information through a TCA-GCN module;
step 4: performing action recognition by using the high-level semantic information through the STGCN module;
step 5: and (3) carrying out linear fusion on the identification results in the step (3) and the step (4) to identify the actions of the diver.
Preferably, the step 2 specifically includes:
and downsampling the vertex information, simultaneously, respectively passing the downsampled vertex information and the shape information through a convolution module in the feature extraction network to obtain coding information, and splicing the coding information to the gesture information to obtain high-level semantic information.
Preferably, the step 3 specifically includes:
the TCA-GCN module comprises a TCA module and a TF module, wherein the TCA module mainly considers and combines space-time dimension characteristics of high-level semantic information, then the TF module fuses results of time modeling convolution with an attention method, and finally the extracted space-time information characteristics are subjected to a full-connection layer and a Softmax layer to obtain estimated action categories.
Preferably, the TCA module includes time aggregation, topology generation, and two-part channel dimension aggregation, where the TCA module
Figure 721770DEST_PATH_IMAGE003
Represented by the formula:
Figure 106615DEST_PATH_IMAGE004
Figure 483238DEST_PATH_IMAGE005
Figure 710957DEST_PATH_IMAGE006
wherein,,
Figure 214751DEST_PATH_IMAGE007
expressed as channel dimension aggregate +.>
Figure 966019DEST_PATH_IMAGE008
Represented as a stitching operation->
Figure 529856DEST_PATH_IMAGE009
Structure after time aggregation for diver joint characteristics, +.>
Figure 369505DEST_PATH_IMAGE010
Representing the result of a topology generation process of a feature, +.>
Figure 67202DEST_PATH_IMAGE003
For aggregation of joint features in the channel dimension, +.>
Figure 161060DEST_PATH_IMAGE011
For convolution result of joint number 1 in time dimension,/->
Figure 410645DEST_PATH_IMAGE012
As a result of topology processing of joint No. 1,
Figure 957164DEST_PATH_IMAGE013
structure after temporal aggregation for node 1 feature, +.>
Figure 320537DEST_PATH_IMAGE014
Topology generation processing result for joint point feature No. 1,/->
Figure 268901DEST_PATH_IMAGE015
For the time aggregation module, +.>
Figure 423808DEST_PATH_IMAGE016
For time weight feature, ++>
Figure 113415DEST_PATH_IMAGE017
For joint characteristics, < >>
Figure 28281DEST_PATH_IMAGE018
Time weight feature of the node No. 1, < ->
Figure 611578DEST_PATH_IMAGE019
For joint feature number 1>
Figure 688119DEST_PATH_IMAGE020
Time weight feature for the node of the No. T, < ->
Figure 596513DEST_PATH_IMAGE021
Is the characteristic of the articulation point of the T-th articulation point, < ->
Figure 33180DEST_PATH_IMAGE022
Normalization and dimension transformation operations for third-order adjacency matrix, < >>
Figure 752874DEST_PATH_IMAGE023
For the adjacency matrix of the kth channel, +.>
Figure 390529DEST_PATH_IMAGE024
Trainable parameters for joint strength, +.>
Figure 179362DEST_PATH_IMAGE025
Is a channel correlation matrix.
Preferably, the TF module
Figure 829174DEST_PATH_IMAGE026
Represented by the formula:
Figure 872216DEST_PATH_IMAGE027
Figure 539827DEST_PATH_IMAGE028
and finally, generating a final TCA-GCN by combining temporal modeling for the multi-convolution function, judging the action type of the obtained time-space characteristic information through a full connection layer and Softmax, using L1 loss as a loss function, and using a real action type label group Truth for supervised learning.
Preferably, the step 4 specifically includes:
the STGCN module comprises a graph convolution module and a time convolution module, local features of adjacent points in the space are learned through graph convolution, and time sequence information in the sequence data is learned through time convolution; and the extracted space-time information features are subjected to a full connection layer and a Softmax layer to obtain estimated action categories.
Preferably, the step 5 specifically includes:
fusing the results of the step 3 and the step 4, and expressing the output result as output by the following formula:
Figure 35530DEST_PATH_IMAGE029
wherein,,
Figure 345158DEST_PATH_IMAGE030
is the result of the action recognition of the STGCN module, < >>
Figure 898499DEST_PATH_IMAGE031
Weight of the result, +.>
Figure 164043DEST_PATH_IMAGE032
And the recognition result of the TCA-GCN module is represented, and score is the final output result after weighting.
A diver action recognition system based on three-dimensional human skin, the system comprising:
the data extraction module is used for extracting the human body shape, posture and vertex information of the video frame of the diver through a three-dimensional human body posture estimation method;
the data fusion module is used for: the human body shape, gesture and vertex data are subjected to data fusion to obtain high-level semantic information;
the TCA-GCN motion estimation module: performing action recognition by using the high-level semantic information through a TCA-GCN module;
STGCN action estimation module: performing action recognition by using the high-level semantic information through the STGCN module;
and the linear fusion module is used for carrying out linear fusion on the identification results of the TCA-GCN module and the STGCN module and identifying the actions of the diver.
A computer readable storage medium having stored thereon a computer program for execution by a processor for implementing a diver action recognition method based on a three-dimensional human skin.
A computer device comprising a memory storing a computer program and a processor implementing a diver action recognition method based on a three-dimensional human skin when executing the computer program.
The invention has the following beneficial effects:
compared with the prior art, the invention has the advantages that:
the invention extracts the shape, posture and vertex data of the human body from the video of the diver by a three-dimensional human body shape and posture estimation method; the human body shape, gesture and vertex data are subjected to a data fusion module to obtain high-level semantic information; performing action recognition by using the high-level semantic information through a TCA-GCN module; performing action recognition by using the high-level semantic information through the STGCN module; and linearly fusing the identification results of the two modules. By the technical scheme, the three-dimensional gesture motion estimation of the diver is realized, and the accuracy of motion recognition is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a diver action recognition method based on three-dimensional human skin data;
fig. 2 is a block diagram of a diver action recognition method based on three-dimensional human skin data.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The present invention will be described in detail with reference to specific examples.
First embodiment:
according to the specific optimization technical scheme adopted by the invention for solving the technical problems, as shown in the figures 1 to 2, the technical scheme is as follows: the invention relates to a diver action recognition method based on three-dimensional human body skin.
A diver action recognition method based on three-dimensional human skin, the method comprising the steps of:
step 1: extracting the human body shape, posture and vertex information of a diver video frame by a three-dimensional human body posture estimation method;
step 2: the human body shape, gesture and vertex data are subjected to data fusion to obtain high-level semantic information;
step 3: performing action recognition by using the high-level semantic information through a TCA-GCN module;
step 4: performing action recognition by using the high-level semantic information through the STGCN module;
step 5: and (3) carrying out linear fusion on the identification results in the step (3) and the step (4) to identify the actions of the diver.
Specific embodiment II:
the second embodiment of the present application differs from the first embodiment only in that:
the step 2 specifically comprises the following steps:
and downsampling the vertex information, simultaneously, respectively passing the downsampled vertex information and the shape information through a convolution module in the feature extraction network to obtain coding information, and splicing the coding information to the gesture information to obtain high-level semantic information.
Third embodiment:
the difference between the third embodiment and the second embodiment of the present application is only that:
the TCA-GCN module comprises a TCA module and a TF module, wherein the TCA module mainly considers and combines space-time dimension characteristics of high-level semantic information, then the TF module fuses results of time modeling convolution with an attention method, and finally the extracted space-time information characteristics are subjected to a full-connection layer and a Softmax layer to obtain estimated action categories.
Fourth embodiment:
the fourth embodiment of the present application differs from the third embodiment only in that:
the TCA module comprises time aggregation, topology generation and two-part channel dimension aggregation, wherein the TCA module is represented by the following formula:
Figure 943780DEST_PATH_IMAGE033
Figure 932465DEST_PATH_IMAGE034
Figure 199367DEST_PATH_IMAGE035
wherein,,
Figure 880884DEST_PATH_IMAGE007
expressed as channel dimension aggregate +.>
Figure 351180DEST_PATH_IMAGE008
Represented as a stitching operation->
Figure 5539DEST_PATH_IMAGE009
Structure after time aggregation for diver joint characteristics, +.>
Figure 736735DEST_PATH_IMAGE010
Representing the result of a topology generation process of a feature, +.>
Figure 464520DEST_PATH_IMAGE003
For aggregation of joint features in the channel dimension, +.>
Figure 405800DEST_PATH_IMAGE011
For convolution result of joint number 1 in time dimension,/->
Figure 346074DEST_PATH_IMAGE012
As a result of topology processing of joint No. 1,
Figure 56410DEST_PATH_IMAGE013
structure after temporal aggregation for node 1 feature, +.>
Figure 811221DEST_PATH_IMAGE014
Topology generation processing result for joint point feature No. 1,/->
Figure 584004DEST_PATH_IMAGE015
For the time aggregation module, +.>
Figure 452603DEST_PATH_IMAGE016
For time weight feature, ++>
Figure 299337DEST_PATH_IMAGE017
For joint characteristics, < >>
Figure 352612DEST_PATH_IMAGE018
Time weight feature of the node No. 1, < ->
Figure 409430DEST_PATH_IMAGE019
For joint feature number 1>
Figure 425927DEST_PATH_IMAGE020
Time weight feature for the node of the No. T, < ->
Figure 379364DEST_PATH_IMAGE021
Is the characteristic of the articulation point of the T-th articulation point, < ->
Figure 557536DEST_PATH_IMAGE022
Normalization and dimension transformation operations for third-order adjacency matrix, < >>
Figure 226284DEST_PATH_IMAGE023
For the adjacency matrix of the kth channel, +.>
Figure 312051DEST_PATH_IMAGE024
Trainable parameters for joint strength, +.>
Figure 117065DEST_PATH_IMAGE025
Is a channel correlation matrix.
Fifth embodiment:
the fifth embodiment differs from the fourth embodiment only in that:
the TF module is represented by the following formula:
Figure 918668DEST_PATH_IMAGE036
Figure 559865DEST_PATH_IMAGE028
and finally, generating a final TCA-GCN by combining temporal modeling for the multi-convolution function, judging the action type of the obtained time-space characteristic information through a full connection layer and Softmax, using L1 loss as a loss function, and using a real action type label group Truth for supervised learning.
Specific embodiment six:
the difference between the sixth embodiment and the fifth embodiment of the present application is only that:
the STGCN module comprises a graph convolution module and a time convolution module, local features of adjacent points in the space are learned through graph convolution, and time sequence information in the sequence data is learned through time convolution; and the extracted space-time information features are subjected to a full connection layer and a Softmax layer to obtain estimated action categories.
Specific embodiment seven:
the seventh embodiment of the present application differs from the sixth embodiment only in that:
fusing the results of the step 3 and the step 4, and expressing the output result as output by the following formula:
Figure 164503DEST_PATH_IMAGE037
wherein,,
Figure 964969DEST_PATH_IMAGE030
is the result of the action recognition of the STGCN module, < >>
Figure 750522DEST_PATH_IMAGE031
Weight of the result, +.>
Figure 862703DEST_PATH_IMAGE032
And the recognition result of the TCA-GCN module is represented, and score is the final output result after weighting.
Specific embodiment eight:
the eighth embodiment of the present application differs from the seventh embodiment only in that:
the invention provides a diver action recognition system based on three-dimensional human skin, which comprises:
the data extraction module is used for extracting the human body shape, posture and vertex information of the video frame of the diver through a three-dimensional human body posture estimation method;
the data fusion module is used for: the human body shape, gesture and vertex data are subjected to data fusion to obtain high-level semantic information;
the TCA-GCN motion estimation module: performing action recognition by using the high-level semantic information through a TCA-GCN module;
STGCN action estimation module: performing action recognition by using the high-level semantic information through the STGCN module;
and the linear fusion module is used for carrying out linear fusion on the identification results of the TCA-GCN module and the STGCN module and identifying the actions of the diver.
Specific embodiment nine:
embodiment nine of the present application differs from embodiment eight only in that:
the present invention provides a computer-readable storage medium having stored thereon a computer program for execution by a processor for implementing, for example, a method for diver action recognition based on three-dimensional human skin.
The method comprises the following steps:
the method comprises the following steps: the system comprises a data extraction module, a data fusion module, an action estimation module and a fusion module.
The data extraction module extracts the human body shape, posture and vertex information of the diver video frame by using a three-dimensional human body posture estimation method.
The data fusion module extracts high-level semantic information by using the shape, the gesture and the vertex information of the human body.
The action estimation module performs action recognition by using high-level semantic information through the TCA-GCN module and the STGCN module respectively.
The fusion module is used for fusing the results in the motion estimation module to obtain more accurate diver motion recognition results.
The construction module specifically comprises a feature extraction network, an STGCN network and a TCA-GCN network.
And step 21, downsampling the vertex information, and simultaneously, respectively passing the downsampled vertex information and the shape information through a convolution module in the feature extraction network to obtain coding information. Splicing the coded information to the gesture information to obtain high-level semantic information.
In step 22, the stgcn includes a graph convolution module and a time convolution module. Through graph convolution, local features of adjacent points in space are learned. And (5) after time convolution, learning time sequence information in the sequence data. And finally, the extracted space-time information features are subjected to a full connection layer and a Softmax layer to obtain estimated action categories.
And step 23, the TCA-GCN is mainly composed of a TCA module and a TF module, wherein the TCA module mainly considers and combines the space-time dimension characteristics of high-level semantic information, the TF module fuses the time modeling convolution result with an attention method, and finally the extracted space-time information characteristics are subjected to a full-connection layer and a Softmax layer to obtain estimated action types.
Step 24, according to the result in step 23 and the result in step 24, outputting a more accurate diver action recognition result in a linear weighting mode.
The calculation formula of the process is as follows:
Figure 24694DEST_PATH_IMAGE038
in the method, in the process of the invention,
Figure 804300DEST_PATH_IMAGE039
representing an i-th frame image extracted from the video, a>
Figure 26334DEST_PATH_IMAGE040
A human body posture and shape estimation method is represented,
Figure 894321DEST_PATH_IMAGE041
the shape, posture, and vertex information of the i-th frame are respectively indicated. />
Figure 328844DEST_PATH_IMAGE042
A representation data fusion module for obtaining high-level semantic information +.>
Figure 697378DEST_PATH_IMAGE043
。/>
Figure 90313DEST_PATH_IMAGE044
And->
Figure 442666DEST_PATH_IMAGE045
Respectively represent STGCN module and TCA-GCN module, and uses +.>
Figure 212039DEST_PATH_IMAGE043
Respectively obtain two action recognition results +.>
Figure 576024DEST_PATH_IMAGE030
And->
Figure 386198DEST_PATH_IMAGE032
。/>
Figure 711000DEST_PATH_IMAGE046
Representing the linear fusion of the recognition results, +.>
Figure 267752DEST_PATH_IMAGE031
Representing the result weight of STGCN, and finally obtaining a more accurate identification result +.>
Figure 892769DEST_PATH_IMAGE047
Specific embodiment ten:
the tenth embodiment differs from the ninth embodiment only in that:
the invention provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes a diver action recognition method based on three-dimensional human skin when executing the computer program.
The method comprises the following steps:
step 1, because the high-quality three-dimensional human skin has better lifting effect on the action recognition task of the diver, the method uses the ROMP network with better current three-dimensional human posture estimation effect, and obtains the shape, posture and vertex parameters of the human body.
And 2, fusing the shape, the posture and the vertex parameters of the human body by using a data fusion module to obtain higher-level semantic information. Specifically, the vertex information is downsampled, then the downsampling result and the shape parameters are respectively passed through a convolution network to obtain the encoded information of the vertex and the shape, and finally the encoded information is spliced to the gesture parameters to obtain higher-level semantic information.
And 3, constructing a time-space diagram based on the skin key points. The SMPL is represented by 24 skin keypoints, so a space-time diagram can be constructed. A dead space graph g= (V, E) is constructed on a sequence of keypoints, which sequence contains N keypoints and T frames, with intra-and inter-frame connections, i.e. a time-space graph.
And 4, obtaining rich spatial information by using the GCN. The formula of GCN is shown below:
Figure 17720DEST_PATH_IMAGE048
wherein,,
Figure 344665DEST_PATH_IMAGE049
for the connection relation matrix between each key point, < +.>
Figure 455840DEST_PATH_IMAGE050
For the relevant feature matrix>
Figure 656402DEST_PATH_IMAGE051
Is the importance degree of different joint points. />
Figure 562041DEST_PATH_IMAGE052
Represented as normalization processing. In order to accord with the three-dimensional action of the underwater operation of the diver, the key points are divided into root nodes, centrifugal points and centripetal points by imitating the movement trend of the action. The adjacency matrix becomes three-dimensional in this way, and the specific formulation is as follows:
Figure 517227DEST_PATH_IMAGE053
for different dimensions
Figure 415782DEST_PATH_IMAGE054
For the sake of +>
Figure 749812DEST_PATH_IMAGE055
Representing the root node, the centrifugal point and the centripetal point, respectively, of the diver's movement. />
Figure 75620DEST_PATH_IMAGE056
Represented as an amount that can be trained to accommodate the different importance of each key point for different time periods.
And 5, performing diver action recognition by using a deep learning network STGCN based on graph convolution. The dimension of the high-level semantic information obtained in the step 2 is (S, 24,7), wherein S represents the length of an action sequence, 24 represents 24 human body skin key points, and 7 represents the feature dimension of each key point. The sampling function is used to specify the range of neighboring nodes involved in performing the graph rolling operation on each node. Through graph convolution, local features of adjacent points in space are learned. And (5) after time convolution, learning time sequence information in the sequence data. And judging the action category of the obtained space-time characteristic information through a full connection layer and Softmax, using L1 loss as a loss function, and using a real action category label group Truth to perform supervised learning.
And 6, performing diver action recognition by using a deep learning network TCA-GCN based on graph convolution. The time aggregation module learns the time-dimensional features and uses the channel aggregation module to effectively combine the spatial dynamic channel-level topology features with the time dynamic topology features. The method mainly comprises a TCA module and a module, wherein the TCA module is divided into time aggregation, topology generation and channel dimension aggregation of the two parts, and a specific formula of the TCA module is as follows:
Figure 596731DEST_PATH_IMAGE057
wherein,,
Figure 53905DEST_PATH_IMAGE007
expressed as channel dimension aggregate +.>
Figure 976862DEST_PATH_IMAGE008
Represented as a stitching operation->
Figure 614516DEST_PATH_IMAGE009
Structure after time aggregation for diver joint characteristics, denoted +.>
Figure 137770DEST_PATH_IMAGE058
,/>
Figure 863281DEST_PATH_IMAGE010
Representing the result of the topology generation process of the feature, expressed as +.>
Figure 155591DEST_PATH_IMAGE059
. The TF module is denoted +.>
Figure 839513DEST_PATH_IMAGE060
,/>
Figure 321835DEST_PATH_IMAGE061
The final TCA-GCN is generated for the multi-convolution function by final combination of temporal modeling. And judging the action category of the obtained space-time characteristic information through a full connection layer and Softmax, using L1 loss as a loss function, and using a real action category label group Truth to perform supervised learning.
And 7, improving the accuracy of motion recognition by using a weighted linear fusion mode. Because the data features and the feature extraction modes considered by the two modules in the step 5 and the step 6 are different, the results of the two modules are fused and used as output, and the formula is as follows:
Figure 116615DEST_PATH_IMAGE062
wherein the method comprises the steps of
Figure 263432DEST_PATH_IMAGE030
Is the result of the action recognition of the STGCN module, < >>
Figure 852676DEST_PATH_IMAGE031
Weight of the result, +.>
Figure 84943DEST_PATH_IMAGE032
The recognition result of the TCA-GCN module is shown, and score is the final result after weighting.
Specific example eleven:
embodiment eleven of the present application differs from embodiment eleven only in that:
the embodiment provides a diver action recognition method based on three-dimensional human skin data, which comprises the following steps:
step 1, extracting the shape, posture and vertex information of a diver by using a three-dimensional human body posture estimation method;
specifically, in order to improve the accuracy of diver motion estimation, the application uses a three-dimensional human body posture estimation network RMOP with better effect at present. The network outputs
Figure 417835DEST_PATH_IMAGE063
The human body shape, posture and vertex information are respectively represented. Wherein,,
Figure 153579DEST_PATH_IMAGE064
,/>
Figure 179304DEST_PATH_IMAGE065
,/>
Figure 630358DEST_PATH_IMAGE066
step 2, obtaining high-level semantic information by using shape, gesture and vertex information through a data fusion module;
specifically, the vertex information is downsampled to obtain
Figure 891575DEST_PATH_IMAGE067
Downsampling result +.>
Figure 498137DEST_PATH_IMAGE068
Through a convolution network, the network only changes channel information, but not other dimensional information, and high-level vertex coding information is obtained>
Figure 944031DEST_PATH_IMAGE069
. At the same time, shape parameters->
Figure 370464DEST_PATH_IMAGE070
Also by means of a convolutional network, the shape-coding information is obtained>
Figure 560006DEST_PATH_IMAGE071
Finally, will->
Figure 21074DEST_PATH_IMAGE072
Figure 375220DEST_PATH_IMAGE073
Splice (embading) to pose parameter +.>
Figure 554528DEST_PATH_IMAGE002
On top of that, higher level semantic information +.>
Figure 282182DEST_PATH_IMAGE074
Step 3, performing action recognition by using the high-level semantic information through the STGCN module;
specifically, the three-dimensional human skin parameters are represented by 24 human skin key points, so that a space-time diagram can be constructed. A dead space graph g= (V, E) is constructed on a sequence of keypoints, which sequence contains N keypoints and T frames, with intra-and inter-frame connections, i.e. a time-space graph. The hierarchical semantic information dimension is (S, 24,7), where S represents the action sequence length, 24 represents 24 human skin keypoints, and 7 represents the feature dimension of each keypoint. The sampling function is used to specify the range of neighboring nodes involved in performing the graph rolling operation on each node. Through graph convolution, local features of adjacent points in space are learned. And (5) after time convolution, learning time sequence information in the sequence data. The extracted characteristic information is subjected to action type judgment through a full connection layer and Softmax, L1 loss is used as a loss function, and a real action type label group Truth is used for supervised learning.
Step 4, performing action recognition by using the high-level semantic information through the TCA-GCN module;
specifically, the module mainly comprises a TCA module and a TF module which are formed by two sub-modules. The TCA module can consider and combine the time and space dimension characteristics of the sequence. The skin sequence data generates sample time weight through a time aggregation module, and then a channel aggregation module is used for effectively combining the space dynamic channel level topological characteristic with the time dynamic topological characteristic to generate the input of the TF module. The TF module can fuse previous time modeling convolution methods with attention methods. After two sub-modules, better characteristic information can be obtained, finally, the characteristic information is subjected to action category judgment through a full connection layer and Softmax, L1 loss is used as a loss function, and a real action category label group Truth is used for supervised learning.
And 5, fusing the results of the two modules and outputting the final action category.
Specifically, the motion recognition accuracy is improved by using a weighted linear fusion method. Because the data features considered by the two modules in the step 5 and the step 6 are different, the results of the two modules are fused and used as output, and the formula is as follows:
Figure 597757DEST_PATH_IMAGE062
wherein the method comprises the steps of
Figure 385453DEST_PATH_IMAGE030
Is the result of the action recognition of the STGCN module, < >>
Figure 52058DEST_PATH_IMAGE031
Weight of the result, +.>
Figure 583402DEST_PATH_IMAGE032
The recognition result of the TCA-GCN module is shown, and score is the final result after weighting.
According to the technical scheme, more specific, higher-level characteristic information is used for representing the actions of a diver, and parameters such as shape, posture and vertex are obtained by using a three-dimensional human body posture estimation method. By the method, the three-dimensional human body information passes through the action recognition module (STGCN module and TCA-GCN module) and the linear weighting module, so that the action recognition of the diver can be completed. The underwater operation communication system provides convenience for underwater operation communication and communication of divers.
In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "examples," "particular embodiments," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "N" means at least two, for example, two, three, etc., unless specifically defined otherwise. Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention. Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer cartridge (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
The above description is only a preferred embodiment of a method for identifying the motions of a diver based on a three-dimensional human skin, and the protection scope of a method for identifying the motions of a diver based on a three-dimensional human skin is not limited to the above embodiments, and all technical solutions under the concept belong to the protection scope of the invention. It should be noted that modifications and variations can be made by those skilled in the art without departing from the principles of the present invention, which is also considered to be within the scope of the present invention.

Claims (8)

1. A diver action recognition method based on three-dimensional human skin is characterized by comprising the following steps: the method comprises the following steps:
step 1: extracting the human body shape, posture and vertex information of a diver video frame by a three-dimensional human body posture estimation method;
step 2: the human body shape, gesture and vertex data are subjected to data fusion to obtain high-level semantic information;
the step 2 specifically comprises the following steps:
downsampling the vertex information, simultaneously, respectively passing the downsampled vertex information and the shape information through a convolution module in a feature extraction network to obtain coding information, splicing the coding information to gesture information, and obtaining high-level semantic information;
step 3: performing action recognition by using the high-level semantic information through a TCA-GCN module;
step 4: performing action recognition by using the high-level semantic information through the STGCN module;
step 5: the identification results in the step 3 and the step 4 are linearly fused, and the actions of the diver are identified;
the step 5 specifically comprises the following steps:
fusing the results of the step 3 and the step 4, and expressing the output result as output by the following formula:
score=γ*score st +(1-γ)score tca
wherein score st Is the result of identifying the action of STGCN module, gamma is the weight of the result, score tca And the recognition result of the TCA-GCN module is represented, and score is the final output result after weighting.
2. The method according to claim 1, characterized in that: the step 3 specifically comprises the following steps:
the TCA-GCN module comprises a TCA module and a TF module, wherein the TCA module mainly considers and combines space-time dimension characteristics of high-level semantic information, then the TF module fuses results of time modeling convolution with an attention method, and finally the extracted space-time information characteristics are subjected to a full-connection layer and a Softmax layer to obtain estimated action categories.
3. The method according to claim 2, characterized in that:
the TCA module comprises time aggregation, topology generation and two-part channel dimension aggregation, wherein the TCA module F out Represented by the formula:
Figure QLYQS_1
A out =TA(W(W),X)=(W 1 ,X 1 )|||…||(A T ,X T )
S=μ(A k )+α·Q
wherein CA is expressed as channel dimension aggregation, and I is expressed as splicing operation, A out For the structure of diver joint features after time aggregation, S represents the result of topology generation processing of the features, F out For aggregation of joint features in the channel dimension, A out1 Is the convolution result of joint number 1 in the time dimension, S 1 As a result of topology processing of joint No. 1,
Figure QLYQS_2
structure after temporal aggregation for node 1 feature, +.>
Figure QLYQS_3
The method is characterized in that TA is a time aggregation module, W (W) is a time weight feature, X is a joint feature, and W is a result of topology generation processing of the joint feature No. 1 1 Is the time weight characteristic of the No. 1 joint point, X 1 Is the characteristic of the joint point No. 1, A T Is the time weight characteristic of the T-th joint point, X T For the characteristics of the joint point of the T joint point, mu is the normalization and dimension transformation operation of a third-order adjacency matrix, A k For the k-th channel adjacency matrix, α is the trainable parameter of joint connection strength, and Q is the channel correlation matrix.
4. A method according to claim 3, characterized in that:
TF module Z out Represented by the formula:
Z out =sk(MSCONV(F out ))
MSCONV is a multi-convolution function, final TCA-GCN is generated by final combination of temporal modeling, the obtained time-space characteristic information is subjected to action category judgment through a full connection layer and Softmax, L1 loss is used as a loss function, and a real action category label group Truth is used for supervised learning.
5. The method according to claim 4, characterized in that: the step 4 specifically comprises the following steps:
the STGCN module comprises a graph convolution module and a time convolution module, local features of adjacent points in the space are learned through graph convolution, and time sequence information in the sequence data is learned through time convolution; and the extracted space-time information features are subjected to a full connection layer and a Softmax layer to obtain estimated action categories.
6. A diver action recognition system based on three-dimensional human skin is characterized in that: the system comprises:
the data extraction module is used for extracting the human body shape, posture and vertex information of the video frame of the diver through a three-dimensional human body posture estimation method;
the data fusion module is used for: the human body shape, gesture and vertex data are subjected to data fusion to obtain high-level semantic information;
downsampling the vertex information, simultaneously, respectively passing the downsampled vertex information and the shape information through a convolution module in a feature extraction network to obtain coding information, splicing the coding information to gesture information, and obtaining high-level semantic information;
the TCA-GCN motion estimation module: performing action recognition by using the high-level semantic information through a TCA-GCN module;
STGCN action estimation module: performing action recognition by using the high-level semantic information through the STGCN module;
the linear fusion module is used for carrying out linear fusion on the identification results of the TCA-GCN module and the STGCN module and identifying the actions of the diver;
the result fusion of the action recognition by the TCA-GCN module and the action recognition by the STGCN module is taken as output, and the output result is represented by the following formula:
score=γ*score st +(1-γ)score tca
wherein score st Is the result of identifying the action of STGCN module, gamma is the weight of the result, score tca And the recognition result of the TCA-GCN module is represented, and score is the final output result after weighting.
7. A computer readable storage medium having stored thereon a computer program, characterized in that the program is executed by a processor for implementing the method according to any of claims 1-5.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized by: the processor, when executing the computer program, implements the method of any of claims 1-5.
CN202310015851.2A 2023-01-06 2023-01-06 Diver action recognition method based on three-dimensional human body skin Active CN115862150B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310015851.2A CN115862150B (en) 2023-01-06 2023-01-06 Diver action recognition method based on three-dimensional human body skin

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310015851.2A CN115862150B (en) 2023-01-06 2023-01-06 Diver action recognition method based on three-dimensional human body skin

Publications (2)

Publication Number Publication Date
CN115862150A CN115862150A (en) 2023-03-28
CN115862150B true CN115862150B (en) 2023-05-23

Family

ID=85656975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310015851.2A Active CN115862150B (en) 2023-01-06 2023-01-06 Diver action recognition method based on three-dimensional human body skin

Country Status (1)

Country Link
CN (1) CN115862150B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591560A (en) * 2021-06-23 2021-11-02 西北工业大学 Human behavior recognition method
CN114663593A (en) * 2022-03-25 2022-06-24 清华大学 Three-dimensional human body posture estimation method, device, equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113297955B (en) * 2021-05-21 2022-03-18 中国矿业大学 Sign language word recognition method based on multi-mode hierarchical information fusion
CN114863325B (en) * 2022-04-19 2024-06-07 上海人工智能创新中心 Action recognition method, apparatus, device and computer readable storage medium
CN114550308B (en) * 2022-04-22 2022-07-05 成都信息工程大学 Human skeleton action recognition method based on space-time diagram
CN114973422A (en) * 2022-07-19 2022-08-30 南京应用数学中心 Gait recognition method based on three-dimensional human body modeling point cloud feature coding

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591560A (en) * 2021-06-23 2021-11-02 西北工业大学 Human behavior recognition method
CN114663593A (en) * 2022-03-25 2022-06-24 清华大学 Three-dimensional human body posture estimation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115862150A (en) 2023-03-28

Similar Documents

Publication Publication Date Title
CN109409222B (en) Multi-view facial expression recognition method based on mobile terminal
CN112163498B (en) Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method
CN109410135B (en) Anti-learning image defogging and fogging method
CN109086659B (en) Human behavior recognition method and device based on multi-channel feature fusion
CN112132739A (en) 3D reconstruction and human face posture normalization method, device, storage medium and equipment
CN112651360B (en) Skeleton action recognition method under small sample
CN112036260A (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
CN115512103A (en) Multi-scale fusion remote sensing image semantic segmentation method and system
CN116246338B (en) Behavior recognition method based on graph convolution and transducer composite neural network
CN110348395B (en) Skeleton behavior identification method based on space-time relationship
CN115761905A (en) Diver action identification method based on skeleton joint points
CN112668543A (en) Isolated word sign language recognition method based on hand model perception
CN116189306A (en) Human behavior recognition method based on joint attention mechanism
CN111738092A (en) Method for recovering shielded human body posture sequence based on deep learning
Yang et al. Channel expansion convolutional network for image classification
Lee et al. Visual thinking of neural networks: Interactive text to image synthesis
CN114494543A (en) Action generation method and related device, electronic equipment and storage medium
CN117854155A (en) Human skeleton action recognition method and system
CN115862150B (en) Diver action recognition method based on three-dimensional human body skin
CN113408721A (en) Neural network structure searching method, apparatus, computer device and storage medium
CN116682180A (en) Action recognition method based on human skeleton sequence space-time information
CN109166118A (en) Fabric surface attribute detection method, device and computer equipment
Zhou et al. Regional Self-Attention Convolutional Neural Network for Facial Expression Recognition
CN117315765A (en) Action recognition method for enhancing space-time characteristics
CN114463346A (en) Complex environment rapid tongue segmentation device based on mobile terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant