CN113239824B - Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module - Google Patents

Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module Download PDF

Info

Publication number
CN113239824B
CN113239824B CN202110544122.7A CN202110544122A CN113239824B CN 113239824 B CN113239824 B CN 113239824B CN 202110544122 A CN202110544122 A CN 202110544122A CN 113239824 B CN113239824 B CN 113239824B
Authority
CN
China
Prior art keywords
channel
module
network
mode
convolution layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110544122.7A
Other languages
Chinese (zh)
Other versions
CN113239824A (en
Inventor
李敬华
刘润泽
孔德慧
王少帆
王立春
尹宝才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202110544122.7A priority Critical patent/CN113239824B/en
Publication of CN113239824A publication Critical patent/CN113239824A/en
Application granted granted Critical
Publication of CN113239824B publication Critical patent/CN113239824B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to a dynamic gesture recognition method of a multi-mode training single-mode test based on a 3D-Ghost module, which is used for solving the problem of dynamic gesture recognition of the multi-mode training single-mode test, specifically utilizing RGB data and depth data to train an integral network, the integral network adopts a parallel double-channel collaborative learning structure, the learning process is improved by transmitting knowledge between different mode networks, a channel m is used for recognizing dynamic gestures through the RGB data, and a channel n is used for recognizing dynamic gestures through the depth data; after training is completed, carrying out dynamic gesture recognition on the RGB data input channel m or carrying out dynamic gesture recognition on the depth data input channel n; the channel adopts an I3D network and improves the I3D network, the improvement is that a attention module is added, part of 3D convolution layers are replaced by 3D-Ghost modules, and all the acceptance-V1 submodules are improved.

Description

Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module
Technical Field
The invention belongs to the technical field of computer vision and man-machine interaction, and particularly relates to a dynamic gesture recognition method based on a multi-mode training single-mode test.
Background
Gestures are one of the ways of man-machine interaction, and can provide a natural and visual interaction way. Gesture recognition aims at understanding human hand actions, and plays an extremely important role in the fields of human-computer interaction, virtual reality, augmented reality and the like. Efficient video-based gesture recognition is extremely difficult due to many factors affecting gesture recognition, such as changes in lighting conditions, complexity of the background environment, and inconsistent gestures in the real environment with standard gestures. In recent years, consumer-level depth sensors are becoming popular, depth images are not easily affected by illumination, and are easy to segment the background, and meanwhile, as an effective supplement to RGB images, the data expression capability can be enhanced to a certain extent. Multimodal gesture recognition typically requires a large amount of annotated, time-aligned RGB-D data that is difficult to obtain. The invention assumes that both color and depth video sequences can be utilized during training, while only one modality of data can be provided during testing. Therefore, the invention provides a dynamic gesture recognition method based on a deep learning network and used for multi-mode training single-mode testing.
Disclosure of Invention
In order to solve the problems, the invention provides an improved dynamic gesture recognition method based on multi-mode training and single-mode testing. The invention is based on the single-mode characteristic representation of the spatial attention and the 3D-Ghost and the improved inter-mode knowledge migration method, and finally improves the single-mode gesture recognition precision, and the specific technical scheme is as follows:
the method trains the whole network by utilizing RGB data and depth data, the whole network adopts a parallel double-channel collaborative learning structure, the learning process is improved by transmitting knowledge between different modal networks, the double-channel network has the same structure and unshared parameters, a channel m is used for recognizing dynamic gestures by the RGB data, and a channel n is used for recognizing the dynamic gestures by the depth data; after training is completed, carrying out dynamic gesture recognition on the RGB data input channel m or carrying out dynamic gesture recognition on the depth data input channel n;
wherein the channel adopts an I3D network and improves the I3D network, the I3D network sequentially comprises a first 3D convolution layer, a first 3D maximum pooling layer, a second 3D convolution layer, a third 3D convolution layer, a second 3D maximum pooling layer, a first acceptance-V1 sub-module, a second acceptance-V1 sub-module, a third 3D maximum pooling layer, a third to seventh acceptance-V1 sub-module, a fourth 3D maximum pooling layer, an eighth, a ninth acceptance-V1 sub-module, an average pooling layer and a fourth 3D convolution layer,
the improvement is that: attention modules are respectively added behind the first 3D convolution layer and the third 3D convolution layer, the second 3D convolution layer is replaced by a 3D-Ghost module, and all the acceptance-V1 sub-modules are improved;
the working process of the attention module is as follows:
carrying out average pooling and maximum pooling of a channel dimension on a characteristic diagram F of a C×T×W×H output by a three-dimensional convolution layer before an attention module to obtain two channel descriptions of T×W×H×1, splicing the two descriptions together according to the channels, and then obtaining a weight coefficient M (F) through the same three-dimensional convolution layer and an activation function Sigmoid as the previous layer of the attention module;
where σ is a Sigmod function, f d×d×d Representing the three-dimensional convolution layer before the attention module, F avg And F max The channel descriptions are averaged pooling and channel pooling respectively,
finally, multiplying the original feature map F and the weight coefficient M (F) element by element to obtain a new feature F weighted by attention *
The improvement on the acceptance-V1 submodule is specifically as follows:
the improvement lies in that the former three-dimensional convolution layer in the second path and the three paths is changed into a 3D-Ghost module, an attention module is respectively added in the second path and the three paths, and the three-dimensional convolution layer in the fourth path is changed into the 3D-Ghost module after the latter three-dimensional convolution layer in the second path and the three paths is specifically positioned.
Only one mode data is available during the test, so that during the training, two channels use respective loss functions respectively, in order to share the knowledge of dominant modes, the space-time semantic alignment loss double-channel integral training is added in the loss functions of weak modes,
when Δl > 0, the loss functions of channel m and channel n are respectively:
when Δl is less than or equal to 0, the loss functions of the channel m and the channel n are respectively:
where lambda is a compromise super-parameter,
the baseline method realizes the migration of knowledge from a good mode to another mode by constraining the semantic consistency of the two modes, and is expressed as follows:
wherein corr (F m ) And corr (F) n ) Correlation matrix respectively representing m-th and n-th channel characteristic diagrams, F m And F n The output characteristics of the channel m and the channel n are the focus regularization parameter rho of the base network m,n The direction for controlling knowledge transfer is specifically as follows:
wherein,Δl represents the difference in sorting loss of m and n channels, +.>And->The classification losses respectively correspond to the channel m and the channel n, the classification losses adopt cross entropy functions, and beta represents a parameter for scaling SSA losses, so that the whole framework is more focused on realizing the positive migration of knowledge.
Advantageous effects
According to the framework, the single-mode gesture recognition based on multi-mode data training is studied, and the multi-mode knowledge is embedded into each single-mode network through the bidirectional migration of the knowledge among the modes, so that the performance of each single-mode network is improved, and the flexibility of the framework is greatly improved. The invention also improves the classification network, so that the whole framework can realize high-efficiency single-mode gesture recognition.
Drawings
FIG. 1I 3D is a schematic diagram of a network architecture;
FIG. 2 is a schematic diagram of the concept-V1 submodule structure;
FIG. 3 is a schematic diagram of an improved single channel network architecture;
FIG. 4 is a schematic diagram of an improved acceptance-V1 submodule structure;
FIG. 5 is a block diagram of the present invention;
FIG. 6 (a) is a conventional three-dimensional convolution;
FIG. 6 (b) 3D-Ghost module.
Detailed Description
The invention provides a dynamic gesture recognition method for RGB-D data training and RGB color or depth single-mode test thought, which utilizes the advantages of multi-mode data to research single-mode gesture recognition based on multi-mode data training. The whole framework of the invention is shown in fig. 5, and adopts a structure of two channel collaborative learning, and aims to improve the learning process by transmitting knowledge between different modal networks, and the whole framework is mainly divided into a classification network module based on a 3D-Ghost module and a spatial attention mechanism and a knowledge positive migration module based on space-time semantic alignment (spatiotemporal semantic alignment, SSA) loss judgment, wherein the knowledge positive migration module is only used in a training stage, and the modularized description is only used for conveniently describing the training loss calculation process of the whole network. Training stage, V m Representing RGB modality gesture video data, V n Representing depth mode gesture video data, wherein the video length is N, and the value is 64 in the invention. V (V) m And V n Input of an I3D network based on a 3D-Ghost module and a spatial attention mechanism, F, of m-modality and n-modality, respectively m And F n I.e. the output characteristics of the two modal networks. In view of the fact that the color mode and the depth mode have different advantages when different actions are described, the color mode is more expressive for some gestures, and the depth mode is more discriminant for other gestures, so that the m-channel and n-channel network weight parameters are not shared. In particular, in order to enable them to learn advantages from each other, a knowledge positive migration module based on SSA loss is introduced in the classification network, i.e. SSA loss is increased in the branch network where the weak mode is located, thereby forcing knowledge of the dominant mode to be learned. SSA loss aims to minimize the semantic distance of two modalities, which the present invention formalizes as euclidean distance of the correlation matrix of the two-modality features. Parameter ρ m,n The function of (1) is to ensure the knowledge transfer from the dominant mode network, i.e. the network with high classification accuracy, to the weak mode network, i.e. the network with low classification accuracy, to avoid negative migration of knowledge, here ρ m,n The definition of (2) also considers the magnitude of two modal classification errors, errorThe greater the difference, ρ m,n The larger the strength of the network learning from the weak mode to the dominant mode is, the more the network learning from the weak mode is strengthened.
In the test phase, each 3D network based on the 3D-Ghost module and the spatial attention mechanism operates independently. Thus, once training of the network is completed, it is equivalent to embedding multi-modal knowledge into each single-modal network, i.e., high-precision gesture recognition can be achieved using the single-modal network.
Description of the basic model
The I3D network is a popular three-dimensional convolutional neural network, the space-time characteristics of gesture videos are extracted based on the I3D network, the space-time characteristics are improved, and network modules related to the invention are described in the following.
Spatial attention module
The attention mechanism is similar to the attention of human vision, namely, the attention is focused on important parts in a plurality of information, key information is selected, and other unimportant information is ignored, so that the performance of the model is improved. Traditional gesture behavior recognition models (e.g., LSTM, etc.) consider all features equally important when extracting gesture behavior data features. Whereas gesture data contains not only behavioral information but also a lot of noise (such as reflected signals of people, walls or other static objects). In order to make the three-dimensional convolutional neural network pay more attention to hand and arm characteristics, the invention fuses a spatial attention mechanism into a 3D-Ghost-based I3D network, namely, key characteristics are extracted based on an attention module for input RGB and depth video sequences.
The core of the attention mechanism is a weight parameter, firstly, the importance degree of semantic information represented by each element on the feature map is learned, then, the weight is distributed to each element on the feature map according to the importance degree, and the importance degree is higher as the weight is larger. For a given feature mapAn attention map can be obtained by the attention mechanism moduleM (F) is a weight used to characterize the importance of a feature. And finally multiplying M (F) with the original feature map F element by element, extracting key features in the images, and learning important features of each image through a neural network to form attention. The weight coefficient can be expressed as formula (1):
where σ is a Sigmod function, f in this embodiment d×d×d Three-dimensional convolution layers corresponding to d=7 and d=3, respectively, F avg And F max The channel descriptions are average pooled and channel pooled, respectively. The input is a characteristic diagram F of CxTxWxH, firstly, the average pooling and the maximum pooling of a channel dimension are respectively carried out to obtain two channel descriptions of TxW xH x 1, and the two descriptions are spliced together according to the channels. Then passing through a convolution layer of 7 x 7 or 3 x 3 and an activation function Sigmoid, the weight coefficient M (F) is obtained.
Then multiplying the original feature map F and the weight coefficient M (F) element by element through a formula (2) to obtain a new feature F weighted by attention *
Classification network module based on 3D-Ghost module
GhostNet is a recently proposed simple and efficient classification network in which the Ghost module is based on the following observations and analysis:
feature maps of CNNs often contain redundant information, i.e. the existence of ghosts, which may be critical to network performance and may be obtained by simpler linear calculations. However, the Ghost module is only applied to two-dimensional CNN, and the invention aims at the problem of dynamic gesture video classification, and provides a 3D-Ghost module which is integrated into the three-dimensional convolutional neural network of I3D. The 3D-Ghost module can generate more rich characteristic diagrams through cheap linear operation, and adopts a group of inherent characteristic diagramsA series of low cost linear transformations produces a large number of phantom feature maps that fully reveal the inherent feature information. This 3D-Ghost module upgrades the existing convolutional neural network as a plug and play component. As shown in FIGS. 6 (a) and (b) which are a comparison of the conventional three-dimensional convolution operation and the 3D-Ghost module, it can be seen that the 3D-Ghost module generates a part of the inherent feature map by one convolution and then performs the linear operation phi 1 、Φ 2 ……Φ k A number of phantom feature maps are generated to enhance the features and to align the number of feature maps output.
In practical application, given input data X εR c×T×h×w Where c is the number of input channels, T is the number of frames of the input video, h and w are the height and width of the input data, respectively, and the operation of the three-dimensional convolution layer to generate n feature maps can be expressed as:
Y=X*f+b (3)
wherein is the three-dimensional convolution operator, b is the bias term, Y ε R T′×h′×w′×n Output characteristic diagram with n channels, f E R c×k×k×k×n Is the three-dimensional convolution kernel in this layer, h 'and w' are the height and width of the output data, respectively, and k x k is the size of the three-dimensional convolution kernel f.
The above is a conventional three-dimensional convolution operation whose output feature maps also contain greater redundancy than two-dimensional convolution operations, and some of which may be similar to each other. Therefore, we consider that it is not necessary to generate these redundant three-dimensional feature maps with a large number of FLOPs kernel parameters, so a 3D-Ghost module is used instead of the three-dimensional convolution operation. Specifically, m intrinsic feature maps Y' ∈R are generated by first convolving one time T′×h′×w′×m
Y′=X*f′ (4)
Wherein f' is R c×k×k×k×m Is a filter used, m is less than or equal to n, and the bias term is omitted for simplicity. The super-parameters (e.g., filter size, step size, fill) are the same as in the normal three-dimensional convolution (equation 3) to keep the temporal and spatial sizes (i.e., T ', h ', and w ') of the output feature map consistent. To further obtain the n feature maps required, we apply to each of Y' according to equation (5)The intrinsic feature map applies a series of linear operations to generate s ghest phantom feature maps.
Wherein y' i Is the ith intrinsic feature map in Y', Φ i,j Representing the jth (except the last) linear operation for generating the jth phantom feature map y ij . That is, y' i May have one or more phantom feature mapsLast phi i,s The intrinsic map is represented for preserving the intrinsic feature map. From the formula (5), the n=m·s feature map y= [ Y ] can be obtained 11 ,y 12 ,…,y ms ]As output data of the 3D-Ghost module, as shown in fig. 6 (b).
In general, the working mechanism of the 3D-Ghost module is to generate an inherent feature map by one three-dimensional convolution, then generate a phantom feature map by one three-dimensional convolution of the inherent feature map, reserve the inherent feature map of the first convolution, and connect the inherent feature map and the phantom feature map in series to replace the original three-dimensional convolution operation. The 3D-Ghost module is added into each single-mode network I3D, so that richer and more flexible feature map representation can be obtained, semantic information contained in a correlation matrix is richer, and the two single-mode networks share one correlation matrix, namely the two single-mode networks share the rich semantic information, by minimizing SSA loss. Because the model is a framework for cooperative learning, the 3D-Ghost module is added, so that more abundant semantic knowledge can be transferred between two channels to improve the performance of the model. As shown in fig. 3, the I3D network is based on a 3D-Ghost module, where inc. is expressed as an acceptance-V1 sub-module based on a 3D-Ghost, and the specific structure of the I3D network is shown in fig. 4.
Knowledge positive migration module based on SSA loss
The invention is a two-channel collaboration framework, which encourages weak classification modes of an input network to obtain recognition results of strong classification modes. When a discriminative representation is not learned by a network of one modality during training, knowledge of the network of another modality can be used to refine its representation. The repeated occurrence of this situation may allow the network to get a better representation in a collaborative manner.
The output features of the two networks, respectively the mth and nth modes, where W, H, T and C represent the width, height, frame number and channel number of the feature, respectively, which contain rich semantic knowledge. By F m For example, F m Element->Representing the content of a certain spatio-temporal position. F (F) m The correlation between all elements of (a) is represented by its correlation matrix, which is defined as follows:
the baseline method realizes the migration of knowledge from a good modality to another modality by constraining the semantic consistency of the two modalities:
here ρ m,n Is a focus regularization parameter for controlling the direction of knowledge transfer. corr (F) m ) And corr (F) n ) And respectively representing the correlation matrix of the m-th and n-th channel characteristic diagrams. ρ m,n The focus regularization parameters for forcing a better performing network to deliver knowledge to a worse performing network are:
wherein,Δl represents the difference in sorting loss of m and n channels, +.>And->Is the classification loss for the network of the mth modality and the network of the nth modality, respectively. A value of Δl being positive indicates that network n performs better than network m, where ρ is desired when training network m m,n Forcing the representation of network m to simulate network n for large values. However, the baseline method defaults to the feature map representation of one of the modality networks (depth) to be stronger than the other modality network (color), and the focus regularization parameters used enable the depth modality network with good performance to transmit knowledge to the color modality network with poor performance, which is just a unidirectional process of transmitting knowledge from the depth modality to the color modality. However, we found that different datasets, different training rounds, cannot determine that the profile of one modality represents a certain preference over another modality. Thus, for this problem we regularize the parameter ρ to the focus of baseine m,n The two channels can be adaptively and bi-directionally learned by modifying, and delta l is multiplied in front of the index to accelerate the convergence of the network:
the improved focus regularization parameters can judge which modal network has better characteristic diagram representation through the value of Deltal in the training process so as to realize the correct and effective transmission of knowledge. This bi-directional learning between channels greatly enhances the interoperability of the overall framework.
Loss function
Considering that only one mode data is available during the test, the two channels respectively use respective loss functions during the training of the proposed model, and in order to share the knowledge of the dominant mode, the loss of the spatiotemporal semantic alignment is increased in the loss functions of the weak mode. Thus, the first and second substrates are bonded together,
when Δl > 0, the loss functions of channel m and channel n are respectively:
when Δl is less than or equal to 0, the loss functions of the channel m and the channel n are respectively:
where λ=0.05, is a trade-off super-parameter.
Experimental part
Experimental data set: the invention adopts three dynamic gesture data sets of SKIG, VIVA and NVGesture to carry out experiments. The skeg dynamic gesture dataset contained a total of 1080 RGB-D gesture video sequences, 10 categories in total, and samples were completed by 6 operators using 3 different palm morphologies (i.e., clenching fists, palm stretching, and index finger only) under 2 lighting conditions (i.e., strong and weak light) and 3 background patterns (i.e., white paper, wood texture, and newspapers), respectively. The VIVA dynamic gesture dataset was captured by microsoft Kinect device for a total of 885 RGB and depth information video sequences, data for 19 different dynamic gestures performed by 8 subjects in a car. The nvgeture dynamic gesture dataset is usually used for human-computer interaction, and is collected by using multiple sensors at multiple angles, and comprises 1532 RGB-D dynamic gestures, wherein the training set has 1050 video sequences, the test set has 482 video sequences, and the total number of the video sequences is 25, and the gestures are recorded by 20 subjects in an automobile simulator with artificial lighting conditions.
Evaluation index: the method is evaluated by using Top1 accuracy according to an evaluation standard protocol in the field of dynamic gesture recognition.
Experiment setting: the I3D network is adopted as a backbone network (backhaul network) of two channels, and the output of the last acceptance-V1 sub-module 'Mixed 5 c' is selected as a characteristic diagram applied to SSA loss in the network. In all experiments, λ was set to 50×10 -3 Beta=2, and an Adam optimizer is used to optimize the objective function, the learning rate is set to 0.0001. During the training phase, batch is set to 2, i.e. 2 64 frames of RGB-D video sequences are fed into the network at a time. The whole model is realized by Pytorch 1.7. During training, only each modal classification network is pre-trained, the experiment is carried out for 60 epochs, and then SSA loss is added to continue training for 15 epochs.
The performance of the model is compared with the most advanced dynamic gesture recognition methods. Tables 1, 2 and 3 show the results of the comparison on the SKIG, VIVA, NVGesture dynamic gesture dataset, respectively. The SKIG data set and the VIVA data set are not divided into a training set and a testing set, so that according to a common evaluation protocol of the two data sets, three-fold cross validation is adopted for the SKIG data set, and eight-fold cross validation is adopted for the VIVA data set.
Ablation experiment: ablation experiments were performed on the SKIG dataset to test the contributions of the different components of the network, as shown in table 4. We use the base model as baseline, i.e. without adding the spatial attention module and the 3D-Ghost module, and the original focus regularization parameters (equation 7). Wherein the spatial attention module (Spatial Attention Module) is represented by SAM and the improved focus regularization parameter (equation 8) is represented by ρ m,n And (3) representing.
Table 1 comparison of the accuracy of different dynamic gesture recognition methods on SKIG datasets
Table 2 comparison of accuracy of different dynamic gesture recognition methods on VIVA dataset
TABLE 3 precision comparison of different dynamic gesture recognition methods on NVGestme dataset
Table 4 tests the effect of various modules on dynamic gesture recognition performance, where the base model is denoted by Baseline, the spatial attention module (Spatial Attention Module) is denoted by SAM, and the improved focus regularization parameters are denoted by ρ m,n And (3) representing.

Claims (2)

1. A dynamic gesture recognition method for a multi-mode training single-mode test based on a 3D-Ghost module is characterized by comprising the following steps of: the method trains the whole network by utilizing RGB data and depth data, the whole network adopts a parallel double-channel collaborative learning structure, the learning process is improved by transmitting knowledge between different modal networks, the double-channel network has the same structure and unshared parameters, a channel m is used for recognizing dynamic gestures by the RGB data, and a channel n is used for recognizing the dynamic gestures by the depth data; after training is completed, carrying out dynamic gesture recognition on the RGB data input channel m or carrying out dynamic gesture recognition on the depth data input channel n;
wherein the channel adopts an I3D network and improves the I3D network, the I3D network sequentially comprises a first 3D convolution layer, a first 3D maximum pooling layer, a second 3D convolution layer, a third 3D convolution layer, a second 3D maximum pooling layer, a first acceptance-V1 sub-module, a second acceptance-V1 sub-module, a third 3D maximum pooling layer, a third to seventh acceptance-V1 sub-module, a fourth 3D maximum pooling layer, an eighth, a ninth acceptance-V1 sub-module, an average pooling layer and a fourth 3D convolution layer,
the improvement is that: attention modules are respectively added behind the first 3D convolution layer and the third 3D convolution layer, the second 3D convolution layer is replaced by a 3D-Ghost module, and all the acceptance-V1 sub-modules are improved;
the working process of the designed 3D-Ghost module is as follows:
first, m intrinsic feature maps Y' ∈R are generated by one three-dimensional convolution T′×h′×w′×m
Y′=X*f′
Wherein f' is R c×k×k×k×m Is the filter used, m.ltoreq.n, and secondly, in order to further obtain the required n feature maps, a series of linear operations are applied to each intrinsic feature map in Y' to generate s huge phantom feature maps, as follows:
wherein y' i Is the ith intrinsic feature map in Y', Φ i,j Represented as y' i Generating j-th phantom feature map y ij In other words y' i With one or more phantom feature mapsAnd the last phi i,s Representing an intrinsic map for preserving an intrinsic feature map, and finally obtaining an n=m·s feature map y= [ Y ] 11 ,y 12 ,…y 1s ,...,y m1 ,y m2 ...,y ms ]Output number as 3D-Ghost moduleAccording to the above;
the working process of the attention module is as follows:
carrying out average pooling and maximum pooling of a channel dimension on a characteristic diagram F of a C×T×W×H output by a three-dimensional convolution layer before an attention module to obtain two channel descriptions of T×W×H×1, splicing the two descriptions together according to the channels, and then obtaining a weight coefficient M (F) through the same three-dimensional convolution layer and an activation function Sigmoid as the previous layer of the attention module;
where σ is a Sigmod function, f d×d×d Representing the three-dimensional convolution layer before the attention module, F avg And F max The channel descriptions are averaged pooling and channel pooling respectively,
finally, multiplying the original feature map F and the weight coefficient M (F) element by element to obtain a new feature F weighted by attention *
The improvement of the acceptance-V1 submodule is specifically as follows:
the improvement lies in that the former three-dimensional convolution layer in the second path and the three paths is changed into a 3D-Ghost module, an attention module is respectively added in the second path and the three paths, and the three-dimensional convolution layer in the fourth path is changed into the 3D-Ghost module after the latter three-dimensional convolution layer in the second path and the three paths is specifically positioned.
2. The method for dynamic gesture recognition based on the 3D-Ghost module for multi-modal training single-modal testing according to claim 1, wherein the method comprises the following steps:
only one mode data is available during the test, so that during the training, two channels use respective loss functions respectively, in order to share the knowledge of dominant modes, the space-time semantic alignment loss double-channel integral training is added in the loss functions of weak modes,
when (when)When the loss functions of the channel m and the channel n are respectively:
when (when)When the loss functions of the channel m and the channel n are respectively:
where lambda is a compromise super-parameter,
the baseline method realizes the migration of knowledge from a good mode to another mode by constraining the semantic consistency of the two modes, and is expressed as follows:
wherein corr (F m ) And corr (F) n ) Correlation matrix respectively representing m-th and n-th channel characteristic diagrams, F m And F n Namely, isOutput characteristics of channel m and channel n, focal regularization parameter ρ of base of the base line network m,n The direction for controlling knowledge transfer is specifically as follows:
wherein, representing the difference of the sorting losses of m and n channels, < >>And->The classification losses respectively correspond to the channel m and the channel n, the classification losses adopt cross entropy functions, and beta represents a parameter for scaling SSA losses, so that the whole framework is more focused on realizing the positive migration of knowledge.
CN202110544122.7A 2021-05-19 2021-05-19 Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module Active CN113239824B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110544122.7A CN113239824B (en) 2021-05-19 2021-05-19 Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110544122.7A CN113239824B (en) 2021-05-19 2021-05-19 Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module

Publications (2)

Publication Number Publication Date
CN113239824A CN113239824A (en) 2021-08-10
CN113239824B true CN113239824B (en) 2024-04-05

Family

ID=77137469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110544122.7A Active CN113239824B (en) 2021-05-19 2021-05-19 Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module

Country Status (1)

Country Link
CN (1) CN113239824B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113852858A (en) * 2021-08-19 2021-12-28 阿里巴巴(中国)有限公司 Video processing method and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205475A (en) * 2015-10-20 2015-12-30 北京工业大学 Dynamic gesture recognition method
CN106991372A (en) * 2017-03-02 2017-07-28 北京工业大学 A kind of dynamic gesture identification method based on interacting depth learning model
CN109214250A (en) * 2017-07-05 2019-01-15 中南大学 A kind of static gesture identification method based on multiple dimensioned convolutional neural networks
CN111104929A (en) * 2019-12-31 2020-05-05 广州视声智能科技有限公司 Multi-modal dynamic gesture recognition method based on 3D (three-dimensional) volume sum and SPP (shortest Path P)
CN111814626A (en) * 2020-06-29 2020-10-23 中南民族大学 Dynamic gesture recognition method and system based on self-attention mechanism
CN112507898A (en) * 2020-12-14 2021-03-16 重庆邮电大学 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3467707B1 (en) * 2017-10-07 2024-03-13 Tata Consultancy Services Limited System and method for deep learning based hand gesture recognition in first person view

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205475A (en) * 2015-10-20 2015-12-30 北京工业大学 Dynamic gesture recognition method
CN106991372A (en) * 2017-03-02 2017-07-28 北京工业大学 A kind of dynamic gesture identification method based on interacting depth learning model
CN109214250A (en) * 2017-07-05 2019-01-15 中南大学 A kind of static gesture identification method based on multiple dimensioned convolutional neural networks
CN111104929A (en) * 2019-12-31 2020-05-05 广州视声智能科技有限公司 Multi-modal dynamic gesture recognition method based on 3D (three-dimensional) volume sum and SPP (shortest Path P)
CN111814626A (en) * 2020-06-29 2020-10-23 中南民族大学 Dynamic gesture recognition method and system based on self-attention mechanism
CN112507898A (en) * 2020-12-14 2021-03-16 重庆邮电大学 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于双层特征融合的生物识别;孔俊;;北华大学学报(自然科学版);20200110(第01期);全文 *
基于深度学习的动态手势识别方法;王健;朱恩成;黄四牛;任华;;计算机仿真;20180215(第02期);全文 *
融合注意力机制和连接时序分类的多模态手语识别;王军;鹿姝;李云伟;;信号处理;20200925(第09期);全文 *

Also Published As

Publication number Publication date
CN113239824A (en) 2021-08-10

Similar Documents

Publication Publication Date Title
Ye et al. Deep joint depth estimation and color correction from monocular underwater images based on unsupervised adaptation networks
Xiaohua et al. Two-level attention with two-stage multi-task learning for facial emotion recognition
Vazquez et al. Virtual and real world adaptation for pedestrian detection
CN110414432A (en) Training method, object identifying method and the corresponding device of Object identifying model
CN108363973B (en) Unconstrained 3D expression migration method
CN109559332B (en) Sight tracking method combining bidirectional LSTM and Itracker
CN113255457A (en) Animation character facial expression generation method and system based on facial expression recognition
CN110135277B (en) Human behavior recognition method based on convolutional neural network
WO2023102224A1 (en) Data augmentation for multi-task learning for depth mapping and semantic segmentation
WO2023165361A1 (en) Data processing method and related device
CN113408343A (en) Classroom action recognition method based on double-scale space-time block mutual attention
CN113239824B (en) Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module
CN112507920A (en) Examination abnormal behavior identification method based on time displacement and attention mechanism
CN116092185A (en) Depth video behavior recognition method and system based on multi-view feature interaction fusion
CN108492275B (en) No-reference stereo image quality evaluation method based on deep neural network
CN114782596A (en) Voice-driven human face animation generation method, device, equipment and storage medium
CN113850182A (en) Action identification method based on DAMR-3 DNet
CN117115917A (en) Teacher behavior recognition method, device and medium based on multi-modal feature fusion
Chen et al. An improved dense-to-sparse cross-modal fusion network for 3D object detection in RGB-D images
CN106210710A (en) A kind of stereo image vision comfort level evaluation methodology based on multi-scale dictionary
CN115984949B (en) Low-quality face image recognition method and equipment with attention mechanism
CN113128517A (en) Tone mapping image mixed visual feature extraction model establishment and quality evaluation method
CN116433904A (en) Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution
Guo et al. Multi-level Fusion Based Deep Convolutional Network for Image Quality Assessment
CN112099330B (en) Holographic human body reconstruction method based on external camera and wearable display control equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant