CN113239824B

CN113239824B - Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module

Info

Publication number: CN113239824B
Application number: CN202110544122.7A
Authority: CN
Inventors: 李敬华; 刘润泽; 孔德慧; 王少帆; 王立春; 尹宝才
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2024-04-05
Anticipated expiration: 2041-05-19
Also published as: CN113239824A

Abstract

The invention relates to a dynamic gesture recognition method of a multi-mode training single-mode test based on a 3D-Ghost module, which is used for solving the problem of dynamic gesture recognition of the multi-mode training single-mode test, specifically utilizing RGB data and depth data to train an integral network, the integral network adopts a parallel double-channel collaborative learning structure, the learning process is improved by transmitting knowledge between different mode networks, a channel m is used for recognizing dynamic gestures through the RGB data, and a channel n is used for recognizing dynamic gestures through the depth data; after training is completed, carrying out dynamic gesture recognition on the RGB data input channel m or carrying out dynamic gesture recognition on the depth data input channel n; the channel adopts an I3D network and improves the I3D network, the improvement is that a attention module is added, part of 3D convolution layers are replaced by 3D-Ghost modules, and all the acceptance-V1 submodules are improved.

Description

Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module

Technical Field

The invention belongs to the technical field of computer vision and man-machine interaction, and particularly relates to a dynamic gesture recognition method based on a multi-mode training single-mode test.

Background

Gestures are one of the ways of man-machine interaction, and can provide a natural and visual interaction way. Gesture recognition aims at understanding human hand actions, and plays an extremely important role in the fields of human-computer interaction, virtual reality, augmented reality and the like. Efficient video-based gesture recognition is extremely difficult due to many factors affecting gesture recognition, such as changes in lighting conditions, complexity of the background environment, and inconsistent gestures in the real environment with standard gestures. In recent years, consumer-level depth sensors are becoming popular, depth images are not easily affected by illumination, and are easy to segment the background, and meanwhile, as an effective supplement to RGB images, the data expression capability can be enhanced to a certain extent. Multimodal gesture recognition typically requires a large amount of annotated, time-aligned RGB-D data that is difficult to obtain. The invention assumes that both color and depth video sequences can be utilized during training, while only one modality of data can be provided during testing. Therefore, the invention provides a dynamic gesture recognition method based on a deep learning network and used for multi-mode training single-mode testing.

Disclosure of Invention

In order to solve the problems, the invention provides an improved dynamic gesture recognition method based on multi-mode training and single-mode testing. The invention is based on the single-mode characteristic representation of the spatial attention and the 3D-Ghost and the improved inter-mode knowledge migration method, and finally improves the single-mode gesture recognition precision, and the specific technical scheme is as follows:

the method trains the whole network by utilizing RGB data and depth data, the whole network adopts a parallel double-channel collaborative learning structure, the learning process is improved by transmitting knowledge between different modal networks, the double-channel network has the same structure and unshared parameters, a channel m is used for recognizing dynamic gestures by the RGB data, and a channel n is used for recognizing the dynamic gestures by the depth data; after training is completed, carrying out dynamic gesture recognition on the RGB data input channel m or carrying out dynamic gesture recognition on the depth data input channel n;

wherein the channel adopts an I3D network and improves the I3D network, the I3D network sequentially comprises a first 3D convolution layer, a first 3D maximum pooling layer, a second 3D convolution layer, a third 3D convolution layer, a second 3D maximum pooling layer, a first acceptance-V1 sub-module, a second acceptance-V1 sub-module, a third 3D maximum pooling layer, a third to seventh acceptance-V1 sub-module, a fourth 3D maximum pooling layer, an eighth, a ninth acceptance-V1 sub-module, an average pooling layer and a fourth 3D convolution layer,

the improvement is that: attention modules are respectively added behind the first 3D convolution layer and the third 3D convolution layer, the second 3D convolution layer is replaced by a 3D-Ghost module, and all the acceptance-V1 sub-modules are improved;

the working process of the attention module is as follows:

carrying out average pooling and maximum pooling of a channel dimension on a characteristic diagram F of a C×T×W×H output by a three-dimensional convolution layer before an attention module to obtain two channel descriptions of T×W×H×1, splicing the two descriptions together according to the channels, and then obtaining a weight coefficient M (F) through the same three-dimensional convolution layer and an activation function Sigmoid as the previous layer of the attention module;

where σ is a Sigmod function, f ^d×d×d Representing the three-dimensional convolution layer before the attention module, F _avg And F _max The channel descriptions are averaged pooling and channel pooling respectively,

finally, multiplying the original feature map F and the weight coefficient M (F) element by element to obtain a new feature F weighted by attention ^* ；

The improvement on the acceptance-V1 submodule is specifically as follows:

the improvement lies in that the former three-dimensional convolution layer in the second path and the three paths is changed into a 3D-Ghost module, an attention module is respectively added in the second path and the three paths, and the three-dimensional convolution layer in the fourth path is changed into the 3D-Ghost module after the latter three-dimensional convolution layer in the second path and the three paths is specifically positioned.

Only one mode data is available during the test, so that during the training, two channels use respective loss functions respectively, in order to share the knowledge of dominant modes, the space-time semantic alignment loss double-channel integral training is added in the loss functions of weak modes,

when Δl > 0, the loss functions of channel m and channel n are respectively:

when Δl is less than or equal to 0, the loss functions of the channel m and the channel n are respectively:

where lambda is a compromise super-parameter,

the baseline method realizes the migration of knowledge from a good mode to another mode by constraining the semantic consistency of the two modes, and is expressed as follows:

wherein corr (F _m ) And corr (F) _n ) Correlation matrix respectively representing m-th and n-th channel characteristic diagrams, F _m And F _n The output characteristics of the channel m and the channel n are the focus regularization parameter rho of the base network ^m,n The direction for controlling knowledge transfer is specifically as follows:

wherein,Δl represents the difference in sorting loss of m and n channels, +.>And->The classification losses respectively correspond to the channel m and the channel n, the classification losses adopt cross entropy functions, and beta represents a parameter for scaling SSA losses, so that the whole framework is more focused on realizing the positive migration of knowledge.

Advantageous effects

According to the framework, the single-mode gesture recognition based on multi-mode data training is studied, and the multi-mode knowledge is embedded into each single-mode network through the bidirectional migration of the knowledge among the modes, so that the performance of each single-mode network is improved, and the flexibility of the framework is greatly improved. The invention also improves the classification network, so that the whole framework can realize high-efficiency single-mode gesture recognition.

Drawings

FIG. 1I 3D is a schematic diagram of a network architecture;

FIG. 2 is a schematic diagram of the concept-V1 submodule structure;

FIG. 3 is a schematic diagram of an improved single channel network architecture;

FIG. 4 is a schematic diagram of an improved acceptance-V1 submodule structure;

FIG. 5 is a block diagram of the present invention;

FIG. 6 (a) is a conventional three-dimensional convolution;

FIG. 6 (b) 3D-Ghost module.

Detailed Description

The invention provides a dynamic gesture recognition method for RGB-D data training and RGB color or depth single-mode test thought, which utilizes the advantages of multi-mode data to research single-mode gesture recognition based on multi-mode data training. The whole framework of the invention is shown in fig. 5, and adopts a structure of two channel collaborative learning, and aims to improve the learning process by transmitting knowledge between different modal networks, and the whole framework is mainly divided into a classification network module based on a 3D-Ghost module and a spatial attention mechanism and a knowledge positive migration module based on space-time semantic alignment (spatiotemporal semantic alignment, SSA) loss judgment, wherein the knowledge positive migration module is only used in a training stage, and the modularized description is only used for conveniently describing the training loss calculation process of the whole network. Training stage, V _m Representing RGB modality gesture video data, V _n Representing depth mode gesture video data, wherein the video length is N, and the value is 64 in the invention. V (V) _m And V _n Input of an I3D network based on a 3D-Ghost module and a spatial attention mechanism, F, of m-modality and n-modality, respectively _m And F _n I.e. the output characteristics of the two modal networks. In view of the fact that the color mode and the depth mode have different advantages when different actions are described, the color mode is more expressive for some gestures, and the depth mode is more discriminant for other gestures, so that the m-channel and n-channel network weight parameters are not shared. In particular, in order to enable them to learn advantages from each other, a knowledge positive migration module based on SSA loss is introduced in the classification network, i.e. SSA loss is increased in the branch network where the weak mode is located, thereby forcing knowledge of the dominant mode to be learned. SSA loss aims to minimize the semantic distance of two modalities, which the present invention formalizes as euclidean distance of the correlation matrix of the two-modality features. Parameter ρ ^m，n The function of (1) is to ensure the knowledge transfer from the dominant mode network, i.e. the network with high classification accuracy, to the weak mode network, i.e. the network with low classification accuracy, to avoid negative migration of knowledge, here ρ ^m，n The definition of (2) also considers the magnitude of two modal classification errors, errorThe greater the difference, ρ ^m，n The larger the strength of the network learning from the weak mode to the dominant mode is, the more the network learning from the weak mode is strengthened.

In the test phase, each 3D network based on the 3D-Ghost module and the spatial attention mechanism operates independently. Thus, once training of the network is completed, it is equivalent to embedding multi-modal knowledge into each single-modal network, i.e., high-precision gesture recognition can be achieved using the single-modal network.

Description of the basic model

The I3D network is a popular three-dimensional convolutional neural network, the space-time characteristics of gesture videos are extracted based on the I3D network, the space-time characteristics are improved, and network modules related to the invention are described in the following.

Spatial attention module

The attention mechanism is similar to the attention of human vision, namely, the attention is focused on important parts in a plurality of information, key information is selected, and other unimportant information is ignored, so that the performance of the model is improved. Traditional gesture behavior recognition models (e.g., LSTM, etc.) consider all features equally important when extracting gesture behavior data features. Whereas gesture data contains not only behavioral information but also a lot of noise (such as reflected signals of people, walls or other static objects). In order to make the three-dimensional convolutional neural network pay more attention to hand and arm characteristics, the invention fuses a spatial attention mechanism into a 3D-Ghost-based I3D network, namely, key characteristics are extracted based on an attention module for input RGB and depth video sequences.

The core of the attention mechanism is a weight parameter, firstly, the importance degree of semantic information represented by each element on the feature map is learned, then, the weight is distributed to each element on the feature map according to the importance degree, and the importance degree is higher as the weight is larger. For a given feature mapAn attention map can be obtained by the attention mechanism moduleM (F) is a weight used to characterize the importance of a feature. And finally multiplying M (F) with the original feature map F element by element, extracting key features in the images, and learning important features of each image through a neural network to form attention. The weight coefficient can be expressed as formula (1):

where σ is a Sigmod function, f in this embodiment ^d×d×d Three-dimensional convolution layers corresponding to d=7 and d=3, respectively, F _avg And F _max The channel descriptions are average pooled and channel pooled, respectively. The input is a characteristic diagram F of CxTxWxH, firstly, the average pooling and the maximum pooling of a channel dimension are respectively carried out to obtain two channel descriptions of TxW xH x 1, and the two descriptions are spliced together according to the channels. Then passing through a convolution layer of 7 x 7 or 3 x 3 and an activation function Sigmoid, the weight coefficient M (F) is obtained.

Then multiplying the original feature map F and the weight coefficient M (F) element by element through a formula (2) to obtain a new feature F weighted by attention ^* 。

Classification network module based on 3D-Ghost module

GhostNet is a recently proposed simple and efficient classification network in which the Ghost module is based on the following observations and analysis:

feature maps of CNNs often contain redundant information, i.e. the existence of ghosts, which may be critical to network performance and may be obtained by simpler linear calculations. However, the Ghost module is only applied to two-dimensional CNN, and the invention aims at the problem of dynamic gesture video classification, and provides a 3D-Ghost module which is integrated into the three-dimensional convolutional neural network of I3D. The 3D-Ghost module can generate more rich characteristic diagrams through cheap linear operation, and adopts a group of inherent characteristic diagramsA series of low cost linear transformations produces a large number of phantom feature maps that fully reveal the inherent feature information. This 3D-Ghost module upgrades the existing convolutional neural network as a plug and play component. As shown in FIGS. 6 (a) and (b) which are a comparison of the conventional three-dimensional convolution operation and the 3D-Ghost module, it can be seen that the 3D-Ghost module generates a part of the inherent feature map by one convolution and then performs the linear operation phi ₁ 、Φ ₂ ……Φ _k A number of phantom feature maps are generated to enhance the features and to align the number of feature maps output.

In practical application, given input data X εR ^c×T×h×w Where c is the number of input channels, T is the number of frames of the input video, h and w are the height and width of the input data, respectively, and the operation of the three-dimensional convolution layer to generate n feature maps can be expressed as:

Y＝X*f+b (3)

wherein is the three-dimensional convolution operator, b is the bias term, Y ε R ^{T′×h′×w′×n} Output characteristic diagram with n channels, f E R ^{c×k×k×k×n} Is the three-dimensional convolution kernel in this layer, h 'and w' are the height and width of the output data, respectively, and k x k is the size of the three-dimensional convolution kernel f.

The above is a conventional three-dimensional convolution operation whose output feature maps also contain greater redundancy than two-dimensional convolution operations, and some of which may be similar to each other. Therefore, we consider that it is not necessary to generate these redundant three-dimensional feature maps with a large number of FLOPs kernel parameters, so a 3D-Ghost module is used instead of the three-dimensional convolution operation. Specifically, m intrinsic feature maps Y' ∈R are generated by first convolving one time ^{T′×h′×w′×m} 。

Y′＝X*f′ (4)

Wherein f' is R ^{c×k×k×k×m} Is a filter used, m is less than or equal to n, and the bias term is omitted for simplicity. The super-parameters (e.g., filter size, step size, fill) are the same as in the normal three-dimensional convolution (equation 3) to keep the temporal and spatial sizes (i.e., T ', h ', and w ') of the output feature map consistent. To further obtain the n feature maps required, we apply to each of Y' according to equation (5)The intrinsic feature map applies a series of linear operations to generate s ghest phantom feature maps.

Wherein y' _i Is the ith intrinsic feature map in Y', Φ _i,j Representing the jth (except the last) linear operation for generating the jth phantom feature map y _ij . That is, y' _i May have one or more phantom feature mapsLast phi _i,s The intrinsic map is represented for preserving the intrinsic feature map. From the formula (5), the n=m·s feature map y= [ Y ] can be obtained ₁₁ ,y ₁₂ ,…,y _ms ]As output data of the 3D-Ghost module, as shown in fig. 6 (b).

In general, the working mechanism of the 3D-Ghost module is to generate an inherent feature map by one three-dimensional convolution, then generate a phantom feature map by one three-dimensional convolution of the inherent feature map, reserve the inherent feature map of the first convolution, and connect the inherent feature map and the phantom feature map in series to replace the original three-dimensional convolution operation. The 3D-Ghost module is added into each single-mode network I3D, so that richer and more flexible feature map representation can be obtained, semantic information contained in a correlation matrix is richer, and the two single-mode networks share one correlation matrix, namely the two single-mode networks share the rich semantic information, by minimizing SSA loss. Because the model is a framework for cooperative learning, the 3D-Ghost module is added, so that more abundant semantic knowledge can be transferred between two channels to improve the performance of the model. As shown in fig. 3, the I3D network is based on a 3D-Ghost module, where inc. is expressed as an acceptance-V1 sub-module based on a 3D-Ghost, and the specific structure of the I3D network is shown in fig. 4.

Knowledge positive migration module based on SSA loss

The invention is a two-channel collaboration framework, which encourages weak classification modes of an input network to obtain recognition results of strong classification modes. When a discriminative representation is not learned by a network of one modality during training, knowledge of the network of another modality can be used to refine its representation. The repeated occurrence of this situation may allow the network to get a better representation in a collaborative manner.

The output features of the two networks, respectively the mth and nth modes, where W, H, T and C represent the width, height, frame number and channel number of the feature, respectively, which contain rich semantic knowledge. By F _m For example, F _m Element->Representing the content of a certain spatio-temporal position. F (F) _m The correlation between all elements of (a) is represented by its correlation matrix, which is defined as follows:

the baseline method realizes the migration of knowledge from a good modality to another modality by constraining the semantic consistency of the two modalities:

here ρ ^m,n Is a focus regularization parameter for controlling the direction of knowledge transfer. corr (F) _m ) And corr (F) _n ) And respectively representing the correlation matrix of the m-th and n-th channel characteristic diagrams. ρ ^m,n The focus regularization parameters for forcing a better performing network to deliver knowledge to a worse performing network are:

wherein,Δl represents the difference in sorting loss of m and n channels, +.>And->Is the classification loss for the network of the mth modality and the network of the nth modality, respectively. A value of Δl being positive indicates that network n performs better than network m, where ρ is desired when training network m ^m,n Forcing the representation of network m to simulate network n for large values. However, the baseline method defaults to the feature map representation of one of the modality networks (depth) to be stronger than the other modality network (color), and the focus regularization parameters used enable the depth modality network with good performance to transmit knowledge to the color modality network with poor performance, which is just a unidirectional process of transmitting knowledge from the depth modality to the color modality. However, we found that different datasets, different training rounds, cannot determine that the profile of one modality represents a certain preference over another modality. Thus, for this problem we regularize the parameter ρ to the focus of baseine ^m,n The two channels can be adaptively and bi-directionally learned by modifying, and delta l is multiplied in front of the index to accelerate the convergence of the network:

the improved focus regularization parameters can judge which modal network has better characteristic diagram representation through the value of Deltal in the training process so as to realize the correct and effective transmission of knowledge. This bi-directional learning between channels greatly enhances the interoperability of the overall framework.

Loss function

Considering that only one mode data is available during the test, the two channels respectively use respective loss functions during the training of the proposed model, and in order to share the knowledge of the dominant mode, the loss of the spatiotemporal semantic alignment is increased in the loss functions of the weak mode. Thus, the first and second substrates are bonded together,

when Δl > 0, the loss functions of channel m and channel n are respectively:

where λ=0.05, is a trade-off super-parameter.

Experimental part

Experimental data set: the invention adopts three dynamic gesture data sets of SKIG, VIVA and NVGesture to carry out experiments. The skeg dynamic gesture dataset contained a total of 1080 RGB-D gesture video sequences, 10 categories in total, and samples were completed by 6 operators using 3 different palm morphologies (i.e., clenching fists, palm stretching, and index finger only) under 2 lighting conditions (i.e., strong and weak light) and 3 background patterns (i.e., white paper, wood texture, and newspapers), respectively. The VIVA dynamic gesture dataset was captured by microsoft Kinect device for a total of 885 RGB and depth information video sequences, data for 19 different dynamic gestures performed by 8 subjects in a car. The nvgeture dynamic gesture dataset is usually used for human-computer interaction, and is collected by using multiple sensors at multiple angles, and comprises 1532 RGB-D dynamic gestures, wherein the training set has 1050 video sequences, the test set has 482 video sequences, and the total number of the video sequences is 25, and the gestures are recorded by 20 subjects in an automobile simulator with artificial lighting conditions.

Evaluation index: the method is evaluated by using Top1 accuracy according to an evaluation standard protocol in the field of dynamic gesture recognition.

Experiment setting: the I3D network is adopted as a backbone network (backhaul network) of two channels, and the output of the last acceptance-V1 sub-module 'Mixed 5 c' is selected as a characteristic diagram applied to SSA loss in the network. In all experiments, λ was set to 50×10 ^-3 Beta=2, and an Adam optimizer is used to optimize the objective function, the learning rate is set to 0.0001. During the training phase, batch is set to 2, i.e. 2 64 frames of RGB-D video sequences are fed into the network at a time. The whole model is realized by Pytorch 1.7. During training, only each modal classification network is pre-trained, the experiment is carried out for 60 epochs, and then SSA loss is added to continue training for 15 epochs.

The performance of the model is compared with the most advanced dynamic gesture recognition methods. Tables 1, 2 and 3 show the results of the comparison on the SKIG, VIVA, NVGesture dynamic gesture dataset, respectively. The SKIG data set and the VIVA data set are not divided into a training set and a testing set, so that according to a common evaluation protocol of the two data sets, three-fold cross validation is adopted for the SKIG data set, and eight-fold cross validation is adopted for the VIVA data set.

Ablation experiment: ablation experiments were performed on the SKIG dataset to test the contributions of the different components of the network, as shown in table 4. We use the base model as baseline, i.e. without adding the spatial attention module and the 3D-Ghost module, and the original focus regularization parameters (equation 7). Wherein the spatial attention module (Spatial Attention Module) is represented by SAM and the improved focus regularization parameter (equation 8) is represented by ρ ^m,n And (3) representing.

Table 1 comparison of the accuracy of different dynamic gesture recognition methods on SKIG datasets

Table 2 comparison of accuracy of different dynamic gesture recognition methods on VIVA dataset

TABLE 3 precision comparison of different dynamic gesture recognition methods on NVGestme dataset

Table 4 tests the effect of various modules on dynamic gesture recognition performance, where the base model is denoted by Baseline, the spatial attention module (Spatial Attention Module) is denoted by SAM, and the improved focus regularization parameters are denoted by ρ ^m,n And (3) representing.

Claims

1. A dynamic gesture recognition method for a multi-mode training single-mode test based on a 3D-Ghost module is characterized by comprising the following steps of: the method trains the whole network by utilizing RGB data and depth data, the whole network adopts a parallel double-channel collaborative learning structure, the learning process is improved by transmitting knowledge between different modal networks, the double-channel network has the same structure and unshared parameters, a channel m is used for recognizing dynamic gestures by the RGB data, and a channel n is used for recognizing the dynamic gestures by the depth data; after training is completed, carrying out dynamic gesture recognition on the RGB data input channel m or carrying out dynamic gesture recognition on the depth data input channel n;

the working process of the designed 3D-Ghost module is as follows:

first, m intrinsic feature maps Y' ∈R are generated by one three-dimensional convolution ^{T′×h′×w′×m} ，

Y′＝X*f′

Wherein f' is R ^{c×k×k×k×m} Is the filter used, m.ltoreq.n, and secondly, in order to further obtain the required n feature maps, a series of linear operations are applied to each intrinsic feature map in Y' to generate s huge phantom feature maps, as follows:

wherein y' _i Is the ith intrinsic feature map in Y', Φ _i，j Represented as y' _i Generating j-th phantom feature map y _ij In other words y' _i With one or more phantom feature mapsAnd the last phi _i，s Representing an intrinsic map for preserving an intrinsic feature map, and finally obtaining an n=m·s feature map y= [ Y ] ₁₁ ，y ₁₂ ，…y _1s ，...，y _m1 ，y _m2 ...，y _ms ]Output number as 3D-Ghost moduleAccording to the above;

the working process of the attention module is as follows:

The improvement of the acceptance-V1 submodule is specifically as follows:

2. The method for dynamic gesture recognition based on the 3D-Ghost module for multi-modal training single-modal testing according to claim 1, wherein the method comprises the following steps:

when (when)When the loss functions of the channel m and the channel n are respectively:

where lambda is a compromise super-parameter,

wherein corr (F _m ) And corr (F) _n ) Correlation matrix respectively representing m-th and n-th channel characteristic diagrams, F _m And F _n Namely, isOutput characteristics of channel m and channel n, focal regularization parameter ρ of base of the base line network ^m，n The direction for controlling knowledge transfer is specifically as follows:

wherein, representing the difference of the sorting losses of m and n channels, < >>And->The classification losses respectively correspond to the channel m and the channel n, the classification losses adopt cross entropy functions, and beta represents a parameter for scaling SSA losses, so that the whole framework is more focused on realizing the positive migration of knowledge.