CN112966644A

CN112966644A - Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof

Info

Publication number: CN112966644A
Application number: CN202110311898.4A
Authority: CN
Inventors: 陈益强; 李雅洁; �谷洋; 王永斌; 张忠平; 肖益珊
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-06-15

Abstract

The invention provides a multi-mode multi-task model for gesture detection and gesture recognition and a training method thereof. The invention utilizes the multi-mode channel attention mechanism to fuse and select multi-mode characteristic information related to the tasks, and utilizes the soft attention value to dynamically adjust the weighted values of different tasks in the multi-task loss function, so that the importance of a plurality of tasks in a training network can be adjusted by the model in real time, and the model can obtain better results of the plurality of tasks at the same time.

Description

Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof

Technical Field

The invention relates to the field of multi-mode fusion, in particular to the field of multi-task learning, and more particularly relates to a multi-mode multi-task model for gesture detection and gesture recognition and a training method thereof.

Background

In the field of human-computer interaction, the human body gesture recognition has very important research significance and application value, such as auxiliary systems of virtual environment, navigation, sign language recognition and the like. Therefore, many researchers have done a lot of research work on gesture recognition. High precision gesture detection and classification is a significant and difficult research effort. In addition, in order to better understand the human world and better interact with human beings by computers, researchers introduce various modal data to make up for the shortcomings of a single modal model, so that the multi-modal research field is rapidly developing and multi-Modal Machine Learning (MML) has become a current research hotspot.

Further, with the introduction of a Multi-task learning (Multi-task learning) method, the AI requires a computer to simulate that a human not only can receive various information at the same time, but also can process multiple tasks at the same time, and ensure efficient completion of the main task. The multi-modal multi-task learning is a necessary trend of the modern AI development, and has huge potential and application prospect. And the effective information between related tasks can play a role of sharing and complementation. The training models of multiple tasks can save computing resources, save model storage space, improve multi-task learning rate and achieve the purpose of efficient processing. Therefore, the method has a great application prospect and research significance for providing the multi-modal and multi-task gesture recognition model by utilizing the complementarity of information and the linkage between tasks.

However, in the prior art, most of gesture recognition detection models based on multiple modes are directed at a single task, the complementary relationship between the modes is not fully utilized, a multitask method for assisting a main line task by using a secondary line task is rarely mentioned, and the problems of low detection precision and the like exist. The method has the following disadvantages:

1. because of the problems of complicated individual difference, different observation illumination conditions and the like, tiny or identical gestures are difficult to find;

2. relevance and complementarity among multiple modes are not fully mined and utilized, and information among different modes is not well balanced and utilized by the models;

3. the model is designed aiming at a single task, can not complete a plurality of tasks, can not utilize the advantages of the plurality of tasks, or can realize the high-efficiency performance of the main task by assistance among the tasks. Therefore, there is a need for a highly robust, high performance gesture recognition model that processes multi-modal information and multiple tasks simultaneously.

Disclosure of Invention

To solve the above problems in the prior art, a multi-modal multi-task model for gesture detection and gesture recognition is provided, which includes a modal feature extraction module, a multi-modal fusion module, and a model multi-task classification module, wherein,

the modal feature extraction module comprises a network structure and a shared feature layer which respectively extract different modal data features, and is used for preprocessing the multimodal data and extracting shared multimodal features;

the multi-mode fusion module comprises a multi-mode channel attention module and a task related feature layer, the multi-mode fusion module is connected with the modal feature extraction module, the shared multi-mode features are used as input of the multi-mode channel attention module, and the fused task related features are extracted to obtain the task related feature layer;

the model multi-task classification module is connected with the multi-mode fusion module, and classifies each task by taking the fused task related characteristics as input;

and network parameters of the modal feature extraction module, the multi-modal fusion module and the model multi-task classification module are updated iteratively in the training process.

Preferably, the model dynamically adjusts the multitask penalty function based on a soft attention mechanism during training.

Preferably, the multi-modal channel attention module includes an upper branch and a lower branch, the upper branch is composed of a 2D convolution kernel, the lower branch is composed of a 2D convolution kernel with the same size as the upper branch and a sigmoid function, and a modal characteristic output by the upper branch and an attention value output by the lower branch are multiplied by a matrix to obtain a task-related characteristic.

Preferably, the multi-modal data comprises video data, skeletal data, audio data.

Preferably, the multitask loss function is

L＝λ₁L₁+λ₂L₂Where L is a multitask penalty function, L₁Two-class cross-entropy loss function, L, for gesture detection tasks₂Multi-class cross-entropy loss function, λ, for use with gesture recognition tasks₁And λ₂Are respectively L₁And L₂A weight value in L that dynamically adjusts in size during training using the following formula,

wherein, i represents the ith task, i is 1, 2; t is the number of network training iterations, w_i(. cndot.) is the relative fall rate of the loss function, i.e. the ratio of the loss function of the current iteration to the loss function of the last iteration, and T is a hyper-parameter for controlling the task weight.

Preferably, the hyperparameter T is 2.

The invention also provides a training method of the model, which comprises the following steps:

step 1, extracting shared multi-modal characteristics of a training sample by adopting the modal characteristic extraction module;

step 2, extracting the fused task related features based on the shared multi-modal features by adopting the multi-modal channel attention module;

step 3, dynamically adjusting a multitask loss function based on a soft attention mechanism;

and 4, iteratively updating the weight values of different task loss functions in the multitask loss function and the network parameters of the modal feature extraction module, the multi-modal fusion module and the model multitask classification module until the model converges.

The invention also provides a method for performing gesture detection and gesture recognition by using the model generated by the training method, which comprises the following steps:

step 1, preprocessing multi-modal data of a gesture to be recognized, and extracting and sharing multi-modal features;

step 2, extracting the fused task related features by adopting a multi-mode channel attention mechanism based on the shared multi-mode features;

and 3, performing gesture detection and gesture recognition by using a model multi-task classification module based on the fused task related characteristics.

The invention also provides a computer-readable storage medium, on which a computer program is stored, wherein the program realizes the steps of the above-mentioned method when executed by a processor.

The invention also provides a computer device comprising a memory and a processor, on which memory a computer program is stored that is executable on the processor, characterized in that the processor implements the steps of the above method when executing the program.

The invention has the following characteristics and beneficial effects: the invention ensures that the model has better capability of fusing multi-mode information and stronger gesture detection capability, can process a plurality of tasks in a coordinated manner, and obviously improves the accuracy of the prediction of the plurality of tasks. The invention utilizes the multi-mode channel attention mechanism to fuse and select multi-mode characteristic information related to the tasks, and utilizes the soft attention value to dynamically adjust the weighted values of different tasks in the multi-task loss function, so that the importance of a plurality of tasks in a training network can be adjusted by the model in real time, and the model can obtain better results of the plurality of tasks at the same time.

Drawings

FIG. 1 illustrates a system architecture according to one embodiment of the invention.

Fig. 2 shows a network architecture according to one embodiment of the invention.

FIG. 3 illustrates a channel attention module according to one embodiment of the present invention.

Detailed Description

The invention is described below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The system architecture of the present invention is shown in fig. 1, and includes 3 modules, which are respectively: the device comprises a modal feature extraction module, a multi-modal fusion module and a model multi-task classification module. The functions of the modules are as follows:

a modal feature extraction module: the module is used for preprocessing the multi-modal data and extracting multi-modal representations.

A multimodal fusion module: the module is used for fusing multi-modal information of a specific task and is a core part of the invention. The model performs task-related multi-modal information fusion based on a multi-modal channel attention mechanism. For different tasks, a channel attention mechanism is utilized to obtain task-related feature layers from the shared feature layers. By applying the method, the association degree among multiple modes can be fully mined, and the influence degree and the importance of multi-mode information on different tasks can be fully mined. By utilizing the relation between the modes and different tasks, the redundancy of multi-mode information is balanced, and the multi-mode characteristic information which is more useful for the tasks is selected, so that the prediction capability of the model for multiple tasks is greatly improved, and the mutual interference between the tasks is reduced.

A model multi-task classification module: the module is mainly responsible for gesture detection and gesture recognition of information obtained after multi-mode fusion of a plurality of tasks. And after the multi-modal data are fused, sending the fused multi-modal data into a full connection layer module of a related task, finally dynamically adjusting a multi-task loss function according to a soft attention mechanism, and cooperatively training the two tasks to finally obtain a gesture detection result with or without a gesture and a classification prediction result of a gesture category.

The system architecture of the present invention is briefly described above, and the present invention is described in detail below in conjunction with a data set and a network architecture.

The training and validation data set used by the present invention is described first.

According to one embodiment of the invention, the invention uses the public data set Montalbano data set for training and verifying the detection capabilities of the invention. This dataset is a preprocessed version of the multimodal gesture recognition dataset of the Charearn 2014 Looking at the Peer Challenge track 3 game. The data set consists of four modality data: RGB video data, depth video data, skeleton data, and audio data, containing the italian gesture category and one non-gesture category performed by 20 performers. Depth video data differs from RGB video data in that depth video data also includes the distance of objects in the video from the camera, represented in grayscale, as compared to RGB video data.

In this embodiment, the multi-modal data provided by the Montalbano data set is used to complete two tasks of gesture detection and gesture recognition, establish a model to detect whether a gesture exists, and recognize 21 types of gesture categories.

The data set used by the present invention is introduced above, and the network architecture is introduced below.

The invention relates to the field of machine learning, the system of the invention can be implemented by a neural network, and fig. 2 shows a network architecture included in the system according to an embodiment of the invention. The system comprises 1 shared feature layer, 1 channel attention mechanism module, 2 task related feature layers and 2 groups of full connection layers, and is used for completing two tasks of gesture recognition and gesture detection.

The data set and network architecture are introduced above, and the modules are described in detail below.

First, modal characteristic extraction module

The modal characteristic extraction module is mainly used for processing video, bone and audio modal data in the Montalbano data set by adopting different networks and extracting characteristics of different modes. The system comprises a network structure and a shared characteristic layer, wherein the network structure and the shared characteristic layer respectively extract different modal data characteristics.

For the video modality: the video data includes a color modality and a depth modality that describe the gesture. The invention trains a left-hand network and a right-hand network respectively. Taking the left hand as an example, the modal data includes a color mode and a depth mode, and the features are extracted by respectively adopting the video network in table 1, namely, the features are extracted by using 3D convolution firstly, and then further extracted by using 2D convolution. And then fusing the color and depth modal characteristics of the left hand to form the video modal data characteristics of the left hand. The feature extraction operation for the right hand is the same as for the left hand. And finally fusing the right-hand and left-hand modal characteristics.

For bone modalities, bone features are extracted using the full-connectivity layer.

For audio modalities, a convolution operation is used to obtain further features.

Table 1 presents a network architecture for extracting video, bone, and audio modality data features, according to an embodiment of the present invention.

TABLE 1 Modal feature extraction

Features of data of different modalities are extracted through the network in table 1, wherein the video features are one-dimensional data with a size of 84, the bone features are one-dimensional data with a size of 350, and the audio features are one-dimensional data with a size of 350.

The 3 networks in table 1 extract one-dimensional data of video features, audio features, and skeletal features, respectively, and for simplicity, only the output results of the shared feature layer are shown in fig. 2. The output of the shared characteristic layer is one-dimensional data obtained by splicing one-dimensional data of video characteristics, audio characteristics and bone characteristics, and the size of the one-dimensional data is 350+350+ 84-784.

Two, multi-mode fusion module

As shown in FIG. 2, the multi-modal fusion module is composed of a channel attention mechanism module and a task-related feature layer. The multi-modal fusion module takes the output of the modal feature extraction module as input, namely, takes the output of a shared feature layer in a network as input, extracts features related to tasks aiming at different tasks, and forms a task related feature layer. The multi-modal channel attention machine has the following advantages: useful information related to the modal and the task is strengthened, and noise interference unrelated to the modal and the task is weakened, so that the aim of predicting different tasks with high precision is fulfilled.

The channel attention mechanism module is described in detail below.

The channel attention mechanism module is used to dynamically adjust the multi-modal feature combinations for different tasks. And sending each obtained modal feature into a channel attention mechanism module to obtain feature values representing different strengths of the modal feature and a certain task, and splicing and combining new modal features obtained for each modal feature to obtain a task related feature layer for the task.

The attention mechanism module is adopted because the combination of the mode fineness degree corresponding to different tasks is different. For example, the task of gesture detection is to detect whether a gesture exists, and is a two-classification task, and the task will pay more attention to whether a gesture appears in the video frame, and does not depend on the gesture details to determine which category the gesture belongs to. The task of gesture recognition is to judge the gesture category, the result includes 21 classifications, the task is more concerned about the details of the gesture, the bone node information and the video detail information related to the gesture details are more important, and therefore the user needs to be highlighted in the training process. It follows that the modal feature combinations are different for different tasks.

FIG. 3 illustrates a configuration of a channel attention mechanism module, according to one embodiment of the invention. Wherein the module is composed of two branches, the upper branch is composed of a convolution kernel with a size of 2D for convolving the original modal characteristics to obtain convolved modal characteristics, according to an embodiment of the present invention, the size of the 2D convolution kernel is 16 × 3; the lower branch is formed by a 2D convolution kernel with the same size as the upper branch and a sigmoid function, and is used for obtaining an attention value with the same size as the convolved modal characteristics of the upper branch after calculating the original modal characteristics, and the attention value is also a matrix. And carrying out matrix multiplication on the convolved modal characteristics obtained by the upper branch and the attention value obtained by the lower branch to obtain the finally reinforced and selected new modal characteristics. Through iterative training of the network, attention values are continuously adjusted, the obtained new modal characteristics are continuously adjusted, and then the task related characteristic layer obtained by combining the new modal characteristics is dynamically changed due to the change of the task related characteristic layer.

By the method, the characteristic combination layers related to different tasks can be obtained by utilizing the relevance between the modes and the tasks, so that the high-efficiency expression of the different tasks is realized.

Model multitask classification module

As shown in fig. 2, the model multi-task classification module includes two sets of fully connected layer modules respectively corresponding to the gesture detection and gesture recognition tasks, and sends the multi-modal information related to the previously fused tasks to the fully connected layer modules of each task for further classification, and finally outputs a judgment result of the model, that is, whether a gesture is output by the gesture detection task or not, and a category to which an input gesture belongs is output by the gesture recognition task.

According to an embodiment of the present invention, due to the high complexity of multi-task training, in order to avoid imbalance of other tasks caused by the network biased to a certain task in the training stage, the present invention adopts a soft attention mechanism for dynamically adjusting the loss function in the training process.

In the soft attention mechanism, for two tasks of gesture detection and gesture recognition, the gesture detection task uses a two-classification cross entropy loss function, which is recorded as L₁The gesture recognition task uses a multi-classification cross entropy loss function, denoted as L₂. The total loss function is:

L+λ₁L₁+λ₂L₂ (1)

wherein λ_i(i ═ 1,2) is the weight value of the ith mission loss function in the total loss function, which is dynamically sized using a soft attention mechanism.

Wherein t is the iteration number of network training, w_i(. cndot.) is the relative rate of fall of the loss function, i.e., the ratio of the loss function of this time to the last iteration. According to an embodiment of the present invention, w may be initialized when t is 1,2_i1. T is a hyper-parameter for controlling the task weight.

Through a soft attention mechanism, the weights of different task loss functions in the total loss function are dynamically adjusted in iterative training, so that the balance among a plurality of tasks can be maintained in the training process of the network, the training network is prevented from deviating to a certain simple task and neglecting the training requirement of a complex task, and the multi-task linkage training is better realized.

According to an embodiment of the present invention, there is also provided a gesture detection and gesture recognition method based on the above system, including:

step 1, preprocessing multi-modal data, and extracting and sharing multi-modal characteristics;

and 3, performing gesture detection and gesture recognition based on the fused task related characteristics.

The inventor carries out experimental verification on the system, video, bone and audio modal data are fused in the experiment, a multi-mode channel attention mechanism and a soft attention mechanism are used for dynamically adjusting loss functions, and exploration experiments are carried out on the weights of different multi-task loss functions. The Accuracy of gesture recognition (Accuracy) results are shown in table 2. When the super parameter T is 2, the gesture detection task can reach 99.80% of precision, and the gesture 21 classification task can reach 95.02% of precision.

TABLE 2 results of the experiment

In general, the present invention utilizes a multi-modal channel attention mechanism to extract task-related features for different tasks based on shared multi-modal features to form a task-related feature layer. Aiming at multi-task collaborative training, a soft attention mechanism is utilized to dynamically adjust a multi-task loss function, the importance degree of different tasks to a model is adjusted in real time, and the phenomenon that the effect of other tasks is poor and the like due to the fact that the model is biased to a certain task which is easy to learn and train and ignores other tasks in a training stage is avoided. The method realizes better fusion of multi-mode information and coordinates the relevance among a plurality of tasks. Compared with the traditional gesture detection method, the accuracy is obviously improved, and the function of cooperative prediction of a plurality of tasks is realized.

It is to be noted and understood that various modifications and improvements can be made to the invention described in detail above without departing from the spirit and scope of the invention as claimed in the appended claims. Accordingly, the scope of the claimed subject matter is not limited by any of the specific exemplary teachings provided.

Claims

1. A multi-mode multi-task model for gesture detection and gesture recognition comprises a modal feature extraction module, a multi-mode fusion module and a model multi-task classification module, wherein,

2. The model of claim 1, dynamically adjusting a multitask penalty function based on a soft attention mechanism when training.

3. The model of claim 1, the multi-modal channel attention module comprising an upper branch and a lower branch, the upper branch being formed by a 2D convolution kernel, the lower branch being formed by a 2D convolution kernel of the same size as the upper branch and a sigmoid function, modal characteristics output by the upper branch being matrix multiplied by attention values output by the lower branch to obtain task-related characteristics.

4. The model of claim 1, the multimodal data comprising video data, skeletal data, audio data.

5. The model of claim 2, the multitask penalty function being

wherein, i represents the ith task, i is 1, 2; t is the number of network training iterations, w_iAs a function of lossThe relative descent rate of (a) is the ratio of the current iteration loss function to the last iteration loss function, and T is a hyper-parameter for controlling the task weight.

6. The model of claim 5, the hyperparameter T-2.

7. A training method for the model of claim 5, comprising:

8. A method for gesture detection and gesture recognition using the model generated by the method of claim 7, comprising:

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to claim 7 or 8.

10. A computer device comprising a memory and a processor, a computer program being stored on the memory and being executable on the processor, characterized in that the steps of the method as claimed in claim 7 or 8 are implemented when the processor executes the program.