CN112966644A - Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof - Google Patents

Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof Download PDF

Info

Publication number
CN112966644A
CN112966644A CN202110311898.4A CN202110311898A CN112966644A CN 112966644 A CN112966644 A CN 112966644A CN 202110311898 A CN202110311898 A CN 202110311898A CN 112966644 A CN112966644 A CN 112966644A
Authority
CN
China
Prior art keywords
task
modal
model
module
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110311898.4A
Other languages
Chinese (zh)
Inventor
陈益强
李雅洁
�谷洋
王永斌
张忠平
肖益珊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202110311898.4A priority Critical patent/CN112966644A/en
Publication of CN112966644A publication Critical patent/CN112966644A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Social Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)

Abstract

The invention provides a multi-mode multi-task model for gesture detection and gesture recognition and a training method thereof. The invention utilizes the multi-mode channel attention mechanism to fuse and select multi-mode characteristic information related to the tasks, and utilizes the soft attention value to dynamically adjust the weighted values of different tasks in the multi-task loss function, so that the importance of a plurality of tasks in a training network can be adjusted by the model in real time, and the model can obtain better results of the plurality of tasks at the same time.

Description

Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof
Technical Field
The invention relates to the field of multi-mode fusion, in particular to the field of multi-task learning, and more particularly relates to a multi-mode multi-task model for gesture detection and gesture recognition and a training method thereof.
Background
In the field of human-computer interaction, the human body gesture recognition has very important research significance and application value, such as auxiliary systems of virtual environment, navigation, sign language recognition and the like. Therefore, many researchers have done a lot of research work on gesture recognition. High precision gesture detection and classification is a significant and difficult research effort. In addition, in order to better understand the human world and better interact with human beings by computers, researchers introduce various modal data to make up for the shortcomings of a single modal model, so that the multi-modal research field is rapidly developing and multi-Modal Machine Learning (MML) has become a current research hotspot.
Further, with the introduction of a Multi-task learning (Multi-task learning) method, the AI requires a computer to simulate that a human not only can receive various information at the same time, but also can process multiple tasks at the same time, and ensure efficient completion of the main task. The multi-modal multi-task learning is a necessary trend of the modern AI development, and has huge potential and application prospect. And the effective information between related tasks can play a role of sharing and complementation. The training models of multiple tasks can save computing resources, save model storage space, improve multi-task learning rate and achieve the purpose of efficient processing. Therefore, the method has a great application prospect and research significance for providing the multi-modal and multi-task gesture recognition model by utilizing the complementarity of information and the linkage between tasks.
However, in the prior art, most of gesture recognition detection models based on multiple modes are directed at a single task, the complementary relationship between the modes is not fully utilized, a multitask method for assisting a main line task by using a secondary line task is rarely mentioned, and the problems of low detection precision and the like exist. The method has the following disadvantages:
1. because of the problems of complicated individual difference, different observation illumination conditions and the like, tiny or identical gestures are difficult to find;
2. relevance and complementarity among multiple modes are not fully mined and utilized, and information among different modes is not well balanced and utilized by the models;
3. the model is designed aiming at a single task, can not complete a plurality of tasks, can not utilize the advantages of the plurality of tasks, or can realize the high-efficiency performance of the main task by assistance among the tasks. Therefore, there is a need for a highly robust, high performance gesture recognition model that processes multi-modal information and multiple tasks simultaneously.
Disclosure of Invention
To solve the above problems in the prior art, a multi-modal multi-task model for gesture detection and gesture recognition is provided, which includes a modal feature extraction module, a multi-modal fusion module, and a model multi-task classification module, wherein,
the modal feature extraction module comprises a network structure and a shared feature layer which respectively extract different modal data features, and is used for preprocessing the multimodal data and extracting shared multimodal features;
the multi-mode fusion module comprises a multi-mode channel attention module and a task related feature layer, the multi-mode fusion module is connected with the modal feature extraction module, the shared multi-mode features are used as input of the multi-mode channel attention module, and the fused task related features are extracted to obtain the task related feature layer;
the model multi-task classification module is connected with the multi-mode fusion module, and classifies each task by taking the fused task related characteristics as input;
and network parameters of the modal feature extraction module, the multi-modal fusion module and the model multi-task classification module are updated iteratively in the training process.
Preferably, the model dynamically adjusts the multitask penalty function based on a soft attention mechanism during training.
Preferably, the multi-modal channel attention module includes an upper branch and a lower branch, the upper branch is composed of a 2D convolution kernel, the lower branch is composed of a 2D convolution kernel with the same size as the upper branch and a sigmoid function, and a modal characteristic output by the upper branch and an attention value output by the lower branch are multiplied by a matrix to obtain a task-related characteristic.
Preferably, the multi-modal data comprises video data, skeletal data, audio data.
Preferably, the multitask loss function is
L=λ1L12L2Where L is a multitask penalty function, L1Two-class cross-entropy loss function, L, for gesture detection tasks2Multi-class cross-entropy loss function, λ, for use with gesture recognition tasks1And λ2Are respectively L1And L2A weight value in L that dynamically adjusts in size during training using the following formula,
Figure BDA0002990118570000031
Figure BDA0002990118570000032
wherein, i represents the ith task, i is 1, 2; t is the number of network training iterations, wi(. cndot.) is the relative fall rate of the loss function, i.e. the ratio of the loss function of the current iteration to the loss function of the last iteration, and T is a hyper-parameter for controlling the task weight.
Preferably, the hyperparameter T is 2.
The invention also provides a training method of the model, which comprises the following steps:
step 1, extracting shared multi-modal characteristics of a training sample by adopting the modal characteristic extraction module;
step 2, extracting the fused task related features based on the shared multi-modal features by adopting the multi-modal channel attention module;
step 3, dynamically adjusting a multitask loss function based on a soft attention mechanism;
and 4, iteratively updating the weight values of different task loss functions in the multitask loss function and the network parameters of the modal feature extraction module, the multi-modal fusion module and the model multitask classification module until the model converges.
The invention also provides a method for performing gesture detection and gesture recognition by using the model generated by the training method, which comprises the following steps:
step 1, preprocessing multi-modal data of a gesture to be recognized, and extracting and sharing multi-modal features;
step 2, extracting the fused task related features by adopting a multi-mode channel attention mechanism based on the shared multi-mode features;
and 3, performing gesture detection and gesture recognition by using a model multi-task classification module based on the fused task related characteristics.
The invention also provides a computer-readable storage medium, on which a computer program is stored, wherein the program realizes the steps of the above-mentioned method when executed by a processor.
The invention also provides a computer device comprising a memory and a processor, on which memory a computer program is stored that is executable on the processor, characterized in that the processor implements the steps of the above method when executing the program.
The invention has the following characteristics and beneficial effects: the invention ensures that the model has better capability of fusing multi-mode information and stronger gesture detection capability, can process a plurality of tasks in a coordinated manner, and obviously improves the accuracy of the prediction of the plurality of tasks. The invention utilizes the multi-mode channel attention mechanism to fuse and select multi-mode characteristic information related to the tasks, and utilizes the soft attention value to dynamically adjust the weighted values of different tasks in the multi-task loss function, so that the importance of a plurality of tasks in a training network can be adjusted by the model in real time, and the model can obtain better results of the plurality of tasks at the same time.
Drawings
FIG. 1 illustrates a system architecture according to one embodiment of the invention.
Fig. 2 shows a network architecture according to one embodiment of the invention.
FIG. 3 illustrates a channel attention module according to one embodiment of the present invention.
Detailed Description
The invention is described below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The system architecture of the present invention is shown in fig. 1, and includes 3 modules, which are respectively: the device comprises a modal feature extraction module, a multi-modal fusion module and a model multi-task classification module. The functions of the modules are as follows:
a modal feature extraction module: the module is used for preprocessing the multi-modal data and extracting multi-modal representations.
A multimodal fusion module: the module is used for fusing multi-modal information of a specific task and is a core part of the invention. The model performs task-related multi-modal information fusion based on a multi-modal channel attention mechanism. For different tasks, a channel attention mechanism is utilized to obtain task-related feature layers from the shared feature layers. By applying the method, the association degree among multiple modes can be fully mined, and the influence degree and the importance of multi-mode information on different tasks can be fully mined. By utilizing the relation between the modes and different tasks, the redundancy of multi-mode information is balanced, and the multi-mode characteristic information which is more useful for the tasks is selected, so that the prediction capability of the model for multiple tasks is greatly improved, and the mutual interference between the tasks is reduced.
A model multi-task classification module: the module is mainly responsible for gesture detection and gesture recognition of information obtained after multi-mode fusion of a plurality of tasks. And after the multi-modal data are fused, sending the fused multi-modal data into a full connection layer module of a related task, finally dynamically adjusting a multi-task loss function according to a soft attention mechanism, and cooperatively training the two tasks to finally obtain a gesture detection result with or without a gesture and a classification prediction result of a gesture category.
The system architecture of the present invention is briefly described above, and the present invention is described in detail below in conjunction with a data set and a network architecture.
The training and validation data set used by the present invention is described first.
According to one embodiment of the invention, the invention uses the public data set Montalbano data set for training and verifying the detection capabilities of the invention. This dataset is a preprocessed version of the multimodal gesture recognition dataset of the Charearn 2014 Looking at the Peer Challenge track 3 game. The data set consists of four modality data: RGB video data, depth video data, skeleton data, and audio data, containing the italian gesture category and one non-gesture category performed by 20 performers. Depth video data differs from RGB video data in that depth video data also includes the distance of objects in the video from the camera, represented in grayscale, as compared to RGB video data.
In this embodiment, the multi-modal data provided by the Montalbano data set is used to complete two tasks of gesture detection and gesture recognition, establish a model to detect whether a gesture exists, and recognize 21 types of gesture categories.
The data set used by the present invention is introduced above, and the network architecture is introduced below.
The invention relates to the field of machine learning, the system of the invention can be implemented by a neural network, and fig. 2 shows a network architecture included in the system according to an embodiment of the invention. The system comprises 1 shared feature layer, 1 channel attention mechanism module, 2 task related feature layers and 2 groups of full connection layers, and is used for completing two tasks of gesture recognition and gesture detection.
The data set and network architecture are introduced above, and the modules are described in detail below.
First, modal characteristic extraction module
The modal characteristic extraction module is mainly used for processing video, bone and audio modal data in the Montalbano data set by adopting different networks and extracting characteristics of different modes. The system comprises a network structure and a shared characteristic layer, wherein the network structure and the shared characteristic layer respectively extract different modal data characteristics.
For the video modality: the video data includes a color modality and a depth modality that describe the gesture. The invention trains a left-hand network and a right-hand network respectively. Taking the left hand as an example, the modal data includes a color mode and a depth mode, and the features are extracted by respectively adopting the video network in table 1, namely, the features are extracted by using 3D convolution firstly, and then further extracted by using 2D convolution. And then fusing the color and depth modal characteristics of the left hand to form the video modal data characteristics of the left hand. The feature extraction operation for the right hand is the same as for the left hand. And finally fusing the right-hand and left-hand modal characteristics.
For bone modalities, bone features are extracted using the full-connectivity layer.
For audio modalities, a convolution operation is used to obtain further features.
Table 1 presents a network architecture for extracting video, bone, and audio modality data features, according to an embodiment of the present invention.
TABLE 1 Modal feature extraction
Figure BDA0002990118570000061
Features of data of different modalities are extracted through the network in table 1, wherein the video features are one-dimensional data with a size of 84, the bone features are one-dimensional data with a size of 350, and the audio features are one-dimensional data with a size of 350.
The 3 networks in table 1 extract one-dimensional data of video features, audio features, and skeletal features, respectively, and for simplicity, only the output results of the shared feature layer are shown in fig. 2. The output of the shared characteristic layer is one-dimensional data obtained by splicing one-dimensional data of video characteristics, audio characteristics and bone characteristics, and the size of the one-dimensional data is 350+350+ 84-784.
Two, multi-mode fusion module
As shown in FIG. 2, the multi-modal fusion module is composed of a channel attention mechanism module and a task-related feature layer. The multi-modal fusion module takes the output of the modal feature extraction module as input, namely, takes the output of a shared feature layer in a network as input, extracts features related to tasks aiming at different tasks, and forms a task related feature layer. The multi-modal channel attention machine has the following advantages: useful information related to the modal and the task is strengthened, and noise interference unrelated to the modal and the task is weakened, so that the aim of predicting different tasks with high precision is fulfilled.
The channel attention mechanism module is described in detail below.
The channel attention mechanism module is used to dynamically adjust the multi-modal feature combinations for different tasks. And sending each obtained modal feature into a channel attention mechanism module to obtain feature values representing different strengths of the modal feature and a certain task, and splicing and combining new modal features obtained for each modal feature to obtain a task related feature layer for the task.
The attention mechanism module is adopted because the combination of the mode fineness degree corresponding to different tasks is different. For example, the task of gesture detection is to detect whether a gesture exists, and is a two-classification task, and the task will pay more attention to whether a gesture appears in the video frame, and does not depend on the gesture details to determine which category the gesture belongs to. The task of gesture recognition is to judge the gesture category, the result includes 21 classifications, the task is more concerned about the details of the gesture, the bone node information and the video detail information related to the gesture details are more important, and therefore the user needs to be highlighted in the training process. It follows that the modal feature combinations are different for different tasks.
FIG. 3 illustrates a configuration of a channel attention mechanism module, according to one embodiment of the invention. Wherein the module is composed of two branches, the upper branch is composed of a convolution kernel with a size of 2D for convolving the original modal characteristics to obtain convolved modal characteristics, according to an embodiment of the present invention, the size of the 2D convolution kernel is 16 × 3; the lower branch is formed by a 2D convolution kernel with the same size as the upper branch and a sigmoid function, and is used for obtaining an attention value with the same size as the convolved modal characteristics of the upper branch after calculating the original modal characteristics, and the attention value is also a matrix. And carrying out matrix multiplication on the convolved modal characteristics obtained by the upper branch and the attention value obtained by the lower branch to obtain the finally reinforced and selected new modal characteristics. Through iterative training of the network, attention values are continuously adjusted, the obtained new modal characteristics are continuously adjusted, and then the task related characteristic layer obtained by combining the new modal characteristics is dynamically changed due to the change of the task related characteristic layer.
By the method, the characteristic combination layers related to different tasks can be obtained by utilizing the relevance between the modes and the tasks, so that the high-efficiency expression of the different tasks is realized.
Model multitask classification module
As shown in fig. 2, the model multi-task classification module includes two sets of fully connected layer modules respectively corresponding to the gesture detection and gesture recognition tasks, and sends the multi-modal information related to the previously fused tasks to the fully connected layer modules of each task for further classification, and finally outputs a judgment result of the model, that is, whether a gesture is output by the gesture detection task or not, and a category to which an input gesture belongs is output by the gesture recognition task.
According to an embodiment of the present invention, due to the high complexity of multi-task training, in order to avoid imbalance of other tasks caused by the network biased to a certain task in the training stage, the present invention adopts a soft attention mechanism for dynamically adjusting the loss function in the training process.
In the soft attention mechanism, for two tasks of gesture detection and gesture recognition, the gesture detection task uses a two-classification cross entropy loss function, which is recorded as L1The gesture recognition task uses a multi-classification cross entropy loss function, denoted as L2. The total loss function is:
L+λ1L12L2 (1)
wherein λi(i ═ 1,2) is the weight value of the ith mission loss function in the total loss function, which is dynamically sized using a soft attention mechanism.
Figure BDA0002990118570000081
Figure BDA0002990118570000091
Wherein t is the iteration number of network training, wi(. cndot.) is the relative rate of fall of the loss function, i.e., the ratio of the loss function of this time to the last iteration. According to an embodiment of the present invention, w may be initialized when t is 1,2i1. T is a hyper-parameter for controlling the task weight.
Through a soft attention mechanism, the weights of different task loss functions in the total loss function are dynamically adjusted in iterative training, so that the balance among a plurality of tasks can be maintained in the training process of the network, the training network is prevented from deviating to a certain simple task and neglecting the training requirement of a complex task, and the multi-task linkage training is better realized.
According to an embodiment of the present invention, there is also provided a gesture detection and gesture recognition method based on the above system, including:
step 1, preprocessing multi-modal data, and extracting and sharing multi-modal characteristics;
step 2, extracting the fused task related features by adopting a multi-mode channel attention mechanism based on the shared multi-mode features;
and 3, performing gesture detection and gesture recognition based on the fused task related characteristics.
The inventor carries out experimental verification on the system, video, bone and audio modal data are fused in the experiment, a multi-mode channel attention mechanism and a soft attention mechanism are used for dynamically adjusting loss functions, and exploration experiments are carried out on the weights of different multi-task loss functions. The Accuracy of gesture recognition (Accuracy) results are shown in table 2. When the super parameter T is 2, the gesture detection task can reach 99.80% of precision, and the gesture 21 classification task can reach 95.02% of precision.
TABLE 2 results of the experiment
Figure BDA0002990118570000092
In general, the present invention utilizes a multi-modal channel attention mechanism to extract task-related features for different tasks based on shared multi-modal features to form a task-related feature layer. Aiming at multi-task collaborative training, a soft attention mechanism is utilized to dynamically adjust a multi-task loss function, the importance degree of different tasks to a model is adjusted in real time, and the phenomenon that the effect of other tasks is poor and the like due to the fact that the model is biased to a certain task which is easy to learn and train and ignores other tasks in a training stage is avoided. The method realizes better fusion of multi-mode information and coordinates the relevance among a plurality of tasks. Compared with the traditional gesture detection method, the accuracy is obviously improved, and the function of cooperative prediction of a plurality of tasks is realized.
It is to be noted and understood that various modifications and improvements can be made to the invention described in detail above without departing from the spirit and scope of the invention as claimed in the appended claims. Accordingly, the scope of the claimed subject matter is not limited by any of the specific exemplary teachings provided.

Claims (10)

1. A multi-mode multi-task model for gesture detection and gesture recognition comprises a modal feature extraction module, a multi-mode fusion module and a model multi-task classification module, wherein,
the modal feature extraction module comprises a network structure and a shared feature layer which respectively extract different modal data features, and is used for preprocessing the multimodal data and extracting shared multimodal features;
the multi-mode fusion module comprises a multi-mode channel attention module and a task related feature layer, the multi-mode fusion module is connected with the modal feature extraction module, the shared multi-mode features are used as input of the multi-mode channel attention module, and the fused task related features are extracted to obtain the task related feature layer;
the model multi-task classification module is connected with the multi-mode fusion module, and classifies each task by taking the fused task related characteristics as input;
and network parameters of the modal feature extraction module, the multi-modal fusion module and the model multi-task classification module are updated iteratively in the training process.
2. The model of claim 1, dynamically adjusting a multitask penalty function based on a soft attention mechanism when training.
3. The model of claim 1, the multi-modal channel attention module comprising an upper branch and a lower branch, the upper branch being formed by a 2D convolution kernel, the lower branch being formed by a 2D convolution kernel of the same size as the upper branch and a sigmoid function, modal characteristics output by the upper branch being matrix multiplied by attention values output by the lower branch to obtain task-related characteristics.
4. The model of claim 1, the multimodal data comprising video data, skeletal data, audio data.
5. The model of claim 2, the multitask penalty function being
L=λ1L12L2Where L is a multitask penalty function, L1Two-class cross-entropy loss function, L, for gesture detection tasks2Multi-class cross-entropy loss function, λ, for use with gesture recognition tasks1And λ2Are respectively L1And L2A weight value in L that dynamically adjusts in size during training using the following formula,
Figure FDA0002990118560000011
Figure FDA0002990118560000021
wherein, i represents the ith task, i is 1, 2; t is the number of network training iterations, wiAs a function of lossThe relative descent rate of (a) is the ratio of the current iteration loss function to the last iteration loss function, and T is a hyper-parameter for controlling the task weight.
6. The model of claim 5, the hyperparameter T-2.
7. A training method for the model of claim 5, comprising:
step 1, extracting shared multi-modal characteristics of a training sample by adopting the modal characteristic extraction module;
step 2, extracting the fused task related features based on the shared multi-modal features by adopting the multi-modal channel attention module;
step 3, dynamically adjusting a multitask loss function based on a soft attention mechanism;
and 4, iteratively updating the weight values of different task loss functions in the multitask loss function and the network parameters of the modal feature extraction module, the multi-modal fusion module and the model multitask classification module until the model converges.
8. A method for gesture detection and gesture recognition using the model generated by the method of claim 7, comprising:
step 1, preprocessing multi-modal data of a gesture to be recognized, and extracting and sharing multi-modal features;
step 2, extracting the fused task related features by adopting a multi-mode channel attention mechanism based on the shared multi-mode features;
and 3, performing gesture detection and gesture recognition by using a model multi-task classification module based on the fused task related characteristics.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to claim 7 or 8.
10. A computer device comprising a memory and a processor, a computer program being stored on the memory and being executable on the processor, characterized in that the steps of the method as claimed in claim 7 or 8 are implemented when the processor executes the program.
CN202110311898.4A 2021-03-24 2021-03-24 Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof Pending CN112966644A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110311898.4A CN112966644A (en) 2021-03-24 2021-03-24 Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110311898.4A CN112966644A (en) 2021-03-24 2021-03-24 Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof

Publications (1)

Publication Number Publication Date
CN112966644A true CN112966644A (en) 2021-06-15

Family

ID=76278286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110311898.4A Pending CN112966644A (en) 2021-03-24 2021-03-24 Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof

Country Status (1)

Country Link
CN (1) CN112966644A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378855A (en) * 2021-06-22 2021-09-10 北京百度网讯科技有限公司 Method for processing multitask, related device and computer program product
CN113705662A (en) * 2021-08-26 2021-11-26 中国银联股份有限公司 Collaborative training method and device and computer readable storage medium
CN117636481A (en) * 2024-01-25 2024-03-01 江西师范大学 Multi-mode joint gesture action generation method based on diffusion model
WO2024108377A1 (en) * 2022-11-22 2024-05-30 上海成电福智科技有限公司 Multimodal multi-task workshop target recognition method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018213841A1 (en) * 2017-05-19 2018-11-22 Google Llc Multi-task multi-modal machine learning model
CN111246256A (en) * 2020-02-21 2020-06-05 华南理工大学 Video recommendation method based on multi-mode video content and multi-task learning
CN112163447A (en) * 2020-08-18 2021-01-01 桂林电子科技大学 Multi-task real-time gesture detection and recognition method based on Attention and Squeezenet
CN112183547A (en) * 2020-10-19 2021-01-05 中国科学院计算技术研究所 Multi-mode data-based multi-task learning method and system
CN112507947A (en) * 2020-12-18 2021-03-16 宜通世纪物联网研究院(广州)有限公司 Gesture recognition method, device, equipment and medium based on multi-mode fusion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018213841A1 (en) * 2017-05-19 2018-11-22 Google Llc Multi-task multi-modal machine learning model
CN111246256A (en) * 2020-02-21 2020-06-05 华南理工大学 Video recommendation method based on multi-mode video content and multi-task learning
CN112163447A (en) * 2020-08-18 2021-01-01 桂林电子科技大学 Multi-task real-time gesture detection and recognition method based on Attention and Squeezenet
CN112183547A (en) * 2020-10-19 2021-01-05 中国科学院计算技术研究所 Multi-mode data-based multi-task learning method and system
CN112507947A (en) * 2020-12-18 2021-03-16 宜通世纪物联网研究院(广州)有限公司 Gesture recognition method, device, equipment and medium based on multi-mode fusion

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SHIKUN LIU等: "End-to-End Multi-Task Learning with Attention" *
YANG GU: "A Collaborative Multi-modal Fusion Method Based on Random Variational Information Bottleneck for Gesture Recognition" *
YINGWEI ZHANG 等: "Learning Effective Spatial–Temporal Features for sEMG Armband-Based Gesture Recognition", 《IEEE INTERNET OF THINGS JOURNAL》 *
高明柯;赵卓;逄涛;王天保;邹一波;黄晨;李德旭;: "基于注意力机制和特征融合的手势识别方法" *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378855A (en) * 2021-06-22 2021-09-10 北京百度网讯科技有限公司 Method for processing multitask, related device and computer program product
CN113705662A (en) * 2021-08-26 2021-11-26 中国银联股份有限公司 Collaborative training method and device and computer readable storage medium
WO2024108377A1 (en) * 2022-11-22 2024-05-30 上海成电福智科技有限公司 Multimodal multi-task workshop target recognition method
CN117636481A (en) * 2024-01-25 2024-03-01 江西师范大学 Multi-mode joint gesture action generation method based on diffusion model
CN117636481B (en) * 2024-01-25 2024-05-14 江西师范大学 Multi-mode joint gesture action generation method based on diffusion model

Similar Documents

Publication Publication Date Title
CN112784764B (en) Expression recognition method and system based on local and global attention mechanism
CN112966644A (en) Multi-mode multi-task model for gesture detection and gesture recognition and training method thereof
CN114398961A (en) Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN113628294A (en) Image reconstruction method and device for cross-modal communication system
CN109783666A (en) A kind of image scene map generation method based on iteration fining
CN108073851B (en) Grabbing gesture recognition method and device and electronic equipment
CN111816169B (en) Method and device for training Chinese and English hybrid speech recognition model
CN111046661A (en) Reading understanding method based on graph convolution network
Qi et al. Personalized sketch-based image retrieval by convolutional neural network and deep transfer learning
CN110968235B (en) Signal processing device and related product
CN115223020A (en) Image processing method, image processing device, electronic equipment and readable storage medium
CN112667071A (en) Gesture recognition method, device, equipment and medium based on random variation information
CN114913590B (en) Data emotion recognition method, device and equipment and readable storage medium
Zhang et al. R2Net: Residual refinement network for salient object detection
Sahu et al. Dynamic routing using inter capsule routing protocol between capsules
Le et al. Multi visual and textual embedding on visual question answering for blind people
Rastgoo et al. Word separation in continuous sign language using isolated signs and post-processing
Li et al. Spatial-temporal dynamic hand gesture recognition via hybrid deep learning model
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
Doughan et al. Novel preprocessors for convolution neural networks
CN116244473B (en) Multi-mode emotion recognition method based on feature decoupling and graph knowledge distillation
Kamil et al. Literature Review of Generative models for Image-to-Image translation problems
CN114863548B (en) Emotion recognition method and device based on nonlinear space characteristics of human body movement gestures
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
He et al. ECS-SC: Long-tailed classification via data augmentation based on easily confused sample selection and combination

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210615

RJ01 Rejection of invention patent application after publication