CN113537024B - Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism - Google Patents

Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism Download PDF

Info

Publication number
CN113537024B
CN113537024B CN202110773432.6A CN202110773432A CN113537024B CN 113537024 B CN113537024 B CN 113537024B CN 202110773432 A CN202110773432 A CN 202110773432A CN 113537024 B CN113537024 B CN 113537024B
Authority
CN
China
Prior art keywords
sign language
network
neural network
layer
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110773432.6A
Other languages
Chinese (zh)
Other versions
CN113537024A (en
Inventor
袁甜甜
周乐员
张剑华
陈胜勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Technology
Original Assignee
Tianjin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Technology filed Critical Tianjin University of Technology
Priority to CN202110773432.6A priority Critical patent/CN113537024B/en
Publication of CN113537024A publication Critical patent/CN113537024A/en
Application granted granted Critical
Publication of CN113537024B publication Critical patent/CN113537024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

A weak supervision neural network sign language recognition method of a multilayer time sequence attention fusion mechanism adopts a coder-decoder neural network fused by the multilayer attention fusion mechanism and combines a language model transformer to perform recognition from continuous sign language videos to continuous sign languages and generate translated sentences; posture information is extracted through a pre-trained migration learning convolution module on a large-scale image data set Imagenet, a bidirectional gated circulation network and a multilayer residual stacked gated circulation network are used for time sequence feature coding, bottom layer semantics and high-dimensional features of sign language grammar are fused through a multilayer time sequence attention fusion mechanism, a sign language recognition sentence expressing sign language person action expression meaning is obtained through a greedy decoding and forced teaching training method, and the difficulty of recognition and translation of continuous sign language videos is solved through a language model. The invention can promote the communication between the hearing-aid and the speaker.

Description

Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism
Technical Field
The invention relates to the technical fields of computer vision, artificial intelligence, data mining, natural language processing, deep learning and the like. In particular to a weak supervision neural network sign language identification method of a multilayer time sequence attention fusion mechanism.
Background
Computer vision is a technology for enabling a computer or a machine to simulate human eyes to perceive, and is widely applied to the fields of graphic images, three-dimensional reconstruction, target tracking, face detection and video understanding. Natural language processing is a technology that enables a computer or a machine to recognize thinking like a human being, and is widely applied to tasks such as machine translation, reading and understanding, language generation, and multi-turn conversation. Before the advent of deep learning techniques, the traditional computer vision technology and natural language processing field relied heavily on manually extracted features and methods for manually defining grammatical rules. As the amount of data increases and the cost of GPU computation decreases, deep learning techniques typified by deep neural networks are gradually emerging. Computer vision techniques and natural language techniques based on deep learning are beginning to grow in popularity. Because deep learning has strong representation learning capability, the neural network can learn and understand certain knowledge only through the joint training of end-to-end data and labels without manually extracting and formulating complicated characteristics and rules by the original person. Therefore, by combining the perception of computer vision based on the deep learning technology and the cognitive technology of natural language processing, the weakly supervised neural network algorithm of the multi-layer time sequence attention fusion mechanism for continuous sign language video recognition and translation is designed, so that the computer can effectively understand the content expressed by the speaker in the sign language video.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to design a weak supervision neural network algorithm of a multi-layer time sequence attention fusion mechanism for continuous sign language video recognition and translation, which is used for solving the problem of difficulty in recognition and translation of continuous sign language videos, so that a computer can learn to understand the meaning expressed by a speaker, and communication between a hearing person and the speaker can be promoted.
The technical scheme of the invention is as follows:
a weakly supervised neural network sign language recognition method of a multi-layer time sequence attention fusion mechanism comprises the following steps:
1) a sign language video V containing (f)1,...fu) For the continuous sign language recognition task, the neural network learning conditional probability P (G/V) is used to generate a sign language recognition sequence G (G)1,...gn) For the continuous sign language translation task, the network learning conditional probability P (L/G) is used to generate the natural language sequence L (L)1,....lu);
Carrying out uniform and random frame sampling on each sign language video data by using an opencv library so as to ensure that the frame number of each sign language video is consistent, carrying out word segmentation on corpus tag sentences of the sign language video and automatically marking each sign language video by using a python programming language;
2) the method comprises the following steps of transmitting sign language videos of a batch of specified sizes into an encoder part of a neural network according to frames, firstly carrying out feature extraction on each sign language video frame through a pre-trained convolutional neural network module, and then obtaining effective attitude information through two layers of residual full-connected layers, wherein the effective attitude information is used as space embedding of the network:
Su=SpatialEmbedding(fu) (1)
wherein f isuRepresenting a sign language video frame, SuThe space embedded vector is extracted through convolution network pyrrole characteristics;
the space embedding vector of the sign language video contains rich characteristic information, the space embedding vector is input into a next module bidirectional gating circulation network, the gating circulation network can carry out effective characteristic modeling on the sign language video frame sequence data of the time dimension, and sign language action context information B is obtained in a forward and backward bidirectional modeling modeuA 1 to BuObtaining higher-dimensional abstract information E through three-layer residual stacking one-way gating circulation networku
Through the operation, the neural network encoder part performs space-time coding on the sign language video to obtain a hidden vector huAnd h isuPassed to the decoder part of the neural network, the decoder network incorporating huVector sum C obtained by fusion of multi-layer sequential attention mechanismmixVector, obtaining sign language recognized words at each time step of the multilayer residual gating circulation network, and finally combining the words into a complete sign language sentence;
the fusion vector of the multilayer time sequence attention mechanism comprises the following steps: firstly, a fraction is calculated, and a hidden vector h of a previous step of each time step of a decoder is calculatedn-1As query terms, with query term hn-1Respectively and EUAnd BuTo carry outThe operation yields two fractional vectors score1 and score2 as follows:
score1(hn-1,Eu)=EuWhn-1 T (2)
score2(hn-1,Bu)=BuDhn-1 T (3)
two score using the above scoring function, where W and D are trainable neural network weight parameters, are then used to obtain the temporal attention weights r and p of the sign language video for aligning the sign language video frames and words, which is calculated as follows:
Figure GDA0003610748330000031
Figure GDA0003610748330000032
wherein k represents the kth time step in the time sequence dimension of the encoder network, n represents the nth time step in the time sequence dimension of the decoder network, and then the obtained time sequence attention weights r and p of the sign language video are respectively added with EuAnd BuAnd calculating to obtain two sign language attention background vectors CtAnd CbThe operation is as follows:
Figure GDA0003610748330000033
Figure GDA0003610748330000034
then C is mixedtAnd CbThe two background vectors are fused to obtain CmixThe operation is as follows:
Figure GDA0003610748330000035
this attention background vector is called sign language sequence context fusion vector Cmix
3) In the decoding stage, first from the input<BOS>At the beginning of the symbol,<BOS>the symbol is used as the starting symbol of each network training and is input into the first time step of the decoder network, and C is simultaneously inputmixSplicing with sign language embedded vocabulary, inputting into a decoder of the current time step, obtaining output after nonlinear operation of a gate control cycle network with a four-layer stacked residual error structure of the decoder, generating words with the maximum probability of the current time step through a layer of full-connection layer, and circularly decoding until meeting<End>And (5) finishing the symbol, and finishing the generation of a complete sign language recognition sentence.
Further, the language model generates a natural language text conforming to the spoken language expression, and performs language learning using a Transformer as a language model to obtain a result of continuous sign language translation.
In a space embedding module of the encoder network, a convolutional neural network pre-trained on Imagenet is used to freeze all parameters on the convolutional neural network; using the resnet152 to pre-train the convolutional network, and using the output of the penultimate layer or the output of the last layer, adding two 2600-dimensional trainable residual fully-connected layers after the pre-trained convolutional neural network, and performing residual connection with the output of the following bidirectional gating cycle unit module; the hidden units of the bidirectional gated loop units of the encoder are arranged in 1300 dimensions, because past and future information is spliced, the output is 2600 dimensions, and the hidden unit dimension of each subsequent layer of the gated loop network is also 2600, so that residual connection can be performed; at the decoder stage, the word embedding dimension for sign language words is set to 256 and the network hidden unit dimension is set to 800 for each gated loop in the decoder.
The default Adam optimizer and cross entropy loss function of the pytorech are adopted in the training process, each batch is set to be 10, the setting of the learning rate is divided into two stages, the first stage is carried out by using 0.00004, after the 8 th epoch, the learning rate is adjusted to be 0.000004, the training is continuously carried out for 6 epochs, and the convergence of the neural network parameters is completed.
The technical conception of the invention is as follows: the method comprises the steps of performing feature modeling on a sign language video by deeply learning strong representation learning capacity and a large-scale sign language video data set and utilizing the strong feature extraction capacity of a convolutional neural network and the long sequence modeling capacity of a gated cyclic network, and combining a multilayer time sequence attention system fusion technology and the strong translation capacity of a transform language model, so as to obtain continuous sign language translation sentences which accord with the natural language spoken language sequence from the continuous sign language video. The proposed algorithm aims at learning the sequence-to-sequence mapping. I.e. a sign language video V, containing (f)1,...fu) For the continuous sign language recognition task, the neural network learning conditional probability P (G/V) is used to generate a sign language recognition sequence G (G)1,...gn) For the continuous sign language translation task, the network learning conditional probability P (L/G) is used to generate the natural language sequence L (L)1,....lu)。
The invention has the following beneficial effects: in the method, the task, whether continuous sign language recognition or translation, is a weak supervision task from video to text, and only comprises sentence-level labeling, each word is not required to be individually labeled by time division of video, and the network can recognize and understand the meaning of a sign language person from a sign language video by utilizing end-to-end deep neural network training.
Drawings
FIG. 1 is a block diagram of a weakly supervised neural network sign language identification method of a multi-layer temporal attention fusion mechanism.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, a weakly supervised neural network sign language recognition method of a multi-layer time-series attention fusion mechanism includes the following steps:
1) in thatAnd learning the mapping relation of the sequences to the sequences. I.e. a sign language video V, containing (f)1,...fu) For the continuous sign language recognition task, the neural network learning conditional probability P (G/V) is used to generate a sign language recognition sequence G (G)1,...gn) For the continuous sign language translation task, the network learning conditional probability P (L/G) is used to generate the natural language sequence L (L)1,....lu);
Carrying out uniform and random frame sampling on each sign language video data by using an opencv library so as to ensure that the frame number of each sign language video is consistent, carrying out word segmentation on corpus tag sentences of the sign language video and automatically marking each sign language video by using a python programming language;
2) the method comprises the following steps of transmitting sign language videos of a batch of specified sizes into an encoder part of a neural network according to frames, firstly carrying out feature extraction on each sign language video frame through a pre-trained convolutional neural network module, and then obtaining effective attitude information through two layers of residual full-connected layers, wherein the effective attitude information is used as space embedding of the network:
Su=SpatialEmbedding(fu) (1)
wherein f isuRepresenting a sign language video frame, SuThe space embedded vector is extracted through convolution network pyrrole characteristics;
the space embedding vector of the sign language video contains rich characteristic information, the space embedding vector is input into a next module bidirectional gating circulation network, the gating circulation network can carry out effective characteristic modeling on the sign language video frame sequence data of the time dimension, and sign language action context information B is obtained in a forward and backward bidirectional modeling modeuA 1 to BuObtaining higher-dimensional abstract information E through three-layer residual stacking one-way gating circulation networku
Through the operation, the neural network encoder part performs space-time coding on the sign language video to obtain a hidden vector huAnd h isuPassed to the decoder part of the neural network, the decoder network incorporating huVector sum C obtained by fusion of multi-layer sequential attention mechanismmixVector in multipleObtaining words recognized by the sign language at each time step of the layer residual gating circulation network, and finally combining the words into a complete sign language sentence;
the fusion vector of the multilayer time sequence attention mechanism comprises the following steps: firstly, a fraction is calculated, and a hidden vector h of a previous step of each time step of a decoder is calculatedn-1As query terms, with query term hn-1Respectively and EUAnd BuThe two fractional vectors score1 and score2 are computed as follows:
score1(hn-1,Eu)=EuWhn-1 T (2)
score2(hn-1,Bu)=BuDhn-1 T (3)
two score using the above scoring function, where W and D are trainable neural network weight parameters, are then used to obtain the temporal attention weights r and p of the sign language video for aligning the sign language video frames and words, and the mathematical operations are as follows:
Figure GDA0003610748330000061
Figure GDA0003610748330000062
wherein k represents the kth time step in the encoder network timing dimension, n represents the nth time step in the decoder network timing dimension, and then the obtained temporal attention weights r and p of the sign language video are respectively added with EuAnd BuAnd calculating to obtain two sign language attention background vectors CtAnd CbThe operation is as follows:
Figure GDA0003610748330000063
Figure GDA0003610748330000064
then C is mixedtAnd CbThe two background vectors are fused to obtain CmixThe operation is as follows:
Figure GDA0003610748330000065
this attention background vector is called sign language sequence context fusion vector Cmix
3) In the decoding stage, first from the input<BOS>At the start of the symbol, the symbol is,<BOS>the symbol is used as the starting symbol of each network training and is input into the first time step of the decoder network, and C is simultaneously inputmixSplicing with sign language embedded vocabulary, inputting into a decoder of the current time step, obtaining output after nonlinear operation of a gate control cycle network with a four-layer stacked residual error structure of the decoder, generating words with the maximum probability of the current time step through a layer of full-connection layer, and circularly decoding until meeting<End>And (5) finishing the symbol, and finishing the generation of a complete sign language recognition sentence.
Further, the language model aims to generate natural language text that conforms to the spoken language expression. Since sentences generated by continuous sign language recognition may not conform to spoken descriptions, language learning is performed by using a Transformer as a language model, so as to further obtain results of continuous sign language translation, a very small model parameter is used for training a Transformer network, so that sign language recognition sentences and translated natural languages are mapped in a one-to-one manner, and originally static position coding is changed into dynamic trainable position coding in a Transformer structure, so that the position relation between words in a sequence can be learned more easily.
The language model is built by using a pytorch deep learning framework, and the configuration network parameters are set as follows: in a spatial embedding module of an encoder network, a convolutional neural network pre-trained on Imagenet is used, and all parameters on the convolutional neural network are frozen for the convenience of training; using the resnet152 to pre-train the convolutional network, and using the output of the penultimate layer or the output of the last layer, adding two 2600-dimensional trainable residual fully-connected layers after the pre-trained convolutional neural network, and performing residual connection with the output of the following bidirectional gating cycle unit module; the hidden unit of the bidirectional gating cycle unit of the encoder is set to be 1300-dimensional, because past and future information is spliced, the output is 2600-dimensional, the hidden unit dimension of the gating cycle network of each layer is also set to be 2600, so that residual connection can be carried out, the word embedding dimension of a sign language word is set to be 256 at the decoder stage, and the hidden unit dimension of each gating cycle network in the decoder is set to be 800.
During training, a default Adam optimizer and cross entropy loss function of the pytorech are used, and each batch is set to 10. The learning rate is set in two stages, the first stage is carried out by using 0.00004, after the 8 th epoch, the learning rate is adjusted to 0.000004, and training is continued for 6 epochs, so that convergence of the neural network parameters can be completed.
Therefore, the weakly supervised neural network method of the multi-layer time sequence attention fusion mechanism for continuous sign language video recognition and translation provided by the invention can enable the network to recognize and understand the meaning of the sign language person from the sign language video.

Claims (4)

1. A weakly supervised neural network sign language recognition method of a multi-layer time sequence attention fusion mechanism is characterized by comprising the following steps:
1) a sign language video V containing (f)1,...fu) For the continuous sign language recognition task, the neural network learning conditional probability P (G/V) is used to generate a sign language recognition sequence G (G)1,...gn) For the continuous sign language translation task, the network learning conditional probability P (L/G) is used to generate the natural language sequence L (L)1,....lu);
Carrying out uniform and random frame sampling on each sign language video data by using an opencv library so as to ensure that the frame number of each sign language video is consistent, carrying out word segmentation on corpus tag sentences of the sign language video and automatically marking each sign language video by using a python programming language;
2) the method comprises the following steps of transmitting sign language videos of a batch of specified sizes into an encoder part of a neural network according to frames, firstly carrying out feature extraction on each sign language video frame through a pre-trained convolutional neural network module, and then obtaining effective attitude information through two layers of residual full-connected layers, wherein the effective attitude information is used as space embedding of the network:
Su=SpatialEmbedding(fu) (1)
wherein f isuRepresenting a sign language video frame, SuThe space embedding vector is obtained after the convolution network characteristic extraction;
the space embedding vector of the sign language video contains rich characteristic information, the space embedding vector is input into a next module bidirectional gating circulation network, the gating circulation network can carry out effective characteristic modeling on the sign language video frame sequence data of the time dimension, and sign language action context information B is obtained in a forward and backward bidirectional modeling modeuA 1 to BuObtaining higher-dimensional abstract information E through three-layer residual stacking one-way gating circulation networku
Through the operation, the neural network encoder part performs space-time coding on the sign language video to obtain a hidden vector huAnd h isuPassed to the decoder part of the neural network, the decoder network incorporating huVector sum C obtained by fusion of multi-layer sequential attention mechanismmixVector, obtaining sign language recognized words at each time step of the multilayer residual gating circulation network, and finally combining the words into a complete sign language sentence;
the fusion vector of the multilayer time sequence attention mechanism comprises the following steps: firstly, a fraction is calculated, and a hidden vector h of a previous step of each time step of a decoder is calculatedn-1As query terms, with query term hn-1Respectively and EUAnd BuThe two score vectors score1 and score2 are computed as follows:
score1(hn-1,Eu)=EuWhn-1 T (2)
score2(hn-1,Bu)=BuDhn-1 T (3)
two score using the above scoring function, where W and D are trainable neural network weighting parameters, are then used to obtain sign language video temporal attention weights r and p for aligning sign language video frames and words, which operate as follows:
Figure FDA0003586237820000021
Figure FDA0003586237820000022
wherein k represents the kth time step in the time sequence dimension of the encoder network, n represents the nth time step in the time sequence dimension of the decoder network, and then the obtained time sequence attention weights r and p of the sign language video are respectively added with EuAnd BuPerforming operation to obtain two sign language attention background vectors CtAnd CbThe operation is as follows:
Figure FDA0003586237820000023
Figure FDA0003586237820000024
then C is mixedtAnd CbThe two background vectors are fused to obtain CmixThe operation is as follows:
Figure FDA0003586237820000025
this attention background vector is called sign language sequence context fusion vectorCmix
3) In the decoding stage, first from the input<BOS>At the start of the symbol, the symbol is,<BOS>the symbol is used as the starting symbol of each network training and is input into the first time step of the decoder network, and C is simultaneously inputmixSplicing with sign language embedded vocabulary, inputting into a decoder of the current time step, obtaining output after nonlinear operation of a gate control cycle network with a four-layer stacked residual error structure of the decoder, generating words with the maximum probability of the current time step through a layer of full-connection layer, and circularly decoding until meeting<End>And finishing the symbol, and finishing the generation of a complete sign language recognition sentence.
2. The method as claimed in claim 1, wherein the language model generates natural language text conforming to the spoken language expression, and the language learning is performed by using a Transformer as the language model, so as to further obtain the result of continuous sign language translation, and the originally static position code is changed into a dynamically trainable position code in the Transformer structure.
3. The sign language recognition method of the weakly supervised neural network of the multi-layer time sequence attention fusion mechanism as claimed in claim 1 or 2, characterized in that a language model is built by using a pytorch deep learning framework, and configuration network parameters are set as follows, in a space embedding module of an encoder network, a convolutional neural network pre-trained on Imagenet is used to freeze all parameters on the convolutional neural network; using the resnet152 to pre-train the convolutional network, and using the output of the penultimate layer or the output of the last layer, adding two 2600-dimensional trainable residual fully-connected layers after the pre-trained convolutional neural network, and performing residual connection with the output of the following bidirectional gating cycle unit module; the hidden unit of the bidirectional gating cycle unit of the encoder is set to be 1300-dimensional, because past and future information is spliced, the output is 2600-dimensional, the hidden unit of each layer of the gating cycle network is also set to be 2600-dimensional, and therefore residual connection can be carried out; at the decoder stage, the word embedding dimension for sign language words is set to 256 and the network hidden unit dimension is set to 800 for each gated loop in the decoder.
4. The weakly supervised neural network sign language identification method of the multi-layer time series attention fusion mechanism as recited in claim 3, wherein a pitoch default Adam optimizer and a cross entropy loss function are adopted in the training process, each batch is set to be 10, the setting of the learning rate is divided into two stages, the first stage is carried out by using 0.00004, after the 8 th epoch, the learning rate is adjusted to be 0.000004, the training is continued for 6 epochs, and the convergence of neural network parameters is completed.
CN202110773432.6A 2021-07-08 2021-07-08 Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism Active CN113537024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110773432.6A CN113537024B (en) 2021-07-08 2021-07-08 Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110773432.6A CN113537024B (en) 2021-07-08 2021-07-08 Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism

Publications (2)

Publication Number Publication Date
CN113537024A CN113537024A (en) 2021-10-22
CN113537024B true CN113537024B (en) 2022-06-21

Family

ID=78127177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110773432.6A Active CN113537024B (en) 2021-07-08 2021-07-08 Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism

Country Status (1)

Country Link
CN (1) CN113537024B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114115531B (en) * 2021-11-11 2022-09-30 合肥工业大学 End-to-end sign language recognition method based on attention mechanism
CN113920989B (en) * 2021-12-13 2022-04-01 中国科学院自动化研究所 End-to-end system and equipment for voice recognition and voice translation
CN114812551B (en) * 2022-03-09 2024-07-26 同济大学 Indoor environment robot navigation natural language instruction generation method
CN114694076A (en) * 2022-04-08 2022-07-01 浙江理工大学 Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion
CN115345257B (en) * 2022-09-22 2023-06-06 中山大学 Flight trajectory classification model training method, classification method, device and storage medium
CN116089593B (en) * 2023-03-24 2023-06-13 齐鲁工业大学(山东省科学院) Multi-pass man-machine dialogue method and device based on time sequence feature screening coding module
CN117671730A (en) * 2023-11-29 2024-03-08 四川师范大学 Continuous sign language recognition method based on local self-attention

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN108647603A (en) * 2018-04-28 2018-10-12 清华大学 Semi-supervised continuous sign language interpretation method based on attention mechanism and device
CN109190578A (en) * 2018-09-13 2019-01-11 合肥工业大学 The sign language video interpretation method merged based on convolution network with Recognition with Recurrent Neural Network
CN110163181A (en) * 2019-05-29 2019-08-23 中国科学技术大学 Sign Language Recognition Method and device
CN110399850A (en) * 2019-07-30 2019-11-01 西安工业大学 A kind of continuous sign language recognition method based on deep neural network
CN111340006A (en) * 2020-04-16 2020-06-26 深圳市康鸿泰科技有限公司 Sign language identification method and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325099B (en) * 2020-01-21 2022-08-26 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111339837B (en) * 2020-02-08 2022-05-03 河北工业大学 Continuous sign language recognition method
CN112101262B (en) * 2020-09-22 2022-09-06 中国科学技术大学 Multi-feature fusion sign language recognition method and network model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN108647603A (en) * 2018-04-28 2018-10-12 清华大学 Semi-supervised continuous sign language interpretation method based on attention mechanism and device
CN109190578A (en) * 2018-09-13 2019-01-11 合肥工业大学 The sign language video interpretation method merged based on convolution network with Recognition with Recurrent Neural Network
CN110163181A (en) * 2019-05-29 2019-08-23 中国科学技术大学 Sign Language Recognition Method and device
CN110399850A (en) * 2019-07-30 2019-11-01 西安工业大学 A kind of continuous sign language recognition method based on deep neural network
CN111340006A (en) * 2020-04-16 2020-06-26 深圳市康鸿泰科技有限公司 Sign language identification method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
An Attention-Enhanced Multi-Scale and Dual Sign Language Recognition Network Based on a Graph Convolution Network;Lu Meng et al;《sensors》;20210205;全文 *
融合注意力机制和连接时序分类的多模态手语识别;王军 等;《信号处理》;20200930;第36卷(第9期);全文 *

Also Published As

Publication number Publication date
CN113537024A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN113537024B (en) Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism
Wu et al. Multimodal large language models: A survey
CN107979764B (en) Video subtitle generating method based on semantic segmentation and multi-layer attention framework
CN110633683B (en) Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM
CN111967272B (en) Visual dialogue generating system based on semantic alignment
CN111339837A (en) Continuous sign language recognition method
CN114245203B (en) Video editing method, device, equipment and medium based on script
CN113780059B (en) Continuous sign language identification method based on multiple feature points
CN109933773A (en) A kind of multiple semantic sentence analysis system and method
CN113221571B (en) Entity relation joint extraction method based on entity correlation attention mechanism
CN113792177A (en) Scene character visual question-answering method based on knowledge-guided deep attention network
CN114385802A (en) Common-emotion conversation generation method integrating theme prediction and emotion inference
CN116246213B (en) Data processing method, device, equipment and medium
CN111354246A (en) System and method for helping deaf-mute to communicate
CN111259785A (en) Lip language identification method based on time offset residual error network
CN115563335A (en) Model training method, image-text data processing device, image-text data processing equipment and image-text data processing medium
CN116091978A (en) Video description method based on advanced semantic information feature coding
CN113553445B (en) Method for generating video description
CN113032535A (en) Visual question and answer method and device for assisting visually impaired people, computing equipment and storage medium
CN117272237B (en) Multi-modal fusion-based patent drawing multi-language graphic generation method and system
CN117994622A (en) Multi-mode perception fusion emotion recognition method and robot emotion interaction method
CN116151226B (en) Machine learning-based deaf-mute sign language error correction method, equipment and medium
CN111818397A (en) Video description generation method based on long-time and short-time memory network variant
CN115240713B (en) Voice emotion recognition method and device based on multi-modal characteristics and contrast learning
CN116311493A (en) Two-stage human-object interaction detection method based on coding and decoding architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant