CN113537024A - Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism - Google Patents

Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism Download PDF

Info

Publication number
CN113537024A
CN113537024A CN202110773432.6A CN202110773432A CN113537024A CN 113537024 A CN113537024 A CN 113537024A CN 202110773432 A CN202110773432 A CN 202110773432A CN 113537024 A CN113537024 A CN 113537024A
Authority
CN
China
Prior art keywords
sign language
network
neural network
decoder
time sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110773432.6A
Other languages
Chinese (zh)
Other versions
CN113537024B (en
Inventor
袁甜甜
周乐员
张剑华
陈胜勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Technology
Original Assignee
Tianjin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Technology filed Critical Tianjin University of Technology
Priority to CN202110773432.6A priority Critical patent/CN113537024B/en
Publication of CN113537024A publication Critical patent/CN113537024A/en
Application granted granted Critical
Publication of CN113537024B publication Critical patent/CN113537024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

A weak supervision neural network sign language recognition method of a multilayer time sequence attention fusion mechanism adopts a coder-decoder neural network fused by the multilayer attention fusion mechanism and combines a language model transformer to perform recognition from continuous sign language videos to continuous sign languages and generate translated sentences; posture information is extracted through a pre-trained migration learning convolution module on a large-scale image data set Imagenet, a bidirectional gated circulation network and a multilayer residual stacked gated circulation network are used for time sequence feature coding, bottom layer semantics and high-dimensional features of sign language grammar are fused through a multilayer time sequence attention fusion mechanism, a sign language recognition sentence expressing sign language person action expression meaning is obtained through a greedy decoding and forced teaching training method, and the difficulty of recognition and translation of continuous sign language videos is solved through a language model. The invention can promote the communication between the hearing-aid and the speaker.

Description

Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism
Technical Field
The invention relates to the technical fields of computer vision, artificial intelligence, data mining, natural language processing, deep learning and the like. In particular to a weak supervision neural network sign language identification method of a multilayer time sequence attention fusion mechanism.
Background
Computer vision is a technology for enabling a computer or a machine to simulate human eyes to perceive, and is widely applied to the fields of graphic images, three-dimensional reconstruction, target tracking, face detection and video understanding. Natural language processing is a technology that enables a computer or a machine to recognize thinking like a human being, and is widely applied to tasks such as machine translation, reading and understanding, language generation, and multi-turn conversation. Before the advent of deep learning techniques, the traditional computer vision technology and natural language processing field relied heavily on manually extracted features and methods for manually defining grammatical rules. As the amount of data increases and the cost of GPU computation decreases, deep learning techniques typified by deep neural networks are gradually emerging. Computer vision techniques and natural language techniques based on deep learning are becoming popular. Because deep learning has strong representation learning capability, the neural network can learn and understand certain knowledge only through the joint training of end-to-end data and labels without manually extracting and formulating complicated characteristics and rules by the original person. Therefore, by combining the perception of computer vision based on the deep learning technology and the cognitive technology of natural language processing, the weakly supervised neural network algorithm of the multi-layer time sequence attention fusion mechanism for continuous sign language video recognition and translation is designed, so that the computer can effectively understand the content expressed by the speaker in the sign language video.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to design a weak supervision neural network algorithm of a multi-layer time sequence attention fusion mechanism for continuous sign language video recognition and translation, which is used for solving the problem of difficulty in recognition and translation of continuous sign language videos, so that a computer can learn to understand the meaning expressed by a speaker, and communication between a hearing person and the speaker can be promoted.
The technical scheme of the invention is as follows:
a weakly supervised neural network sign language recognition method of a multi-layer time sequence attention fusion mechanism comprises the following steps:
1) a sign language video V containing (f)1,...fu) For the continuous sign language recognition task, the neural network learning conditional probability P (G/V) is used to generate a sign language recognition sequence G (G)1,...gn) For the continuous sign language translation task, the network learning conditional probability P (L/G) is used to generate the natural language sequence L (L)1,....lu);
Carrying out uniform and random frame sampling on each sign language video data by using an opencv library so as to ensure that the frame number of each sign language video is consistent, carrying out word segmentation on corpus tag sentences of the sign language video and automatically marking each sign language video by using a python programming language;
2) the method comprises the following steps of transmitting sign language videos of a batch of specified sizes into an encoder part of a neural network according to frames, firstly carrying out feature extraction on each sign language video frame through a pre-trained convolutional neural network module, and then obtaining effective attitude information through two layers of residual full-connected layers, wherein the effective attitude information is used as space embedding of the network:
Su=SpatialEmbedding(fu) (1)
wherein f isuRepresenting a sign language video frame, SuThe space embedded vector is extracted through convolution network pyrrole characteristics;
the space embedding vector of the sign language video contains rich characteristic information, the space embedding vector is input into a next module bidirectional gating circulation network, the gating circulation network can carry out effective characteristic modeling on the sign language video frame sequence data of the time dimension, and sign language action context information B is obtained in a forward and backward bidirectional modeling modeuA 1 to BuObtaining higher-dimensional abstract information E through three-layer residual stacking one-way gating circulation networku
Through the operation, the neural network encoder part performs space-time coding on the sign language video to obtain a hidden vector huAnd h isuPassed to the decoder part of the neural network, the decoder network incorporating huVector sum C obtained by fusion of multi-layer sequential attention mechanismmixVector, obtaining sign language recognized words at each time step of the multilayer residual gating circulation network, and finally combining the words into a complete sign language sentence;
the fusion vector of the multilayer time sequence attention mechanism comprises the following steps: firstly, a fraction is calculated, and a hidden vector h of a previous step of each time step of a decoder is calculatedn-1As query terms, with query term hn-1Respectively and EUAnd BuThe two fractional vectors score1 and score2 are computed as follows:
score1(hn-1,Eu)=EuWhn-1 T (2)
score2(hn-1,Bu)=BuWhn-1 T (3)
two score using the above scoring function, where W and D are trainable neural network weight parameters, are then used to obtain the temporal attention weights r and p of the sign language video for aligning the sign language video frames and words, which is calculated as follows:
Figure BDA0003153321250000031
Figure BDA0003153321250000032
wherein k represents the kth time step in the time sequence dimension of the encoder network, n represents the nth time step in the time sequence dimension of the decoder network, and then the obtained time sequence attention weights r and p of the sign language video are respectively added with EuAnd BuAnd calculating to obtain two sign language attention background vectors CtAnd CbThe operation is as follows:
Figure BDA0003153321250000033
Figure BDA0003153321250000034
then C is mixedtAnd CbThe two background vectors are fused to obtain CmixThe operation is as follows:
Figure BDA0003153321250000035
this attention background vector is called sign language sequence context fusion vector Cmix
3) In the decoding stage, first from the input<BOS>At the start of the symbol, the symbol is,<BOS>the symbol is used as the starting symbol of each network training and is input into the first time step of the decoder network, and C is simultaneously inputmixSplicing with sign language embedded vocabulary, inputting into a decoder of the current time step, obtaining output after nonlinear operation of a gate control cycle network with a four-layer stacked residual error structure of the decoder, generating words with the maximum probability of the current time step through a layer of full-connection layer, and circularly decoding until meeting<End>And (5) finishing the symbol, and finishing the generation of a complete sign language recognition sentence.
Further, the language model generates a natural language text conforming to the spoken language expression, and performs language learning using a Transformer as a language model to obtain a result of continuous sign language translation.
In a space embedding module of the encoder network, a convolutional neural network pre-trained on Imagenet is used to freeze all parameters on the convolutional neural network; using the resnet152 to pre-train the convolutional network, and using the output of the penultimate layer or the output of the last layer, adding two 2600-dimensional trainable residual fully-connected layers after the pre-trained convolutional neural network, and performing residual connection with the output of the following bidirectional gating cycle unit module; the hidden unit of the bidirectional gating cycle unit of the encoder is set to be 1300-dimensional, because past and future information is spliced, the output is 2600-dimensional, the hidden unit of each layer of the gating cycle network is also set to be 2600-dimensional, and therefore residual connection can be carried out; at the decoder stage, the word embedding dimension for sign language words is set to 256 and the network hidden unit dimension is set to 800 for each gated loop in the decoder.
The default Adam optimizer and cross entropy loss function of the pytorech are adopted in the training process, each batch is set to be 10, the setting of the learning rate is divided into two stages, the first stage is carried out by using 0.00004, after the 8 th epoch, the learning rate is adjusted to be 0.000004, the training is continuously carried out for 6 epochs, and the convergence of the neural network parameters is completed.
The technical conception of the invention is as follows: the method comprises the steps of performing feature modeling on a sign language video by deeply learning strong representation learning capacity and a large-scale sign language video data set and utilizing the strong feature extraction capacity of a convolutional neural network and the long sequence modeling capacity of a gated cyclic network, and combining a multilayer time sequence attention system fusion technology and the strong translation capacity of a transform language model, so as to obtain continuous sign language translation sentences which accord with the natural language spoken language sequence from the continuous sign language video. The proposed algorithm aims at learning the sequence-to-sequence mapping. I.e. a sign language video V, containing (f)1,...fu) For the continuous sign language recognition task, the neural network learning conditional probability P (G/V) is used to generate a sign language recognition sequence G (G)1,...gn) For the continuous sign language translation task, the network learning conditional probability P (L/G) is used to generate the natural language sequence L (L)1,....lu)。
The invention has the following beneficial effects: in the method, the task, whether continuous sign language recognition or translation, is a weak supervision task from video to text, and only comprises sentence-level labeling, each word is not required to be individually labeled by time division of video, and the network can recognize and understand the meaning of a sign language person from a sign language video by utilizing end-to-end deep neural network training.
Drawings
FIG. 1 is a block diagram of a weakly supervised neural network sign language identification method of a multi-layer temporal attention fusion mechanism.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, a weakly supervised neural network sign language recognition method of a multi-layer time-series attention fusion mechanism includes the following steps:
1) and learning the mapping relation of the sequences to the sequences. I.e. a sign language video V, containing (f)1,...fu) For the continuous sign language recognition task, the neural network learning conditional probability P (G/V) is used to generate a sign language recognition sequence G (G)1,...gn) For the continuous sign language translation task, the network learning conditional probability P (L/G) is used to generate the natural language sequence L (L)1,....lu);
Carrying out uniform and random frame sampling on each sign language video data by using an opencv library so as to ensure that the frame number of each sign language video is consistent, carrying out word segmentation on corpus tag sentences of the sign language video and automatically marking each sign language video by using a python programming language;
2) the method comprises the following steps of transmitting sign language videos of a batch of specified sizes into an encoder part of a neural network according to frames, firstly carrying out feature extraction on each sign language video frame through a pre-trained convolutional neural network module, and then obtaining effective attitude information through two layers of residual full-connected layers, wherein the effective attitude information is used as space embedding of the network:
Su=SpatialEmbedding(fu) (1)
wherein f isuRepresenting a sign language video frame, SuThe space embedded vector is extracted through convolution network pyrrole characteristics;
the space embedded vector of the sign language video contains rich characteristic information, and the space embedded vector is input into a next module bidirectional gating circulation network, and gating circulationThe ring network can effectively perform characteristic modeling on the time-dimension sign language video frame sequence data, and obtain sign language action context information B in a forward and backward bidirectional modeling modeuA 1 to BuObtaining higher-dimensional abstract information E through three-layer residual stacking one-way gating circulation networku
Through the operation, the neural network encoder part performs space-time coding on the sign language video to obtain a hidden vector huAnd h isuPassed to the decoder part of the neural network, the decoder network incorporating huVector sum C obtained by fusion of multi-layer sequential attention mechanismmixVector, obtaining sign language recognized words at each time step of the multilayer residual gating circulation network, and finally combining the words into a complete sign language sentence;
the fusion vector of the multilayer time sequence attention mechanism comprises the following steps: firstly, a fraction is calculated, and a hidden vector h of a previous step of each time step of a decoder is calculatedn-1As query terms, with query term hn-1Respectively and EUAnd BuThe two fractional vectors score1 and score2 are computed as follows:
score1(hn-1,Eu)=EuWhn-1 T (2)
score2(hn-1,Bu)=BuWhn-1 T (3)
two score using the above scoring function, where W and D are trainable neural network weight parameters, are then used to obtain the temporal attention weights r and p of the sign language video for aligning the sign language video frames and words, and the mathematical operations are as follows:
Figure BDA0003153321250000061
Figure BDA0003153321250000062
wherein k represents the kth time step in the time sequence dimension of the encoder network, n represents the nth time step in the time sequence dimension of the decoder network, and then the obtained time sequence attention weights r and p of the sign language video are respectively added with EuAnd BuAnd calculating to obtain two sign language attention background vectors CtAnd CbThe operation is as follows:
Figure BDA0003153321250000063
Figure BDA0003153321250000064
then C is mixedtAnd CbThe two background vectors are fused to obtain CmixThe operation is as follows:
Figure BDA0003153321250000065
this attention background vector is called sign language sequence context fusion vector Cmix
3) In the decoding stage, first from the input<BOS>At the start of the symbol, the symbol is,<BOS>the symbol is used as the starting symbol of each network training and is input into the first time step of the decoder network, and C is simultaneously inputmixSplicing with sign language embedded vocabulary, inputting into a decoder of the current time step, obtaining output after nonlinear operation of a gate control cycle network with a four-layer stacked residual error structure of the decoder, generating words with the maximum probability of the current time step through a layer of full-connection layer, and circularly decoding until meeting<End>And (5) finishing the symbol, and finishing the generation of a complete sign language recognition sentence.
Further, the language model aims to generate natural language text that conforms to the spoken language expression. Because sentences generated by continuous sign language recognition may not conform to spoken description, language learning is performed by using a Transformer as a language model so as to further obtain the result of continuous sign language translation, a Transformer network is trained by using very small model parameters so as to perform one-to-one mapping on sign language recognition sentences and translated natural language, and originally static position coding is changed into dynamic trainable position coding in a Transformer structure, so that the position relation between words in a sequence can be learned more easily.
The language model is built by using a pytorech deep learning framework, and the configuration network parameters are set as follows: in a spatial embedding module of an encoder network, a convolutional neural network pre-trained on Imagenet is used, and all parameters on the convolutional neural network are frozen for the convenience of training; using the resnet152 to pre-train the convolutional network, and using the output of the penultimate layer or the output of the last layer, adding two 2600-dimensional trainable residual fully-connected layers after the pre-trained convolutional neural network, and performing residual connection with the output of the following bidirectional gating cycle unit module; the hidden unit of the bidirectional gating cycle unit of the encoder is set to be 1300-dimensional, because past and future information is spliced, the output is 2600-dimensional, the hidden unit dimension of the gating cycle network of each layer is also set to be 2600, so that residual connection can be carried out, the word embedding dimension of a sign language word is set to be 256 at the decoder stage, and the hidden unit dimension of each gating cycle network in the decoder is set to be 800.
During training, a default Adam optimizer and cross entropy loss function of the pytorech are used, and each batch is set to 10. The learning rate is set in two stages, the first stage is carried out by using 0.00004, after the 8 th epoch, the learning rate is adjusted to 0.000004, and training is continued for 6 epochs, so that convergence of the neural network parameters can be completed.
Therefore, the weakly supervised neural network method of the multi-layer time sequence attention fusion mechanism for continuous sign language video recognition and translation provided by the invention can enable the network to recognize and understand the meaning of the sign language person from the sign language video.

Claims (4)

1. A weakly supervised neural network sign language recognition method of a multi-layer time sequence attention fusion mechanism is characterized by comprising the following steps:
1) a sign language video V containing (f)1,...fu) For the continuous sign language recognition task, the neural network learning conditional probability P (G/V) is used to generate a sign language recognition sequence G (G)1,...gn) For the continuous sign language translation task, the network learning conditional probability P (L/G) is used to generate the natural language sequence L (L)1,....lu);
Carrying out uniform and random frame sampling on each sign language video data by using an opencv library so as to ensure that the frame number of each sign language video is consistent, carrying out word segmentation on corpus tag sentences of the sign language video and automatically marking each sign language video by using a python programming language;
2) the method comprises the following steps of transmitting sign language videos of a batch of specified sizes into an encoder part of a neural network according to frames, firstly carrying out feature extraction on each sign language video frame through a pre-trained convolutional neural network module, and then obtaining effective attitude information through two layers of residual full-connected layers, wherein the effective attitude information is used as space embedding of the network:
Su=SpatialEmbedding(fu) (1)
wherein f isuRepresenting a sign language video frame, SuThe space embedded vector is extracted through convolution network pyrrole characteristics;
the space embedding vector of the sign language video contains rich characteristic information, the space embedding vector is input into a next module bidirectional gating circulation network, the gating circulation network can carry out effective characteristic modeling on the sign language video frame sequence data of the time dimension, and sign language action context information B is obtained in a forward and backward bidirectional modeling modeuA 1 to BuObtaining higher-dimensional abstract information E through three-layer residual stacking one-way gating circulation networku
Through the operation, the neural network encoder part performs space-time coding on the sign language video to obtain a hidden vector huAnd h isuPassed to the decoder part of the neural network, the decoder network incorporating hu(Vector)And C obtained by fusion of multi-layer time sequence attention mechanismmixVector, obtaining sign language recognized words at each time step of the multilayer residual gating circulation network, and finally combining the words into a complete sign language sentence;
the fusion vector of the multilayer time sequence attention mechanism comprises the following steps: firstly, a fraction is calculated, and a hidden vector h of a previous step of each time step of a decoder is calculatedn-1As query terms, with query term hn-1Respectively and EUAnd BuThe two fractional vectors score1 and score2 are computed as follows:
Figure FDA0003153321240000011
Figure FDA0003153321240000012
two score using the above scoring function, where W and D are trainable neural network weight parameters, are then used to obtain the temporal attention weights r and p of the sign language video for aligning the sign language video frames and words, which is calculated as follows:
Figure FDA0003153321240000021
Figure FDA0003153321240000022
wherein k represents the kth time step in the time sequence dimension of the encoder network, n represents the nth time step in the time sequence dimension of the decoder network, and then the obtained time sequence attention weights r and p of the sign language video are respectively added with EuAnd BuAnd calculating to obtain two sign language attention background vectors CtAnd CbThe operation is as follows:
Figure FDA0003153321240000023
Figure FDA0003153321240000024
then C is mixedtAnd CbThe two background vectors are fused to obtain CmixThe operation is as follows:
Figure FDA0003153321240000025
this attention background vector is called sign language sequence context fusion vector Cmix
3) In the decoding stage, first from the input<BOS>At the start of the symbol, the symbol is,<BOS>the symbol is used as the starting symbol of each network training and is input into the first time step of the decoder network, and C is simultaneously inputmixSplicing with sign language embedded vocabulary, inputting into a decoder of the current time step, obtaining output after nonlinear operation of a gate control cycle network with a four-layer stacked residual error structure of the decoder, generating words with the maximum probability of the current time step through a layer of full-connection layer, and circularly decoding until meeting<End>And (5) finishing the symbol, and finishing the generation of a complete sign language recognition sentence.
2. The method as claimed in claim 1, wherein the language model generates natural language text conforming to the spoken language expression, and the language learning is performed by using a Transformer as the language model, so as to further obtain the result of continuous sign language translation, and the originally static position code is changed into a dynamically trainable position code in the Transformer structure.
3. The sign language recognition method of the weakly supervised neural network of the multi-layer time sequence attention fusion mechanism as claimed in claim 1 or 2, characterized in that a language model is built by using a pytorch deep learning framework, and configuration network parameters are set as follows, in a space embedding module of an encoder network, a convolutional neural network pre-trained on Imagenet is used to freeze all parameters on the convolutional neural network; using the resnet152 to pre-train the convolutional network, and using the output of the penultimate layer or the output of the last layer, adding two 2600-dimensional trainable residual fully-connected layers after the pre-trained convolutional neural network, and performing residual connection with the output of the following bidirectional gating cycle unit module; the hidden unit of the bidirectional gating cycle unit of the encoder is set to be 1300-dimensional, because past and future information is spliced, the output is 2600-dimensional, the hidden unit of each layer of the gating cycle network is also set to be 2600-dimensional, and therefore residual connection can be carried out; at the decoder stage, the word embedding dimension for sign language words is set to 256 and the network hidden unit dimension is set to 800 for each gated loop in the decoder.
4. The weakly supervised neural network sign language identification method of the multi-layer time series attention fusion mechanism as recited in claim 3, wherein a pitoch default Adam optimizer and a cross entropy loss function are adopted in the training process, each batch is set to be 10, the setting of the learning rate is divided into two stages, the first stage is carried out by using 0.00004, after the 8 th epoch, the learning rate is adjusted to be 0.000004, the training is continued for 6 epochs, and the convergence of neural network parameters is completed.
CN202110773432.6A 2021-07-08 2021-07-08 Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism Active CN113537024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110773432.6A CN113537024B (en) 2021-07-08 2021-07-08 Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110773432.6A CN113537024B (en) 2021-07-08 2021-07-08 Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism

Publications (2)

Publication Number Publication Date
CN113537024A true CN113537024A (en) 2021-10-22
CN113537024B CN113537024B (en) 2022-06-21

Family

ID=78127177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110773432.6A Active CN113537024B (en) 2021-07-08 2021-07-08 Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism

Country Status (1)

Country Link
CN (1) CN113537024B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114115531A (en) * 2021-11-11 2022-03-01 合肥工业大学 End-to-end sign language identification method based on attention mechanism
US11475877B1 (en) * 2021-12-13 2022-10-18 Institute Of Automation, Chinese Academy Of Sciences End-to-end system for speech recognition and speech translation and device
CN115345257A (en) * 2022-09-22 2022-11-15 中山大学 Flight trajectory classification model training method, classification method, device and storage medium
CN116089593A (en) * 2023-03-24 2023-05-09 齐鲁工业大学(山东省科学院) Multi-pass man-machine dialogue method and device based on time sequence feature screening coding module

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN108647603A (en) * 2018-04-28 2018-10-12 清华大学 Semi-supervised continuous sign language interpretation method based on attention mechanism and device
CN109190578A (en) * 2018-09-13 2019-01-11 合肥工业大学 The sign language video interpretation method merged based on convolution network with Recognition with Recurrent Neural Network
CN110163181A (en) * 2019-05-29 2019-08-23 中国科学技术大学 Sign Language Recognition Method and device
CN110399850A (en) * 2019-07-30 2019-11-01 西安工业大学 A kind of continuous sign language recognition method based on deep neural network
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111339837A (en) * 2020-02-08 2020-06-26 河北工业大学 Continuous sign language recognition method
CN111340006A (en) * 2020-04-16 2020-06-26 深圳市康鸿泰科技有限公司 Sign language identification method and system
CN112101262A (en) * 2020-09-22 2020-12-18 中国科学技术大学 Multi-feature fusion sign language recognition method and network model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN108647603A (en) * 2018-04-28 2018-10-12 清华大学 Semi-supervised continuous sign language interpretation method based on attention mechanism and device
CN109190578A (en) * 2018-09-13 2019-01-11 合肥工业大学 The sign language video interpretation method merged based on convolution network with Recognition with Recurrent Neural Network
CN110163181A (en) * 2019-05-29 2019-08-23 中国科学技术大学 Sign Language Recognition Method and device
CN110399850A (en) * 2019-07-30 2019-11-01 西安工业大学 A kind of continuous sign language recognition method based on deep neural network
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111339837A (en) * 2020-02-08 2020-06-26 河北工业大学 Continuous sign language recognition method
CN111340006A (en) * 2020-04-16 2020-06-26 深圳市康鸿泰科技有限公司 Sign language identification method and system
CN112101262A (en) * 2020-09-22 2020-12-18 中国科学技术大学 Multi-feature fusion sign language recognition method and network model

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LU MENG ET AL: "An Attention-Enhanced Multi-Scale and Dual Sign Language Recognition Network Based on a Graph Convolution Network", 《SENSORS》 *
LU MENG ET AL: "An Attention-Enhanced Multi-Scale and Dual Sign Language Recognition Network Based on a Graph Convolution Network", 《SENSORS》, 5 February 2021 (2021-02-05) *
王军 等: "融合注意力机制和连接时序分类的多模态手语识别", 《信号处理》 *
王军 等: "融合注意力机制和连接时序分类的多模态手语识别", 《信号处理》, vol. 36, no. 9, 30 September 2020 (2020-09-30) *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114115531A (en) * 2021-11-11 2022-03-01 合肥工业大学 End-to-end sign language identification method based on attention mechanism
CN114115531B (en) * 2021-11-11 2022-09-30 合肥工业大学 End-to-end sign language recognition method based on attention mechanism
US11475877B1 (en) * 2021-12-13 2022-10-18 Institute Of Automation, Chinese Academy Of Sciences End-to-end system for speech recognition and speech translation and device
CN115345257A (en) * 2022-09-22 2022-11-15 中山大学 Flight trajectory classification model training method, classification method, device and storage medium
CN115345257B (en) * 2022-09-22 2023-06-06 中山大学 Flight trajectory classification model training method, classification method, device and storage medium
CN116089593A (en) * 2023-03-24 2023-05-09 齐鲁工业大学(山东省科学院) Multi-pass man-machine dialogue method and device based on time sequence feature screening coding module
KR102610897B1 (en) * 2023-03-24 2023-12-07 치루 유니버시티 오브 테크놀로지 (산동 아카데미 오브 사이언시스) Method and device for multi-pass human-machine conversation based on time sequence feature screening and encoding module

Also Published As

Publication number Publication date
CN113537024B (en) 2022-06-21

Similar Documents

Publication Publication Date Title
CN113537024B (en) Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism
CN110188182B (en) Model training method, dialogue generating method, device, equipment and medium
Wu et al. Multimodal large language models: A survey
CN112084314A (en) Knowledge-introducing generating type session system
CN111753704B (en) Time sequence centralized prediction method based on video character lip reading recognition
CN111967272B (en) Visual dialogue generating system based on semantic alignment
CN114245203B (en) Video editing method, device, equipment and medium based on script
CN113221571B (en) Entity relation joint extraction method based on entity correlation attention mechanism
CN113792177A (en) Scene character visual question-answering method based on knowledge-guided deep attention network
CN116246213B (en) Data processing method, device, equipment and medium
CN109933773A (en) A kind of multiple semantic sentence analysis system and method
Li et al. Key action and joint ctc-attention based sign language recognition
CN114385802A (en) Common-emotion conversation generation method integrating theme prediction and emotion inference
CN111259785A (en) Lip language identification method based on time offset residual error network
CN113780059A (en) Continuous sign language identification method based on multiple feature points
CN115563335A (en) Model training method, image-text data processing device, image-text data processing equipment and image-text data processing medium
CN117272237B (en) Multi-modal fusion-based patent drawing multi-language graphic generation method and system
CN113779224A (en) Personalized dialogue generation method and system based on user dialogue history
CN115240713B (en) Voice emotion recognition method and device based on multi-modal characteristics and contrast learning
CN113590800B (en) Training method and device for image generation model and image generation method and device
CN116311493A (en) Two-stage human-object interaction detection method based on coding and decoding architecture
CN113553445B (en) Method for generating video description
CN113469260B (en) Visual description method based on convolutional neural network, attention mechanism and self-attention converter
CN113609923B (en) Attention-based continuous sign language sentence recognition method
CN112766101B (en) Method for constructing Chinese lip language identification modeling unit set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant