CN112132094B - Continuous sign language recognition system based on multi-language collaboration - Google Patents

Continuous sign language recognition system based on multi-language collaboration Download PDF

Info

Publication number
CN112132094B
CN112132094B CN202011060272.2A CN202011060272A CN112132094B CN 112132094 B CN112132094 B CN 112132094B CN 202011060272 A CN202011060272 A CN 202011060272A CN 112132094 B CN112132094 B CN 112132094B
Authority
CN
China
Prior art keywords
sign language
language
shared
sequence
sign
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011060272.2A
Other languages
Chinese (zh)
Other versions
CN112132094A (en
Inventor
李厚强
周文罡
蒲俊福
胡鹤臻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202011060272.2A priority Critical patent/CN112132094B/en
Publication of CN112132094A publication Critical patent/CN112132094A/en
Application granted granted Critical
Publication of CN112132094B publication Critical patent/CN112132094B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a continuous sign language recognition system based on multi-language cooperation, which uses a common visual feature encoder to extract feature expression, and uses different time sequence modeling networks (namely target sequence models) to learn the language characteristics of corresponding sign languages for sign languages of different languages. A shared sequential encoder (namely a shared sequence model) is used for expressing the same visual mode among different sign language languages, language embedding vectors are used for initialization, multi-language sign language recognition under a single frame is realized through multi-language collaborative training, visual commonalities among different sign languages are fully mined, and sign language recognition performance is improved.

Description

Continuous sign language recognition system based on multi-language collaboration
Technical Field
The invention relates to the technical field of action recognition in computer vision, in particular to a continuous sign language recognition system based on multi-language collaboration.
Background
In the continuous sign language recognition problem, each sign language video is labeled by an ordered sign language word sequence, and the essence of the continuous sign language recognition problem can be regarded as a process for learning the mapping relation between a video sequence and a labeled text sequence. Generally, a continuous sign language recognition system is composed of a visual feature encoder and a time-series modeling model. Feature expression of video plays a very important role in continuous sign language recognition, and hand features such as SIFT, HOG, etc. were used earlier to characterize hand patterns and trajectories. With the successful application of deep learning in computer vision, two-dimensional convolutional neural networks for image characterization and three-dimensional convolutional neural networks for video characterization are introduced into hand language recognition. Related work is carried out in an end-to-end system, the 2D CNN is used for extracting RGB image information, and good performance is achieved; in order to model the time-series dependency, sign language identification methods based on three-dimensional convolution kernels are also proposed in succession. In another video feature characterization mode, a 2D convolution network and 1D time sequence convolution are used for carrying out space-time expression on the sign language video, and the visual features extracted by the method are superior to those of other methods in a continuous sign language recognition task.
The sequence learning model in continuous sign language can be realized by means of connection time sequence classification, hidden Markov model, coder-decoder network, etc. The recurrent neural network is successfully applied to a plurality of sequence learning tasks and is introduced into a continuous sign language recognition problem, and the bidirectional LSTM-CTC structure is one of the most widely applied baseline methods in sign language. In addition, there is work to embed hidden markov models in neural network for sign language recognition. Similar to machine translation, attention-based codec networks are also used to learn the mapping between videos and annotations, thereby implementing the tasks of sign language recognition and sign language translation.
In the machine translation task, most methods are also single language translation problems focusing on source to target languages, and end-to-end solutions based on deep neural networks have made important progress in this type of problem. The machine translation system can extend the single language translation method to the multi-language translation task in various ways. By adding language identifiers at the beginning of the sentence to be translated, the monolingual model can be applied to multi-language translation through simple expansion. To raise the problem of corpus resource limitation, an attempt is made to generate more sentences from corpus for augmenting data using an end-to-end twin network. Furthermore, by using different parameter sharing strategies, the size problem of the model in a multi-lingual system can be balanced.
The prior art mainly has the following defects:
1) like natural language, sign languages in different countries and regions are also not used, and they have respective unique grammatical structures and vocabularies. In other words, it is difficult for people using different sign languages to understand the sign language semantics of the other party. The existing video sign language recognition method is often used for solving the problem of sign language recognition of single language, so that the sign language recognition system is limited in practical application and deployment.
2) Most of the existing multi-language sign language recognition algorithms are based on different sign language data sets, and a plurality of model parameters aiming at different sign language languages are trained on the same network architecture. The method can achieve certain effect, but ignores the problem that similar visual patterns exist among different sign languages, and the method of separate independent training is not beneficial to mining the commonality of the sign languages by the model.
Disclosure of Invention
The invention aims to provide a continuous sign language recognition system based on multi-language cooperation, which realizes multi-language sign language recognition under a single frame and has recognition performance superior to that of a single training recognition result.
The purpose of the invention is realized by the following technical scheme:
a continuous sign language recognition system based on multi-language collaboration, comprising: the method comprises the steps of sharing a visual feature encoder, a sharing sequence model and a plurality of target sequence models; wherein:
the shared visual characteristic encoder is used for extracting visual characteristics in sign language videos of all languages and inputting the visual characteristics to the shared sequence model and each target sequence model respectively;
the shared sequence model is used for expressing the same visual mode among different sign language languages, learning the commonality among the different sign language languages and initializing by embedding vectors in different languages;
each target sequence model for learning, in conjunction with an output of the shared sequence model, a mapping between a respective language visual feature and a respective sign language word;
in the training stage, performing joint optimization on all target sequence models; and each trained target sequence model can predict the probability distribution of the sign language words corresponding to the sign language video of the corresponding language.
According to the technical scheme provided by the invention, a common visual feature encoder (shared visual feature encoder) is used for extracting feature expression, and different time sequence modeling networks (namely target sequence models) are used for learning language characteristics of corresponding sign languages for sign languages of different languages. A shared sequential encoder (namely a shared sequence model) is used for expressing the same visual mode among different sign language languages, language embedding vectors are used for initialization, multi-language sign language recognition under a single frame is realized through multi-language collaborative training, visual commonalities among different sign languages are fully mined, and sign language recognition performance is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a diagram illustrating three basic frameworks for multi-language identification according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a continuous sign language recognition system based on multi-language collaboration according to an embodiment of the present invention;
fig. 3 is a schematic diagram of network iterative optimization provided in the embodiment of the present invention;
fig. 4 is a schematic diagram of obtaining alignment labels based on maximized probability according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Most of the existing sign language recognition frameworks only recognize single sign language languages, different sign languages have shared visual modes, and common characters among the sign languages of different languages can be ignored by using different sign language data sets to train a plurality of independent models. The invention expresses the same visual mode among different sign language languages through the shared time sequence encoder, realizes the multi-language sign language recognition under a single frame through multi-language collaborative training, and the recognition performance is superior to that of a recognition result of independent training.
Similar to multi-language machine translation, consider three system architectures with multi-language sign language recognition, as shown in FIG. 1. 1) The simplest approach is to use a shared visual coder and sequence model, as shown in part (a) of fig. 1, and this simple architecture can be easily implemented without major changes to existing continuous sign language systems. However, this will impair its sequence model's ability to handle multiple languages of correspondence mapping. 2) For different sign languages, different models are used respectively but share the same visual encoder, as shown in part (b) of fig. 1, where each branch functions the same as the classical sign language recognition system, and no information sharing is performed between each target branch, this design is related to languages, but complementarity between different sign languages cannot be explored. 3) To take advantage of the first two architectures, the third approach is to add an additional shared sequence model to learn the commonality between different sign languages, as shown in part (c) of fig. 1.
The continuous sign language recognition system based on multi-language collaboration provided by the embodiment of the invention adopts a structure shown in part (c) of fig. 1. It uses a common CNN-TCN for feature extraction for all input sign languages. An independent target sequence model is adopted to learn the corresponding relation between the visual characteristics and the hand words. Furthermore, in order to model the visual patterns shared between different sign languages, it is proposed to use a shared sequence model for common points between all sign languages. Each branch in the system (target sequence model) was optimized by CTC loss. As shown in fig. 2, the main structure of the system is shown, which comprises: a shared visual feature coder, a shared sequence model, and several target sequence models. The following is a description of the principles of the various parts of the system, as well as the training scheme.
One, sharing the visual feature encoder.
The shared visual characteristic encoder is used for extracting visual characteristics in sign language videos of all languages and inputting the visual characteristics to the shared sequence model and each target sequence model respectively.
In the embodiment of the invention, the shared visual characteristic encoder comprises the following components in sequence: the Spatial CNN and the Temporal CNN are used for extracting visual features of the video.
As indicated by the dashed box below fig. 2:
the space convolution network mainly comprises the following components in sequence: the first convolution layer, the first largest pooling layer, the second and third convolution layers, two inclusion layers, the second largest pooling layer, five inclusion layers, the third largest pooling layer, two inclusion layers, and the fourth largest pooling layer.
The time sequence convolution network mainly comprises two convolution layers and a maximum pooling layer, wherein the convolution layers and the maximum pooling layer are alternately arranged.
The temporal receptive field of the shared visual feature encoder is 16 frames. Denote the shared visual feature encoder as Ev. Sign language video for any language
Figure BDA0002712136890000041
The visual features output by the shared visual features encoder are represented as:
Figure BDA0002712136890000042
wherein x istRepresenting the t-th video frame, N being the number of video frames, f representing the viewThe visual characteristics of the frequency segment (corresponding to the video frame under the receptive field) segment change the output duration into N/4 of the input.
And II, sequence learning.
The long-term dependence can be effectively modeled by a long-term memory network (LSTM), and the LSTM unit at the current time t is in a cell state CtAnd hidden state htIndicating that its basic idea controls the renewal of cellular and cryptic states by introducing gate structures. The LSTM cell has 3 different gate structures, respectively input gate
Figure BDA0002712136890000051
Forgetting door
Figure BDA0002712136890000052
And an output gate
Figure BDA0002712136890000053
The specific calculation method is as follows:
Figure BDA0002712136890000054
Figure BDA0002712136890000055
Figure BDA0002712136890000056
where σ is the activation function, t denotes the time, ftFor the input features, W and b are the linear mapping weights and offsets. The current cellular state and the cryptic state are updated as follows:
Figure BDA0002712136890000057
Figure BDA0002712136890000058
Figure BDA0002712136890000059
wherein, an indicates the product of each element in the vector.
In the present invention, in order to model bidirectional timing information, bidirectional long and short time memory networks (BLSTM) are used for timing encoding. Two different BLSTM networks are used in this framework for different purposes. On the one hand, it is desirable to use a separate sequence model (i.e., a target sequence model) to learn the mapping between visual features and hand-language words, since each language has its own unique rules. The separate sequence modeling branches help capture the features of each particular sign language and can reduce the interference problem between cross-language sequence modeling. On the other hand, in order to encode similar visual patterns in different sign languages, the invention introduces a shared sequence model to learn the commonality among different sign languages. To embed the markup information of a language, the state of the shared model is initialized using different language embedding vectors.
1. The sequence model is shared.
The sharing sequence model is used for expressing the same visual mode among different sign language languages, learning the commonality among different sign language languages and initializing by embedding vectors in different languages.
In order to embed the information related to the language category in the shared sequence model, the category information of the language is encoded by an Embedding Layer (Embedding Layer) and used for initialization of BLSTM in the shared sequence model to distinguish the difference of different languages.
For the input visual feature F, the output feature OsExpressed as:
Os=BLSTMs(F;h0=ek,c0=ek),
wherein h is0And c0Initial hidden state and cell state of the two-way long and short memory network, respectively, ekIs to indicate the kth handClass-embedded vectors for linguistic languages, BLSTMsRepresenting a shared sequence model.
2. A target sequence model.
Each target sequence model for learning a mapping between a respective language visual feature and a respective sign language word in conjunction with an output of the shared sequence model;
target sequence model for kth sign language
Figure BDA0002712136890000061
Output the result
Figure BDA0002712136890000062
Expressed as:
Figure BDA0002712136890000063
and thirdly, optimizing the model.
1. And (4) language joint optimization.
In the embodiment of the invention, in the training stage, joint optimization is carried out on all target sequence models; and each trained target sequence model can predict the probability distribution of the sign language words corresponding to the sign language videos of the corresponding languages.
In order to obtain the probability distribution of the target sequence words, the target sequence model is output
Figure BDA0002712136890000064
Mapping to non-normalized log-probability space with fully-connected layers, expressed as:
Figure BDA0002712136890000065
wherein the superscript k represents the type identification of the sign language,
Figure BDA0002712136890000066
weight and bias parameters, Y, of the fully connected layer, respectivelyt,sIs the t-th video segment belongs toProbability of the sign language word s.
And in the training stage, optimizing by adopting a connection time sequence classification loss CTC. By adopting a joint optimization mode, the total loss function is the sum of CTC loss functions of all target sequence models, expressed as,
Figure BDA0002712136890000067
wherein K is the total number of target sequence models,
Figure BDA0002712136890000068
utilizing Y as a CTC loss function for a target sequence model in a kth sign language(k)And (4) calculating.
2. Shared visual feature encoder optimization.
Existing studies have shown that iterative training of CNNs is an effective way to further improve performance. The idea is to take the alignment between the input video and sign language words and use this method to fine tune the feature extractor network, the optimization process is shown in fig. 3. On this basis, the embodiment of the present invention provides a method for obtaining an alignment relationship between a sign language video and a sign language annotation sequence based on a maximum probability decoding algorithm, so as to perform fine adjustment on a shared visual feature encoder, which includes:
obtaining probability distribution Y of sign language words through target sequence model(k)Then, according to the sequence of the sign language words in the sign language labeling sequence, sequentially extracting the category probability values corresponding to the video segments corresponding to the current sign language words, operating the sign language words in the sign language labeling sequence, and combining the sign language words into a new probability matrix Y(k)′As shown in fig. 4, where T is the number of video segments; finding a new probability matrix Y using a dynamic programming algorithm(k)′The upper most probable path.
Note Pi,jIs a sequence of features f1,f2,…,fiAnd the annotation sequence s1,s2,…,sjThe maximum probability between them, the dynamically planned transfer equation is expressed as:
Pi,j=Y(k)′ i,j+max(Pi-1,j,Pi-1,j-1),
wherein Y is(k)′ i,jAs a new probability matrix Y(k)′The ith row and the jth column of the element, i.e. the ith video segment, belong to the sign language word sjThe probability of (d); i is less than or equal to N/4.
Through the above, we can obtain the alignment relationship between the sign language video and the sign language word label, that is, the category pseudo label (i.e. video segment pseudo label) at the segment level can be obtained. And taking the video as a video classification task to optimize the shared visual feature encoder. And the optimized shared visual feature encoder is used as a pre-training parameter, and the whole framework is introduced again for end-to-end training, so that continuous iterative optimization is realized.
According to the scheme of the embodiment of the invention, on one hand, multi-language sign language recognition under a single frame is realized through multi-language collaborative training, the visual commonality among different sign languages is fully excavated, and the sign language recognition performance is improved. On the other hand, the shared visual feature encoder is improved by acquiring the alignment relation between the video and the sign language annotation sequence through a maximum probability decoding algorithm.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (5)

1. A continuous sign language recognition system based on multi-language collaboration, comprising: a shared visual feature encoder, a shared sequence model, and a number of target sequence models; wherein:
the shared visual characteristic encoder is used for extracting visual characteristics in sign language videos of all languages and inputting the visual characteristics to the shared sequence model and each target sequence model respectively;
the shared sequence model is used for expressing the same visual mode among different sign language languages, learning the commonality among different sign language languages and initializing by embedding vectors in different languages;
each target sequence model for learning a mapping between a respective language visual feature and a respective sign language word in conjunction with an output of the shared sequence model;
in the training stage, performing joint optimization on all target sequence models; each trained target sequence model can predict the probability distribution of the sign language words corresponding to the sign language videos of the corresponding languages;
the maximum probability decoding algorithm is used for obtaining the alignment relation between the sign language video and the sign language annotation sequence, so that the shared visual feature encoder is finely adjusted, and the implementation mode comprises the following steps:
obtaining probability distribution Y of sign language words through target sequence model(k)Then, according to the sequence of the sign language words in the sign language labeling sequence, sequentially extracting the probability values corresponding to the video segments corresponding to the current sign language words, and combining the sign language words in the sign language labeling sequence into a new probability matrix Y after operating the sign language words(k)′(ii) a Finding a new probability matrix Y using a dynamic programming algorithm(k)′An upper maximum probability path;
note Pi,jIs a sequence of features f1,f2,…,fiH and the sequence of labels s1,s2,…,sjThe maximum probability between them, the dynamically planned transfer equation is expressed as:
Pi,j=Y(k)′ i,j+max(Pi-1,j,Pi-1,j-1)
wherein Y is(k)′ i,jAs a new probability matrix Y(k)′The ith row and the jth column of the element, i.e. the ith video segment, belong to the sign language word sjThe probability of (d);
through the operation, the alignment relation between the sign language video and the sign language word label is obtained, namely the pseudo label of the video segment is obtained, and therefore the shared visual feature encoder is optimized.
2. A continuous sign language recognition system based on multi-language collaboration as claimed in claim 1, wherein the shared visual feature encoder comprises, in sequence: a spatial convolution network and a time sequence convolution network; wherein:
the spatial convolution network comprises the following components in sequence: a first convolution layer, a first maximum pooling layer, a second and a third convolution layers, two inclusion layers, a second maximum pooling layer, five inclusion layers, a third maximum pooling layer, two inclusion layers, and a fourth maximum pooling layer;
the time sequence convolution network comprises two convolution layers and a maximum pooling layer, and the convolution layers and the maximum pooling layer are alternately arranged;
denote the shared visual feature encoder as EvSign language video in any language
Figure FDA0003570014790000021
The visual features output by the shared visual features encoder are represented as:
Figure FDA0003570014790000022
wherein x istThe video segment is a video frame corresponding to a reception field of a shared visual feature encoder in a time sequence.
3. A continuous sign language recognition system based on multi-language collaboration as claimed in claim 1, wherein said shared sequence model is implemented by a bidirectional long and short memory network; for the input visual feature F, output the result OsExpressed as:
Os=BLSTMs(F;h0=ek,c0=ek)
wherein h is0And c0Initial implicit and cellular states of a bidirectional long and short memory network, respectively, ekIs a category embedded vector for the kth sign language.
4. The continuous sign language recognition system based on multi-language collaboration as claimed in claim 1, wherein each target sequence model is implemented by a two-way long-short memory network, which is initialized by zero vector;
for the target sequence model of the kth sign language, outputting the result
Figure FDA0003570014790000023
Expressed as:
Figure FDA0003570014790000024
wherein, F, OsRespectively, output of the shared visual feature encoder, the shared sequence model, h0And c0Respectively the initial hidden state and the cell state of the bidirectional long and short memory network.
5. The system of claim 1, wherein the sign language recognition system is a system for continuous sign language recognition based on multi-language collaboration,
target sequence model output
Figure FDA0003570014790000025
Mapping to non-normalized log-probability space with fully-connected layers, expressed as:
Figure FDA0003570014790000026
Wherein the superscript k represents the category identification of the sign language,
Figure FDA0003570014790000027
weight and bias parameters, Y, of the fully connected layer, respectivelyt,sIs the probability that the tth video segment belongs to the sign language word s;
in the training stage, the connection timing sequence classification loss CTC is adopted for optimization,
adopting a joint optimization mode, wherein the total loss function is the sum of CTC loss functions of all target sequence models and is represented as:
Figure FDA0003570014790000028
wherein K is the total number of target sequence models,
Figure FDA0003570014790000029
utilizing Y as a CTC loss function for a target sequence model in a kth sign language(k)And (4) calculating.
CN202011060272.2A 2020-09-30 2020-09-30 Continuous sign language recognition system based on multi-language collaboration Active CN112132094B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011060272.2A CN112132094B (en) 2020-09-30 2020-09-30 Continuous sign language recognition system based on multi-language collaboration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011060272.2A CN112132094B (en) 2020-09-30 2020-09-30 Continuous sign language recognition system based on multi-language collaboration

Publications (2)

Publication Number Publication Date
CN112132094A CN112132094A (en) 2020-12-25
CN112132094B true CN112132094B (en) 2022-07-15

Family

ID=73843529

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011060272.2A Active CN112132094B (en) 2020-09-30 2020-09-30 Continuous sign language recognition system based on multi-language collaboration

Country Status (1)

Country Link
CN (1) CN112132094B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861827B (en) * 2021-04-08 2022-09-06 中国科学技术大学 Sign language translation method and system using single language material translation
CN113992894A (en) * 2021-10-27 2022-01-28 甘肃风尚电子科技信息有限公司 Abnormal event identification system based on monitoring video time sequence action positioning and abnormal detection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647603A (en) * 2018-04-28 2018-10-12 清华大学 Semi-supervised continuous sign language interpretation method based on attention mechanism and device
CN110210416A (en) * 2019-06-05 2019-09-06 中国科学技术大学 Based on the decoded sign Language Recognition optimization method and device of dynamic pseudo label
CN110874537A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Generation method of multi-language translation model, translation method and translation equipment
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10867595B2 (en) * 2017-05-19 2020-12-15 Baidu Usa Llc Cold fusing sequence-to-sequence models with language models

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647603A (en) * 2018-04-28 2018-10-12 清华大学 Semi-supervised continuous sign language interpretation method based on attention mechanism and device
CN110874537A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Generation method of multi-language translation model, translation method and translation equipment
CN110210416A (en) * 2019-06-05 2019-09-06 中国科学技术大学 Based on the decoded sign Language Recognition optimization method and device of dynamic pseudo label
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Research of a Sign Language Translation System Based on Deep Learning;Siming He,and etc;《2019 International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM)》;20200109;第392-396页 *
融合宽残差和长短时记忆网络的动态手势识别研究;梁智杰等;《计算机应用研究》;20191231;第36卷(第12期);第3846-3852页 *

Also Published As

Publication number Publication date
CN112132094A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
Guo et al. Back to mlp: A simple baseline for human motion prediction
CN110334361B (en) Neural machine translation method for Chinese language
CN108733792B (en) Entity relation extraction method
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
Jang et al. Recurrent neural network-based semantic variational autoencoder for sequence-to-sequence learning
CN112100351A (en) Method and equipment for constructing intelligent question-answering system through question generation data set
CN111783462A (en) Chinese named entity recognition model and method based on dual neural network fusion
CN111368565A (en) Text translation method, text translation device, storage medium and computer equipment
Ruan et al. Survey: Transformer based video-language pre-training
CN112364174A (en) Patient medical record similarity evaluation method and system based on knowledge graph
CN110059324B (en) Neural network machine translation method and device based on dependency information supervision
CN112132094B (en) Continuous sign language recognition system based on multi-language collaboration
Tang et al. Deep sequential fusion LSTM network for image description
CN111985205A (en) Aspect level emotion classification model
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
Luo et al. Hierarchical transfer learning architecture for low-resource neural machine translation
CN113204633A (en) Semantic matching distillation method and device
Song et al. Parallel temporal encoder for sign language translation
Qing-Dao-Er-Ji et al. Research on Mongolian-Chinese machine translation based on the end-to-end neural network
Basmatkar et al. Survey on neural machine translation for multilingual translation system
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN116432019A (en) Data processing method and related equipment
CN114254645A (en) Artificial intelligence auxiliary writing system
Shirghasemi et al. The impact of active learning algorithm on a cross-lingual model in a Persian sentiment task
Wang et al. Multimodal object classification using bidirectional gated recurrent unit networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant