CN112132094A - Continuous sign language recognition system based on multi-language collaboration - Google Patents

Continuous sign language recognition system based on multi-language collaboration Download PDF

Info

Publication number
CN112132094A
CN112132094A CN202011060272.2A CN202011060272A CN112132094A CN 112132094 A CN112132094 A CN 112132094A CN 202011060272 A CN202011060272 A CN 202011060272A CN 112132094 A CN112132094 A CN 112132094A
Authority
CN
China
Prior art keywords
sign language
language
shared
sequence
languages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011060272.2A
Other languages
Chinese (zh)
Other versions
CN112132094B (en
Inventor
李厚强
周文罡
蒲俊福
胡鹤臻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202011060272.2A priority Critical patent/CN112132094B/en
Publication of CN112132094A publication Critical patent/CN112132094A/en
Application granted granted Critical
Publication of CN112132094B publication Critical patent/CN112132094B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a continuous sign language recognition system based on multi-language collaboration, which uses a common visual feature encoder to extract feature expression, and uses different time sequence modeling networks (namely target sequence models) to learn the language characteristics of corresponding sign languages for sign languages of different languages. A shared sequential encoder (namely a shared sequence model) is used for expressing the same visual mode among different sign language languages, language embedding vectors are used for initialization, multi-language sign language recognition under a single frame is realized through multi-language collaborative training, visual commonalities among different sign languages are fully mined, and sign language recognition performance is improved.

Description

Continuous sign language recognition system based on multi-language collaboration
Technical Field
The invention relates to the technical field of action recognition in computer vision, in particular to a continuous sign language recognition system based on multi-language collaboration.
Background
In the continuous sign language recognition problem, each sign language video is labeled by an ordered sign language word sequence, and the essence of the continuous sign language recognition problem can be regarded as a process for learning the mapping relationship between a video sequence and a labeled text sequence. Generally, a continuous sign language recognition system is composed of a visual feature encoder and a time-series modeling model. Feature expression of video plays a very important role in continuous sign language recognition, and hand features such as SIFT, HOG, etc. were used earlier to characterize hand patterns and trajectories. With the successful application of deep learning in computer vision, two-dimensional convolutional neural networks for image characterization and three-dimensional convolutional neural networks for video characterization are introduced into hand language recognition. Related work is carried out in an end-to-end system, RGB image information is extracted by using the 2D CNN, and good performance is obtained; in order to model the time-series dependency, sign language identification methods based on three-dimensional convolution kernels are also proposed in succession. In another video feature characterization mode, a 2D convolution network and 1D time sequence convolution are used for carrying out space-time expression on the sign language video, and the visual features extracted by the method are superior to those of other methods in a continuous sign language recognition task.
The sequence learning model in continuous sign language can be realized by connecting time sequence classification, hidden Markov model, coder-decoder network, etc. The recurrent neural network is successfully applied to a plurality of sequence learning tasks and is introduced into a continuous sign language recognition problem, and the bidirectional LSTM-CTC structure is one of the most widely applied baseline methods in sign language. In addition, there is work to embed hidden markov models into neural network for sign language recognition. Similar to machine translation, attention-based codec networks are also used to learn the mapping between videos and annotations, thereby implementing the tasks of sign language recognition and sign language translation.
In the machine translation task, most methods are also single language translation problems focusing on source to target languages, and end-to-end solutions based on deep neural networks have made important progress in this type of problem. The machine translation system can extend the single language translation method to the multi-language translation task in various ways. By adding language identifiers at the beginning of the sentence to be translated, the monolingual model can be applied to multi-language translation through simple extension. To raise the problem of corpus resource limitation, attempts are made to generate more sentences from corpus for augmenting data using an end-to-end twin network. Furthermore, by using different parameter sharing strategies, the size problem of the model in a multi-lingual system can be balanced.
The prior art mainly has the following defects:
1) like natural language, sign languages in different countries and regions are also not used, and they have respective unique grammatical structures and vocabularies. In other words, it is difficult for people who use different sign languages to understand the sign language semantics of the other party. The existing video sign language recognition method is often used for solving the problem of sign language recognition of single language, so that the sign language recognition system is limited in practical application and deployment.
2) Most of the existing multi-language sign language recognition algorithms are based on different sign language data sets, and a plurality of model parameters aiming at different sign language languages are trained on the same network architecture. The method can achieve certain effect, but ignores the problem that similar visual patterns exist among different sign languages, and the method of separate independent training is not beneficial to mining the commonality of the sign languages by the model.
Disclosure of Invention
The invention aims to provide a continuous sign language recognition system based on multi-language cooperation, which realizes multi-language sign language recognition under a single frame and has recognition performance superior to that of a single training recognition result.
The purpose of the invention is realized by the following technical scheme:
a continuous sign language recognition system based on multi-language collaboration, comprising: a shared visual feature encoder, a shared sequence model, and a number of target sequence models; wherein:
the shared visual characteristic encoder is used for extracting visual characteristics in sign language videos of all languages and inputting the visual characteristics to the shared sequence models and each target sequence model respectively;
the shared sequence model is used for expressing the same visual mode among different sign language languages, learning the commonality among the different sign language languages and initializing by embedding vectors in different languages;
each target sequence model for learning a mapping between a respective language visual feature and a respective sign language word in conjunction with an output of the shared sequence model;
in the training stage, performing joint optimization aiming at all target sequence models; and each trained target sequence model can predict the probability distribution of the sign language words corresponding to the sign language videos of the corresponding languages.
According to the technical scheme provided by the invention, a common visual feature encoder (shared visual feature encoder) is used for extracting feature expression, and different time sequence modeling networks (namely target sequence models) are used for learning language characteristics of corresponding sign languages for sign languages of different languages. A shared sequential encoder (namely a shared sequence model) is used for expressing the same visual mode among different sign language languages, language embedding vectors are used for initialization, multi-language sign language recognition under a single frame is realized through multi-language collaborative training, visual commonalities among different sign languages are fully mined, and sign language recognition performance is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a diagram illustrating three basic frameworks for multi-language identification according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a continuous sign language recognition system based on multi-language collaboration according to an embodiment of the present invention;
fig. 3 is a schematic diagram of network iterative optimization provided in the embodiment of the present invention;
fig. 4 is a schematic diagram of obtaining alignment labels based on the maximized probability according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Most of the existing sign language recognition frameworks only recognize single sign language languages, different sign languages have shared visual modes, and common characters among the sign languages of different languages can be ignored by using different sign language data sets to train a plurality of independent models. The invention expresses the same visual mode among different sign language languages through the shared time sequence encoder, realizes the multi-language sign language recognition under a single frame through multi-language collaborative training, and the recognition performance is superior to that of a recognition result of independent training.
Similar to the multi-language machine translation, consider three system architectures with multi-language sign language recognition functionality, as shown in FIG. 1. 1) The simplest approach is to use a shared visual encoder and sequence model, as shown in part (a) of fig. 1, and this simple architecture can be easily implemented without major changes to existing continuous sign language systems. However, this would impair its sequence model's ability to handle multiple languages of correspondence mapping. 2) For different sign languages, different models are used respectively but share the same visual encoder, as shown in part (b) of fig. 1, wherein each branch functions the same as a classical sign language recognition system, no information sharing is performed between each target branch, and the design is related to languages, but complementarity between different sign languages cannot be explored. 3) To take advantage of the first two architectures, the third approach is to add an additional shared sequence model to learn the commonality between different sign languages, as shown in part (c) of fig. 1.
The continuous sign language recognition system based on multi-language collaboration provided by the embodiment of the invention adopts a structure shown in part (c) of FIG. 1. It uses one common CNN-TCN for feature extraction for all input sign languages. An independent target sequence model is adopted to learn the corresponding relation between the visual characteristics and the hand words. Furthermore, in order to model the visual patterns shared between different sign languages, it is proposed to use a shared sequence model for common points between all sign languages. Each branch in the system (target sequence model) was optimized by CTC loss. As shown in fig. 2, the main structure of the system is shown, which comprises: a shared visual feature encoder, a shared sequence model, and several target sequence models. The following is a description of the principles of the various parts of the system, as well as the training scheme.
First, sharing the visual feature encoder.
The shared visual characteristic encoder is used for extracting visual characteristics in sign language videos of all languages and inputting the visual characteristics to the shared sequence model and each target sequence model respectively.
In the embodiment of the invention, the shared visual characteristic encoder comprises the following components in sequence: spatial CNN (Spatial CNN) and Temporal CNN (Temporal CNN) are used to extract visual features of video.
As shown by the dashed box below fig. 2:
the space convolution network mainly comprises the following components in sequence: the first rolling layer, the first maximum pooling layer, the second and third rolling layers, the two inclusion layers, the second maximum pooling layer, the five inclusion layers, the third maximum pooling layer, the two inclusion layers, and the fourth maximum pooling layer.
The time sequence convolution network mainly comprises two convolution layers and a maximum pooling layer, wherein the convolution layers and the maximum pooling layer are alternately arranged.
The time-series receptive field of the shared visual feature encoder is 16 frames. Denote the shared visual feature encoder as Ev. Sign language video for any language
Figure BDA0002712136890000041
Sharing the output of a visual feature encoderThe visual features are represented as:
Figure BDA0002712136890000042
wherein x istThe method comprises the steps of representing the t-th video frame, wherein N is the number of the video frames, f represents the visual characteristics of a video segment (corresponding video frame under the receptive field), and the output duration is changed into N/4 of the input duration.
And II, sequence learning.
The long-time dependence can be effectively modeled by a long-time memory network (LSTM), and the LSTM unit at the current time t is in a cell state CtAnd hidden state htIt is shown that its basic idea controls the renewal of cellular and cryptic states by introducing a gate structure. The LSTM cell has 3 different gate structures, respectively input gate
Figure BDA0002712136890000051
Forgetting door
Figure BDA0002712136890000052
And an output gate
Figure BDA0002712136890000053
The specific calculation method is as follows:
Figure BDA0002712136890000054
Figure BDA0002712136890000055
Figure BDA0002712136890000056
where σ is the activation function, t denotes the time, ftFor the input features, W and b are the linear mapping weights and offsets. The current cellular state and the cryptic state are updated as follows:
Figure BDA0002712136890000057
Figure BDA0002712136890000058
Figure BDA0002712136890000059
wherein, an indicates the product of each element in the vector.
In the present invention, in order to model bidirectional timing information, a bidirectional long and short time memory network (BLSTM) is used for timing encoding. Two different BLSTM networks are used in this framework for different purposes. On the one hand, it is desirable to use a separate sequence model (i.e., a target sequence model) to learn the mapping between visual features and hand words because each language has its own unique rules. The separate sequence modeling branches help capture the features of each particular sign language and can reduce the interference problem between cross-language sequence modeling. On the other hand, in order to encode similar visual patterns in different sign languages, the invention introduces a shared sequence model to learn the commonality between different sign languages. To embed the markup information of a language, the state of the shared model is initialized using different language embedding vectors.
1. The sequence model is shared.
The shared sequence model is used for expressing the same visual mode among different sign language languages, learning the commonality among the different sign language languages, and carrying out initialization by embedding vectors in different languages.
In order to embed the information related to the language category in the shared sequence model, the category information of the language is encoded by an Embedding Layer (Embedding Layer) and used for initialization of BLSTM in the shared sequence model to distinguish the difference of different languages.
For the input visual feature F, the output feature OsExpressed as:
Os=BLSTMs(F;h0=ek,c0=ek),
wherein h is0And c0Initial hidden state and cell state of the two-way long and short memory network, respectively, ekIs a class-embedded vector, BLSTM, indicating the kth sign languagesRepresenting a shared sequence model.
2. A target sequence model.
Each target sequence model for learning a mapping between a respective language visual feature and a respective sign language word in conjunction with an output of the shared sequence model;
target sequence model for kth sign language
Figure BDA0002712136890000061
Outputting the result
Figure BDA0002712136890000062
Expressed as:
Figure BDA0002712136890000063
and thirdly, optimizing the model.
1. And (4) language joint optimization.
In the embodiment of the invention, in the training stage, joint optimization is carried out on all target sequence models; and each trained target sequence model can predict the probability distribution of the sign language words corresponding to the sign language videos of the corresponding languages.
In order to obtain the probability distribution of the target sequence words, the target sequence model is output
Figure BDA0002712136890000064
Mapping to non-normalized log-probability space with fully-connected layers, expressed as:
Figure BDA0002712136890000065
wherein the superscript k represents the type identification of the sign language,
Figure BDA0002712136890000066
weight and bias parameters, Y, of the fully connected layer, respectivelyt,sIs the probability that the t-th video segment belongs to the sign language word s.
And in the training stage, optimizing by adopting a connection time sequence classification loss CTC. By adopting a joint optimization mode, the total loss function is the sum of CTC loss functions of all target sequence models, expressed as,
Figure BDA0002712136890000067
wherein K is the total number of target sequence models,
Figure BDA0002712136890000068
utilizing Y as a CTC loss function for a target sequence model in a kth sign language(k)And (4) calculating.
2. Shared visual feature encoder optimization.
Existing studies have shown that iterative training of CNNs is an effective way to further improve performance. The idea is to take the alignment between the input video and sign language words and use this method to fine tune the feature extractor network, the optimization process is shown in fig. 3. On this basis, the embodiment of the present invention provides a method for obtaining an alignment relationship between a sign language video and a sign language annotation sequence based on a maximum probability decoding algorithm, so as to perform fine tuning on a shared visual feature encoder, which includes:
obtaining probability distribution Y of sign language words through target sequence model(k)Then, according to the sequence of the sign language words in the sign language labeling sequence, sequentially extracting the category probability values corresponding to the video segments corresponding to the current sign language words, operating the sign language words in the sign language labeling sequence, and combining the sign language words into a new probability matrix Y(k)′As shown in fig. 4, where T is the number of video segments; finding new using dynamic programming algorithmProbability matrix Y of(k)′The upper most probable path.
Note Pi,jIs a sequence of features f1,f2,…,fiAnd the annotation sequence s1,s2,…,sjThe maximum probability between them, the dynamically planned transfer equation is expressed as:
Pi,j=Y(k)′ i,j+max(Pi-1,j,Pi-1,j-1),
wherein, Y(k)′ i,jAs a new probability matrix Y(k)′The ith row and the jth column of the element, i.e. the ith video segment, belong to the sign language word sjThe probability of (d); i is less than or equal to N/4.
Through the above, we can obtain the alignment relationship between the sign language video and the sign language word label, that is, the category pseudo label (i.e. video segment pseudo label) at the segment level can be obtained. And taking the video as a video classification task to optimize the shared visual feature encoder. And the optimized shared visual feature encoder is used as a pre-training parameter, and the whole framework is introduced again for end-to-end training, so that continuous iterative optimization is realized.
According to the scheme of the embodiment of the invention, on one hand, multi-language sign language recognition under a single frame is realized through multi-language collaborative training, the visual commonality among different sign languages is fully excavated, and the sign language recognition performance is improved. On the other hand, the shared visual feature encoder is improved by acquiring the alignment relation between the video and the sign language annotation sequence through a maximum probability decoding algorithm.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A continuous sign language recognition system based on multi-language collaboration, comprising: a shared visual feature encoder, a shared sequence model, and a number of target sequence models; wherein:
the shared visual characteristic encoder is used for extracting visual characteristics in sign language videos of all languages and inputting the visual characteristics to the shared sequence models and each target sequence model respectively;
the shared sequence model is used for expressing the same visual mode among different sign language languages, learning the commonality among the different sign language languages and initializing by embedding vectors in different languages;
each target sequence model for learning a mapping between a respective language visual feature and a respective sign language word in conjunction with an output of the shared sequence model;
in the training stage, performing joint optimization aiming at all target sequence models; and each trained target sequence model can predict the probability distribution of the sign language words corresponding to the sign language videos of the corresponding languages.
2. The system of claim 1, wherein the shared visual feature encoder comprises, in order: a spatial convolution network and a time sequence convolution network; wherein:
the spatial convolution network comprises the following components in sequence: a first convolution layer, a first maximum pooling layer, second and third convolution layers, two inclusion layers, a second maximum pooling layer, five inclusion layers, a third maximum pooling layer, two inclusion layers, and a fourth maximum pooling layer;
the time sequence convolution network comprises two convolution layers and a maximum pooling layer, and the convolution layers and the maximum pooling layer are alternately arranged;
denote the shared visual feature encoder as EvSign language video in any language
Figure FDA0002712136880000011
The visual features output by the shared visual features encoder are represented as:
Figure FDA0002712136880000012
wherein x istThe video segment is a video frame corresponding to the reception field of the shared visual feature encoder on the time sequence.
3. The continuous sign language recognition system based on multi-language collaboration as claimed in claim 1, wherein the shared sequence model is implemented by a two-way long and short memory network; for the input visual feature F, the result O is outputsExpressed as:
Os=BLSTMs(F;h0=ek,c0=ek)
wherein h is0And c0Initial hidden state and cell state of the two-way long and short memory network, respectively, ekIs a category embedded vector for the kth sign language.
4. The continuous sign language recognition system based on multi-language collaboration as claimed in claim 1, wherein each target sequence model is implemented by a two-way long-short memory network, which is initialized by zero vector;
for the target sequence model of the kth sign language, outputting the result
Figure FDA0002712136880000021
Expressed as:
Figure FDA0002712136880000022
wherein, F, OsRespectively, output of the shared visual feature encoder, the shared sequence model, h0And c0The initial hidden state and the cell state of the bidirectional long and short memory network are respectively.
5. The continuous sign language recognition system based on multi-language collaboration as claimed in claim 1,
target sequence model output
Figure FDA0002712136880000023
Mapping to non-normalized log-probability space with fully-connected layers, expressed as:
Figure FDA0002712136880000024
wherein the superscript k represents the type identification of the sign language,
Figure FDA0002712136880000025
weight and bias parameters, Y, of the fully connected layer, respectivelyt,sIs the probability that the tth video segment belongs to the sign language word s;
in the training stage, the connection time sequence classification loss CTC is adopted for optimization,
adopting a joint optimization mode, wherein the total loss function is the sum of CTC loss functions of all target sequence models and is represented as:
Figure FDA0002712136880000026
wherein K is the total number of target sequence models,
Figure FDA0002712136880000027
utilizing Y as a CTC loss function for a target sequence model in a kth sign language(k)And (4) calculating.
6. The system of claim 1, wherein the method further comprises: the method for obtaining the alignment relation between the sign language video and the sign language annotation sequence by using the maximum probability decoding algorithm so as to finely adjust the shared visual feature encoder comprises the following steps:
obtaining probability distribution Y of sign language words through target sequence model(k)Then, according to the sequence of the sign language words in the sign language labeling sequence, sequentially extracting the probability values corresponding to the video segments corresponding to the current sign language words, operating the sign language words in the sign language labeling sequence, and combining the sign language words into a new probability matrix Y(k)′(ii) a Finding a new probability matrix Y using a dynamic programming algorithm(k)′An upper maximum probability path;
note Pi,jIs a sequence of features f1,f2,…,fiAnd the annotation sequence s1,s2,…,sjThe maximum probability between them, the dynamically planned transfer equation is expressed as:
Pi,j=Y(k)′ i,j+max(Pi-1,j,Pi-1,j-1)
wherein, Y(k)′ i,jAs a new probability matrix Y(k)′The ith row and the jth column of the element, i.e. the ith video segment, belong to the sign language word sjThe probability of (d);
through the operation, the alignment relation between the sign language video and the sign language word labels is obtained, namely the pseudo labels of the video segments are obtained, and therefore the shared visual feature encoder is optimized.
CN202011060272.2A 2020-09-30 2020-09-30 Continuous sign language recognition system based on multi-language collaboration Active CN112132094B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011060272.2A CN112132094B (en) 2020-09-30 2020-09-30 Continuous sign language recognition system based on multi-language collaboration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011060272.2A CN112132094B (en) 2020-09-30 2020-09-30 Continuous sign language recognition system based on multi-language collaboration

Publications (2)

Publication Number Publication Date
CN112132094A true CN112132094A (en) 2020-12-25
CN112132094B CN112132094B (en) 2022-07-15

Family

ID=73843529

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011060272.2A Active CN112132094B (en) 2020-09-30 2020-09-30 Continuous sign language recognition system based on multi-language collaboration

Country Status (1)

Country Link
CN (1) CN112132094B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861827A (en) * 2021-04-08 2021-05-28 中国科学技术大学 Sign language translation method and system using single language material translation
CN113992894A (en) * 2021-10-27 2022-01-28 甘肃风尚电子科技信息有限公司 Abnormal event identification system based on monitoring video time sequence action positioning and abnormal detection

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647603A (en) * 2018-04-28 2018-10-12 清华大学 Semi-supervised continuous sign language interpretation method based on attention mechanism and device
US20180336884A1 (en) * 2017-05-19 2018-11-22 Baidu Usa Llc Cold fusing sequence-to-sequence models with language models
CN110210416A (en) * 2019-06-05 2019-09-06 中国科学技术大学 Based on the decoded sign Language Recognition optimization method and device of dynamic pseudo label
CN110874537A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Generation method of multi-language translation model, translation method and translation equipment
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180336884A1 (en) * 2017-05-19 2018-11-22 Baidu Usa Llc Cold fusing sequence-to-sequence models with language models
CN108647603A (en) * 2018-04-28 2018-10-12 清华大学 Semi-supervised continuous sign language interpretation method based on attention mechanism and device
CN110874537A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Generation method of multi-language translation model, translation method and translation equipment
CN110210416A (en) * 2019-06-05 2019-09-06 中国科学技术大学 Based on the decoded sign Language Recognition optimization method and device of dynamic pseudo label
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SIMING HE,AND ETC: "Research of a Sign Language Translation System Based on Deep Learning", 《2019 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND ADVANCED MANUFACTURING (AIAM)》 *
梁智杰等: "融合宽残差和长短时记忆网络的动态手势识别研究", 《计算机应用研究》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861827A (en) * 2021-04-08 2021-05-28 中国科学技术大学 Sign language translation method and system using single language material translation
CN112861827B (en) * 2021-04-08 2022-09-06 中国科学技术大学 Sign language translation method and system using single language material translation
CN113992894A (en) * 2021-10-27 2022-01-28 甘肃风尚电子科技信息有限公司 Abnormal event identification system based on monitoring video time sequence action positioning and abnormal detection

Also Published As

Publication number Publication date
CN112132094B (en) 2022-07-15

Similar Documents

Publication Publication Date Title
Guo et al. Back to mlp: A simple baseline for human motion prediction
CN111368565B (en) Text translation method, text translation device, storage medium and computer equipment
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
Chen et al. Two-stream network for sign language recognition and translation
Jang et al. Recurrent neural network-based semantic variational autoencoder for sequence-to-sequence learning
Ruan et al. Survey: Transformer based video-language pre-training
CN110059324B (en) Neural network machine translation method and device based on dependency information supervision
CN112132094B (en) Continuous sign language recognition system based on multi-language collaboration
CN113204633B (en) Semantic matching distillation method and device
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
Bolaños et al. Egocentric video description based on temporally-linked sequences
Song et al. Parallel temporal encoder for sign language translation
CN116432019A (en) Data processing method and related equipment
Wei et al. Exploiting the local temporal information for video captioning
CN113312919A (en) Method and device for generating text of knowledge graph
Wang et al. (2+ 1) D-SLR: an efficient network for video sign language recognition
CN114220516A (en) Brain CT medical report generation method based on hierarchical recurrent neural network decoding
CN114254645A (en) Artificial intelligence auxiliary writing system
Niu et al. A multi-layer memory sharing network for video captioning
Shirghasemi et al. The impact of active learning algorithm on a cross-lingual model in a Persian sentiment task
Hu et al. Learning discriminative representations via variational self-distillation for cross-view geo-localization
CN112257463B (en) Compression method of neural machine translation model for Chinese-English inter-translation
Phuc et al. Video captioning in Vietnamese using deep learning
Ghosh et al. SpecTextor: End-to-end attention-based mechanism for dense text generation in sports journalism
Gao et al. Combinational sign language recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant