CN112132094A

CN112132094A - Continuous sign language recognition system based on multi-language collaboration

Info

Publication number: CN112132094A
Application number: CN202011060272.2A
Authority: CN
Inventors: 李厚强; 周文罡; 蒲俊福; 胡鹤臻
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2020-12-25
Anticipated expiration: 2040-09-30
Also published as: CN112132094B

Abstract

The invention discloses a continuous sign language recognition system based on multi-language collaboration, which uses a common visual feature encoder to extract feature expression, and uses different time sequence modeling networks (namely target sequence models) to learn the language characteristics of corresponding sign languages for sign languages of different languages. A shared sequential encoder (namely a shared sequence model) is used for expressing the same visual mode among different sign language languages, language embedding vectors are used for initialization, multi-language sign language recognition under a single frame is realized through multi-language collaborative training, visual commonalities among different sign languages are fully mined, and sign language recognition performance is improved.

Description

Continuous sign language recognition system based on multi-language collaboration

Technical Field

The invention relates to the technical field of action recognition in computer vision, in particular to a continuous sign language recognition system based on multi-language collaboration.

Background

In the continuous sign language recognition problem, each sign language video is labeled by an ordered sign language word sequence, and the essence of the continuous sign language recognition problem can be regarded as a process for learning the mapping relationship between a video sequence and a labeled text sequence. Generally, a continuous sign language recognition system is composed of a visual feature encoder and a time-series modeling model. Feature expression of video plays a very important role in continuous sign language recognition, and hand features such as SIFT, HOG, etc. were used earlier to characterize hand patterns and trajectories. With the successful application of deep learning in computer vision, two-dimensional convolutional neural networks for image characterization and three-dimensional convolutional neural networks for video characterization are introduced into hand language recognition. Related work is carried out in an end-to-end system, RGB image information is extracted by using the 2D CNN, and good performance is obtained; in order to model the time-series dependency, sign language identification methods based on three-dimensional convolution kernels are also proposed in succession. In another video feature characterization mode, a 2D convolution network and 1D time sequence convolution are used for carrying out space-time expression on the sign language video, and the visual features extracted by the method are superior to those of other methods in a continuous sign language recognition task.

The sequence learning model in continuous sign language can be realized by connecting time sequence classification, hidden Markov model, coder-decoder network, etc. The recurrent neural network is successfully applied to a plurality of sequence learning tasks and is introduced into a continuous sign language recognition problem, and the bidirectional LSTM-CTC structure is one of the most widely applied baseline methods in sign language. In addition, there is work to embed hidden markov models into neural network for sign language recognition. Similar to machine translation, attention-based codec networks are also used to learn the mapping between videos and annotations, thereby implementing the tasks of sign language recognition and sign language translation.

In the machine translation task, most methods are also single language translation problems focusing on source to target languages, and end-to-end solutions based on deep neural networks have made important progress in this type of problem. The machine translation system can extend the single language translation method to the multi-language translation task in various ways. By adding language identifiers at the beginning of the sentence to be translated, the monolingual model can be applied to multi-language translation through simple extension. To raise the problem of corpus resource limitation, attempts are made to generate more sentences from corpus for augmenting data using an end-to-end twin network. Furthermore, by using different parameter sharing strategies, the size problem of the model in a multi-lingual system can be balanced.

The prior art mainly has the following defects:

1) like natural language, sign languages in different countries and regions are also not used, and they have respective unique grammatical structures and vocabularies. In other words, it is difficult for people who use different sign languages to understand the sign language semantics of the other party. The existing video sign language recognition method is often used for solving the problem of sign language recognition of single language, so that the sign language recognition system is limited in practical application and deployment.

2) Most of the existing multi-language sign language recognition algorithms are based on different sign language data sets, and a plurality of model parameters aiming at different sign language languages are trained on the same network architecture. The method can achieve certain effect, but ignores the problem that similar visual patterns exist among different sign languages, and the method of separate independent training is not beneficial to mining the commonality of the sign languages by the model.

Disclosure of Invention

The invention aims to provide a continuous sign language recognition system based on multi-language cooperation, which realizes multi-language sign language recognition under a single frame and has recognition performance superior to that of a single training recognition result.

The purpose of the invention is realized by the following technical scheme:

a continuous sign language recognition system based on multi-language collaboration, comprising: a shared visual feature encoder, a shared sequence model, and a number of target sequence models; wherein:

the shared visual characteristic encoder is used for extracting visual characteristics in sign language videos of all languages and inputting the visual characteristics to the shared sequence models and each target sequence model respectively;

the shared sequence model is used for expressing the same visual mode among different sign language languages, learning the commonality among the different sign language languages and initializing by embedding vectors in different languages;

each target sequence model for learning a mapping between a respective language visual feature and a respective sign language word in conjunction with an output of the shared sequence model;

in the training stage, performing joint optimization aiming at all target sequence models; and each trained target sequence model can predict the probability distribution of the sign language words corresponding to the sign language videos of the corresponding languages.

According to the technical scheme provided by the invention, a common visual feature encoder (shared visual feature encoder) is used for extracting feature expression, and different time sequence modeling networks (namely target sequence models) are used for learning language characteristics of corresponding sign languages for sign languages of different languages. A shared sequential encoder (namely a shared sequence model) is used for expressing the same visual mode among different sign language languages, language embedding vectors are used for initialization, multi-language sign language recognition under a single frame is realized through multi-language collaborative training, visual commonalities among different sign languages are fully mined, and sign language recognition performance is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a diagram illustrating three basic frameworks for multi-language identification according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a continuous sign language recognition system based on multi-language collaboration according to an embodiment of the present invention;

fig. 3 is a schematic diagram of network iterative optimization provided in the embodiment of the present invention;

fig. 4 is a schematic diagram of obtaining alignment labels based on the maximized probability according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Most of the existing sign language recognition frameworks only recognize single sign language languages, different sign languages have shared visual modes, and common characters among the sign languages of different languages can be ignored by using different sign language data sets to train a plurality of independent models. The invention expresses the same visual mode among different sign language languages through the shared time sequence encoder, realizes the multi-language sign language recognition under a single frame through multi-language collaborative training, and the recognition performance is superior to that of a recognition result of independent training.

Similar to the multi-language machine translation, consider three system architectures with multi-language sign language recognition functionality, as shown in FIG. 1. 1) The simplest approach is to use a shared visual encoder and sequence model, as shown in part (a) of fig. 1, and this simple architecture can be easily implemented without major changes to existing continuous sign language systems. However, this would impair its sequence model's ability to handle multiple languages of correspondence mapping. 2) For different sign languages, different models are used respectively but share the same visual encoder, as shown in part (b) of fig. 1, wherein each branch functions the same as a classical sign language recognition system, no information sharing is performed between each target branch, and the design is related to languages, but complementarity between different sign languages cannot be explored. 3) To take advantage of the first two architectures, the third approach is to add an additional shared sequence model to learn the commonality between different sign languages, as shown in part (c) of fig. 1.

The continuous sign language recognition system based on multi-language collaboration provided by the embodiment of the invention adopts a structure shown in part (c) of FIG. 1. It uses one common CNN-TCN for feature extraction for all input sign languages. An independent target sequence model is adopted to learn the corresponding relation between the visual characteristics and the hand words. Furthermore, in order to model the visual patterns shared between different sign languages, it is proposed to use a shared sequence model for common points between all sign languages. Each branch in the system (target sequence model) was optimized by CTC loss. As shown in fig. 2, the main structure of the system is shown, which comprises: a shared visual feature encoder, a shared sequence model, and several target sequence models. The following is a description of the principles of the various parts of the system, as well as the training scheme.

First, sharing the visual feature encoder.

The shared visual characteristic encoder is used for extracting visual characteristics in sign language videos of all languages and inputting the visual characteristics to the shared sequence model and each target sequence model respectively.

In the embodiment of the invention, the shared visual characteristic encoder comprises the following components in sequence: spatial CNN (Spatial CNN) and Temporal CNN (Temporal CNN) are used to extract visual features of video.

As shown by the dashed box below fig. 2:

the space convolution network mainly comprises the following components in sequence: the first rolling layer, the first maximum pooling layer, the second and third rolling layers, the two inclusion layers, the second maximum pooling layer, the five inclusion layers, the third maximum pooling layer, the two inclusion layers, and the fourth maximum pooling layer.

The time sequence convolution network mainly comprises two convolution layers and a maximum pooling layer, wherein the convolution layers and the maximum pooling layer are alternately arranged.

The time-series receptive field of the shared visual feature encoder is 16 frames. Denote the shared visual feature encoder as E_v. Sign language video for any language

Sharing the output of a visual feature encoderThe visual features are represented as:

wherein x is_tThe method comprises the steps of representing the t-th video frame, wherein N is the number of the video frames, f represents the visual characteristics of a video segment (corresponding video frame under the receptive field), and the output duration is changed into N/4 of the input duration.

And II, sequence learning.

The long-time dependence can be effectively modeled by a long-time memory network (LSTM), and the LSTM unit at the current time t is in a cell state C_tAnd hidden state h_tIt is shown that its basic idea controls the renewal of cellular and cryptic states by introducing a gate structure. The LSTM cell has 3 different gate structures, respectively input gate

Forgetting door

And an output gate

The specific calculation method is as follows:

where σ is the activation function, t denotes the time, f_tFor the input features, W and b are the linear mapping weights and offsets. The current cellular state and the cryptic state are updated as follows:

wherein, an indicates the product of each element in the vector.

In the present invention, in order to model bidirectional timing information, a bidirectional long and short time memory network (BLSTM) is used for timing encoding. Two different BLSTM networks are used in this framework for different purposes. On the one hand, it is desirable to use a separate sequence model (i.e., a target sequence model) to learn the mapping between visual features and hand words because each language has its own unique rules. The separate sequence modeling branches help capture the features of each particular sign language and can reduce the interference problem between cross-language sequence modeling. On the other hand, in order to encode similar visual patterns in different sign languages, the invention introduces a shared sequence model to learn the commonality between different sign languages. To embed the markup information of a language, the state of the shared model is initialized using different language embedding vectors.

1. The sequence model is shared.

The shared sequence model is used for expressing the same visual mode among different sign language languages, learning the commonality among the different sign language languages, and carrying out initialization by embedding vectors in different languages.

In order to embed the information related to the language category in the shared sequence model, the category information of the language is encoded by an Embedding Layer (Embedding Layer) and used for initialization of BLSTM in the shared sequence model to distinguish the difference of different languages.

For the input visual feature F, the output feature O_sExpressed as:

O_s＝BLSTM_s(F；h₀＝e_k,c₀＝e_k)，

wherein h is₀And c₀Initial hidden state and cell state of the two-way long and short memory network, respectively, e_kIs a class-embedded vector, BLSTM, indicating the kth sign language_sRepresenting a shared sequence model.

2. A target sequence model.

target sequence model for kth sign language

Outputting the result

Expressed as:

and thirdly, optimizing the model.

1. And (4) language joint optimization.

In the embodiment of the invention, in the training stage, joint optimization is carried out on all target sequence models; and each trained target sequence model can predict the probability distribution of the sign language words corresponding to the sign language videos of the corresponding languages.

In order to obtain the probability distribution of the target sequence words, the target sequence model is output

Mapping to non-normalized log-probability space with fully-connected layers, expressed as:

wherein the superscript k represents the type identification of the sign language,

weight and bias parameters, Y, of the fully connected layer, respectively_t,sIs the probability that the t-th video segment belongs to the sign language word s.

And in the training stage, optimizing by adopting a connection time sequence classification loss CTC. By adopting a joint optimization mode, the total loss function is the sum of CTC loss functions of all target sequence models, expressed as,

wherein K is the total number of target sequence models,

utilizing Y as a CTC loss function for a target sequence model in a kth sign language^(k)And (4) calculating.

2. Shared visual feature encoder optimization.

Existing studies have shown that iterative training of CNNs is an effective way to further improve performance. The idea is to take the alignment between the input video and sign language words and use this method to fine tune the feature extractor network, the optimization process is shown in fig. 3. On this basis, the embodiment of the present invention provides a method for obtaining an alignment relationship between a sign language video and a sign language annotation sequence based on a maximum probability decoding algorithm, so as to perform fine tuning on a shared visual feature encoder, which includes:

obtaining probability distribution Y of sign language words through target sequence model^(k)Then, according to the sequence of the sign language words in the sign language labeling sequence, sequentially extracting the category probability values corresponding to the video segments corresponding to the current sign language words, operating the sign language words in the sign language labeling sequence, and combining the sign language words into a new probability matrix Y^(k)′As shown in fig. 4, where T is the number of video segments; finding new using dynamic programming algorithmProbability matrix Y of^(k)′The upper most probable path.

Note P_i,jIs a sequence of features f₁,f₂,…,f_iAnd the annotation sequence s₁,s₂,…,s_jThe maximum probability between them, the dynamically planned transfer equation is expressed as:

P_i,j＝Y^(k)′ _i,j+max(P_i-1,j,P_i-1,j-1)，

wherein, Y^(k)′ _i,jAs a new probability matrix Y^(k)′The ith row and the jth column of the element, i.e. the ith video segment, belong to the sign language word s_jThe probability of (d); i is less than or equal to N/4.

Through the above, we can obtain the alignment relationship between the sign language video and the sign language word label, that is, the category pseudo label (i.e. video segment pseudo label) at the segment level can be obtained. And taking the video as a video classification task to optimize the shared visual feature encoder. And the optimized shared visual feature encoder is used as a pre-training parameter, and the whole framework is introduced again for end-to-end training, so that continuous iterative optimization is realized.

According to the scheme of the embodiment of the invention, on one hand, multi-language sign language recognition under a single frame is realized through multi-language collaborative training, the visual commonality among different sign languages is fully excavated, and the sign language recognition performance is improved. On the other hand, the shared visual feature encoder is improved by acquiring the alignment relation between the video and the sign language annotation sequence through a maximum probability decoding algorithm.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A continuous sign language recognition system based on multi-language collaboration, comprising: a shared visual feature encoder, a shared sequence model, and a number of target sequence models; wherein:

2. The system of claim 1, wherein the shared visual feature encoder comprises, in order: a spatial convolution network and a time sequence convolution network; wherein:

the spatial convolution network comprises the following components in sequence: a first convolution layer, a first maximum pooling layer, second and third convolution layers, two inclusion layers, a second maximum pooling layer, five inclusion layers, a third maximum pooling layer, two inclusion layers, and a fourth maximum pooling layer;

the time sequence convolution network comprises two convolution layers and a maximum pooling layer, and the convolution layers and the maximum pooling layer are alternately arranged;

denote the shared visual feature encoder as E_vSign language video in any language

The visual features output by the shared visual features encoder are represented as:

wherein x is_tThe video segment is a video frame corresponding to the reception field of the shared visual feature encoder on the time sequence.

3. The continuous sign language recognition system based on multi-language collaboration as claimed in claim 1, wherein the shared sequence model is implemented by a two-way long and short memory network; for the input visual feature F, the result O is output_sExpressed as:

O_s＝BLSTM_s(F；h₀＝e_k,c₀＝e_k)

wherein h is₀And c₀Initial hidden state and cell state of the two-way long and short memory network, respectively, e_kIs a category embedded vector for the kth sign language.

4. The continuous sign language recognition system based on multi-language collaboration as claimed in claim 1, wherein each target sequence model is implemented by a two-way long-short memory network, which is initialized by zero vector;

for the target sequence model of the kth sign language, outputting the result

Expressed as:

wherein, F, O_sRespectively, output of the shared visual feature encoder, the shared sequence model, h₀And c₀The initial hidden state and the cell state of the bidirectional long and short memory network are respectively.

5. The continuous sign language recognition system based on multi-language collaboration as claimed in claim 1,

target sequence model output

weight and bias parameters, Y, of the fully connected layer, respectively_t,sIs the probability that the tth video segment belongs to the sign language word s;

in the training stage, the connection time sequence classification loss CTC is adopted for optimization,

adopting a joint optimization mode, wherein the total loss function is the sum of CTC loss functions of all target sequence models and is represented as:

wherein K is the total number of target sequence models,

6. The system of claim 1, wherein the method further comprises: the method for obtaining the alignment relation between the sign language video and the sign language annotation sequence by using the maximum probability decoding algorithm so as to finely adjust the shared visual feature encoder comprises the following steps:

obtaining probability distribution Y of sign language words through target sequence model^(k)Then, according to the sequence of the sign language words in the sign language labeling sequence, sequentially extracting the probability values corresponding to the video segments corresponding to the current sign language words, operating the sign language words in the sign language labeling sequence, and combining the sign language words into a new probability matrix Y^(k)′(ii) a Finding a new probability matrix Y using a dynamic programming algorithm^(k)′An upper maximum probability path;

P_i,j＝Y^(k)′ _i,j+max(P_i-1,j,P_i-1,j-1)

wherein, Y^(k)′ _i,jAs a new probability matrix Y^(k)′The ith row and the jth column of the element, i.e. the ith video segment, belong to the sign language word s_jThe probability of (d);

through the operation, the alignment relation between the sign language video and the sign language word labels is obtained, namely the pseudo labels of the video segments are obtained, and therefore the shared visual feature encoder is optimized.