CN111723779A

CN111723779A - Chinese sign language recognition system based on deep learning

Info

Publication number: CN111723779A
Application number: CN202010699780.9A
Authority: CN
Inventors: 张浩东; 李威杰; 谢亮; 熊蓉
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2020-09-29
Anticipated expiration: 2040-07-20
Also published as: CN111723779B

Abstract

The invention discloses a Chinese sign language recognition system based on deep learning. The sign language recognition system is set to be in two modes of sign language word recognition and continuous sign language recognition and is respectively used for recognizing words and sentences expressed by actions of the sign language. The whole system consists of a data acquisition module, a data processing module, a recognition module and an output display module, wherein the sign language word recognition module consists of a graph convolution neural network and a three-dimensional convolution neural network, and the continuous sign language recognition module consists of a coder-decoder network. The system collects images and joint data of sign language actions through the data collection module, then preprocesses the images and the joint data, inputs the data into the recognition module, and finally outputs corresponding sign language words or sentences. The invention can convert sign language into text and promote the communication between the person with hearing impairment and the ordinary person. The invention has strong practicability and high stability and is convenient for popularization and application.

Description

Chinese sign language recognition system based on deep learning

Technical Field

The invention relates to a sign language recognition system, in particular to a Chinese sign language recognition system based on a deep learning algorithm.

Background

Sign language is a language expressed by gestures, body movements, facial expressions, etc. without using voice. The method is a main mode for communicating among the hearing-impaired people, and ordinary people are difficult to communicate with the hearing-impaired people without special learning. Sign language recognition aims to convert sign language into voice or text, and facilitates communication between a hearing-impaired person and an ordinary person. Sign language recognition tasks have a wide social demand. According to the world health organization, about 4.66 million people worldwide suffer from disabled hearing loss, which exceeds 5% of the world population. Hearing loss causes an inconceivable inconvenience to the population, and people with hearing impairment are difficult to communicate with ordinary people and face huge social pressure. Therefore, it is necessary to design a universal sign language recognition system to solve these problems.

The sign language recognition task may be divided into sign language word recognition and continuous sign language recognition. Sign language word recognition recognizes a word and successive sign language recognition recognizes a complete sentence. The input data sources include video, skeleton information, and the like. The traditional method uses a data glove to collect data, converts gestures, limb actions and the like of sign language into manually designed features, and then classifies the sign language by using the features through a machine learning method to complete the recognition task. These methods are not robust and practical enough due to the limited representation capabilities of manual features and the inconvenience of using data gloves.

Disclosure of Invention

Aiming at the social requirements and the problems of sign language recognition tasks, the invention provides a Chinese sign language recognition system based on deep learning, which has the advantages of strong practicability, high robustness, convenient use, low cost and convenient popularization and use. The invention can identify the sign language words of 500 types and the sign language sentences of 100 types, and display the identification result in real time, thereby overcoming the communication barrier between the person with hearing impairment and the ordinary person.

The invention adopts the specific technical scheme that:

a deep learning based chinese sign language recognition system, comprising:

the data acquisition module is used for acquiring RGB images and human joint data when a human body makes sign language actions;

the data processing module is used for adjusting the size of the RGB image acquired by the data acquisition module and normalizing the pixel value;

the sign language word recognition module consists of a graph convolution neural network module, a three-dimensional convolution neural network module and a fusion module and is used for recognizing sign language words expressed by sign language actions;

the graph convolution neural network module constructs a joint graph based on the human body joint data acquired by the data acquisition module, performs graph convolution on all adjacent nodes and further outputs probability distribution of sign language word categories, wherein the adjacent nodes simultaneously comprise adjacent nodes in space dimension and adjacent nodes in time dimension;

the three-dimensional convolution neural network module is used for performing convolution on the RGB image processed by the data processing module by utilizing a three-dimensional convolution core containing a space dimension and a time dimension, extracting the space time characteristics of the RGB image and further outputting the probability distribution of the sign language word categories;

the fusion module is used for fusing the output results of the graph convolution neural network module and the three-dimensional convolution neural network module through weighted average to obtain a final sign language word prediction result;

the continuous sign language recognition module consists of an encoder module and a decoder module and is used for recognizing complete sentences expressed by continuous sign language actions;

the encoder module comprises a convolutional neural network and a cyclic neural network and is used for respectively extracting the spatial features and the temporal features of the continuous sign language and generating the global semantic information of the continuous sign language action;

the decoder module adopts a cyclic neural network, and predicts the output at the next moment by using the global semantic information coded by the coder, the output at the previous moment and the hidden layer state of the cyclic neural network.

And the output display module is used for enabling a user to select the sign language word recognition mode or the continuous sign language recognition mode and displaying an output result of the sign language word recognition module or the continuous sign language recognition module according to the selection of the user.

Preferably, the data acquisition module uses a Kinect depth camera and can acquire RGB images and 25 human body joint data when the human body performs sign language actions at the same time.

Preferably, the data processing module resizes the RGB image acquired by the data acquisition module to 224 × 224 image, and normalizes the pixel values so that they satisfy a gaussian distribution with a mean and a standard deviation of 0.5.

Preferably, in the graph convolution neural network module, the specific process is as follows:

11) firstly, constructing an articulated graph based on human body joint data acquired by a data acquisition module, wherein nodes of the articulated graph correspond to coordinate information of human body joint points, edges of the articulated graph correspond to connection between the joint points, and the same nodes are connected in a time dimension;

12) and performing graph convolution operation on the joint graph, wherein in the operation process, a convolution kernel with the given K × K size and an input characteristic f with the channel number of c_inOutput f of a single channel at spatial position x_out(x) Comprises the following steps:

wherein: f. of_in(v) Representing input features f_inThe characteristic value of the middle v node, w (v) represents the weight of the v node; b (v)_ti) Is a set of nodes that need to be traversed, wherein:

B(v_ti)＝{v_qj|d(v_ti,v_qj)≤D,|q-t|≤T}

in the formula: d (v)_ti,v_qj) Watch (A)Show node v_tiAnd node v_qjA distance between v_tiIs the node at position x at time t, v_qjThe node is a node which is near the position x in the convolution kernel range at the moment q; d is a distance threshold value in a space dimension, and T is a distance threshold value in a time dimension;

the output of all channels is subjected to weighted summation to obtain the characteristics of a single joint diagram;

13) and mapping the extracted features in the single joint diagram into probability distribution of sign language word categories through a full connection layer, and normalizing through a normalization function.

Preferably, the three-dimensional convolutional neural network module uses a plurality of layers of three-dimensional depth residual error networks, wherein the specific process in each layer of residual error network is as follows:

21) the input I size of the residual network is C multiplied by T multiplied by H multiplied by W, wherein C is the number of channels, T is the length in the time dimension, H and W are the length and the width of the image respectively, and the network performs convolution operation on the image through a three-dimensional convolution kernel W:

wherein: f (x, y, t) represents the convolution result, x and y represent pixel coordinates in the image, t represents a position in the image sequence, c represents a channel c of the image, x, y and t represent offsets in the image length, width and time dimensions, respectively, and b represents an offset;

adding identity mapping on the basis of the network, and converting the network into a function of learning a residual error H (I) ═ F (I) + I, wherein F (I) is a convolution result of a network input I, and H (I) is an output of a residual error network at the current layer and is used as an input of a residual error network at the next layer;

22) after the space time characteristics of the RGB image are extracted by the three-dimensional depth residual error network, the characteristics are mapped into probability distribution of sign language word categories through a full connection layer, and normalization is carried out.

Preferably, in the fusion module, the specific fusion process is as follows:

acquiring probability distribution vectors of sign language word categories output by a graph convolution neural network module and probability distribution vectors of sign language word categories output by a three-dimensional convolution neural network, wherein numerical values in the probability distribution vectors are all between 0 and 1 and represent the probability of each category of sign language words; and carrying out weighted average calculation on the probability values in the two vectors in a one-to-one correspondence manner to obtain the finally output probability distribution vector.

Further, in the fusion module, when performing weighted average calculation, the weights of the output vectors of the graph convolution neural network module and the three-dimensional convolution neural network are preferably 0.4 and 0.6, respectively.

Preferably, in the encoder module, the convolutional neural network is a multi-layer deep residual error network, and the cyclic neural network is a long-term and short-term memory network; the encoder receives an image sequence with variable length as input, firstly extracts the spatial characteristics of each frame of image by using a depth residual error network, and then completes semantic coding of a continuous sign language video through a long-term and short-term memory network; the long-term and short-term memory network takes the spatial features extracted by the convolutional neural network as input, and continuously updates the hidden layer state, wherein the hidden layer state comprises the features of time dimension; averaging the hidden layer states of all time steps to obtain a semantic vector c after encoding, which is used as the input of a decoder:

where T' is the length of the input sequence, h_tIs the hidden layer state of the recurrent neural network at time t.

Preferably, in the decoder module, the decoder is composed of a long-short term memory network, a word embedding layer and a full connection layer, and the long-short term memory network is initialized by using the hidden layer state of the encoder at the last moment; the word embedding layer extracts semantic vectors from words output at the last moment, and then the semantic vectors are connected with the features coded by the coder and used as the input of the long-term and short-term memory network; after the long-term and short-term memory network is updated, the fully-connected layer takes the hidden layer state at the current moment as input, generates the probability distribution of the currently output words, and finally selects the word with the highest probability as output.

Preferably, in the encoder module and the decoder module, the long-short term memory network and the short-term memory network both use a bidirectional long-short term memory network.

Preferably, the output display module can select the sign language recognition mode and output the top five candidate results with the highest probability in real time.

Compared with the prior art, the invention has the following beneficial effects:

1) in the invention, the graph convolution neural network can effectively process the joint data. The joint data is actually a topological map, not in the form of grid data, which makes it difficult to perform feature extraction using convolutional neural networks. By means of graph convolution, the topological structure of joint data can be fully utilized, and the internal relation between joint points can be extracted.

2) In the invention, the three-dimensional convolution neural network can simultaneously carry out convolution operation on the space dimension and the time dimension, and directly extract the space time characteristics of the sign language action. The three-dimensional depth residual error with the depth of 18 gives consideration to the identification accuracy and the running speed, so that a certain identification accuracy is maintained, and the running speed is improved.

3) In the invention, multi-modal data is used as input, and the recognition results of the joint data and the RGB image are fused, so that the robustness and the stability of the sign language recognition system are improved.

4) In the invention, the encoder-decoder structure can more fully utilize the global characteristics of the actions of the sign language and learn the language model of the continuous sign language, thereby effectively solving the problem of identifying the sequence of the continuous sign language and learning the mapping relation between the sign language video and the sentences expressed by the video.

Drawings

FIG. 1 is a system framework diagram of the present invention.

FIG. 2 is a data processing module program framework diagram of the present invention.

FIG. 3 is a flow chart of the algorithm of the graph convolution neural network module of the present invention.

FIG. 4 is an algorithmic flow diagram of the three-dimensional convolutional neural network module of the present invention.

Fig. 5 is a network architecture diagram of an encoder-decoder module of the present invention.

FIG. 6 is a schematic diagram of a "home page" interface of the output display module according to the present invention.

FIG. 7 is a schematic diagram of a "sign language word recognition" interface of the output display module of the present invention.

FIG. 8 is a schematic diagram of a "continuous sign language recognition" interface of the output display module according to the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

In the present embodiment, a system framework diagram of a deep learning based Chinese sign language recognition system is shown in fig. 1. The whole system comprises the following components:

the data acquisition module is used for acquiring sign language action information;

the data processing module is used for data preprocessing;

the atlas neural network module is used for carrying out sign language word recognition on joint data;

the three-dimensional convolution neural network module is used for carrying out sign language word recognition on the RGB image;

the fusion module is used for fusing joint data and an RGB image recognition result;

an encoder module for encoding the continuous sign language information;

a decoder module for decoding the continuous sign language encoded information;

the output display module is used for displaying an output result in real time;

the output end of the data acquisition module is connected with the data processing module, the output of the data processing module can be respectively connected with the graph convolution neural network module, the three-dimensional convolution neural network module and the encoder module, the output of the graph convolution neural network and the three-dimensional convolution neural network can be connected with the fusion module, the encoder module can be connected with the decoder module, and the fusion module and the decoder module can be connected with the output display module.

Among the above modules, the graph convolution neural network module, the three-dimensional convolution neural network module and the fusion module form a sign language word recognition module for realizing the sign language word recognition mode, and the encoder module and the decoder module form a continuous sign language recognition module for realizing the continuous sign language recognition mode.

When the Chinese sign language recognition system is used, firstly, the data acquisition module acquires image data and joint data of sign language actions, and then the data processing module carries out preprocessing. And then performing sign language word recognition or continuous sign language recognition according to the selection of the user. In the sign language word recognition module, the graph convolution neural network can predict the probability of sign language word classes based on joint data, and the three-dimensional convolution neural network can predict the probability of sign language word classes based on RGB images. Their output is a vector of length 500, each element representing the probability of a sign language word for the corresponding category. The fusion module performs weighted average on the two results to obtain a final prediction result. In the continuous sign language recognition module, the complete sentence expressed by the sign language action is recognized through the encoder-decoder network. The recognition results of the two modes are displayed in real time through the output display module, and the first five candidate results with the highest probability are displayed.

The specific implementation of each module is described in detail below with reference to the accompanying drawings.

1. The data acquisition module is used for acquiring RGB images and human joint data when a human body makes sign language actions. In this embodiment, the data acquisition module uses a Kinect depth camera, which can acquire RGB images simultaneously and automatically obtain 25 position coordinate data of the human body joints.

2. And the data processing module is used for adjusting the size of the RGB image acquired by the data acquisition module and normalizing the pixel value. In the present embodiment, a program flow diagram of the data processing module is shown in fig. 2. The data processing module pre-processes the RGB image, including resizing the RGB image to 224 x 224 and normalizing the pixel values for each channel to satisfy a gaussian distribution with both a mean and a standard deviation of 0.5.

3. The sign language word recognition module is used for recognizing sign language words expressed by the action of the hand language;

3.1, the graph convolution neural network module is used for constructing an articular graph based on the human body articular data acquired by the data acquisition module, carrying out graph convolution on all adjacent nodes and further outputting probability distribution of sign language word categories, wherein the adjacent nodes simultaneously comprise adjacent nodes in space dimension and adjacent nodes in time dimension.

In this embodiment, an algorithm flowchart of the convolutional neural network module is shown in fig. 3, and the specific process in the convolutional neural network module is as follows:

12) and performing graph convolution operation on the joint graph, wherein in the operation process, a convolution kernel with the given K × K size and an input characteristic f with the channel number of c_in. The output f of a single channel at a spatial position x, if only the spatial distance is taken into account_out(x) Comprises the following steps:

where w is a weight function, the weight vector will be inner-multiplied with the input features, similar to the weights of a convolutional neural network. p is a sampling function that acts to traverse nodes near position x. Suppose v_iIf the node distance calculation function is D at the node at the position x, the distance of the node to be traversed must be less than or equal to D, and then the sampling function traverses the node set B (v)_i) The following were used:

B(v_i)＝{v_j|d(v_i,v_j}≤D}

by the above formula, the features of a single joint map can be obtained when only the spatial distance is considered. However, in order to extract the temporal dynamics of the joint sequence, the same nodes of consecutive frames need to be connected together, and the definition of the sampling function p needs to be expanded. The original function of the sampling function p is to traverse the nodes adjacent to the position x, and for the information of the time dimension, the adjacent nodes need to include not only the adjacent nodes of the space dimension but also the adjacent nodes of the time dimension. Suppose v_tiIs the node at the position x at the moment T, the time interval of the node to be traversed must be less than or equal to T, and then the node set B (v) traversed by the extended sampling function_ti) Both spatially closer nodes and temporally closer nodes should be considered.

Thus, in the present invention, the output f of a single channel at a spatial location x, taking into account both spatial and temporal dimensions_out(x) Can be expressed as follows:

B(v_ti)＝{v_qj|d(v_ti,v_qj)≤D,|q-t|≤T}

in the formula: d (v)_ti,v_qj) Representing a node v_tiAnd node v_qjA distance between v_tiIs the node at position x at time t, v_qjThe node is a node which is near the position x in the convolution kernel range at the moment q; d is the distance threshold in the spatial dimension and T is the distance threshold in the temporal dimension. In this formula, { v_qj|d(v_ti,v_qj) D is less than or equal to D, q-T is less than or equal to T, and a node v is represented_tiAll nodes v whose spatial distance does not exceed D and whose time interval does not exceed T_qj。

And after the output of each channel is calculated, the output of all the channels is subjected to weighted summation to obtain the characteristics of a single joint diagram.

The atlas neural network can effectively process joint data. The entire model can be trained end-to-end by back propagation.

3.2, the three-dimensional convolution neural network module is used for utilizing a three-dimensional convolution core containing space dimension and time dimension to convolute the RGB image processed by the data processing module, extracting space time characteristics of the RGB image and further outputting probability distribution of sign language word categories.

In the present embodiment, the algorithm flow chart of the three-dimensional convolutional neural network module is shown in fig. 4. The three-dimensional convolutional neural network module uses a multi-layered three-dimensional depth residual error network. The three-dimensional depth residual net depth of this embodiment is 18. The specific process in each layer of residual error network is as follows:

wherein: f (x, y, t) denotes the convolution result, x and y denote the pixel coordinates in the image, t denotes the position in the image sequence, C denotes the channel C of the image, C ∈ C, x, y and t denote the offset in the image length, width and time dimensions, respectively, and b denotes the offset.

As the depth of the convolutional neural network increases, the network may be more difficult to train. To solve this problem, the network can be transformed to learn a residual function by adding identity mapping on the basis of the original convolutional neural network: f (x) h (x) -x. Thus, the output of the current layer residual network, h (I) ═ f (I) + I, where f (I) is the result of the convolution of the network input I, which is also the input of the next layer residual network. The three-dimensional convolutional neural network retains information in the time dimension and is communicated in the network.

The depth residual error network added with the identity mapping is easier to optimize, and the network can be converged more quickly by pre-training on the ImageNet data set, and transfer learning is realized.

And 3.3, the fusion module is used for fusing the output results of the atlas convolution neural network module and the three-dimensional convolution neural network module through weighted average to obtain a final sign language word prediction result.

In the fusion module, the specific fusion process is as follows:

and acquiring the probability distribution vector of the sign language word class output by the graph convolution neural network module and the probability distribution vector of the sign language word class output by the three-dimensional convolution neural network. In this embodiment, the two net output probability distribution vectors are both vectors with a length of 500, and the values are all between 0 and 1, which represents the probability of each kind of sign language word. And the fusion module performs weighted average calculation on the probability values in the two vectors in a one-to-one correspondence manner to obtain the probability distribution vector which is finally output. The specific weighted weight can be adjusted according to the requirement, and the weights of the output vectors of the graph convolutional neural network module and the three-dimensional convolutional neural network are preferably 0.4 and 0.6 respectively.

4. And the continuous sign language recognition module is used for recognizing the complete sentences expressed by the continuous sign language actions. In the present embodiment, a network configuration of the encoder-decoder module is shown in fig. 5.

And 4.1, the encoder module comprises a convolutional neural network and a cyclic neural network, and is used for respectively extracting the spatial features and the temporal features of the continuous sign language and generating the global semantic information of the continuous sign language action.

In this embodiment, the convolutional neural network in the encoder module is a depth residual network with a depth of 18, and the cyclic neural network is a long-term and short-term memory network. The deep residual error network is used for learning the spatial characteristics of the video, and the long-term and short-term memory network is used for modeling the time sequence information.

The long-short term memory network overcomes the gradient disappearance problem by adding the internal state of the cell to the original network structure of the recurrent neural network and using an input gate, a forgetting gate and an update gate. Intracellular state c at a time above a long short term memory cell_t-1Hidden layer state h_t-1And input x at the current time_tUpdate as input:

i_t＝σ(W_iix_t+b_ii+W_hih_t-1+b_hi)

f_t＝σ(W_ifx_t+b_if+W_hfh_t-1+b_hf)

g_t＝tanh(W_igx_t+b_ig+W_hgh_t-1+b_hg)

o_t＝σ(W_iox_t+b_io+W_hoh_t-1+b_ho)

c_t＝f_t⊙c_t-1+i_t⊙g_t

h_t＝o_t⊙tanh(c_t)

wherein h is_tIs the hidden layer state at the current time, c_tIs the intracellular state of the cell at the present time, i_t、f_t、g_t、o_tW is the weight matrix, b is the bias, σ, tanh are the activation functions, respectively, and ⊙ is the element-by-element multiplication.

The encoder receives an image sequence with variable length as input, firstly extracts the spatial characteristics of each frame of image by using a depth residual error network, and then completes semantic coding of the continuous sign language video through a long-term and short-term memory network. The long-term and short-term memory network takes the spatial features extracted by the convolutional neural network as input, and constantly updates the hidden layer state, wherein the hidden layer state actually comprises the features of the time dimension. Averaging the hidden layer states of all time steps to obtain a semantic vector c after encoding, which is used as the input of a decoder:

And 4.2, the decoder module adopts a cyclic neural network, and predicts the output at the next moment by using the global semantic information coded by the coder, the output at the previous moment and the hidden layer state of the cyclic neural network.

In the decoder module, the decoder is composed of a long-short term memory network, a word embedding layer and a full connection layer. The decoder takes the feature vector coded by the coder, the output at the previous moment and the previous hidden layer state as input, updates the hidden layer state, and predicts the word currently output according to the updated hidden layer state and the output at the previous moment. The decoder's long-short term memory network is initialized with the hidden layer state at the last moment of the encoder. The word embedding layer outputs the word y at the last moment_t-1Extracting semantic vector w, and connecting with the semantic vector c as input x of long-short term memory network_t. After the long-term and short-term memory network is updated, the full-connection layer will hide the state h of the layer at the current moment_tAs input, generating probability distribution of the currently output words, and finally selecting the word with the highest probability as output, as follows:

w＝WordEmbedding(y_t-1)

x_t＝[w,c]

h_t＝φ(W_hx_t+U_th_t-1+b_h)

y_t＝φ(U_yh_t+b_y)

in both the encoder module and the decoder module, the long-short term memory network is a bidirectional long-short term memory network. The one-way long-short term memory network can only model the information of the current time step and the previous time step, and the input after the current time step does not contribute to generating the final output. In the continuous sign language recognition problem, since the sign language video represents a semantic sentence with a grammatical structure, contextual information should be fully utilized. The use of a two-way long-short term memory network enables better utilization of the context information.

5. And the output display module is used for enabling a user to select the sign language word recognition mode or the continuous sign language recognition mode and displaying an output result of the sign language word recognition module or the continuous sign language recognition module according to the selection of the user.

In this embodiment, a schematic diagram of a "home" interface of the output display module is shown in fig. 6, and the physical form of the interface is a screen, such as a mobile phone screen or a PC screen. The user may select either a sign language word recognition mode or a continuous sign language recognition mode through the screen. The interface diagram of sign language word recognition is shown in fig. 7, and the interface diagram of successive sign language recognition is shown in fig. 8, which all display the first five candidate results with the highest probability.

Therefore, the method can convert the sign language into the text and promote the communication between the person with the hearing impairment and the ordinary person.

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. A Chinese sign language recognition system based on deep learning is characterized by comprising:

2. The system as claimed in claim 1, wherein the data collection module uses a Kinect depth camera to collect RGB images and 25 human joint data when human body is doing sign language.

3. The system of claim 1, wherein the data processing module resizes the RGB image collected by the data collection module to 224 × 224 image, and normalizes the pixel values to satisfy gaussian distribution with mean and standard deviation of 0.5.

4. The system for Chinese sign language recognition based on deep learning of claim 1, wherein the atlas neural network module comprises the following specific processes:

B(v_ti)＝{v_qj|d(v_ti，v_qj)≤D，|q-t|≤T}

in the formula: d (v)_ti,v_qj) Representing a node v_tiAnd node v_qjA distance between v_tiIs the node at position x at time t, v_qjThe node is a node which is near the position x in the convolution kernel range at the moment q; d is a distance threshold in the spatial dimension and T is a distance in the time dimensionA threshold value;

5. The system of claim 1, wherein the three-dimensional convolutional neural network module uses a plurality of layers of three-dimensional deep residual error networks, and the specific process in each layer of residual error network is as follows:

6. The system for recognizing Chinese sign language based on deep learning of claim 1, wherein the fusion module specifically fuses as follows:

7. The system according to claim 1, wherein in the encoder module, the convolutional neural network is a multi-layer deep residual error network, and the cyclic neural network is a long-short term memory network; the encoder receives an image sequence with variable length as input, firstly extracts the spatial characteristics of each frame of image by using a depth residual error network, and then completes semantic coding of a continuous sign language video through a long-term and short-term memory network; the long-term and short-term memory network takes the spatial features extracted by the convolutional neural network as input, and continuously updates the hidden layer state, wherein the hidden layer state comprises the features of time dimension; averaging the hidden layer states of all time steps to obtain a semantic vector c after encoding, which is used as the input of a decoder:

8. The system as claimed in claim 1, wherein the decoder module comprises a long-short term memory network, a word embedding layer and a full link layer, the long-short term memory network is initialized by using the hidden layer state of the last moment of the encoder; the word embedding layer extracts semantic vectors from words output at the last moment, and then the semantic vectors are connected with the features coded by the coder and used as the input of the long-term and short-term memory network; after the long-term and short-term memory network is updated, the fully-connected layer takes the hidden layer state at the current moment as input, generates the probability distribution of the currently output words, and finally selects the word with the highest probability as output.

9. The system as claimed in claim 1, wherein the long-short term memory network in the encoder module and the decoder module is a bidirectional long-short term memory network.

10. The system as claimed in claim 1, wherein the output display module is capable of selecting sign language recognition mode and outputting the top five candidate results with highest probability in real time.