CN111723779B

CN111723779B - Chinese sign language recognition system based on deep learning

Info

Publication number: CN111723779B
Application number: CN202010699780.9A
Authority: CN
Inventors: 张浩东; 李威杰; 谢亮; 熊蓉
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2023-05-02
Anticipated expiration: 2040-07-20
Also published as: CN111723779A

Abstract

The invention discloses a Chinese sign language recognition system based on deep learning. The sign language recognition system is set into two modes of sign language word recognition and continuous sign language recognition, and is used for recognizing words and sentences expressed by the sign language actions respectively. The whole system consists of a data acquisition module, a data processing module, a recognition module and an output display module, wherein the sign language word recognition module consists of a graph convolution neural network and a three-dimensional convolution neural network, and the continuous sign language recognition module consists of an encoder-decoder network. The system collects images and joint data of sign language actions through the data collection module, then carries out preprocessing, inputs the data into the recognition module, and finally outputs corresponding sign language words or sentences. The invention can convert sign language into text to promote communication between people with dyshearing disorder and common people. The invention has strong practicability and high stability, and is convenient for popularization and application.

Description

Chinese sign language recognition system based on deep learning

Technical Field

The invention relates to a sign language recognition system, in particular to a Chinese sign language recognition system based on a deep learning algorithm.

Background

Sign language is a language expressed by means of gestures, limb actions, facial expressions, and the like without using voice. The method is a main mode for communication among people with hearing impairment, and ordinary people can hardly communicate with people with hearing impairment without special study. Sign language recognition aims to convert sign language into voice or text, and facilitates communication between people with dyshearing and ordinary people. Sign language recognition tasks have wide social needs. According to world health organization statistics, about 4.66 million people worldwide suffer from disabled hearing loss, exceeding 5% of the world population. Hearing loss presents inconveniences for these populations, and people with hearing impairment are difficult to communicate with ordinary people and face tremendous social stress. It is therefore necessary to design a universal sign language recognition system to solve these problems.

Sign language recognition tasks can be classified into sign language word recognition and continuous sign language recognition. Sign language word recognition recognizes a word, and successive sign language recognition recognizes a complete sentence. The input data sources include video, skeleton information, etc. The traditional method uses data gloves to collect data, converts gestures, limb actions and the like of sign language into manually designed features, and classifies the sign language by using the features through a machine learning method to finish the task of recognition. These methods are not robust and practical enough due to limited representation capability of the manual features and insufficient ease of use of the data glove.

Disclosure of Invention

Aiming at the social requirement and the existing problems of the sign language recognition task, the invention provides a Chinese sign language recognition system based on deep learning, which has the advantages of strong practicability, high robustness, convenient use, low cost and convenient popularization and use. The invention can recognize 500 kinds of Chinese sign language words and 100 kinds of Chinese sign language sentences, display the recognition result in real time and overcome the communication obstacle between the hearing impaired person and the common person.

The invention adopts the specific technical scheme that:

a chinese sign language recognition system based on deep learning, comprising:

the data acquisition module is used for acquiring RGB images and human joint data when the human body makes sign language actions;

the data processing module is used for adjusting the size of the RGB image acquired by the data acquisition module and normalizing the pixel value;

the sign language word recognition module consists of a graph convolution neural network module, a three-dimensional convolution neural network module and a fusion module and is used for recognizing sign language words expressed by sign language actions;

the graph convolution neural network module constructs a joint graph based on the human joint data acquired by the data acquisition module, and performs graph convolution on all adjacent nodes so as to output probability distribution of sign language word categories, wherein the adjacent nodes simultaneously comprise adjacent nodes in space dimension and adjacent nodes in time dimension;

the three-dimensional convolution neural network module convolves the RGB image processed by the data processing module by utilizing a three-dimensional convolution check comprising a space dimension and a time dimension, extracts the space time characteristic of the RGB image, and further outputs the probability distribution of sign language word class;

the fusion module is used for fusing the output results of the graph convolution neural network module and the three-dimensional convolution neural network module through weighted average to obtain a final sign language word prediction result;

the continuous sign language identification module consists of an encoder module and a decoder module and is used for identifying complete sentences expressed by continuous sign language actions;

the encoder module comprises a convolutional neural network and a cyclic neural network, and is used for respectively extracting the spatial features and the temporal features of continuous sign language and generating global semantic information of continuous sign language actions;

the decoder module adopts a cyclic neural network, and predicts the output of the next moment by using the global semantic information encoded by the encoder, the output of the last moment and the hidden layer state of the cyclic neural network.

And the output display module is used for enabling the user to select the sign language word recognition mode or the continuous sign language recognition mode and displaying the output result of the sign language word recognition module or the continuous sign language recognition module according to the selection of the user.

Preferably, the data acquisition module uses a Kinect depth camera to acquire RGB images and 25 human joint data simultaneously when a human body makes sign language actions.

Preferably, the data processing module adjusts the RGB image size acquired by the data acquisition module to 224×224 images, and normalizes the pixel values to satisfy gaussian distribution with both mean and standard deviation of 0.5.

Preferably, in the graph roll-up neural network module, the specific process is as follows:

11 Firstly, constructing a joint diagram based on the human body joint data acquired by the data acquisition module, wherein the nodes of the joint diagram correspond to the coordinate information of human body joint points, the edges of the joint diagram correspond to the connection between the joint points, and the same nodes are connected in the time dimension;

12 Then performing graph convolution operation on the joint graph; in the operation process, according to a convolution kernel with a given K multiplied by K and an input characteristic f with a channel number of c _in Output f of single channel at spatial position x _out (x) The method comprises the following steps:

wherein: f (f) _in (v) Representing input features f _in The characteristic value of the v node, w (v) represents the weight of the v node; b (v) _ti ) A set of nodes that need to be traversed, wherein:

B(v _ti )＝{v _qj |d(v _ti ,v _qj )≤D,|q-t|≤T}

wherein: d (v) _ti ,v _qj ) Representing node v _ti And node v _qj Distance between v _ti Is the node at position x at time t, v _qj Means that the q moment is a node near the position x in the convolution kernel range; d is a distance threshold in the space dimension, T is a distance threshold in the time dimension;

the output of all channels is weighted and summed to obtain the characteristics of a single Zhang Guanjie chart;

13 The extracted features in the single joint graph are mapped into probability distribution of sign language word category through the full connection layer, and normalization is carried out through normalization function.

Preferably, the three-dimensional convolutional neural network module uses a multi-layer three-dimensional depth residual network, wherein the specific process in each layer of residual network is as follows:

21 The input I size of the residual network is c×t×h×w, where C is the number of channels, T is the length in the time dimension, H and W are the length and width of the image, respectively, and the network convolves the image with a three-dimensional convolution kernel W:

wherein: f (x, y, t) represents the convolution result, x and y represent pixel coordinates in the image, t represents positions in the image sequence, c represents channels c of the image, δx, δy and δt represent offsets in the length, width and time dimensions of the image, respectively, and b represents offsets;

adding identity mapping on the basis of a network, converting the network into a residual function H (I) =F (I) +I, wherein F (I) is a convolution result of network input I, H (I) is the output of the current layer residual network, and simultaneously, the H (I) is used as the input of the next layer residual network;

22 After the space time characteristics of the RGB image are extracted by the three-dimensional depth residual error network, the characteristics are mapped into probability distribution of sign language word category through a layer of full-connection layer, and normalization is carried out.

Preferably, in the fusion module, the specific fusion process is as follows:

acquiring probability distribution vectors of sign language word categories output by a graph convolution neural network module and probability distribution vectors of sign language word categories output by a three-dimensional convolution neural network, wherein the numerical values in the probability distribution vectors are all between 0 and 1, and represent the probability of each type of sign language word; and carrying out weighted average calculation on probability values in the two vectors in one-to-one correspondence to obtain a probability distribution vector which is finally output.

In the fusion module, when the weighted average calculation is performed, the weights of the output vectors of the graph convolution neural network module and the three-dimensional convolution neural network are preferably 0.4 and 0.6 respectively.

Preferably, in the encoder module, the convolutional neural network is a multi-layer depth residual error network, and the recurrent neural network is a long-term and short-term memory network; the encoder receives an image sequence with variable length as input, firstly uses a depth residual error network to extract the spatial characteristics of each frame of image, and then finishes semantic coding of continuous sign language video through a long-short-term memory network; the long-term and short-term memory network takes the space characteristics extracted by the convolutional neural network as input, continuously updates the state of the hidden layer, and the hidden layer contains the characteristics of the time dimension; averaging the hidden layer states of all time steps to obtain an encoded semantic vector c which is used as an input of a decoder:

where T' is the length of the input sequence, h _t Is the hidden layer state of the recurrent neural network at time t.

Preferably, in the decoder module, the decoder is composed of a long-short-period memory network, a word embedding layer and a full connection layer, wherein the long-short-period memory network is initialized by using a hidden layer state at the last moment of the encoder; the word embedding layer extracts semantic vectors from words output at the previous moment, and then connects the semantic vectors with the characteristics encoded by the encoder to serve as input of a long-term and short-term memory network; after the long-period memory network is updated, the full-connection layer takes the hidden layer state at the current moment as input, generates probability distribution of the word output at present, and finally selects the word with the highest probability as output.

Preferably, in the encoder module and the decoder module, the long-term memory network and the short-term memory network both adopt two-way long-term memory networks.

Preferably, in the output display module, a sign language recognition mode can be selected, and the first five candidate results with the highest probability are output in real time.

Compared with the prior art, the invention has the following beneficial effects:

1) In the invention, the graph convolution neural network can effectively process the joint data. The joint data is actually a topological graph rather than a grid data form, which makes it difficult to use convolutional neural networks for feature extraction. By means of graph convolution, the topological structure of joint data can be fully utilized, and the internal relation between the joint points can be extracted.

2) According to the invention, the three-dimensional convolution neural network can simultaneously carry out convolution operation on the space dimension and the time dimension, and the space-time characteristics of sign language actions can be directly extracted. The three-dimensional depth residual error with the depth of 18 gives consideration to the identification accuracy and the running speed, so that the identification accuracy is maintained to a certain degree, and the running speed is improved.

3) In the invention, multi-mode data is used as input, and the joint data and the RGB image recognition result are fused, so that the robustness and stability of the sign language recognition system are improved.

4) In the invention, the encoder-decoder structure can fully utilize the global characteristics of the sign language actions and learn the language model of the continuous sign language, can effectively solve the sequence-to-sequence problem of continuous sign language recognition and learn the mapping relation between the sign language video and sentences expressed by the sign language video.

Drawings

Fig. 1 is a system frame diagram of the present invention.

FIG. 2 is a block diagram of a data processing module program according to the present invention.

Fig. 3 is an algorithm flow chart of the graph roll-up neural network module of the present invention.

Fig. 4 is an algorithm flow chart of the three-dimensional convolutional neural network module of the present invention.

Fig. 5 is a network structure diagram of an encoder-decoder module of the present invention.

FIG. 6 is a schematic diagram of a "home" interface of the output display module of the present invention.

FIG. 7 is a schematic diagram of a "sign language word recognition" interface of the output display module of the present invention.

Fig. 8 is a schematic diagram of a "continuous sign language recognition" interface of the output display module of the present invention.

Detailed Description

The invention is further illustrated and described below in conjunction with the drawings and detailed description.

In this embodiment, a system frame diagram of a chinese sign language recognition system based on deep learning is shown in fig. 1. The composition modules of the whole system comprise:

the data acquisition module is used for acquiring sign language action information;

the data processing module is used for preprocessing data;

the graph convolution neural network module is used for recognizing sign language words of the joint data;

the three-dimensional convolutional neural network module is used for recognizing sign language words of the RGB image;

the fusion module is used for fusing joint data and RGB image recognition results;

an encoder module for encoding continuous sign language information;

a decoder module for decoding successive sign language encoded information;

the output display module is used for displaying output results in real time;

the output end of the data acquisition module is connected with the data processing module, the output of the data processing module is respectively connected with the graph convolution neural network module, the three-dimensional convolution neural network module and the encoder module, the outputs of the graph convolution neural network and the three-dimensional convolution neural network are connected with the fusion module, the encoder module is connected with the decoder module, and the fusion module and the decoder module are connected with the output display module.

Among the above modules, a sign language word recognition module for realizing a sign language word recognition mode is formed by a graph convolution neural network module, a three-dimensional convolution neural network module and a fusion module, and a continuous sign language recognition module for realizing a continuous sign language recognition mode is formed by an encoder module and a decoder module, wherein one of the two modes operates selectively and can be selected according to the needs of users.

When the Chinese sign language recognition system is used, firstly, the data acquisition module acquires image data and joint data of sign language actions, and then the data processing module performs preprocessing. Then, sign language word recognition or continuous sign language recognition is performed according to the selection of the user. In the sign language word recognition module, the graph convolution neural network predicts the probability of the sign language word class based on the joint data, and the three-dimensional convolution neural network predicts the probability of the sign language word class based on the RGB image. The outputs are each a vector of length 500, each element representing the sign language word probability of the corresponding class. The fusion module performs weighted average on the two results to obtain a final prediction result. In the continuous sign language recognition module, the complete sentence expressed by the sign language action is recognized through the encoder-decoder network. The identification results of the two modes are displayed in real time through the output display module, and the first five candidate results with the highest probability are displayed.

Specific implementation forms of the modules are described in detail below with reference to the accompanying drawings.

1. And the data acquisition module is used for acquiring RGB images and human joint data when the human body makes sign language actions. In this embodiment, the data acquisition module uses a Kinect depth camera, which can acquire RGB images simultaneously and automatically obtain position coordinate data of 25 human joints.

2. And the data processing module is used for adjusting the size of the RGB image acquired by the data acquisition module and normalizing the pixel value. In this embodiment, a flow chart of the data processing module is shown in fig. 2. The data processing module performs preprocessing on the RGB image, wherein the preprocessing includes adjusting the size of the RGB image to 224×224, and normalizing the pixel value of each channel to satisfy a gaussian distribution with a mean value and a standard deviation of 0.5.

3. The sign language word recognition module is used for recognizing sign language words expressed by the sign language actions;

3.1, the function of the graph convolution neural network module is to construct a joint graph based on the human joint data acquired by the data acquisition module, and to carry out graph convolution on all adjacent nodes so as to output probability distribution of sign language word category, wherein the adjacent nodes simultaneously comprise adjacent nodes in space dimension and adjacent nodes in time dimension.

In this embodiment, as shown in fig. 3, the algorithm flow chart of the graph rolling neural network module includes the following specific processes:

12 Then performing graph convolution operation on the joint graph; in the operation process, according to a convolution kernel with a given K multiplied by K and an input characteristic f with a channel number of c _in . If only the spatial distance is considered, the output f of a single channel at spatial position x _out (x) The method comprises the following steps:

where w is a weight function, the weight vector is inner-product with the input features, similar to the weights of convolutional neural networks. p is the sampling function, acting to traverse nodes near position x. Let v _i Is the node at position x, the node distance calculation function is D, the distance of the node to be traversed must be less than or equal to D, then the node set B (v _i ) The following are provided:

B(v _i )＝{v _j |d(v _i ,v _j }≤D}

by the above formula, the characteristics of the single joint map when only the spatial distance is considered can be obtained. However, in order to extract the temporal dynamics of the joint sequence, it is necessary to connect the same nodes of successive frames together and expand the definition of the sampling function p. The original function of the sampling function p is to traverse the positionx adjacent nodes, aiming at the information of the time dimension, the adjacent nodes need to not only comprise the adjacent nodes of the space dimension, but also comprise the adjacent nodes of the time dimension. Let v _ti Is a node at a position x at a time T, and the time interval of the node to be traversed must be less than or equal to T, so that the node set B (v _ti ) Nodes that are closer in space should be considered as well as nodes that are closer in time interval.

Thus, in the present invention, the output f of a single channel at a spatial position x, after considering both the spatial and temporal dimensions _out (x) Can be expressed as follows:

B(v _ti )＝{v _qj |d(v _ti ,v _qj )≤D,|q-t|≤T}

wherein: d (v) _ti ,v _qj ) Representing node v _ti And node v _qj Distance between v _ti Is the node at position x at time t, v _qj Means that the q moment is a node near the position x in the convolution kernel range; d is a distance threshold in the spatial dimension, and T is a distance threshold in the temporal dimension. In this formula, { v _qj |d(v _ti ,v _qj ) D, q-T T is less than or equal to T and represents the node v _ti All nodes v whose spatial distance does not exceed D and whose time interval does not exceed T _qj 。

After the output of each channel is calculated, the output of all channels is weighted and summed, and the characteristics of the single Zhang Guanjie chart can be obtained.

The graph convolution neural network can effectively process joint data. The entire model may be trained end-to-end by back propagation.

And 3.2, the three-dimensional convolution neural network module is used for convoluting the RGB image processed by the three-dimensional convolution check data processing module containing the space dimension and the time dimension, extracting the space time characteristic of the RGB image, and further outputting the probability distribution of the sign language word class.

In this embodiment, an algorithm flow chart of the three-dimensional convolutional neural network module is shown in fig. 4. The three-dimensional convolutional neural network module uses a multi-layer three-dimensional depth residual network. The three-dimensional depth residual network depth of this embodiment is 18. The specific process in each layer of residual error network is as follows:

wherein: f (x, y, t) represents the convolution result, x and y represent pixel coordinates in the image, t represents positions in the image sequence, C represents the channel C of the image, C e C, δx, δy and δt represent the offsets in the image length, width and time dimensions, respectively, and b represents the offset.

As the depth of convolutional neural networks increases, the networks may be more difficult to train. To solve this problem, the network can be converted to learn a residual function by adding an identity mapping on the basis of the original convolutional neural network: f (x) =h (x) -x. Thus, the output H (I) =f (I) +i of the current layer residual network, where F (I) is the convolution result of network input I, which output is simultaneously the input of the next layer residual network. The three-dimensional convolutional neural network retains information of a time dimension and is transmitted in the network.

The depth residual network with the identity mapping is easier to optimize, and by pre-training on the ImageNet dataset, the network can converge faster and migrate learning can be achieved.

And 3.3, a fusion module is used for fusing the output results of the graph convolution neural network module and the three-dimensional convolution neural network module through weighted average to obtain a final sign language word prediction result.

In the fusion module, the specific fusion process is as follows:

and obtaining probability distribution vectors of the sign language word categories output by the graph convolution neural network module and probability distribution vectors of the sign language word categories output by the three-dimensional convolution neural network. In this embodiment, the two network output probability distribution vectors are vectors with a length of 500, and the values are between 0 and 1, which represents the probability of each sign language word. And the fusion module carries out weighted average calculation on probability values in the two vectors in one-to-one correspondence to obtain a final output probability distribution vector. The specific weighting weight can be adjusted according to the requirement, and the weights of the output vectors of the graph convolutional neural network module and the three-dimensional convolutional neural network are preferably 0.4 and 0.6 respectively.

4. And the continuous sign language identification module is used for identifying complete sentences expressed by continuous sign language actions. In this embodiment, a network structure diagram of the encoder-decoder module is shown in fig. 5.

And 4.1, the encoder module comprises a convolutional neural network and a cyclic neural network, and is used for respectively extracting the spatial features and the temporal features of continuous sign language and generating global semantic information of continuous sign language actions.

In this embodiment, the convolutional neural network in the encoder module is a depth residual network with a depth of 18, and the recurrent neural network is a long-short-term memory network. The depth residual error network is used for learning the spatial characteristics of the video, and the long-term and short-term memory network is used for modeling the time sequence information.

The long-term and short-term memory network is formed by adding the internal state of cells to the original network structure of the cyclic neural network and using an input gate, a forgetting gate and an updating gateTo overcome the gradient vanishing problem. Cell internal state c at a moment above a long-short-term memory cell _t-1 Hidden layer state h _t-1 Input x at the current time _t Update as input:

i _t ＝σ(W _ii x _t +b _ii +W _hi h _t-1 +b _hi )

f _t ＝σ(W _if x _t +b _if +W _hf h _t-1 +b _hf )

g _t ＝tanh(W _ig x _t +b _ig +W _hg h _t-1 +b _hg )

o _t ＝σ(W _io x _t +b _io +W _ho h _t-1 +b _ho )

c _t ＝f _t ⊙c _t-1 +i _t ⊙g _t

h _t ＝o _t ⊙tanh(c _t )

wherein h is _t Is the hidden layer state at the current time, c _t Is the intracellular state at the current time, i _t 、f _t 、g _t 、o _t An input gate, a forget gate, an update gate, and an output gate, respectively. W is the weight matrix, b is the bias, σ, tanh are the activation functions, respectively, and as such, are the element-wise multiplications.

The encoder receives an image sequence with variable length as input, firstly uses a depth residual error network to extract the spatial characteristics of each frame of image, and then finishes semantic coding of continuous sign language video through a long-term and short-term memory network. The long-term and short-term memory network takes the spatial features extracted by the convolutional neural network as input to continuously update the hidden layer state, and the hidden layer state actually comprises the features of the time dimension. Averaging the hidden layer states of all time steps to obtain an encoded semantic vector c which is used as an input of a decoder:

4.2 the decoder module adopts a cyclic neural network, and predicts the output of the next moment by using the global semantic information encoded by the encoder, the output of the last moment and the hidden layer state of the cyclic neural network.

In the decoder module, the decoder consists of a long-term memory network, a word embedding layer and a full connection layer. The decoder takes the feature vector encoded by the encoder, the last-time output and the previous hidden layer state as inputs, updates the hidden layer state, and predicts the word currently output according to the updated hidden layer state and the last-time output. The long and short term memory network of the decoder is initialized with the hidden layer state at the last moment of the encoder. The word embedding layer will output the word y at the previous time _t-1 Extracting semantic vector w, and then connecting with the feature coded by the encoder, namely semantic vector c, as input x of long-term and short-term memory network _t . After the long-short-period memory network is updated, the full-connection layer will hide the layer state h at the current moment _t As input, generating probability distribution of the currently output word, and finally selecting the word with highest probability as output, wherein the probability distribution is specifically expressed as follows:

w＝WordEmbedding(y _t-1 )

x _t ＝[w,c]

h _t ＝φ(W _h x _t +U _t h _t-1 +b _h )

y _t ＝φ(U _y h _t +b _y )

in both the encoder and decoder modules, the long and short term memory network is a two-way long and short term memory network. The one-way long-short-term memory network can only model the information of the current time step and the previous time step, and the input after the current time step does not have any contribution to the generation of the final output. In the continuous sign language recognition problem, since sign language video represents a semantic sentence having a grammatical structure, the information of the context should be fully utilized. The use of a two-way long and short term memory network enables better utilization of context information.

5. And the output display module is used for enabling a user to select a sign language word recognition mode or a continuous sign language recognition mode and displaying the output result of the sign language word recognition module or the continuous sign language recognition module according to the selection of the user.

In this embodiment, the "home" interface of the output display module is shown in fig. 6, and the physical form of the interface is a screen, such as a mobile phone screen or a PC screen. The user may select either the word recognition mode by screen sign language or the continuous sign language recognition mode. The schematic of the "sign language word recognition" interface is shown in fig. 7, and the schematic of the "continuous sign language recognition" interface is shown in fig. 8, all of which show the first five candidate results with the highest probability.

Therefore, the invention can convert sign language into text, and promote communication between people with dyshearing disorder and common people.

The above embodiment is only a preferred embodiment of the present invention, but it is not intended to limit the present invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, all the technical schemes obtained by adopting the equivalent substitution or equivalent transformation are within the protection scope of the invention.

Claims

1. A chinese sign language recognition system based on deep learning, comprising:

the sign language word recognition module consists of a graph convolution neural network module, a three-dimensional convolution neural network module and a fusion module and is used for recognizing sign language words expressed by sign language actions; the graph convolution neural network module constructs a joint graph based on the human joint data acquired by the data acquisition module, and performs graph convolution on all adjacent nodes so as to output probability distribution of sign language word categories, wherein the adjacent nodes simultaneously comprise adjacent nodes in space dimension and adjacent nodes in time dimension; the three-dimensional convolution neural network module convolves the RGB image processed by the data processing module by utilizing a three-dimensional convolution check comprising a space dimension and a time dimension, extracts the space time characteristic of the RGB image, and further outputs the probability distribution of sign language word class; the fusion module is used for fusing the output results of the graph convolution neural network module and the three-dimensional convolution neural network module through weighted average to obtain a final sign language word prediction result;

the continuous sign language identification module consists of an encoder module and a decoder module and is used for identifying complete sentences expressed by continuous sign language actions; the encoder module comprises a convolutional neural network and a cyclic neural network, and is used for respectively extracting the spatial features and the temporal features of continuous sign language and generating global semantic information of continuous sign language actions; the decoder module adopts a cyclic neural network, and predicts the output of the next moment by using the global semantic information encoded by the encoder, the output of the last moment and the hidden layer state of the cyclic neural network;

the output display module is used for enabling a user to select a sign language word recognition mode or a continuous sign language recognition mode and displaying the output result of the sign language word recognition module or the continuous sign language recognition module according to the selection of the user;

in the graph convolution neural network module, the specific process is as follows:

B(v _ti )＝{v _qj |d(v _ti ,v _qj )≤D,|q-t|≤T}

wherein: d (v) _ii ,v _qj ) Representing node v _ti And node v _qj Distance between v _ti Is the node at position x at time t, v _qj Means that the q moment is a node near the position x in the convolution kernel range; d is a distance threshold in the space dimension, T is a distance threshold in the time dimension;

13 Mapping the extracted features in the single joint graph into probability distribution of sign language word category through the full-connection layer, and normalizing through normalization function;

the three-dimensional convolutional neural network module uses a multi-layer three-dimensional depth residual error network, wherein the specific process in each layer of residual error network is as follows:

2. The chinese sign language recognition system of claim 1, wherein said data acquisition module uses a Kinect depth camera to simultaneously acquire RGB images and 25 human joint data when a human body makes a sign language motion.

3. The deep learning-based chinese sign language recognition system of claim 1, wherein the data processing module adjusts the RGB image size collected by the data collecting module to 224 x 224 images, and normalizes the pixel values to satisfy a gaussian distribution with a mean and standard deviation of 0.5.

4. The chinese sign language recognition system based on deep learning of claim 1, wherein the specific fusion process in the fusion module is as follows:

5. The chinese sign language recognition system based on deep learning of claim 1, wherein in said encoder module, the convolutional neural network is a multi-layered deep residual network, and the recurrent neural network is a long-term and short-term memory network; the encoder receives an image sequence with variable length as input, firstly uses a depth residual error network to extract the spatial characteristics of each frame of image, and then finishes semantic coding of continuous sign language video through a long-short-term memory network; the long-term and short-term memory network takes the space characteristics extracted by the convolutional neural network as input, continuously updates the state of the hidden layer, and the hidden layer contains the characteristics of the time dimension; averaging the hidden layer states of all time steps to obtain an encoded semantic vector c which is used as an input of a decoder:

6. The deep learning based chinese sign language recognition system of claim 1, wherein the decoder comprises a long-short term memory network, a word embedding layer and a full connection layer, the long-short term memory network is initialized by using a hidden layer state at the last moment of the encoder; the word embedding layer extracts semantic vectors from words output at the previous moment, and then connects the semantic vectors with the characteristics encoded by the encoder to serve as input of a long-term and short-term memory network; after the long-period memory network is updated, the full-connection layer takes the hidden layer state at the current moment as input, generates probability distribution of the word output at present, and finally selects the word with the highest probability as output.

7. The deep learning based chinese sign language recognition system of claim 1, wherein the long-short term memory network and the short-short term memory network are both two-way long-short term memory networks.

8. The deep learning-based chinese sign language recognition system of claim 1, wherein the output display module is capable of selecting a sign language recognition mode and outputting the first five candidate results with highest probability in real time.