CN111723779A - Chinese sign language recognition system based on deep learning - Google Patents

Chinese sign language recognition system based on deep learning Download PDF

Info

Publication number
CN111723779A
CN111723779A CN202010699780.9A CN202010699780A CN111723779A CN 111723779 A CN111723779 A CN 111723779A CN 202010699780 A CN202010699780 A CN 202010699780A CN 111723779 A CN111723779 A CN 111723779A
Authority
CN
China
Prior art keywords
sign language
module
neural network
network
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010699780.9A
Other languages
Chinese (zh)
Other versions
CN111723779B (en
Inventor
张浩东
李威杰
谢亮
熊蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202010699780.9A priority Critical patent/CN111723779B/en
Publication of CN111723779A publication Critical patent/CN111723779A/en
Application granted granted Critical
Publication of CN111723779B publication Critical patent/CN111723779B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a Chinese sign language recognition system based on deep learning. The sign language recognition system is set to be in two modes of sign language word recognition and continuous sign language recognition and is respectively used for recognizing words and sentences expressed by actions of the sign language. The whole system consists of a data acquisition module, a data processing module, a recognition module and an output display module, wherein the sign language word recognition module consists of a graph convolution neural network and a three-dimensional convolution neural network, and the continuous sign language recognition module consists of a coder-decoder network. The system collects images and joint data of sign language actions through the data collection module, then preprocesses the images and the joint data, inputs the data into the recognition module, and finally outputs corresponding sign language words or sentences. The invention can convert sign language into text and promote the communication between the person with hearing impairment and the ordinary person. The invention has strong practicability and high stability and is convenient for popularization and application.

Description

Chinese sign language recognition system based on deep learning
Technical Field
The invention relates to a sign language recognition system, in particular to a Chinese sign language recognition system based on a deep learning algorithm.
Background
Sign language is a language expressed by gestures, body movements, facial expressions, etc. without using voice. The method is a main mode for communicating among the hearing-impaired people, and ordinary people are difficult to communicate with the hearing-impaired people without special learning. Sign language recognition aims to convert sign language into voice or text, and facilitates communication between a hearing-impaired person and an ordinary person. Sign language recognition tasks have a wide social demand. According to the world health organization, about 4.66 million people worldwide suffer from disabled hearing loss, which exceeds 5% of the world population. Hearing loss causes an inconceivable inconvenience to the population, and people with hearing impairment are difficult to communicate with ordinary people and face huge social pressure. Therefore, it is necessary to design a universal sign language recognition system to solve these problems.
The sign language recognition task may be divided into sign language word recognition and continuous sign language recognition. Sign language word recognition recognizes a word and successive sign language recognition recognizes a complete sentence. The input data sources include video, skeleton information, and the like. The traditional method uses a data glove to collect data, converts gestures, limb actions and the like of sign language into manually designed features, and then classifies the sign language by using the features through a machine learning method to complete the recognition task. These methods are not robust and practical enough due to the limited representation capabilities of manual features and the inconvenience of using data gloves.
Disclosure of Invention
Aiming at the social requirements and the problems of sign language recognition tasks, the invention provides a Chinese sign language recognition system based on deep learning, which has the advantages of strong practicability, high robustness, convenient use, low cost and convenient popularization and use. The invention can identify the sign language words of 500 types and the sign language sentences of 100 types, and display the identification result in real time, thereby overcoming the communication barrier between the person with hearing impairment and the ordinary person.
The invention adopts the specific technical scheme that:
a deep learning based chinese sign language recognition system, comprising:
the data acquisition module is used for acquiring RGB images and human joint data when a human body makes sign language actions;
the data processing module is used for adjusting the size of the RGB image acquired by the data acquisition module and normalizing the pixel value;
the sign language word recognition module consists of a graph convolution neural network module, a three-dimensional convolution neural network module and a fusion module and is used for recognizing sign language words expressed by sign language actions;
the graph convolution neural network module constructs a joint graph based on the human body joint data acquired by the data acquisition module, performs graph convolution on all adjacent nodes and further outputs probability distribution of sign language word categories, wherein the adjacent nodes simultaneously comprise adjacent nodes in space dimension and adjacent nodes in time dimension;
the three-dimensional convolution neural network module is used for performing convolution on the RGB image processed by the data processing module by utilizing a three-dimensional convolution core containing a space dimension and a time dimension, extracting the space time characteristics of the RGB image and further outputting the probability distribution of the sign language word categories;
the fusion module is used for fusing the output results of the graph convolution neural network module and the three-dimensional convolution neural network module through weighted average to obtain a final sign language word prediction result;
the continuous sign language recognition module consists of an encoder module and a decoder module and is used for recognizing complete sentences expressed by continuous sign language actions;
the encoder module comprises a convolutional neural network and a cyclic neural network and is used for respectively extracting the spatial features and the temporal features of the continuous sign language and generating the global semantic information of the continuous sign language action;
the decoder module adopts a cyclic neural network, and predicts the output at the next moment by using the global semantic information coded by the coder, the output at the previous moment and the hidden layer state of the cyclic neural network.
And the output display module is used for enabling a user to select the sign language word recognition mode or the continuous sign language recognition mode and displaying an output result of the sign language word recognition module or the continuous sign language recognition module according to the selection of the user.
Preferably, the data acquisition module uses a Kinect depth camera and can acquire RGB images and 25 human body joint data when the human body performs sign language actions at the same time.
Preferably, the data processing module resizes the RGB image acquired by the data acquisition module to 224 × 224 image, and normalizes the pixel values so that they satisfy a gaussian distribution with a mean and a standard deviation of 0.5.
Preferably, in the graph convolution neural network module, the specific process is as follows:
11) firstly, constructing an articulated graph based on human body joint data acquired by a data acquisition module, wherein nodes of the articulated graph correspond to coordinate information of human body joint points, edges of the articulated graph correspond to connection between the joint points, and the same nodes are connected in a time dimension;
12) and performing graph convolution operation on the joint graph, wherein in the operation process, a convolution kernel with the given K × K size and an input characteristic f with the channel number of cinOutput f of a single channel at spatial position xout(x) Comprises the following steps:
Figure BDA0002592600350000031
wherein: f. ofin(v) Representing input features finThe characteristic value of the middle v node, w (v) represents the weight of the v node; b (v)ti) Is a set of nodes that need to be traversed, wherein:
B(vti)={vqj|d(vti,vqj)≤D,|q-t|≤T}
in the formula: d (v)ti,vqj) Watch (A)Show node vtiAnd node vqjA distance between vtiIs the node at position x at time t, vqjThe node is a node which is near the position x in the convolution kernel range at the moment q; d is a distance threshold value in a space dimension, and T is a distance threshold value in a time dimension;
the output of all channels is subjected to weighted summation to obtain the characteristics of a single joint diagram;
13) and mapping the extracted features in the single joint diagram into probability distribution of sign language word categories through a full connection layer, and normalizing through a normalization function.
Preferably, the three-dimensional convolutional neural network module uses a plurality of layers of three-dimensional depth residual error networks, wherein the specific process in each layer of residual error network is as follows:
21) the input I size of the residual network is C multiplied by T multiplied by H multiplied by W, wherein C is the number of channels, T is the length in the time dimension, H and W are the length and the width of the image respectively, and the network performs convolution operation on the image through a three-dimensional convolution kernel W:
Figure BDA0002592600350000032
wherein: f (x, y, t) represents the convolution result, x and y represent pixel coordinates in the image, t represents a position in the image sequence, c represents a channel c of the image, x, y and t represent offsets in the image length, width and time dimensions, respectively, and b represents an offset;
adding identity mapping on the basis of the network, and converting the network into a function of learning a residual error H (I) ═ F (I) + I, wherein F (I) is a convolution result of a network input I, and H (I) is an output of a residual error network at the current layer and is used as an input of a residual error network at the next layer;
22) after the space time characteristics of the RGB image are extracted by the three-dimensional depth residual error network, the characteristics are mapped into probability distribution of sign language word categories through a full connection layer, and normalization is carried out.
Preferably, in the fusion module, the specific fusion process is as follows:
acquiring probability distribution vectors of sign language word categories output by a graph convolution neural network module and probability distribution vectors of sign language word categories output by a three-dimensional convolution neural network, wherein numerical values in the probability distribution vectors are all between 0 and 1 and represent the probability of each category of sign language words; and carrying out weighted average calculation on the probability values in the two vectors in a one-to-one correspondence manner to obtain the finally output probability distribution vector.
Further, in the fusion module, when performing weighted average calculation, the weights of the output vectors of the graph convolution neural network module and the three-dimensional convolution neural network are preferably 0.4 and 0.6, respectively.
Preferably, in the encoder module, the convolutional neural network is a multi-layer deep residual error network, and the cyclic neural network is a long-term and short-term memory network; the encoder receives an image sequence with variable length as input, firstly extracts the spatial characteristics of each frame of image by using a depth residual error network, and then completes semantic coding of a continuous sign language video through a long-term and short-term memory network; the long-term and short-term memory network takes the spatial features extracted by the convolutional neural network as input, and continuously updates the hidden layer state, wherein the hidden layer state comprises the features of time dimension; averaging the hidden layer states of all time steps to obtain a semantic vector c after encoding, which is used as the input of a decoder:
Figure BDA0002592600350000041
where T' is the length of the input sequence, htIs the hidden layer state of the recurrent neural network at time t.
Preferably, in the decoder module, the decoder is composed of a long-short term memory network, a word embedding layer and a full connection layer, and the long-short term memory network is initialized by using the hidden layer state of the encoder at the last moment; the word embedding layer extracts semantic vectors from words output at the last moment, and then the semantic vectors are connected with the features coded by the coder and used as the input of the long-term and short-term memory network; after the long-term and short-term memory network is updated, the fully-connected layer takes the hidden layer state at the current moment as input, generates the probability distribution of the currently output words, and finally selects the word with the highest probability as output.
Preferably, in the encoder module and the decoder module, the long-short term memory network and the short-term memory network both use a bidirectional long-short term memory network.
Preferably, the output display module can select the sign language recognition mode and output the top five candidate results with the highest probability in real time.
Compared with the prior art, the invention has the following beneficial effects:
1) in the invention, the graph convolution neural network can effectively process the joint data. The joint data is actually a topological map, not in the form of grid data, which makes it difficult to perform feature extraction using convolutional neural networks. By means of graph convolution, the topological structure of joint data can be fully utilized, and the internal relation between joint points can be extracted.
2) In the invention, the three-dimensional convolution neural network can simultaneously carry out convolution operation on the space dimension and the time dimension, and directly extract the space time characteristics of the sign language action. The three-dimensional depth residual error with the depth of 18 gives consideration to the identification accuracy and the running speed, so that a certain identification accuracy is maintained, and the running speed is improved.
3) In the invention, multi-modal data is used as input, and the recognition results of the joint data and the RGB image are fused, so that the robustness and the stability of the sign language recognition system are improved.
4) In the invention, the encoder-decoder structure can more fully utilize the global characteristics of the actions of the sign language and learn the language model of the continuous sign language, thereby effectively solving the problem of identifying the sequence of the continuous sign language and learning the mapping relation between the sign language video and the sentences expressed by the video.
Drawings
FIG. 1 is a system framework diagram of the present invention.
FIG. 2 is a data processing module program framework diagram of the present invention.
FIG. 3 is a flow chart of the algorithm of the graph convolution neural network module of the present invention.
FIG. 4 is an algorithmic flow diagram of the three-dimensional convolutional neural network module of the present invention.
Fig. 5 is a network architecture diagram of an encoder-decoder module of the present invention.
FIG. 6 is a schematic diagram of a "home page" interface of the output display module according to the present invention.
FIG. 7 is a schematic diagram of a "sign language word recognition" interface of the output display module of the present invention.
FIG. 8 is a schematic diagram of a "continuous sign language recognition" interface of the output display module according to the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
In the present embodiment, a system framework diagram of a deep learning based Chinese sign language recognition system is shown in fig. 1. The whole system comprises the following components:
the data acquisition module is used for acquiring sign language action information;
the data processing module is used for data preprocessing;
the atlas neural network module is used for carrying out sign language word recognition on joint data;
the three-dimensional convolution neural network module is used for carrying out sign language word recognition on the RGB image;
the fusion module is used for fusing joint data and an RGB image recognition result;
an encoder module for encoding the continuous sign language information;
a decoder module for decoding the continuous sign language encoded information;
the output display module is used for displaying an output result in real time;
the output end of the data acquisition module is connected with the data processing module, the output of the data processing module can be respectively connected with the graph convolution neural network module, the three-dimensional convolution neural network module and the encoder module, the output of the graph convolution neural network and the three-dimensional convolution neural network can be connected with the fusion module, the encoder module can be connected with the decoder module, and the fusion module and the decoder module can be connected with the output display module.
Among the above modules, the graph convolution neural network module, the three-dimensional convolution neural network module and the fusion module form a sign language word recognition module for realizing the sign language word recognition mode, and the encoder module and the decoder module form a continuous sign language recognition module for realizing the continuous sign language recognition mode.
When the Chinese sign language recognition system is used, firstly, the data acquisition module acquires image data and joint data of sign language actions, and then the data processing module carries out preprocessing. And then performing sign language word recognition or continuous sign language recognition according to the selection of the user. In the sign language word recognition module, the graph convolution neural network can predict the probability of sign language word classes based on joint data, and the three-dimensional convolution neural network can predict the probability of sign language word classes based on RGB images. Their output is a vector of length 500, each element representing the probability of a sign language word for the corresponding category. The fusion module performs weighted average on the two results to obtain a final prediction result. In the continuous sign language recognition module, the complete sentence expressed by the sign language action is recognized through the encoder-decoder network. The recognition results of the two modes are displayed in real time through the output display module, and the first five candidate results with the highest probability are displayed.
The specific implementation of each module is described in detail below with reference to the accompanying drawings.
1. The data acquisition module is used for acquiring RGB images and human joint data when a human body makes sign language actions. In this embodiment, the data acquisition module uses a Kinect depth camera, which can acquire RGB images simultaneously and automatically obtain 25 position coordinate data of the human body joints.
2. And the data processing module is used for adjusting the size of the RGB image acquired by the data acquisition module and normalizing the pixel value. In the present embodiment, a program flow diagram of the data processing module is shown in fig. 2. The data processing module pre-processes the RGB image, including resizing the RGB image to 224 x 224 and normalizing the pixel values for each channel to satisfy a gaussian distribution with both a mean and a standard deviation of 0.5.
3. The sign language word recognition module is used for recognizing sign language words expressed by the action of the hand language;
3.1, the graph convolution neural network module is used for constructing an articular graph based on the human body articular data acquired by the data acquisition module, carrying out graph convolution on all adjacent nodes and further outputting probability distribution of sign language word categories, wherein the adjacent nodes simultaneously comprise adjacent nodes in space dimension and adjacent nodes in time dimension.
In this embodiment, an algorithm flowchart of the convolutional neural network module is shown in fig. 3, and the specific process in the convolutional neural network module is as follows:
11) firstly, constructing an articulated graph based on human body joint data acquired by a data acquisition module, wherein nodes of the articulated graph correspond to coordinate information of human body joint points, edges of the articulated graph correspond to connection between the joint points, and the same nodes are connected in a time dimension;
12) and performing graph convolution operation on the joint graph, wherein in the operation process, a convolution kernel with the given K × K size and an input characteristic f with the channel number of cin. The output f of a single channel at a spatial position x, if only the spatial distance is taken into accountout(x) Comprises the following steps:
Figure BDA0002592600350000071
where w is a weight function, the weight vector will be inner-multiplied with the input features, similar to the weights of a convolutional neural network. p is a sampling function that acts to traverse nodes near position x. Suppose viIf the node distance calculation function is D at the node at the position x, the distance of the node to be traversed must be less than or equal to D, and then the sampling function traverses the node set B (v)i) The following were used:
B(vi)={vj|d(vi,vj}≤D}
by the above formula, the features of a single joint map can be obtained when only the spatial distance is considered. However, in order to extract the temporal dynamics of the joint sequence, the same nodes of consecutive frames need to be connected together, and the definition of the sampling function p needs to be expanded. The original function of the sampling function p is to traverse the nodes adjacent to the position x, and for the information of the time dimension, the adjacent nodes need to include not only the adjacent nodes of the space dimension but also the adjacent nodes of the time dimension. Suppose vtiIs the node at the position x at the moment T, the time interval of the node to be traversed must be less than or equal to T, and then the node set B (v) traversed by the extended sampling functionti) Both spatially closer nodes and temporally closer nodes should be considered.
Thus, in the present invention, the output f of a single channel at a spatial location x, taking into account both spatial and temporal dimensionsout(x) Can be expressed as follows:
Figure BDA0002592600350000081
wherein: f. ofin(v) Representing input features finThe characteristic value of the middle v node, w (v) represents the weight of the v node; b (v)ti) Is a set of nodes that need to be traversed, wherein:
B(vti)={vqj|d(vti,vqj)≤D,|q-t|≤T}
in the formula: d (v)ti,vqj) Representing a node vtiAnd node vqjA distance between vtiIs the node at position x at time t, vqjThe node is a node which is near the position x in the convolution kernel range at the moment q; d is the distance threshold in the spatial dimension and T is the distance threshold in the temporal dimension. In this formula, { vqj|d(vti,vqj) D is less than or equal to D, q-T is less than or equal to T, and a node v is representedtiAll nodes v whose spatial distance does not exceed D and whose time interval does not exceed Tqj
And after the output of each channel is calculated, the output of all the channels is subjected to weighted summation to obtain the characteristics of a single joint diagram.
13) And mapping the extracted features in the single joint diagram into probability distribution of sign language word categories through a full connection layer, and normalizing through a normalization function.
The atlas neural network can effectively process joint data. The entire model can be trained end-to-end by back propagation.
3.2, the three-dimensional convolution neural network module is used for utilizing a three-dimensional convolution core containing space dimension and time dimension to convolute the RGB image processed by the data processing module, extracting space time characteristics of the RGB image and further outputting probability distribution of sign language word categories.
In the present embodiment, the algorithm flow chart of the three-dimensional convolutional neural network module is shown in fig. 4. The three-dimensional convolutional neural network module uses a multi-layered three-dimensional depth residual error network. The three-dimensional depth residual net depth of this embodiment is 18. The specific process in each layer of residual error network is as follows:
21) the input I size of the residual network is C multiplied by T multiplied by H multiplied by W, wherein C is the number of channels, T is the length in the time dimension, H and W are the length and the width of the image respectively, and the network performs convolution operation on the image through a three-dimensional convolution kernel W:
Figure BDA0002592600350000091
wherein: f (x, y, t) denotes the convolution result, x and y denote the pixel coordinates in the image, t denotes the position in the image sequence, C denotes the channel C of the image, C ∈ C, x, y and t denote the offset in the image length, width and time dimensions, respectively, and b denotes the offset.
As the depth of the convolutional neural network increases, the network may be more difficult to train. To solve this problem, the network can be transformed to learn a residual function by adding identity mapping on the basis of the original convolutional neural network: f (x) h (x) -x. Thus, the output of the current layer residual network, h (I) ═ f (I) + I, where f (I) is the result of the convolution of the network input I, which is also the input of the next layer residual network. The three-dimensional convolutional neural network retains information in the time dimension and is communicated in the network.
The depth residual error network added with the identity mapping is easier to optimize, and the network can be converged more quickly by pre-training on the ImageNet data set, and transfer learning is realized.
22) After the space time characteristics of the RGB image are extracted by the three-dimensional depth residual error network, the characteristics are mapped into probability distribution of sign language word categories through a full connection layer, and normalization is carried out.
And 3.3, the fusion module is used for fusing the output results of the atlas convolution neural network module and the three-dimensional convolution neural network module through weighted average to obtain a final sign language word prediction result.
In the fusion module, the specific fusion process is as follows:
and acquiring the probability distribution vector of the sign language word class output by the graph convolution neural network module and the probability distribution vector of the sign language word class output by the three-dimensional convolution neural network. In this embodiment, the two net output probability distribution vectors are both vectors with a length of 500, and the values are all between 0 and 1, which represents the probability of each kind of sign language word. And the fusion module performs weighted average calculation on the probability values in the two vectors in a one-to-one correspondence manner to obtain the probability distribution vector which is finally output. The specific weighted weight can be adjusted according to the requirement, and the weights of the output vectors of the graph convolutional neural network module and the three-dimensional convolutional neural network are preferably 0.4 and 0.6 respectively.
4. And the continuous sign language recognition module is used for recognizing the complete sentences expressed by the continuous sign language actions. In the present embodiment, a network configuration of the encoder-decoder module is shown in fig. 5.
And 4.1, the encoder module comprises a convolutional neural network and a cyclic neural network, and is used for respectively extracting the spatial features and the temporal features of the continuous sign language and generating the global semantic information of the continuous sign language action.
In this embodiment, the convolutional neural network in the encoder module is a depth residual network with a depth of 18, and the cyclic neural network is a long-term and short-term memory network. The deep residual error network is used for learning the spatial characteristics of the video, and the long-term and short-term memory network is used for modeling the time sequence information.
The long-short term memory network overcomes the gradient disappearance problem by adding the internal state of the cell to the original network structure of the recurrent neural network and using an input gate, a forgetting gate and an update gate. Intracellular state c at a time above a long short term memory cellt-1Hidden layer state ht-1And input x at the current timetUpdate as input:
it=σ(Wiixt+bii+Whiht-1+bhi)
ft=σ(Wifxt+bif+Whfht-1+bhf)
gt=tanh(Wigxt+big+Whght-1+bhg)
ot=σ(Wioxt+bio+Whoht-1+bho)
ct=ft⊙ct-1+it⊙gt
ht=ot⊙tanh(ct)
wherein h istIs the hidden layer state at the current time, ctIs the intracellular state of the cell at the present time, it、ft、gt、otW is the weight matrix, b is the bias, σ, tanh are the activation functions, respectively, and ⊙ is the element-by-element multiplication.
The encoder receives an image sequence with variable length as input, firstly extracts the spatial characteristics of each frame of image by using a depth residual error network, and then completes semantic coding of the continuous sign language video through a long-term and short-term memory network. The long-term and short-term memory network takes the spatial features extracted by the convolutional neural network as input, and constantly updates the hidden layer state, wherein the hidden layer state actually comprises the features of the time dimension. Averaging the hidden layer states of all time steps to obtain a semantic vector c after encoding, which is used as the input of a decoder:
Figure BDA0002592600350000101
where T' is the length of the input sequence, htIs the hidden layer state of the recurrent neural network at time t.
And 4.2, the decoder module adopts a cyclic neural network, and predicts the output at the next moment by using the global semantic information coded by the coder, the output at the previous moment and the hidden layer state of the cyclic neural network.
In the decoder module, the decoder is composed of a long-short term memory network, a word embedding layer and a full connection layer. The decoder takes the feature vector coded by the coder, the output at the previous moment and the previous hidden layer state as input, updates the hidden layer state, and predicts the word currently output according to the updated hidden layer state and the output at the previous moment. The decoder's long-short term memory network is initialized with the hidden layer state at the last moment of the encoder. The word embedding layer outputs the word y at the last momentt-1Extracting semantic vector w, and connecting with the semantic vector c as input x of long-short term memory networkt. After the long-term and short-term memory network is updated, the full-connection layer will hide the state h of the layer at the current momenttAs input, generating probability distribution of the currently output words, and finally selecting the word with the highest probability as output, as follows:
w=WordEmbedding(yt-1)
xt=[w,c]
ht=φ(Whxt+Utht-1+bh)
yt=φ(Uyht+by)
in both the encoder module and the decoder module, the long-short term memory network is a bidirectional long-short term memory network. The one-way long-short term memory network can only model the information of the current time step and the previous time step, and the input after the current time step does not contribute to generating the final output. In the continuous sign language recognition problem, since the sign language video represents a semantic sentence with a grammatical structure, contextual information should be fully utilized. The use of a two-way long-short term memory network enables better utilization of the context information.
5. And the output display module is used for enabling a user to select the sign language word recognition mode or the continuous sign language recognition mode and displaying an output result of the sign language word recognition module or the continuous sign language recognition module according to the selection of the user.
In this embodiment, a schematic diagram of a "home" interface of the output display module is shown in fig. 6, and the physical form of the interface is a screen, such as a mobile phone screen or a PC screen. The user may select either a sign language word recognition mode or a continuous sign language recognition mode through the screen. The interface diagram of sign language word recognition is shown in fig. 7, and the interface diagram of successive sign language recognition is shown in fig. 8, which all display the first five candidate results with the highest probability.
Therefore, the method can convert the sign language into the text and promote the communication between the person with the hearing impairment and the ordinary person.
The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims (10)

1. A Chinese sign language recognition system based on deep learning is characterized by comprising:
the data acquisition module is used for acquiring RGB images and human joint data when a human body makes sign language actions;
the data processing module is used for adjusting the size of the RGB image acquired by the data acquisition module and normalizing the pixel value;
the sign language word recognition module consists of a graph convolution neural network module, a three-dimensional convolution neural network module and a fusion module and is used for recognizing sign language words expressed by sign language actions;
the graph convolution neural network module constructs a joint graph based on the human body joint data acquired by the data acquisition module, performs graph convolution on all adjacent nodes and further outputs probability distribution of sign language word categories, wherein the adjacent nodes simultaneously comprise adjacent nodes in space dimension and adjacent nodes in time dimension;
the three-dimensional convolution neural network module is used for performing convolution on the RGB image processed by the data processing module by utilizing a three-dimensional convolution core containing a space dimension and a time dimension, extracting the space time characteristics of the RGB image and further outputting the probability distribution of the sign language word categories;
the fusion module is used for fusing the output results of the graph convolution neural network module and the three-dimensional convolution neural network module through weighted average to obtain a final sign language word prediction result;
the continuous sign language recognition module consists of an encoder module and a decoder module and is used for recognizing complete sentences expressed by continuous sign language actions;
the encoder module comprises a convolutional neural network and a cyclic neural network and is used for respectively extracting the spatial features and the temporal features of the continuous sign language and generating the global semantic information of the continuous sign language action;
the decoder module adopts a cyclic neural network, and predicts the output at the next moment by using the global semantic information coded by the coder, the output at the previous moment and the hidden layer state of the cyclic neural network.
And the output display module is used for enabling a user to select the sign language word recognition mode or the continuous sign language recognition mode and displaying an output result of the sign language word recognition module or the continuous sign language recognition module according to the selection of the user.
2. The system as claimed in claim 1, wherein the data collection module uses a Kinect depth camera to collect RGB images and 25 human joint data when human body is doing sign language.
3. The system of claim 1, wherein the data processing module resizes the RGB image collected by the data collection module to 224 × 224 image, and normalizes the pixel values to satisfy gaussian distribution with mean and standard deviation of 0.5.
4. The system for Chinese sign language recognition based on deep learning of claim 1, wherein the atlas neural network module comprises the following specific processes:
11) firstly, constructing an articulated graph based on human body joint data acquired by a data acquisition module, wherein nodes of the articulated graph correspond to coordinate information of human body joint points, edges of the articulated graph correspond to connection between the joint points, and the same nodes are connected in a time dimension;
12) and performing graph convolution operation on the joint graph, wherein in the operation process, a convolution kernel with the given K × K size and an input characteristic f with the channel number of cinOutput f of a single channel at spatial position xout(x) Comprises the following steps:
Figure FDA0002592600340000021
wherein: f. ofin(v) Representing input features finThe characteristic value of the middle v node, w (v) represents the weight of the v node; b (v)ti) Is a set of nodes that need to be traversed, wherein:
B(vti)={vqj|d(vti,vqj)≤D,|q-t|≤T}
in the formula: d (v)ti,vqj) Representing a node vtiAnd node vqjA distance between vtiIs the node at position x at time t, vqjThe node is a node which is near the position x in the convolution kernel range at the moment q; d is a distance threshold in the spatial dimension and T is a distance in the time dimensionA threshold value;
the output of all channels is subjected to weighted summation to obtain the characteristics of a single joint diagram;
13) and mapping the extracted features in the single joint diagram into probability distribution of sign language word categories through a full connection layer, and normalizing through a normalization function.
5. The system of claim 1, wherein the three-dimensional convolutional neural network module uses a plurality of layers of three-dimensional deep residual error networks, and the specific process in each layer of residual error network is as follows:
21) the input I size of the residual network is C multiplied by T multiplied by H multiplied by W, wherein C is the number of channels, T is the length in the time dimension, H and W are the length and the width of the image respectively, and the network performs convolution operation on the image through a three-dimensional convolution kernel W:
Figure FDA0002592600340000031
wherein: f (x, y, t) represents the convolution result, x and y represent pixel coordinates in the image, t represents a position in the image sequence, c represents a channel c of the image, x, y and t represent offsets in the image length, width and time dimensions, respectively, and b represents an offset;
adding identity mapping on the basis of the network, and converting the network into a function of learning a residual error H (I) ═ F (I) + I, wherein F (I) is a convolution result of a network input I, and H (I) is an output of a residual error network at the current layer and is used as an input of a residual error network at the next layer;
22) after the space time characteristics of the RGB image are extracted by the three-dimensional depth residual error network, the characteristics are mapped into probability distribution of sign language word categories through a full connection layer, and normalization is carried out.
6. The system for recognizing Chinese sign language based on deep learning of claim 1, wherein the fusion module specifically fuses as follows:
acquiring probability distribution vectors of sign language word categories output by a graph convolution neural network module and probability distribution vectors of sign language word categories output by a three-dimensional convolution neural network, wherein numerical values in the probability distribution vectors are all between 0 and 1 and represent the probability of each category of sign language words; and carrying out weighted average calculation on the probability values in the two vectors in a one-to-one correspondence manner to obtain the finally output probability distribution vector.
7. The system according to claim 1, wherein in the encoder module, the convolutional neural network is a multi-layer deep residual error network, and the cyclic neural network is a long-short term memory network; the encoder receives an image sequence with variable length as input, firstly extracts the spatial characteristics of each frame of image by using a depth residual error network, and then completes semantic coding of a continuous sign language video through a long-term and short-term memory network; the long-term and short-term memory network takes the spatial features extracted by the convolutional neural network as input, and continuously updates the hidden layer state, wherein the hidden layer state comprises the features of time dimension; averaging the hidden layer states of all time steps to obtain a semantic vector c after encoding, which is used as the input of a decoder:
Figure FDA0002592600340000032
where T' is the length of the input sequence, htIs the hidden layer state of the recurrent neural network at time t.
8. The system as claimed in claim 1, wherein the decoder module comprises a long-short term memory network, a word embedding layer and a full link layer, the long-short term memory network is initialized by using the hidden layer state of the last moment of the encoder; the word embedding layer extracts semantic vectors from words output at the last moment, and then the semantic vectors are connected with the features coded by the coder and used as the input of the long-term and short-term memory network; after the long-term and short-term memory network is updated, the fully-connected layer takes the hidden layer state at the current moment as input, generates the probability distribution of the currently output words, and finally selects the word with the highest probability as output.
9. The system as claimed in claim 1, wherein the long-short term memory network in the encoder module and the decoder module is a bidirectional long-short term memory network.
10. The system as claimed in claim 1, wherein the output display module is capable of selecting sign language recognition mode and outputting the top five candidate results with highest probability in real time.
CN202010699780.9A 2020-07-20 2020-07-20 Chinese sign language recognition system based on deep learning Active CN111723779B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010699780.9A CN111723779B (en) 2020-07-20 2020-07-20 Chinese sign language recognition system based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010699780.9A CN111723779B (en) 2020-07-20 2020-07-20 Chinese sign language recognition system based on deep learning

Publications (2)

Publication Number Publication Date
CN111723779A true CN111723779A (en) 2020-09-29
CN111723779B CN111723779B (en) 2023-05-02

Family

ID=72572899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010699780.9A Active CN111723779B (en) 2020-07-20 2020-07-20 Chinese sign language recognition system based on deep learning

Country Status (1)

Country Link
CN (1) CN111723779B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149618A (en) * 2020-10-14 2020-12-29 紫清智行科技(北京)有限公司 Pedestrian abnormal behavior detection method and device suitable for inspection vehicle
CN113781876A (en) * 2021-08-05 2021-12-10 深兰科技(上海)有限公司 Method and device for converting text into sign language action video
CN113792607A (en) * 2021-08-19 2021-12-14 辽宁科技大学 Neural network sign language classification and identification method based on Transformer

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190138607A1 (en) * 2017-11-03 2019-05-09 Board Of Trustees Of Michigan State University System and apparatus for non-intrusive word and sentence level sign language translation
US10304208B1 (en) * 2018-02-12 2019-05-28 Avodah Labs, Inc. Automated gesture identification using neural networks
CN110084296A (en) * 2019-04-22 2019-08-02 中山大学 A kind of figure expression learning framework and its multi-tag classification method based on certain semantic
CN110399850A (en) * 2019-07-30 2019-11-01 西安工业大学 A kind of continuous sign language recognition method based on deep neural network
CN110796110A (en) * 2019-11-05 2020-02-14 西安电子科技大学 Human behavior identification method and system based on graph convolution network
CN111259804A (en) * 2020-01-16 2020-06-09 合肥工业大学 Multi-mode fusion sign language recognition system and method based on graph convolution
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190138607A1 (en) * 2017-11-03 2019-05-09 Board Of Trustees Of Michigan State University System and apparatus for non-intrusive word and sentence level sign language translation
US10304208B1 (en) * 2018-02-12 2019-05-28 Avodah Labs, Inc. Automated gesture identification using neural networks
CN110084296A (en) * 2019-04-22 2019-08-02 中山大学 A kind of figure expression learning framework and its multi-tag classification method based on certain semantic
CN110399850A (en) * 2019-07-30 2019-11-01 西安工业大学 A kind of continuous sign language recognition method based on deep neural network
CN110796110A (en) * 2019-11-05 2020-02-14 西安电子科技大学 Human behavior identification method and system based on graph convolution network
CN111259804A (en) * 2020-01-16 2020-06-09 合肥工业大学 Multi-mode fusion sign language recognition system and method based on graph convolution
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Y.LIAO等: "DynamicSignLanguageRecognitionBasedonVideoSequenceWithBLSTM-3DResidualNetworks" *
李为斌;刘佳;: "基于视觉的动态手势识别概述" *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149618A (en) * 2020-10-14 2020-12-29 紫清智行科技(北京)有限公司 Pedestrian abnormal behavior detection method and device suitable for inspection vehicle
CN113781876A (en) * 2021-08-05 2021-12-10 深兰科技(上海)有限公司 Method and device for converting text into sign language action video
CN113781876B (en) * 2021-08-05 2023-08-29 深兰科技(上海)有限公司 Conversion method and device for converting text into sign language action video
CN113792607A (en) * 2021-08-19 2021-12-14 辽宁科技大学 Neural network sign language classification and identification method based on Transformer
CN113792607B (en) * 2021-08-19 2024-01-05 辽宁科技大学 Neural network sign language classification and identification method based on Transformer

Also Published As

Publication number Publication date
CN111723779B (en) 2023-05-02

Similar Documents

Publication Publication Date Title
CN111723779B (en) Chinese sign language recognition system based on deep learning
CN111259804B (en) Multi-modal fusion sign language recognition system and method based on graph convolution
CN111339837B (en) Continuous sign language recognition method
CN109886072B (en) Face attribute classification system based on bidirectional Ladder structure
CN111984772B (en) Medical image question-answering method and system based on deep learning
CN109711356B (en) Expression recognition method and system
CN112329525A (en) Gesture recognition method and device based on space-time diagram convolutional neural network
Sharma et al. Vision-based sign language recognition system: A Comprehensive Review
CN111354246A (en) System and method for helping deaf-mute to communicate
CN114842547A (en) Sign language teaching method, device and system based on gesture action generation and recognition
CN113297955A (en) Sign language word recognition method based on multi-mode hierarchical information fusion
CN112906520A (en) Gesture coding-based action recognition method and device
CN112905762A (en) Visual question-answering method based on equal attention-deficit-diagram network
Krishnaraj et al. A Glove based approach to recognize Indian Sign Languages
CN113780059A (en) Continuous sign language identification method based on multiple feature points
Dixit et al. Audio to indian and american sign language converter using machine translation and nlp technique
CN113240714A (en) Human motion intention prediction method based on context-aware network
CN116453025A (en) Volleyball match group behavior identification method integrating space-time information in frame-missing environment
CN116229507A (en) Human body posture detection method and system
CN113609923B (en) Attention-based continuous sign language sentence recognition method
CN111178141B (en) LSTM human body behavior identification method based on attention mechanism
CN114821781A (en) Multi-source fusion lip language identification method and system based on infrared low-light-level telescope
CN113936333A (en) Action recognition algorithm based on human body skeleton sequence
CN112633224A (en) Social relationship identification method and device, electronic equipment and storage medium
CN111079661A (en) Sign language recognition system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant