CN111723779B - Chinese sign language recognition system based on deep learning - Google Patents

Chinese sign language recognition system based on deep learning Download PDF

Info

Publication number
CN111723779B
CN111723779B CN202010699780.9A CN202010699780A CN111723779B CN 111723779 B CN111723779 B CN 111723779B CN 202010699780 A CN202010699780 A CN 202010699780A CN 111723779 B CN111723779 B CN 111723779B
Authority
CN
China
Prior art keywords
sign language
module
network
neural network
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010699780.9A
Other languages
Chinese (zh)
Other versions
CN111723779A (en
Inventor
张浩东
李威杰
谢亮
熊蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202010699780.9A priority Critical patent/CN111723779B/en
Publication of CN111723779A publication Critical patent/CN111723779A/en
Application granted granted Critical
Publication of CN111723779B publication Critical patent/CN111723779B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a Chinese sign language recognition system based on deep learning. The sign language recognition system is set into two modes of sign language word recognition and continuous sign language recognition, and is used for recognizing words and sentences expressed by the sign language actions respectively. The whole system consists of a data acquisition module, a data processing module, a recognition module and an output display module, wherein the sign language word recognition module consists of a graph convolution neural network and a three-dimensional convolution neural network, and the continuous sign language recognition module consists of an encoder-decoder network. The system collects images and joint data of sign language actions through the data collection module, then carries out preprocessing, inputs the data into the recognition module, and finally outputs corresponding sign language words or sentences. The invention can convert sign language into text to promote communication between people with dyshearing disorder and common people. The invention has strong practicability and high stability, and is convenient for popularization and application.

Description

Chinese sign language recognition system based on deep learning
Technical Field
The invention relates to a sign language recognition system, in particular to a Chinese sign language recognition system based on a deep learning algorithm.
Background
Sign language is a language expressed by means of gestures, limb actions, facial expressions, and the like without using voice. The method is a main mode for communication among people with hearing impairment, and ordinary people can hardly communicate with people with hearing impairment without special study. Sign language recognition aims to convert sign language into voice or text, and facilitates communication between people with dyshearing and ordinary people. Sign language recognition tasks have wide social needs. According to world health organization statistics, about 4.66 million people worldwide suffer from disabled hearing loss, exceeding 5% of the world population. Hearing loss presents inconveniences for these populations, and people with hearing impairment are difficult to communicate with ordinary people and face tremendous social stress. It is therefore necessary to design a universal sign language recognition system to solve these problems.
Sign language recognition tasks can be classified into sign language word recognition and continuous sign language recognition. Sign language word recognition recognizes a word, and successive sign language recognition recognizes a complete sentence. The input data sources include video, skeleton information, etc. The traditional method uses data gloves to collect data, converts gestures, limb actions and the like of sign language into manually designed features, and classifies the sign language by using the features through a machine learning method to finish the task of recognition. These methods are not robust and practical enough due to limited representation capability of the manual features and insufficient ease of use of the data glove.
Disclosure of Invention
Aiming at the social requirement and the existing problems of the sign language recognition task, the invention provides a Chinese sign language recognition system based on deep learning, which has the advantages of strong practicability, high robustness, convenient use, low cost and convenient popularization and use. The invention can recognize 500 kinds of Chinese sign language words and 100 kinds of Chinese sign language sentences, display the recognition result in real time and overcome the communication obstacle between the hearing impaired person and the common person.
The invention adopts the specific technical scheme that:
a chinese sign language recognition system based on deep learning, comprising:
the data acquisition module is used for acquiring RGB images and human joint data when the human body makes sign language actions;
the data processing module is used for adjusting the size of the RGB image acquired by the data acquisition module and normalizing the pixel value;
the sign language word recognition module consists of a graph convolution neural network module, a three-dimensional convolution neural network module and a fusion module and is used for recognizing sign language words expressed by sign language actions;
the graph convolution neural network module constructs a joint graph based on the human joint data acquired by the data acquisition module, and performs graph convolution on all adjacent nodes so as to output probability distribution of sign language word categories, wherein the adjacent nodes simultaneously comprise adjacent nodes in space dimension and adjacent nodes in time dimension;
the three-dimensional convolution neural network module convolves the RGB image processed by the data processing module by utilizing a three-dimensional convolution check comprising a space dimension and a time dimension, extracts the space time characteristic of the RGB image, and further outputs the probability distribution of sign language word class;
the fusion module is used for fusing the output results of the graph convolution neural network module and the three-dimensional convolution neural network module through weighted average to obtain a final sign language word prediction result;
the continuous sign language identification module consists of an encoder module and a decoder module and is used for identifying complete sentences expressed by continuous sign language actions;
the encoder module comprises a convolutional neural network and a cyclic neural network, and is used for respectively extracting the spatial features and the temporal features of continuous sign language and generating global semantic information of continuous sign language actions;
the decoder module adopts a cyclic neural network, and predicts the output of the next moment by using the global semantic information encoded by the encoder, the output of the last moment and the hidden layer state of the cyclic neural network.
And the output display module is used for enabling the user to select the sign language word recognition mode or the continuous sign language recognition mode and displaying the output result of the sign language word recognition module or the continuous sign language recognition module according to the selection of the user.
Preferably, the data acquisition module uses a Kinect depth camera to acquire RGB images and 25 human joint data simultaneously when a human body makes sign language actions.
Preferably, the data processing module adjusts the RGB image size acquired by the data acquisition module to 224×224 images, and normalizes the pixel values to satisfy gaussian distribution with both mean and standard deviation of 0.5.
Preferably, in the graph roll-up neural network module, the specific process is as follows:
11 Firstly, constructing a joint diagram based on the human body joint data acquired by the data acquisition module, wherein the nodes of the joint diagram correspond to the coordinate information of human body joint points, the edges of the joint diagram correspond to the connection between the joint points, and the same nodes are connected in the time dimension;
12 Then performing graph convolution operation on the joint graph; in the operation process, according to a convolution kernel with a given K multiplied by K and an input characteristic f with a channel number of c in Output f of single channel at spatial position x out (x) The method comprises the following steps:
Figure BDA0002592600350000031
wherein: f (f) in (v) Representing input features f in The characteristic value of the v node, w (v) represents the weight of the v node; b (v) ti ) A set of nodes that need to be traversed, wherein:
B(v ti )={v qj |d(v ti ,v qj )≤D,|q-t|≤T}
wherein: d (v) ti ,v qj ) Representing node v ti And node v qj Distance between v ti Is the node at position x at time t, v qj Means that the q moment is a node near the position x in the convolution kernel range; d is a distance threshold in the space dimension, T is a distance threshold in the time dimension;
the output of all channels is weighted and summed to obtain the characteristics of a single Zhang Guanjie chart;
13 The extracted features in the single joint graph are mapped into probability distribution of sign language word category through the full connection layer, and normalization is carried out through normalization function.
Preferably, the three-dimensional convolutional neural network module uses a multi-layer three-dimensional depth residual network, wherein the specific process in each layer of residual network is as follows:
21 The input I size of the residual network is c×t×h×w, where C is the number of channels, T is the length in the time dimension, H and W are the length and width of the image, respectively, and the network convolves the image with a three-dimensional convolution kernel W:
Figure BDA0002592600350000032
wherein: f (x, y, t) represents the convolution result, x and y represent pixel coordinates in the image, t represents positions in the image sequence, c represents channels c of the image, δx, δy and δt represent offsets in the length, width and time dimensions of the image, respectively, and b represents offsets;
adding identity mapping on the basis of a network, converting the network into a residual function H (I) =F (I) +I, wherein F (I) is a convolution result of network input I, H (I) is the output of the current layer residual network, and simultaneously, the H (I) is used as the input of the next layer residual network;
22 After the space time characteristics of the RGB image are extracted by the three-dimensional depth residual error network, the characteristics are mapped into probability distribution of sign language word category through a layer of full-connection layer, and normalization is carried out.
Preferably, in the fusion module, the specific fusion process is as follows:
acquiring probability distribution vectors of sign language word categories output by a graph convolution neural network module and probability distribution vectors of sign language word categories output by a three-dimensional convolution neural network, wherein the numerical values in the probability distribution vectors are all between 0 and 1, and represent the probability of each type of sign language word; and carrying out weighted average calculation on probability values in the two vectors in one-to-one correspondence to obtain a probability distribution vector which is finally output.
In the fusion module, when the weighted average calculation is performed, the weights of the output vectors of the graph convolution neural network module and the three-dimensional convolution neural network are preferably 0.4 and 0.6 respectively.
Preferably, in the encoder module, the convolutional neural network is a multi-layer depth residual error network, and the recurrent neural network is a long-term and short-term memory network; the encoder receives an image sequence with variable length as input, firstly uses a depth residual error network to extract the spatial characteristics of each frame of image, and then finishes semantic coding of continuous sign language video through a long-short-term memory network; the long-term and short-term memory network takes the space characteristics extracted by the convolutional neural network as input, continuously updates the state of the hidden layer, and the hidden layer contains the characteristics of the time dimension; averaging the hidden layer states of all time steps to obtain an encoded semantic vector c which is used as an input of a decoder:
Figure BDA0002592600350000041
where T' is the length of the input sequence, h t Is the hidden layer state of the recurrent neural network at time t.
Preferably, in the decoder module, the decoder is composed of a long-short-period memory network, a word embedding layer and a full connection layer, wherein the long-short-period memory network is initialized by using a hidden layer state at the last moment of the encoder; the word embedding layer extracts semantic vectors from words output at the previous moment, and then connects the semantic vectors with the characteristics encoded by the encoder to serve as input of a long-term and short-term memory network; after the long-period memory network is updated, the full-connection layer takes the hidden layer state at the current moment as input, generates probability distribution of the word output at present, and finally selects the word with the highest probability as output.
Preferably, in the encoder module and the decoder module, the long-term memory network and the short-term memory network both adopt two-way long-term memory networks.
Preferably, in the output display module, a sign language recognition mode can be selected, and the first five candidate results with the highest probability are output in real time.
Compared with the prior art, the invention has the following beneficial effects:
1) In the invention, the graph convolution neural network can effectively process the joint data. The joint data is actually a topological graph rather than a grid data form, which makes it difficult to use convolutional neural networks for feature extraction. By means of graph convolution, the topological structure of joint data can be fully utilized, and the internal relation between the joint points can be extracted.
2) According to the invention, the three-dimensional convolution neural network can simultaneously carry out convolution operation on the space dimension and the time dimension, and the space-time characteristics of sign language actions can be directly extracted. The three-dimensional depth residual error with the depth of 18 gives consideration to the identification accuracy and the running speed, so that the identification accuracy is maintained to a certain degree, and the running speed is improved.
3) In the invention, multi-mode data is used as input, and the joint data and the RGB image recognition result are fused, so that the robustness and stability of the sign language recognition system are improved.
4) In the invention, the encoder-decoder structure can fully utilize the global characteristics of the sign language actions and learn the language model of the continuous sign language, can effectively solve the sequence-to-sequence problem of continuous sign language recognition and learn the mapping relation between the sign language video and sentences expressed by the sign language video.
Drawings
Fig. 1 is a system frame diagram of the present invention.
FIG. 2 is a block diagram of a data processing module program according to the present invention.
Fig. 3 is an algorithm flow chart of the graph roll-up neural network module of the present invention.
Fig. 4 is an algorithm flow chart of the three-dimensional convolutional neural network module of the present invention.
Fig. 5 is a network structure diagram of an encoder-decoder module of the present invention.
FIG. 6 is a schematic diagram of a "home" interface of the output display module of the present invention.
FIG. 7 is a schematic diagram of a "sign language word recognition" interface of the output display module of the present invention.
Fig. 8 is a schematic diagram of a "continuous sign language recognition" interface of the output display module of the present invention.
Detailed Description
The invention is further illustrated and described below in conjunction with the drawings and detailed description.
In this embodiment, a system frame diagram of a chinese sign language recognition system based on deep learning is shown in fig. 1. The composition modules of the whole system comprise:
the data acquisition module is used for acquiring sign language action information;
the data processing module is used for preprocessing data;
the graph convolution neural network module is used for recognizing sign language words of the joint data;
the three-dimensional convolutional neural network module is used for recognizing sign language words of the RGB image;
the fusion module is used for fusing joint data and RGB image recognition results;
an encoder module for encoding continuous sign language information;
a decoder module for decoding successive sign language encoded information;
the output display module is used for displaying output results in real time;
the output end of the data acquisition module is connected with the data processing module, the output of the data processing module is respectively connected with the graph convolution neural network module, the three-dimensional convolution neural network module and the encoder module, the outputs of the graph convolution neural network and the three-dimensional convolution neural network are connected with the fusion module, the encoder module is connected with the decoder module, and the fusion module and the decoder module are connected with the output display module.
Among the above modules, a sign language word recognition module for realizing a sign language word recognition mode is formed by a graph convolution neural network module, a three-dimensional convolution neural network module and a fusion module, and a continuous sign language recognition module for realizing a continuous sign language recognition mode is formed by an encoder module and a decoder module, wherein one of the two modes operates selectively and can be selected according to the needs of users.
When the Chinese sign language recognition system is used, firstly, the data acquisition module acquires image data and joint data of sign language actions, and then the data processing module performs preprocessing. Then, sign language word recognition or continuous sign language recognition is performed according to the selection of the user. In the sign language word recognition module, the graph convolution neural network predicts the probability of the sign language word class based on the joint data, and the three-dimensional convolution neural network predicts the probability of the sign language word class based on the RGB image. The outputs are each a vector of length 500, each element representing the sign language word probability of the corresponding class. The fusion module performs weighted average on the two results to obtain a final prediction result. In the continuous sign language recognition module, the complete sentence expressed by the sign language action is recognized through the encoder-decoder network. The identification results of the two modes are displayed in real time through the output display module, and the first five candidate results with the highest probability are displayed.
Specific implementation forms of the modules are described in detail below with reference to the accompanying drawings.
1. And the data acquisition module is used for acquiring RGB images and human joint data when the human body makes sign language actions. In this embodiment, the data acquisition module uses a Kinect depth camera, which can acquire RGB images simultaneously and automatically obtain position coordinate data of 25 human joints.
2. And the data processing module is used for adjusting the size of the RGB image acquired by the data acquisition module and normalizing the pixel value. In this embodiment, a flow chart of the data processing module is shown in fig. 2. The data processing module performs preprocessing on the RGB image, wherein the preprocessing includes adjusting the size of the RGB image to 224×224, and normalizing the pixel value of each channel to satisfy a gaussian distribution with a mean value and a standard deviation of 0.5.
3. The sign language word recognition module is used for recognizing sign language words expressed by the sign language actions;
3.1, the function of the graph convolution neural network module is to construct a joint graph based on the human joint data acquired by the data acquisition module, and to carry out graph convolution on all adjacent nodes so as to output probability distribution of sign language word category, wherein the adjacent nodes simultaneously comprise adjacent nodes in space dimension and adjacent nodes in time dimension.
In this embodiment, as shown in fig. 3, the algorithm flow chart of the graph rolling neural network module includes the following specific processes:
11 Firstly, constructing a joint diagram based on the human body joint data acquired by the data acquisition module, wherein the nodes of the joint diagram correspond to the coordinate information of human body joint points, the edges of the joint diagram correspond to the connection between the joint points, and the same nodes are connected in the time dimension;
12 Then performing graph convolution operation on the joint graph; in the operation process, according to a convolution kernel with a given K multiplied by K and an input characteristic f with a channel number of c in . If only the spatial distance is considered, the output f of a single channel at spatial position x out (x) The method comprises the following steps:
Figure BDA0002592600350000071
where w is a weight function, the weight vector is inner-product with the input features, similar to the weights of convolutional neural networks. p is the sampling function, acting to traverse nodes near position x. Let v i Is the node at position x, the node distance calculation function is D, the distance of the node to be traversed must be less than or equal to D, then the node set B (v i ) The following are provided:
B(v i )={v j |d(v i ,v j }≤D}
by the above formula, the characteristics of the single joint map when only the spatial distance is considered can be obtained. However, in order to extract the temporal dynamics of the joint sequence, it is necessary to connect the same nodes of successive frames together and expand the definition of the sampling function p. The original function of the sampling function p is to traverse the positionx adjacent nodes, aiming at the information of the time dimension, the adjacent nodes need to not only comprise the adjacent nodes of the space dimension, but also comprise the adjacent nodes of the time dimension. Let v ti Is a node at a position x at a time T, and the time interval of the node to be traversed must be less than or equal to T, so that the node set B (v ti ) Nodes that are closer in space should be considered as well as nodes that are closer in time interval.
Thus, in the present invention, the output f of a single channel at a spatial position x, after considering both the spatial and temporal dimensions out (x) Can be expressed as follows:
Figure BDA0002592600350000081
wherein: f (f) in (v) Representing input features f in The characteristic value of the v node, w (v) represents the weight of the v node; b (v) ti ) A set of nodes that need to be traversed, wherein:
B(v ti )={v qj |d(v ti ,v qj )≤D,|q-t|≤T}
wherein: d (v) ti ,v qj ) Representing node v ti And node v qj Distance between v ti Is the node at position x at time t, v qj Means that the q moment is a node near the position x in the convolution kernel range; d is a distance threshold in the spatial dimension, and T is a distance threshold in the temporal dimension. In this formula, { v qj |d(v ti ,v qj ) D, q-T T is less than or equal to T and represents the node v ti All nodes v whose spatial distance does not exceed D and whose time interval does not exceed T qj
After the output of each channel is calculated, the output of all channels is weighted and summed, and the characteristics of the single Zhang Guanjie chart can be obtained.
13 The extracted features in the single joint graph are mapped into probability distribution of sign language word category through the full connection layer, and normalization is carried out through normalization function.
The graph convolution neural network can effectively process joint data. The entire model may be trained end-to-end by back propagation.
And 3.2, the three-dimensional convolution neural network module is used for convoluting the RGB image processed by the three-dimensional convolution check data processing module containing the space dimension and the time dimension, extracting the space time characteristic of the RGB image, and further outputting the probability distribution of the sign language word class.
In this embodiment, an algorithm flow chart of the three-dimensional convolutional neural network module is shown in fig. 4. The three-dimensional convolutional neural network module uses a multi-layer three-dimensional depth residual network. The three-dimensional depth residual network depth of this embodiment is 18. The specific process in each layer of residual error network is as follows:
21 The input I size of the residual network is c×t×h×w, where C is the number of channels, T is the length in the time dimension, H and W are the length and width of the image, respectively, and the network convolves the image with a three-dimensional convolution kernel W:
Figure BDA0002592600350000091
wherein: f (x, y, t) represents the convolution result, x and y represent pixel coordinates in the image, t represents positions in the image sequence, C represents the channel C of the image, C e C, δx, δy and δt represent the offsets in the image length, width and time dimensions, respectively, and b represents the offset.
As the depth of convolutional neural networks increases, the networks may be more difficult to train. To solve this problem, the network can be converted to learn a residual function by adding an identity mapping on the basis of the original convolutional neural network: f (x) =h (x) -x. Thus, the output H (I) =f (I) +i of the current layer residual network, where F (I) is the convolution result of network input I, which output is simultaneously the input of the next layer residual network. The three-dimensional convolutional neural network retains information of a time dimension and is transmitted in the network.
The depth residual network with the identity mapping is easier to optimize, and by pre-training on the ImageNet dataset, the network can converge faster and migrate learning can be achieved.
22 After the space time characteristics of the RGB image are extracted by the three-dimensional depth residual error network, the characteristics are mapped into probability distribution of sign language word category through a layer of full-connection layer, and normalization is carried out.
And 3.3, a fusion module is used for fusing the output results of the graph convolution neural network module and the three-dimensional convolution neural network module through weighted average to obtain a final sign language word prediction result.
In the fusion module, the specific fusion process is as follows:
and obtaining probability distribution vectors of the sign language word categories output by the graph convolution neural network module and probability distribution vectors of the sign language word categories output by the three-dimensional convolution neural network. In this embodiment, the two network output probability distribution vectors are vectors with a length of 500, and the values are between 0 and 1, which represents the probability of each sign language word. And the fusion module carries out weighted average calculation on probability values in the two vectors in one-to-one correspondence to obtain a final output probability distribution vector. The specific weighting weight can be adjusted according to the requirement, and the weights of the output vectors of the graph convolutional neural network module and the three-dimensional convolutional neural network are preferably 0.4 and 0.6 respectively.
4. And the continuous sign language identification module is used for identifying complete sentences expressed by continuous sign language actions. In this embodiment, a network structure diagram of the encoder-decoder module is shown in fig. 5.
And 4.1, the encoder module comprises a convolutional neural network and a cyclic neural network, and is used for respectively extracting the spatial features and the temporal features of continuous sign language and generating global semantic information of continuous sign language actions.
In this embodiment, the convolutional neural network in the encoder module is a depth residual network with a depth of 18, and the recurrent neural network is a long-short-term memory network. The depth residual error network is used for learning the spatial characteristics of the video, and the long-term and short-term memory network is used for modeling the time sequence information.
The long-term and short-term memory network is formed by adding the internal state of cells to the original network structure of the cyclic neural network and using an input gate, a forgetting gate and an updating gateTo overcome the gradient vanishing problem. Cell internal state c at a moment above a long-short-term memory cell t-1 Hidden layer state h t-1 Input x at the current time t Update as input:
i t =σ(W ii x t +b ii +W hi h t-1 +b hi )
f t =σ(W if x t +b if +W hf h t-1 +b hf )
g t =tanh(W ig x t +b ig +W hg h t-1 +b hg )
o t =σ(W io x t +b io +W ho h t-1 +b ho )
c t =f t ⊙c t-1 +i t ⊙g t
h t =o t ⊙tanh(c t )
wherein h is t Is the hidden layer state at the current time, c t Is the intracellular state at the current time, i t 、f t 、g t 、o t An input gate, a forget gate, an update gate, and an output gate, respectively. W is the weight matrix, b is the bias, σ, tanh are the activation functions, respectively, and as such, are the element-wise multiplications.
The encoder receives an image sequence with variable length as input, firstly uses a depth residual error network to extract the spatial characteristics of each frame of image, and then finishes semantic coding of continuous sign language video through a long-term and short-term memory network. The long-term and short-term memory network takes the spatial features extracted by the convolutional neural network as input to continuously update the hidden layer state, and the hidden layer state actually comprises the features of the time dimension. Averaging the hidden layer states of all time steps to obtain an encoded semantic vector c which is used as an input of a decoder:
Figure BDA0002592600350000101
where T' is the length of the input sequence, h t Is the hidden layer state of the recurrent neural network at time t.
4.2 the decoder module adopts a cyclic neural network, and predicts the output of the next moment by using the global semantic information encoded by the encoder, the output of the last moment and the hidden layer state of the cyclic neural network.
In the decoder module, the decoder consists of a long-term memory network, a word embedding layer and a full connection layer. The decoder takes the feature vector encoded by the encoder, the last-time output and the previous hidden layer state as inputs, updates the hidden layer state, and predicts the word currently output according to the updated hidden layer state and the last-time output. The long and short term memory network of the decoder is initialized with the hidden layer state at the last moment of the encoder. The word embedding layer will output the word y at the previous time t-1 Extracting semantic vector w, and then connecting with the feature coded by the encoder, namely semantic vector c, as input x of long-term and short-term memory network t . After the long-short-period memory network is updated, the full-connection layer will hide the layer state h at the current moment t As input, generating probability distribution of the currently output word, and finally selecting the word with highest probability as output, wherein the probability distribution is specifically expressed as follows:
w=WordEmbedding(y t-1 )
x t =[w,c]
h t =φ(W h x t +U t h t-1 +b h )
y t =φ(U y h t +b y )
in both the encoder and decoder modules, the long and short term memory network is a two-way long and short term memory network. The one-way long-short-term memory network can only model the information of the current time step and the previous time step, and the input after the current time step does not have any contribution to the generation of the final output. In the continuous sign language recognition problem, since sign language video represents a semantic sentence having a grammatical structure, the information of the context should be fully utilized. The use of a two-way long and short term memory network enables better utilization of context information.
5. And the output display module is used for enabling a user to select a sign language word recognition mode or a continuous sign language recognition mode and displaying the output result of the sign language word recognition module or the continuous sign language recognition module according to the selection of the user.
In this embodiment, the "home" interface of the output display module is shown in fig. 6, and the physical form of the interface is a screen, such as a mobile phone screen or a PC screen. The user may select either the word recognition mode by screen sign language or the continuous sign language recognition mode. The schematic of the "sign language word recognition" interface is shown in fig. 7, and the schematic of the "continuous sign language recognition" interface is shown in fig. 8, all of which show the first five candidate results with the highest probability.
Therefore, the invention can convert sign language into text, and promote communication between people with dyshearing disorder and common people.
The above embodiment is only a preferred embodiment of the present invention, but it is not intended to limit the present invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, all the technical schemes obtained by adopting the equivalent substitution or equivalent transformation are within the protection scope of the invention.

Claims (8)

1. A chinese sign language recognition system based on deep learning, comprising:
the data acquisition module is used for acquiring RGB images and human joint data when the human body makes sign language actions;
the data processing module is used for adjusting the size of the RGB image acquired by the data acquisition module and normalizing the pixel value;
the sign language word recognition module consists of a graph convolution neural network module, a three-dimensional convolution neural network module and a fusion module and is used for recognizing sign language words expressed by sign language actions; the graph convolution neural network module constructs a joint graph based on the human joint data acquired by the data acquisition module, and performs graph convolution on all adjacent nodes so as to output probability distribution of sign language word categories, wherein the adjacent nodes simultaneously comprise adjacent nodes in space dimension and adjacent nodes in time dimension; the three-dimensional convolution neural network module convolves the RGB image processed by the data processing module by utilizing a three-dimensional convolution check comprising a space dimension and a time dimension, extracts the space time characteristic of the RGB image, and further outputs the probability distribution of sign language word class; the fusion module is used for fusing the output results of the graph convolution neural network module and the three-dimensional convolution neural network module through weighted average to obtain a final sign language word prediction result;
the continuous sign language identification module consists of an encoder module and a decoder module and is used for identifying complete sentences expressed by continuous sign language actions; the encoder module comprises a convolutional neural network and a cyclic neural network, and is used for respectively extracting the spatial features and the temporal features of continuous sign language and generating global semantic information of continuous sign language actions; the decoder module adopts a cyclic neural network, and predicts the output of the next moment by using the global semantic information encoded by the encoder, the output of the last moment and the hidden layer state of the cyclic neural network;
the output display module is used for enabling a user to select a sign language word recognition mode or a continuous sign language recognition mode and displaying the output result of the sign language word recognition module or the continuous sign language recognition module according to the selection of the user;
in the graph convolution neural network module, the specific process is as follows:
11 Firstly, constructing a joint diagram based on the human body joint data acquired by the data acquisition module, wherein the nodes of the joint diagram correspond to the coordinate information of human body joint points, the edges of the joint diagram correspond to the connection between the joint points, and the same nodes are connected in the time dimension;
12 Then performing graph convolution operation on the joint graph; in the operation process, according to a convolution kernel with a given K multiplied by K and an input characteristic f with a channel number of c in Output f of single channel at spatial position x out (x) The method comprises the following steps:
Figure FDA0004127211860000021
wherein: f (f) in (v) Representing input features f in The characteristic value of the v node, w (v) represents the weight of the v node; b (v) ti ) A set of nodes that need to be traversed, wherein:
B(v ti )={v qj |d(v ti ,v qj )≤D,|q-t|≤T}
wherein: d (v) ii ,v qj ) Representing node v ti And node v qj Distance between v ti Is the node at position x at time t, v qj Means that the q moment is a node near the position x in the convolution kernel range; d is a distance threshold in the space dimension, T is a distance threshold in the time dimension;
the output of all channels is weighted and summed to obtain the characteristics of a single Zhang Guanjie chart;
13 Mapping the extracted features in the single joint graph into probability distribution of sign language word category through the full-connection layer, and normalizing through normalization function;
the three-dimensional convolutional neural network module uses a multi-layer three-dimensional depth residual error network, wherein the specific process in each layer of residual error network is as follows:
21 The input I size of the residual network is c×t×h×w, where C is the number of channels, T is the length in the time dimension, H and W are the length and width of the image, respectively, and the network convolves the image with a three-dimensional convolution kernel W:
Figure FDA0004127211860000022
wherein: f (x, y, t) represents the convolution result, x and y represent pixel coordinates in the image, t represents positions in the image sequence, c represents channels c of the image, δx, δy and δt represent offsets in the length, width and time dimensions of the image, respectively, and b represents offsets;
adding identity mapping on the basis of a network, converting the network into a residual function H (I) =F (I) +I, wherein F (I) is a convolution result of network input I, H (I) is the output of the current layer residual network, and simultaneously, the H (I) is used as the input of the next layer residual network;
22 After the space time characteristics of the RGB image are extracted by the three-dimensional depth residual error network, the characteristics are mapped into probability distribution of sign language word category through a layer of full-connection layer, and normalization is carried out.
2. The chinese sign language recognition system of claim 1, wherein said data acquisition module uses a Kinect depth camera to simultaneously acquire RGB images and 25 human joint data when a human body makes a sign language motion.
3. The deep learning-based chinese sign language recognition system of claim 1, wherein the data processing module adjusts the RGB image size collected by the data collecting module to 224 x 224 images, and normalizes the pixel values to satisfy a gaussian distribution with a mean and standard deviation of 0.5.
4. The chinese sign language recognition system based on deep learning of claim 1, wherein the specific fusion process in the fusion module is as follows:
acquiring probability distribution vectors of sign language word categories output by a graph convolution neural network module and probability distribution vectors of sign language word categories output by a three-dimensional convolution neural network, wherein the numerical values in the probability distribution vectors are all between 0 and 1, and represent the probability of each type of sign language word; and carrying out weighted average calculation on probability values in the two vectors in one-to-one correspondence to obtain a probability distribution vector which is finally output.
5. The chinese sign language recognition system based on deep learning of claim 1, wherein in said encoder module, the convolutional neural network is a multi-layered deep residual network, and the recurrent neural network is a long-term and short-term memory network; the encoder receives an image sequence with variable length as input, firstly uses a depth residual error network to extract the spatial characteristics of each frame of image, and then finishes semantic coding of continuous sign language video through a long-short-term memory network; the long-term and short-term memory network takes the space characteristics extracted by the convolutional neural network as input, continuously updates the state of the hidden layer, and the hidden layer contains the characteristics of the time dimension; averaging the hidden layer states of all time steps to obtain an encoded semantic vector c which is used as an input of a decoder:
Figure FDA0004127211860000031
where T' is the length of the input sequence, h t Is the hidden layer state of the recurrent neural network at time t.
6. The deep learning based chinese sign language recognition system of claim 1, wherein the decoder comprises a long-short term memory network, a word embedding layer and a full connection layer, the long-short term memory network is initialized by using a hidden layer state at the last moment of the encoder; the word embedding layer extracts semantic vectors from words output at the previous moment, and then connects the semantic vectors with the characteristics encoded by the encoder to serve as input of a long-term and short-term memory network; after the long-period memory network is updated, the full-connection layer takes the hidden layer state at the current moment as input, generates probability distribution of the word output at present, and finally selects the word with the highest probability as output.
7. The deep learning based chinese sign language recognition system of claim 1, wherein the long-short term memory network and the short-short term memory network are both two-way long-short term memory networks.
8. The deep learning-based chinese sign language recognition system of claim 1, wherein the output display module is capable of selecting a sign language recognition mode and outputting the first five candidate results with highest probability in real time.
CN202010699780.9A 2020-07-20 2020-07-20 Chinese sign language recognition system based on deep learning Active CN111723779B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010699780.9A CN111723779B (en) 2020-07-20 2020-07-20 Chinese sign language recognition system based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010699780.9A CN111723779B (en) 2020-07-20 2020-07-20 Chinese sign language recognition system based on deep learning

Publications (2)

Publication Number Publication Date
CN111723779A CN111723779A (en) 2020-09-29
CN111723779B true CN111723779B (en) 2023-05-02

Family

ID=72572899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010699780.9A Active CN111723779B (en) 2020-07-20 2020-07-20 Chinese sign language recognition system based on deep learning

Country Status (1)

Country Link
CN (1) CN111723779B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149618B (en) * 2020-10-14 2022-09-09 紫清智行科技(北京)有限公司 Pedestrian abnormal behavior detection method and device suitable for inspection vehicle
CN113780059A (en) * 2021-07-24 2021-12-10 上海大学 Continuous sign language identification method based on multiple feature points
CN113781876B (en) * 2021-08-05 2023-08-29 深兰科技(上海)有限公司 Conversion method and device for converting text into sign language action video
CN113792607B (en) * 2021-08-19 2024-01-05 辽宁科技大学 Neural network sign language classification and identification method based on Transformer

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10304208B1 (en) * 2018-02-12 2019-05-28 Avodah Labs, Inc. Automated gesture identification using neural networks
CN110084296A (en) * 2019-04-22 2019-08-02 中山大学 A kind of figure expression learning framework and its multi-tag classification method based on certain semantic
CN110399850A (en) * 2019-07-30 2019-11-01 西安工业大学 A kind of continuous sign language recognition method based on deep neural network
CN110796110A (en) * 2019-11-05 2020-02-14 西安电子科技大学 Human behavior identification method and system based on graph convolution network
CN111259804A (en) * 2020-01-16 2020-06-09 合肥工业大学 Multi-mode fusion sign language recognition system and method based on graph convolution
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11263409B2 (en) * 2017-11-03 2022-03-01 Board Of Trustees Of Michigan State University System and apparatus for non-intrusive word and sentence level sign language translation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10304208B1 (en) * 2018-02-12 2019-05-28 Avodah Labs, Inc. Automated gesture identification using neural networks
CN110084296A (en) * 2019-04-22 2019-08-02 中山大学 A kind of figure expression learning framework and its multi-tag classification method based on certain semantic
CN110399850A (en) * 2019-07-30 2019-11-01 西安工业大学 A kind of continuous sign language recognition method based on deep neural network
CN110796110A (en) * 2019-11-05 2020-02-14 西安电子科技大学 Human behavior identification method and system based on graph convolution network
CN111259804A (en) * 2020-01-16 2020-06-09 合肥工业大学 Multi-mode fusion sign language recognition system and method based on graph convolution
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Y.Liao等.DynamicSignLanguageRecognitionBasedonVideoSequenceWithBLSTM-3DResidualNetworks.《IEEE Access》.2019,738044-38054. *
李为斌 ; 刘佳 ; .基于视觉的动态手势识别概述.计算机应用与软件.2020,(03),全文. *

Also Published As

Publication number Publication date
CN111723779A (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN111723779B (en) Chinese sign language recognition system based on deep learning
CN112949622B (en) Bimodal character classification method and device for fusing text and image
CN113297955B (en) Sign language word recognition method based on multi-mode hierarchical information fusion
Sharma et al. Vision-based sign language recognition system: A Comprehensive Review
CN105373810B (en) Method and system for establishing motion recognition model
CN113792177A (en) Scene character visual question-answering method based on knowledge-guided deep attention network
Balasuriya et al. Learning platform for visually impaired children through artificial intelligence and computer vision
CN112905762A (en) Visual question-answering method based on equal attention-deficit-diagram network
CN106548194A (en) The construction method and localization method of two dimensional image human joint pointses location model
Krishnaraj et al. A Glove based approach to recognize Indian Sign Languages
CN113780059A (en) Continuous sign language identification method based on multiple feature points
Shinde et al. Sign language to text and vice versa recognition using computer vision in Marathi
CN112906520A (en) Gesture coding-based action recognition method and device
CN113240714B (en) Human motion intention prediction method based on context awareness network
CN112738555B (en) Video processing method and device
CN117218725A (en) Real-time sign language recognition and translation system and method based on edge equipment
CN111178141B (en) LSTM human body behavior identification method based on attention mechanism
CN114663910A (en) Multi-mode learning state analysis system
Ganpatye et al. Motion Based Indian Sign Language Recognition using Deep Learning
CN112633224A (en) Social relationship identification method and device, electronic equipment and storage medium
Jadhav et al. GoogLeNet application towards gesture recognition for ASL character identification
Mishra et al. Environment descriptor for the visually impaired
CN117576279B (en) Digital person driving method and system based on multi-mode data
CN113838218B (en) Speech driving virtual human gesture synthesis method for sensing environment
CN113111721B (en) Human behavior intelligent identification method based on multi-unmanned aerial vehicle visual angle image data driving

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant