CN111797777B

CN111797777B - Sign language recognition system and method based on space-time semantic features

Info

Publication number: CN111797777B
Application number: CN202010648991.XA
Authority: CN
Inventors: 殷亚凤; 甘世维; 谢磊; 陆桑璐
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2023-10-17
Anticipated expiration: 2040-07-07
Also published as: CN111797777A

Abstract

The invention discloses a sign language recognition system and a method based on space-time semantic features, wherein the system comprises the following steps: the system comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; the video data acquisition module is used for acquiring sign language video data; the video data preprocessing module is used for preprocessing video data; and the sign language recognition module is used for carrying out sign language recognition and outputting a prediction result. The invention eliminates the semantic gap between sign language image data and text data, realizes convenient communication with the deaf-mute, and ensures the accuracy of translation by using the neural network algorithm at the front edge as a translation tool. In addition, the method can be used as a medium for man-machine interaction, and the intelligent equipment is controlled by analyzing continuous gestures of a user.

Description

Sign language recognition system and method based on space-time semantic features

Technical Field

The invention belongs to the technical field of sign language identification, and particularly relates to a sign language identification system and method based on space-time semantic features.

Background

According to World Health Organization (WHO) data, about 4.66 million people worldwide had hearing impairment in 2018, and it is expected that the number of people with hearing impairment would be over 9 million by 2050. People with hearing impairment and deaf-mutes use sign language as communication medium, however, few ordinary people can master sign language and communicate with the sign language, so that the deaf-mutes have communication obstacle with the ordinary people.

The current solutions mainly include manual translation, sign language translation based on gloves of different colors and translation based on a smart watch. The feasibility of manual translation is limited to certain specific and regular occasions, and is difficult to use by the masses, and is impractical for most people to learn sign language courses. Sign language translation based on gloves with different colors and translation based on a smart watch introduce extra equipment burden, and sign language expressive persons are required to be provided with certain equipment, so that communication experience of deaf-mute persons is greatly reduced.

In view of this, the visual method provided by the invention can obtain sign language video through a common monitoring camera, an external camera, a built-in camera of intelligent equipment (such as a smart phone, a smart glasses and the like) or other camera modules, and does not need a custom device for sign language expressive persons. In addition, the method and the device can be applied to novel human-computer interaction scenes, and instruction control of gestures on intelligent equipment is completed through gesture analysis of users.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a sign language recognition system and a sign language recognition method based on space-time semantic features, so as to solve the problem that the deaf-mute and the common public have communication barriers in the prior art. The invention can facilitate the communication between the deaf-mute and the ordinary person, so that the ordinary person can understand the meaning expressed by the sign language without learning the sign language.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the invention discloses a sign language recognition system based on space-time semantic features, which comprises: the system comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; wherein, the liquid crystal display device comprises a liquid crystal display device,

the video data acquisition module is used for acquiring sign language video data; data are collected through a common monitoring camera, an external camera, a built-in camera of intelligent equipment (such as a smart phone, a smart glasses and the like) or other camera modules;

the video preprocessing module comprises: the device comprises a video frame size adjusting module, a data normalization module and a video framing module;

the video frame size adjusting module is used for scaling the size of the acquired sign language video frame to a uniform size;

the data normalization module normalizes the pixel value of the video frame with the adjusted size from 0-255 to 0-1;

the video framing module divides sign language video data into video clips (clips) with fixed frame numbers;

the sign language identification module comprises: the device comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;

the space-time feature extraction module is used for extracting a video segment into 4096-dimensional space-time features;

the semantic feature mining module is used for acquiring semantic association features in the space-time features;

and the decoding module is used for carrying out sign language identification according to the semantic association characteristics and outputting a sign language identification result to a user.

Further, the video frame resizing module comprises: the device comprises a central clipping module and a size adjusting module, wherein the central clipping module is used for clipping redundant blank places of video frames, and the size adjusting module adjusts the clipped images to a uniform size by adopting a size adjusting (reshape) function of an open source computer vision library (opencv).

Further, the data normalization module normalizes all video frame pixel values to 0-1 by dividing 255.

Further, the video framing module uses a sliding window algorithm to divide the video into fixed frame number video segments.

Further, the window size of the sliding window algorithm is 16, and the step size is 8.

Further, the spatio-temporal feature extraction module adopts a modified pseudo 3D residual network (P3D) model to extract a video segment as a 4096-dimensional spatio-temporal feature, specifically: the modified P3D model adopts a residual network (resnet 50) as a basic framework, and the residual structure is obtained by replacing the following modules;

the space-time feature extraction module comprises: residual error module, P3D-A module, P3D-B module, P3D-C module and a series of 3D convolution modules, pooling module, full connection module; wherein, the liquid crystal display device comprises a liquid crystal display device,

residual error module: using a depth convolution F _d (convolution kernel 3*3) and 2 point-by-point convolutions F _p (convolution kernel 1*1) to implement the function of two-dimensional convolution, the formula is as follows:

y＝F _p (F _d (F _p (x)))+x

P3D-A module: using a 2D spatial convolution F _s (convolution kernel 1 x 3), a 1D time convolution F _t (convolution kernel 3 x 1), two point-by-point convolutions F _p Instead of 3D convolution, (convolution kernel 1 x 1), the formula is as follows:

y＝F _p (F _t (F _s (F _p (x))))+x

P3D-B module: using a 2D spatial convolution F _s (convolution kernel 1 x 3), a 1D time convolution F _t (convolution kernel 3 x 1), two point-by-point convolutions F _p (convolution kernel 1 x 1) composition, P3D-B convolves 2D space with F _s (convolution kernel 1 x 3) and 1D time convolution F _t (convolution kernel 3×1×1) parallel arrangement to obtain:

y＝F _p (F _t (F _p (x))+F _s (F _p (x)))+x

P3D-C module: using a 2D spatial convolution F _s (convolution kernel 1 x 3), a 1D time convolution F _t (convolution kernel 3 x 1), two point-by-point convolutions F _p (convolution kernel 1 x 1) composition, P3D-B convolves 2D space with F _s (convolution kernel 1 x 3) and 1D time convolution F _t (convolution kernel 3 x 1) processing similar to residual module, and convolving 2D space with F _s (convolution kernel 1 x 3) as a branch, and 1D time convolution F _t The (convolution kernel 3 x 1) processing results are added as a point-by-point convolution F _p The input of (convolution kernel 1 x 1), the formula is as follows:

y＝F _p (F _s (F _p (x))+F _t (F _s (F _p (x))))+x；

3D convolution module: extracting motion information between successive frames, and outputting a characteristic image Y for one input characteristic image X; the formula for the 3D convolution is expressed as follows:

wherein K is a 3D convolution kernel, and C, T, H and W respectively represent the number of channels for inputting X, the sequence length, the length and the width;

and (3) pooling module: computing output for elements in a fixed shape window (also known as a pooling window) of input data; the pooling layer that calculates the maximum value of the elements within the pooling window is called Max pooling (Max Pool), and the pooling layer that calculates the average value of the elements within the pooling window is called average pooling (Avg Pool);

and (3) a full connection module: for outputting the final feature vector, the formula is as follows:

Y＝WX+b

wherein X is an input vector, Y is an output vector, W is a weight coefficient, and b is a bias value.

Further, the semantic feature mining module acquires semantic association features in the space-time features by adopting a three-layer two-way long-short-term memory network (BiLSTM) with memory cells (memory cells); the method specifically comprises the following steps:

two-way long and short term memory network module (BiLSTM): consists of a forward long-term memory network and a reverse long-term memory network which are overlapped together, and comprises the following components: input gate i _t Output gate o _t Forgetting door f _t And a cell state c _t The method comprises the steps of carrying out a first treatment on the surface of the The mathematical formalization of long and short term memory networks is expressed as follows:

i _t ：＝sigm(W _xi x _t +W _hi h _t-1 )

f _t ：＝sigm(W _xf x _t +W _hf h _t-1 )

o _t ：＝sigm(W _xo x _t +W _ho h _t-1 )

h _t ：＝O _t ⊙tanh(c _t )

wherein, as the element multiplication, sigm is a sigmoid function, tan h is a hyperbolic tangent function, x _t To input data, h _t-1 In order to hide the nodes of the network,cell state update value, W _x ，W _h Is the corresponding weight;

memory cell (memory cell) module: the two adjacent long-term and short-term memory network layers are connected end to end, so that the context information can be better mined. The method comprises the following steps: output of implicit state of upper layerInput of implicit state through memory cell as next layer +.>The memory cell comprises a hyperbolic tangent function (tanh) and a fully connected layer (fc), expressed as:

attention mechanism (attention) module: the decoding module perceives the weight influence of the input sequence on the current predicted value, and the formula is expressed as follows:

a _t ＝softmax(v*tanh(atten(H _t-1，X )))

wherein a is _t Is the attention vector, representing the weight of the input sequence to the current word prediction, v is the weight coefficient, and atten is a fully connected layer.

Further, the decoding module is a layer of unidirectional long-and-short-term memory network plus a prediction module, and semantic association features obtained by the semantic mining module are predicted; comprising the following steps:

long-term memory network module: decoding the implicit state of the last cell through the long-short-term memory network, and calculating the current cell state according to the last output word vector;

and a prediction module: predicting the current output according to the current unit state, and displaying the predicted sign language recognition result to a user in the form of text information, wherein the predicted sign language recognition result comprises a full connection layer and a normalized exponential function (softmax), and the formula is expressed as follows:

wherein, exp function is an exponential function, and W is a weight coefficient.

The invention discloses a sign language identification method based on space-time semantic features, which comprises the following steps:

1) Acquiring sign language video data;

2) Performing center clipping on each frame of image of the sign language video data, and adjusting the size of the video frame to be uniform;

3) Normalizing each frame of image by using sign language video data with uniform size;

4) Framing the normalized video data, and dividing the video sequence into a series of video clips (clips) with the size of 16 and the step length of 8;

5) A series of video clips (clips) are put forward to a space-time feature vector with the size of 4096 dimensions through a space-time feature module;

6) Semantic mining is carried out on the obtained 4096-dimensional space-time feature vectors so as to obtain semantic association features among video clips (clips);

7) And carrying out sign language identification through a decoding module according to the semantic association characteristics, and outputting a sign language identification result to a user.

The invention has the beneficial effects that:

the invention can collect sign language video through the common camera, and can facilitate communication between the deaf and dumb person and the ordinary person without the need of custom equipment for sign language expressive person, so that the ordinary person can understand the meaning expressed by sign language without learning sign language; the hardware burden in user communication is reduced, and more natural intelligent device communication physical examination is brought to the user.

In modeling of sign language or gestures, the method provided by the invention extracts a plurality of implicit characteristics, and the characteristics are related to hand actions, hand motion tracks and facial expressions, such as gesture shapes, motion tracks in the gesture conversion process and the like, so that the sign language is perceived from multiple dimensions, and a more accurate recognition effect is achieved.

Drawings

FIG. 1 is a schematic diagram of a system module according to the present invention.

FIG. 2 is a schematic diagram of a sign language recognition model according to the present invention.

FIG. 3 is a schematic diagram of the space-time feature model of the present invention.

FIG. 4 is a schematic diagram showing the operation of the memory cell according to the present invention.

Detailed Description

The invention will be further described with reference to examples and drawings, to which reference is made, but which are not intended to limit the scope of the invention.

Referring to fig. 1, a sign language recognition system based on spatiotemporal semantic features of the present invention includes: the system comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; wherein, the liquid crystal display device comprises a liquid crystal display device,

the video data acquisition module is used for acquiring sign language video data through a common monitoring camera, an external camera, a built-in camera of intelligent equipment (such as a smart phone, smart glasses and the like) or other camera modules;

the video frame size adjusting module is used for scaling the size of the acquired sign language video frame to a uniform size; it comprises the following steps: the device comprises a central clipping module and a size adjusting module, wherein the central clipping module is used for clipping redundant blank places of video frames, and the size adjusting module adjusts the clipped images to a uniform size by adopting a size adjusting (reshape) function of an open source computer vision library (opencv). Considering that the recorded video content is mainly in the central area, the central cropping is used to crop out the side ineffective parts of the image, and then the image is uniformly reduced to the size of 256,256,3 by using the resizing (reshape) function of the open source computer vision library (opencv).

The data normalization module normalizes the pixel value of the video frame with the adjusted size from 0-255 to 0-1; the non-normalized data cannot be used as prediction, the normalization module normalizes the pixel values of the cut video frames from 0-255 to 0-1, the normalization mode adopts maximum normalization, and all the frame pixel values are divided by 255 to obtain normalized data;

the video framing module is used for dividing sign language video data into video clips (clips) with fixed frame numbers by using a sliding window algorithm; considering that the lengths of recorded videos may be inconsistent, a framing process needs to be performed on the video, the video data is divided into a series of video frames with a length of 16 frames and a step length of 8, and the video framing module uses a sliding window algorithm to divide the video into video clips (clips) with fixed frames.

Referring to fig. 2, the sign language recognition module includes: the device comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;

and the decoding module is used for identifying sign language according to the semantic association characteristics and outputting corresponding text information to a user.

Referring to fig. 3, the spatio-temporal feature extraction module adopts a modified pseudo 3D residual network (P3D) model to extract a video segment as a 4096-dimensional spatio-temporal feature, specifically: the modified P3D model adopts a residual network (resnet 50) as a basic framework, and the residual structure is obtained by replacing the following modules;

y＝F _p (F _d (F _p (x)))+x

y＝F _p (F _t (F _s (F _p (x))))+x

y＝F _p (F _t (F _p (x))+F _s (F _p (x)))+x

y＝F _p (F _s (F _p (x))+F _t (F _s (F _p (x))))+x；

and (3) pooling module: the pooling module can reduce information redundancy, promote scale invariance and rotation invariance of the model, and prevent overfitting. The pooling module calculates an output for elements in a fixed shape window (also called a pooling window) of the input data each time; the pooling layer that calculates the maximum value of the elements in the pooling window is called Max Pool (Max Pool), and the pooling layer that calculates the average value of the elements in the pooling window is called average Pool (Avg Pool);

Y＝WX+b

Referring to fig. 4, the semantic feature mining module acquires semantic association features in the space-time features by using a three-layer two-way long-short-term memory network (BiLSTM) with memory cells (memory cells); the method specifically comprises the following steps:

i _t ：＝sigm(W _xi x _t +W _hi h _t-1 )

f _t ：＝sigm(W _xf x _t +W _hf h _t-1 )

o _t ：＝sigm(W _xo x _t +W _ho h _t-1 )

h _t ：＝O _t ⊙tanh(c _t )

a _t ＝softmax(v*tanh(atten(H _t-1，X )))

wherein a is _t Is an attention vector representing the input sequence versus the current wordPredicted weights, v, are weight coefficients and atten is a fully connected layer.

The decoding module is a layer of unidirectional long-short-term memory network plus a prediction module, and predicts the semantic association features acquired by the semantic mining module; comprising the following steps:

wherein exp function is an exponential function, W is a weight coefficient, and p (y _t |z _t ) And the corresponding text with the maximum value is the output result.

1) Acquiring sign language video data;

7) And according to the semantic association characteristics, sign language prediction is carried out through a decoding module, and corresponding text information is output to a user.

It should be noted that the neural network model may be more diverse in actual operations, such as adding or deleting a convolution layer to each module, or modifying parameters of the convolution layer, and modifying the preprocessing process, but basically, sign language recognition or gesture recognition is performed based on the neural network, which implementation or modification should not be considered beyond the scope of the present invention.

The present invention has been described in terms of the preferred embodiments thereof, and it should be understood by those skilled in the art that various modifications can be made without departing from the principles of the invention, and such modifications should also be considered as being within the scope of the invention.

Claims

1. A sign language recognition system based on spatiotemporal semantic features, comprising: the system comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module;

the video data acquisition module is used for acquiring sign language video data;

the data normalization module normalizes pixel values in the video frame after the size adjustment from 0-255 to a range of 0-1;

the video framing module divides sign language video data into video clips with fixed frame numbers;

the decoding module is used for carrying out sign language identification according to the semantic association characteristics and outputting a sign language identification result to a user;

the space-time feature extraction module adopts a modified pseudo 3D residual error network model to extract a video segment into 4096-dimensional space-time features, and specifically comprises the following steps: the modified pseudo 3D residual error network model adopts a residual error network as a basic framework, and a residual error structure is obtained by replacing the following modules;

the space-time feature extraction module comprises: residual error module, P3D-A module, P3D-B module, P3D-C module and a series of 3D convolution modules, pooling module, full connection module;

residual error module: using a depth convolution F _d And 2 point-by-point convolutions F _p To realize the function of two-dimensional convolution, the formula is expressed as follows:

y＝F _p (F _d (F _b (x)))+x；

P3D-A module: using a 2D spatial convolution F _s A 1D time convolution F _t Two point-by-point convolutions F _p Instead of 3D convolution, the formula is expressed as follows:

y＝F _p (F _t (F _s (F _p (x))))+x；

P3D-B module: using a 2D spatial convolution F _s A 1D time convolution F _t Two point-by-point convolutions F _p Composition, P3D-B convolves the 2D space with F _s And 1D time convolution F _t And (3) parallel arrangement:

y＝F _b (F _t (F _b (x))+F _s (F _p (x)))+x；

P3D-C module: using a 2D spatial convolution F _s A 1D time convolution F _t Two point-by-point convolutions F _p Composition, P3D-B convolves the 2D space with F _s And 1D time convolution F _t Processing similar to a residual error module, and convolving a 2D space with F _s As a branch, convolve F with 1D time _t Processing result addition as point-by-point convolution F _p The formula is expressed as follows:

y＝F _p (F _s (F _p (x))+F _t (F _s (F _p (x))))+x；

and (3) pooling module: calculating an output for elements in a fixed shape window of the input data; the pooling layer for calculating the maximum value of the elements in the pooling window is called maximum pooling, and the pooling layer for calculating the average value of the elements in the pooling window is called average pooling;

Y＝WX+b

wherein X is an input vector, Y is an output vector, W is a weight coefficient, and b is a bias value;

the semantic feature mining module acquires semantic association features in the space-time features by adopting a three-layer two-way long-short-term memory network with memory cells; the method specifically comprises the following steps:

two-way long-short-term memory network module: consists of a forward long-term memory network and a reverse long-term memory network which are overlapped together, and comprises the following components: input gate i _t Output gate o _t Forgetting door f _t And a cell state c _t The method comprises the steps of carrying out a first treatment on the surface of the The mathematical formalization of long and short term memory networks is expressed as follows:

i _t ：＝sigm(W _xi x _t +W _hi h _t-1 )

f _t ：＝sigm(W _xf x _t +W _hf h _t-1 )

o _t ：＝sigm(W _xo x _t +W _ho h _t-1 )

h _t ：＝O _t ⊙tanh(c _t )

memory cell module: connecting the head and the tail of two adjacent long-period and short-period memory network layers, and outputting the hidden state of the upper layerInput of implicit state through memory cell as next layer +.>The memory cell comprises a hyperbolic tangent function tanh and a full link layer fc, expressed as:

attention mechanism module: the decoding module perceives the weight influence of the input sequence on the current predicted value.

2. The spatiotemporal semantic feature based sign language identification system of claim 1, wherein the video frame resizing module comprises: the device comprises a central cutting module and a size adjusting module, wherein the central cutting module is used for cutting off redundant blank places of video frames, and the size adjusting module adjusts the cut images to a uniform size by adopting a size adjusting function of an open source computer vision library.

3. The spatiotemporal semantic feature based sign language identification system of claim 1, wherein the data normalization module normalizes all video frame pixel values to 0-1 by dividing 255.

4. The sign language recognition system based on spatiotemporal semantic features of claim 1, wherein the video framing module uses a sliding window algorithm to divide video into fixed frame number video segments.

5. The sign language recognition system based on the temporal and spatial semantic features according to claim 4, wherein the sliding window algorithm has a window size of 16 and a step size of 8.

6. The sign language recognition system based on the space-time semantic features according to claim 1, wherein the decoding module adopts a layer of unidirectional long-short term memory network plus a prediction module, and predicts the semantic association features acquired by the semantic mining module; the method specifically comprises the following steps:

long-term memory network module: decoding the implicit state of the last unit through the long-short-term memory network, and calculating the current unit state according to the last output word vector;

and a prediction module: predicting current output according to current unit state, and displaying predicted sign language recognition result to user in text information form, wherein the predicted sign language recognition result comprises a full connection layer and a normalized exponential function, and the formula is expressed as follows:

7. A sign language recognition method based on spatiotemporal semantic features, based on the system of claim 1, comprising the steps of:

1) Acquiring sign language video data;

4) Carrying out framing treatment on the normalized video data, and dividing the video sequence into a series of video fragments with the size of 16 and the step length of 8;

5) A series of video clips are provided with space-time feature vectors with the size of 4096 dimensions through a space-time feature module;

6) Semantic mining is carried out on the obtained 4096-dimensional space-time feature vectors so as to obtain semantic association features among video clips;

7) And carrying out sign language identification on the acquired semantic association features through a decoding module, and outputting a sign language identification result to a user.