CN111797777A

CN111797777A - Sign language recognition system and method based on space-time semantic features

Info

Publication number: CN111797777A
Application number: CN202010648991.XA
Authority: CN
Inventors: 殷亚凤; 甘世维; 谢磊; 陆桑璐
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2020-10-20
Anticipated expiration: 2040-07-07
Also published as: CN111797777B

Abstract

The invention discloses a sign language recognition system and a method based on space-time semantic features, wherein the system comprises: the device comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; the video data acquisition module is used for acquiring sign language video data; the video data preprocessing module is used for preprocessing the video data; and the sign language recognition module is used for carrying out sign language recognition and outputting a prediction result. The invention eliminates the semantic gap between sign language image data and text data, realizes convenient communication with deaf-mute people, and ensures the accuracy of translation by using the leading-edge neural network algorithm as a translation tool. In addition, the invention can also be used as a medium for man-machine interaction, and the command control can be carried out on the intelligent equipment through the analysis of the continuous gestures of the user.

Description

Sign language recognition system and method based on space-time semantic features

Technical Field

The invention belongs to the technical field of sign language recognition, and particularly relates to a sign language recognition system and method based on space-time semantic features.

Background

According to the World Health Organization (WHO) data, about 4.66 million people worldwide have hearing impairment in 2018, and it is expected that by 2050, the number of people with hearing impairment will exceed 9 million. The hearing impaired people and the deaf-mute use sign language as a communication medium, but few ordinary people can master the sign language and communicate with the sign language, so that the deaf-mute and the ordinary people have communication obstacle.

The current solutions are mainly manual translations, sign language translations based on gloves of different colors and translations based on smartwatches. The feasibility of manual translation is only limited to certain normal occasions, is difficult to be used by the public, and is impractical for most people to learn sign language courses. Sign language translation based on different colour gloves and translation based on intelligent wrist-watch have introduced extra equipment burden, need sign language expressor to be equipped with certain equipment, lead to deaf and dumb personage's exchange experience to descend by a wide margin.

In view of this, the vision-based method provided by the invention can acquire the sign language video through a common monitoring camera, an external camera, a built-in camera of an intelligent device (such as a smart phone, smart glasses, and the like) or other camera modules, and does not require a sign language presenter to be equipped with a customized device. In addition, the invention can also be applied to a novel man-machine interaction scene, and command control of the gesture on the intelligent device is completed through gesture analysis of the user.

Disclosure of Invention

In view of the above disadvantages of the prior art, the present invention provides a sign language recognition system and method based on spatiotemporal semantic features to solve the problem that the deaf-mute and the general public have communication disorders in the prior art. The invention can facilitate the communication between the deaf and dumb people and the ordinary people, so that the ordinary people can understand the meaning expressed by the sign language without learning the sign language.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention relates to a sign language recognition system based on space-time semantic features, which comprises: the device comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; wherein the content of the first and second substances,

the video data acquisition module is used for acquiring sign language video data; data are acquired through a common monitoring camera, an external camera, a built-in camera of intelligent equipment (such as a smart phone, smart glasses and the like) or other camera modules;

the video preprocessing module comprises: the video frame size adjusting module, the data normalizing module and the video framing module;

the video frame size adjusting module is used for zooming the size of the collected sign language video frame to a uniform size;

the data normalization module normalizes the pixel value of the video frame after the size adjustment from 0-255 to 0-1 range;

the video framing module is used for dividing sign language video data into video clips (clips) with fixed frame numbers;

the sign language recognition module comprises: the system comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;

the space-time feature extraction module extracts a video segment into 4096-dimensional space-time features;

the semantic feature mining module is used for acquiring semantic association features in the space-time features;

and the decoding module is used for carrying out sign language recognition according to the semantic association characteristics and outputting a sign language recognition result to a user.

Further, the video frame resizing module comprises: the system comprises a center cutting module and a size adjusting module, wherein the center cutting module is used for cutting off the redundant blank places of video frames, and the size adjusting module adopts a size adjusting (reshape) function of an open source computer vision library (opencv) to adjust the cut images to be in a uniform size.

Further, the data normalization module normalizes all video frame pixel values to 0-1 by dividing by 255.

Further, the video framing module uses a sliding window algorithm to divide the video into video segments of a fixed frame number.

Further, the window size of the sliding window algorithm is 16, and the step size is 8.

Further, the spatio-temporal feature extraction module extracts a video segment into a 4096-dimensional spatio-temporal feature by using a modified pseudo-3D residual network (P3D) model, specifically: the modified P3D model adopts a residual error network (resnet50) as a basic framework, and a residual error structure is obtained by replacing the following modules;

the space-time feature extraction module comprises: a residual module, a P3D-A module, a P3D-B module, a P3D-C module, a series of 3D convolution modules, a pooling module and a full-connection module; wherein the content of the first and second substances,

a residual module: using a deep convolution F_d(convolution kernel 3 x 3) and 2 point-by-point convolutions F_p(convolution kernel 1 x 1) to implement the function of two-dimensional convolution, the formula is expressed as follows:

y＝F_p(F_d(F_p(x)))+x

P3D-A module: using a 2D spatial convolution F_s(convolution kernel 1 x 3), a 1D time convolution F_t(convolution kernel 3 x 1), two point-by-point convolutions F_p(convolution kernel 1 x 1) instead of 3D convolution, the formula is as follows:

y＝F_p(F_t(F_s(F_p(x))))+x

P3D-B Module: using a 2D spatial convolution F_s(convolution kernel 1 x 3), a 1D time convolution F_t(convolution kernel 3 x 1), two point-by-point convolutions F_p(convolution kernel 1 x 1) composition, P3D-B convolves a 2D space by F_s(convolution kernels 1 x 3) and 1D time convolution F_t(convolution kernels 3 x 1) were aligned in parallel to give:

y＝F_p(F_t(F_p(x))+F_s(F_p(x)))+x

P3D-C Module: using a 2D spatial convolution F_s(convolution kernel 1 x 3), a 1D time convolution F_t(convolution kernel 3 x 1), two point-by-point convolutions F_p(convolution kernel 1 x 1) composition, P3D-B convolves a 2D space by F_s(convolution kernels 1 x 3) and 1D time convolution F_t(convolution kernel 3 x 1) similar residual block processing, 2D space convolution F_s(convolution kernel 1 x 3) as a branch, and a 1D time convolution F_t(convolution kernel 3 x 1) the results of the processing are added as a point-by-point convolution F_p(convolution kernel 1 x 1), the formula is as follows:

y＝F_p(F_s(F_p(x))+F_t(F_s(F_p(x))))+x；

a 3D convolution module: extracting motion information between continuous frames, wherein for an input feature map X, an output feature map is Y; the formula for the 3D convolution is expressed as follows:

wherein, K is a 3D convolution kernel, and C, T, H and W respectively represent the number of channels of input X, the sequence length, the length and the width;

a pooling module: computing output for elements in a fixed-shape window (also known as a pooling window) of input data; the pooling layer that calculates the maximum value of an element within the pooling window is called maximum pooling (Maxpool), and the pooling layer that calculates the average value of the element within the pooling window is called average pooling (Avg Pool);

a full connection module: for outputting the final feature vector, the formula is as follows:

Y＝WX+b

wherein, X is an input vector, Y is an output vector, W is a weight coefficient, and b is an offset value.

Further, the semantic feature mining module adopts a three-layer bidirectional long-short term memory network (BilSTM) with memory cells (MemoryCells) to obtain semantic association features in the spatio-temporal features; the method specifically comprises the following steps:

bidirectional long-short term memory network module (BiLSTM): the network consists of a forward long-short term memory network and a reverse long-short term memory network which are overlapped together, and comprises: input door i_tOutput gate o_tForgetting door f_tAnd a cell state c_t(ii) a The mathematical formalization of the long-short term memory network is expressed as follows:

i_t：＝sigm(W_xix_t+W_hih_t-1)

f_t：＝sigm(W_xfx_t+W_hfh_t-1)

o_t：＝sigm(W_xox_t+W_hoh_t-1)

h_t：＝O_t⊙tanh(c_t)

wherein, "-" indicates multiplication of elements, sigmm is sigmoid function, tanh is hyperbolic tangent function, x_tTo input data, h_t-1In order to hide the node(s),

cell state update value, W_x，W_hIs the corresponding weight;

memory cell (memorylcell) module: two adjacent long and short term memory network layers are connected end to end, so that context information can be better mined. The method specifically comprises the following steps: outputting the hidden state of the upper layer

By memory cells as input for the underlying hidden state

The memory cell comprises a hyperbolic tangent function (tanh) and a full junction layer (fc), and the formula is expressed as:

attention mechanism (attention) module: enabling the decoding module to sense the weight influence of the input sequence on the current predicted value, and expressing the formula as follows:

a_t＝softmax(v*tanh(atten(H_t-1，X)))

in the formula, a_tIs an attention vector representing the weight of the input sequence to the current word prediction, v is a weight coefficient, atten (, x) is a fully connected layer.

Furthermore, the decoding module is a layer of unidirectional long-term and short-term memory network and is added with a prediction module, and the prediction is carried out through semantic association characteristics obtained by a semantic mining module; the method comprises the following steps:

long-short term memory network module: decoding the hidden state of the last cell through the long-short term memory network, and calculating the current cell state according to the last output word vector;

a prediction module: predicting the current output according to the current unit state, and displaying the predicted sign language recognition result to the user in the form of text information, wherein the predicted sign language recognition result comprises a full connection layer and a normalized exponential function (softmax), and the formula is expressed as follows:

wherein the exp function is an exponential function, and W is a weight coefficient.

The invention relates to a sign language identification method based on space-time semantic features, which comprises the following steps:

1) acquiring sign language video data;

2) performing center cutting on each frame image of the sign language video data, and adjusting the size of a video frame to be a uniform size;

3) normalizing each frame of image by sign language video data with uniform size;

4) performing frame division on the normalized video data, and dividing a video sequence into a series of video clips (clips) with the size of 16 and the step length of 8;

5) extracting a series of video clips (clips) by a space-time feature module to obtain a space-time feature vector with the size of 4096 dimensions;

6) performing semantic mining on the obtained 4096-dimensional space-time feature vector to obtain semantic association features among video clips (clips);

7) and performing sign language recognition through a decoding module according to the semantic association characteristics, and outputting a sign language recognition result to a user.

The invention has the beneficial effects that:

the invention can collect sign language videos through a common camera, and can facilitate the communication between deaf-mute people and ordinary people without the sign language expressors equipped with customized equipment, so that the ordinary people can understand the meaning expressed by the sign language without learning the sign language; hardware burden in user communication is reduced, and more natural intelligent device communication physical examination is brought to the user.

In the modeling of sign language or gestures, the extracted features are implicit, the features are related to hand movements, hand movement tracks and facial expressions, such as gesture shapes, movement tracks in gesture conversion processes and the like, and the sign language is sensed from multiple dimensions, so that a more accurate recognition effect is achieved.

Drawings

FIG. 1 is a block diagram of a system according to the present invention.

FIG. 2 is a diagram illustrating a sign language recognition model according to the present invention.

FIG. 3 is a schematic diagram of a spatiotemporal feature model according to the present invention.

FIG. 4 is a diagram illustrating the operation of the memory cell of the present invention.

Detailed Description

In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.

Referring to fig. 1, a sign language recognition system based on spatiotemporal semantic features of the present invention includes: the device comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; wherein the content of the first and second substances,

the system comprises a video data acquisition module, a sign language video data acquisition module and a sign language video data acquisition module, wherein the sign language video data acquisition module acquires sign language video data through a common monitoring camera, an external camera, a built-in camera of intelligent equipment (such as a smart phone, smart glasses and the like) or other camera modules;

the video frame size adjusting module is used for zooming the size of the collected sign language video frame to a uniform size; it includes: the system comprises a center cutting module and a size adjusting module, wherein the center cutting module is used for cutting off the redundant blank places of video frames, and the size adjusting module adopts a size adjusting (reshape) function of an open source computer vision library (opencv) to adjust the cut images to be in a uniform size. Considering that the effective content of the recorded video is mainly in the central area, the central cropping is adopted to crop out the side ineffective parts in the image, and then the image is uniformly reduced to the size of (256, 3) by using the resizing (reshape) function of an open source computer vision library (opencv).

The data normalization module normalizes the pixel value of the video frame after the size adjustment from 0-255 to 0-1 range; the non-normalized data can not be used as prediction, a normalization module normalizes the pixel values of the cut video frames from 0-255 to 0-1, the normalization mode adopts maximum value normalization, and the pixel values of all the frames are divided by 255 to obtain normalized data;

the video framing module divides sign language video data into video clips (clips) with fixed frame numbers by using a sliding window algorithm; considering that the lengths of the recorded videos may be inconsistent, the video needs to be framed, the video data is divided into a series of video frames with the length of 16 frames and the step length of 8, and the video framing module divides the video into video clips (clips) with fixed frame numbers by using a sliding window algorithm.

Referring to fig. 2, the sign language recognition module includes: the system comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;

and the decoding module is used for carrying out sign language recognition according to the semantic association characteristics and outputting corresponding text information to a user.

Referring to fig. 3, the spatio-temporal feature extraction module extracts a 4096-dimensional spatio-temporal feature from a video segment by using a modified pseudo 3D residual network (P3D) model, specifically: the modified P3D model adopts a residual error network (resnet50) as a basic framework, and a residual error structure is obtained by replacing the following modules;

y＝F_p(F_d(F_p(x)))+x

y＝F_p(F_t(F_s(F_p(x))))+x

y＝F_p(F_t(F_p(x))+F_s(F_p(x)))+x

y＝F_p(F_s(F_p(x))+F_t(F_s(F_p(x))))+x；

a pooling module: the pooling module can reduce information redundancy, improve scale invariance and rotation invariance of the model, and prevent overfitting. The pooling module calculates the output for elements in one fixed-shape window (also called pooling window) of the input data at a time; the pooling layer for calculating the maximum value of the element within the pooling window is called Max Pool (Avg Pool), and the pooling layer for calculating the average value of the element within the pooling window is called average Pool (Avg Pool);

Y＝WX+b

Referring to fig. 4, the semantic feature mining module obtains semantic association features in spatiotemporal features by using a three-layer bidirectional long short term memory network (BiLSTM) with memory cells (memory cells); the method specifically comprises the following steps:

i_t：＝sigm(W_xix_t+W_hih_t-1)

f_t：＝sigm(W_xfx_t+W_hfh_t-1)

o_t：＝sigm(W_xox_t+W_hoh_t-1)

h_t：＝O_t⊙tanh(c_t)

cell state update value, W_x，W_hIs the corresponding weight;

By memory cells as input for the underlying hidden state

a_t＝softmax(v*tanh(atten(H_t-1，X)))

in the formula, a_tIs the vector of the attention of the user,represents the weight of the input sequence to the current word prediction, v is the weight coefficient, and atten (, a fully connected layer).

The decoding module is a layer of unidirectional long-term and short-term memory network and is added with a prediction module, and the prediction is carried out through semantic association characteristics obtained by a semantic mining module; the method comprises the following steps:

wherein exp function is an exponential function, W is a weight coefficient, p (y)_t|z_t) And the corresponding text is the output result when the value is maximum.

1) acquiring sign language video data;

7) and according to the semantic association characteristics, sign language prediction is carried out through a decoding module, and corresponding text information is output to a user.

It should be noted that the neural network model may be more diversified in actual operation, such as adding or deleting a convolutional layer per module, or changing convolutional layer parameters, changing a preprocessing process, but sign language recognition or gesture recognition is performed based on the neural network per se, and such implementation or change should not be considered to be beyond the scope of the present invention.

While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A sign language recognition system based on spatiotemporal semantic features, comprising: the device comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module;

the video data acquisition module is used for acquiring sign language video data;

the data normalization module normalizes the pixel values in the video frame after the size adjustment from 0-255 to 0-1 range;

the video framing module is used for dividing sign language video data into video segments with fixed frame numbers;

2. The spatiotemporal semantic feature-based sign language recognition system of claim 1 wherein the video frame resizing module comprises: the system comprises a central cutting module and a size adjusting module, wherein the central cutting module is used for cutting off redundant blank places of video frames, and the size adjusting module adopts a size adjusting function of an open-source computer vision library to adjust cut images to be in a uniform size.

3. The system according to claim 1, wherein the data normalization module normalizes all video frame pixel values to 0-1 by dividing by 255.

4. The system according to claim 1, wherein the video framing module uses a sliding window algorithm to divide the video into fixed frame number video segments.

5. The system for sign language recognition based on spatiotemporal semantic features of claim 4 wherein the sliding window algorithm has a window size of 16 and a step size of 8.

6. The sign language recognition system based on spatiotemporal semantic features of claim 1, wherein the spatiotemporal feature extraction module adopts a modified pseudo-3D residual network model to extract a video segment as a 4096-dimensional spatiotemporal feature, specifically: the modified pseudo-3D residual error network model adopts a residual error network as a basic framework, and a residual error structure is obtained by replacing the following modules;

the space-time feature extraction module comprises: a residual module, a P3D-A module, a P3D-B module, a P3D-C module, a series of 3D convolution modules, a pooling module and a full-connection module;

a residual module: by one depthConvolution F_dAnd 2 point-by-point convolutions F_pTo implement the function of two-dimensional convolution, the formula is expressed as follows:

y＝F_p(F_d(F_p(x)))+x；

P3D-A module: using a 2D spatial convolution F_sA 1D time convolution F_tTwo point-by-point convolutions F_pInstead of 3D convolution, the formula is expressed as follows:

y＝F_p(F_t(F_s(F_p(x))))+x；

P3D-B Module: using a 2D spatial convolution F_sA 1D time convolution F_tTwo point-by-point convolutions F_pComposition, P3D-B convolves a 2D space with F_sAnd 1D time convolution F_tParallel arrangement yields:

y＝F_p(F_t(F_p(x))+F_s(F_p(x)))+x；

P3D-C Module: using a 2D spatial convolution F_sA 1D time convolution F_tTwo point-by-point convolutions F_pComposition, P3D-B convolves a 2D space with F_sAnd 1D time convolution F_tPerforming similar residual module processing to convolve the 2D space by F_sAs a branch, convolved with 1D time F_tAdding the processing results as a point-by-point convolution F_pIs formulated as follows:

y＝F_p(F_s(F_p(x))+F_t(F_s(F_p(x))))+x；

a pooling module: computing output for elements in a fixed-shape window of input data; the pooling layer for calculating the maximum value of the elements in the pooling window is called maximum pooling, and the pooling layer for calculating the average value of the elements in the pooling window is called average pooling;

Y＝WX+b

7. The sign language recognition system based on spatiotemporal semantic features of claim 1, wherein the semantic feature mining module adopts a three-layer bidirectional long-short term memory network with memory cells to obtain semantic related features in spatiotemporal features; the method specifically comprises the following steps:

the bidirectional long-short term memory network module: the network consists of a forward long-short term memory network and a reverse long-short term memory network which are overlapped together, and comprises: input door i_tOutput gate o_tForgetting door f_tAnd a cell state c_t(ii) a The mathematical formalization of the long-short term memory network is expressed as follows:

i_t：＝sigm(W_xix_t+W_hih_t-1)

f_t：＝sigm(W_xfx_t+W_hfh_t-1)

o_t：＝sigm(W_xox_t+W_hoh_t-1)

h_t：＝O_t(⊙tanh(C_t)

wherein, "-" indicates multiplication of elements, sigmm is sigmoid function, tanh is hyperbolic tangent function, x_tTo input data, h_t-1To hide nodes，

Cell state update value, W_x，W_hIs the corresponding weight;

a memory cell module: two adjacent long and short term memory network layers are connected end to end, and the hidden state of the previous layer is output

By memory cells as input for the underlying hidden state

The memory cell comprises a hyperbolic tangent function tanh and a full junction layer fc, and the formula is as follows:

an attention mechanism module: enabling the decoding module to sense the weight influence of the input sequence on the current predicted value, and expressing the formula as follows:

a_t＝softmax(v*tanh(atten(H_t-1，X))

in the formula, a_tIs the attention vector representing the weight of the input sequence to the current predicted word, v is the weight coefficient, and atten (, a fully connected layer).

8. The sign language recognition system based on spatiotemporal semantic features of claim 1, wherein the decoding module adopts a layer of unidirectional long-short term memory network and a prediction module to predict semantic associated features obtained by a semantic mining module; the method specifically comprises the following steps:

long-short term memory network module: decoding the hidden state of the last unit through a long-short term memory network, and calculating the current unit state according to the last output word vector;

a prediction module: predicting the current output according to the current unit state, and displaying the predicted sign language recognition result to a user in the form of text information, wherein the sign language recognition result comprises a full connection layer and a normalized exponential function, and the formula is expressed as follows:

9. A sign language identification method based on space-time semantic features is characterized by comprising the following steps:

1) acquiring sign language video data;

4) carrying out frame division processing on the normalized video data, and dividing a video sequence into a series of video segments with the size of 16 and the step length of 8;

5) a series of video segments are processed by a space-time feature module to provide a space-time feature vector with the size of 4096 dimensions;

6) performing semantic mining on the obtained 4096-dimensional space-time feature vector to obtain semantic association features among video segments;

7) and carrying out sign language recognition on the acquired semantic association characteristics through a decoding module, and outputting a sign language recognition result to a user.