CN111797777A - Sign language recognition system and method based on space-time semantic features - Google Patents

Sign language recognition system and method based on space-time semantic features Download PDF

Info

Publication number
CN111797777A
CN111797777A CN202010648991.XA CN202010648991A CN111797777A CN 111797777 A CN111797777 A CN 111797777A CN 202010648991 A CN202010648991 A CN 202010648991A CN 111797777 A CN111797777 A CN 111797777A
Authority
CN
China
Prior art keywords
module
sign language
video
convolution
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010648991.XA
Other languages
Chinese (zh)
Other versions
CN111797777B (en
Inventor
殷亚凤
甘世维
谢磊
陆桑璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202010648991.XA priority Critical patent/CN111797777B/en
Publication of CN111797777A publication Critical patent/CN111797777A/en
Application granted granted Critical
Publication of CN111797777B publication Critical patent/CN111797777B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/32Normalisation of the pattern dimensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a sign language recognition system and a method based on space-time semantic features, wherein the system comprises: the device comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; the video data acquisition module is used for acquiring sign language video data; the video data preprocessing module is used for preprocessing the video data; and the sign language recognition module is used for carrying out sign language recognition and outputting a prediction result. The invention eliminates the semantic gap between sign language image data and text data, realizes convenient communication with deaf-mute people, and ensures the accuracy of translation by using the leading-edge neural network algorithm as a translation tool. In addition, the invention can also be used as a medium for man-machine interaction, and the command control can be carried out on the intelligent equipment through the analysis of the continuous gestures of the user.

Description

Sign language recognition system and method based on space-time semantic features
Technical Field
The invention belongs to the technical field of sign language recognition, and particularly relates to a sign language recognition system and method based on space-time semantic features.
Background
According to the World Health Organization (WHO) data, about 4.66 million people worldwide have hearing impairment in 2018, and it is expected that by 2050, the number of people with hearing impairment will exceed 9 million. The hearing impaired people and the deaf-mute use sign language as a communication medium, but few ordinary people can master the sign language and communicate with the sign language, so that the deaf-mute and the ordinary people have communication obstacle.
The current solutions are mainly manual translations, sign language translations based on gloves of different colors and translations based on smartwatches. The feasibility of manual translation is only limited to certain normal occasions, is difficult to be used by the public, and is impractical for most people to learn sign language courses. Sign language translation based on different colour gloves and translation based on intelligent wrist-watch have introduced extra equipment burden, need sign language expressor to be equipped with certain equipment, lead to deaf and dumb personage's exchange experience to descend by a wide margin.
In view of this, the vision-based method provided by the invention can acquire the sign language video through a common monitoring camera, an external camera, a built-in camera of an intelligent device (such as a smart phone, smart glasses, and the like) or other camera modules, and does not require a sign language presenter to be equipped with a customized device. In addition, the invention can also be applied to a novel man-machine interaction scene, and command control of the gesture on the intelligent device is completed through gesture analysis of the user.
Disclosure of Invention
In view of the above disadvantages of the prior art, the present invention provides a sign language recognition system and method based on spatiotemporal semantic features to solve the problem that the deaf-mute and the general public have communication disorders in the prior art. The invention can facilitate the communication between the deaf and dumb people and the ordinary people, so that the ordinary people can understand the meaning expressed by the sign language without learning the sign language.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the invention relates to a sign language recognition system based on space-time semantic features, which comprises: the device comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; wherein the content of the first and second substances,
the video data acquisition module is used for acquiring sign language video data; data are acquired through a common monitoring camera, an external camera, a built-in camera of intelligent equipment (such as a smart phone, smart glasses and the like) or other camera modules;
the video preprocessing module comprises: the video frame size adjusting module, the data normalizing module and the video framing module;
the video frame size adjusting module is used for zooming the size of the collected sign language video frame to a uniform size;
the data normalization module normalizes the pixel value of the video frame after the size adjustment from 0-255 to 0-1 range;
the video framing module is used for dividing sign language video data into video clips (clips) with fixed frame numbers;
the sign language recognition module comprises: the system comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;
the space-time feature extraction module extracts a video segment into 4096-dimensional space-time features;
the semantic feature mining module is used for acquiring semantic association features in the space-time features;
and the decoding module is used for carrying out sign language recognition according to the semantic association characteristics and outputting a sign language recognition result to a user.
Further, the video frame resizing module comprises: the system comprises a center cutting module and a size adjusting module, wherein the center cutting module is used for cutting off the redundant blank places of video frames, and the size adjusting module adopts a size adjusting (reshape) function of an open source computer vision library (opencv) to adjust the cut images to be in a uniform size.
Further, the data normalization module normalizes all video frame pixel values to 0-1 by dividing by 255.
Further, the video framing module uses a sliding window algorithm to divide the video into video segments of a fixed frame number.
Further, the window size of the sliding window algorithm is 16, and the step size is 8.
Further, the spatio-temporal feature extraction module extracts a video segment into a 4096-dimensional spatio-temporal feature by using a modified pseudo-3D residual network (P3D) model, specifically: the modified P3D model adopts a residual error network (resnet50) as a basic framework, and a residual error structure is obtained by replacing the following modules;
the space-time feature extraction module comprises: a residual module, a P3D-A module, a P3D-B module, a P3D-C module, a series of 3D convolution modules, a pooling module and a full-connection module; wherein the content of the first and second substances,
a residual module: using a deep convolution Fd(convolution kernel 3 x 3) and 2 point-by-point convolutions Fp(convolution kernel 1 x 1) to implement the function of two-dimensional convolution, the formula is expressed as follows:
y=Fp(Fd(Fp(x)))+x
P3D-A module: using a 2D spatial convolution Fs(convolution kernel 1 x 3), a 1D time convolution Ft(convolution kernel 3 x 1), two point-by-point convolutions Fp(convolution kernel 1 x 1) instead of 3D convolution, the formula is as follows:
y=Fp(Ft(Fs(Fp(x))))+x
P3D-B Module: using a 2D spatial convolution Fs(convolution kernel 1 x 3), a 1D time convolution Ft(convolution kernel 3 x 1), two point-by-point convolutions Fp(convolution kernel 1 x 1) composition, P3D-B convolves a 2D space by Fs(convolution kernels 1 x 3) and 1D time convolution Ft(convolution kernels 3 x 1) were aligned in parallel to give:
y=Fp(Ft(Fp(x))+Fs(Fp(x)))+x
P3D-C Module: using a 2D spatial convolution Fs(convolution kernel 1 x 3), a 1D time convolution Ft(convolution kernel 3 x 1), two point-by-point convolutions Fp(convolution kernel 1 x 1) composition, P3D-B convolves a 2D space by Fs(convolution kernels 1 x 3) and 1D time convolution Ft(convolution kernel 3 x 1) similar residual block processing, 2D space convolution Fs(convolution kernel 1 x 3) as a branch, and a 1D time convolution Ft(convolution kernel 3 x 1) the results of the processing are added as a point-by-point convolution Fp(convolution kernel 1 x 1), the formula is as follows:
y=Fp(Fs(Fp(x))+Ft(Fs(Fp(x))))+x;
a 3D convolution module: extracting motion information between continuous frames, wherein for an input feature map X, an output feature map is Y; the formula for the 3D convolution is expressed as follows:
Figure BDA0002574202170000031
wherein, K is a 3D convolution kernel, and C, T, H and W respectively represent the number of channels of input X, the sequence length, the length and the width;
a pooling module: computing output for elements in a fixed-shape window (also known as a pooling window) of input data; the pooling layer that calculates the maximum value of an element within the pooling window is called maximum pooling (Maxpool), and the pooling layer that calculates the average value of the element within the pooling window is called average pooling (Avg Pool);
a full connection module: for outputting the final feature vector, the formula is as follows:
Y=WX+b
wherein, X is an input vector, Y is an output vector, W is a weight coefficient, and b is an offset value.
Further, the semantic feature mining module adopts a three-layer bidirectional long-short term memory network (BilSTM) with memory cells (MemoryCells) to obtain semantic association features in the spatio-temporal features; the method specifically comprises the following steps:
bidirectional long-short term memory network module (BiLSTM): the network consists of a forward long-short term memory network and a reverse long-short term memory network which are overlapped together, and comprises: input door itOutput gate otForgetting door ftAnd a cell state ct(ii) a The mathematical formalization of the long-short term memory network is expressed as follows:
it:=sigm(Wxixt+Whiht-1)
ft:=sigm(Wxfxt+Whfht-1)
ot:=sigm(Wxoxt+Whoht-1)
Figure BDA0002574202170000032
Figure BDA0002574202170000033
ht:=Ot⊙tanh(ct)
wherein, "-" indicates multiplication of elements, sigmm is sigmoid function, tanh is hyperbolic tangent function, xtTo input data, ht-1In order to hide the node(s),
Figure BDA0002574202170000034
cell state update value, Wx,WhIs the corresponding weight;
memory cell (memorylcell) module: two adjacent long and short term memory network layers are connected end to end, so that context information can be better mined. The method specifically comprises the following steps: outputting the hidden state of the upper layer
Figure BDA0002574202170000035
By memory cells as input for the underlying hidden state
Figure BDA0002574202170000036
The memory cell comprises a hyperbolic tangent function (tanh) and a full junction layer (fc), and the formula is expressed as:
Figure BDA0002574202170000041
attention mechanism (attention) module: enabling the decoding module to sense the weight influence of the input sequence on the current predicted value, and expressing the formula as follows:
at=softmax(v*tanh(atten(Ht-1,X)))
in the formula, atIs an attention vector representing the weight of the input sequence to the current word prediction, v is a weight coefficient, atten (, x) is a fully connected layer.
Furthermore, the decoding module is a layer of unidirectional long-term and short-term memory network and is added with a prediction module, and the prediction is carried out through semantic association characteristics obtained by a semantic mining module; the method comprises the following steps:
long-short term memory network module: decoding the hidden state of the last cell through the long-short term memory network, and calculating the current cell state according to the last output word vector;
a prediction module: predicting the current output according to the current unit state, and displaying the predicted sign language recognition result to the user in the form of text information, wherein the predicted sign language recognition result comprises a full connection layer and a normalized exponential function (softmax), and the formula is expressed as follows:
Figure BDA0002574202170000042
wherein the exp function is an exponential function, and W is a weight coefficient.
The invention relates to a sign language identification method based on space-time semantic features, which comprises the following steps:
1) acquiring sign language video data;
2) performing center cutting on each frame image of the sign language video data, and adjusting the size of a video frame to be a uniform size;
3) normalizing each frame of image by sign language video data with uniform size;
4) performing frame division on the normalized video data, and dividing a video sequence into a series of video clips (clips) with the size of 16 and the step length of 8;
5) extracting a series of video clips (clips) by a space-time feature module to obtain a space-time feature vector with the size of 4096 dimensions;
6) performing semantic mining on the obtained 4096-dimensional space-time feature vector to obtain semantic association features among video clips (clips);
7) and performing sign language recognition through a decoding module according to the semantic association characteristics, and outputting a sign language recognition result to a user.
The invention has the beneficial effects that:
the invention can collect sign language videos through a common camera, and can facilitate the communication between deaf-mute people and ordinary people without the sign language expressors equipped with customized equipment, so that the ordinary people can understand the meaning expressed by the sign language without learning the sign language; hardware burden in user communication is reduced, and more natural intelligent device communication physical examination is brought to the user.
In the modeling of sign language or gestures, the extracted features are implicit, the features are related to hand movements, hand movement tracks and facial expressions, such as gesture shapes, movement tracks in gesture conversion processes and the like, and the sign language is sensed from multiple dimensions, so that a more accurate recognition effect is achieved.
Drawings
FIG. 1 is a block diagram of a system according to the present invention.
FIG. 2 is a diagram illustrating a sign language recognition model according to the present invention.
FIG. 3 is a schematic diagram of a spatiotemporal feature model according to the present invention.
FIG. 4 is a diagram illustrating the operation of the memory cell of the present invention.
Detailed Description
In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.
Referring to fig. 1, a sign language recognition system based on spatiotemporal semantic features of the present invention includes: the device comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; wherein the content of the first and second substances,
the system comprises a video data acquisition module, a sign language video data acquisition module and a sign language video data acquisition module, wherein the sign language video data acquisition module acquires sign language video data through a common monitoring camera, an external camera, a built-in camera of intelligent equipment (such as a smart phone, smart glasses and the like) or other camera modules;
the video preprocessing module comprises: the video frame size adjusting module, the data normalizing module and the video framing module;
the video frame size adjusting module is used for zooming the size of the collected sign language video frame to a uniform size; it includes: the system comprises a center cutting module and a size adjusting module, wherein the center cutting module is used for cutting off the redundant blank places of video frames, and the size adjusting module adopts a size adjusting (reshape) function of an open source computer vision library (opencv) to adjust the cut images to be in a uniform size. Considering that the effective content of the recorded video is mainly in the central area, the central cropping is adopted to crop out the side ineffective parts in the image, and then the image is uniformly reduced to the size of (256, 3) by using the resizing (reshape) function of an open source computer vision library (opencv).
The data normalization module normalizes the pixel value of the video frame after the size adjustment from 0-255 to 0-1 range; the non-normalized data can not be used as prediction, a normalization module normalizes the pixel values of the cut video frames from 0-255 to 0-1, the normalization mode adopts maximum value normalization, and the pixel values of all the frames are divided by 255 to obtain normalized data;
the video framing module divides sign language video data into video clips (clips) with fixed frame numbers by using a sliding window algorithm; considering that the lengths of the recorded videos may be inconsistent, the video needs to be framed, the video data is divided into a series of video frames with the length of 16 frames and the step length of 8, and the video framing module divides the video into video clips (clips) with fixed frame numbers by using a sliding window algorithm.
Referring to fig. 2, the sign language recognition module includes: the system comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;
the space-time feature extraction module extracts a video segment into 4096-dimensional space-time features;
the semantic feature mining module is used for acquiring semantic association features in the space-time features;
and the decoding module is used for carrying out sign language recognition according to the semantic association characteristics and outputting corresponding text information to a user.
Referring to fig. 3, the spatio-temporal feature extraction module extracts a 4096-dimensional spatio-temporal feature from a video segment by using a modified pseudo 3D residual network (P3D) model, specifically: the modified P3D model adopts a residual error network (resnet50) as a basic framework, and a residual error structure is obtained by replacing the following modules;
the space-time feature extraction module comprises: a residual module, a P3D-A module, a P3D-B module, a P3D-C module, a series of 3D convolution modules, a pooling module and a full-connection module; wherein the content of the first and second substances,
a residual module: using a deep convolution Fd(convolution kernel 3 x 3) and 2 point-by-point convolutions Fp(convolution kernel 1 x 1) to implement the function of two-dimensional convolution, the formula is expressed as follows:
y=Fp(Fd(Fp(x)))+x
P3D-A module: using a 2D spatial convolution Fs(convolution kernel 1 x 3), a 1D time convolution Ft(convolution kernel 3 x 1), two point-by-point convolutions Fp(convolution kernel 1 x 1) instead of 3D convolution, the formula is as follows:
y=Fp(Ft(Fs(Fp(x))))+x
P3D-B Module: using a 2D spatial convolution Fs(convolution kernel 1 x 3), a 1D time convolution Ft(convolution kernel 3 x 1), two point-by-point convolutions Fp(convolution kernel 1 x 1) composition, P3D-B convolves a 2D space by Fs(convolution kernels 1 x 3) and 1D time convolution Ft(convolution kernels 3 x 1) were aligned in parallel to give:
y=Fp(Ft(Fp(x))+Fs(Fp(x)))+x
P3D-C Module: using a 2D spatial convolution Fs(convolution kernel 1 x 3), a 1D time convolution Ft(convolution kernel 3 x 1), two point-by-point convolutions Fp(convolution kernel 1 x 1) composition, P3D-B convolves a 2D space by Fs(convolution kernels 1 x 3) and 1D time convolution Ft(convolution kernel 3 x 1) similar residual block processing, 2D space convolution Fs(convolution kernel 1 x 3) as a branch, and a 1D time convolution Ft(convolution kernel 3 x 1) the results of the processing are added as a point-by-point convolution Fp(convolution kernel 1 x 1), the formula is as follows:
y=Fp(Fs(Fp(x))+Ft(Fs(Fp(x))))+x;
a 3D convolution module: extracting motion information between continuous frames, wherein for an input feature map X, an output feature map is Y; the formula for the 3D convolution is expressed as follows:
Figure BDA0002574202170000061
wherein, K is a 3D convolution kernel, and C, T, H and W respectively represent the number of channels of input X, the sequence length, the length and the width;
a pooling module: the pooling module can reduce information redundancy, improve scale invariance and rotation invariance of the model, and prevent overfitting. The pooling module calculates the output for elements in one fixed-shape window (also called pooling window) of the input data at a time; the pooling layer for calculating the maximum value of the element within the pooling window is called Max Pool (Avg Pool), and the pooling layer for calculating the average value of the element within the pooling window is called average Pool (Avg Pool);
a full connection module: for outputting the final feature vector, the formula is as follows:
Y=WX+b
wherein, X is an input vector, Y is an output vector, W is a weight coefficient, and b is an offset value.
Referring to fig. 4, the semantic feature mining module obtains semantic association features in spatiotemporal features by using a three-layer bidirectional long short term memory network (BiLSTM) with memory cells (memory cells); the method specifically comprises the following steps:
bidirectional long-short term memory network module (BiLSTM): the network consists of a forward long-short term memory network and a reverse long-short term memory network which are overlapped together, and comprises: input door itOutput gate otForgetting door ftAnd a cell state ct(ii) a The mathematical formalization of the long-short term memory network is expressed as follows:
it:=sigm(Wxixt+Whiht-1)
ft:=sigm(Wxfxt+Whfht-1)
ot:=sigm(Wxoxt+Whoht-1)
Figure BDA0002574202170000071
Figure BDA0002574202170000072
ht:=Ot⊙tanh(ct)
wherein, "-" indicates multiplication of elements, sigmm is sigmoid function, tanh is hyperbolic tangent function, xtTo input data, ht-1In order to hide the node(s),
Figure BDA0002574202170000073
cell state update value, Wx,WhIs the corresponding weight;
memory cell (memorylcell) module: two adjacent long and short term memory network layers are connected end to end, so that context information can be better mined. The method specifically comprises the following steps: outputting the hidden state of the upper layer
Figure BDA0002574202170000074
By memory cells as input for the underlying hidden state
Figure BDA0002574202170000075
The memory cell comprises a hyperbolic tangent function (tanh) and a full junction layer (fc), and the formula is expressed as:
Figure BDA0002574202170000076
attention mechanism (attention) module: enabling the decoding module to sense the weight influence of the input sequence on the current predicted value, and expressing the formula as follows:
at=softmax(v*tanh(atten(Ht-1,X)))
in the formula, atIs the vector of the attention of the user,represents the weight of the input sequence to the current word prediction, v is the weight coefficient, and atten (, a fully connected layer).
The decoding module is a layer of unidirectional long-term and short-term memory network and is added with a prediction module, and the prediction is carried out through semantic association characteristics obtained by a semantic mining module; the method comprises the following steps:
long-short term memory network module: decoding the hidden state of the last cell through the long-short term memory network, and calculating the current cell state according to the last output word vector;
a prediction module: predicting the current output according to the current unit state, and displaying the predicted sign language recognition result to the user in the form of text information, wherein the predicted sign language recognition result comprises a full connection layer and a normalized exponential function (softmax), and the formula is expressed as follows:
Figure BDA0002574202170000081
wherein exp function is an exponential function, W is a weight coefficient, p (y)t|zt) And the corresponding text is the output result when the value is maximum.
The invention relates to a sign language identification method based on space-time semantic features, which comprises the following steps:
1) acquiring sign language video data;
2) performing center cutting on each frame image of the sign language video data, and adjusting the size of a video frame to be a uniform size;
3) normalizing each frame of image by sign language video data with uniform size;
4) performing frame division on the normalized video data, and dividing a video sequence into a series of video clips (clips) with the size of 16 and the step length of 8;
5) extracting a series of video clips (clips) by a space-time feature module to obtain a space-time feature vector with the size of 4096 dimensions;
6) performing semantic mining on the obtained 4096-dimensional space-time feature vector to obtain semantic association features among video clips (clips);
7) and according to the semantic association characteristics, sign language prediction is carried out through a decoding module, and corresponding text information is output to a user.
It should be noted that the neural network model may be more diversified in actual operation, such as adding or deleting a convolutional layer per module, or changing convolutional layer parameters, changing a preprocessing process, but sign language recognition or gesture recognition is performed based on the neural network per se, and such implementation or change should not be considered to be beyond the scope of the present invention.
While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (9)

1. A sign language recognition system based on spatiotemporal semantic features, comprising: the device comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module;
the video data acquisition module is used for acquiring sign language video data;
the video preprocessing module comprises: the video frame size adjusting module, the data normalizing module and the video framing module;
the video frame size adjusting module is used for zooming the size of the collected sign language video frame to a uniform size;
the data normalization module normalizes the pixel values in the video frame after the size adjustment from 0-255 to 0-1 range;
the video framing module is used for dividing sign language video data into video segments with fixed frame numbers;
the sign language recognition module comprises: the system comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;
the space-time feature extraction module extracts a video segment into 4096-dimensional space-time features;
the semantic feature mining module is used for acquiring semantic association features in the space-time features;
and the decoding module is used for carrying out sign language recognition according to the semantic association characteristics and outputting a sign language recognition result to a user.
2. The spatiotemporal semantic feature-based sign language recognition system of claim 1 wherein the video frame resizing module comprises: the system comprises a central cutting module and a size adjusting module, wherein the central cutting module is used for cutting off redundant blank places of video frames, and the size adjusting module adopts a size adjusting function of an open-source computer vision library to adjust cut images to be in a uniform size.
3. The system according to claim 1, wherein the data normalization module normalizes all video frame pixel values to 0-1 by dividing by 255.
4. The system according to claim 1, wherein the video framing module uses a sliding window algorithm to divide the video into fixed frame number video segments.
5. The system for sign language recognition based on spatiotemporal semantic features of claim 4 wherein the sliding window algorithm has a window size of 16 and a step size of 8.
6. The sign language recognition system based on spatiotemporal semantic features of claim 1, wherein the spatiotemporal feature extraction module adopts a modified pseudo-3D residual network model to extract a video segment as a 4096-dimensional spatiotemporal feature, specifically: the modified pseudo-3D residual error network model adopts a residual error network as a basic framework, and a residual error structure is obtained by replacing the following modules;
the space-time feature extraction module comprises: a residual module, a P3D-A module, a P3D-B module, a P3D-C module, a series of 3D convolution modules, a pooling module and a full-connection module;
a residual module: by one depthConvolution FdAnd 2 point-by-point convolutions FpTo implement the function of two-dimensional convolution, the formula is expressed as follows:
y=Fp(Fd(Fp(x)))+x;
P3D-A module: using a 2D spatial convolution FsA 1D time convolution FtTwo point-by-point convolutions FpInstead of 3D convolution, the formula is expressed as follows:
y=Fp(Ft(Fs(Fp(x))))+x;
P3D-B Module: using a 2D spatial convolution FsA 1D time convolution FtTwo point-by-point convolutions FpComposition, P3D-B convolves a 2D space with FsAnd 1D time convolution FtParallel arrangement yields:
y=Fp(Ft(Fp(x))+Fs(Fp(x)))+x;
P3D-C Module: using a 2D spatial convolution FsA 1D time convolution FtTwo point-by-point convolutions FpComposition, P3D-B convolves a 2D space with FsAnd 1D time convolution FtPerforming similar residual module processing to convolve the 2D space by FsAs a branch, convolved with 1D time FtAdding the processing results as a point-by-point convolution FpIs formulated as follows:
y=Fp(Fs(Fp(x))+Ft(Fs(Fp(x))))+x;
a 3D convolution module: extracting motion information between continuous frames, wherein for an input feature map X, an output feature map is Y; the formula for the 3D convolution is expressed as follows:
Figure FDA0002574202160000021
wherein, K is a 3D convolution kernel, and C, T, H and W respectively represent the number of channels of input X, the sequence length, the length and the width;
a pooling module: computing output for elements in a fixed-shape window of input data; the pooling layer for calculating the maximum value of the elements in the pooling window is called maximum pooling, and the pooling layer for calculating the average value of the elements in the pooling window is called average pooling;
a full connection module: for outputting the final feature vector, the formula is as follows:
Y=WX+b
wherein, X is an input vector, Y is an output vector, W is a weight coefficient, and b is an offset value.
7. The sign language recognition system based on spatiotemporal semantic features of claim 1, wherein the semantic feature mining module adopts a three-layer bidirectional long-short term memory network with memory cells to obtain semantic related features in spatiotemporal features; the method specifically comprises the following steps:
the bidirectional long-short term memory network module: the network consists of a forward long-short term memory network and a reverse long-short term memory network which are overlapped together, and comprises: input door itOutput gate otForgetting door ftAnd a cell state ct(ii) a The mathematical formalization of the long-short term memory network is expressed as follows:
it:=sigm(Wxixt+Whiht-1)
ft:=sigm(Wxfxt+Whfht-1)
ot:=sigm(Wxoxt+Whoht-1)
Figure FDA0002574202160000031
Figure FDA0002574202160000032
ht:=Ot(⊙tanh(Ct)
wherein, "-" indicates multiplication of elements, sigmm is sigmoid function, tanh is hyperbolic tangent function, xtTo input data, ht-1To hide nodes,
Figure FDA0002574202160000033
Cell state update value, Wx,WhIs the corresponding weight;
a memory cell module: two adjacent long and short term memory network layers are connected end to end, and the hidden state of the previous layer is output
Figure FDA0002574202160000034
By memory cells as input for the underlying hidden state
Figure FDA0002574202160000035
The memory cell comprises a hyperbolic tangent function tanh and a full junction layer fc, and the formula is as follows:
Figure FDA0002574202160000036
an attention mechanism module: enabling the decoding module to sense the weight influence of the input sequence on the current predicted value, and expressing the formula as follows:
at=softmax(v*tanh(atten(Ht-1,X))
in the formula, atIs the attention vector representing the weight of the input sequence to the current predicted word, v is the weight coefficient, and atten (, a fully connected layer).
8. The sign language recognition system based on spatiotemporal semantic features of claim 1, wherein the decoding module adopts a layer of unidirectional long-short term memory network and a prediction module to predict semantic associated features obtained by a semantic mining module; the method specifically comprises the following steps:
long-short term memory network module: decoding the hidden state of the last unit through a long-short term memory network, and calculating the current unit state according to the last output word vector;
a prediction module: predicting the current output according to the current unit state, and displaying the predicted sign language recognition result to a user in the form of text information, wherein the sign language recognition result comprises a full connection layer and a normalized exponential function, and the formula is expressed as follows:
Figure FDA0002574202160000037
wherein the exp function is an exponential function, and W is a weight coefficient.
9. A sign language identification method based on space-time semantic features is characterized by comprising the following steps:
1) acquiring sign language video data;
2) performing center cutting on each frame image of the sign language video data, and adjusting the size of a video frame to be a uniform size;
3) normalizing each frame of image by sign language video data with uniform size;
4) carrying out frame division processing on the normalized video data, and dividing a video sequence into a series of video segments with the size of 16 and the step length of 8;
5) a series of video segments are processed by a space-time feature module to provide a space-time feature vector with the size of 4096 dimensions;
6) performing semantic mining on the obtained 4096-dimensional space-time feature vector to obtain semantic association features among video segments;
7) and carrying out sign language recognition on the acquired semantic association characteristics through a decoding module, and outputting a sign language recognition result to a user.
CN202010648991.XA 2020-07-07 2020-07-07 Sign language recognition system and method based on space-time semantic features Active CN111797777B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010648991.XA CN111797777B (en) 2020-07-07 2020-07-07 Sign language recognition system and method based on space-time semantic features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010648991.XA CN111797777B (en) 2020-07-07 2020-07-07 Sign language recognition system and method based on space-time semantic features

Publications (2)

Publication Number Publication Date
CN111797777A true CN111797777A (en) 2020-10-20
CN111797777B CN111797777B (en) 2023-10-17

Family

ID=72810405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010648991.XA Active CN111797777B (en) 2020-07-07 2020-07-07 Sign language recognition system and method based on space-time semantic features

Country Status (1)

Country Link
CN (1) CN111797777B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487939A (en) * 2020-11-26 2021-03-12 深圳市热丽泰和生命科技有限公司 Pure vision light weight sign language recognition system based on deep learning
CN113869178A (en) * 2021-09-18 2021-12-31 合肥工业大学 Feature extraction system and video quality evaluation system based on space-time dimension
CN114155562A (en) * 2022-02-09 2022-03-08 北京金山数字娱乐科技有限公司 Gesture recognition method and device

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120323521A1 (en) * 2009-09-29 2012-12-20 Commissariat A L'energie Atomique Et Aux Energies Al Ternatives System and method for recognizing gestures
CN105095866A (en) * 2015-07-17 2015-11-25 重庆邮电大学 Rapid behavior identification method and system
CN105389539A (en) * 2015-10-15 2016-03-09 电子科技大学 Three-dimensional gesture estimation method and three-dimensional gesture estimation system based on depth data
DE102016100075A1 (en) * 2016-01-04 2017-07-06 Volkswagen Aktiengesellschaft Method for evaluating gestures
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
CN108171198A (en) * 2018-01-11 2018-06-15 合肥工业大学 Continuous sign language video automatic translating method based on asymmetric multilayer LSTM
CN110110602A (en) * 2019-04-09 2019-08-09 南昌大学 A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence
CN110309761A (en) * 2019-06-26 2019-10-08 深圳市微纳集成电路与系统应用研究院 Continuity gesture identification method based on the Three dimensional convolution neural network with thresholding cycling element
CN111091045A (en) * 2019-10-25 2020-05-01 重庆邮电大学 Sign language identification method based on space-time attention mechanism
CN111126112A (en) * 2018-10-31 2020-05-08 顺丰科技有限公司 Candidate region determination method and device
US20200184278A1 (en) * 2014-03-18 2020-06-11 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111340006A (en) * 2020-04-16 2020-06-26 深圳市康鸿泰科技有限公司 Sign language identification method and system
CN111339837A (en) * 2020-02-08 2020-06-26 河北工业大学 Continuous sign language recognition method
CN111361700A (en) * 2020-03-23 2020-07-03 南京畅淼科技有限责任公司 Ship empty and heavy load identification method based on machine vision

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120323521A1 (en) * 2009-09-29 2012-12-20 Commissariat A L'energie Atomique Et Aux Energies Al Ternatives System and method for recognizing gestures
US20200184278A1 (en) * 2014-03-18 2020-06-11 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
CN105095866A (en) * 2015-07-17 2015-11-25 重庆邮电大学 Rapid behavior identification method and system
CN105389539A (en) * 2015-10-15 2016-03-09 电子科技大学 Three-dimensional gesture estimation method and three-dimensional gesture estimation system based on depth data
DE102016100075A1 (en) * 2016-01-04 2017-07-06 Volkswagen Aktiengesellschaft Method for evaluating gestures
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
CN108171198A (en) * 2018-01-11 2018-06-15 合肥工业大学 Continuous sign language video automatic translating method based on asymmetric multilayer LSTM
CN111126112A (en) * 2018-10-31 2020-05-08 顺丰科技有限公司 Candidate region determination method and device
CN110110602A (en) * 2019-04-09 2019-08-09 南昌大学 A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence
CN110309761A (en) * 2019-06-26 2019-10-08 深圳市微纳集成电路与系统应用研究院 Continuity gesture identification method based on the Three dimensional convolution neural network with thresholding cycling element
CN111091045A (en) * 2019-10-25 2020-05-01 重庆邮电大学 Sign language identification method based on space-time attention mechanism
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111339837A (en) * 2020-02-08 2020-06-26 河北工业大学 Continuous sign language recognition method
CN111361700A (en) * 2020-03-23 2020-07-03 南京畅淼科技有限责任公司 Ship empty and heavy load identification method based on machine vision
CN111340006A (en) * 2020-04-16 2020-06-26 深圳市康鸿泰科技有限公司 Sign language identification method and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487939A (en) * 2020-11-26 2021-03-12 深圳市热丽泰和生命科技有限公司 Pure vision light weight sign language recognition system based on deep learning
CN113869178A (en) * 2021-09-18 2021-12-31 合肥工业大学 Feature extraction system and video quality evaluation system based on space-time dimension
CN114155562A (en) * 2022-02-09 2022-03-08 北京金山数字娱乐科技有限公司 Gesture recognition method and device

Also Published As

Publication number Publication date
CN111797777B (en) 2023-10-17

Similar Documents

Publication Publication Date Title
CN111091045B (en) Sign language identification method based on space-time attention mechanism
CN111797777B (en) Sign language recognition system and method based on space-time semantic features
WO2021008320A1 (en) Sign language recognition method and apparatus, computer-readable storage medium, and computer device
EP3989111A1 (en) Video classification method and apparatus, model training method and apparatus, device and storage medium
CN109815826B (en) Method and device for generating face attribute model
EP4099220A1 (en) Processing apparatus, method and storage medium
CN109086706B (en) Motion recognition method based on segmentation human body model applied to human-computer cooperation
CN111310676A (en) Video motion recognition method based on CNN-LSTM and attention
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN104361316B (en) Dimension emotion recognition method based on multi-scale time sequence modeling
EP4276684A1 (en) Capsule endoscope image recognition method based on deep learning, and device and medium
CN108875836B (en) Simple-complex activity collaborative recognition method based on deep multitask learning
EP4006777A1 (en) Image classification method and device
CN111832393A (en) Video target detection method and device based on deep learning
CN114360067A (en) Dynamic gesture recognition method based on deep learning
Alam et al. Two dimensional convolutional neural network approach for real-time bangla sign language characters recognition and translation
Boukdir et al. Isolated video-based Arabic sign language recognition using convolutional and recursive neural networks
CN114724224A (en) Multi-mode emotion recognition method for medical care robot
CN115205336A (en) Feature fusion target perception tracking method based on multilayer perceptron
Zatout et al. Semantic scene synthesis: application to assistive systems
CN114724251A (en) Old people behavior identification method based on skeleton sequence under infrared video
CN112668543B (en) Isolated word sign language recognition method based on hand model perception
CN102663369B (en) Human motion tracking method on basis of SURF (Speed Up Robust Feature) high efficiency matching kernel
CN116797799A (en) Single-target tracking method and tracking system based on channel attention and space-time perception
Howell et al. Active vision techniques for visually mediated interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant