CN111797777A - Sign language recognition system and method based on space-time semantic features - Google Patents
Sign language recognition system and method based on space-time semantic features Download PDFInfo
- Publication number
- CN111797777A CN111797777A CN202010648991.XA CN202010648991A CN111797777A CN 111797777 A CN111797777 A CN 111797777A CN 202010648991 A CN202010648991 A CN 202010648991A CN 111797777 A CN111797777 A CN 111797777A
- Authority
- CN
- China
- Prior art keywords
- module
- sign language
- video
- convolution
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 230000015654 memory Effects 0.000 claims description 35
- 238000011176 pooling Methods 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 25
- 238000005065 mining Methods 0.000 claims description 15
- 238000000605 extraction Methods 0.000 claims description 12
- 238000009432 framing Methods 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 7
- 230000002457 bidirectional effect Effects 0.000 claims description 6
- 239000000203 mixture Substances 0.000 claims description 6
- 230000006403 short-term memory Effects 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 5
- 230000007787 long-term memory Effects 0.000 claims description 5
- 229910052739 hydrogen Inorganic materials 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 239000004576 sand Substances 0.000 claims 2
- 238000013519 translation Methods 0.000 abstract description 8
- 206010011878 Deafness Diseases 0.000 abstract description 7
- 238000004891 communication Methods 0.000 abstract description 7
- 238000004458 analytical method Methods 0.000 abstract description 2
- 238000013528 artificial neural network Methods 0.000 abstract description 2
- 230000003993 interaction Effects 0.000 abstract description 2
- 230000014616 translation Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 239000000126 substance Substances 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 3
- 239000004984 smart glass Substances 0.000 description 3
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 2
- 208000016354 hearing loss disease Diseases 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 208000032041 Hearing impaired Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 208000030251 communication disease Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/32—Normalisation of the pattern dimensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Psychiatry (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a sign language recognition system and a method based on space-time semantic features, wherein the system comprises: the device comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; the video data acquisition module is used for acquiring sign language video data; the video data preprocessing module is used for preprocessing the video data; and the sign language recognition module is used for carrying out sign language recognition and outputting a prediction result. The invention eliminates the semantic gap between sign language image data and text data, realizes convenient communication with deaf-mute people, and ensures the accuracy of translation by using the leading-edge neural network algorithm as a translation tool. In addition, the invention can also be used as a medium for man-machine interaction, and the command control can be carried out on the intelligent equipment through the analysis of the continuous gestures of the user.
Description
Technical Field
The invention belongs to the technical field of sign language recognition, and particularly relates to a sign language recognition system and method based on space-time semantic features.
Background
According to the World Health Organization (WHO) data, about 4.66 million people worldwide have hearing impairment in 2018, and it is expected that by 2050, the number of people with hearing impairment will exceed 9 million. The hearing impaired people and the deaf-mute use sign language as a communication medium, but few ordinary people can master the sign language and communicate with the sign language, so that the deaf-mute and the ordinary people have communication obstacle.
The current solutions are mainly manual translations, sign language translations based on gloves of different colors and translations based on smartwatches. The feasibility of manual translation is only limited to certain normal occasions, is difficult to be used by the public, and is impractical for most people to learn sign language courses. Sign language translation based on different colour gloves and translation based on intelligent wrist-watch have introduced extra equipment burden, need sign language expressor to be equipped with certain equipment, lead to deaf and dumb personage's exchange experience to descend by a wide margin.
In view of this, the vision-based method provided by the invention can acquire the sign language video through a common monitoring camera, an external camera, a built-in camera of an intelligent device (such as a smart phone, smart glasses, and the like) or other camera modules, and does not require a sign language presenter to be equipped with a customized device. In addition, the invention can also be applied to a novel man-machine interaction scene, and command control of the gesture on the intelligent device is completed through gesture analysis of the user.
Disclosure of Invention
In view of the above disadvantages of the prior art, the present invention provides a sign language recognition system and method based on spatiotemporal semantic features to solve the problem that the deaf-mute and the general public have communication disorders in the prior art. The invention can facilitate the communication between the deaf and dumb people and the ordinary people, so that the ordinary people can understand the meaning expressed by the sign language without learning the sign language.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the invention relates to a sign language recognition system based on space-time semantic features, which comprises: the device comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; wherein the content of the first and second substances,
the video data acquisition module is used for acquiring sign language video data; data are acquired through a common monitoring camera, an external camera, a built-in camera of intelligent equipment (such as a smart phone, smart glasses and the like) or other camera modules;
the video preprocessing module comprises: the video frame size adjusting module, the data normalizing module and the video framing module;
the video frame size adjusting module is used for zooming the size of the collected sign language video frame to a uniform size;
the data normalization module normalizes the pixel value of the video frame after the size adjustment from 0-255 to 0-1 range;
the video framing module is used for dividing sign language video data into video clips (clips) with fixed frame numbers;
the sign language recognition module comprises: the system comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;
the space-time feature extraction module extracts a video segment into 4096-dimensional space-time features;
the semantic feature mining module is used for acquiring semantic association features in the space-time features;
and the decoding module is used for carrying out sign language recognition according to the semantic association characteristics and outputting a sign language recognition result to a user.
Further, the video frame resizing module comprises: the system comprises a center cutting module and a size adjusting module, wherein the center cutting module is used for cutting off the redundant blank places of video frames, and the size adjusting module adopts a size adjusting (reshape) function of an open source computer vision library (opencv) to adjust the cut images to be in a uniform size.
Further, the data normalization module normalizes all video frame pixel values to 0-1 by dividing by 255.
Further, the video framing module uses a sliding window algorithm to divide the video into video segments of a fixed frame number.
Further, the window size of the sliding window algorithm is 16, and the step size is 8.
Further, the spatio-temporal feature extraction module extracts a video segment into a 4096-dimensional spatio-temporal feature by using a modified pseudo-3D residual network (P3D) model, specifically: the modified P3D model adopts a residual error network (resnet50) as a basic framework, and a residual error structure is obtained by replacing the following modules;
the space-time feature extraction module comprises: a residual module, a P3D-A module, a P3D-B module, a P3D-C module, a series of 3D convolution modules, a pooling module and a full-connection module; wherein the content of the first and second substances,
a residual module: using a deep convolution Fd(convolution kernel 3 x 3) and 2 point-by-point convolutions Fp(convolution kernel 1 x 1) to implement the function of two-dimensional convolution, the formula is expressed as follows:
y=Fp(Fd(Fp(x)))+x
P3D-A module: using a 2D spatial convolution Fs(convolution kernel 1 x 3), a 1D time convolution Ft(convolution kernel 3 x 1), two point-by-point convolutions Fp(convolution kernel 1 x 1) instead of 3D convolution, the formula is as follows:
y=Fp(Ft(Fs(Fp(x))))+x
P3D-B Module: using a 2D spatial convolution Fs(convolution kernel 1 x 3), a 1D time convolution Ft(convolution kernel 3 x 1), two point-by-point convolutions Fp(convolution kernel 1 x 1) composition, P3D-B convolves a 2D space by Fs(convolution kernels 1 x 3) and 1D time convolution Ft(convolution kernels 3 x 1) were aligned in parallel to give:
y=Fp(Ft(Fp(x))+Fs(Fp(x)))+x
P3D-C Module: using a 2D spatial convolution Fs(convolution kernel 1 x 3), a 1D time convolution Ft(convolution kernel 3 x 1), two point-by-point convolutions Fp(convolution kernel 1 x 1) composition, P3D-B convolves a 2D space by Fs(convolution kernels 1 x 3) and 1D time convolution Ft(convolution kernel 3 x 1) similar residual block processing, 2D space convolution Fs(convolution kernel 1 x 3) as a branch, and a 1D time convolution Ft(convolution kernel 3 x 1) the results of the processing are added as a point-by-point convolution Fp(convolution kernel 1 x 1), the formula is as follows:
y=Fp(Fs(Fp(x))+Ft(Fs(Fp(x))))+x;
a 3D convolution module: extracting motion information between continuous frames, wherein for an input feature map X, an output feature map is Y; the formula for the 3D convolution is expressed as follows:
wherein, K is a 3D convolution kernel, and C, T, H and W respectively represent the number of channels of input X, the sequence length, the length and the width;
a pooling module: computing output for elements in a fixed-shape window (also known as a pooling window) of input data; the pooling layer that calculates the maximum value of an element within the pooling window is called maximum pooling (Maxpool), and the pooling layer that calculates the average value of the element within the pooling window is called average pooling (Avg Pool);
a full connection module: for outputting the final feature vector, the formula is as follows:
Y=WX+b
wherein, X is an input vector, Y is an output vector, W is a weight coefficient, and b is an offset value.
Further, the semantic feature mining module adopts a three-layer bidirectional long-short term memory network (BilSTM) with memory cells (MemoryCells) to obtain semantic association features in the spatio-temporal features; the method specifically comprises the following steps:
bidirectional long-short term memory network module (BiLSTM): the network consists of a forward long-short term memory network and a reverse long-short term memory network which are overlapped together, and comprises: input door itOutput gate otForgetting door ftAnd a cell state ct(ii) a The mathematical formalization of the long-short term memory network is expressed as follows:
it:=sigm(Wxixt+Whiht-1)
ft:=sigm(Wxfxt+Whfht-1)
ot:=sigm(Wxoxt+Whoht-1)
ht:=Ot⊙tanh(ct)
wherein, "-" indicates multiplication of elements, sigmm is sigmoid function, tanh is hyperbolic tangent function, xtTo input data, ht-1In order to hide the node(s),cell state update value, Wx,WhIs the corresponding weight;
memory cell (memorylcell) module: two adjacent long and short term memory network layers are connected end to end, so that context information can be better mined. The method specifically comprises the following steps: outputting the hidden state of the upper layerBy memory cells as input for the underlying hidden stateThe memory cell comprises a hyperbolic tangent function (tanh) and a full junction layer (fc), and the formula is expressed as:
attention mechanism (attention) module: enabling the decoding module to sense the weight influence of the input sequence on the current predicted value, and expressing the formula as follows:
at=softmax(v*tanh(atten(Ht-1,X)))
in the formula, atIs an attention vector representing the weight of the input sequence to the current word prediction, v is a weight coefficient, atten (, x) is a fully connected layer.
Furthermore, the decoding module is a layer of unidirectional long-term and short-term memory network and is added with a prediction module, and the prediction is carried out through semantic association characteristics obtained by a semantic mining module; the method comprises the following steps:
long-short term memory network module: decoding the hidden state of the last cell through the long-short term memory network, and calculating the current cell state according to the last output word vector;
a prediction module: predicting the current output according to the current unit state, and displaying the predicted sign language recognition result to the user in the form of text information, wherein the predicted sign language recognition result comprises a full connection layer and a normalized exponential function (softmax), and the formula is expressed as follows:
wherein the exp function is an exponential function, and W is a weight coefficient.
The invention relates to a sign language identification method based on space-time semantic features, which comprises the following steps:
1) acquiring sign language video data;
2) performing center cutting on each frame image of the sign language video data, and adjusting the size of a video frame to be a uniform size;
3) normalizing each frame of image by sign language video data with uniform size;
4) performing frame division on the normalized video data, and dividing a video sequence into a series of video clips (clips) with the size of 16 and the step length of 8;
5) extracting a series of video clips (clips) by a space-time feature module to obtain a space-time feature vector with the size of 4096 dimensions;
6) performing semantic mining on the obtained 4096-dimensional space-time feature vector to obtain semantic association features among video clips (clips);
7) and performing sign language recognition through a decoding module according to the semantic association characteristics, and outputting a sign language recognition result to a user.
The invention has the beneficial effects that:
the invention can collect sign language videos through a common camera, and can facilitate the communication between deaf-mute people and ordinary people without the sign language expressors equipped with customized equipment, so that the ordinary people can understand the meaning expressed by the sign language without learning the sign language; hardware burden in user communication is reduced, and more natural intelligent device communication physical examination is brought to the user.
In the modeling of sign language or gestures, the extracted features are implicit, the features are related to hand movements, hand movement tracks and facial expressions, such as gesture shapes, movement tracks in gesture conversion processes and the like, and the sign language is sensed from multiple dimensions, so that a more accurate recognition effect is achieved.
Drawings
FIG. 1 is a block diagram of a system according to the present invention.
FIG. 2 is a diagram illustrating a sign language recognition model according to the present invention.
FIG. 3 is a schematic diagram of a spatiotemporal feature model according to the present invention.
FIG. 4 is a diagram illustrating the operation of the memory cell of the present invention.
Detailed Description
In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.
Referring to fig. 1, a sign language recognition system based on spatiotemporal semantic features of the present invention includes: the device comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; wherein the content of the first and second substances,
the system comprises a video data acquisition module, a sign language video data acquisition module and a sign language video data acquisition module, wherein the sign language video data acquisition module acquires sign language video data through a common monitoring camera, an external camera, a built-in camera of intelligent equipment (such as a smart phone, smart glasses and the like) or other camera modules;
the video preprocessing module comprises: the video frame size adjusting module, the data normalizing module and the video framing module;
the video frame size adjusting module is used for zooming the size of the collected sign language video frame to a uniform size; it includes: the system comprises a center cutting module and a size adjusting module, wherein the center cutting module is used for cutting off the redundant blank places of video frames, and the size adjusting module adopts a size adjusting (reshape) function of an open source computer vision library (opencv) to adjust the cut images to be in a uniform size. Considering that the effective content of the recorded video is mainly in the central area, the central cropping is adopted to crop out the side ineffective parts in the image, and then the image is uniformly reduced to the size of (256, 3) by using the resizing (reshape) function of an open source computer vision library (opencv).
The data normalization module normalizes the pixel value of the video frame after the size adjustment from 0-255 to 0-1 range; the non-normalized data can not be used as prediction, a normalization module normalizes the pixel values of the cut video frames from 0-255 to 0-1, the normalization mode adopts maximum value normalization, and the pixel values of all the frames are divided by 255 to obtain normalized data;
the video framing module divides sign language video data into video clips (clips) with fixed frame numbers by using a sliding window algorithm; considering that the lengths of the recorded videos may be inconsistent, the video needs to be framed, the video data is divided into a series of video frames with the length of 16 frames and the step length of 8, and the video framing module divides the video into video clips (clips) with fixed frame numbers by using a sliding window algorithm.
Referring to fig. 2, the sign language recognition module includes: the system comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;
the space-time feature extraction module extracts a video segment into 4096-dimensional space-time features;
the semantic feature mining module is used for acquiring semantic association features in the space-time features;
and the decoding module is used for carrying out sign language recognition according to the semantic association characteristics and outputting corresponding text information to a user.
Referring to fig. 3, the spatio-temporal feature extraction module extracts a 4096-dimensional spatio-temporal feature from a video segment by using a modified pseudo 3D residual network (P3D) model, specifically: the modified P3D model adopts a residual error network (resnet50) as a basic framework, and a residual error structure is obtained by replacing the following modules;
the space-time feature extraction module comprises: a residual module, a P3D-A module, a P3D-B module, a P3D-C module, a series of 3D convolution modules, a pooling module and a full-connection module; wherein the content of the first and second substances,
a residual module: using a deep convolution Fd(convolution kernel 3 x 3) and 2 point-by-point convolutions Fp(convolution kernel 1 x 1) to implement the function of two-dimensional convolution, the formula is expressed as follows:
y=Fp(Fd(Fp(x)))+x
P3D-A module: using a 2D spatial convolution Fs(convolution kernel 1 x 3), a 1D time convolution Ft(convolution kernel 3 x 1), two point-by-point convolutions Fp(convolution kernel 1 x 1) instead of 3D convolution, the formula is as follows:
y=Fp(Ft(Fs(Fp(x))))+x
P3D-B Module: using a 2D spatial convolution Fs(convolution kernel 1 x 3), a 1D time convolution Ft(convolution kernel 3 x 1), two point-by-point convolutions Fp(convolution kernel 1 x 1) composition, P3D-B convolves a 2D space by Fs(convolution kernels 1 x 3) and 1D time convolution Ft(convolution kernels 3 x 1) were aligned in parallel to give:
y=Fp(Ft(Fp(x))+Fs(Fp(x)))+x
P3D-C Module: using a 2D spatial convolution Fs(convolution kernel 1 x 3), a 1D time convolution Ft(convolution kernel 3 x 1), two point-by-point convolutions Fp(convolution kernel 1 x 1) composition, P3D-B convolves a 2D space by Fs(convolution kernels 1 x 3) and 1D time convolution Ft(convolution kernel 3 x 1) similar residual block processing, 2D space convolution Fs(convolution kernel 1 x 3) as a branch, and a 1D time convolution Ft(convolution kernel 3 x 1) the results of the processing are added as a point-by-point convolution Fp(convolution kernel 1 x 1), the formula is as follows:
y=Fp(Fs(Fp(x))+Ft(Fs(Fp(x))))+x;
a 3D convolution module: extracting motion information between continuous frames, wherein for an input feature map X, an output feature map is Y; the formula for the 3D convolution is expressed as follows:
wherein, K is a 3D convolution kernel, and C, T, H and W respectively represent the number of channels of input X, the sequence length, the length and the width;
a pooling module: the pooling module can reduce information redundancy, improve scale invariance and rotation invariance of the model, and prevent overfitting. The pooling module calculates the output for elements in one fixed-shape window (also called pooling window) of the input data at a time; the pooling layer for calculating the maximum value of the element within the pooling window is called Max Pool (Avg Pool), and the pooling layer for calculating the average value of the element within the pooling window is called average Pool (Avg Pool);
a full connection module: for outputting the final feature vector, the formula is as follows:
Y=WX+b
wherein, X is an input vector, Y is an output vector, W is a weight coefficient, and b is an offset value.
Referring to fig. 4, the semantic feature mining module obtains semantic association features in spatiotemporal features by using a three-layer bidirectional long short term memory network (BiLSTM) with memory cells (memory cells); the method specifically comprises the following steps:
bidirectional long-short term memory network module (BiLSTM): the network consists of a forward long-short term memory network and a reverse long-short term memory network which are overlapped together, and comprises: input door itOutput gate otForgetting door ftAnd a cell state ct(ii) a The mathematical formalization of the long-short term memory network is expressed as follows:
it:=sigm(Wxixt+Whiht-1)
ft:=sigm(Wxfxt+Whfht-1)
ot:=sigm(Wxoxt+Whoht-1)
ht:=Ot⊙tanh(ct)
wherein, "-" indicates multiplication of elements, sigmm is sigmoid function, tanh is hyperbolic tangent function, xtTo input data, ht-1In order to hide the node(s),cell state update value, Wx,WhIs the corresponding weight;
memory cell (memorylcell) module: two adjacent long and short term memory network layers are connected end to end, so that context information can be better mined. The method specifically comprises the following steps: outputting the hidden state of the upper layerBy memory cells as input for the underlying hidden stateThe memory cell comprises a hyperbolic tangent function (tanh) and a full junction layer (fc), and the formula is expressed as:
attention mechanism (attention) module: enabling the decoding module to sense the weight influence of the input sequence on the current predicted value, and expressing the formula as follows:
at=softmax(v*tanh(atten(Ht-1,X)))
in the formula, atIs the vector of the attention of the user,represents the weight of the input sequence to the current word prediction, v is the weight coefficient, and atten (, a fully connected layer).
The decoding module is a layer of unidirectional long-term and short-term memory network and is added with a prediction module, and the prediction is carried out through semantic association characteristics obtained by a semantic mining module; the method comprises the following steps:
long-short term memory network module: decoding the hidden state of the last cell through the long-short term memory network, and calculating the current cell state according to the last output word vector;
a prediction module: predicting the current output according to the current unit state, and displaying the predicted sign language recognition result to the user in the form of text information, wherein the predicted sign language recognition result comprises a full connection layer and a normalized exponential function (softmax), and the formula is expressed as follows:
wherein exp function is an exponential function, W is a weight coefficient, p (y)t|zt) And the corresponding text is the output result when the value is maximum.
The invention relates to a sign language identification method based on space-time semantic features, which comprises the following steps:
1) acquiring sign language video data;
2) performing center cutting on each frame image of the sign language video data, and adjusting the size of a video frame to be a uniform size;
3) normalizing each frame of image by sign language video data with uniform size;
4) performing frame division on the normalized video data, and dividing a video sequence into a series of video clips (clips) with the size of 16 and the step length of 8;
5) extracting a series of video clips (clips) by a space-time feature module to obtain a space-time feature vector with the size of 4096 dimensions;
6) performing semantic mining on the obtained 4096-dimensional space-time feature vector to obtain semantic association features among video clips (clips);
7) and according to the semantic association characteristics, sign language prediction is carried out through a decoding module, and corresponding text information is output to a user.
It should be noted that the neural network model may be more diversified in actual operation, such as adding or deleting a convolutional layer per module, or changing convolutional layer parameters, changing a preprocessing process, but sign language recognition or gesture recognition is performed based on the neural network per se, and such implementation or change should not be considered to be beyond the scope of the present invention.
While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.
Claims (9)
1. A sign language recognition system based on spatiotemporal semantic features, comprising: the device comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module;
the video data acquisition module is used for acquiring sign language video data;
the video preprocessing module comprises: the video frame size adjusting module, the data normalizing module and the video framing module;
the video frame size adjusting module is used for zooming the size of the collected sign language video frame to a uniform size;
the data normalization module normalizes the pixel values in the video frame after the size adjustment from 0-255 to 0-1 range;
the video framing module is used for dividing sign language video data into video segments with fixed frame numbers;
the sign language recognition module comprises: the system comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;
the space-time feature extraction module extracts a video segment into 4096-dimensional space-time features;
the semantic feature mining module is used for acquiring semantic association features in the space-time features;
and the decoding module is used for carrying out sign language recognition according to the semantic association characteristics and outputting a sign language recognition result to a user.
2. The spatiotemporal semantic feature-based sign language recognition system of claim 1 wherein the video frame resizing module comprises: the system comprises a central cutting module and a size adjusting module, wherein the central cutting module is used for cutting off redundant blank places of video frames, and the size adjusting module adopts a size adjusting function of an open-source computer vision library to adjust cut images to be in a uniform size.
3. The system according to claim 1, wherein the data normalization module normalizes all video frame pixel values to 0-1 by dividing by 255.
4. The system according to claim 1, wherein the video framing module uses a sliding window algorithm to divide the video into fixed frame number video segments.
5. The system for sign language recognition based on spatiotemporal semantic features of claim 4 wherein the sliding window algorithm has a window size of 16 and a step size of 8.
6. The sign language recognition system based on spatiotemporal semantic features of claim 1, wherein the spatiotemporal feature extraction module adopts a modified pseudo-3D residual network model to extract a video segment as a 4096-dimensional spatiotemporal feature, specifically: the modified pseudo-3D residual error network model adopts a residual error network as a basic framework, and a residual error structure is obtained by replacing the following modules;
the space-time feature extraction module comprises: a residual module, a P3D-A module, a P3D-B module, a P3D-C module, a series of 3D convolution modules, a pooling module and a full-connection module;
a residual module: by one depthConvolution FdAnd 2 point-by-point convolutions FpTo implement the function of two-dimensional convolution, the formula is expressed as follows:
y=Fp(Fd(Fp(x)))+x;
P3D-A module: using a 2D spatial convolution FsA 1D time convolution FtTwo point-by-point convolutions FpInstead of 3D convolution, the formula is expressed as follows:
y=Fp(Ft(Fs(Fp(x))))+x;
P3D-B Module: using a 2D spatial convolution FsA 1D time convolution FtTwo point-by-point convolutions FpComposition, P3D-B convolves a 2D space with FsAnd 1D time convolution FtParallel arrangement yields:
y=Fp(Ft(Fp(x))+Fs(Fp(x)))+x;
P3D-C Module: using a 2D spatial convolution FsA 1D time convolution FtTwo point-by-point convolutions FpComposition, P3D-B convolves a 2D space with FsAnd 1D time convolution FtPerforming similar residual module processing to convolve the 2D space by FsAs a branch, convolved with 1D time FtAdding the processing results as a point-by-point convolution FpIs formulated as follows:
y=Fp(Fs(Fp(x))+Ft(Fs(Fp(x))))+x;
a 3D convolution module: extracting motion information between continuous frames, wherein for an input feature map X, an output feature map is Y; the formula for the 3D convolution is expressed as follows:
wherein, K is a 3D convolution kernel, and C, T, H and W respectively represent the number of channels of input X, the sequence length, the length and the width;
a pooling module: computing output for elements in a fixed-shape window of input data; the pooling layer for calculating the maximum value of the elements in the pooling window is called maximum pooling, and the pooling layer for calculating the average value of the elements in the pooling window is called average pooling;
a full connection module: for outputting the final feature vector, the formula is as follows:
Y=WX+b
wherein, X is an input vector, Y is an output vector, W is a weight coefficient, and b is an offset value.
7. The sign language recognition system based on spatiotemporal semantic features of claim 1, wherein the semantic feature mining module adopts a three-layer bidirectional long-short term memory network with memory cells to obtain semantic related features in spatiotemporal features; the method specifically comprises the following steps:
the bidirectional long-short term memory network module: the network consists of a forward long-short term memory network and a reverse long-short term memory network which are overlapped together, and comprises: input door itOutput gate otForgetting door ftAnd a cell state ct(ii) a The mathematical formalization of the long-short term memory network is expressed as follows:
it:=sigm(Wxixt+Whiht-1)
ft:=sigm(Wxfxt+Whfht-1)
ot:=sigm(Wxoxt+Whoht-1)
ht:=Ot(⊙tanh(Ct)
wherein, "-" indicates multiplication of elements, sigmm is sigmoid function, tanh is hyperbolic tangent function, xtTo input data, ht-1To hide nodes,Cell state update value, Wx,WhIs the corresponding weight;
a memory cell module: two adjacent long and short term memory network layers are connected end to end, and the hidden state of the previous layer is outputBy memory cells as input for the underlying hidden stateThe memory cell comprises a hyperbolic tangent function tanh and a full junction layer fc, and the formula is as follows:
an attention mechanism module: enabling the decoding module to sense the weight influence of the input sequence on the current predicted value, and expressing the formula as follows:
at=softmax(v*tanh(atten(Ht-1,X))
in the formula, atIs the attention vector representing the weight of the input sequence to the current predicted word, v is the weight coefficient, and atten (, a fully connected layer).
8. The sign language recognition system based on spatiotemporal semantic features of claim 1, wherein the decoding module adopts a layer of unidirectional long-short term memory network and a prediction module to predict semantic associated features obtained by a semantic mining module; the method specifically comprises the following steps:
long-short term memory network module: decoding the hidden state of the last unit through a long-short term memory network, and calculating the current unit state according to the last output word vector;
a prediction module: predicting the current output according to the current unit state, and displaying the predicted sign language recognition result to a user in the form of text information, wherein the sign language recognition result comprises a full connection layer and a normalized exponential function, and the formula is expressed as follows:
wherein the exp function is an exponential function, and W is a weight coefficient.
9. A sign language identification method based on space-time semantic features is characterized by comprising the following steps:
1) acquiring sign language video data;
2) performing center cutting on each frame image of the sign language video data, and adjusting the size of a video frame to be a uniform size;
3) normalizing each frame of image by sign language video data with uniform size;
4) carrying out frame division processing on the normalized video data, and dividing a video sequence into a series of video segments with the size of 16 and the step length of 8;
5) a series of video segments are processed by a space-time feature module to provide a space-time feature vector with the size of 4096 dimensions;
6) performing semantic mining on the obtained 4096-dimensional space-time feature vector to obtain semantic association features among video segments;
7) and carrying out sign language recognition on the acquired semantic association characteristics through a decoding module, and outputting a sign language recognition result to a user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010648991.XA CN111797777B (en) | 2020-07-07 | 2020-07-07 | Sign language recognition system and method based on space-time semantic features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010648991.XA CN111797777B (en) | 2020-07-07 | 2020-07-07 | Sign language recognition system and method based on space-time semantic features |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111797777A true CN111797777A (en) | 2020-10-20 |
CN111797777B CN111797777B (en) | 2023-10-17 |
Family
ID=72810405
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010648991.XA Active CN111797777B (en) | 2020-07-07 | 2020-07-07 | Sign language recognition system and method based on space-time semantic features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111797777B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112487939A (en) * | 2020-11-26 | 2021-03-12 | 深圳市热丽泰和生命科技有限公司 | Pure vision light weight sign language recognition system based on deep learning |
CN113869178A (en) * | 2021-09-18 | 2021-12-31 | 合肥工业大学 | Feature extraction system and video quality evaluation system based on space-time dimension |
CN114155562A (en) * | 2022-02-09 | 2022-03-08 | 北京金山数字娱乐科技有限公司 | Gesture recognition method and device |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120323521A1 (en) * | 2009-09-29 | 2012-12-20 | Commissariat A L'energie Atomique Et Aux Energies Al Ternatives | System and method for recognizing gestures |
CN105095866A (en) * | 2015-07-17 | 2015-11-25 | 重庆邮电大学 | Rapid behavior identification method and system |
CN105389539A (en) * | 2015-10-15 | 2016-03-09 | 电子科技大学 | Three-dimensional gesture estimation method and three-dimensional gesture estimation system based on depth data |
DE102016100075A1 (en) * | 2016-01-04 | 2017-07-06 | Volkswagen Aktiengesellschaft | Method for evaluating gestures |
CN107451552A (en) * | 2017-07-25 | 2017-12-08 | 北京联合大学 | A kind of gesture identification method based on 3D CNN and convolution LSTM |
CN108171198A (en) * | 2018-01-11 | 2018-06-15 | 合肥工业大学 | Continuous sign language video automatic translating method based on asymmetric multilayer LSTM |
CN110110602A (en) * | 2019-04-09 | 2019-08-09 | 南昌大学 | A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence |
CN110309761A (en) * | 2019-06-26 | 2019-10-08 | 深圳市微纳集成电路与系统应用研究院 | Continuity gesture identification method based on the Three dimensional convolution neural network with thresholding cycling element |
CN111091045A (en) * | 2019-10-25 | 2020-05-01 | 重庆邮电大学 | Sign language identification method based on space-time attention mechanism |
CN111126112A (en) * | 2018-10-31 | 2020-05-08 | 顺丰科技有限公司 | Candidate region determination method and device |
US20200184278A1 (en) * | 2014-03-18 | 2020-06-11 | Z Advanced Computing, Inc. | System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform |
CN111325099A (en) * | 2020-01-21 | 2020-06-23 | 南京邮电大学 | Sign language identification method and system based on double-current space-time diagram convolutional neural network |
CN111340006A (en) * | 2020-04-16 | 2020-06-26 | 深圳市康鸿泰科技有限公司 | Sign language identification method and system |
CN111339837A (en) * | 2020-02-08 | 2020-06-26 | 河北工业大学 | Continuous sign language recognition method |
CN111361700A (en) * | 2020-03-23 | 2020-07-03 | 南京畅淼科技有限责任公司 | Ship empty and heavy load identification method based on machine vision |
-
2020
- 2020-07-07 CN CN202010648991.XA patent/CN111797777B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120323521A1 (en) * | 2009-09-29 | 2012-12-20 | Commissariat A L'energie Atomique Et Aux Energies Al Ternatives | System and method for recognizing gestures |
US20200184278A1 (en) * | 2014-03-18 | 2020-06-11 | Z Advanced Computing, Inc. | System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform |
CN105095866A (en) * | 2015-07-17 | 2015-11-25 | 重庆邮电大学 | Rapid behavior identification method and system |
CN105389539A (en) * | 2015-10-15 | 2016-03-09 | 电子科技大学 | Three-dimensional gesture estimation method and three-dimensional gesture estimation system based on depth data |
DE102016100075A1 (en) * | 2016-01-04 | 2017-07-06 | Volkswagen Aktiengesellschaft | Method for evaluating gestures |
CN107451552A (en) * | 2017-07-25 | 2017-12-08 | 北京联合大学 | A kind of gesture identification method based on 3D CNN and convolution LSTM |
CN108171198A (en) * | 2018-01-11 | 2018-06-15 | 合肥工业大学 | Continuous sign language video automatic translating method based on asymmetric multilayer LSTM |
CN111126112A (en) * | 2018-10-31 | 2020-05-08 | 顺丰科技有限公司 | Candidate region determination method and device |
CN110110602A (en) * | 2019-04-09 | 2019-08-09 | 南昌大学 | A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence |
CN110309761A (en) * | 2019-06-26 | 2019-10-08 | 深圳市微纳集成电路与系统应用研究院 | Continuity gesture identification method based on the Three dimensional convolution neural network with thresholding cycling element |
CN111091045A (en) * | 2019-10-25 | 2020-05-01 | 重庆邮电大学 | Sign language identification method based on space-time attention mechanism |
CN111325099A (en) * | 2020-01-21 | 2020-06-23 | 南京邮电大学 | Sign language identification method and system based on double-current space-time diagram convolutional neural network |
CN111339837A (en) * | 2020-02-08 | 2020-06-26 | 河北工业大学 | Continuous sign language recognition method |
CN111361700A (en) * | 2020-03-23 | 2020-07-03 | 南京畅淼科技有限责任公司 | Ship empty and heavy load identification method based on machine vision |
CN111340006A (en) * | 2020-04-16 | 2020-06-26 | 深圳市康鸿泰科技有限公司 | Sign language identification method and system |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112487939A (en) * | 2020-11-26 | 2021-03-12 | 深圳市热丽泰和生命科技有限公司 | Pure vision light weight sign language recognition system based on deep learning |
CN113869178A (en) * | 2021-09-18 | 2021-12-31 | 合肥工业大学 | Feature extraction system and video quality evaluation system based on space-time dimension |
CN114155562A (en) * | 2022-02-09 | 2022-03-08 | 北京金山数字娱乐科技有限公司 | Gesture recognition method and device |
Also Published As
Publication number | Publication date |
---|---|
CN111797777B (en) | 2023-10-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111091045B (en) | Sign language identification method based on space-time attention mechanism | |
CN111797777B (en) | Sign language recognition system and method based on space-time semantic features | |
WO2021008320A1 (en) | Sign language recognition method and apparatus, computer-readable storage medium, and computer device | |
EP3989111A1 (en) | Video classification method and apparatus, model training method and apparatus, device and storage medium | |
CN109815826B (en) | Method and device for generating face attribute model | |
EP4099220A1 (en) | Processing apparatus, method and storage medium | |
CN109086706B (en) | Motion recognition method based on segmentation human body model applied to human-computer cooperation | |
CN111310676A (en) | Video motion recognition method based on CNN-LSTM and attention | |
CN112800903B (en) | Dynamic expression recognition method and system based on space-time diagram convolutional neural network | |
CN104361316B (en) | Dimension emotion recognition method based on multi-scale time sequence modeling | |
EP4276684A1 (en) | Capsule endoscope image recognition method based on deep learning, and device and medium | |
CN108875836B (en) | Simple-complex activity collaborative recognition method based on deep multitask learning | |
EP4006777A1 (en) | Image classification method and device | |
CN111832393A (en) | Video target detection method and device based on deep learning | |
CN114360067A (en) | Dynamic gesture recognition method based on deep learning | |
Alam et al. | Two dimensional convolutional neural network approach for real-time bangla sign language characters recognition and translation | |
Boukdir et al. | Isolated video-based Arabic sign language recognition using convolutional and recursive neural networks | |
CN114724224A (en) | Multi-mode emotion recognition method for medical care robot | |
CN115205336A (en) | Feature fusion target perception tracking method based on multilayer perceptron | |
Zatout et al. | Semantic scene synthesis: application to assistive systems | |
CN114724251A (en) | Old people behavior identification method based on skeleton sequence under infrared video | |
CN112668543B (en) | Isolated word sign language recognition method based on hand model perception | |
CN102663369B (en) | Human motion tracking method on basis of SURF (Speed Up Robust Feature) high efficiency matching kernel | |
CN116797799A (en) | Single-target tracking method and tracking system based on channel attention and space-time perception | |
Howell et al. | Active vision techniques for visually mediated interaction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |