CN111797777B - Sign language recognition system and method based on space-time semantic features - Google Patents

Sign language recognition system and method based on space-time semantic features Download PDF

Info

Publication number
CN111797777B
CN111797777B CN202010648991.XA CN202010648991A CN111797777B CN 111797777 B CN111797777 B CN 111797777B CN 202010648991 A CN202010648991 A CN 202010648991A CN 111797777 B CN111797777 B CN 111797777B
Authority
CN
China
Prior art keywords
module
sign language
video
convolution
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010648991.XA
Other languages
Chinese (zh)
Other versions
CN111797777A (en
Inventor
殷亚凤
甘世维
谢磊
陆桑璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202010648991.XA priority Critical patent/CN111797777B/en
Publication of CN111797777A publication Critical patent/CN111797777A/en
Application granted granted Critical
Publication of CN111797777B publication Critical patent/CN111797777B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/32Normalisation of the pattern dimensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language

Abstract

The invention discloses a sign language recognition system and a method based on space-time semantic features, wherein the system comprises the following steps: the system comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; the video data acquisition module is used for acquiring sign language video data; the video data preprocessing module is used for preprocessing video data; and the sign language recognition module is used for carrying out sign language recognition and outputting a prediction result. The invention eliminates the semantic gap between sign language image data and text data, realizes convenient communication with the deaf-mute, and ensures the accuracy of translation by using the neural network algorithm at the front edge as a translation tool. In addition, the method can be used as a medium for man-machine interaction, and the intelligent equipment is controlled by analyzing continuous gestures of a user.

Description

Sign language recognition system and method based on space-time semantic features
Technical Field
The invention belongs to the technical field of sign language identification, and particularly relates to a sign language identification system and method based on space-time semantic features.
Background
According to World Health Organization (WHO) data, about 4.66 million people worldwide had hearing impairment in 2018, and it is expected that the number of people with hearing impairment would be over 9 million by 2050. People with hearing impairment and deaf-mutes use sign language as communication medium, however, few ordinary people can master sign language and communicate with the sign language, so that the deaf-mutes have communication obstacle with the ordinary people.
The current solutions mainly include manual translation, sign language translation based on gloves of different colors and translation based on a smart watch. The feasibility of manual translation is limited to certain specific and regular occasions, and is difficult to use by the masses, and is impractical for most people to learn sign language courses. Sign language translation based on gloves with different colors and translation based on a smart watch introduce extra equipment burden, and sign language expressive persons are required to be provided with certain equipment, so that communication experience of deaf-mute persons is greatly reduced.
In view of this, the visual method provided by the invention can obtain sign language video through a common monitoring camera, an external camera, a built-in camera of intelligent equipment (such as a smart phone, a smart glasses and the like) or other camera modules, and does not need a custom device for sign language expressive persons. In addition, the method and the device can be applied to novel human-computer interaction scenes, and instruction control of gestures on intelligent equipment is completed through gesture analysis of users.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a sign language recognition system and a sign language recognition method based on space-time semantic features, so as to solve the problem that the deaf-mute and the common public have communication barriers in the prior art. The invention can facilitate the communication between the deaf-mute and the ordinary person, so that the ordinary person can understand the meaning expressed by the sign language without learning the sign language.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the invention discloses a sign language recognition system based on space-time semantic features, which comprises: the system comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; wherein, the liquid crystal display device comprises a liquid crystal display device,
the video data acquisition module is used for acquiring sign language video data; data are collected through a common monitoring camera, an external camera, a built-in camera of intelligent equipment (such as a smart phone, a smart glasses and the like) or other camera modules;
the video preprocessing module comprises: the device comprises a video frame size adjusting module, a data normalization module and a video framing module;
the video frame size adjusting module is used for scaling the size of the acquired sign language video frame to a uniform size;
the data normalization module normalizes the pixel value of the video frame with the adjusted size from 0-255 to 0-1;
the video framing module divides sign language video data into video clips (clips) with fixed frame numbers;
the sign language identification module comprises: the device comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;
the space-time feature extraction module is used for extracting a video segment into 4096-dimensional space-time features;
the semantic feature mining module is used for acquiring semantic association features in the space-time features;
and the decoding module is used for carrying out sign language identification according to the semantic association characteristics and outputting a sign language identification result to a user.
Further, the video frame resizing module comprises: the device comprises a central clipping module and a size adjusting module, wherein the central clipping module is used for clipping redundant blank places of video frames, and the size adjusting module adjusts the clipped images to a uniform size by adopting a size adjusting (reshape) function of an open source computer vision library (opencv).
Further, the data normalization module normalizes all video frame pixel values to 0-1 by dividing 255.
Further, the video framing module uses a sliding window algorithm to divide the video into fixed frame number video segments.
Further, the window size of the sliding window algorithm is 16, and the step size is 8.
Further, the spatio-temporal feature extraction module adopts a modified pseudo 3D residual network (P3D) model to extract a video segment as a 4096-dimensional spatio-temporal feature, specifically: the modified P3D model adopts a residual network (resnet 50) as a basic framework, and the residual structure is obtained by replacing the following modules;
the space-time feature extraction module comprises: residual error module, P3D-A module, P3D-B module, P3D-C module and a series of 3D convolution modules, pooling module, full connection module; wherein, the liquid crystal display device comprises a liquid crystal display device,
residual error module: using a depth convolution F d (convolution kernel 3*3) and 2 point-by-point convolutions F p (convolution kernel 1*1) to implement the function of two-dimensional convolution, the formula is as follows:
y=F p (F d (F p (x)))+x
P3D-A module: using a 2D spatial convolution F s (convolution kernel 1 x 3), a 1D time convolution F t (convolution kernel 3 x 1), two point-by-point convolutions F p Instead of 3D convolution, (convolution kernel 1 x 1), the formula is as follows:
y=F p (F t (F s (F p (x))))+x
P3D-B module: using a 2D spatial convolution F s (convolution kernel 1 x 3), a 1D time convolution F t (convolution kernel 3 x 1), two point-by-point convolutions F p (convolution kernel 1 x 1) composition, P3D-B convolves 2D space with F s (convolution kernel 1 x 3) and 1D time convolution F t (convolution kernel 3×1×1) parallel arrangement to obtain:
y=F p (F t (F p (x))+F s (F p (x)))+x
P3D-C module: using a 2D spatial convolution F s (convolution kernel 1 x 3), a 1D time convolution F t (convolution kernel 3 x 1), two point-by-point convolutions F p (convolution kernel 1 x 1) composition, P3D-B convolves 2D space with F s (convolution kernel 1 x 3) and 1D time convolution F t (convolution kernel 3 x 1) processing similar to residual module, and convolving 2D space with F s (convolution kernel 1 x 3) as a branch, and 1D time convolution F t The (convolution kernel 3 x 1) processing results are added as a point-by-point convolution F p The input of (convolution kernel 1 x 1), the formula is as follows:
y=F p (F s (F p (x))+F t (F s (F p (x))))+x;
3D convolution module: extracting motion information between successive frames, and outputting a characteristic image Y for one input characteristic image X; the formula for the 3D convolution is expressed as follows:
wherein K is a 3D convolution kernel, and C, T, H and W respectively represent the number of channels for inputting X, the sequence length, the length and the width;
and (3) pooling module: computing output for elements in a fixed shape window (also known as a pooling window) of input data; the pooling layer that calculates the maximum value of the elements within the pooling window is called Max pooling (Max Pool), and the pooling layer that calculates the average value of the elements within the pooling window is called average pooling (Avg Pool);
and (3) a full connection module: for outputting the final feature vector, the formula is as follows:
Y=WX+b
wherein X is an input vector, Y is an output vector, W is a weight coefficient, and b is a bias value.
Further, the semantic feature mining module acquires semantic association features in the space-time features by adopting a three-layer two-way long-short-term memory network (BiLSTM) with memory cells (memory cells); the method specifically comprises the following steps:
two-way long and short term memory network module (BiLSTM): consists of a forward long-term memory network and a reverse long-term memory network which are overlapped together, and comprises the following components: input gate i t Output gate o t Forgetting door f t And a cell state c t The method comprises the steps of carrying out a first treatment on the surface of the The mathematical formalization of long and short term memory networks is expressed as follows:
i t :=sigm(W xi x t +W hi h t-1 )
f t :=sigm(W xf x t +W hf h t-1 )
o t :=sigm(W xo x t +W ho h t-1 )
h t :=O t ⊙tanh(c t )
wherein, as the element multiplication, sigm is a sigmoid function, tan h is a hyperbolic tangent function, x t To input data, h t-1 In order to hide the nodes of the network,cell state update value, W x ,W h Is the corresponding weight;
memory cell (memory cell) module: the two adjacent long-term and short-term memory network layers are connected end to end, so that the context information can be better mined. The method comprises the following steps: output of implicit state of upper layerInput of implicit state through memory cell as next layer +.>The memory cell comprises a hyperbolic tangent function (tanh) and a fully connected layer (fc), expressed as:
attention mechanism (attention) module: the decoding module perceives the weight influence of the input sequence on the current predicted value, and the formula is expressed as follows:
a t =softmax(v*tanh(atten(H t-1,X )))
wherein a is t Is the attention vector, representing the weight of the input sequence to the current word prediction, v is the weight coefficient, and atten is a fully connected layer.
Further, the decoding module is a layer of unidirectional long-and-short-term memory network plus a prediction module, and semantic association features obtained by the semantic mining module are predicted; comprising the following steps:
long-term memory network module: decoding the implicit state of the last cell through the long-short-term memory network, and calculating the current cell state according to the last output word vector;
and a prediction module: predicting the current output according to the current unit state, and displaying the predicted sign language recognition result to a user in the form of text information, wherein the predicted sign language recognition result comprises a full connection layer and a normalized exponential function (softmax), and the formula is expressed as follows:
wherein, exp function is an exponential function, and W is a weight coefficient.
The invention discloses a sign language identification method based on space-time semantic features, which comprises the following steps:
1) Acquiring sign language video data;
2) Performing center clipping on each frame of image of the sign language video data, and adjusting the size of the video frame to be uniform;
3) Normalizing each frame of image by using sign language video data with uniform size;
4) Framing the normalized video data, and dividing the video sequence into a series of video clips (clips) with the size of 16 and the step length of 8;
5) A series of video clips (clips) are put forward to a space-time feature vector with the size of 4096 dimensions through a space-time feature module;
6) Semantic mining is carried out on the obtained 4096-dimensional space-time feature vectors so as to obtain semantic association features among video clips (clips);
7) And carrying out sign language identification through a decoding module according to the semantic association characteristics, and outputting a sign language identification result to a user.
The invention has the beneficial effects that:
the invention can collect sign language video through the common camera, and can facilitate communication between the deaf and dumb person and the ordinary person without the need of custom equipment for sign language expressive person, so that the ordinary person can understand the meaning expressed by sign language without learning sign language; the hardware burden in user communication is reduced, and more natural intelligent device communication physical examination is brought to the user.
In modeling of sign language or gestures, the method provided by the invention extracts a plurality of implicit characteristics, and the characteristics are related to hand actions, hand motion tracks and facial expressions, such as gesture shapes, motion tracks in the gesture conversion process and the like, so that the sign language is perceived from multiple dimensions, and a more accurate recognition effect is achieved.
Drawings
FIG. 1 is a schematic diagram of a system module according to the present invention.
FIG. 2 is a schematic diagram of a sign language recognition model according to the present invention.
FIG. 3 is a schematic diagram of the space-time feature model of the present invention.
FIG. 4 is a schematic diagram showing the operation of the memory cell according to the present invention.
Detailed Description
The invention will be further described with reference to examples and drawings, to which reference is made, but which are not intended to limit the scope of the invention.
Referring to fig. 1, a sign language recognition system based on spatiotemporal semantic features of the present invention includes: the system comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module; wherein, the liquid crystal display device comprises a liquid crystal display device,
the video data acquisition module is used for acquiring sign language video data through a common monitoring camera, an external camera, a built-in camera of intelligent equipment (such as a smart phone, smart glasses and the like) or other camera modules;
the video preprocessing module comprises: the device comprises a video frame size adjusting module, a data normalization module and a video framing module;
the video frame size adjusting module is used for scaling the size of the acquired sign language video frame to a uniform size; it comprises the following steps: the device comprises a central clipping module and a size adjusting module, wherein the central clipping module is used for clipping redundant blank places of video frames, and the size adjusting module adjusts the clipped images to a uniform size by adopting a size adjusting (reshape) function of an open source computer vision library (opencv). Considering that the recorded video content is mainly in the central area, the central cropping is used to crop out the side ineffective parts of the image, and then the image is uniformly reduced to the size of 256,256,3 by using the resizing (reshape) function of the open source computer vision library (opencv).
The data normalization module normalizes the pixel value of the video frame with the adjusted size from 0-255 to 0-1; the non-normalized data cannot be used as prediction, the normalization module normalizes the pixel values of the cut video frames from 0-255 to 0-1, the normalization mode adopts maximum normalization, and all the frame pixel values are divided by 255 to obtain normalized data;
the video framing module is used for dividing sign language video data into video clips (clips) with fixed frame numbers by using a sliding window algorithm; considering that the lengths of recorded videos may be inconsistent, a framing process needs to be performed on the video, the video data is divided into a series of video frames with a length of 16 frames and a step length of 8, and the video framing module uses a sliding window algorithm to divide the video into video clips (clips) with fixed frames.
Referring to fig. 2, the sign language recognition module includes: the device comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;
the space-time feature extraction module is used for extracting a video segment into 4096-dimensional space-time features;
the semantic feature mining module is used for acquiring semantic association features in the space-time features;
and the decoding module is used for identifying sign language according to the semantic association characteristics and outputting corresponding text information to a user.
Referring to fig. 3, the spatio-temporal feature extraction module adopts a modified pseudo 3D residual network (P3D) model to extract a video segment as a 4096-dimensional spatio-temporal feature, specifically: the modified P3D model adopts a residual network (resnet 50) as a basic framework, and the residual structure is obtained by replacing the following modules;
the space-time feature extraction module comprises: residual error module, P3D-A module, P3D-B module, P3D-C module and a series of 3D convolution modules, pooling module, full connection module; wherein, the liquid crystal display device comprises a liquid crystal display device,
residual error module: using a depth convolution F d (convolution kernel 3*3) and 2 point-by-point convolutions F p (convolution kernel 1*1) to implement the function of two-dimensional convolution, the formula is as follows:
y=F p (F d (F p (x)))+x
P3D-A module: using a 2D spatial convolution F s (convolution kernel 1 x 3), a 1D time convolution F t (convolution kernel 3 x 1), two point-by-point convolutions F p Instead of 3D convolution, (convolution kernel 1 x 1), the formula is as follows:
y=F p (F t (F s (F p (x))))+x
P3D-B module: using a 2D spatial convolution F s (convolution kernel 1 x 3), a 1D time convolution F t (convolution kernel 3 x 1), two point-by-point convolutions F p (convolution kernel 1 x 1) composition, P3D-B convolves 2D space with F s (convolution kernel 1 x 3) and 1D time convolution F t (convolution kernel 3×1×1) parallel arrangement to obtain:
y=F p (F t (F p (x))+F s (F p (x)))+x
P3D-C module: using a 2D spatial convolution F s (convolution kernel 1 x 3), a 1D time convolution F t (convolution kernel 3 x 1), two point-by-point convolutions F p (convolution kernel 1 x 1) composition, P3D-B convolves 2D space with F s (convolution kernel 1 x 3) and 1D time convolution F t (convolution kernel 3 x 1) processing similar to residual module, and convolving 2D space with F s (convolution kernel 1 x 3) as a branch, and 1D time convolution F t The (convolution kernel 3 x 1) processing results are added as a point-by-point convolution F p The input of (convolution kernel 1 x 1), the formula is as follows:
y=F p (F s (F p (x))+F t (F s (F p (x))))+x;
3D convolution module: extracting motion information between successive frames, and outputting a characteristic image Y for one input characteristic image X; the formula for the 3D convolution is expressed as follows:
wherein K is a 3D convolution kernel, and C, T, H and W respectively represent the number of channels for inputting X, the sequence length, the length and the width;
and (3) pooling module: the pooling module can reduce information redundancy, promote scale invariance and rotation invariance of the model, and prevent overfitting. The pooling module calculates an output for elements in a fixed shape window (also called a pooling window) of the input data each time; the pooling layer that calculates the maximum value of the elements in the pooling window is called Max Pool (Max Pool), and the pooling layer that calculates the average value of the elements in the pooling window is called average Pool (Avg Pool);
and (3) a full connection module: for outputting the final feature vector, the formula is as follows:
Y=WX+b
wherein X is an input vector, Y is an output vector, W is a weight coefficient, and b is a bias value.
Referring to fig. 4, the semantic feature mining module acquires semantic association features in the space-time features by using a three-layer two-way long-short-term memory network (BiLSTM) with memory cells (memory cells); the method specifically comprises the following steps:
two-way long and short term memory network module (BiLSTM): consists of a forward long-term memory network and a reverse long-term memory network which are overlapped together, and comprises the following components: input gate i t Output gate o t Forgetting door f t And a cell state c t The method comprises the steps of carrying out a first treatment on the surface of the The mathematical formalization of long and short term memory networks is expressed as follows:
i t :=sigm(W xi x t +W hi h t-1 )
f t :=sigm(W xf x t +W hf h t-1 )
o t :=sigm(W xo x t +W ho h t-1 )
h t :=O t ⊙tanh(c t )
wherein, as the element multiplication, sigm is a sigmoid function, tan h is a hyperbolic tangent function, x t To input data, h t-1 In order to hide the nodes of the network,cell state update value, W x ,W h Is the corresponding weight;
memory cell (memory cell) module: the two adjacent long-term and short-term memory network layers are connected end to end, so that the context information can be better mined. The method comprises the following steps: output of implicit state of upper layerInput of implicit state through memory cell as next layer +.>The memory cell comprises a hyperbolic tangent function (tanh) and a fully connected layer (fc), expressed as:
attention mechanism (attention) module: the decoding module perceives the weight influence of the input sequence on the current predicted value, and the formula is expressed as follows:
a t =softmax(v*tanh(atten(H t-1,X )))
wherein a is t Is an attention vector representing the input sequence versus the current wordPredicted weights, v, are weight coefficients and atten is a fully connected layer.
The decoding module is a layer of unidirectional long-short-term memory network plus a prediction module, and predicts the semantic association features acquired by the semantic mining module; comprising the following steps:
long-term memory network module: decoding the implicit state of the last cell through the long-short-term memory network, and calculating the current cell state according to the last output word vector;
and a prediction module: predicting the current output according to the current unit state, and displaying the predicted sign language recognition result to a user in the form of text information, wherein the predicted sign language recognition result comprises a full connection layer and a normalized exponential function (softmax), and the formula is expressed as follows:
wherein exp function is an exponential function, W is a weight coefficient, and p (y t |z t ) And the corresponding text with the maximum value is the output result.
The invention discloses a sign language identification method based on space-time semantic features, which comprises the following steps:
1) Acquiring sign language video data;
2) Performing center clipping on each frame of image of the sign language video data, and adjusting the size of the video frame to be uniform;
3) Normalizing each frame of image by using sign language video data with uniform size;
4) Framing the normalized video data, and dividing the video sequence into a series of video clips (clips) with the size of 16 and the step length of 8;
5) A series of video clips (clips) are put forward to a space-time feature vector with the size of 4096 dimensions through a space-time feature module;
6) Semantic mining is carried out on the obtained 4096-dimensional space-time feature vectors so as to obtain semantic association features among video clips (clips);
7) And according to the semantic association characteristics, sign language prediction is carried out through a decoding module, and corresponding text information is output to a user.
It should be noted that the neural network model may be more diverse in actual operations, such as adding or deleting a convolution layer to each module, or modifying parameters of the convolution layer, and modifying the preprocessing process, but basically, sign language recognition or gesture recognition is performed based on the neural network, which implementation or modification should not be considered beyond the scope of the present invention.
The present invention has been described in terms of the preferred embodiments thereof, and it should be understood by those skilled in the art that various modifications can be made without departing from the principles of the invention, and such modifications should also be considered as being within the scope of the invention.

Claims (7)

1. A sign language recognition system based on spatiotemporal semantic features, comprising: the system comprises a video data acquisition module, a video data preprocessing module and a sign language recognition module;
the video data acquisition module is used for acquiring sign language video data;
the video preprocessing module comprises: the device comprises a video frame size adjusting module, a data normalization module and a video framing module;
the video frame size adjusting module is used for scaling the size of the acquired sign language video frame to a uniform size;
the data normalization module normalizes pixel values in the video frame after the size adjustment from 0-255 to a range of 0-1;
the video framing module divides sign language video data into video clips with fixed frame numbers;
the sign language identification module comprises: the device comprises a space-time feature extraction module, a semantic feature mining module and a decoding module;
the space-time feature extraction module is used for extracting a video segment into 4096-dimensional space-time features;
the semantic feature mining module is used for acquiring semantic association features in the space-time features;
the decoding module is used for carrying out sign language identification according to the semantic association characteristics and outputting a sign language identification result to a user;
the space-time feature extraction module adopts a modified pseudo 3D residual error network model to extract a video segment into 4096-dimensional space-time features, and specifically comprises the following steps: the modified pseudo 3D residual error network model adopts a residual error network as a basic framework, and a residual error structure is obtained by replacing the following modules;
the space-time feature extraction module comprises: residual error module, P3D-A module, P3D-B module, P3D-C module and a series of 3D convolution modules, pooling module, full connection module;
residual error module: using a depth convolution F d And 2 point-by-point convolutions F p To realize the function of two-dimensional convolution, the formula is expressed as follows:
y=F p (F d (F b (x)))+x;
P3D-A module: using a 2D spatial convolution F s A 1D time convolution F t Two point-by-point convolutions F p Instead of 3D convolution, the formula is expressed as follows:
y=F p (F t (F s (F p (x))))+x;
P3D-B module: using a 2D spatial convolution F s A 1D time convolution F t Two point-by-point convolutions F p Composition, P3D-B convolves the 2D space with F s And 1D time convolution F t And (3) parallel arrangement:
y=F b (F t (F b (x))+F s (F p (x)))+x;
P3D-C module: using a 2D spatial convolution F s A 1D time convolution F t Two point-by-point convolutions F p Composition, P3D-B convolves the 2D space with F s And 1D time convolution F t Processing similar to a residual error module, and convolving a 2D space with F s As a branch, convolve F with 1D time t Processing result addition as point-by-point convolution F p The formula is expressed as follows:
y=F p (F s (F p (x))+F t (F s (F p (x))))+x;
3D convolution module: extracting motion information between successive frames, and outputting a characteristic image Y for one input characteristic image X; the formula for the 3D convolution is expressed as follows:
wherein K is a 3D convolution kernel, and C, T, H and W respectively represent the number of channels for inputting X, the sequence length, the length and the width;
and (3) pooling module: calculating an output for elements in a fixed shape window of the input data; the pooling layer for calculating the maximum value of the elements in the pooling window is called maximum pooling, and the pooling layer for calculating the average value of the elements in the pooling window is called average pooling;
and (3) a full connection module: for outputting the final feature vector, the formula is as follows:
Y=WX+b
wherein X is an input vector, Y is an output vector, W is a weight coefficient, and b is a bias value;
the semantic feature mining module acquires semantic association features in the space-time features by adopting a three-layer two-way long-short-term memory network with memory cells; the method specifically comprises the following steps:
two-way long-short-term memory network module: consists of a forward long-term memory network and a reverse long-term memory network which are overlapped together, and comprises the following components: input gate i t Output gate o t Forgetting door f t And a cell state c t The method comprises the steps of carrying out a first treatment on the surface of the The mathematical formalization of long and short term memory networks is expressed as follows:
i t :=sigm(W xi x t +W hi h t-1 )
f t :=sigm(W xf x t +W hf h t-1 )
o t :=sigm(W xo x t +W ho h t-1 )
h t :=O t ⊙tanh(c t )
wherein, as the element multiplication, sigm is a sigmoid function, tan h is a hyperbolic tangent function, x t To input data, h t-1 In order to hide the nodes of the network,cell state update value, W x ,W h Is the corresponding weight;
memory cell module: connecting the head and the tail of two adjacent long-period and short-period memory network layers, and outputting the hidden state of the upper layerInput of implicit state through memory cell as next layer +.>The memory cell comprises a hyperbolic tangent function tanh and a full link layer fc, expressed as:
attention mechanism module: the decoding module perceives the weight influence of the input sequence on the current predicted value.
2. The spatiotemporal semantic feature based sign language identification system of claim 1, wherein the video frame resizing module comprises: the device comprises a central cutting module and a size adjusting module, wherein the central cutting module is used for cutting off redundant blank places of video frames, and the size adjusting module adjusts the cut images to a uniform size by adopting a size adjusting function of an open source computer vision library.
3. The spatiotemporal semantic feature based sign language identification system of claim 1, wherein the data normalization module normalizes all video frame pixel values to 0-1 by dividing 255.
4. The sign language recognition system based on spatiotemporal semantic features of claim 1, wherein the video framing module uses a sliding window algorithm to divide video into fixed frame number video segments.
5. The sign language recognition system based on the temporal and spatial semantic features according to claim 4, wherein the sliding window algorithm has a window size of 16 and a step size of 8.
6. The sign language recognition system based on the space-time semantic features according to claim 1, wherein the decoding module adopts a layer of unidirectional long-short term memory network plus a prediction module, and predicts the semantic association features acquired by the semantic mining module; the method specifically comprises the following steps:
long-term memory network module: decoding the implicit state of the last unit through the long-short-term memory network, and calculating the current unit state according to the last output word vector;
and a prediction module: predicting current output according to current unit state, and displaying predicted sign language recognition result to user in text information form, wherein the predicted sign language recognition result comprises a full connection layer and a normalized exponential function, and the formula is expressed as follows:
wherein, exp function is an exponential function, and W is a weight coefficient.
7. A sign language recognition method based on spatiotemporal semantic features, based on the system of claim 1, comprising the steps of:
1) Acquiring sign language video data;
2) Performing center clipping on each frame of image of the sign language video data, and adjusting the size of the video frame to be uniform;
3) Normalizing each frame of image by using sign language video data with uniform size;
4) Carrying out framing treatment on the normalized video data, and dividing the video sequence into a series of video fragments with the size of 16 and the step length of 8;
5) A series of video clips are provided with space-time feature vectors with the size of 4096 dimensions through a space-time feature module;
6) Semantic mining is carried out on the obtained 4096-dimensional space-time feature vectors so as to obtain semantic association features among video clips;
7) And carrying out sign language identification on the acquired semantic association features through a decoding module, and outputting a sign language identification result to a user.
CN202010648991.XA 2020-07-07 2020-07-07 Sign language recognition system and method based on space-time semantic features Active CN111797777B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010648991.XA CN111797777B (en) 2020-07-07 2020-07-07 Sign language recognition system and method based on space-time semantic features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010648991.XA CN111797777B (en) 2020-07-07 2020-07-07 Sign language recognition system and method based on space-time semantic features

Publications (2)

Publication Number Publication Date
CN111797777A CN111797777A (en) 2020-10-20
CN111797777B true CN111797777B (en) 2023-10-17

Family

ID=72810405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010648991.XA Active CN111797777B (en) 2020-07-07 2020-07-07 Sign language recognition system and method based on space-time semantic features

Country Status (1)

Country Link
CN (1) CN111797777B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487939A (en) * 2020-11-26 2021-03-12 深圳市热丽泰和生命科技有限公司 Pure vision light weight sign language recognition system based on deep learning
CN113869178B (en) * 2021-09-18 2022-07-15 合肥工业大学 Feature extraction system and video quality evaluation system based on space-time dimension
CN114155562A (en) * 2022-02-09 2022-03-08 北京金山数字娱乐科技有限公司 Gesture recognition method and device

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095866A (en) * 2015-07-17 2015-11-25 重庆邮电大学 Rapid behavior identification method and system
CN105389539A (en) * 2015-10-15 2016-03-09 电子科技大学 Three-dimensional gesture estimation method and three-dimensional gesture estimation system based on depth data
DE102016100075A1 (en) * 2016-01-04 2017-07-06 Volkswagen Aktiengesellschaft Method for evaluating gestures
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
CN108171198A (en) * 2018-01-11 2018-06-15 合肥工业大学 Continuous sign language video automatic translating method based on asymmetric multilayer LSTM
CN110110602A (en) * 2019-04-09 2019-08-09 南昌大学 A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence
CN110309761A (en) * 2019-06-26 2019-10-08 深圳市微纳集成电路与系统应用研究院 Continuity gesture identification method based on the Three dimensional convolution neural network with thresholding cycling element
CN111091045A (en) * 2019-10-25 2020-05-01 重庆邮电大学 Sign language identification method based on space-time attention mechanism
CN111126112A (en) * 2018-10-31 2020-05-08 顺丰科技有限公司 Candidate region determination method and device
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111339837A (en) * 2020-02-08 2020-06-26 河北工业大学 Continuous sign language recognition method
CN111340006A (en) * 2020-04-16 2020-06-26 深圳市康鸿泰科技有限公司 Sign language identification method and system
CN111361700A (en) * 2020-03-23 2020-07-03 南京畅淼科技有限责任公司 Ship empty and heavy load identification method based on machine vision

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2950713A1 (en) * 2009-09-29 2011-04-01 Movea Sa SYSTEM AND METHOD FOR RECOGNIZING GESTURES
US11195057B2 (en) * 2014-03-18 2021-12-07 Z Advanced Computing, Inc. System and method for extremely efficient image and pattern recognition and artificial intelligence platform

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095866A (en) * 2015-07-17 2015-11-25 重庆邮电大学 Rapid behavior identification method and system
CN105389539A (en) * 2015-10-15 2016-03-09 电子科技大学 Three-dimensional gesture estimation method and three-dimensional gesture estimation system based on depth data
DE102016100075A1 (en) * 2016-01-04 2017-07-06 Volkswagen Aktiengesellschaft Method for evaluating gestures
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
CN108171198A (en) * 2018-01-11 2018-06-15 合肥工业大学 Continuous sign language video automatic translating method based on asymmetric multilayer LSTM
CN111126112A (en) * 2018-10-31 2020-05-08 顺丰科技有限公司 Candidate region determination method and device
CN110110602A (en) * 2019-04-09 2019-08-09 南昌大学 A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence
CN110309761A (en) * 2019-06-26 2019-10-08 深圳市微纳集成电路与系统应用研究院 Continuity gesture identification method based on the Three dimensional convolution neural network with thresholding cycling element
CN111091045A (en) * 2019-10-25 2020-05-01 重庆邮电大学 Sign language identification method based on space-time attention mechanism
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111339837A (en) * 2020-02-08 2020-06-26 河北工业大学 Continuous sign language recognition method
CN111361700A (en) * 2020-03-23 2020-07-03 南京畅淼科技有限责任公司 Ship empty and heavy load identification method based on machine vision
CN111340006A (en) * 2020-04-16 2020-06-26 深圳市康鸿泰科技有限公司 Sign language identification method and system

Also Published As

Publication number Publication date
CN111797777A (en) 2020-10-20

Similar Documents

Publication Publication Date Title
CN111797777B (en) Sign language recognition system and method based on space-time semantic features
CN111091045B (en) Sign language identification method based on space-time attention mechanism
WO2021093468A1 (en) Video classification method and apparatus, model training method and apparatus, device and storage medium
CN108388900B (en) Video description method based on combination of multi-feature fusion and space-time attention mechanism
CN104361316B (en) Dimension emotion recognition method based on multi-scale time sequence modeling
CN106326857A (en) Gender identification method and gender identification device based on face image
CN113269054B (en) Aerial video analysis method based on space-time 2D convolutional neural network
EP4276684A1 (en) Capsule endoscope image recognition method based on deep learning, and device and medium
EP4006777A1 (en) Image classification method and device
CN112766172A (en) Face continuous expression recognition method based on time sequence attention mechanism
KR101522554B1 (en) Systems and method for estimating the centers of moving objects in a video sequence
Kaluri et al. A framework for sign gesture recognition using improved genetic algorithm and adaptive filter
CN114360067A (en) Dynamic gesture recognition method based on deep learning
CN116343185A (en) Sign semantic information extraction method oriented to blind assisting field
CN112418032A (en) Human behavior recognition method and device, electronic equipment and storage medium
Zatout et al. Semantic scene synthesis: application to assistive systems
CN111369559A (en) Makeup evaluation method, makeup evaluation device, makeup mirror, and storage medium
Howell et al. Active vision techniques for visually mediated interaction
CN113420783B (en) Intelligent man-machine interaction method and device based on image-text matching
CN113269068B (en) Gesture recognition method based on multi-modal feature adjustment and embedded representation enhancement
CN116115240A (en) Electroencephalogram emotion recognition method based on multi-branch chart convolution network
Moran Classifying emotion using convolutional neural networks
Luan et al. A Lightweight Heatmap-based Eye Tracking System
CN113792669A (en) Pedestrian re-identification baseline method based on hierarchical self-attention network
CN114038044A (en) Face gender and age identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant