CN110110602A - A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence - Google Patents

A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence Download PDF

Info

Publication number
CN110110602A
CN110110602A CN201910282569.4A CN201910282569A CN110110602A CN 110110602 A CN110110602 A CN 110110602A CN 201910282569 A CN201910282569 A CN 201910282569A CN 110110602 A CN110110602 A CN 110110602A
Authority
CN
China
Prior art keywords
layer
sign language
video sequence
dimensional
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910282569.4A
Other languages
Chinese (zh)
Inventor
闵卫东
廖艳秋
熊鹏文
韩清
张愚
徐剑强
邹松
熊辛
汪琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang University
Original Assignee
Nanchang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang University filed Critical Nanchang University
Priority to CN201910282569.4A priority Critical patent/CN110110602A/en
Publication of CN110110602A publication Critical patent/CN110110602A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Abstract

The dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence that the present invention provides a kind of, the method proposes the new model B3D ResNet based on three-dimensional residual error neural network, the following steps are included: step 1, in the video frame, using the position of Faster R-CNN model inspection hand, and divides from background and sell;Step 2, it is extracted using the space-time characteristic that video sequence of the B3D ResNet model to input carries out gesture and characteristic sequence is analyzed;Step 3, by classifying to the video sequence of input, it can identify gesture, effectively realize dynamic Sign Language Recognition.The present invention can extract effective dynamic gesture space-time characteristic sequence by the space-time characteristic of analysis video sequence, to achieve the purpose that the different gestures of identification, and also obtain good performance in complicated or similar Sign Language Recognition.By test data set the experimental results showed that, the present invention can accurately and effectively distinguish different sign language and similar gesture pair.

Description

A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence
Technical field
It is specially a kind of based on three-dimensional residual error neural network and video sequence the present invention relates to Sign Language Recognition technical field Dynamic sign Language Recognition Method.
Background technique
Sign Language Recognition is a kind of effective technology that deaf-mute exchanges with non-deaf-mute, the continuous depth studied with human-computer interaction Enter, Sign Language Recognition has become a hot topic.In recent years, sign language automatic recognition system is by being converted to text or language for gesture Sound creates a kind of new mode for human-computer interaction, and this technology can be realized by computer aided technique.Currently, This respect there are many successfully application, such as distributes language translation, sign language tutor and special education, these can help deaf Mute carries out fluent exchange with other people.On the other hand, sign language is generally to be made of a series of actions, is a kind of with similar The quick movement of feature.Therefore, static Sign Language Recognition technology is difficult to solve the complexity and variation issues of sign language movement.Cause This, research trends Sign Language Recognition technology is the effective ways for solving problems.The dynamic hand gesture recognition technology of view-based access control model has There is the features such as flexibility, scalability and low cost, is the hot spot of current gesture interaction technical research.However, dynamic sign language is known Other technology also Challenge in terms of the complexity problem of finger movement in the case where solving body background.Another difficulty is that how from Most effective feature is extracted in image or video sequence.In addition, how to select suitable classifier is also to obtain to accurately identify knot The key factor of fruit.
In order to help deaf-mute normally to be exchanged in daily life, more and more researchers are dedicated in improvement Problem is stated, many achievements are had been achieved in terms of dynamic Sign Language Recognition.The method for solving the problems, such as dynamic Sign Language Recognition mainly has Two kinds: one is the recognition methods based on gesture shape and motion profile, and another kind is the identification side based on sign language video sequence Method.
In traditional dynamic Sign Language Recognition, hand mainly is identified using the shape feature of gesture and motion profile feature Gesture.But these features cannot fully meet the requirement of practical dynamic Sign Language Recognition.With the rapid development of deep learning theory, Data-driven method shows superiority outstanding in terms of target detection and gesture identification.Be based on gesture shape and movement The sign Language Recognition Method of track is different, and the Sign Language Recognition based on video sequence can make full use of temporal information, with entire scene It compares, the size of hand is relatively small, therefore the useful space feature of sign language movement can be covered by irrelevant information.Therefore, together When study sign language movement space-time characteristic will be a kind of effective ways of dynamic Sign Language Recognition.
Summary of the invention
The dynamic Sign Language Recognition based on three-dimensional residual error neural network and video sequence that the purpose of the present invention is to provide a kind of Method, to solve the problems mentioned in the above background technology.
To achieve the above object, the invention provides the following technical scheme: it is a kind of based on three-dimensional residual error neural network and video The dynamic sign Language Recognition Method of sequence, the method propose the new model B3D ResNet based on three-dimensional residual error neural network, The following steps are included:
Step 1, in the video frame, using the position of Faster R-CNN model inspection hand, and divide from background and sell;
Step 2, the space-time characteristic extraction and feature of gesture are carried out using video sequence of the B3D ResNet model to input Sequence analysis;
Step 3, by classifying to the video sequence of input, it can identify gesture, effectively realize dynamic sign language knowledge Not.
Further, the step of position using Faster R-CNN model inspection hand is as follows:
(1) when image sequence inputs convolutional neural networks, it will generate characteristic pattern, region proposes network with core size It is slided on characteristic pattern for the network window of n × n;
(2) Area generation network recommendation candidate region exports multiple qualified candidate regions;
(3) different size of candidate region is converted to the candidate region of regular length by area-of-interest pond layer, then Export the candidate region of regular length;
(4) classification is carried out to each area-of-interest and bounding box returns, export class and candidate regions belonging to candidate region The exact position of domain in the picture.
Further, the B3D ResNet model mainly includes 17 convolutional layers, and 2 LSTM layers two-way, 1 full connection Layer;In input layer, having eight sizes is 112 × 112 picture frame, has three-dimensional centered on present frame, and by three Channel inputs L × H × W, and wherein L, H and W are time spans, height and width;Then, respectively in three channels with three-dimensional volume Product, kernel size is 7 × 7 × 3, wherein 7 × 7 in Spatial Dimension, it is 3 on time dimension;Core size is 2 × 2 × 1 Down-sampling acts on each characteristic pattern in convolutional layer, to reduce characteristic pattern dimension;By interior using having on three channels The 3D convolution of core size 3 × 3 × 3 obtains next convolutional layer C2_x, next layer of C3_x, C4_x and C5_x behaviour having the same Make;Later, it is inserted into direct-connected connect between every two layers of convolutional neural networks, network is converted into its corresponding residual error version;Then special Sign vector is sent to the shot and long term memory network run in two directions;By the hiding shape of each direction shot and long term memory network State layer is fully connected layer and the combination of soft maximum layer to obtain the Intermediate scores for corresponding to each movement;Finally, by two shot and long terms The average class prediction score to obtain current sequence of the score of memory network.
Further, the space-time characteristic that the B3D ResNet model carries out gesture to the video sequence of input extracts packet It includes: extracting the feature vector of input video sequence first, by constructing Three dimensional convolution, before the Feature Mapping in convolutional layer is connected to Multiple successive frames in one layer, then capture movement information;The design principle of Three dimensional convolution network layer is to utilize three dimensional convolution kernel It carries out, it can extract a type of feature from frame cube;In each element of any single network layer, arbitrarily Feature vector value at position is given by the following formula:
Wherein, tanh () is hyperbolic tangent function, and parameter t and x are the Connecting quantities of current layer, and H, W and D are Three dimensional convolutions The height of kernel, width and time dimension, z are the deviations of characteristic layer.
The present invention learns space-time characteristic by fast connecting using the additivity residual error function of input;In order to by two-dimentional residual error Unit is used to encode the three-dimensional architecture of space-time video information, basic residual unit according to Three dimensional convolution network layer design Principle is modified, and it is 3 × 3 × 3 phase that Three dimensional convolution, which has convolution kernel size in each of three channels channel respectively, Same core size, B3D ResNet model can be applied to Three dimensional convolution network by connecting residual error, and automatically from input Space-time characteristic is extracted in video sequence.
Further, the B3D ResNet model carries out the characteristic sequence analysis bag of gesture to the video sequence of input It includes: utilizing two-way shot and long term memory unit, it includes six shared weights and the information from future and past is integrated, to view Each of frequency sequence piece is predicted;In two-way shot and long term memory unit, propagated forward layer and back-propagating layer are connected to Output layer;In concept, memory cell stores past context, and input gate and output gate cell allow to store for a long time Context;Meanwhile it can be by forgeing the memory in door clearing cell;Formally, including list entries x={ x1, x2..., xt, location mode c={ c1, c2..., ctAnd hidden state h={ h1, h2..., ht, it, ft, ot, ct, gt, ht It is input gate respectively, forgets door, out gate, memory cell activates vector, and function of state hides function;Two-way shot and long term note The equation for recalling unit is as follows:
it=σ (wxixt+whiht-1+bi) (2)
ft=σ (wxfxt+whfht-1+bf) (3)
ot=σ (wxoxt+whoht-1+bo) (4)
gt=tanh (wxcxt+whcht-1+bx) (5)
ct=ftct-1+itgt (6)
ht=ot tanh(ct) (7)
Wherein tanh () is hyperbolic tangent function, forgets door and determines when information, input gate should be removed from memory cell Determine when new formation should be integrated in memory, which generates one group of candidate value, if input gate allows, they will be by It is added in memory cell;Reference formula (6), based on door is forgotten, the output of input gate and new candidate value updates storage device list Member;In formula (7), out gate controls hidden state and storage information;Finally, hidden state is expressed as memory cell state Function and out gate between product.
Compared with prior art, the beneficial effects of the present invention are:
The invention proposes a kind of new model B3D ResNet for dynamic Sign Language Recognition.The model passes through analysis video The space-time characteristic of sequence can extract effective dynamic gesture space-time characteristic sequence, thus achieve the purpose that the different gestures of identification, And good performance is also obtained in complicated or similar Sign Language Recognition.Pass through test data set DEVISIGN-D and SLR_ Dataset's the experimental results showed that, the present invention can accurately and effectively distinguish different sign language and similar gesture pair.This Outside, the present invention takes full advantage of the space-time characteristic of dynamic sign language, improves the accuracy and overall performance of dynamic Sign Language Recognition.
Detailed description of the invention
Fig. 1 is structure of the invention frame diagram;
Fig. 2 is B3D ResNet model structure of the present invention;
Fig. 3 is the three-dimensional residual error structural unit figure of the present invention;
Fig. 4 is the two-way shot and long term memory network structural unit figure of the present invention;
Fig. 5 is the comparing result figure of the present invention with other methods;
Fig. 6 is hand of the present invention positioning and segmentation result figure.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is described in further detail.The specific embodiments are only for explaining the present invention technical solution described herein, and It is not limited to the present invention.
The present invention provides a kind of technical solution: a kind of dynamic sign language knowledge based on three-dimensional residual error neural network and video sequence Other method, structural framing are as shown in Figure 1.The method proposes the new model B3D based on three-dimensional residual error neural network ResNet, comprising the following steps:
Step 1, in the video frame, using the position of Faster R-CNN model inspection hand, and divide from background and sell.
Step 2, the space-time characteristic extraction and feature of gesture are carried out using video sequence of the B3D ResNet model to input Sequence analysis.
Step 3, by classifying to the video sequence of input, it can identify gesture, effectively realize dynamic sign language knowledge Not.
The detection of hand position is most important step for time segmentation and subsequent identification module.In order to obtain figure The accurate information of hand position, selects the algorithm of target detection of function admirable most important as in.With SSD, YOLO and its other party Method is compared, and FasterR-CNN has higher precision and stronger robustness, suitable for the detection compared with wisp.
As shown in the target locating module of Fig. 1, using Faster R-CNN model inspection hand position the step of it is as follows:
(1) when image sequence inputs convolutional neural networks, it will generate characteristic pattern, region proposes network with core size It is slided on characteristic pattern for the network window of n × n.
(2) Area generation network recommendation candidate region exports multiple qualified candidate regions.
(3) different size of candidate region is converted to the candidate region of regular length by area-of-interest pond layer, then Export the candidate region of regular length.
(4) classification is carried out to each area-of-interest and bounding box returns, export class and candidate regions belonging to candidate region The exact position of domain in the picture.
1 testing result of table
As shown in table 1, Faster R-CNN model has high measurement accuracy for target.This result is reflected in following parameter In:WithTherefore, come using Faster R-CNN model Hand is detected, the precise information of position can be obtained.
The invention proposes B3D ResNet models, for identification the dynamic sign language based on video sequence.Specifically, should Model can complete video sequence characteristics and extract and learn long-term space-time characteristic.For dynamic Sign Language Recognition, different dynamic hands Language gesture generally corresponds to the video with different labels.Therefore, gesture can be identified by being classified to label.Pass through The space-time characteristic for extracting video, classifies to feature vector, the identification of various dynamic sign languages can be well realized.In order to mention The accuracy of identification of high dynamic sign language further analyzes characteristic sequence by two-way shot and long term memory unit.B3D ResNet model It is described below.
Fig. 2 shows the detailed construction of B3D ResNet model, includes mainly 17 convolutional layers, and 2 are LSTM layers two-way, 1 A full articulamentum;In input layer, having eight sizes is 112 × 112 picture frame, centered on present frame, and passes through three L × H × W is inputted with three-dimensional channel, wherein L, H and W are time spans, height and width;Then, it is transported respectively in three channels With Three dimensional convolution, kernel size is 7 × 7 × 3, wherein 7 × 7 in Spatial Dimension, it is 3 on time dimension;Core size is 2 × 2 × 1 down-sampling acts on each characteristic pattern in convolutional layer, to reduce characteristic pattern dimension;By being answered on three channels Next convolutional layer C2_x is obtained with the 3D convolution with kernel size 3 × 3 × 3, next layer of C3_x, C4_x and C5_x have Identical operation;Later, it is inserted into direct-connected connect between every two layers of convolutional neural networks, network is converted into its corresponding residual error version This;Then feature vector is sent to the shot and long term memory network run in two directions;Each direction shot and long term is remembered into net The hidden state layer of network is fully connected layer and the combination of soft maximum layer to obtain the Intermediate scores for corresponding to each movement;Finally, will The average class prediction score to obtain current sequence of the score of two shot and long term memory networks.
The space-time characteristic extraction that B3D ResNet model carries out gesture to the video sequence of input includes: B3D ResNet mould Type extracts the feature vector of input video sequence first.For the identification problem of image sequence, generally by Three dimensional convolution from view Room and time dimensional information is captured in frequency sequence.By constructing Three dimensional convolution, the Feature Mapping in convolutional layer is connected to previous Multiple successive frames in layer, then capture movement information;The design principle of Three dimensional convolution network layer be using three dimensional convolution kernel into Capable, it can extract a type of feature from frame cube;In each element of any single network layer, any position The feature vector value at the place of setting is given by the following formula:
Wherein, tanh () is hyperbolic tangent function, and parameter t and x are the Connecting quantities of current layer, and H, W and D are Three dimensional convolutions The height of kernel, width and time dimension, z are the deviations of characteristic layer.
However, the Three dimensional convolution network number of plies is more, learning ability can be stronger.In addition, being added to Three dimensional convolution network residual Remaining connection department is to simplify the training of deeper network.The present invention does not learn unreferenced nonlinear function not instead of directly, utilizes The additivity residual error function of input helps to learn space-time characteristic by fast connecting.This three-dimensional residual error structure is as shown in Figure 3. In order to which two-dimentional residual unit to be used to encode the three-dimensional architecture of space-time video information, basic residual unit is rolled up according to three-dimensional The design principle of product network layer is modified, and Three dimensional convolution has convolution kernel size in each of three channels channel respectively For 3 × 3 × 3 phase same core size, B3D ResNet model can be applied to Three dimensional convolution network by connecting residual error, and Space-time characteristic is automatically extracted from input video sequence.
The characteristic sequence analysis that B3D ResNet model carries out gesture to the video sequence of input includes: B3D ResNet mould Type utilizes two-way shot and long term memory unit, it includes six shared weights and the information from future and past is integrated, to view Each of frequency sequence piece is predicted that structure is as shown in Figure 4.In two-way shot and long term memory unit, propagated forward layer and Back-propagating layer is connected to output layer;In concept, memory cell stores past context, input gate and out gate list Member allows storage context for a long time;Meanwhile it can be by forgeing the memory in door clearing cell;Formally, including List entries x={ x1, x2..., xt, location mode c={ c1, c2..., ctAnd hidden state h={ h1, h2..., ht, it, ft, ot, ct, gt, htIt is input gate respectively, forgets door, out gate, memory cell activates vector, and function of state hides letter Number;The equation of two-way shot and long term memory unit is as follows:
it=σ (wxixt+whiht-1+bi) (2)
ft=σ (wxfxt+whfht-1+bf) (3)
ot=σ (wxoxt+whoht-1+bo) (4)
gt=tanh (wxcxt+whcht-1+bc) (5)
ct=ftct-1+itgt (6)
ht=ot tanh(ct) (7)
Wherein tanh () is hyperbolic tangent function, forgets door and determines when information, input gate should be removed from memory cell Determine when new formation should be integrated in memory, which generates one group of candidate value, if input gate allows, they will be by It is added in memory cell;Reference formula (6), based on door is forgotten, the output of input gate and new candidate value updates storage device list Member: in formula (7), out gate controls hidden state and storage information: finally, hidden state is expressed as memory cell state Function and out gate between product.
It is upper it can be found that B3D ResNet model can obtain the repertoire of input video from above formula expression.It is right In dynamic Sign Language Recognition, B3D ResNet model has the great ability of the contextual information in capture sequence.
The present invention carries out in test data set, including DEVISIGN-D data set and SLR_Dataset data set.
DEVISIGN-D data set is a Chinese sign language data set, and the researcher for global Sign Language Recognition community provides The Chinese sign language data set of one large vocabulary, for training and assessing their algorithm.It is by 500 daily vocabulary groups At.Data cover 8 different sign language persons.Wherein, for 4 sign language persons (2 males and 2 women), other 4 sign language persons (2 males and 2 women) record vocabulary twice altogether.It includes 6000 videos completely.
SLR_Dataset is collected by Huang et al. and is issued on their project page.Microsoft's Kinect camera is used In recorded video, and RGB is provided, depth and body joints information.In the present invention, rgb video information is used only.SLR_ Dataset includes 2.5 ten thousand marked instance of video, is recorded by 50 sign language personnel, and each instance of video is by profession Chinese sign language teacher annotation.
B3D ResNet model is based on realizing on deep learning platform Caffe, and GPU used in experimentation is Quadro P4000.When training pattern, batch size is set as 2, and basic learning rate is set as 0.1, momentum parameter setting It is 0.9.Since the size of data set limits, we take following available strategy to avoid overfitting problem: one is many institutes Known method-data enhancing, image sequence is by random cropping.Another way is that batch standardizes, it is intended to reduce internal association Variable offset, and it is applied to all convolutional layers, accelerate the training process of deep neural network.
After experiment parameter setting, B3D ResNet model carries out the training of dynamic Sign Language Recognition, mainly mentions from input video Space-time characteristic is taken, long term time behavioral characteristics is analyzed, predicts the label of gesture sequence.In order to assess B3D ResNet model dynamic Performance in state Sign Language Recognition, using recognition accuracy as standard.We are by the method B3D ResNet model of proposition and one A little traditional sequence action recognition models compare, such as the Res3D based on DEVISIGN-D data set, 2D-Resnet And AleXnet.The comparison of dynamic Sign Language Recognition result is as shown in Figure 5.When these trained models are until about 20k the number of iterations When, recognition accuracy starts to reach maximum value.The result shows that the accuracy of Res3D, 2D-Resnet and AleXnet are respectively 86.6%, 85% and 73.8%, and our method accuracy rate is 89.9%, is better than other methods at least 3.3%.Therefore, real It tests and shows that B3D ResNet model has optimal dynamic sign language performance.
For the dynamic sign language gesture identification based on video sequence, key position is the movement for identifying hand region, however Hand region takes up space compared to for whole image, and ratio is considerably less, therefore a large amount of background area just seems redundancy.This Invention is by detection hand region, then hand is split from background, it is possible to reduce the calculation amount of B3D ResNet model, To improve recognition accuracy.Experimental result is as shown in Figure 6.In order to verify this method, in DEVISIGN-D data set and The preprocessing process is assessed on SLR_Dataset using two different training methods:
Mode 1: image sequence is first detected and divides hand region;
Mode 2: without any processing, directly training.
Table 2 pre-processes comparing result
Experimental result is as shown in table 2, by pretreatment, is verified in data set DEVISIGN-D and SLR_Dataset, It was found that our method be it is actually active, recognition accuracy improves 46.1% and 36.7%.
Table 3 compares on data set DEVISIGN-Dand SLR_Dataset
Training result of the B3D ResNet model on data set DEVISIGN-D and SLR_Dataset is as shown in table 3.Number According to the results show that B3D ResNet model achieves highest recognition accuracy.As can be seen from Table 3, due to different numbers Different according to the complexity of collection, SLR_Dataset data set is more challenging.Specifically, in DEVISIGN-D data set and SLR_ In Dataset, result of the present invention is respectively 89.8% and 86.9%, respectively than BLSTM-NN high 29.5% and 30.3%, than HMM-DTC high 25.4% and 21.7%, is higher by 19% and 21.1% than DNN, is higher by 11.5% and 13.4% than C3D.Compare knot Fruit shows that the present invention obtains the newest accuracy of identification of dynamic sign language in two test data sets.
The above only expresses the preferred embodiment of the present invention, and the description thereof is more specific and detailed, but can not be because This and be interpreted as limitations on the scope of the patent of the present invention.It should be pointed out that for those of ordinary skill in the art, Under the premise of not departing from present inventive concept, several deformations can also be made, improves and substitutes, these belong to protection of the invention Range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims (5)

1. a kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence, it is characterised in that: the side Method proposes the new model B3D ResNet based on three-dimensional residual error neural network, comprising the following steps:
Step 1, in the video frame, using the position of Faster R-CNN model inspection hand, and divide from background and sell;
Step 2, space-time characteristic extraction and the characteristic sequence of gesture are carried out using video sequence of the B3D ResNet model to input Analysis;
Step 3, by classifying to the video sequence of input, it can identify gesture, effectively realize dynamic Sign Language Recognition.
2. a kind of dynamic Sign Language Recognition side based on three-dimensional residual error neural network and video sequence according to claim 1 Method, it is characterised in that: the step of position using Faster R-CNN model inspection hand is as follows:
(1) when image sequence inputs convolutional neural networks, it will generate characteristic pattern, region proposes that network is being n with core size The network window of × n is slided on characteristic pattern;
(2) Area generation network recommendation candidate region exports multiple qualified candidate regions;
(3) different size of candidate region is converted to the candidate region of regular length by area-of-interest pond layer, is then exported The candidate region of regular length;
(4) classification is carried out to each area-of-interest and bounding box returns, exported class belonging to candidate region and candidate region exists Exact position in image.
3. a kind of dynamic Sign Language Recognition side based on three-dimensional residual error neural network and video sequence according to claim 1 Method, it is characterised in that: the B3D ResNet model mainly includes 17 convolutional layers, and 2 LSTM layers two-way, 1 full articulamentum; In input layer, having eight sizes is 112 × 112 picture frame, has three-dimensional channel centered on present frame, and by three L × H × W is inputted, wherein L, H and W are time spans, height and width;Then, Three dimensional convolution is used in three channels respectively, Its kernel size is 7 × 7 × 3, wherein 7 × 7 in Spatial Dimension, it is 3 on time dimension;Under core size is 2 × 2 × 1 Each characteristic pattern of the sampling action in convolutional layer, to reduce characteristic pattern dimension;There is kernel by applying on three channels The 3D convolution of size 3 × 3 × 3 obtains next convolutional layer C2_x, next layer of C3_x, C4_x and C5_x behaviour having the same Make;Later, it is inserted into direct-connected connect between every two layers of convolutional neural networks, network is converted into its corresponding residual error version;Then special Sign vector is sent to the shot and long term memory network run in two directions;By the hiding shape of each direction shot and long term memory network State layer is fully connected layer and the combination of soft maximum layer to obtain the Intermediate scores for corresponding to each movement;Finally, by two shot and long terms The average class prediction score to obtain current sequence of the score of memory network.
4. a kind of dynamic Sign Language Recognition side based on three-dimensional residual error neural network and video sequence according to claim 1 Method, it is characterised in that: the space-time characteristic extraction that the B3D ResNet model carries out gesture to the video sequence of input includes: head The feature vector for first extracting input video sequence, by constructing Three dimensional convolution, the Feature Mapping in convolutional layer is connected to preceding layer In multiple successive frames, then capture movement information;The design principle of Three dimensional convolution network layer is carried out using three dimensional convolution kernel , it can extract a type of feature from frame cube;In each element of any single network layer, any position The feature vector value at place is given by the following formula:
Wherein, tanh () is hyperbolic tangent function, and parameter t and x are the Connecting quantities of current layer, and H, W and D are Three dimensional convolution kernels Height, width and time dimension, z are the deviations of characteristic layer;
The present invention learns space-time characteristic by fast connecting using the additivity residual error function of input;In order to by two-dimentional residual unit For encoding the three-dimensional architecture of space-time video information, basic residual unit according to Three dimensional convolution network layer design principle It modifies, it is 3 × 3 × 3 phase same core that Three dimensional convolution, which has convolution kernel size in each of three channels channel respectively, Size, B3D ResNet model can be applied to Three dimensional convolution network by connecting residual error, and automatically from input video Space-time characteristic is extracted in sequence.
5. a kind of dynamic Sign Language Recognition side based on three-dimensional residual error neural network and video sequence according to claim 1 Method, it is characterised in that: the characteristic sequence analysis that the B3D ResNet model carries out gesture to the video sequence of input includes: benefit With two-way shot and long term memory unit, it includes six shared weights and the information from future and past is integrated, to video sequence Each of column piece are predicted;In two-way shot and long term memory unit, propagated forward layer and back-propagating layer are connected to output Layer;In concept, memory cell stores past context, and input gate and output gate cell allow to store for a long time up and down Text;Meanwhile it can be by forgeing the memory in door clearing cell;Formally, including list entries x={ x1, x2..., xt, location mode c={ c1, c2..., ctAnd hidden state h={ h1, h2..., ht, it, ft, ot, ct, gt, htIt is respectively Input gate forgets door, out gate, and memory cell activates vector, and function of state hides function;Two-way shot and long term memory unit Equation it is as follows:
it=σ (wxixt+whiht-1+bi) (2)
ft=σ (wxfxt+whfht-1+bf) (3)
ot=σ (wxoxt+whoht-1+bo) (4)
gt=tanh (wxcxt+whcht-1+hc) (5)
ct=ftct-1+itgt (6)
ht=ottanh(ct) (7)
Wherein tanh () is hyperbolic tangent function, forgets door and determines when information should be removed from memory cell, input gate determines When new formation should be integrated in memory, which generates one group of candidate value, if input gate allows, they will be added Into memory cell;Reference formula (6), based on door is forgotten, the output of input gate and new candidate value updates storage device unit; In formula (7), out gate controls hidden state and storage information;Finally, hidden state is expressed as memory cell state Product between function and out gate.
CN201910282569.4A 2019-04-09 2019-04-09 A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence Pending CN110110602A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910282569.4A CN110110602A (en) 2019-04-09 2019-04-09 A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910282569.4A CN110110602A (en) 2019-04-09 2019-04-09 A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence

Publications (1)

Publication Number Publication Date
CN110110602A true CN110110602A (en) 2019-08-09

Family

ID=67483774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910282569.4A Pending CN110110602A (en) 2019-04-09 2019-04-09 A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence

Country Status (1)

Country Link
CN (1) CN110110602A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569823A (en) * 2019-09-18 2019-12-13 西安工业大学 sign language identification and skeleton generation method based on RNN
CN111091045A (en) * 2019-10-25 2020-05-01 重庆邮电大学 Sign language identification method based on space-time attention mechanism
CN111273779A (en) * 2020-02-20 2020-06-12 沈阳航空航天大学 Dynamic gesture recognition method based on adaptive spatial supervision
CN111339837A (en) * 2020-02-08 2020-06-26 河北工业大学 Continuous sign language recognition method
CN111797777A (en) * 2020-07-07 2020-10-20 南京大学 Sign language recognition system and method based on space-time semantic features
CN112487967A (en) * 2020-11-30 2021-03-12 电子科技大学 Scenic spot painting behavior identification method based on three-dimensional convolution network
CN112818914A (en) * 2021-02-24 2021-05-18 网易(杭州)网络有限公司 Video content classification method and device
CN113071438A (en) * 2020-01-06 2021-07-06 北京地平线机器人技术研发有限公司 Control instruction generation method and device, storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
CN107679491A (en) * 2017-09-29 2018-02-09 华中师范大学 A kind of 3D convolutional neural networks sign Language Recognition Methods for merging multi-modal data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
CN107679491A (en) * 2017-09-29 2018-02-09 华中师范大学 A kind of 3D convolutional neural networks sign Language Recognition Methods for merging multi-modal data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
廖艳秋 等: ""Dynamic Sign Language Recognition Based on Video Sequence With BLSTM-3D Residual Networks"", 《IEEE ACCESS》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569823A (en) * 2019-09-18 2019-12-13 西安工业大学 sign language identification and skeleton generation method based on RNN
CN111091045A (en) * 2019-10-25 2020-05-01 重庆邮电大学 Sign language identification method based on space-time attention mechanism
CN111091045B (en) * 2019-10-25 2022-08-23 重庆邮电大学 Sign language identification method based on space-time attention mechanism
CN113071438A (en) * 2020-01-06 2021-07-06 北京地平线机器人技术研发有限公司 Control instruction generation method and device, storage medium and electronic equipment
CN113071438B (en) * 2020-01-06 2023-03-24 北京地平线机器人技术研发有限公司 Control instruction generation method and device, storage medium and electronic equipment
CN111339837A (en) * 2020-02-08 2020-06-26 河北工业大学 Continuous sign language recognition method
CN111339837B (en) * 2020-02-08 2022-05-03 河北工业大学 Continuous sign language recognition method
CN111273779A (en) * 2020-02-20 2020-06-12 沈阳航空航天大学 Dynamic gesture recognition method based on adaptive spatial supervision
CN111273779B (en) * 2020-02-20 2023-09-19 沈阳航空航天大学 Dynamic gesture recognition method based on self-adaptive space supervision
CN111797777A (en) * 2020-07-07 2020-10-20 南京大学 Sign language recognition system and method based on space-time semantic features
CN111797777B (en) * 2020-07-07 2023-10-17 南京大学 Sign language recognition system and method based on space-time semantic features
CN112487967A (en) * 2020-11-30 2021-03-12 电子科技大学 Scenic spot painting behavior identification method based on three-dimensional convolution network
CN112818914A (en) * 2021-02-24 2021-05-18 网易(杭州)网络有限公司 Video content classification method and device
CN112818914B (en) * 2021-02-24 2023-08-18 网易(杭州)网络有限公司 Video content classification method and device

Similar Documents

Publication Publication Date Title
CN110110602A (en) A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence
Zhang et al. Dynamic hand gesture recognition based on short-term sampling neural networks
US20210326597A1 (en) Video processing method and apparatus, and electronic device and storage medium
CN108804530B (en) Subtitling areas of an image
Yang et al. Extraction of 2d motion trajectories and its application to hand gesture recognition
US20190362707A1 (en) Interactive method, interactive terminal, storage medium, and computer device
CN109961034A (en) Video object detection method based on convolution gating cycle neural unit
CN103268495B (en) Human body behavior modeling recognition methods based on priori knowledge cluster in computer system
CN110399850A (en) A kind of continuous sign language recognition method based on deep neural network
CN111291556B (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN110717431A (en) Fine-grained visual question and answer method combined with multi-view attention mechanism
CN112784763A (en) Expression recognition method and system based on local and overall feature adaptive fusion
CN110096991A (en) A kind of sign Language Recognition Method based on convolutional neural networks
CN115966010A (en) Expression recognition method based on attention and multi-scale feature fusion
Rani et al. An effectual classical dance pose estimation and classification system employing convolution neural network–long shortterm memory (CNN-LSTM) network for video sequences
CN113378919B (en) Image description generation method for fusing visual sense and enhancing multilayer global features
Balachandar et al. Deep learning technique based visually impaired people using YOLO V3 framework mechanism
Mahyoub et al. Sign Language Recognition using Deep Learning
Cai et al. Performance analysis of distance teaching classroom based on machine learning and virtual reality
Wu Biomedical image segmentation and object detection using deep convolutional neural networks
Xuan DRN-LSTM: a deep residual network based on long short-term memory network for students behaviour recognition in education
He et al. An optimal 3D convolutional neural network based lipreading method
Shinde et al. Automatic Data Collection from Forms using Optical Character Recognition
Rawat et al. Indian Sign Language Recognition System for Interrogative Words Using Deep Learning
Zhang The Cognitive Transformation of Japanese Language Education by Artificial Intelligence Technology in the Wireless Network Environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190809