CN110110602A - A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence - Google Patents
A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence Download PDFInfo
- Publication number
- CN110110602A CN110110602A CN201910282569.4A CN201910282569A CN110110602A CN 110110602 A CN110110602 A CN 110110602A CN 201910282569 A CN201910282569 A CN 201910282569A CN 110110602 A CN110110602 A CN 110110602A
- Authority
- CN
- China
- Prior art keywords
- layer
- sign language
- video sequence
- dimensional
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 17
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 16
- 239000000284 extract Substances 0.000 claims abstract description 8
- 238000007689 inspection Methods 0.000 claims abstract description 7
- 238000004458 analytical method Methods 0.000 claims abstract description 3
- 230000015654 memory Effects 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 19
- 230000007787 long-term memory Effects 0.000 claims description 16
- 230000033001 locomotion Effects 0.000 claims description 14
- 238000013461 design Methods 0.000 claims description 6
- 230000007774 longterm Effects 0.000 claims description 6
- 238000012300 Sequence Analysis Methods 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 5
- 230000009471 action Effects 0.000 claims description 3
- 230000008901 benefit Effects 0.000 claims description 3
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000000644 propagated effect Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000006399 behavior Effects 0.000 claims description 2
- 238000012360 testing method Methods 0.000 abstract description 6
- 238000012549 training Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 238000001514 detection method Methods 0.000 description 5
- 206010011878 Deafness Diseases 0.000 description 4
- 230000003993 interaction Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000005057 finger movement Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/194—Segmentation; Edge detection involving foreground-background segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
Abstract
The dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence that the present invention provides a kind of, the method proposes the new model B3D ResNet based on three-dimensional residual error neural network, the following steps are included: step 1, in the video frame, using the position of Faster R-CNN model inspection hand, and divides from background and sell;Step 2, it is extracted using the space-time characteristic that video sequence of the B3D ResNet model to input carries out gesture and characteristic sequence is analyzed;Step 3, by classifying to the video sequence of input, it can identify gesture, effectively realize dynamic Sign Language Recognition.The present invention can extract effective dynamic gesture space-time characteristic sequence by the space-time characteristic of analysis video sequence, to achieve the purpose that the different gestures of identification, and also obtain good performance in complicated or similar Sign Language Recognition.By test data set the experimental results showed that, the present invention can accurately and effectively distinguish different sign language and similar gesture pair.
Description
Technical field
It is specially a kind of based on three-dimensional residual error neural network and video sequence the present invention relates to Sign Language Recognition technical field
Dynamic sign Language Recognition Method.
Background technique
Sign Language Recognition is a kind of effective technology that deaf-mute exchanges with non-deaf-mute, the continuous depth studied with human-computer interaction
Enter, Sign Language Recognition has become a hot topic.In recent years, sign language automatic recognition system is by being converted to text or language for gesture
Sound creates a kind of new mode for human-computer interaction, and this technology can be realized by computer aided technique.Currently,
This respect there are many successfully application, such as distributes language translation, sign language tutor and special education, these can help deaf
Mute carries out fluent exchange with other people.On the other hand, sign language is generally to be made of a series of actions, is a kind of with similar
The quick movement of feature.Therefore, static Sign Language Recognition technology is difficult to solve the complexity and variation issues of sign language movement.Cause
This, research trends Sign Language Recognition technology is the effective ways for solving problems.The dynamic hand gesture recognition technology of view-based access control model has
There is the features such as flexibility, scalability and low cost, is the hot spot of current gesture interaction technical research.However, dynamic sign language is known
Other technology also Challenge in terms of the complexity problem of finger movement in the case where solving body background.Another difficulty is that how from
Most effective feature is extracted in image or video sequence.In addition, how to select suitable classifier is also to obtain to accurately identify knot
The key factor of fruit.
In order to help deaf-mute normally to be exchanged in daily life, more and more researchers are dedicated in improvement
Problem is stated, many achievements are had been achieved in terms of dynamic Sign Language Recognition.The method for solving the problems, such as dynamic Sign Language Recognition mainly has
Two kinds: one is the recognition methods based on gesture shape and motion profile, and another kind is the identification side based on sign language video sequence
Method.
In traditional dynamic Sign Language Recognition, hand mainly is identified using the shape feature of gesture and motion profile feature
Gesture.But these features cannot fully meet the requirement of practical dynamic Sign Language Recognition.With the rapid development of deep learning theory,
Data-driven method shows superiority outstanding in terms of target detection and gesture identification.Be based on gesture shape and movement
The sign Language Recognition Method of track is different, and the Sign Language Recognition based on video sequence can make full use of temporal information, with entire scene
It compares, the size of hand is relatively small, therefore the useful space feature of sign language movement can be covered by irrelevant information.Therefore, together
When study sign language movement space-time characteristic will be a kind of effective ways of dynamic Sign Language Recognition.
Summary of the invention
The dynamic Sign Language Recognition based on three-dimensional residual error neural network and video sequence that the purpose of the present invention is to provide a kind of
Method, to solve the problems mentioned in the above background technology.
To achieve the above object, the invention provides the following technical scheme: it is a kind of based on three-dimensional residual error neural network and video
The dynamic sign Language Recognition Method of sequence, the method propose the new model B3D ResNet based on three-dimensional residual error neural network,
The following steps are included:
Step 1, in the video frame, using the position of Faster R-CNN model inspection hand, and divide from background and sell;
Step 2, the space-time characteristic extraction and feature of gesture are carried out using video sequence of the B3D ResNet model to input
Sequence analysis;
Step 3, by classifying to the video sequence of input, it can identify gesture, effectively realize dynamic sign language knowledge
Not.
Further, the step of position using Faster R-CNN model inspection hand is as follows:
(1) when image sequence inputs convolutional neural networks, it will generate characteristic pattern, region proposes network with core size
It is slided on characteristic pattern for the network window of n × n;
(2) Area generation network recommendation candidate region exports multiple qualified candidate regions;
(3) different size of candidate region is converted to the candidate region of regular length by area-of-interest pond layer, then
Export the candidate region of regular length;
(4) classification is carried out to each area-of-interest and bounding box returns, export class and candidate regions belonging to candidate region
The exact position of domain in the picture.
Further, the B3D ResNet model mainly includes 17 convolutional layers, and 2 LSTM layers two-way, 1 full connection
Layer;In input layer, having eight sizes is 112 × 112 picture frame, has three-dimensional centered on present frame, and by three
Channel inputs L × H × W, and wherein L, H and W are time spans, height and width;Then, respectively in three channels with three-dimensional volume
Product, kernel size is 7 × 7 × 3, wherein 7 × 7 in Spatial Dimension, it is 3 on time dimension;Core size is 2 × 2 × 1
Down-sampling acts on each characteristic pattern in convolutional layer, to reduce characteristic pattern dimension;By interior using having on three channels
The 3D convolution of core size 3 × 3 × 3 obtains next convolutional layer C2_x, next layer of C3_x, C4_x and C5_x behaviour having the same
Make;Later, it is inserted into direct-connected connect between every two layers of convolutional neural networks, network is converted into its corresponding residual error version;Then special
Sign vector is sent to the shot and long term memory network run in two directions;By the hiding shape of each direction shot and long term memory network
State layer is fully connected layer and the combination of soft maximum layer to obtain the Intermediate scores for corresponding to each movement;Finally, by two shot and long terms
The average class prediction score to obtain current sequence of the score of memory network.
Further, the space-time characteristic that the B3D ResNet model carries out gesture to the video sequence of input extracts packet
It includes: extracting the feature vector of input video sequence first, by constructing Three dimensional convolution, before the Feature Mapping in convolutional layer is connected to
Multiple successive frames in one layer, then capture movement information;The design principle of Three dimensional convolution network layer is to utilize three dimensional convolution kernel
It carries out, it can extract a type of feature from frame cube;In each element of any single network layer, arbitrarily
Feature vector value at position is given by the following formula:
Wherein, tanh () is hyperbolic tangent function, and parameter t and x are the Connecting quantities of current layer, and H, W and D are Three dimensional convolutions
The height of kernel, width and time dimension, z are the deviations of characteristic layer.
The present invention learns space-time characteristic by fast connecting using the additivity residual error function of input;In order to by two-dimentional residual error
Unit is used to encode the three-dimensional architecture of space-time video information, basic residual unit according to Three dimensional convolution network layer design
Principle is modified, and it is 3 × 3 × 3 phase that Three dimensional convolution, which has convolution kernel size in each of three channels channel respectively,
Same core size, B3D ResNet model can be applied to Three dimensional convolution network by connecting residual error, and automatically from input
Space-time characteristic is extracted in video sequence.
Further, the B3D ResNet model carries out the characteristic sequence analysis bag of gesture to the video sequence of input
It includes: utilizing two-way shot and long term memory unit, it includes six shared weights and the information from future and past is integrated, to view
Each of frequency sequence piece is predicted;In two-way shot and long term memory unit, propagated forward layer and back-propagating layer are connected to
Output layer;In concept, memory cell stores past context, and input gate and output gate cell allow to store for a long time
Context;Meanwhile it can be by forgeing the memory in door clearing cell;Formally, including list entries x={ x1,
x2..., xt, location mode c={ c1, c2..., ctAnd hidden state h={ h1, h2..., ht, it, ft, ot, ct, gt, ht
It is input gate respectively, forgets door, out gate, memory cell activates vector, and function of state hides function;Two-way shot and long term note
The equation for recalling unit is as follows:
it=σ (wxixt+whiht-1+bi) (2)
ft=σ (wxfxt+whfht-1+bf) (3)
ot=σ (wxoxt+whoht-1+bo) (4)
gt=tanh (wxcxt+whcht-1+bx) (5)
ct=ftct-1+itgt (6)
ht=ot tanh(ct) (7)
Wherein tanh () is hyperbolic tangent function, forgets door and determines when information, input gate should be removed from memory cell
Determine when new formation should be integrated in memory, which generates one group of candidate value, if input gate allows, they will be by
It is added in memory cell;Reference formula (6), based on door is forgotten, the output of input gate and new candidate value updates storage device list
Member;In formula (7), out gate controls hidden state and storage information;Finally, hidden state is expressed as memory cell state
Function and out gate between product.
Compared with prior art, the beneficial effects of the present invention are:
The invention proposes a kind of new model B3D ResNet for dynamic Sign Language Recognition.The model passes through analysis video
The space-time characteristic of sequence can extract effective dynamic gesture space-time characteristic sequence, thus achieve the purpose that the different gestures of identification,
And good performance is also obtained in complicated or similar Sign Language Recognition.Pass through test data set DEVISIGN-D and SLR_
Dataset's the experimental results showed that, the present invention can accurately and effectively distinguish different sign language and similar gesture pair.This
Outside, the present invention takes full advantage of the space-time characteristic of dynamic sign language, improves the accuracy and overall performance of dynamic Sign Language Recognition.
Detailed description of the invention
Fig. 1 is structure of the invention frame diagram;
Fig. 2 is B3D ResNet model structure of the present invention;
Fig. 3 is the three-dimensional residual error structural unit figure of the present invention;
Fig. 4 is the two-way shot and long term memory network structural unit figure of the present invention;
Fig. 5 is the comparing result figure of the present invention with other methods;
Fig. 6 is hand of the present invention positioning and segmentation result figure.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is described in further detail.The specific embodiments are only for explaining the present invention technical solution described herein, and
It is not limited to the present invention.
The present invention provides a kind of technical solution: a kind of dynamic sign language knowledge based on three-dimensional residual error neural network and video sequence
Other method, structural framing are as shown in Figure 1.The method proposes the new model B3D based on three-dimensional residual error neural network
ResNet, comprising the following steps:
Step 1, in the video frame, using the position of Faster R-CNN model inspection hand, and divide from background and sell.
Step 2, the space-time characteristic extraction and feature of gesture are carried out using video sequence of the B3D ResNet model to input
Sequence analysis.
Step 3, by classifying to the video sequence of input, it can identify gesture, effectively realize dynamic sign language knowledge
Not.
The detection of hand position is most important step for time segmentation and subsequent identification module.In order to obtain figure
The accurate information of hand position, selects the algorithm of target detection of function admirable most important as in.With SSD, YOLO and its other party
Method is compared, and FasterR-CNN has higher precision and stronger robustness, suitable for the detection compared with wisp.
As shown in the target locating module of Fig. 1, using Faster R-CNN model inspection hand position the step of it is as follows:
(1) when image sequence inputs convolutional neural networks, it will generate characteristic pattern, region proposes network with core size
It is slided on characteristic pattern for the network window of n × n.
(2) Area generation network recommendation candidate region exports multiple qualified candidate regions.
(3) different size of candidate region is converted to the candidate region of regular length by area-of-interest pond layer, then
Export the candidate region of regular length.
(4) classification is carried out to each area-of-interest and bounding box returns, export class and candidate regions belonging to candidate region
The exact position of domain in the picture.
1 testing result of table
As shown in table 1, Faster R-CNN model has high measurement accuracy for target.This result is reflected in following parameter
In:WithTherefore, come using Faster R-CNN model
Hand is detected, the precise information of position can be obtained.
The invention proposes B3D ResNet models, for identification the dynamic sign language based on video sequence.Specifically, should
Model can complete video sequence characteristics and extract and learn long-term space-time characteristic.For dynamic Sign Language Recognition, different dynamic hands
Language gesture generally corresponds to the video with different labels.Therefore, gesture can be identified by being classified to label.Pass through
The space-time characteristic for extracting video, classifies to feature vector, the identification of various dynamic sign languages can be well realized.In order to mention
The accuracy of identification of high dynamic sign language further analyzes characteristic sequence by two-way shot and long term memory unit.B3D ResNet model
It is described below.
Fig. 2 shows the detailed construction of B3D ResNet model, includes mainly 17 convolutional layers, and 2 are LSTM layers two-way, 1
A full articulamentum;In input layer, having eight sizes is 112 × 112 picture frame, centered on present frame, and passes through three
L × H × W is inputted with three-dimensional channel, wherein L, H and W are time spans, height and width;Then, it is transported respectively in three channels
With Three dimensional convolution, kernel size is 7 × 7 × 3, wherein 7 × 7 in Spatial Dimension, it is 3 on time dimension;Core size is 2
× 2 × 1 down-sampling acts on each characteristic pattern in convolutional layer, to reduce characteristic pattern dimension;By being answered on three channels
Next convolutional layer C2_x is obtained with the 3D convolution with kernel size 3 × 3 × 3, next layer of C3_x, C4_x and C5_x have
Identical operation;Later, it is inserted into direct-connected connect between every two layers of convolutional neural networks, network is converted into its corresponding residual error version
This;Then feature vector is sent to the shot and long term memory network run in two directions;Each direction shot and long term is remembered into net
The hidden state layer of network is fully connected layer and the combination of soft maximum layer to obtain the Intermediate scores for corresponding to each movement;Finally, will
The average class prediction score to obtain current sequence of the score of two shot and long term memory networks.
The space-time characteristic extraction that B3D ResNet model carries out gesture to the video sequence of input includes: B3D ResNet mould
Type extracts the feature vector of input video sequence first.For the identification problem of image sequence, generally by Three dimensional convolution from view
Room and time dimensional information is captured in frequency sequence.By constructing Three dimensional convolution, the Feature Mapping in convolutional layer is connected to previous
Multiple successive frames in layer, then capture movement information;The design principle of Three dimensional convolution network layer be using three dimensional convolution kernel into
Capable, it can extract a type of feature from frame cube;In each element of any single network layer, any position
The feature vector value at the place of setting is given by the following formula:
Wherein, tanh () is hyperbolic tangent function, and parameter t and x are the Connecting quantities of current layer, and H, W and D are Three dimensional convolutions
The height of kernel, width and time dimension, z are the deviations of characteristic layer.
However, the Three dimensional convolution network number of plies is more, learning ability can be stronger.In addition, being added to Three dimensional convolution network residual
Remaining connection department is to simplify the training of deeper network.The present invention does not learn unreferenced nonlinear function not instead of directly, utilizes
The additivity residual error function of input helps to learn space-time characteristic by fast connecting.This three-dimensional residual error structure is as shown in Figure 3.
In order to which two-dimentional residual unit to be used to encode the three-dimensional architecture of space-time video information, basic residual unit is rolled up according to three-dimensional
The design principle of product network layer is modified, and Three dimensional convolution has convolution kernel size in each of three channels channel respectively
For 3 × 3 × 3 phase same core size, B3D ResNet model can be applied to Three dimensional convolution network by connecting residual error, and
Space-time characteristic is automatically extracted from input video sequence.
The characteristic sequence analysis that B3D ResNet model carries out gesture to the video sequence of input includes: B3D ResNet mould
Type utilizes two-way shot and long term memory unit, it includes six shared weights and the information from future and past is integrated, to view
Each of frequency sequence piece is predicted that structure is as shown in Figure 4.In two-way shot and long term memory unit, propagated forward layer and
Back-propagating layer is connected to output layer;In concept, memory cell stores past context, input gate and out gate list
Member allows storage context for a long time;Meanwhile it can be by forgeing the memory in door clearing cell;Formally, including
List entries x={ x1, x2..., xt, location mode c={ c1, c2..., ctAnd hidden state h={ h1, h2..., ht,
it, ft, ot, ct, gt, htIt is input gate respectively, forgets door, out gate, memory cell activates vector, and function of state hides letter
Number;The equation of two-way shot and long term memory unit is as follows:
it=σ (wxixt+whiht-1+bi) (2)
ft=σ (wxfxt+whfht-1+bf) (3)
ot=σ (wxoxt+whoht-1+bo) (4)
gt=tanh (wxcxt+whcht-1+bc) (5)
ct=ftct-1+itgt (6)
ht=ot tanh(ct) (7)
Wherein tanh () is hyperbolic tangent function, forgets door and determines when information, input gate should be removed from memory cell
Determine when new formation should be integrated in memory, which generates one group of candidate value, if input gate allows, they will be by
It is added in memory cell;Reference formula (6), based on door is forgotten, the output of input gate and new candidate value updates storage device list
Member: in formula (7), out gate controls hidden state and storage information: finally, hidden state is expressed as memory cell state
Function and out gate between product.
It is upper it can be found that B3D ResNet model can obtain the repertoire of input video from above formula expression.It is right
In dynamic Sign Language Recognition, B3D ResNet model has the great ability of the contextual information in capture sequence.
The present invention carries out in test data set, including DEVISIGN-D data set and SLR_Dataset data set.
DEVISIGN-D data set is a Chinese sign language data set, and the researcher for global Sign Language Recognition community provides
The Chinese sign language data set of one large vocabulary, for training and assessing their algorithm.It is by 500 daily vocabulary groups
At.Data cover 8 different sign language persons.Wherein, for 4 sign language persons (2 males and 2 women), other 4 sign language persons
(2 males and 2 women) record vocabulary twice altogether.It includes 6000 videos completely.
SLR_Dataset is collected by Huang et al. and is issued on their project page.Microsoft's Kinect camera is used
In recorded video, and RGB is provided, depth and body joints information.In the present invention, rgb video information is used only.SLR_
Dataset includes 2.5 ten thousand marked instance of video, is recorded by 50 sign language personnel, and each instance of video is by profession
Chinese sign language teacher annotation.
B3D ResNet model is based on realizing on deep learning platform Caffe, and GPU used in experimentation is
Quadro P4000.When training pattern, batch size is set as 2, and basic learning rate is set as 0.1, momentum parameter setting
It is 0.9.Since the size of data set limits, we take following available strategy to avoid overfitting problem: one is many institutes
Known method-data enhancing, image sequence is by random cropping.Another way is that batch standardizes, it is intended to reduce internal association
Variable offset, and it is applied to all convolutional layers, accelerate the training process of deep neural network.
After experiment parameter setting, B3D ResNet model carries out the training of dynamic Sign Language Recognition, mainly mentions from input video
Space-time characteristic is taken, long term time behavioral characteristics is analyzed, predicts the label of gesture sequence.In order to assess B3D ResNet model dynamic
Performance in state Sign Language Recognition, using recognition accuracy as standard.We are by the method B3D ResNet model of proposition and one
A little traditional sequence action recognition models compare, such as the Res3D based on DEVISIGN-D data set, 2D-Resnet
And AleXnet.The comparison of dynamic Sign Language Recognition result is as shown in Figure 5.When these trained models are until about 20k the number of iterations
When, recognition accuracy starts to reach maximum value.The result shows that the accuracy of Res3D, 2D-Resnet and AleXnet are respectively
86.6%, 85% and 73.8%, and our method accuracy rate is 89.9%, is better than other methods at least 3.3%.Therefore, real
It tests and shows that B3D ResNet model has optimal dynamic sign language performance.
For the dynamic sign language gesture identification based on video sequence, key position is the movement for identifying hand region, however
Hand region takes up space compared to for whole image, and ratio is considerably less, therefore a large amount of background area just seems redundancy.This
Invention is by detection hand region, then hand is split from background, it is possible to reduce the calculation amount of B3D ResNet model,
To improve recognition accuracy.Experimental result is as shown in Figure 6.In order to verify this method, in DEVISIGN-D data set and
The preprocessing process is assessed on SLR_Dataset using two different training methods:
Mode 1: image sequence is first detected and divides hand region;
Mode 2: without any processing, directly training.
Table 2 pre-processes comparing result
Experimental result is as shown in table 2, by pretreatment, is verified in data set DEVISIGN-D and SLR_Dataset,
It was found that our method be it is actually active, recognition accuracy improves 46.1% and 36.7%.
Table 3 compares on data set DEVISIGN-Dand SLR_Dataset
Training result of the B3D ResNet model on data set DEVISIGN-D and SLR_Dataset is as shown in table 3.Number
According to the results show that B3D ResNet model achieves highest recognition accuracy.As can be seen from Table 3, due to different numbers
Different according to the complexity of collection, SLR_Dataset data set is more challenging.Specifically, in DEVISIGN-D data set and SLR_
In Dataset, result of the present invention is respectively 89.8% and 86.9%, respectively than BLSTM-NN high 29.5% and 30.3%, than
HMM-DTC high 25.4% and 21.7%, is higher by 19% and 21.1% than DNN, is higher by 11.5% and 13.4% than C3D.Compare knot
Fruit shows that the present invention obtains the newest accuracy of identification of dynamic sign language in two test data sets.
The above only expresses the preferred embodiment of the present invention, and the description thereof is more specific and detailed, but can not be because
This and be interpreted as limitations on the scope of the patent of the present invention.It should be pointed out that for those of ordinary skill in the art,
Under the premise of not departing from present inventive concept, several deformations can also be made, improves and substitutes, these belong to protection of the invention
Range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
Claims (5)
1. a kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence, it is characterised in that: the side
Method proposes the new model B3D ResNet based on three-dimensional residual error neural network, comprising the following steps:
Step 1, in the video frame, using the position of Faster R-CNN model inspection hand, and divide from background and sell;
Step 2, space-time characteristic extraction and the characteristic sequence of gesture are carried out using video sequence of the B3D ResNet model to input
Analysis;
Step 3, by classifying to the video sequence of input, it can identify gesture, effectively realize dynamic Sign Language Recognition.
2. a kind of dynamic Sign Language Recognition side based on three-dimensional residual error neural network and video sequence according to claim 1
Method, it is characterised in that: the step of position using Faster R-CNN model inspection hand is as follows:
(1) when image sequence inputs convolutional neural networks, it will generate characteristic pattern, region proposes that network is being n with core size
The network window of × n is slided on characteristic pattern;
(2) Area generation network recommendation candidate region exports multiple qualified candidate regions;
(3) different size of candidate region is converted to the candidate region of regular length by area-of-interest pond layer, is then exported
The candidate region of regular length;
(4) classification is carried out to each area-of-interest and bounding box returns, exported class belonging to candidate region and candidate region exists
Exact position in image.
3. a kind of dynamic Sign Language Recognition side based on three-dimensional residual error neural network and video sequence according to claim 1
Method, it is characterised in that: the B3D ResNet model mainly includes 17 convolutional layers, and 2 LSTM layers two-way, 1 full articulamentum;
In input layer, having eight sizes is 112 × 112 picture frame, has three-dimensional channel centered on present frame, and by three
L × H × W is inputted, wherein L, H and W are time spans, height and width;Then, Three dimensional convolution is used in three channels respectively,
Its kernel size is 7 × 7 × 3, wherein 7 × 7 in Spatial Dimension, it is 3 on time dimension;Under core size is 2 × 2 × 1
Each characteristic pattern of the sampling action in convolutional layer, to reduce characteristic pattern dimension;There is kernel by applying on three channels
The 3D convolution of size 3 × 3 × 3 obtains next convolutional layer C2_x, next layer of C3_x, C4_x and C5_x behaviour having the same
Make;Later, it is inserted into direct-connected connect between every two layers of convolutional neural networks, network is converted into its corresponding residual error version;Then special
Sign vector is sent to the shot and long term memory network run in two directions;By the hiding shape of each direction shot and long term memory network
State layer is fully connected layer and the combination of soft maximum layer to obtain the Intermediate scores for corresponding to each movement;Finally, by two shot and long terms
The average class prediction score to obtain current sequence of the score of memory network.
4. a kind of dynamic Sign Language Recognition side based on three-dimensional residual error neural network and video sequence according to claim 1
Method, it is characterised in that: the space-time characteristic extraction that the B3D ResNet model carries out gesture to the video sequence of input includes: head
The feature vector for first extracting input video sequence, by constructing Three dimensional convolution, the Feature Mapping in convolutional layer is connected to preceding layer
In multiple successive frames, then capture movement information;The design principle of Three dimensional convolution network layer is carried out using three dimensional convolution kernel
, it can extract a type of feature from frame cube;In each element of any single network layer, any position
The feature vector value at place is given by the following formula:
Wherein, tanh () is hyperbolic tangent function, and parameter t and x are the Connecting quantities of current layer, and H, W and D are Three dimensional convolution kernels
Height, width and time dimension, z are the deviations of characteristic layer;
The present invention learns space-time characteristic by fast connecting using the additivity residual error function of input;In order to by two-dimentional residual unit
For encoding the three-dimensional architecture of space-time video information, basic residual unit according to Three dimensional convolution network layer design principle
It modifies, it is 3 × 3 × 3 phase same core that Three dimensional convolution, which has convolution kernel size in each of three channels channel respectively,
Size, B3D ResNet model can be applied to Three dimensional convolution network by connecting residual error, and automatically from input video
Space-time characteristic is extracted in sequence.
5. a kind of dynamic Sign Language Recognition side based on three-dimensional residual error neural network and video sequence according to claim 1
Method, it is characterised in that: the characteristic sequence analysis that the B3D ResNet model carries out gesture to the video sequence of input includes: benefit
With two-way shot and long term memory unit, it includes six shared weights and the information from future and past is integrated, to video sequence
Each of column piece are predicted;In two-way shot and long term memory unit, propagated forward layer and back-propagating layer are connected to output
Layer;In concept, memory cell stores past context, and input gate and output gate cell allow to store for a long time up and down
Text;Meanwhile it can be by forgeing the memory in door clearing cell;Formally, including list entries x={ x1, x2...,
xt, location mode c={ c1, c2..., ctAnd hidden state h={ h1, h2..., ht, it, ft, ot, ct, gt, htIt is respectively
Input gate forgets door, out gate, and memory cell activates vector, and function of state hides function;Two-way shot and long term memory unit
Equation it is as follows:
it=σ (wxixt+whiht-1+bi) (2)
ft=σ (wxfxt+whfht-1+bf) (3)
ot=σ (wxoxt+whoht-1+bo) (4)
gt=tanh (wxcxt+whcht-1+hc) (5)
ct=ftct-1+itgt (6)
ht=ottanh(ct) (7)
Wherein tanh () is hyperbolic tangent function, forgets door and determines when information should be removed from memory cell, input gate determines
When new formation should be integrated in memory, which generates one group of candidate value, if input gate allows, they will be added
Into memory cell;Reference formula (6), based on door is forgotten, the output of input gate and new candidate value updates storage device unit;
In formula (7), out gate controls hidden state and storage information;Finally, hidden state is expressed as memory cell state
Product between function and out gate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910282569.4A CN110110602A (en) | 2019-04-09 | 2019-04-09 | A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910282569.4A CN110110602A (en) | 2019-04-09 | 2019-04-09 | A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110110602A true CN110110602A (en) | 2019-08-09 |
Family
ID=67483774
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910282569.4A Pending CN110110602A (en) | 2019-04-09 | 2019-04-09 | A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110110602A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110569823A (en) * | 2019-09-18 | 2019-12-13 | 西安工业大学 | sign language identification and skeleton generation method based on RNN |
CN111091045A (en) * | 2019-10-25 | 2020-05-01 | 重庆邮电大学 | Sign language identification method based on space-time attention mechanism |
CN111273779A (en) * | 2020-02-20 | 2020-06-12 | 沈阳航空航天大学 | Dynamic gesture recognition method based on adaptive spatial supervision |
CN111339837A (en) * | 2020-02-08 | 2020-06-26 | 河北工业大学 | Continuous sign language recognition method |
CN111797777A (en) * | 2020-07-07 | 2020-10-20 | 南京大学 | Sign language recognition system and method based on space-time semantic features |
CN112487967A (en) * | 2020-11-30 | 2021-03-12 | 电子科技大学 | Scenic spot painting behavior identification method based on three-dimensional convolution network |
CN112818914A (en) * | 2021-02-24 | 2021-05-18 | 网易(杭州)网络有限公司 | Video content classification method and device |
CN113071438A (en) * | 2020-01-06 | 2021-07-06 | 北京地平线机器人技术研发有限公司 | Control instruction generation method and device, storage medium and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107451552A (en) * | 2017-07-25 | 2017-12-08 | 北京联合大学 | A kind of gesture identification method based on 3D CNN and convolution LSTM |
CN107679491A (en) * | 2017-09-29 | 2018-02-09 | 华中师范大学 | A kind of 3D convolutional neural networks sign Language Recognition Methods for merging multi-modal data |
-
2019
- 2019-04-09 CN CN201910282569.4A patent/CN110110602A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107451552A (en) * | 2017-07-25 | 2017-12-08 | 北京联合大学 | A kind of gesture identification method based on 3D CNN and convolution LSTM |
CN107679491A (en) * | 2017-09-29 | 2018-02-09 | 华中师范大学 | A kind of 3D convolutional neural networks sign Language Recognition Methods for merging multi-modal data |
Non-Patent Citations (1)
Title |
---|
廖艳秋 等: ""Dynamic Sign Language Recognition Based on Video Sequence With BLSTM-3D Residual Networks"", 《IEEE ACCESS》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110569823A (en) * | 2019-09-18 | 2019-12-13 | 西安工业大学 | sign language identification and skeleton generation method based on RNN |
CN111091045A (en) * | 2019-10-25 | 2020-05-01 | 重庆邮电大学 | Sign language identification method based on space-time attention mechanism |
CN111091045B (en) * | 2019-10-25 | 2022-08-23 | 重庆邮电大学 | Sign language identification method based on space-time attention mechanism |
CN113071438A (en) * | 2020-01-06 | 2021-07-06 | 北京地平线机器人技术研发有限公司 | Control instruction generation method and device, storage medium and electronic equipment |
CN113071438B (en) * | 2020-01-06 | 2023-03-24 | 北京地平线机器人技术研发有限公司 | Control instruction generation method and device, storage medium and electronic equipment |
CN111339837A (en) * | 2020-02-08 | 2020-06-26 | 河北工业大学 | Continuous sign language recognition method |
CN111339837B (en) * | 2020-02-08 | 2022-05-03 | 河北工业大学 | Continuous sign language recognition method |
CN111273779A (en) * | 2020-02-20 | 2020-06-12 | 沈阳航空航天大学 | Dynamic gesture recognition method based on adaptive spatial supervision |
CN111273779B (en) * | 2020-02-20 | 2023-09-19 | 沈阳航空航天大学 | Dynamic gesture recognition method based on self-adaptive space supervision |
CN111797777A (en) * | 2020-07-07 | 2020-10-20 | 南京大学 | Sign language recognition system and method based on space-time semantic features |
CN111797777B (en) * | 2020-07-07 | 2023-10-17 | 南京大学 | Sign language recognition system and method based on space-time semantic features |
CN112487967A (en) * | 2020-11-30 | 2021-03-12 | 电子科技大学 | Scenic spot painting behavior identification method based on three-dimensional convolution network |
CN112818914A (en) * | 2021-02-24 | 2021-05-18 | 网易(杭州)网络有限公司 | Video content classification method and device |
CN112818914B (en) * | 2021-02-24 | 2023-08-18 | 网易(杭州)网络有限公司 | Video content classification method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110110602A (en) | A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence | |
Zhang et al. | Dynamic hand gesture recognition based on short-term sampling neural networks | |
US20210326597A1 (en) | Video processing method and apparatus, and electronic device and storage medium | |
CN108804530B (en) | Subtitling areas of an image | |
Yang et al. | Extraction of 2d motion trajectories and its application to hand gesture recognition | |
US20190362707A1 (en) | Interactive method, interactive terminal, storage medium, and computer device | |
CN109961034A (en) | Video object detection method based on convolution gating cycle neural unit | |
CN103268495B (en) | Human body behavior modeling recognition methods based on priori knowledge cluster in computer system | |
CN110399850A (en) | A kind of continuous sign language recognition method based on deep neural network | |
CN111291556B (en) | Chinese entity relation extraction method based on character and word feature fusion of entity meaning item | |
CN110717431A (en) | Fine-grained visual question and answer method combined with multi-view attention mechanism | |
CN112784763A (en) | Expression recognition method and system based on local and overall feature adaptive fusion | |
CN110096991A (en) | A kind of sign Language Recognition Method based on convolutional neural networks | |
CN115966010A (en) | Expression recognition method based on attention and multi-scale feature fusion | |
Rani et al. | An effectual classical dance pose estimation and classification system employing convolution neural network–long shortterm memory (CNN-LSTM) network for video sequences | |
CN113378919B (en) | Image description generation method for fusing visual sense and enhancing multilayer global features | |
Balachandar et al. | Deep learning technique based visually impaired people using YOLO V3 framework mechanism | |
Mahyoub et al. | Sign Language Recognition using Deep Learning | |
Cai et al. | Performance analysis of distance teaching classroom based on machine learning and virtual reality | |
Wu | Biomedical image segmentation and object detection using deep convolutional neural networks | |
Xuan | DRN-LSTM: a deep residual network based on long short-term memory network for students behaviour recognition in education | |
He et al. | An optimal 3D convolutional neural network based lipreading method | |
Shinde et al. | Automatic Data Collection from Forms using Optical Character Recognition | |
Rawat et al. | Indian Sign Language Recognition System for Interrogative Words Using Deep Learning | |
Zhang | The Cognitive Transformation of Japanese Language Education by Artificial Intelligence Technology in the Wireless Network Environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190809 |