CN110110602A

CN110110602A - A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence

Info

Publication number: CN110110602A
Application number: CN201910282569.4A
Authority: CN
Inventors: 闵卫东; 廖艳秋; 熊鹏文; 韩清; 张愚; 徐剑强; 邹松; 熊辛; 汪琦
Original assignee: Nanchang University
Current assignee: Nanchang University
Priority date: 2019-04-09
Filing date: 2019-04-09
Publication date: 2019-08-09

Abstract

The dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence that the present invention provides a kind of, the method proposes the new model B3D ResNet based on three-dimensional residual error neural network, the following steps are included: step 1, in the video frame, using the position of Faster R-CNN model inspection hand, and divides from background and sell；Step 2, it is extracted using the space-time characteristic that video sequence of the B3D ResNet model to input carries out gesture and characteristic sequence is analyzed；Step 3, by classifying to the video sequence of input, it can identify gesture, effectively realize dynamic Sign Language Recognition.The present invention can extract effective dynamic gesture space-time characteristic sequence by the space-time characteristic of analysis video sequence, to achieve the purpose that the different gestures of identification, and also obtain good performance in complicated or similar Sign Language Recognition.By test data set the experimental results showed that, the present invention can accurately and effectively distinguish different sign language and similar gesture pair.

Description

A kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence

Technical field

It is specially a kind of based on three-dimensional residual error neural network and video sequence the present invention relates to Sign Language Recognition technical field Dynamic sign Language Recognition Method.

Background technique

Sign Language Recognition is a kind of effective technology that deaf-mute exchanges with non-deaf-mute, the continuous depth studied with human-computer interaction Enter, Sign Language Recognition has become a hot topic.In recent years, sign language automatic recognition system is by being converted to text or language for gesture Sound creates a kind of new mode for human-computer interaction, and this technology can be realized by computer aided technique.Currently, This respect there are many successfully application, such as distributes language translation, sign language tutor and special education, these can help deaf Mute carries out fluent exchange with other people.On the other hand, sign language is generally to be made of a series of actions, is a kind of with similar The quick movement of feature.Therefore, static Sign Language Recognition technology is difficult to solve the complexity and variation issues of sign language movement.Cause This, research trends Sign Language Recognition technology is the effective ways for solving problems.The dynamic hand gesture recognition technology of view-based access control model has There is the features such as flexibility, scalability and low cost, is the hot spot of current gesture interaction technical research.However, dynamic sign language is known Other technology also Challenge in terms of the complexity problem of finger movement in the case where solving body background.Another difficulty is that how from Most effective feature is extracted in image or video sequence.In addition, how to select suitable classifier is also to obtain to accurately identify knot The key factor of fruit.

In order to help deaf-mute normally to be exchanged in daily life, more and more researchers are dedicated in improvement Problem is stated, many achievements are had been achieved in terms of dynamic Sign Language Recognition.The method for solving the problems, such as dynamic Sign Language Recognition mainly has Two kinds: one is the recognition methods based on gesture shape and motion profile, and another kind is the identification side based on sign language video sequence Method.

In traditional dynamic Sign Language Recognition, hand mainly is identified using the shape feature of gesture and motion profile feature Gesture.But these features cannot fully meet the requirement of practical dynamic Sign Language Recognition.With the rapid development of deep learning theory, Data-driven method shows superiority outstanding in terms of target detection and gesture identification.Be based on gesture shape and movement The sign Language Recognition Method of track is different, and the Sign Language Recognition based on video sequence can make full use of temporal information, with entire scene It compares, the size of hand is relatively small, therefore the useful space feature of sign language movement can be covered by irrelevant information.Therefore, together When study sign language movement space-time characteristic will be a kind of effective ways of dynamic Sign Language Recognition.

Summary of the invention

The dynamic Sign Language Recognition based on three-dimensional residual error neural network and video sequence that the purpose of the present invention is to provide a kind of Method, to solve the problems mentioned in the above background technology.

To achieve the above object, the invention provides the following technical scheme: it is a kind of based on three-dimensional residual error neural network and video The dynamic sign Language Recognition Method of sequence, the method propose the new model B3D ResNet based on three-dimensional residual error neural network, The following steps are included:

Step 1, in the video frame, using the position of Faster R-CNN model inspection hand, and divide from background and sell；

Step 2, the space-time characteristic extraction and feature of gesture are carried out using video sequence of the B3D ResNet model to input Sequence analysis；

Step 3, by classifying to the video sequence of input, it can identify gesture, effectively realize dynamic sign language knowledge Not.

Further, the step of position using Faster R-CNN model inspection hand is as follows:

(1) when image sequence inputs convolutional neural networks, it will generate characteristic pattern, region proposes network with core size It is slided on characteristic pattern for the network window of n × n；

(2) Area generation network recommendation candidate region exports multiple qualified candidate regions；

(3) different size of candidate region is converted to the candidate region of regular length by area-of-interest pond layer, then Export the candidate region of regular length；

(4) classification is carried out to each area-of-interest and bounding box returns, export class and candidate regions belonging to candidate region The exact position of domain in the picture.

Further, the B3D ResNet model mainly includes 17 convolutional layers, and 2 LSTM layers two-way, 1 full connection Layer；In input layer, having eight sizes is 112 × 112 picture frame, has three-dimensional centered on present frame, and by three Channel inputs L × H × W, and wherein L, H and W are time spans, height and width；Then, respectively in three channels with three-dimensional volume Product, kernel size is 7 × 7 × 3, wherein 7 × 7 in Spatial Dimension, it is 3 on time dimension；Core size is 2 × 2 × 1 Down-sampling acts on each characteristic pattern in convolutional layer, to reduce characteristic pattern dimension；By interior using having on three channels The 3D convolution of core size 3 × 3 × 3 obtains next convolutional layer C2_x, next layer of C3_x, C4_x and C5_x behaviour having the same Make；Later, it is inserted into direct-connected connect between every two layers of convolutional neural networks, network is converted into its corresponding residual error version；Then special Sign vector is sent to the shot and long term memory network run in two directions；By the hiding shape of each direction shot and long term memory network State layer is fully connected layer and the combination of soft maximum layer to obtain the Intermediate scores for corresponding to each movement；Finally, by two shot and long terms The average class prediction score to obtain current sequence of the score of memory network.

Further, the space-time characteristic that the B3D ResNet model carries out gesture to the video sequence of input extracts packet It includes: extracting the feature vector of input video sequence first, by constructing Three dimensional convolution, before the Feature Mapping in convolutional layer is connected to Multiple successive frames in one layer, then capture movement information；The design principle of Three dimensional convolution network layer is to utilize three dimensional convolution kernel It carries out, it can extract a type of feature from frame cube；In each element of any single network layer, arbitrarily Feature vector value at position is given by the following formula:

Wherein, tanh () is hyperbolic tangent function, and parameter t and x are the Connecting quantities of current layer, and H, W and D are Three dimensional convolutions The height of kernel, width and time dimension, z are the deviations of characteristic layer.

The present invention learns space-time characteristic by fast connecting using the additivity residual error function of input；In order to by two-dimentional residual error Unit is used to encode the three-dimensional architecture of space-time video information, basic residual unit according to Three dimensional convolution network layer design Principle is modified, and it is 3 × 3 × 3 phase that Three dimensional convolution, which has convolution kernel size in each of three channels channel respectively, Same core size, B3D ResNet model can be applied to Three dimensional convolution network by connecting residual error, and automatically from input Space-time characteristic is extracted in video sequence.

Further, the B3D ResNet model carries out the characteristic sequence analysis bag of gesture to the video sequence of input It includes: utilizing two-way shot and long term memory unit, it includes six shared weights and the information from future and past is integrated, to view Each of frequency sequence piece is predicted；In two-way shot and long term memory unit, propagated forward layer and back-propagating layer are connected to Output layer；In concept, memory cell stores past context, and input gate and output gate cell allow to store for a long time Context；Meanwhile it can be by forgeing the memory in door clearing cell；Formally, including list entries x={ x₁, x₂..., x_t, location mode c={ c₁, c₂..., c_tAnd hidden state h={ h₁, h₂..., h_t, i_t, f_t, o_t, c_t, g_t, h_t It is input gate respectively, forgets door, out gate, memory cell activates vector, and function of state hides function；Two-way shot and long term note The equation for recalling unit is as follows:

i_t=σ (w_xix_t+w_hih_t-1+b_i) (2)

f_t=σ (w_xfx_t+w_hfh_t-1+b_f) (3)

o_t=σ (w_xox_t+w_hoh_t-1+b_o) (4)

g_t=tanh (w_xcx_t+w_hch_t-1+b_x) (5)

c_t=f_tc_t-1+i_tg_t (6)

h_t=o_t tanh(c_t) (7)

Wherein tanh () is hyperbolic tangent function, forgets door and determines when information, input gate should be removed from memory cell Determine when new formation should be integrated in memory, which generates one group of candidate value, if input gate allows, they will be by It is added in memory cell；Reference formula (6), based on door is forgotten, the output of input gate and new candidate value updates storage device list Member；In formula (7), out gate controls hidden state and storage information；Finally, hidden state is expressed as memory cell state Function and out gate between product.

Compared with prior art, the beneficial effects of the present invention are:

The invention proposes a kind of new model B3D ResNet for dynamic Sign Language Recognition.The model passes through analysis video The space-time characteristic of sequence can extract effective dynamic gesture space-time characteristic sequence, thus achieve the purpose that the different gestures of identification, And good performance is also obtained in complicated or similar Sign Language Recognition.Pass through test data set DEVISIGN-D and SLR_ Dataset's the experimental results showed that, the present invention can accurately and effectively distinguish different sign language and similar gesture pair.This Outside, the present invention takes full advantage of the space-time characteristic of dynamic sign language, improves the accuracy and overall performance of dynamic Sign Language Recognition.

Detailed description of the invention

Fig. 1 is structure of the invention frame diagram；

Fig. 2 is B3D ResNet model structure of the present invention；

Fig. 3 is the three-dimensional residual error structural unit figure of the present invention；

Fig. 4 is the two-way shot and long term memory network structural unit figure of the present invention；

Fig. 5 is the comparing result figure of the present invention with other methods；

Fig. 6 is hand of the present invention positioning and segmentation result figure.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is described in further detail.The specific embodiments are only for explaining the present invention technical solution described herein, and It is not limited to the present invention.

The present invention provides a kind of technical solution: a kind of dynamic sign language knowledge based on three-dimensional residual error neural network and video sequence Other method, structural framing are as shown in Figure 1.The method proposes the new model B3D based on three-dimensional residual error neural network ResNet, comprising the following steps:

Step 1, in the video frame, using the position of Faster R-CNN model inspection hand, and divide from background and sell.

Step 2, the space-time characteristic extraction and feature of gesture are carried out using video sequence of the B3D ResNet model to input Sequence analysis.

The detection of hand position is most important step for time segmentation and subsequent identification module.In order to obtain figure The accurate information of hand position, selects the algorithm of target detection of function admirable most important as in.With SSD, YOLO and its other party Method is compared, and FasterR-CNN has higher precision and stronger robustness, suitable for the detection compared with wisp.

As shown in the target locating module of Fig. 1, using Faster R-CNN model inspection hand position the step of it is as follows:

(1) when image sequence inputs convolutional neural networks, it will generate characteristic pattern, region proposes network with core size It is slided on characteristic pattern for the network window of n × n.

(2) Area generation network recommendation candidate region exports multiple qualified candidate regions.

(3) different size of candidate region is converted to the candidate region of regular length by area-of-interest pond layer, then Export the candidate region of regular length.

1 testing result of table

As shown in table 1, Faster R-CNN model has high measurement accuracy for target.This result is reflected in following parameter In:WithTherefore, come using Faster R-CNN model Hand is detected, the precise information of position can be obtained.

The invention proposes B3D ResNet models, for identification the dynamic sign language based on video sequence.Specifically, should Model can complete video sequence characteristics and extract and learn long-term space-time characteristic.For dynamic Sign Language Recognition, different dynamic hands Language gesture generally corresponds to the video with different labels.Therefore, gesture can be identified by being classified to label.Pass through The space-time characteristic for extracting video, classifies to feature vector, the identification of various dynamic sign languages can be well realized.In order to mention The accuracy of identification of high dynamic sign language further analyzes characteristic sequence by two-way shot and long term memory unit.B3D ResNet model It is described below.

Fig. 2 shows the detailed construction of B3D ResNet model, includes mainly 17 convolutional layers, and 2 are LSTM layers two-way, 1 A full articulamentum；In input layer, having eight sizes is 112 × 112 picture frame, centered on present frame, and passes through three L × H × W is inputted with three-dimensional channel, wherein L, H and W are time spans, height and width；Then, it is transported respectively in three channels With Three dimensional convolution, kernel size is 7 × 7 × 3, wherein 7 × 7 in Spatial Dimension, it is 3 on time dimension；Core size is 2 × 2 × 1 down-sampling acts on each characteristic pattern in convolutional layer, to reduce characteristic pattern dimension；By being answered on three channels Next convolutional layer C2_x is obtained with the 3D convolution with kernel size 3 × 3 × 3, next layer of C3_x, C4_x and C5_x have Identical operation；Later, it is inserted into direct-connected connect between every two layers of convolutional neural networks, network is converted into its corresponding residual error version This；Then feature vector is sent to the shot and long term memory network run in two directions；Each direction shot and long term is remembered into net The hidden state layer of network is fully connected layer and the combination of soft maximum layer to obtain the Intermediate scores for corresponding to each movement；Finally, will The average class prediction score to obtain current sequence of the score of two shot and long term memory networks.

The space-time characteristic extraction that B3D ResNet model carries out gesture to the video sequence of input includes: B3D ResNet mould Type extracts the feature vector of input video sequence first.For the identification problem of image sequence, generally by Three dimensional convolution from view Room and time dimensional information is captured in frequency sequence.By constructing Three dimensional convolution, the Feature Mapping in convolutional layer is connected to previous Multiple successive frames in layer, then capture movement information；The design principle of Three dimensional convolution network layer be using three dimensional convolution kernel into Capable, it can extract a type of feature from frame cube；In each element of any single network layer, any position The feature vector value at the place of setting is given by the following formula:

However, the Three dimensional convolution network number of plies is more, learning ability can be stronger.In addition, being added to Three dimensional convolution network residual Remaining connection department is to simplify the training of deeper network.The present invention does not learn unreferenced nonlinear function not instead of directly, utilizes The additivity residual error function of input helps to learn space-time characteristic by fast connecting.This three-dimensional residual error structure is as shown in Figure 3. In order to which two-dimentional residual unit to be used to encode the three-dimensional architecture of space-time video information, basic residual unit is rolled up according to three-dimensional The design principle of product network layer is modified, and Three dimensional convolution has convolution kernel size in each of three channels channel respectively For 3 × 3 × 3 phase same core size, B3D ResNet model can be applied to Three dimensional convolution network by connecting residual error, and Space-time characteristic is automatically extracted from input video sequence.

The characteristic sequence analysis that B3D ResNet model carries out gesture to the video sequence of input includes: B3D ResNet mould Type utilizes two-way shot and long term memory unit, it includes six shared weights and the information from future and past is integrated, to view Each of frequency sequence piece is predicted that structure is as shown in Figure 4.In two-way shot and long term memory unit, propagated forward layer and Back-propagating layer is connected to output layer；In concept, memory cell stores past context, input gate and out gate list Member allows storage context for a long time；Meanwhile it can be by forgeing the memory in door clearing cell；Formally, including List entries x={ x₁, x₂..., x_t, location mode c={ c₁, c₂..., c_tAnd hidden state h={ h₁, h₂..., h_t, i_t, f_t, o_t, c_t, g_t, h_tIt is input gate respectively, forgets door, out gate, memory cell activates vector, and function of state hides letter Number；The equation of two-way shot and long term memory unit is as follows:

i_t=σ (w_xix_t+w_hih_t-1+b_i) (2)

f_t=σ (w_xfx_t+w_hfh_t-1+b_f) (3)

o_t=σ (w_xox_t+w_hoh_t-1+b_o) (4)

g_t=tanh (w_xcx_t+w_hch_t-1+b_c) (5)

c_t=f_tc_t-1+i_tg_t (6)

h_t=o_t tanh(c_t) (7)

Wherein tanh () is hyperbolic tangent function, forgets door and determines when information, input gate should be removed from memory cell Determine when new formation should be integrated in memory, which generates one group of candidate value, if input gate allows, they will be by It is added in memory cell；Reference formula (6), based on door is forgotten, the output of input gate and new candidate value updates storage device list Member: in formula (7), out gate controls hidden state and storage information: finally, hidden state is expressed as memory cell state Function and out gate between product.

It is upper it can be found that B3D ResNet model can obtain the repertoire of input video from above formula expression.It is right In dynamic Sign Language Recognition, B3D ResNet model has the great ability of the contextual information in capture sequence.

The present invention carries out in test data set, including DEVISIGN-D data set and SLR_Dataset data set.

DEVISIGN-D data set is a Chinese sign language data set, and the researcher for global Sign Language Recognition community provides The Chinese sign language data set of one large vocabulary, for training and assessing their algorithm.It is by 500 daily vocabulary groups At.Data cover 8 different sign language persons.Wherein, for 4 sign language persons (2 males and 2 women), other 4 sign language persons (2 males and 2 women) record vocabulary twice altogether.It includes 6000 videos completely.

SLR_Dataset is collected by Huang et al. and is issued on their project page.Microsoft's Kinect camera is used In recorded video, and RGB is provided, depth and body joints information.In the present invention, rgb video information is used only.SLR_ Dataset includes 2.5 ten thousand marked instance of video, is recorded by 50 sign language personnel, and each instance of video is by profession Chinese sign language teacher annotation.

B3D ResNet model is based on realizing on deep learning platform Caffe, and GPU used in experimentation is Quadro P4000.When training pattern, batch size is set as 2, and basic learning rate is set as 0.1, momentum parameter setting It is 0.9.Since the size of data set limits, we take following available strategy to avoid overfitting problem: one is many institutes Known method-data enhancing, image sequence is by random cropping.Another way is that batch standardizes, it is intended to reduce internal association Variable offset, and it is applied to all convolutional layers, accelerate the training process of deep neural network.

After experiment parameter setting, B3D ResNet model carries out the training of dynamic Sign Language Recognition, mainly mentions from input video Space-time characteristic is taken, long term time behavioral characteristics is analyzed, predicts the label of gesture sequence.In order to assess B3D ResNet model dynamic Performance in state Sign Language Recognition, using recognition accuracy as standard.We are by the method B3D ResNet model of proposition and one A little traditional sequence action recognition models compare, such as the Res3D based on DEVISIGN-D data set, 2D-Resnet And AleXnet.The comparison of dynamic Sign Language Recognition result is as shown in Figure 5.When these trained models are until about 20k the number of iterations When, recognition accuracy starts to reach maximum value.The result shows that the accuracy of Res3D, 2D-Resnet and AleXnet are respectively 86.6%, 85% and 73.8%, and our method accuracy rate is 89.9%, is better than other methods at least 3.3%.Therefore, real It tests and shows that B3D ResNet model has optimal dynamic sign language performance.

For the dynamic sign language gesture identification based on video sequence, key position is the movement for identifying hand region, however Hand region takes up space compared to for whole image, and ratio is considerably less, therefore a large amount of background area just seems redundancy.This Invention is by detection hand region, then hand is split from background, it is possible to reduce the calculation amount of B3D ResNet model, To improve recognition accuracy.Experimental result is as shown in Figure 6.In order to verify this method, in DEVISIGN-D data set and The preprocessing process is assessed on SLR_Dataset using two different training methods:

Mode 1: image sequence is first detected and divides hand region；

Mode 2: without any processing, directly training.

Table 2 pre-processes comparing result

Experimental result is as shown in table 2, by pretreatment, is verified in data set DEVISIGN-D and SLR_Dataset, It was found that our method be it is actually active, recognition accuracy improves 46.1% and 36.7%.

Table 3 compares on data set DEVISIGN-Dand SLR_Dataset

Training result of the B3D ResNet model on data set DEVISIGN-D and SLR_Dataset is as shown in table 3.Number According to the results show that B3D ResNet model achieves highest recognition accuracy.As can be seen from Table 3, due to different numbers Different according to the complexity of collection, SLR_Dataset data set is more challenging.Specifically, in DEVISIGN-D data set and SLR_ In Dataset, result of the present invention is respectively 89.8% and 86.9%, respectively than BLSTM-NN high 29.5% and 30.3%, than HMM-DTC high 25.4% and 21.7%, is higher by 19% and 21.1% than DNN, is higher by 11.5% and 13.4% than C3D.Compare knot Fruit shows that the present invention obtains the newest accuracy of identification of dynamic sign language in two test data sets.

The above only expresses the preferred embodiment of the present invention, and the description thereof is more specific and detailed, but can not be because This and be interpreted as limitations on the scope of the patent of the present invention.It should be pointed out that for those of ordinary skill in the art, Under the premise of not departing from present inventive concept, several deformations can also be made, improves and substitutes, these belong to protection of the invention Range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims

1. a kind of dynamic sign Language Recognition Method based on three-dimensional residual error neural network and video sequence, it is characterised in that: the side Method proposes the new model B3D ResNet based on three-dimensional residual error neural network, comprising the following steps:

Step 2, space-time characteristic extraction and the characteristic sequence of gesture are carried out using video sequence of the B3D ResNet model to input Analysis；

Step 3, by classifying to the video sequence of input, it can identify gesture, effectively realize dynamic Sign Language Recognition.

2. a kind of dynamic Sign Language Recognition side based on three-dimensional residual error neural network and video sequence according to claim 1 Method, it is characterised in that: the step of position using Faster R-CNN model inspection hand is as follows:

(1) when image sequence inputs convolutional neural networks, it will generate characteristic pattern, region proposes that network is being n with core size The network window of × n is slided on characteristic pattern；

(3) different size of candidate region is converted to the candidate region of regular length by area-of-interest pond layer, is then exported The candidate region of regular length；

(4) classification is carried out to each area-of-interest and bounding box returns, exported class belonging to candidate region and candidate region exists Exact position in image.

3. a kind of dynamic Sign Language Recognition side based on three-dimensional residual error neural network and video sequence according to claim 1 Method, it is characterised in that: the B3D ResNet model mainly includes 17 convolutional layers, and 2 LSTM layers two-way, 1 full articulamentum； In input layer, having eight sizes is 112 × 112 picture frame, has three-dimensional channel centered on present frame, and by three L × H × W is inputted, wherein L, H and W are time spans, height and width；Then, Three dimensional convolution is used in three channels respectively, Its kernel size is 7 × 7 × 3, wherein 7 × 7 in Spatial Dimension, it is 3 on time dimension；Under core size is 2 × 2 × 1 Each characteristic pattern of the sampling action in convolutional layer, to reduce characteristic pattern dimension；There is kernel by applying on three channels The 3D convolution of size 3 × 3 × 3 obtains next convolutional layer C2_x, next layer of C3_x, C4_x and C5_x behaviour having the same Make；Later, it is inserted into direct-connected connect between every two layers of convolutional neural networks, network is converted into its corresponding residual error version；Then special Sign vector is sent to the shot and long term memory network run in two directions；By the hiding shape of each direction shot and long term memory network State layer is fully connected layer and the combination of soft maximum layer to obtain the Intermediate scores for corresponding to each movement；Finally, by two shot and long terms The average class prediction score to obtain current sequence of the score of memory network.

4. a kind of dynamic Sign Language Recognition side based on three-dimensional residual error neural network and video sequence according to claim 1 Method, it is characterised in that: the space-time characteristic extraction that the B3D ResNet model carries out gesture to the video sequence of input includes: head The feature vector for first extracting input video sequence, by constructing Three dimensional convolution, the Feature Mapping in convolutional layer is connected to preceding layer In multiple successive frames, then capture movement information；The design principle of Three dimensional convolution network layer is carried out using three dimensional convolution kernel , it can extract a type of feature from frame cube；In each element of any single network layer, any position The feature vector value at place is given by the following formula:

Wherein, tanh () is hyperbolic tangent function, and parameter t and x are the Connecting quantities of current layer, and H, W and D are Three dimensional convolution kernels Height, width and time dimension, z are the deviations of characteristic layer；

The present invention learns space-time characteristic by fast connecting using the additivity residual error function of input；In order to by two-dimentional residual unit For encoding the three-dimensional architecture of space-time video information, basic residual unit according to Three dimensional convolution network layer design principle It modifies, it is 3 × 3 × 3 phase same core that Three dimensional convolution, which has convolution kernel size in each of three channels channel respectively, Size, B3D ResNet model can be applied to Three dimensional convolution network by connecting residual error, and automatically from input video Space-time characteristic is extracted in sequence.

5. a kind of dynamic Sign Language Recognition side based on three-dimensional residual error neural network and video sequence according to claim 1 Method, it is characterised in that: the characteristic sequence analysis that the B3D ResNet model carries out gesture to the video sequence of input includes: benefit With two-way shot and long term memory unit, it includes six shared weights and the information from future and past is integrated, to video sequence Each of column piece are predicted；In two-way shot and long term memory unit, propagated forward layer and back-propagating layer are connected to output Layer；In concept, memory cell stores past context, and input gate and output gate cell allow to store for a long time up and down Text；Meanwhile it can be by forgeing the memory in door clearing cell；Formally, including list entries x={ x₁, x₂..., x_t, location mode c={ c₁, c₂..., c_tAnd hidden state h={ h₁, h₂..., h_t, i_t, f_t, o_t, c_t, g_t, h_tIt is respectively Input gate forgets door, out gate, and memory cell activates vector, and function of state hides function；Two-way shot and long term memory unit Equation it is as follows:

i_t=σ (w_xix_t+w_hih_t-1+b_i) (2)

f_t=σ (w_xfx_t+w_hfh_t-1+b_f) (3)

o_t=σ (w_xox_t+w_hoh_t-1+b_o) (4)

g_t=tanh (w_xcx_t+w_hch_t-1+h_c) (5)

c_t=f_tc_t-1+i_tg_t (6)

h_t=o_ttanh(c_t) (7)

Wherein tanh () is hyperbolic tangent function, forgets door and determines when information should be removed from memory cell, input gate determines When new formation should be integrated in memory, which generates one group of candidate value, if input gate allows, they will be added Into memory cell；Reference formula (6), based on door is forgotten, the output of input gate and new candidate value updates storage device unit； In formula (7), out gate controls hidden state and storage information；Finally, hidden state is expressed as memory cell state Product between function and out gate.