CN110175551A

CN110175551A - A kind of sign Language Recognition Method

Info

Publication number: CN110175551A
Application number: CN201910426216.7A
Authority: CN
Inventors: 张淑军; 张群; 李辉; 王传旭
Original assignee: Qingdao University of Science and Technology
Current assignee: Qingdao University of Science and Technology
Priority date: 2019-05-21
Filing date: 2019-05-21
Publication date: 2019-08-27
Anticipated expiration: 2039-05-21
Also published as: CN110175551B

Abstract

The invention discloses a kind of sign Language Recognition Methods, comprising: carries out frequency-domain transform to video sequence corresponding to sign language video, obtains the phase information of image；Phase information and video sequence are sent into a C3D convolutional neural networks convolution of progress and merged, characteristic information is formed；The characteristic information is sent into depth convolutional neural networks and carries out secondary convolution sum pond, and executes adaptive learning pond algorithm during pond, target feature vector is filtered out, is sent into full articulamentum output category result.Frequency-domain transform is integrated in deep learning algorithm by the present invention, is extracted the phase information in sign language video using frequency-domain transform, is assisted rgb space information, and the feature that deep learning network generates sign language is sent into, and thus obtained feature is more essential, accurate.Adaptive learning pond algorithm is added by the pond layer in 3D convolutional neural networks model, video features more abstract, advanced in sign language video can be excavated, obtain more accurate classification results.

Description

A kind of sign Language Recognition Method

Technical field

The invention belongs to video identification technology fields, specifically, being to be related to a kind of method for sign language semantics recognition.

Background technique

In the epoch of computer nowadays technology fast development, human-computer interaction technology receives extensive attention, and achieves Certain research achievement, this technology mainly include human expressions' identification, action recognition and Sign Language Recognition etc..Sign language is deaf-mute With a kind of strong main exchange way listened between people, but for it is strong listen people for, really received sign language for they Training can not fundamentally understand that deaf-mute's is true other than having basic common sense to some simple gesture expression Idea, this makes deaf-mute exchange difficulties between people with strong listen.At the same time, Sign Language Recognition can also with assistance application in In the education and instruction of disability crowd, to ensure the normal life and study of disability crowd.

Traditional sign Language Recognition Method needs deaf-mute to wear the data glove for having multiple sensors, according to data glove The limbs action trail for acquiring deaf-mute generates intelligible semanteme according to trace information.Currently, being mostly based on the 3D of most original The Activity recognition method of convolutional neural networks modelling is low for the Sign Language Recognition accuracy rate under small data set, computationally intensive, The phenomenon that being easy to produce over-fitting, universality be not high.

Application No. is the Chinese invention patent application of CN107506712A, disclose a kind of based on 3D depth convolutional network Human behavior recognition methods improves 3 dimension convolutional network C3D of standard, introduce multistage pondization can to arbitrary resolution and when Long video clip carries out feature extraction, to obtain final classification results.But C3D convolution net used in this method Network structure is low for large-scale data set identify precision than shallower, and it is difficult to extract optimal characteristic informations.

Application No. is the Chinese invention patent applications of CN107679491A, disclose a kind of 3D volumes for merging multi-modal data Product neural network sign Language Recognition Method, it is special by being carried out to gesture infrared image and contour images from Spatial Dimension and time dimension Sign is extracted, and is merged two network outputs based on different data format and is carried out final sign language classification.But whole network inputs It needs to get up to the data processing of input more complicated using somatosensory device additional extractions infrared image and contour images, for The bigger details Activity recognition effect of some fluctuating ranges is bad.

Application No. is the Chinese invention patent application of CN104281853A, disclose a kind of based on 3D convolutional neural networks Activity recognition method inputs feeding network as multi-channel data in conjunction with Optic flow information and carries out feature extraction respectively, finally by Full articulamentum carries out final behavior classification, and will be divided into off-line training and online recognition stage all stage.This method can be with It realizes online recognition, but the requirement to data set is excessively high, and needs to use Optic flow information, calculates more complicated, recognition efficiency Nor very high.

Summary of the invention

The purpose of the present invention is to provide a kind of sign Language Recognition Methods, it is intended to solve present in existing sign Language Recognition Method Characteristic information extracts the problem unexcellent, recognition accuracy is not high.

In order to solve the above technical problems, the present invention is achieved by the following scheme:

A kind of sign Language Recognition Method, including following procedure:

Video sequence X is formed according to sign language video；

Image procossing based on frequency-domain transform is carried out to the video sequence X, extracts phase information；

The phase information and video sequence X are respectively fed to C3D convolutional neural networks and carry out a convolution, and to convolution The feature obtained afterwards is weighted fusion, forms fused characteristic information；

The fused characteristic information is sent into 3D ResNets depth convolutional neural networks and carries out secondary convolution sum pond Change, and execute adaptive learning pond algorithm during pond, filter out target feature vector, is sent into 3D ResNets depth The full articulamentum of convolutional neural networks, output category result.

Compared with prior art, the advantages and positive effects of the present invention are: sign Language Recognition Method of the invention becomes frequency domain It changes and is integrated in deep learning algorithm, extract the phase information in sign language video using frequency-domain transform, be sent into deep learning and calculate Method generates characteristic information, and thus obtained characteristic information is more essential and accurate.In addition, the present invention passes through to 3D convolutional Neural net Network model improves, and adaptive learning pond algorithm is added in the pond layer of network model, it is possible thereby to excavate sign language view More abstract in frequency, advanced video features then obtain more accurate classification results, so that the accuracy rate of Sign Language Recognition is bright It is aobvious to be promoted.

After the detailed description of embodiment of the present invention is read in conjunction with the figure, the other features and advantages of the invention will become more Add clear.

Detailed description of the invention

It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to needed in the embodiment Attached drawing is made one and is simply introduced.It should be evident that drawings in the following description are some embodiments of the invention, for this field For those of ordinary skill, without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of flow chart of embodiment of sign Language Recognition Method proposed by the invention；

Fig. 2 is a kind of structure chart of embodiment of 3D ResNets depth convolutional neural networks；

Fig. 3 is a kind of instance graph for carrying out dimensionality reduction to eigenmatrix using adaptive learning pond algorithm.

Specific embodiment

Specific embodiments of the present invention will be described in detail with reference to the accompanying drawing.

The sign Language Recognition Method of the present embodiment mainly includes two stages:

(1) the feature coding stage based on frequency-domain transform

Frequency-domain transform is combined with deep learning, the phase information in sign language video is extracted by frequency-domain transform；So Afterwards, the phase information and sign language video data are respectively fed to C3D convolutional neural networks and carry out a convolution, and to convolution The feature obtained afterwards is weighted fusion, forms fused characteristic information.

(2) the feature decoding stage based on improved 3D ResNets depth convolutional neural networks

The fused characteristic information that first stage is formed is sent to improved depth convolutional neural networks (3D ResNets in), secondary convolution is carried out to the timing information of different timing positions using the convolution kernel of different scale；Then, then lead to The adaptive learning pond algorithm for crossing the present embodiment proposition carries out dimensionality reduction to the eigenmatrix that secondary convolution obtains, and filters out more Abstract, advanced target feature vector, is sent into full articulamentum, to obtain more accurate classification results.

Below with reference to Fig. 1, the detailed process of the sign Language Recognition Method of the present embodiment is described in detail.

S1, video sequence X is formed according to sign language video；

In the process, following steps can specifically be designed:

S101, sign language video is carried out to cut frame；

Original sign language video RGB data is cut into N number of picture frame, the N is preferably greater than or equal to 34 frames.According to middle national champion The characteristics of language data set, the sign language video as corresponding to each semanteme is shorter and smaller, for Chinese Sign Language data Collection, it is more appropriate to be cut into 34 frames for each sign language video.

S102, picture frame is pre-processed；

In view of in each sign language video, former frames and rear a few frames are usually all frozen frozen mass or background frames, in order to The calculation amount for reducing subsequent step, a step data preprocessing process preferably is carried out after cutting frame, useful figure is gone out with preliminary screening As frame, or referred to as key frame.As a kind of preferred embodiment, in the N number of picture frame that can be generated after cutting frame, by preceding f frame It is rejected with rear f frame as redundant frame, only retains intermediate picture frame as key frame.F≤5 described in preferred design.

For Chinese Sign Language data set, preceding 5 frame and rear 5 frame can be weeded out in 34 picture frames being cut into, in reservation Between 24 frames as key frame.

S103, key frame is divided into n segment according to timing；

As a kind of preferred embodiment, preferably n=3, that is, pretreated key frame is divided into three pieces according to timing Section.

S104, continuous m picture frame is randomly selected from each segment, form video sequence X；

In the present embodiment, continuous 8 picture frames are preferably randomly selected out from each segment, form video sequence X=(x₁,x₂,…,x_n)；Wherein, x_iIndicate m picture frame in i-th of segment, i=1,2 ..., n.

It, then can be from each if not removing redundant frame to 34 picture frames generating after frame are cut without pretreatment Continuous 11 picture frames are randomly selected out in a segment, form the video sequence X.

Certainly, it is formed for cutting the case where quantity of the picture frame generated after frame is greater than 34 frame, or after removal redundant frame Key frame quantity be greater than 24 frame the case where, or to key frame according to timing equal part number of fragments be less than 3 sections the case where, The successive image frame more than 8 can be then randomly selected out from each segment, form the video sequence X.

S2, the image procossing based on frequency-domain transform is carried out to video sequence X, extracts image phase information；

In many algorithms of frequency-domain transform, compared to for Fourier transformation, Gabor transformation have better locality, Direction selection and the features such as with the general character, there is preferable anti-interference ability；Meanwhile for Sign Language Recognition task, work as video When frame spatial position changes, the amplitude variation of Gabor characteristic is relatively small, and phase can be as the variation of position be with a certain Corresponding change occurs for rate, and accordingly, with respect to amplitude, Gabor phase information can more represent the abstract characteristics of behavior itself, With prior meaning.

To sum up, the characteristics of the present embodiment combination sign language video, it is preferred to use the Gabor transformation in frequency-domain transform extracts video The phase information of sequence X so that all information of signal can either be provided on the whole, and can provide in any local time The information of signal intensity severe degree realizes the optimization to sign language behavioural characteristic.Since the calculation method of Gabor phase information has Very much, the combination of these methods and deep learning network belongs to the scope of the present invention in principle, but in order to reduce data dimension Several and operand, the present embodiment preferably use document [Guo Y, Xu Z, Local Gabor Phase Difference Pattern for Face Recognition, the 19th International Conference on Pattern Recognition, IEEE, 2008:1-4] propose local Gabor phase difference mode (Local Gabor Phase Difference Pattern, LGPDP) extract phase information of the picture frame after Gabor transformation.Certainly, other are based on The innovatory algorithm of LGPDP is equally applicable.

S3, video sequence X and the phase information extracted are respectively fed to C3D convolutional neural networks one secondary volume of progress Product；

In the present embodiment, video sequence X and the phase information extracted are preferably first fed into conventional C3D convolution Neural network model carries out a process of convolution, the characteristic information after generating a convolution.

S4, fusion is weighted to the characteristic information obtained after a convolution, forms fused characteristic information；

In the present embodiment, traditional Weighted Fusion algorithm can be used to by C3D convolutional neural networks process of convolution Characteristic information afterwards is weighted fusion, to form fused eigenmatrix.

S5, fused characteristic information is sent into the secondary convolution sum pond of 3D ResNets depth convolutional neural networks progress Change, to filter out target feature vector；

More accurate video features in order to obtain, the present embodiment change 3D ResNets depth convolutional neural networks Into, the adaptive learning pond algorithm based on weighting Cross-covariance is introduced, dimensionality reduction is carried out to the eigenmatrix that convolution obtains, To filter out more abstract, advanced target feature vector.

As a kind of preferred embodiment, the present embodiment preferably uses 19 layers of 3D ResNets depth convolutional neural networks, It include: 3D convolutional layer, 8 pond layers and two layers of the full articulamentum of 1 data input layer, 8 different scale convolution kernels.Such as Shown in Fig. 2,8 3D convolutional layers and 8 pond layers described in preferred design are interlaced, wherein

C1-C8 is 8 3D convolutional layers, and the convolution kernel of each 3D convolutional layer is 3 × 3 × 3, and the quantity of convolution kernel is by 64 It is incremented by successively to 512, to generate further types of high-level characteristic from rudimentary feature combination；After convolutional layer, to two-way The Fusion Features of information progress convolutional layer；

S1-S8 is 8 pond layers, each pond layer uses adaptive learning pond algorithm to carry out dimensionality reduction, wherein the Two pond layer S2, the 6th pond layer S6, the 7th pond layer S7 and the 8th pond layer S8 use 2 × 2 × 2 window Mouth carries out down-sampling to time dimension and Spatial Dimension simultaneously, other ponds layer S1, S3, S4, S5 use 1 × 2 × 2 window Mouthful, down-sampling is only carried out on Spatial Dimension.

The 3D convolutional layer of the present embodiment it is preferable to use the convolution kernel of different scale to the timing informations of different timing positions into Then the secondary convolution of row carries out the characteristic aggregation on time dimension to the convolution feature of each timing position again, to reduce net The calculation amount of network structure.As a kind of preferred embodiment, can be sent into first using the convolution kernel of 1*1 to by data input layer Eigenmatrix carry out dimensionality reduction operation, to help to reduce model parameter, to different characteristic carry out size normalization.Then, right The timing information of different timing positions carries out the convolution of different scale convolution kernel respectively, such as selects the convolution of 3*3,5*5 respectively It checks the other middle height feature of its videl stage and carries out convolution, then the convolution information of each of which timing position is weighted and is melted It closes, the eigenmatrix after forming polymerization, is sent into the feature pool that pond layer carries out adaptivity.

The present embodiment improves pond algorithm performed by each pond layer, proposes a kind of adaptive learning pond Algorithm, as shown in figure 3, firstly, corresponding Cross-covariance is calculated for the eigenmatrix after polymerization, then to obtained Cross-covariance carries out dimensionality reduction operation, obtains the feature vector until current time；Then, the important of the frame is obtained Property, the feature vector of obtained each frame Chi Huahou is calculated, different weights is successively assigned according to the height of importance, is chosen The shared maximum feature vector of weight is as target feature vector.

The detailed process of the adaptive learning pond algorithm proposed below to the present embodiment is described below:

S501, the eigenmatrix F obtained after being merged according to 3D convolutional layer convolution_n, seek F_nCross-covariance Q_n；

S502, using conventional pond algorithm to Cross-covariance Q_nPond dimensionality reduction is carried out, the feature after forming dimensionality reduction Vector；

S503, the feature vector after t frame moment dimensionality reduction is expressed asIt is calculated using the following equation the t+1 frame moment Feature vector after dimensionality reductionImportance β_t+1, it may be assumed that

Wherein, f_pFor the anticipation function in perceptron algorithm；φ(x_t+1) indicate at the video sequence X, from the 1st frame The feature vector after dimensionality reduction until t+1 frame；

The weights omega of S504, the feature vector at calculating t+1 frame moment, the weights omega should meet following calculation formula:

S505, step S503-S504 is repeated, calculates the weight of the feature vector at each frame moment；

S506, according to sequence from high to low, the weight of the feature vector at each frame moment calculated to step S505 It is ranked up, weight is higher, and the useful information which contains is more；

The maximum feature vector of S507, weight selection is sent into full articulamentum as target feature vector.

In the present embodiment, the data for being sent to each 3D convolutional layer are eigenmatrixes, are executing convolution pond Later, a target feature vector is obtained by each pond layer.The target signature that will be obtained by each pond layer Vector is respectively fed to full articulamentum, to obtain more accurate classification results.In order to prevent under deep layer network gradient explosion or The problems such as disperse, is preferably added BN layers after each 3D convolutional layer, all carries out dropout in each layer of full articulamentum Operation.

The target feature vector that S6, basis filter out, is sent into full articulamentum and obtains final classification results；

3D ResNets depth convolutional neural networks two full articulamentums of preferred design of the present embodiment, as shown in Figure 2.Its In,

FC1 is first full articulamentum, preferably comprises 512 neurons, the feature exported by the 8th pond layer S8 Vector is connected with FC1 layers of 512 neurons, is converted into the feature vector of 512 dimensions in this layer；The 8th pond layer S8 with Dropout layers are used between first full articulamentum FC1, by 0.5 probability dropping partial nerve network unit, and utilize migration Learning algorithm freezes the 8th pond layer S8 with 0.1 probability and connect with the part of first full articulamentum FC1；

FC2 is second full articulamentum, while being also intensive output layer, including identical with the class number of classification results Neuron, such as the number of neuron is 6；Each neuron and first full articulamentum in second full articulamentum FC2 512 neurons in FC1 connect entirely, finally classify via classifier Softmax recurrence, export affiliated sign language classification Classification results.

As a kind of preferred embodiment, in 3D ResNets depth convolutional neural networks, 3D convolutional layer and first are complete It is preferable to use ELU by articulamentum FC1 as activation primitive, to promote the performance of depth network.Second full articulamentum FC2 preferably makes Use Softmax as activation primitive, majorized function is it is preferable to use SGD function, and it is preferable to use more classification cross entropy letters for loss function Several the sum of errors with adaptive learning pond algorithm, that is, loss function can be embodied as:

L (X, Y)=l_cro(x,y)+μl_B(τ)；

Wherein, L (X, Y) is loss function；l_cro(x, y) is that more classification intersect entropy function；l_B(τ) is adaptive learning pond Change the error of algorithm；μ is hyper parameter.Since the error of loss function, more classification intersection entropy functions and pond algorithm is existing Technology, therefore, in above-mentioned formula, the meaning of the relevant parameter in each function is all known to those skilled in the art , the present embodiment is not described in detail.

The classification results exported as a result, by the full articulamentum of 3D ResNets depth convolutional neural networks, as identify Sign language meaning out.

The sign Language Recognition Method of the present embodiment can be divided into training and two stages of test.Training stage is using the above step Rapid S1-S6 is trained, and before this, carries out the initialization of weight to whole network structure first, it is preferred to use disclosed base Quasi- Activity recognition data set Kinetics carries out weights initialisation to 3D ResNets depth convolutional neural networks, so that weight Initialization adapt to the task of this Sign Language Recognition enough.Then, use transfer learning strategy to entire net during training Network structure carries out transfer learning, freezes convolutional layer, constantly the full articulamentum of training the last layer, keeps final classification result more quasi- Really.In addition, 0.001 is set by initial learning rate, over time, with 1/10th after each iterative process Rate gradually decreases learning rate, and learning rate is changed in 2000 stoppings until iteration is completed.Whole network the number of iterations is complete Accuracy rate is set gradually to tend towards stability at 2000 times or so before.Momentum is set as 0.9, loads last after iteration 30,000 times Secondary network model, into test phase.

In test phase, Chinese Sign Language data set can be selected as data source, all test process are in this data set On tested.

Frequency-domain transform is integrated in deep learning algorithm by sign Language Recognition Method of the invention, is well identified using having The Gabor phase information of performance assists the rgb space information of sign language video, utilize the phase information and deep learning extracted Process combines, and can obtain more essential, accurate sign language behavioural characteristic；Use improved 19 layers of deep layer convolutional Neural net Network excavates video features more abstract, advanced in original video；Different timing positions are captured using the convolution kernel of different scale Video level characteristics, calculation amount can not only be reduced, moreover it is possible to make full use of the raw information in video, better adapt to complexity Sign Language Recognition under background；Finally, the pond algorithm using adaptive learning carries out dimensionality reduction to the eigenmatrix that convolution obtains, obtain To more accurate classification results, the accuracy rate of Sign Language Recognition is improved.

Certainly, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features；And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of sign Language Recognition Method characterized by comprising

Video sequence X is formed according to sign language video；

The phase information and video sequence X are respectively fed to C3D convolutional neural networks and carry out a convolution, and is obtained to after convolution To feature be weighted fusion, form fused characteristic information；

The fused characteristic information is sent into 3D ResNets depth convolutional neural networks and carries out secondary convolution sum pond, and Adaptive learning pond algorithm is executed during pond, filters out target feature vector, is sent into 3D ResNets depth convolution The full articulamentum of neural network, output category result.

2. sign Language Recognition Method according to claim 1, which is characterized in that adaptive learning pond algorithm includes:

According to the eigenmatrix F generated after secondary convolution_n, seek F_nCross-covariance Q_n；

To Cross-covariance Q_nPond dimensionality reduction is carried out, the feature vector after forming dimensionality reduction；

Feature vector after t frame moment dimensionality reduction is expressed asFeature vector after calculating t+1 frame moment dimensionality reductionImportance β_t+1:

Wherein, f_pFor the anticipation function in perceptron algorithm；φ(x_t+1) indicate at the video sequence X, it is by the end of t+1 frame The feature vector after dimensionality reduction only；

The weights omega of the feature vector at t+1 frame moment is calculated, the weights omega meets following calculation formula:

Calculate the weight of the feature vector at each frame moment, the maximum feature vector of weight selection as the target signature to Amount.

3. sign Language Recognition Method according to claim 1, which is characterized in that during forming the video sequence X, Include:

Sign language video is carried out to cut frame；

Picture frame corresponding to sign language video is divided into n segment according to timing；

Continuous m picture frame is randomly selected from each segment, forms the video sequence X=(x₁,x₂,…,x_n)； Wherein, x_iIndicate m picture frame in i-th of segment.

4. sign Language Recognition Method according to claim 3, which is characterized in that during forming the video sequence X, It specifically includes:

Each sign language video is cut to N frame, N >=34, and is rejected using preceding f frame and rear f frame as redundant frame, is retained intermediate Key frame, f≤5；

The key frame of the centre is divided into three segments according to timing；

Continuous at least eight picture frame is randomly selected from each segment, forms the video sequence X.

5. sign Language Recognition Method according to claim 1, which is characterized in that extracting phase information based on frequency-domain transform In the process, the phase information of picture frame is extracted using Gabor transformation.

6. sign Language Recognition Method according to any one of claim 1 to 5, which is characterized in that in the 3D ResNets In depth convolutional neural networks, 3D convolutional layer carries out the timing information of different timing positions using the convolution kernel of different scale Then secondary convolution carries out the characteristic aggregation on time dimension to the convolution feature of each timing position, forms secondary convolution Eigenmatrix later is sent into pond layer, and then carries out dimensionality reduction using adaptive learning pond algorithm, to filter out target Feature vector.

7. sign Language Recognition Method according to claim 6, which is characterized in that the 3D ResNets depth convolutional Neural net Network includes 8 3D convolutional layers and 8 pond layers, and 8 3D convolutional layers and 8 pond layers are interlaced；Wherein,

The convolution kernel of each 3D convolutional layer is 3 × 3 × 3, and the quantity of convolution kernel is incremented by successively by 64 to 512, in convolutional layer Later, the Fusion Features of convolutional layer are carried out to two-way information；

Each pond layer carries out dimensionality reduction using adaptive learning pond algorithm, wherein second pond layer, the 6th Pond layer, the 7th pond layer and the 8th pond layer use 2 × 2 × 2 window while to time dimensions and space dimension Degree carries out down-sampling, other pond layers use 1 × 2 × 2 window, down-sampling is only carried out on Spatial Dimension.

8. sign Language Recognition Method according to claim 7, which is characterized in that be separately added into after each 3D convolutional layer BN layers.

9. sign Language Recognition Method according to claim 7, which is characterized in that the 3D ResNets depth convolutional Neural net Network further includes a data input layer and two full articulamentums, wherein

First full articulamentum includes 512 neurons, is converted by the feature vector that the 8th pond layer exports in this layer For the feature vector of 512 dimensions, Dropout layers are used between the 8th pond layer and first full articulamentum, by 0.5 probability Partial nerve network unit is abandoned, and the 8th pond layer and first are freezed entirely with 0.1 probability using transfer learning algorithm The part of articulamentum connects；

Second full articulamentum is intensive output layer, and including neuron identical with the class number of classification results, second complete Each neuron in articulamentum is connect entirely with 512 neurons in first full articulamentum, is finally carried out via classifier Classification exports the classification results of affiliated sign language classification.

10. sign Language Recognition Method according to claim 9, which is characterized in that the 3D convolutional layer and first full connection Layer uses ELU as activation primitive, and described second full articulamentum uses Softmax as activation primitive, and majorized function uses SGD function, loss function are the sum of the error that more classification intersect entropy function and adaptive learning pond algorithm.