CN110175551A - A kind of sign Language Recognition Method - Google Patents
A kind of sign Language Recognition Method Download PDFInfo
- Publication number
- CN110175551A CN110175551A CN201910426216.7A CN201910426216A CN110175551A CN 110175551 A CN110175551 A CN 110175551A CN 201910426216 A CN201910426216 A CN 201910426216A CN 110175551 A CN110175551 A CN 110175551A
- Authority
- CN
- China
- Prior art keywords
- pond
- sign language
- layer
- frame
- convolution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 31
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 24
- 230000003044 adaptive effect Effects 0.000 claims abstract description 18
- 230000009467 reduction Effects 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 13
- 210000002569 neuron Anatomy 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 230000001537 neural effect Effects 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 238000013526 transfer learning Methods 0.000 claims description 3
- 230000002776 aggregation Effects 0.000 claims description 2
- 238000004220 aggregation Methods 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 210000005036 nerve Anatomy 0.000 claims description 2
- 230000000717 retained effect Effects 0.000 claims 1
- 238000013135 deep learning Methods 0.000 abstract description 8
- 238000012549 training Methods 0.000 description 6
- 206010011878 Deafness Diseases 0.000 description 5
- 230000008859 change Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000006116 polymerization reaction Methods 0.000 description 2
- 238000013480 data collection Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000003238 somatosensory effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/107—Static hand or arm
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
Abstract
The invention discloses a kind of sign Language Recognition Methods, comprising: carries out frequency-domain transform to video sequence corresponding to sign language video, obtains the phase information of image;Phase information and video sequence are sent into a C3D convolutional neural networks convolution of progress and merged, characteristic information is formed;The characteristic information is sent into depth convolutional neural networks and carries out secondary convolution sum pond, and executes adaptive learning pond algorithm during pond, target feature vector is filtered out, is sent into full articulamentum output category result.Frequency-domain transform is integrated in deep learning algorithm by the present invention, is extracted the phase information in sign language video using frequency-domain transform, is assisted rgb space information, and the feature that deep learning network generates sign language is sent into, and thus obtained feature is more essential, accurate.Adaptive learning pond algorithm is added by the pond layer in 3D convolutional neural networks model, video features more abstract, advanced in sign language video can be excavated, obtain more accurate classification results.
Description
Technical field
The invention belongs to video identification technology fields, specifically, being to be related to a kind of method for sign language semantics recognition.
Background technique
In the epoch of computer nowadays technology fast development, human-computer interaction technology receives extensive attention, and achieves
Certain research achievement, this technology mainly include human expressions' identification, action recognition and Sign Language Recognition etc..Sign language is deaf-mute
With a kind of strong main exchange way listened between people, but for it is strong listen people for, really received sign language for they
Training can not fundamentally understand that deaf-mute's is true other than having basic common sense to some simple gesture expression
Idea, this makes deaf-mute exchange difficulties between people with strong listen.At the same time, Sign Language Recognition can also with assistance application in
In the education and instruction of disability crowd, to ensure the normal life and study of disability crowd.
Traditional sign Language Recognition Method needs deaf-mute to wear the data glove for having multiple sensors, according to data glove
The limbs action trail for acquiring deaf-mute generates intelligible semanteme according to trace information.Currently, being mostly based on the 3D of most original
The Activity recognition method of convolutional neural networks modelling is low for the Sign Language Recognition accuracy rate under small data set, computationally intensive,
The phenomenon that being easy to produce over-fitting, universality be not high.
Application No. is the Chinese invention patent application of CN107506712A, disclose a kind of based on 3D depth convolutional network
Human behavior recognition methods improves 3 dimension convolutional network C3D of standard, introduce multistage pondization can to arbitrary resolution and when
Long video clip carries out feature extraction, to obtain final classification results.But C3D convolution net used in this method
Network structure is low for large-scale data set identify precision than shallower, and it is difficult to extract optimal characteristic informations.
Application No. is the Chinese invention patent applications of CN107679491A, disclose a kind of 3D volumes for merging multi-modal data
Product neural network sign Language Recognition Method, it is special by being carried out to gesture infrared image and contour images from Spatial Dimension and time dimension
Sign is extracted, and is merged two network outputs based on different data format and is carried out final sign language classification.But whole network inputs
It needs to get up to the data processing of input more complicated using somatosensory device additional extractions infrared image and contour images, for
The bigger details Activity recognition effect of some fluctuating ranges is bad.
Application No. is the Chinese invention patent application of CN104281853A, disclose a kind of based on 3D convolutional neural networks
Activity recognition method inputs feeding network as multi-channel data in conjunction with Optic flow information and carries out feature extraction respectively, finally by
Full articulamentum carries out final behavior classification, and will be divided into off-line training and online recognition stage all stage.This method can be with
It realizes online recognition, but the requirement to data set is excessively high, and needs to use Optic flow information, calculates more complicated, recognition efficiency
Nor very high.
Summary of the invention
The purpose of the present invention is to provide a kind of sign Language Recognition Methods, it is intended to solve present in existing sign Language Recognition Method
Characteristic information extracts the problem unexcellent, recognition accuracy is not high.
In order to solve the above technical problems, the present invention is achieved by the following scheme:
A kind of sign Language Recognition Method, including following procedure:
Video sequence X is formed according to sign language video;
Image procossing based on frequency-domain transform is carried out to the video sequence X, extracts phase information;
The phase information and video sequence X are respectively fed to C3D convolutional neural networks and carry out a convolution, and to convolution
The feature obtained afterwards is weighted fusion, forms fused characteristic information;
The fused characteristic information is sent into 3D ResNets depth convolutional neural networks and carries out secondary convolution sum pond
Change, and execute adaptive learning pond algorithm during pond, filter out target feature vector, is sent into 3D ResNets depth
The full articulamentum of convolutional neural networks, output category result.
Compared with prior art, the advantages and positive effects of the present invention are: sign Language Recognition Method of the invention becomes frequency domain
It changes and is integrated in deep learning algorithm, extract the phase information in sign language video using frequency-domain transform, be sent into deep learning and calculate
Method generates characteristic information, and thus obtained characteristic information is more essential and accurate.In addition, the present invention passes through to 3D convolutional Neural net
Network model improves, and adaptive learning pond algorithm is added in the pond layer of network model, it is possible thereby to excavate sign language view
More abstract in frequency, advanced video features then obtain more accurate classification results, so that the accuracy rate of Sign Language Recognition is bright
It is aobvious to be promoted.
After the detailed description of embodiment of the present invention is read in conjunction with the figure, the other features and advantages of the invention will become more
Add clear.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to needed in the embodiment
Attached drawing is made one and is simply introduced.It should be evident that drawings in the following description are some embodiments of the invention, for this field
For those of ordinary skill, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart of embodiment of sign Language Recognition Method proposed by the invention;
Fig. 2 is a kind of structure chart of embodiment of 3D ResNets depth convolutional neural networks;
Fig. 3 is a kind of instance graph for carrying out dimensionality reduction to eigenmatrix using adaptive learning pond algorithm.
Specific embodiment
Specific embodiments of the present invention will be described in detail with reference to the accompanying drawing.
The sign Language Recognition Method of the present embodiment mainly includes two stages:
(1) the feature coding stage based on frequency-domain transform
Frequency-domain transform is combined with deep learning, the phase information in sign language video is extracted by frequency-domain transform;So
Afterwards, the phase information and sign language video data are respectively fed to C3D convolutional neural networks and carry out a convolution, and to convolution
The feature obtained afterwards is weighted fusion, forms fused characteristic information.
(2) the feature decoding stage based on improved 3D ResNets depth convolutional neural networks
The fused characteristic information that first stage is formed is sent to improved depth convolutional neural networks (3D
ResNets in), secondary convolution is carried out to the timing information of different timing positions using the convolution kernel of different scale;Then, then lead to
The adaptive learning pond algorithm for crossing the present embodiment proposition carries out dimensionality reduction to the eigenmatrix that secondary convolution obtains, and filters out more
Abstract, advanced target feature vector, is sent into full articulamentum, to obtain more accurate classification results.
Below with reference to Fig. 1, the detailed process of the sign Language Recognition Method of the present embodiment is described in detail.
S1, video sequence X is formed according to sign language video;
In the process, following steps can specifically be designed:
S101, sign language video is carried out to cut frame;
Original sign language video RGB data is cut into N number of picture frame, the N is preferably greater than or equal to 34 frames.According to middle national champion
The characteristics of language data set, the sign language video as corresponding to each semanteme is shorter and smaller, for Chinese Sign Language data
Collection, it is more appropriate to be cut into 34 frames for each sign language video.
S102, picture frame is pre-processed;
In view of in each sign language video, former frames and rear a few frames are usually all frozen frozen mass or background frames, in order to
The calculation amount for reducing subsequent step, a step data preprocessing process preferably is carried out after cutting frame, useful figure is gone out with preliminary screening
As frame, or referred to as key frame.As a kind of preferred embodiment, in the N number of picture frame that can be generated after cutting frame, by preceding f frame
It is rejected with rear f frame as redundant frame, only retains intermediate picture frame as key frame.F≤5 described in preferred design.
For Chinese Sign Language data set, preceding 5 frame and rear 5 frame can be weeded out in 34 picture frames being cut into, in reservation
Between 24 frames as key frame.
S103, key frame is divided into n segment according to timing;
As a kind of preferred embodiment, preferably n=3, that is, pretreated key frame is divided into three pieces according to timing
Section.
S104, continuous m picture frame is randomly selected from each segment, form video sequence X;
In the present embodiment, continuous 8 picture frames are preferably randomly selected out from each segment, form video sequence
X=(x1,x2,…,xn);Wherein, xiIndicate m picture frame in i-th of segment, i=1,2 ..., n.
It, then can be from each if not removing redundant frame to 34 picture frames generating after frame are cut without pretreatment
Continuous 11 picture frames are randomly selected out in a segment, form the video sequence X.
Certainly, it is formed for cutting the case where quantity of the picture frame generated after frame is greater than 34 frame, or after removal redundant frame
Key frame quantity be greater than 24 frame the case where, or to key frame according to timing equal part number of fragments be less than 3 sections the case where,
The successive image frame more than 8 can be then randomly selected out from each segment, form the video sequence X.
S2, the image procossing based on frequency-domain transform is carried out to video sequence X, extracts image phase information;
In many algorithms of frequency-domain transform, compared to for Fourier transformation, Gabor transformation have better locality,
Direction selection and the features such as with the general character, there is preferable anti-interference ability;Meanwhile for Sign Language Recognition task, work as video
When frame spatial position changes, the amplitude variation of Gabor characteristic is relatively small, and phase can be as the variation of position be with a certain
Corresponding change occurs for rate, and accordingly, with respect to amplitude, Gabor phase information can more represent the abstract characteristics of behavior itself,
With prior meaning.
To sum up, the characteristics of the present embodiment combination sign language video, it is preferred to use the Gabor transformation in frequency-domain transform extracts video
The phase information of sequence X so that all information of signal can either be provided on the whole, and can provide in any local time
The information of signal intensity severe degree realizes the optimization to sign language behavioural characteristic.Since the calculation method of Gabor phase information has
Very much, the combination of these methods and deep learning network belongs to the scope of the present invention in principle, but in order to reduce data dimension
Several and operand, the present embodiment preferably use document [Guo Y, Xu Z, Local Gabor Phase Difference
Pattern for Face Recognition, the 19th International Conference on Pattern
Recognition, IEEE, 2008:1-4] propose local Gabor phase difference mode (Local Gabor Phase
Difference Pattern, LGPDP) extract phase information of the picture frame after Gabor transformation.Certainly, other are based on
The innovatory algorithm of LGPDP is equally applicable.
S3, video sequence X and the phase information extracted are respectively fed to C3D convolutional neural networks one secondary volume of progress
Product;
In the present embodiment, video sequence X and the phase information extracted are preferably first fed into conventional C3D convolution
Neural network model carries out a process of convolution, the characteristic information after generating a convolution.
S4, fusion is weighted to the characteristic information obtained after a convolution, forms fused characteristic information;
In the present embodiment, traditional Weighted Fusion algorithm can be used to by C3D convolutional neural networks process of convolution
Characteristic information afterwards is weighted fusion, to form fused eigenmatrix.
S5, fused characteristic information is sent into the secondary convolution sum pond of 3D ResNets depth convolutional neural networks progress
Change, to filter out target feature vector;
More accurate video features in order to obtain, the present embodiment change 3D ResNets depth convolutional neural networks
Into, the adaptive learning pond algorithm based on weighting Cross-covariance is introduced, dimensionality reduction is carried out to the eigenmatrix that convolution obtains,
To filter out more abstract, advanced target feature vector.
As a kind of preferred embodiment, the present embodiment preferably uses 19 layers of 3D ResNets depth convolutional neural networks,
It include: 3D convolutional layer, 8 pond layers and two layers of the full articulamentum of 1 data input layer, 8 different scale convolution kernels.Such as
Shown in Fig. 2,8 3D convolutional layers and 8 pond layers described in preferred design are interlaced, wherein
C1-C8 is 8 3D convolutional layers, and the convolution kernel of each 3D convolutional layer is 3 × 3 × 3, and the quantity of convolution kernel is by 64
It is incremented by successively to 512, to generate further types of high-level characteristic from rudimentary feature combination;After convolutional layer, to two-way
The Fusion Features of information progress convolutional layer;
S1-S8 is 8 pond layers, each pond layer uses adaptive learning pond algorithm to carry out dimensionality reduction, wherein the
Two pond layer S2, the 6th pond layer S6, the 7th pond layer S7 and the 8th pond layer S8 use 2 × 2 × 2 window
Mouth carries out down-sampling to time dimension and Spatial Dimension simultaneously, other ponds layer S1, S3, S4, S5 use 1 × 2 × 2 window
Mouthful, down-sampling is only carried out on Spatial Dimension.
The 3D convolutional layer of the present embodiment it is preferable to use the convolution kernel of different scale to the timing informations of different timing positions into
Then the secondary convolution of row carries out the characteristic aggregation on time dimension to the convolution feature of each timing position again, to reduce net
The calculation amount of network structure.As a kind of preferred embodiment, can be sent into first using the convolution kernel of 1*1 to by data input layer
Eigenmatrix carry out dimensionality reduction operation, to help to reduce model parameter, to different characteristic carry out size normalization.Then, right
The timing information of different timing positions carries out the convolution of different scale convolution kernel respectively, such as selects the convolution of 3*3,5*5 respectively
It checks the other middle height feature of its videl stage and carries out convolution, then the convolution information of each of which timing position is weighted and is melted
It closes, the eigenmatrix after forming polymerization, is sent into the feature pool that pond layer carries out adaptivity.
The present embodiment improves pond algorithm performed by each pond layer, proposes a kind of adaptive learning pond
Algorithm, as shown in figure 3, firstly, corresponding Cross-covariance is calculated for the eigenmatrix after polymerization, then to obtained
Cross-covariance carries out dimensionality reduction operation, obtains the feature vector until current time;Then, the important of the frame is obtained
Property, the feature vector of obtained each frame Chi Huahou is calculated, different weights is successively assigned according to the height of importance, is chosen
The shared maximum feature vector of weight is as target feature vector.
The detailed process of the adaptive learning pond algorithm proposed below to the present embodiment is described below:
S501, the eigenmatrix F obtained after being merged according to 3D convolutional layer convolutionn, seek FnCross-covariance Qn;
S502, using conventional pond algorithm to Cross-covariance QnPond dimensionality reduction is carried out, the feature after forming dimensionality reduction
Vector;
S503, the feature vector after t frame moment dimensionality reduction is expressed asIt is calculated using the following equation the t+1 frame moment
Feature vector after dimensionality reductionImportance βt+1, it may be assumed that
Wherein, fpFor the anticipation function in perceptron algorithm;φ(xt+1) indicate at the video sequence X, from the 1st frame
The feature vector after dimensionality reduction until t+1 frame;
The weights omega of S504, the feature vector at calculating t+1 frame moment, the weights omega should meet following calculation formula:
S505, step S503-S504 is repeated, calculates the weight of the feature vector at each frame moment;
S506, according to sequence from high to low, the weight of the feature vector at each frame moment calculated to step S505
It is ranked up, weight is higher, and the useful information which contains is more;
The maximum feature vector of S507, weight selection is sent into full articulamentum as target feature vector.
In the present embodiment, the data for being sent to each 3D convolutional layer are eigenmatrixes, are executing convolution pond
Later, a target feature vector is obtained by each pond layer.The target signature that will be obtained by each pond layer
Vector is respectively fed to full articulamentum, to obtain more accurate classification results.In order to prevent under deep layer network gradient explosion or
The problems such as disperse, is preferably added BN layers after each 3D convolutional layer, all carries out dropout in each layer of full articulamentum
Operation.
The target feature vector that S6, basis filter out, is sent into full articulamentum and obtains final classification results;
3D ResNets depth convolutional neural networks two full articulamentums of preferred design of the present embodiment, as shown in Figure 2.Its
In,
FC1 is first full articulamentum, preferably comprises 512 neurons, the feature exported by the 8th pond layer S8
Vector is connected with FC1 layers of 512 neurons, is converted into the feature vector of 512 dimensions in this layer;The 8th pond layer S8 with
Dropout layers are used between first full articulamentum FC1, by 0.5 probability dropping partial nerve network unit, and utilize migration
Learning algorithm freezes the 8th pond layer S8 with 0.1 probability and connect with the part of first full articulamentum FC1;
FC2 is second full articulamentum, while being also intensive output layer, including identical with the class number of classification results
Neuron, such as the number of neuron is 6;Each neuron and first full articulamentum in second full articulamentum FC2
512 neurons in FC1 connect entirely, finally classify via classifier Softmax recurrence, export affiliated sign language classification
Classification results.
As a kind of preferred embodiment, in 3D ResNets depth convolutional neural networks, 3D convolutional layer and first are complete
It is preferable to use ELU by articulamentum FC1 as activation primitive, to promote the performance of depth network.Second full articulamentum FC2 preferably makes
Use Softmax as activation primitive, majorized function is it is preferable to use SGD function, and it is preferable to use more classification cross entropy letters for loss function
Several the sum of errors with adaptive learning pond algorithm, that is, loss function can be embodied as:
L (X, Y)=lcro(x,y)+μlB(τ);
Wherein, L (X, Y) is loss function;lcro(x, y) is that more classification intersect entropy function;lB(τ) is adaptive learning pond
Change the error of algorithm;μ is hyper parameter.Since the error of loss function, more classification intersection entropy functions and pond algorithm is existing
Technology, therefore, in above-mentioned formula, the meaning of the relevant parameter in each function is all known to those skilled in the art
, the present embodiment is not described in detail.
The classification results exported as a result, by the full articulamentum of 3D ResNets depth convolutional neural networks, as identify
Sign language meaning out.
The sign Language Recognition Method of the present embodiment can be divided into training and two stages of test.Training stage is using the above step
Rapid S1-S6 is trained, and before this, carries out the initialization of weight to whole network structure first, it is preferred to use disclosed base
Quasi- Activity recognition data set Kinetics carries out weights initialisation to 3D ResNets depth convolutional neural networks, so that weight
Initialization adapt to the task of this Sign Language Recognition enough.Then, use transfer learning strategy to entire net during training
Network structure carries out transfer learning, freezes convolutional layer, constantly the full articulamentum of training the last layer, keeps final classification result more quasi-
Really.In addition, 0.001 is set by initial learning rate, over time, with 1/10th after each iterative process
Rate gradually decreases learning rate, and learning rate is changed in 2000 stoppings until iteration is completed.Whole network the number of iterations is complete
Accuracy rate is set gradually to tend towards stability at 2000 times or so before.Momentum is set as 0.9, loads last after iteration 30,000 times
Secondary network model, into test phase.
In test phase, Chinese Sign Language data set can be selected as data source, all test process are in this data set
On tested.
Frequency-domain transform is integrated in deep learning algorithm by sign Language Recognition Method of the invention, is well identified using having
The Gabor phase information of performance assists the rgb space information of sign language video, utilize the phase information and deep learning extracted
Process combines, and can obtain more essential, accurate sign language behavioural characteristic;Use improved 19 layers of deep layer convolutional Neural net
Network excavates video features more abstract, advanced in original video;Different timing positions are captured using the convolution kernel of different scale
Video level characteristics, calculation amount can not only be reduced, moreover it is possible to make full use of the raw information in video, better adapt to complexity
Sign Language Recognition under background;Finally, the pond algorithm using adaptive learning carries out dimensionality reduction to the eigenmatrix that convolution obtains, obtain
To more accurate classification results, the accuracy rate of Sign Language Recognition is improved.
Certainly, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although referring to aforementioned reality
Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified
Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.
Claims (10)
1. a kind of sign Language Recognition Method characterized by comprising
Video sequence X is formed according to sign language video;
Image procossing based on frequency-domain transform is carried out to the video sequence X, extracts phase information;
The phase information and video sequence X are respectively fed to C3D convolutional neural networks and carry out a convolution, and is obtained to after convolution
To feature be weighted fusion, form fused characteristic information;
The fused characteristic information is sent into 3D ResNets depth convolutional neural networks and carries out secondary convolution sum pond, and
Adaptive learning pond algorithm is executed during pond, filters out target feature vector, is sent into 3D ResNets depth convolution
The full articulamentum of neural network, output category result.
2. sign Language Recognition Method according to claim 1, which is characterized in that adaptive learning pond algorithm includes:
According to the eigenmatrix F generated after secondary convolutionn, seek FnCross-covariance Qn;
To Cross-covariance QnPond dimensionality reduction is carried out, the feature vector after forming dimensionality reduction;
Feature vector after t frame moment dimensionality reduction is expressed asFeature vector after calculating t+1 frame moment dimensionality reductionImportance βt+1:
Wherein, fpFor the anticipation function in perceptron algorithm;φ(xt+1) indicate at the video sequence X, it is by the end of t+1 frame
The feature vector after dimensionality reduction only;
The weights omega of the feature vector at t+1 frame moment is calculated, the weights omega meets following calculation formula:
Calculate the weight of the feature vector at each frame moment, the maximum feature vector of weight selection as the target signature to
Amount.
3. sign Language Recognition Method according to claim 1, which is characterized in that during forming the video sequence X,
Include:
Sign language video is carried out to cut frame;
Picture frame corresponding to sign language video is divided into n segment according to timing;
Continuous m picture frame is randomly selected from each segment, forms the video sequence X=(x1,x2,…,xn);
Wherein, xiIndicate m picture frame in i-th of segment.
4. sign Language Recognition Method according to claim 3, which is characterized in that during forming the video sequence X,
It specifically includes:
Each sign language video is cut to N frame, N >=34, and is rejected using preceding f frame and rear f frame as redundant frame, is retained intermediate
Key frame, f≤5;
The key frame of the centre is divided into three segments according to timing;
Continuous at least eight picture frame is randomly selected from each segment, forms the video sequence X.
5. sign Language Recognition Method according to claim 1, which is characterized in that extracting phase information based on frequency-domain transform
In the process, the phase information of picture frame is extracted using Gabor transformation.
6. sign Language Recognition Method according to any one of claim 1 to 5, which is characterized in that in the 3D ResNets
In depth convolutional neural networks, 3D convolutional layer carries out the timing information of different timing positions using the convolution kernel of different scale
Then secondary convolution carries out the characteristic aggregation on time dimension to the convolution feature of each timing position, forms secondary convolution
Eigenmatrix later is sent into pond layer, and then carries out dimensionality reduction using adaptive learning pond algorithm, to filter out target
Feature vector.
7. sign Language Recognition Method according to claim 6, which is characterized in that the 3D ResNets depth convolutional Neural net
Network includes 8 3D convolutional layers and 8 pond layers, and 8 3D convolutional layers and 8 pond layers are interlaced;Wherein,
The convolution kernel of each 3D convolutional layer is 3 × 3 × 3, and the quantity of convolution kernel is incremented by successively by 64 to 512, in convolutional layer
Later, the Fusion Features of convolutional layer are carried out to two-way information;
Each pond layer carries out dimensionality reduction using adaptive learning pond algorithm, wherein second pond layer, the 6th
Pond layer, the 7th pond layer and the 8th pond layer use 2 × 2 × 2 window while to time dimensions and space dimension
Degree carries out down-sampling, other pond layers use 1 × 2 × 2 window, down-sampling is only carried out on Spatial Dimension.
8. sign Language Recognition Method according to claim 7, which is characterized in that be separately added into after each 3D convolutional layer
BN layers.
9. sign Language Recognition Method according to claim 7, which is characterized in that the 3D ResNets depth convolutional Neural net
Network further includes a data input layer and two full articulamentums, wherein
First full articulamentum includes 512 neurons, is converted by the feature vector that the 8th pond layer exports in this layer
For the feature vector of 512 dimensions, Dropout layers are used between the 8th pond layer and first full articulamentum, by 0.5 probability
Partial nerve network unit is abandoned, and the 8th pond layer and first are freezed entirely with 0.1 probability using transfer learning algorithm
The part of articulamentum connects;
Second full articulamentum is intensive output layer, and including neuron identical with the class number of classification results, second complete
Each neuron in articulamentum is connect entirely with 512 neurons in first full articulamentum, is finally carried out via classifier
Classification exports the classification results of affiliated sign language classification.
10. sign Language Recognition Method according to claim 9, which is characterized in that the 3D convolutional layer and first full connection
Layer uses ELU as activation primitive, and described second full articulamentum uses Softmax as activation primitive, and majorized function uses
SGD function, loss function are the sum of the error that more classification intersect entropy function and adaptive learning pond algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910426216.7A CN110175551B (en) | 2019-05-21 | 2019-05-21 | Sign language recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910426216.7A CN110175551B (en) | 2019-05-21 | 2019-05-21 | Sign language recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110175551A true CN110175551A (en) | 2019-08-27 |
CN110175551B CN110175551B (en) | 2023-01-10 |
Family
ID=67691821
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910426216.7A Active CN110175551B (en) | 2019-05-21 | 2019-05-21 | Sign language recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110175551B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111126493A (en) * | 2019-12-25 | 2020-05-08 | 东软睿驰汽车技术(沈阳)有限公司 | Deep learning model training method and device, electronic equipment and storage medium |
CN111310701A (en) * | 2020-02-27 | 2020-06-19 | 腾讯科技(深圳)有限公司 | Gesture recognition method, device, equipment and storage medium |
CN111339837A (en) * | 2020-02-08 | 2020-06-26 | 河北工业大学 | Continuous sign language recognition method |
CN111507275A (en) * | 2020-04-20 | 2020-08-07 | 北京理工大学 | Video data time sequence information extraction method and device based on deep learning |
CN112464816A (en) * | 2020-11-27 | 2021-03-09 | 南京特殊教育师范学院 | Local sign language identification method and device based on secondary transfer learning |
CN113378722A (en) * | 2021-06-11 | 2021-09-10 | 西安电子科技大学 | Behavior identification method and system based on 3D convolution and multilevel semantic information fusion |
US11227151B2 (en) | 2020-03-05 | 2022-01-18 | King Fahd University Of Petroleum And Minerals | Methods and systems for computerized recognition of hand gestures |
CN116343342A (en) * | 2023-05-30 | 2023-06-27 | 山东海量信息技术研究院 | Sign language recognition method, system, device, electronic equipment and readable storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5901246A (en) * | 1995-06-06 | 1999-05-04 | Hoffberg; Steven M. | Ergonomic man-machine interface incorporating adaptive pattern recognition based control system |
CN104376306A (en) * | 2014-11-19 | 2015-02-25 | 天津大学 | Optical fiber sensing system invasion identification and classification method and classifier based on filter bank |
CN105654037A (en) * | 2015-12-21 | 2016-06-08 | 浙江大学 | Myoelectric signal gesture recognition method based on depth learning and feature images |
CN107767405A (en) * | 2017-09-29 | 2018-03-06 | 华中科技大学 | A kind of nuclear phase for merging convolutional neural networks closes filtered target tracking |
CN107845390A (en) * | 2017-09-21 | 2018-03-27 | 太原理工大学 | A kind of Emotional speech recognition system based on PCNN sound spectrograph Fusion Features |
CN109409276A (en) * | 2018-10-19 | 2019-03-01 | 大连理工大学 | A kind of stalwartness sign language feature extracting method |
JP2019074478A (en) * | 2017-10-18 | 2019-05-16 | 沖電気工業株式会社 | Identification device, identification method and program |
-
2019
- 2019-05-21 CN CN201910426216.7A patent/CN110175551B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5901246A (en) * | 1995-06-06 | 1999-05-04 | Hoffberg; Steven M. | Ergonomic man-machine interface incorporating adaptive pattern recognition based control system |
CN104376306A (en) * | 2014-11-19 | 2015-02-25 | 天津大学 | Optical fiber sensing system invasion identification and classification method and classifier based on filter bank |
CN105654037A (en) * | 2015-12-21 | 2016-06-08 | 浙江大学 | Myoelectric signal gesture recognition method based on depth learning and feature images |
CN107845390A (en) * | 2017-09-21 | 2018-03-27 | 太原理工大学 | A kind of Emotional speech recognition system based on PCNN sound spectrograph Fusion Features |
CN107767405A (en) * | 2017-09-29 | 2018-03-06 | 华中科技大学 | A kind of nuclear phase for merging convolutional neural networks closes filtered target tracking |
JP2019074478A (en) * | 2017-10-18 | 2019-05-16 | 沖電気工業株式会社 | Identification device, identification method and program |
CN109409276A (en) * | 2018-10-19 | 2019-03-01 | 大连理工大学 | A kind of stalwartness sign language feature extracting method |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111126493A (en) * | 2019-12-25 | 2020-05-08 | 东软睿驰汽车技术(沈阳)有限公司 | Deep learning model training method and device, electronic equipment and storage medium |
CN111126493B (en) * | 2019-12-25 | 2023-08-01 | 东软睿驰汽车技术(沈阳)有限公司 | Training method and device for deep learning model, electronic equipment and storage medium |
CN111339837B (en) * | 2020-02-08 | 2022-05-03 | 河北工业大学 | Continuous sign language recognition method |
CN111339837A (en) * | 2020-02-08 | 2020-06-26 | 河北工业大学 | Continuous sign language recognition method |
CN111310701A (en) * | 2020-02-27 | 2020-06-19 | 腾讯科技(深圳)有限公司 | Gesture recognition method, device, equipment and storage medium |
CN111310701B (en) * | 2020-02-27 | 2023-02-10 | 腾讯科技(深圳)有限公司 | Gesture recognition method, device, equipment and storage medium |
US11227151B2 (en) | 2020-03-05 | 2022-01-18 | King Fahd University Of Petroleum And Minerals | Methods and systems for computerized recognition of hand gestures |
CN111507275A (en) * | 2020-04-20 | 2020-08-07 | 北京理工大学 | Video data time sequence information extraction method and device based on deep learning |
CN111507275B (en) * | 2020-04-20 | 2023-10-10 | 北京理工大学 | Video data time sequence information extraction method and device based on deep learning |
CN112464816A (en) * | 2020-11-27 | 2021-03-09 | 南京特殊教育师范学院 | Local sign language identification method and device based on secondary transfer learning |
CN113378722A (en) * | 2021-06-11 | 2021-09-10 | 西安电子科技大学 | Behavior identification method and system based on 3D convolution and multilevel semantic information fusion |
CN113378722B (en) * | 2021-06-11 | 2023-04-07 | 西安电子科技大学 | Behavior identification method and system based on 3D convolution and multilevel semantic information fusion |
CN116343342A (en) * | 2023-05-30 | 2023-06-27 | 山东海量信息技术研究院 | Sign language recognition method, system, device, electronic equipment and readable storage medium |
CN116343342B (en) * | 2023-05-30 | 2023-08-04 | 山东海量信息技术研究院 | Sign language recognition method, system, device, electronic equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110175551B (en) | 2023-01-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110175551A (en) | A kind of sign Language Recognition Method | |
CN108319686B (en) | Antagonism cross-media retrieval method based on limited text space | |
Li et al. | A comparison of 3D shape retrieval methods based on a large-scale benchmark supporting multimodal queries | |
Wu et al. | Attention deep model with multi-scale deep supervision for person re-identification | |
Xin et al. | Arch: Adaptive recurrent-convolutional hybrid networks for long-term action recognition | |
CN108647628B (en) | Micro-expression recognition method based on multi-feature multi-task dictionary sparse transfer learning | |
CN113221639A (en) | Micro-expression recognition method for representative AU (AU) region extraction based on multitask learning | |
CN101968853A (en) | Improved immune algorithm based expression recognition method for optimizing support vector machine parameters | |
Rasheed et al. | Handwritten Urdu characters and digits recognition using transfer learning and augmentation with AlexNet | |
CN105160352A (en) | High-dimensional data subspace clustering projection effect optimization method based on dimension reconstitution | |
Zhu et al. | Big data image classification based on distributed deep representation learning model | |
Dong et al. | A procedural texture generation framework based on semantic descriptions | |
Sun et al. | Second-order encoding networks for semantic segmentation | |
Liu et al. | Lightweight ViT model for micro-expression recognition enhanced by transfer learning | |
Rani et al. | An effectual classical dance pose estimation and classification system employing convolution neural network–long shortterm memory (CNN-LSTM) network for video sequences | |
CN113222002A (en) | Zero sample classification method based on generative discriminative contrast optimization | |
Han | Residual learning based CNN for gesture recognition in robot interaction | |
Shen et al. | Multi-dimensional, multi-functional and multi-level attention in YOLO for underwater object detection | |
CN115222998B (en) | Image classification method | |
Zhu et al. | Multiscale brain-like neural network for saliency prediction on omnidirectional images | |
Yang et al. | A new residual dense network for dance action recognition from heterogeneous view perception | |
Ling et al. | A facial expression recognition system for smart learning based on YOLO and vision transformer | |
Hua et al. | Collaborative Generative Adversarial Network with Visual perception and memory reasoning | |
Li et al. | Attentive 3d-ghost module for dynamic hand gesture recognition with positive knowledge transfer | |
Dafnis et al. | Isolated sign recognition using ASL datasets with consistent text-based gloss labeling and curriculum learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |