CN111104929B - Multi-mode dynamic gesture recognition method based on 3D convolution and SPP - Google Patents

Multi-mode dynamic gesture recognition method based on 3D convolution and SPP Download PDF

Info

Publication number
CN111104929B
CN111104929B CN201911423353.1A CN201911423353A CN111104929B CN 111104929 B CN111104929 B CN 111104929B CN 201911423353 A CN201911423353 A CN 201911423353A CN 111104929 B CN111104929 B CN 111104929B
Authority
CN
China
Prior art keywords
sequence sample
convolution
sequence
sample
optical flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911423353.1A
Other languages
Chinese (zh)
Other versions
CN111104929A (en
Inventor
彭永坚
汪壮雄
许冰媛
周智恒
彭明
朱湘军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGZHOU VIDEO-STAR ELECTRONICS CO LTD
South China University of Technology SCUT
Original Assignee
GUANGZHOU VIDEO-STAR ELECTRONICS CO LTD
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGZHOU VIDEO-STAR ELECTRONICS CO LTD, South China University of Technology SCUT filed Critical GUANGZHOU VIDEO-STAR ELECTRONICS CO LTD
Priority to CN201911423353.1A priority Critical patent/CN111104929B/en
Publication of CN111104929A publication Critical patent/CN111104929A/en
Application granted granted Critical
Publication of CN111104929B publication Critical patent/CN111104929B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-mode dynamic gesture recognition method based on 3D convolution and SPP, which comprises the following steps: data preprocessing, namely extracting optical flow characteristics and gray scale characteristics from an RGB video sequence to respectively obtain an optical flow sequence sample and a gray scale sequence sample, and regularizing each optical flow sequence sample, each gray scale sequence sample and each depth sequence sample into 32 frames, wherein the dimension of each sample is 32 multiplied by 112; data enhancement, namely amplifying a sequence sample data set through translation, overturning, noise adding and affine transformation; training the neural network, namely respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample into the same network structure, and respectively training three networks to judge gestures; model integration, namely integrating classification results of the sequence samples by three networks to obtain a final discrimination result; by adopting the technical scheme of the invention, the accuracy of gesture recognition can be improved.

Description

Multi-mode dynamic gesture recognition method based on 3D convolution and SPP
Technical Field
The invention relates to the technical field of image recognition, in particular to a multi-mode dynamic gesture recognition method based on 3D convolution and SPP.
Background
Gesture is one of important ways of man-machine interaction, and gesture recognition is to use a computer to recognize gesture actions made by people. Gesture recognition includes static gesture recognition and dynamic gesture recognition, and static gesture recognition focuses on the hand shape of a certain frame of image, and is relatively simple. Dynamic gesture recognition focuses not only on hand shape, but also on track and shape changes of gestures in the space-time dimension. Because of the diversity and the diversity of the dynamic gestures, the recognition accuracy of the dynamic gestures is still low, and the dynamic gestures are a challenging research direction in the field of artificial intelligence.
With the development of deep learning, dynamic gesture recognition by using a deep convolutional neural network is attracting attention of students. However, when a common 2D convolutional neural network is used for processing a video image sequence, information of a target in a time dimension is easily lost, and change information of the target in the time-space dimension cannot be effectively extracted, so that the recognition accuracy of the network is affected. Therefore, feature learning of the video space-time dimension is a key for realizing human dynamic gesture recognition.
Disclosure of Invention
In order to solve the above technical problems, an embodiment of the present invention provides a method for identifying multi-modal dynamic gestures based on 3D convolution and SPP, including:
a data preprocessing step, namely extracting optical flow characteristics and gray scale characteristics from an RGB video sequence to respectively obtain an optical flow sequence sample and a gray scale sequence sample, and regularizing each optical flow sequence sample, each gray scale sequence sample and each depth sequence sample into 32 frames, wherein the dimension of each sample is 32 multiplied by 112;
a data enhancement step of amplifying a sequence sample data set by translation, overturning, noise adding and affine transformation;
the neural network training step, namely respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample into the same network structure, and respectively training three networks to judge gestures;
and a model integration step, namely integrating classification results of the sequence samples by the three networks to obtain a final discrimination result.
Preferably, the data preprocessing step comprises the following steps:
extracting optical flow characteristics of 1080 RGB video sequences contained in the SKIG data set by utilizing a iDT algorithm to obtain 1080 optical flow sequence samples;
graying each frame of image of the RGB video sequence to obtain 1080 gray sequence samples;
different gesture sequence samples have different durations, each sequence sample is ordered into fixed 32 frames by adopting a method of repeating frames or discarding frames in the nearest neighbor, and each frame has the dimension of 112×112, and the dimension is used as the input of the neural network.
Preferably, the iDT algorithm is as follows:
the iDT algorithm assumes that the relationship between two adjacent frames of images is described by a projective transformation matrix, and the later frame of image is obtained by projectively transforming the former frame of image;
and performing feature matching between two adjacent frames by adopting a SURF feature and dense optical flow method, and estimating a projective transformation matrix by using a RANSAC algorithm.
Preferably, the data enhancement step is as follows:
the optical flow sequence sample, the gray sequence sample and the depth sequence sample corresponding to the same gesture are transformed in the same mode, wherein the transformation mode comprises the following steps:
the translation operation is as follows, the pixel point (x, y) on each channel of each sequence sample is translated by Δx units along the x-axis and by Δy units along the y-axis, i.e., (x ', y') = (x+Δx, y+Δy). Wherein Deltax is any integer of [ -0.1 xw, 0.1 xw ], deltay is any integer of [ -0.1 xh, 0.1 xh ], w is the corresponding width of each frame of image, and h is the corresponding length of each frame of image;
the turning operation is as follows, the data of each channel of each sequence sample is subjected to mirror image horizontal turning and mirror image up-down turning;
the noise adding operation is as follows, gaussian white noise is added to the data of each channel of each sequence sample, and the added noise obeys Gaussian distribution with the mean value of 0 and the variance of 0.1;
the affine transformation operates as follows, rotating the data of each channel of each sequence sample by a set angle, including 0 °, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, 315 °.
Preferably, the neural network training step comprises the following steps:
respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample corresponding to the same gesture into the same network structure, respectively training three neural networks to judge the gesture, specifically, training the optical flow sequence sample to obtain a first neural network, training the gray sequence sample to obtain a second neural network and training the depth sequence sample to obtain a third neural network;
the neural network is composed of a 3D convolutional neural network, SPPs and full-connection layers, the 3D convolutional neural network is used for simultaneously extracting space-time characteristics of gestures, then SPPs are used for extracting global and local characteristics, and two full-connection layers and softmax are input to obtain the scores of gesture classification.
Preferably, the 3D convolutional neural network comprises 5 convolutional layers;
each convolution layer comprises two operations of convolution operation and pooling operation, the convolution kernel adopted by the convolution operation is 3 multiplied by 3, and the step length is 1 multiplied by 1;
the first convolution operation, the second convolution operation and the third convolution operation respectively comprise 64, 128 and 256 convolution kernels, a BN layer and a ReLU activation function are adopted after the convolution operation, the pooling window of the first pooling operation is 1 multiplied by 2, the step length is 2 multiplied by 2, the pooling windows of the second pooling operation and the third pooling operation are both 2 multiplied by 2, and the step length is 2 multiplied by 2;
the fourth convolution operation and the fifth convolution operation all comprise 512 convolution kernels, the pooling window of the fourth pooling operation and the pooling window of the fifth pooling operation are 2 multiplied by 2, the step length is 2 multiplied by 1, and the first pooling operation, the second pooling operation, the third pooling operation, the fourth pooling operation and the fifth pooling operation all adopt a mean pooling method.
As a preferred scheme, the SPP network performs spatial pyramid pooling on feature graphs obtained by the 3D convolutional neural network in different scales to obtain feature vectors of (16+4+1) multiplied by 512 dimensions, the feature vectors of (16+4+1) multiplied by 512 dimensions are input into two fully-connected layers, the number of neurons is 1024, and then the result is input into a softmax layer to obtain the score of 10 types of gestures.
Preferably, the model integration correspondingly multiplies gesture classification scores of three network pairs of sequence samples, and discriminates the samples as gesture categories with highest scores.
Compared with the prior art, the invention has the following advantages and effects:
(1) According to the method, the translation, the overturning, the noise adding and the affine transformation are utilized for data amplification, so that the generalization capability of the gesture classification model is improved;
(2) According to the invention, the sequence samples are input into the 3D convolutional neural network, the space-time characteristics are extracted at the same time, and the local characteristics and the global characteristics are extracted by utilizing the SPP network, so that the dynamic gesture recognition with high accuracy is realized;
(3) According to the method, the multi-mode sequence sample is used as input, three gesture classifiers are trained respectively, and the recognition accuracy of a gesture recognition system is improved through model integration.
Drawings
FIG. 1 is a general flow chart of the disclosed 3D convolution and SPP based multi-modal dynamic gesture recognition method;
fig. 2 is a schematic diagram of a neural network structure in the multi-mode dynamic gesture recognition method based on 3D convolution and SPP disclosed in the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Embodiment one:
the dataset SKIG used in this embodiment contains 2160 gesture video sequences, of which there are 1080 RGB video sequences and 1080 depth video sequences, all captured simultaneously by the Kinect sensor, containing 10 types of gestures.
As shown in fig. 1, the multi-mode dynamic gesture recognition method based on 3D convolution and SPP sequentially includes the following steps: a data preprocessing step, a data enhancement step, a neural network training step and a model integration step.
And in the data preprocessing step, 1080 RGB video sequences contained in the SKIG data set are subjected to optical flow characteristics extraction by using a iDT algorithm, so that 1080 optical flow sequence samples are obtained. And graying each frame of image of the RGB video sequence to obtain 1080 gray sequence samples. Different gesture sequence samples have different durations, each sequence sample is ordered into fixed 32 frames by adopting a method of repeating frames or discarding frames in the nearest neighbor, and each frame has the dimension of 112×112, namely each sequence sample has the dimension of 32×112×112, and the input of the neural network is used.
The iDT algorithm assumes that the relationship between two adjacent frames of images can be described by a projective transformation matrix, and the next frame of image can be obtained by projectively transforming the previous frame of image, so that the problem that the change of the two adjacent frames of images is relatively small is solved. And performing feature matching between two adjacent frames by adopting a SURF feature and dense optical flow method, and estimating a projective transformation matrix by using a RANSAC algorithm.
The data enhancement step, the optical flow sequence sample, the gray sequence sample and the depth sequence sample corresponding to the same gesture are transformed in the same mode, the sequence sample data set is amplified, and the transformation mode comprises the following steps:
the translation operation is as follows:
the pixel point (x, y) on each channel of each sequence sample is translated by Δx units along the x-axis and by Δy units along the y-axis, i.e., (x ', y') = (x+Δx, y+Δy). Wherein Deltax is any integer of [ -0.1 xw, 0.1 xw ], deltay is any integer of [ -0.1 xh, 0.1 xh ], w is the corresponding width of each frame image, and h is the corresponding length of each frame image.
The flipping operation is as follows:
and carrying out mirror image horizontal overturn and mirror image up-down overturn on the data of each channel of each sequence sample.
The noise adding operation is as follows:
the data of each channel of each sequence sample is added with Gaussian white noise, and the added noise obeys Gaussian distribution with the mean value of 0 and the variance of 0.1.
The affine transformation operation is as follows:
the data of each channel of each sequence of samples is rotated by a set angle, including 0 °, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, 315 °.
And a neural network training step, namely respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample corresponding to the same gesture into the same network structure, and respectively training three neural networks to judge the gesture. Specifically, the optical flow sequence sample is trained to obtain a first neural network, the gray sequence sample is trained to obtain a second neural network, and the depth sequence sample is trained to obtain a third neural network.
And model integration, namely correspondingly multiplying gesture classification scores of the three network pairs of sequence samples, and distinguishing the samples as gesture categories with highest scores.
As shown in fig. 2, the neural network is composed of a 3D convolutional neural network, SPPs and full-connection layers, the space-time characteristics of gestures are extracted simultaneously by using the 3D convolutional neural network, then global and local characteristics are extracted by using the SPPs, and the two full-connection layers and softmax are input to obtain the scores of gesture classification.
The 3D convolutional neural network comprises 5 convolutional layers, each convolutional layer comprises two operations of convolutional operation and pooling operation, the convolutional kernel adopted by the convolutional operation is 3 multiplied by 3, and the step size is 1 multiplied by 1.
The first convolution operation C1, the second convolution operation C2 and the third convolution operation C3 respectively comprise 64, 128 and 256 convolution kernels, a BN layer and a ReLU activation function are adopted after the convolution operation, the pooling window of the first pooling operation P1 is 1 multiplied by 2, the step length is 2 multiplied by 2, the pooling windows of the second pooling operation P2 and the third pooling operation P3 are both 2 multiplied by 2, and the step length is 2 multiplied by 2;
the fourth convolution operation C4 and the fifth convolution operation C5 each comprise 512 convolution kernels, the fourth pooling operation P4 and the fifth pooling operation P5 have a pooling window of 2 x 2, a step size of 2 x 1, the first pooling operation P1, the second pooling operation P2, the third pooling operation P3, the fourth pooling operation P4 and the fifth pooling operation P5 all adopt a mean pooling method.
And the SPP network pools the feature graphs obtained by the 3D convolutional neural network with different dimensions of space pyramids to obtain feature vectors of (16+4+1) multiplied by 512 dimensions. Inputting the obtained feature vector into two fully connected layers, wherein the number of neurons is 1024, and inputting the result into a softmax layer to obtain the score of 10 types of gestures.
In summary, the embodiment discloses a multi-mode dynamic gesture recognition method based on 3D convolution and SPP, which improves the generalization capability of a gesture classification model by performing data amplification by using translation, flipping, noise adding and affine transformation. According to the method, the sequence samples are input into the 3D convolutional neural network, the space-time characteristics are extracted at the same time, and the SPP network is utilized to extract the local characteristics and the global characteristics, so that the dynamic gesture recognition with high accuracy is realized. In addition, the method takes the multi-mode sequence sample as input, respectively trains three gesture classifiers, and improves the recognition accuracy of the gesture recognition system through model integration.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (5)

1. The multi-mode dynamic gesture recognition method based on the 3D convolution and the SPP is characterized by comprising the following steps of:
the data preprocessing step, extracting optical flow characteristics and gray scale characteristics from an RGB video sequence to respectively obtain an optical flow sequence sample and a gray scale sequence sample, and regularizing each optical flow sequence sample, each gray scale sequence sample and each depth sequence sample into 32 frames, wherein each sample dimension is 32 multiplied by 112, and specifically: extracting optical flow characteristics of 1080 RGB video sequences contained in the SKIG data set by utilizing a iDT algorithm to obtain 1080 optical flow sequence samples; graying each frame of image of the RGB video sequence to obtain 1080 gray sequence samples; different gesture sequence samples have different time lengths, each sequence sample is regulated into fixed 32 frames by adopting a method of repeating frames or discarding frames in the nearest neighbor, and each frame has the dimension of 112 multiplied by 112 and is used as the input of the neural network;
the iDT algorithm is as follows: the iDT algorithm assumes that the relationship between two adjacent frames of images is described by a projective transformation matrix, and the later frame of image is obtained by projectively transforming the former frame of image; performing feature matching between two adjacent frames by adopting a SURF feature and dense optical flow method, and estimating a projective transformation matrix by using a RANSAC algorithm;
a data enhancement step of amplifying a sequence sample data set by translation, overturning, noise adding and affine transformation;
the neural network training step, namely respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample into the same network structure, respectively training three networks to judge gestures, specifically: respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample corresponding to the same gesture into the same network structure, respectively training three neural networks to judge the gesture, specifically, training the optical flow sequence sample to obtain a first neural network, training the gray sequence sample to obtain a second neural network and training the depth sequence sample to obtain a third neural network; the neural network consists of a 3D convolutional neural network, SPPs and full-connection layers, the 3D convolutional neural network is used for simultaneously extracting space-time characteristics of gestures, then SPPs are used for extracting global and local characteristics, and two full-connection layers and softmax are input to obtain the scores of gesture classification;
and a model integration step, namely integrating classification results of the sequence samples by the three networks to obtain a final discrimination result.
2. The method for multi-modal dynamic gesture recognition based on 3D convolution and SPP of claim 1, wherein the data enhancement step is performed as follows:
the optical flow sequence sample, the gray sequence sample and the depth sequence sample corresponding to the same gesture are transformed in the same mode, wherein the transformation mode comprises the following steps:
the translation operation is as follows, translating the pixel point (x, y) on each channel of each sequence sample by Δx units along the x-axis and by Δy units along the y-axis, i.e., (x ', y') = (x+Δx, y+Δy); wherein Δx is [ -0.1×w,0.1×w]Is an integer of one of the above,
Figure FDA0004145163450000021
is [ -0.1 Xh, 0.1 Xh]W is the corresponding width of each frame of image, and h is the corresponding length of each frame of image;
the turning operation is as follows, the data of each channel of each sequence sample is subjected to mirror image horizontal turning and mirror image up-down turning;
the noise adding operation is as follows, gaussian white noise is added to the data of each channel of each sequence sample, and the added noise obeys Gaussian distribution with the mean value of 0 and the variance of 0.1;
the affine transformation operates as follows, rotating the data of each channel of each sequence sample by a set angle, including 0 °, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, 315 °.
3. The 3D convolution and SPP based multi-modal dynamic gesture recognition method of claim 1, wherein the 3D convolutional neural network comprises 5 convolutional layers;
each convolution layer comprises two operations of convolution operation and pooling operation, the convolution kernel adopted by the convolution operation is 3 multiplied by 3, and the step length is 1 multiplied by 1;
the first convolution operation, the second convolution operation and the third convolution operation respectively comprise 64, 128 and 256 convolution kernels, a BN layer and a ReLU activation function are adopted after the convolution operation, the pooling window of the first pooling operation is 1 multiplied by 2, the step length is 2 multiplied by 2, the pooling windows of the second pooling operation and the third pooling operation are both 2 multiplied by 2, and the step length is 2 multiplied by 2;
the fourth convolution operation and the fifth convolution operation all comprise 512 convolution kernels, the pooling window of the fourth pooling operation and the pooling window of the fifth pooling operation are 2 multiplied by 2, the step length is 2 multiplied by 1, and the first pooling operation, the second pooling operation, the third pooling operation, the fourth pooling operation and the fifth pooling operation all adopt a mean pooling method.
4. The multi-mode dynamic gesture recognition method based on 3D convolution and SPP according to claim 1, wherein the SPP network performs spatial pyramid pooling on feature graphs obtained by the 3D convolution neural network in different scales to obtain feature vectors of (16+4+1) x 512 dimensions, the feature vectors of (16+4+1) x 512 dimensions are input into two fully-connected layers, the number of neurons is 1024, and then the result is input into a softmax layer to obtain the score of 10 types of gestures.
5. The 3D convolution and SPP based multi-modal dynamic gesture recognition method of claim 1, wherein the model integration correspondingly multiplies gesture classification scores of three network pairs of sequence samples, discriminating the samples as the highest-score gesture class.
CN201911423353.1A 2019-12-31 2019-12-31 Multi-mode dynamic gesture recognition method based on 3D convolution and SPP Active CN111104929B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911423353.1A CN111104929B (en) 2019-12-31 2019-12-31 Multi-mode dynamic gesture recognition method based on 3D convolution and SPP

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911423353.1A CN111104929B (en) 2019-12-31 2019-12-31 Multi-mode dynamic gesture recognition method based on 3D convolution and SPP

Publications (2)

Publication Number Publication Date
CN111104929A CN111104929A (en) 2020-05-05
CN111104929B true CN111104929B (en) 2023-05-09

Family

ID=70426599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911423353.1A Active CN111104929B (en) 2019-12-31 2019-12-31 Multi-mode dynamic gesture recognition method based on 3D convolution and SPP

Country Status (1)

Country Link
CN (1) CN111104929B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239824B (en) * 2021-05-19 2024-04-05 北京工业大学 Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180226A (en) * 2017-04-28 2017-09-19 华南理工大学 A kind of dynamic gesture identification method based on combination neural net
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
CN109871781A (en) * 2019-01-28 2019-06-11 山东大学 Dynamic gesture identification method and system based on multi-modal 3D convolutional neural networks
CN109919057A (en) * 2019-02-26 2019-06-21 北京理工大学 A kind of multi-modal fusion gesture identification method based on efficient convolutional neural networks

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8873813B2 (en) * 2012-09-17 2014-10-28 Z Advanced Computing, Inc. Application of Z-webs and Z-factors to analytics, search engine, learning, recognition, natural language, and other utilities
US11074495B2 (en) * 2013-02-28 2021-07-27 Z Advanced Computing, Inc. (Zac) System and method for extremely efficient image and pattern recognition and artificial intelligence platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180226A (en) * 2017-04-28 2017-09-19 华南理工大学 A kind of dynamic gesture identification method based on combination neural net
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
CN109871781A (en) * 2019-01-28 2019-06-11 山东大学 Dynamic gesture identification method and system based on multi-modal 3D convolutional neural networks
CN109919057A (en) * 2019-02-26 2019-06-21 北京理工大学 A kind of multi-modal fusion gesture identification method based on efficient convolutional neural networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
冯翔 ; 吴瀚 ; 司冰灵 ; 季超 ; .基于嵌网融合结构的卷积神经网络手势图像识别方法.生物医学工程研究.2019,(04),410-414、425. *
曹钰 ; .基于区域信息的深度卷积神经网络研究综述.电子世界.2017,(06),32、36. *

Also Published As

Publication number Publication date
CN111104929A (en) 2020-05-05

Similar Documents

Publication Publication Date Title
Taskiran et al. A real-time system for recognition of American sign language by using deep learning
CN107038448B (en) Target detection model construction method
Yeh et al. Multi-scale deep residual learning-based single image haze removal via image decomposition
CN108229490B (en) Key point detection method, neural network training method, device and electronic equipment
CN110796080B (en) Multi-pose pedestrian image synthesis algorithm based on generation countermeasure network
WO2019071433A1 (en) Method, system and apparatus for pattern recognition
CN110766020A (en) System and method for detecting and identifying multi-language natural scene text
Cao et al. Robust vehicle detection by combining deep features with exemplar classification
US20210256707A1 (en) Learning to Segment via Cut-and-Paste
Liang et al. MAFNet: Multi-style attention fusion network for salient object detection
CN110969089A (en) Lightweight face recognition system and recognition method under noise environment
CN114821764A (en) Gesture image recognition method and system based on KCF tracking detection
Avola et al. 3D hand pose and shape estimation from RGB images for keypoint-based hand gesture recognition
CN111260577B (en) Face image restoration system based on multi-guide image and self-adaptive feature fusion
Mali et al. Indian sign language recognition using SVM classifier
CN111104929B (en) Multi-mode dynamic gesture recognition method based on 3D convolution and SPP
Zhang et al. Infrared ship target segmentation based on adversarial domain adaptation
Rao et al. Sign Language Recognition using LSTM and Media Pipe
CN116935044A (en) Endoscopic polyp segmentation method with multi-scale guidance and multi-level supervision
Ma et al. LAYN: Lightweight Multi-Scale Attention YOLOv8 Network for Small Object Detection
Kralevska et al. Real-time Macedonian Sign Language Recognition System by using Transfer Learning
Assaleh et al. Recognition of handwritten Arabic alphabet via hand motion tracking
CN111160078B (en) Human interaction behavior recognition method, system and device based on video image
Kumar et al. Computer vision based Hand gesture recognition system
Deshpande et al. Sign Language Recognition System using CNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant