CN111104929A - Multi-modal dynamic gesture recognition method based on 3D (three-dimensional) volume sum and SPP (shortest Path P) - Google Patents

Multi-modal dynamic gesture recognition method based on 3D (three-dimensional) volume sum and SPP (shortest Path P) Download PDF

Info

Publication number
CN111104929A
CN111104929A CN201911423353.1A CN201911423353A CN111104929A CN 111104929 A CN111104929 A CN 111104929A CN 201911423353 A CN201911423353 A CN 201911423353A CN 111104929 A CN111104929 A CN 111104929A
Authority
CN
China
Prior art keywords
sequence
sequence sample
sample
optical flow
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911423353.1A
Other languages
Chinese (zh)
Other versions
CN111104929B (en
Inventor
彭永坚
汪壮雄
许冰媛
周智恒
彭明
朱湘军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Video Star Intelligent Technology Co ltd
South China University of Technology SCUT
Original Assignee
Guangzhou Video Star Intelligent Technology Co ltd
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Video Star Intelligent Technology Co ltd, South China University of Technology SCUT filed Critical Guangzhou Video Star Intelligent Technology Co ltd
Priority to CN201911423353.1A priority Critical patent/CN111104929B/en
Publication of CN111104929A publication Critical patent/CN111104929A/en
Application granted granted Critical
Publication of CN111104929B publication Critical patent/CN111104929B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-modal dynamic gesture recognition method based on 3D volume sum and SPP, which comprises the following steps: data preprocessing, namely extracting optical flow characteristics and gray level characteristics from an RGB video sequence, respectively obtaining optical flow sequence samples and gray level sequence samples, and regulating each optical flow sequence sample, each gray level sequence sample and each depth sequence sample into 32 frames, wherein the dimensionality of each sample is 32 multiplied by 112; data enhancement, namely amplifying a sequence sample data set through translation, turnover, noise addition and affine transformation; training a neural network, namely respectively inputting the gray sequence sample, the optical flow sequence sample and the depth sequence sample into the same network structure, and respectively training three networks to perform gesture judgment; model integration, namely integrating classification results of the sequence samples by the three networks to obtain a final judgment result; by adopting the technical scheme of the invention, the accuracy of gesture recognition can be improved.

Description

Multi-modal dynamic gesture recognition method based on 3D (three-dimensional) volume sum and SPP (shortest Path P)
Technical Field
The invention relates to the technical field of image recognition, in particular to a multi-modal dynamic gesture recognition method based on 3D volume sum and SPP.
Background
Gestures are one of important ways of man-machine interaction, and gesture recognition is to recognize gesture actions made by people by using a computer. The gesture recognition comprises static gesture recognition and dynamic gesture recognition, the static gesture recognition focuses on the shape of a hand of a certain frame of image, and the gesture recognition is relatively simple. Dynamic gesture recognition focuses not only on hand shapes, but also on trajectory and shape changes of gestures in spatiotemporal dimensions. Due to the diversity and difference of the dynamic gestures, the recognition accuracy of the dynamic gestures is still low, and the method is a challenging research direction in the field of artificial intelligence.
With the development of deep learning, the dynamic gesture recognition by using a deep convolutional neural network is concerned by students. However, when a common 2D convolutional neural network is used for processing a video image sequence, information of a target in a time dimension is easily lost, and change information of the target in a space-time dimension cannot be effectively extracted, thereby affecting the identification accuracy of the network. Therefore, feature learning of the video spatiotemporal dimension is a key for realizing human dynamic gesture recognition.
Disclosure of Invention
In order to solve the above technical problem, an embodiment of the present invention provides a multimodal dynamic gesture recognition method based on 3D volume and SPP, including:
a data preprocessing step, namely extracting optical flow characteristics and gray level characteristics from an RGB video sequence, respectively obtaining optical flow sequence samples and gray level sequence samples, and regulating each optical flow sequence sample, each gray level sequence sample and each depth sequence sample into 32 frames, wherein the dimensionality of each sample is 32 multiplied by 112;
a data enhancement step, namely amplifying a sequence sample data set through translation, turnover, noise addition and affine transformation;
a neural network training step, namely respectively inputting the gray sequence sample, the optical flow sequence sample and the depth sequence sample into the same network structure, and respectively training three networks to perform gesture judgment;
and a model integration step, namely integrating the classification results of the sequence samples by the three networks to obtain a final judgment result.
Preferably, the data preprocessing step process is as follows:
extracting optical flow characteristics of 1080 RGB video sequences contained in the SKIG data set by utilizing an iDT algorithm to obtain 1080 optical flow sequence samples;
graying each frame of image of an RGB video sequence to obtain 1080 grayscale sequence samples;
different gesture sequence samples have different durations, each sequence sample is structured into 32 fixed frames by adopting a method of repeating frames or discarding frames in the nearest neighborhood, and the dimension of each frame is 112 multiplied by 112 to be used as the input of a neural network.
Preferably, the iDT algorithm is as follows:
iDT algorithm assumes that the relationship between two adjacent frames of images is described by a projective transformation matrix, and the latter frame of image is obtained by the previous frame of image through projective transformation;
and (3) carrying out feature matching between two adjacent frames by adopting a SURF feature and dense optical flow method, and estimating a projection transformation matrix by utilizing a RANSAC algorithm.
Preferably, the data enhancement step process is as follows:
and carrying out transformation in the same way on the optical flow sequence sample, the gray level sequence sample and the depth sequence sample corresponding to the same gesture, wherein the transformation way comprises the following steps:
the shift operation is to shift the pixel point (x, y) on each channel of each sequence sample by Δ x units along the x-axis and Δ y units along the y-axis, i.e., (x ', y') (x + Δ x, y + Δ y). Wherein Δ x is any integer of [ -0.1 × w,0.1 × w ], Δ y is any integer of [ -0.1 × h,0.1 × h ], w is the corresponding width of each frame of image, and h is the corresponding length of each frame of image;
the turning operation comprises the following steps of carrying out mirror image horizontal turning and mirror image up-down turning on the data of each channel of each sequence sample;
adding white Gaussian noise to the data of each channel of each sequence sample, wherein the added noise follows Gaussian distribution with the mean value of 0 and the variance of 0.1;
the affine transformation operates by performing a set angular rotation of the data for each channel of each sequence sample, including 0 °, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, 315 °.
Preferably, the neural network training step process is as follows:
respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample corresponding to the same gesture into the same network structure, respectively training three neural networks to perform gesture discrimination, specifically, training the optical flow sequence sample to obtain a first neural network, training the gray sequence sample to obtain a second neural network, and training the depth sequence sample to obtain a third neural network;
the neural network is composed of a 3D convolutional neural network, an SPP and a full connection layer, the 3D convolutional neural network is used for simultaneously extracting the space-time characteristics of the gestures, then the SPP is used for extracting the global and local characteristics, and the scores of the gesture classification are obtained by inputting the two full connection layers and softmax.
Preferably, the 3D convolutional neural network comprises 5 convolutional layers;
each convolution layer comprises convolution operation and pooling operation, the sizes of convolution kernels adopted by the convolution operation are 3 multiplied by 3, and the step length is 1 multiplied by 1;
the first convolution operation, the second convolution operation and the third convolution operation respectively comprise 64 convolution kernels, 128 convolution kernels and 256 convolution kernels, a BN layer and a ReLU activation function are adopted after the convolution operations, the pooling window of the first pooling operation is 1 x 2, the step size is 2 x 2, the pooling windows of the second pooling operation and the third pooling operation are 2 x 2, and the step size is 2 x 2;
the fourth convolution operation and the fifth convolution operation both comprise 512 convolution kernels, the pooling windows of the fourth pooling operation and the fifth pooling operation are 2 × 2 × 2, and the step size is 2 × 1 × 1, wherein the first pooling operation, the second pooling operation, the third pooling operation, the fourth pooling operation and the fifth pooling operation all adopt a mean pooling method.
As a preferred scheme, the SPP network performs spatial pyramid pooling of different scales on feature maps obtained by the 3D convolutional neural network to obtain (16+4+1) × 512-dimensional feature vectors, inputs the (16+4+1) × 512-dimensional feature vectors into two fully-connected layers, the number of neurons is 1024, and inputs the result into the softmax layer to obtain scores of 10 types of gestures.
As a preferred scheme, the model integration multiplies the gesture classification scores of the three network pair sequence samples correspondingly, and discriminates the sample as the gesture class with the highest score.
Compared with the prior art, the invention has the following advantages and effects:
(1) according to the gesture classification method, data amplification is performed by utilizing translation, overturning, noise adding and affine transformation, so that the generalization capability of the gesture classification model is improved;
(2) according to the invention, sequence samples are input into a 3D convolutional neural network, and the time-space characteristics are extracted at the same time, and the SPP network is used for extracting local characteristics and global characteristics, so that high-accuracy dynamic gesture recognition is realized;
(3) the method takes the multi-modal sequence samples as input, respectively trains three gesture classifiers, and improves the recognition accuracy of the gesture recognition system through model integration.
Drawings
FIG. 1 is a general flow diagram of the disclosed multi-modal dynamic gesture recognition method based on 3D volume and SPP;
FIG. 2 is a schematic diagram of a neural network structure in the multi-modal dynamic gesture recognition method based on 3D volume and SPP.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The first embodiment is as follows:
the data set, skeg, used in this embodiment contains 2160 gesture video sequences, among which 1080 RGB video sequences and 1080 depth video sequences, all of which are captured by the Kinect sensor at the same time, including 10 types of gestures.
As shown in fig. 1, the multi-modal dynamic gesture recognition method based on 3D volume sum and SPP sequentially includes the following steps: the method comprises the steps of data preprocessing, data enhancement, neural network training and model integration.
And a data preprocessing step, namely extracting optical flow characteristics of 1080 RGB video sequences contained in the SKIG data set by utilizing an iDT algorithm to obtain 1080 optical flow sequence samples. Graying each frame of image of the RGB video sequence to obtain 1080 grayscale sequence samples. Different gesture sequence samples have different durations, each sequence sample is normalized into a fixed 32 frames by adopting a method of repeating frames or discarding frames in the nearest neighborhood, the dimension of each frame is 112 × 112, namely the dimension of each sequence sample is 32 × 112 × 112, and the fixed 32 frames are used as the input of the neural network.
iDT the algorithm assumes that the relationship between two adjacent frames of images can be described by a projective transformation matrix, and the latter frame of image can be obtained from the former frame of image by projective transformation, thereby solving the problem of small change between two adjacent frames of images. And (3) carrying out feature matching between two adjacent frames by adopting a SURF feature and dense optical flow method, and estimating a projection transformation matrix by utilizing a RANSAC algorithm.
And a data enhancement step, namely performing the same-mode transformation on the optical flow sequence sample, the gray level sequence sample and the depth sequence sample corresponding to the same gesture, and amplifying a sequence sample data set, wherein the transformation mode comprises the following steps:
the translation operation is as follows:
the pixel point (x, y) on each channel of each sequence sample is shifted by Δ x units along the x-axis and Δ y units along the y-axis, i.e., (x ', y') (x + Δ x, y + Δ y). Where Δ x is any integer of [ -0.1 × w,0.1 × w ], Δ y is any integer of [ -0.1 × h,0.1 × h ], w is the corresponding width of each frame of image, and h is the corresponding length of each frame of image.
The turning operation is as follows:
and carrying out mirror image horizontal inversion and mirror image up-down inversion on the data of each channel of each sequence sample.
The noise addition operation is as follows:
white gaussian noise is added to the data of each channel of each sequence sample, and the added noise follows a gaussian distribution with a mean value of 0 and a variance of 0.1.
The affine transformation operation is as follows:
the data for each channel of each sequence sample is rotated by a set angle, including 0 °, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, 315 °.
And a neural network training step, namely respectively inputting the gray sequence sample, the optical flow sequence sample and the depth sequence sample corresponding to the same gesture into the same network structure, and respectively training three neural networks to judge the gesture. Specifically, a first neural network is obtained through optical flow sequence sample training, a second neural network is obtained through gray level sequence sample training, and a third neural network is obtained through depth sequence sample training.
And model integration, namely correspondingly multiplying the gesture classification scores of the three network pairs of sequence samples, and judging the samples as the gesture class with the highest score.
As shown in fig. 2, the neural network is composed of a 3D convolutional neural network, an SPP and a fully connected layer, the 3D convolutional neural network is used to simultaneously extract spatiotemporal features of gestures, then the SPP is used to extract global and local features, and the scores of gesture classification are obtained by inputting two fully connected layers and softmax.
The 3D convolutional neural network comprises 5 convolutional layers, each convolutional layer comprises a convolution operation and a pooling operation, the sizes of convolution kernels adopted by the convolution operations are 3 x 3, and the step size is 1 x 1.
The first convolution operation C1, the second convolution operation C2 and the third convolution operation C3 respectively comprise 64, 128 and 256 convolution kernels, and adopt a BN layer and a ReLU activation function after the convolution operations, the pooling window of the first pooling operation P1 is 1 × 2 × 2, the step size is 2 × 2 × 2, the pooling windows of the second pooling operation P2 and the third pooling operation P3 are both 2 × 2 × 2 × 2, and the step size is 2 × 2 × 2;
the fourth convolution operation C4 and the fifth convolution operation C5 each contain 512 convolution kernels, and the pooling windows of the fourth pooling operation P4 and the fifth pooling operation P5 are 2 × 2 × 2 and the step size is 2 × 1 × 1, wherein the first pooling operation P1, the second pooling operation P2, the third pooling operation P3, the fourth pooling operation P4 and the fifth pooling operation P5 all employ a mean pooling method.
The SPP network performs spatial pyramid pooling of different scales on the feature map obtained by the 3D convolutional neural network to obtain a (16+4+1) × 512-dimensional feature vector. And inputting the obtained feature vectors into two full-connection layers, wherein the number of the neurons is 1024, and inputting the result into a softmax layer to obtain the scores of 10 types of gestures.
In summary, the embodiment discloses a multi-modal dynamic gesture recognition method based on 3D convolution and SPP, which improves the generalization capability of the gesture classification model by performing data amplification through translation, flipping, noise addition and affine transformation. According to the method, the sequence samples are input into the 3D convolutional neural network, the space-time characteristics are extracted at the same time, the local characteristics and the global characteristics are extracted by utilizing the SPP network, and the high-accuracy dynamic gesture recognition is realized. In addition, the method takes the multi-modal sequence samples as input, respectively trains three gesture classifiers, and improves the recognition accuracy of the gesture recognition system through model integration.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (8)

1. A multi-modal dynamic gesture recognition method based on 3D volume Sum (SPP), comprising:
a data preprocessing step, namely extracting optical flow characteristics and gray level characteristics from an RGB video sequence, respectively obtaining optical flow sequence samples and gray level sequence samples, and regulating each optical flow sequence sample, each gray level sequence sample and each depth sequence sample into 32 frames, wherein the dimensionality of each sample is 32 multiplied by 112;
a data enhancement step, namely amplifying a sequence sample data set through translation, turnover, noise addition and affine transformation;
a neural network training step, namely respectively inputting the gray sequence sample, the optical flow sequence sample and the depth sequence sample into the same network structure, and respectively training three networks to perform gesture judgment;
and a model integration step, namely integrating the classification results of the sequence samples by the three networks to obtain a final judgment result.
2. The method of claim 1, wherein the data preprocessing step is performed as follows:
extracting optical flow characteristics of 1080 RGB video sequences contained in the SKIG data set by utilizing an iDT algorithm to obtain 1080 optical flow sequence samples;
graying each frame of image of an RGB video sequence to obtain 1080 grayscale sequence samples;
different gesture sequence samples have different durations, each sequence sample is structured into 32 fixed frames by adopting a method of repeating frames or discarding frames in the nearest neighborhood, and the dimension of each frame is 112 multiplied by 112 to be used as the input of a neural network.
3. The method of claim 2, wherein the iDT algorithm is as follows:
iDT algorithm assumes that the relationship between two adjacent frames of images is described by a projective transformation matrix, and the latter frame of image is obtained by the previous frame of image through projective transformation;
and (3) carrying out feature matching between two adjacent frames by adopting a SURF feature and dense optical flow method, and estimating a projection transformation matrix by utilizing a RANSAC algorithm.
4. The method of claim 1, wherein the data enhancement step is performed by the following steps:
and carrying out transformation in the same way on the optical flow sequence sample, the gray level sequence sample and the depth sequence sample corresponding to the same gesture, wherein the transformation way comprises the following steps:
the shift operation is to shift the pixel point (x, y) on each channel of each sequence sample by Δ x units along the x-axis and Δ y units along the y-axis, i.e., (x ', y') (x + Δ x, y + Δ y). Wherein Δ x is any integer of [ -0.1 × w,0.1 × w ], Δ y is any integer of [ -0.1 × h,0.1 × h ], w is the corresponding width of each frame of image, and h is the corresponding length of each frame of image;
the turning operation comprises the following steps of carrying out mirror image horizontal turning and mirror image up-down turning on the data of each channel of each sequence sample;
adding white Gaussian noise to the data of each channel of each sequence sample, wherein the added noise follows Gaussian distribution with the mean value of 0 and the variance of 0.1;
the affine transformation operates by performing a set angular rotation of the data for each channel of each sequence sample, including 0 °, 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, 315 °.
5. The method of claim 1, wherein the neural network training step is performed as follows:
respectively inputting a gray sequence sample, an optical flow sequence sample and a depth sequence sample corresponding to the same gesture into the same network structure, respectively training three neural networks to perform gesture discrimination, specifically, training the optical flow sequence sample to obtain a first neural network, training the gray sequence sample to obtain a second neural network, and training the depth sequence sample to obtain a third neural network;
the neural network is composed of a 3D convolutional neural network, an SPP and a full connection layer, the 3D convolutional neural network is used for simultaneously extracting the space-time characteristics of the gestures, then the SPP is used for extracting the global and local characteristics, and the scores of the gesture classification are obtained by inputting the two full connection layers and softmax.
6. The method of claim 5, wherein the 3D convolutional neural network comprises 5 convolutional layers;
each convolution layer comprises convolution operation and pooling operation, the sizes of convolution kernels adopted by the convolution operation are 3 multiplied by 3, and the step length is 1 multiplied by 1;
the first convolution operation, the second convolution operation and the third convolution operation respectively comprise 64 convolution kernels, 128 convolution kernels and 256 convolution kernels, a BN layer and a ReLU activation function are adopted after the convolution operations, the pooling window of the first pooling operation is 1 x 2, the step size is 2 x 2, the pooling windows of the second pooling operation and the third pooling operation are 2 x 2, and the step size is 2 x 2;
the fourth convolution operation and the fifth convolution operation both comprise 512 convolution kernels, the pooling windows of the fourth pooling operation and the fifth pooling operation are 2 × 2 × 2, and the step size is 2 × 1 × 1, wherein the first pooling operation, the second pooling operation, the third pooling operation, the fourth pooling operation and the fifth pooling operation all adopt a mean pooling method.
7. The multi-modal dynamic gesture recognition method based on 3D convolution and SPP according to claim 5, wherein the SPP network performs spatial pyramid pooling of different scales on feature maps obtained by a 3D convolution neural network to obtain feature vectors of (16+4+1) x 512 dimensions, the feature vectors of (16+4+1) x 512 dimensions are input into two full-connected layers, the number of neurons is 1024, and then the result is input into a softmax layer to obtain scores of 10 types of gestures.
8. The method of claim 1, wherein the model integration multiplies gesture classification scores of three network pair sequence samples correspondingly, and discriminates the sample as the gesture class with the highest score.
CN201911423353.1A 2019-12-31 2019-12-31 Multi-mode dynamic gesture recognition method based on 3D convolution and SPP Active CN111104929B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911423353.1A CN111104929B (en) 2019-12-31 2019-12-31 Multi-mode dynamic gesture recognition method based on 3D convolution and SPP

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911423353.1A CN111104929B (en) 2019-12-31 2019-12-31 Multi-mode dynamic gesture recognition method based on 3D convolution and SPP

Publications (2)

Publication Number Publication Date
CN111104929A true CN111104929A (en) 2020-05-05
CN111104929B CN111104929B (en) 2023-05-09

Family

ID=70426599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911423353.1A Active CN111104929B (en) 2019-12-31 2019-12-31 Multi-mode dynamic gesture recognition method based on 3D convolution and SPP

Country Status (1)

Country Link
CN (1) CN111104929B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239824A (en) * 2021-05-19 2021-08-10 北京工业大学 Dynamic gesture recognition method for multi-modal training single-modal test based on 3D-Ghost module
CN117711016A (en) * 2023-11-29 2024-03-15 亿慧云智能科技(深圳)股份有限公司 Gesture recognition method and system based on terminal equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140079297A1 (en) * 2012-09-17 2014-03-20 Saied Tadayon Application of Z-Webs and Z-factors to Analytics, Search Engine, Learning, Recognition, Natural Language, and Other Utilities
CN107180226A (en) * 2017-04-28 2017-09-19 华南理工大学 A kind of dynamic gesture identification method based on combination neural net
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
US20180204111A1 (en) * 2013-02-28 2018-07-19 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
CN109871781A (en) * 2019-01-28 2019-06-11 山东大学 Dynamic gesture identification method and system based on multi-modal 3D convolutional neural networks
CN109919057A (en) * 2019-02-26 2019-06-21 北京理工大学 A kind of multi-modal fusion gesture identification method based on efficient convolutional neural networks

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140079297A1 (en) * 2012-09-17 2014-03-20 Saied Tadayon Application of Z-Webs and Z-factors to Analytics, Search Engine, Learning, Recognition, Natural Language, and Other Utilities
US20180204111A1 (en) * 2013-02-28 2018-07-19 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
CN107180226A (en) * 2017-04-28 2017-09-19 华南理工大学 A kind of dynamic gesture identification method based on combination neural net
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
CN109871781A (en) * 2019-01-28 2019-06-11 山东大学 Dynamic gesture identification method and system based on multi-modal 3D convolutional neural networks
CN109919057A (en) * 2019-02-26 2019-06-21 北京理工大学 A kind of multi-modal fusion gesture identification method based on efficient convolutional neural networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
冯翔;吴瀚;司冰灵;季超;: "基于嵌网融合结构的卷积神经网络手势图像识别方法" *
曹钰;: "基于区域信息的深度卷积神经网络研究综述" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239824A (en) * 2021-05-19 2021-08-10 北京工业大学 Dynamic gesture recognition method for multi-modal training single-modal test based on 3D-Ghost module
CN113239824B (en) * 2021-05-19 2024-04-05 北京工业大学 Dynamic gesture recognition method for multi-mode training single-mode test based on 3D-Ghost module
CN117711016A (en) * 2023-11-29 2024-03-15 亿慧云智能科技(深圳)股份有限公司 Gesture recognition method and system based on terminal equipment

Also Published As

Publication number Publication date
CN111104929B (en) 2023-05-09

Similar Documents

Publication Publication Date Title
CN107038448B (en) Target detection model construction method
CN110796080B (en) Multi-pose pedestrian image synthesis algorithm based on generation countermeasure network
CN108229490B (en) Key point detection method, neural network training method, device and electronic equipment
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
US11755889B2 (en) Method, system and apparatus for pattern recognition
Yan et al. Combining the best of convolutional layers and recurrent layers: A hybrid network for semantic segmentation
Chadha et al. iSeeBetter: Spatio-temporal video super-resolution using recurrent generative back-projection networks
CN110969089A (en) Lightweight face recognition system and recognition method under noise environment
CN110399882A (en) A kind of character detecting method based on deformable convolutional neural networks
CN113591719A (en) Method and device for detecting text with any shape in natural scene and training method
CN111104929A (en) Multi-modal dynamic gesture recognition method based on 3D (three-dimensional) volume sum and SPP (shortest Path P)
JP2017182438A (en) Image processing device, semiconductor device, image recognition device, mobile device, and image processing method
Tarchoun et al. Hand-Crafted Features vs Deep Learning for Pedestrian Detection in Moving Camera.
CN114677479A (en) Natural landscape multi-view three-dimensional reconstruction method based on deep learning
CN113763417A (en) Target tracking method based on twin network and residual error structure
CN115147932A (en) Static gesture recognition method and system based on deep learning
US11436432B2 (en) Method and apparatus for artificial neural network
CN116977200A (en) Processing method and device of video denoising model, computer equipment and storage medium
CN113628349B (en) AR navigation method, device and readable storage medium based on scene content adaptation
CN115410133A (en) Video dense prediction method and device
CN115115860A (en) Image feature point detection matching network based on deep learning
WO2020196917A1 (en) Image recognition device and image recognition program
Venkatesan et al. Advanced classification using genetic algorithm and image segmentation for Improved FD
Zhou et al. Attentive Multimodal Fusion for Optical and Scene Flow
US20240193866A1 (en) Methods and systems for 3d hand pose estimation from rgb images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant