CN113780140A - Gesture image segmentation and recognition method and device based on deep learning - Google Patents

Gesture image segmentation and recognition method and device based on deep learning Download PDF

Info

Publication number
CN113780140A
CN113780140A CN202111016595.6A CN202111016595A CN113780140A CN 113780140 A CN113780140 A CN 113780140A CN 202111016595 A CN202111016595 A CN 202111016595A CN 113780140 A CN113780140 A CN 113780140A
Authority
CN
China
Prior art keywords
gesture
convolution
convolution kernel
segmentation
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111016595.6A
Other languages
Chinese (zh)
Other versions
CN113780140B (en
Inventor
崔振超
雷玉
齐静
杨文柱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University
Original Assignee
Hebei University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University filed Critical Hebei University
Priority to CN202111016595.6A priority Critical patent/CN113780140B/en
Publication of CN113780140A publication Critical patent/CN113780140A/en
Application granted granted Critical
Publication of CN113780140B publication Critical patent/CN113780140B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20132Image cropping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a gesture image segmentation and recognition method and device based on deep learning. The method of the invention firstly preprocesses the gesture image to ensure that the size of the image is fixed. Secondly, acquiring gesture multi-scale information on different visual fields by densely connecting hole convolutions with different hole rates through a dense segmentation network in a complex background so as to improve the accuracy of feature expression. In addition, in order to fuse details and spatial position information on different levels and improve the segmentation performance of the whole network, the dense segmentation network adopts an encoder-decoder structure, redundant background information is removed, and accurate segmentation of the gesture image is realized. And finally, inputting the mask image only retaining the gesture image into a gesture recognition network, and recognizing by adopting an improved algorithm. The invention can improve the segmentation performance of the gesture image, thereby improving the recognition rate of the gesture image.

Description

Gesture image segmentation and recognition method and device based on deep learning
Technical Field
The invention relates to the field of human-computer interaction and computer vision, in particular to a gesture image segmentation and recognition method and device based on deep learning.
Background
Gesture interaction based on gesture recognition is one of basic interaction modes in the field of human-computer interaction, and is one of key directions of machine vision and computer application field research. Gesture recognition has wide application in the fields of unmanned aerial vehicle pan-tilt, ar (augmented reality), vr (virtual reality), and the like, and has strong advantages in various environments, such as a non-contact environment, an environment with noisy or quiet sound, and the like, so how to increase robustness and performance of gesture recognition is of great importance.
At present, gesture interaction methods are mainly divided into two types, namely sensing-based devices and vision-based devices. For gesture recognition based on sensing equipment, for example, chinese patent application 201810542738.9 discloses a gesture recognition method and apparatus for improving accuracy of gesture recognition and reducing misoperation. The method comprises the following steps: when touch operation is detected, detecting a motion track of a contact; the motion trail of the contact is used for representing the gesture of the control terminal equipment; detecting a current moving speed of the contact; and recognizing the gesture according to the current moving speed and the motion trail. Chinese patent application 201510552869.1 discloses a 3D gesture recognition method, comprising the steps of: s1, real-time acquisition of 3D coordinate data of the user gesture is carried out by physical hardware; s2, preprocessing the acquired 3D coordinate data by physical hardware to form feedback data; s3, the data processing software identifies the feedback data; and S4, the system outputs a data identification processing result. The problems that a large amount of data needs to be processed, the process is complex, and the software processing efficiency is low in gesture recognition can be effectively solved, but the support of additional equipment is needed. Therefore, expensive auxiliary equipment is needed for gesture recognition based on the sensor, the interaction mode is not friendly and natural enough, and the requirement in actual human-computer interaction is difficult to meet.
Vision-based gesture recognition, as in document [1], Wei et al fuses a target detection model SSD (SSD) in gesture segmentation, but where a skin color probability map is thresholded, loss of hand detail information is caused. Chinese patent application 201910130815.4 proposes a gesture image segmentation and recognition method for improving capsule network and algorithm, which adopts the improved capsule network in deep learning to detect hand and generate binary image for gesture recognition, comprising the following steps: shooting and collecting gesture images under a complex background; constructing and training a U-shaped residual error capsule network to obtain a binaryzation gesture image; positioning a gesture rectangular enclosure frame; and constructing and training an improved matrix capsule network to realize the recognition of the gesture image. However, in the existing gesture recognition method based on vision, under the conditions of complex background and non-uniform illumination, the existing network convergence speed is slow, and the gesture recognition rate is not high.
The current technical research is mostly applied to the aspect of gesture recognition of actual human-computer interaction, and additional equipment support is needed. In addition, due to the variability of gestures, the results of hand detection easily generate rich backgrounds, thereby interfering with gesture recognition and reducing interactivity. Therefore, how to effectively develop a gesture recognition technology which is high in recognition speed and not greatly influenced by external illumination and environment is worthy of research. Through related technology retrieval, no gesture recognition technology which fully meets the requirements is found at present.
Disclosure of Invention
The invention aims to provide a gesture image segmentation and recognition method and device based on deep learning, and aims to solve the problem that the existing method is low in recognition rate of gesture images under complex backgrounds.
The invention is realized by the following steps: a gesture image segmentation and recognition method based on deep learning comprises the following steps:
a. carrying out size resetting operation on the input gesture image to fix the size of the image;
b. inputting the gesture image in the step a into the dense segmentation network, training the dense segmentation network, and obtaining a dense segmentation network model after training;
the dense partition network includes an encoder and a decoder; the encoder also comprises a deep convolutional neural network module and an improved cavity space pyramid pooling module;
the improved cavity space pyramid pooling module comprises a parallel mode and a cascade mode; in a parallel mode, carrying out feature coding on the input feature graph by using different void ratios to acquire multi-scale information of the gesture; in the cascade mode, each layer except the first layer and the second layer connects the output of the parallel mode in series with the output of the previous layer; then, deconvolution with different void ratios is adopted to be connected with the output of the parallel mode from bottom to top;
c. segmenting the gesture image by adopting a trained dense segmentation network model, and performing binarization processing on a segmentation result;
d. inputting the divided binary gesture images into a gesture recognition network, training the gesture recognition network by using the gesture images with different gesture shapes, and obtaining a gesture recognition network model after training;
e. and classifying the gestures in different shapes by adopting the trained gesture recognition network model to realize the recognition of the gesture images.
In step b, in the parallel mode, the voidage used is {2 }0,21,22,...,2nAnd (4) performing hole convolution, wherein the hole convolution comprises n +1 hole convolutions in total to perform multi-scale feature extraction on the feature map.
Taking n to 4, the output of the parallel mode is shown as follows:
Figure BDA0003240045570000021
where x represents the input feature map and d represents the voidage of {2 }0,21,22,…,24Array of Hk,d(x) Representing a convolutionKernel size k, void ratio d void convolution, oiRepresents the output of 5 parallel modes, which are o from top to bottom0、o1、o2、o3、o4
The output of the cascade mode is given by:
Figure BDA0003240045570000031
pithe output of the cascade mode is represented by,
Figure BDA0003240045570000032
features representing different scales are spliced on the channel;
deconvolution with different void ratios is adopted and connected with the output of the parallel mode from bottom to top, and the specific formula of the deconvolution is as follows:
Figure BDA0003240045570000033
Figure BDA0003240045570000034
in the formula, qjRepresenting the output after deconvolution, y represents the output of the improved void space pyramid pooling module, DH3,d[j]Representing a deconvolution with a convolution kernel of 3 and a void rate of d.
The deep convolutional neural network module includes a 7 × 7 convolutional kernel, a 3 × 3 convolutional kernel, and 4 residual groups. The 4 residual groups are as follows: the first residual group has 3 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1 × 1 × 64, a convolution kernel of 3 × 3 × 64 and a convolution kernel of 1 × 1 × 256, and the residual blocks have 9 layers in total, the void ratio d is 1, and the step length s is 2; the second residual group has 4 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 128, a convolution kernel of 3 × 3 × 128 and a convolution kernel of 1 × 1 × 512, and the residual blocks have 12 layers, the void ratio d is 1, and the step length s is 1; the third residual group has 6 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 256, a convolution kernel of 3 × 3 × 256, and a convolution kernel of 1 × 1 × 1024, and has 18 layers, the void ratio d is 2, and the step length s is 1; the fourth residual group has 3 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 512, a convolution kernel of 3 × 3 × 512, and a convolution kernel of 1 × 1 × 2048, and has 9 layers, the void rate d is 4, and the step length s is 1.
In step b, the specific decoding process of the decoder is as follows: performing characteristic splicing on the output result of the improved cavity space pyramid pooling module and the characteristic of the fourth residual error group subjected to 1 × 1 convolution operation on a channel, and performing first-time double upsampling on the spliced result; then, splicing the result of the first double upsampling and the characteristic of the first residual group after 1 × 1 convolution operation on a channel, and continuing to perform the second double upsampling; then, performing feature splicing on the result of the second-time double up-sampling and the features subjected to 7 × 7 convolution and 1 × 1 convolution operations on a channel, and continuing to perform third-time double up-sampling; finally, the results of the gesture segmentation are refined using a 3 × 3 convolution kernel, and a 1 × 1 convolution kernel in sequence.
In the step d, the gesture recognition network comprises three convolution layers, an activation function ReLu for feature extraction, a maximum value pooling MaxPoint, a full connection layer and a Softmax layer;
the training gesture recognition network comprises the following steps:
performing a first set of convolution operations: performing 19 × 19 × 64 convolution once, then performing ReLu activation, and finally using a maximum pooling operation as a downsampling operation;
performing a second set of convolution operations: performing 17 × 17 × 128 convolution once, then performing ReLu activation, and finally using a maximum pooling operation as a downsampling operation;
a third set of convolution operations is performed: performing 15 × 15 × 128 convolution once, then performing ReLu activation, and finally using a maximum pooling operation as a downsampling operation;
and sequentially inputting the results of the third group of convolution operations into a Softmax layer and a full-connection layer, and outputting the final gesture classification result.
The gesture image segmentation and recognition device based on deep learning corresponding to the method comprises the following modules:
the gesture image acquisition module is connected with the preprocessing module and used for acquiring a color gesture image;
the preprocessing module is respectively connected with the gesture image acquisition module and the dense segmentation network training module and is used for cutting the color gesture image and providing an input image with a fixed size for the dense segmentation network training module;
the dense segmentation network training module is respectively connected with the preprocessing module and the binarization image acquisition module, trains a gesture segmentation model by using an input image output by the preprocessing module to obtain an optimized segmentation model and outputs a gesture segmentation result;
the binarization image acquisition module is respectively connected with the dense segmentation network training module and the gesture recognition model training module and is used for acquiring a binarization gesture image; and
and the gesture recognition model training module is connected with the binarization image obtaining module, trains a gesture recognition model by using the binarization gesture image to obtain an optimized gesture recognition model and outputs a gesture classification result.
Due to the variability of gestures, the results of hand detection are prone to generate rich backgrounds, thereby interfering with gesture recognition and reducing interactivity. Aiming at the problem, the invention provides a gesture image segmentation and recognition method based on deep learning, which is based on a dense segmentation network and an improved gesture recognition network, really realizes the fusion of local features and global features of a gesture, and enriches feature expression. The method has stronger robustness and can obtain higher recognition rate under the conditions of similar skin color, hand and face shielding, non-uniform illumination and the like.
The gesture image segmentation and recognition method based on deep learning provided by the invention has the advantages that:
for the problem that the gestures in the complex background have various scales, different void ratios are designed in parallel and cascade modes in the IASPP, and the void convolutions with different void ratios are stacked together, so that the gesture multi-scale information on different receptive fields is obtained, and the feature expression is enriched. Therefore, the IASPP combines global and high-level semantic features with local and detailed semantic features to filter redundant information in the background, and contributes to improving the segmentation precision.
The invention obtains more accurate gesture segmentation result by utilizing the encoder for acquiring high-level semantic information and the decoder for amplifying the image by utilizing the information in the encoding stage to recover the detail information of the image.
The invention has better overall performance than the general mainstream algorithm and is more suitable for the man-machine products. The improved gesture recognition network has the advantages that: compared with the original network method, the method can effectively improve the gesture recognition rate, and has better gesture recognition effect when recognizing gesture images with different illumination compared with the traditional CNN method.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a training diagram of a densely partitioned network in accordance with the present invention.
FIG. 3 is a training diagram of the gesture recognition network of the present invention.
Fig. 4 is a diagram of the IASPP framework in the present invention.
FIG. 5 is a diagram of a dense split network framework in accordance with the present invention.
FIG. 6 is a diagram of the overall network framework of the present invention.
Fig. 7 and 8 are graphs comparing the segmentation effect of the present invention with other algorithms.
Detailed Description
The invention provides a gesture image segmentation and recognition method based on deep learning, which generally comprises the following 3 steps:
step 1: the gesture images in all complex backgrounds are resized (resize operation) so that the image sizes are fixed.
Step 2: and inputting the gesture image subjected to resize operation under the complex background into the dense segmentation network, so as to train the dense segmentation network, and outputting the trained dense segmentation network model. And finally, outputting a binarization gesture image by using the trained dense segmentation network model.
And step 3: and (3) inputting the gesture images divided in the step (2) into a gesture recognition network, training the gesture recognition network by using the gesture images with different gesture shapes, and outputting a trained gesture recognition network model. And classifying each different gesture by using the network model to realize the recognition of the gesture image.
Due to the variability of gestures, the results of hand detection are prone to generate rich backgrounds, thereby interfering with gesture recognition and reducing interactivity. Aiming at the problem, the invention provides a dense segmentation and gesture recognition strategy. The gesture segmentation can remove redundant information brought by the background in a maximized manner, and reduce interference on a gesture recognition algorithm, so that the accuracy of gesture recognition is improved. In order to improve the accuracy of gesture segmentation, the invention provides an Improved void space Pyramid Pooling method (Improved atmospheric Spatial Pyramid Pooling, IASPP), and the method combines a cascade mode and a parallel mode to extract features, so that richer hand feature information is obtained.
And filtering redundant backgrounds by using the proposed dense segmentation network under a complex background, segmenting the gesture image, inputting the positioned gesture area into a gesture recognition network, and recognizing by adopting an improved algorithm. The invention improves the segmentation performance of the gesture image, thereby improving the recognition rate of the gesture image.
The dense segmentation network in the step 2 mainly comprises three parts, which are sequentially as follows: a Deep Convolutional Neural Network (DCNN), a hole space pyramid pooling (IASPP) module, and a decoder.
With reference to fig. 5, the input of the dense partition network in step 2 is a 512 × 512 × 3 RGB image, and the encoding portion is composed of DCNN and IASPP modules. Wherein, DCNN is a feature-extracted backbone network composed of 1 convolution kernel 7 × 7 (referred to as Conv in the figure), 1 convolution kernel 3 × 3 and 4 residual groups. As shown in table 1 below, the first residual group has 3 residual blocks, each of which has 3 layers of 1 × 1 × 64 convolution kernel, 3 × 3 × 64 convolution kernel, and 1 × 1 × 256 convolution kernel, respectively, and has 9 layers, the void rate d is 1, and the step length s is 2; the second residual group has 4 residual blocks, each residual block has 3 layers of convolution kernels of 1 × 1 × 128, 3 × 3 × 128, 1 × 1 × 512, and 12 layers, the void rate d is 1, and the step length s is 1; the third residual group has 6 residual blocks, each residual block has 3 layers of convolution kernels of 1 × 1 × 256, 3 × 3 × 256 and 1 × 1 × 1024, and has 18 layers, the void rate d is 2, and the step length s is 1; the fourth residual group has 3 residual blocks, each residual block has 3 layers of 1 × 1 × 512 convolution kernels, 3 × 3 × 512 convolution kernels, 1 × 1 × 2048 convolution kernels, and 9 layers in total, the void rate d is 4, and the step length s is 1.
TABLE 1 Deep Convolutional Neural Network (DCNN) parameter settings
Figure BDA0003240045570000061
It is noted that in order for the Decoder (Decoder) to fuse more local detail information while reducing the amount of computation, a 1 × 1 convolution kernel is added after the output features of the 7 × 7 convolution kernel, the first residual group, and the fourth residual group in the DCNN. Finally, after the feature extraction of the RGB image by DCNN, the feature map finally output by the fourth residual group becomes 1/8 of the original image. The feature map of the fourth residual group output is used as the input of the IASPP module.
As shown in fig. 4, the design framework of IASPP in the dense split network combines both parallel and cascade modes. In parallel mode, the invention uses a void ratio of {2 }0,21,22,…,2nAnd performing feature coding on the input feature graph by the cavity convolution to acquire multi-scale information of the gesture. In the embodiment of the invention, n is set to be 4, that is, a total of 5 hole convolutions are included to perform multi-scale feature extraction on the feature map, so as to generate richer feature expressions.
Taking n to 4, the output of the parallel mode is defined as formula (1):
Figure BDA0003240045570000071
where x represents the input feature map and d represents the void fraction {2 }0,21,22,…,24Array of { in } Hk,d(x) To represent the hole convolution with a convolution kernel size of k and a hole rate of d, oiShows the outputs of 5 parallel modes, which are sequentially o from top to bottom as can be seen from FIG. 40、o1、o2、o3、o4
In the cascade mode, the output of the parallel mode is connected with the output of the previous layer in series by each layer except the first layer and the second layer, so that gesture information is extracted in a more intensive mode, and better feature expression is generated. Specifically, the output result o of the parallel mode is first convolved with a hole having k of 3 and d of 21Continuing to extract the features, and outputting the result as p1. Then using the hole convolution pair o with k equal to 3 and d equal to 42And p1Continuing to extract the features after splicing on the channel, and outputting a result of p2. Finally, the hole convolution pair o with k equal to 3 and d equal to 8 is used3And p2Continuing to extract the features after splicing on the channel, and outputting a result of p3
The output of the cascade mode in IASPP is defined as equation (2). In which use
Figure BDA0003240045570000072
Features representing different scales are spliced on the channel (Concat in fig. 4), piRepresenting the output of the cascade mode.
Figure BDA0003240045570000073
Since image segmentation is extremely sensitive to spatial position information of pixels, in order to fuse more detailed information while restoring image size, the present invention designs deconvolution (denoted by TC in fig. 4) with different voidage rates, connected from bottom to top with the output of the parallel mode, for restoringAnd local characteristics enable the image edge to be smoother. First, the deconvolution (i.e., TC1) with k-3 and d-8 is used to pair o4And p3The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q1(ii) a Then, the deconvolution of k 3 and d 4 (i.e., TC2) is used to pair o3And q is1The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q2(ii) a Then deconvolution (i.e., TC3) using k-3 and d-2 is used to pair o2And q is2The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q3(ii) a Then, the deconvolution (i.e. TC4) with k-3 and d-2 is used to pair o1And q is3The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q4. The output y of the last IASPP is both o0And q is4Features after splicing on the channel.
The above paragraph is formulated as follows:
Figure BDA0003240045570000081
Figure BDA0003240045570000082
in the formula, qjRepresenting the output after deconvolution, y the final output of IASPP, DH3,d[j]Representing a deconvolution with a convolution kernel of 3 and a void rate of d.
The output characteristic diagram of the fourth residual error group in the DCCN is used as the input of the IASPP, the 2048-dimensional characteristics output by the DCNN are subjected to characteristic coding by utilizing the cavity convolution with different cavity rates, and multi-scale context information is mined while characteristic expression is enriched.
As shown in fig. 5, in order to recover more detailed features during decoding (Decoder), three scale features of a 7 × 7 convolution kernel, a first residual group, and a fourth residual group are selected in the DCNN. And three upsampling operations are used to resize the feature map, connecting it with the feature map from the encoding portion after each upsampling. The decoding process specifically comprises: firstly, performing feature splicing on the output result y of the IASPP and the feature of the fourth residual group after 1 × 1 convolution operation on a channel, and performing first double upsampling on the spliced result (denoted by Up in the figure); then, splicing the result of the first double upsampling and the characteristic of the first residual group after 1 × 1 convolution operation on a channel, and continuing to perform the second double upsampling; and then, performing feature splicing on the result of the second-time double up-sampling and the features after 7 × 7 convolution and 1 × 1 convolution operation on the channel, and continuing to perform third-time double up-sampling. Finally, a 3 × 3 convolution kernel, and a 1 × 1 convolution kernel are used in sequence to refine the result of the gesture segmentation.
In step 3, the information from the dense-segment network is input to the gesture recognition network model, and classification is continued.
As shown in fig. 6, in the gesture recognition network model, a gesture classification network is formed by three convolutional layers, an activation function ReLu for feature extraction, a maximum pooling MaxPooling, a Softmax layer, and a full link layer. In the classification process, the output of the dense segmentation network model is randomly divided into a training set and a testing set, and then the training set and the testing set are used as input to be input into a gesture classification layer. In the gesture classification method, the operations performed sequentially include a first set of convolution operations (the first set of convolutions are performed once for 19 × 19 × 64 convolution, followed by ReLu activation, and finally maximum pooling operation is used as a down-sampling operation); a second set of convolution operations (the second set of convolutions are performed once with a 17 x 128 convolution, followed by ReLu activation, and finally with a maximum pooling operation as a downsampling operation); a third set of convolution operations (the third set of convolutions performs a 15 x 128 convolution, followed by ReLu activation, and finally a maximum pooling operation as a downsampling operation); and finally, sequentially inputting the results of the third group of convolution operations to a Softmax layer, and outputting the final gesture classification result by the full-connection layer.
With reference to fig. 2 and fig. 3, a gesture image segmentation and recognition apparatus based on deep learning corresponding to the above method includes the following modules:
and the gesture image acquisition module is connected with the first preprocessing module and used for acquiring the color gesture image.
And the first preprocessing module is respectively connected with the gesture image acquisition module and the dense segmentation network training module and is used for carrying out cutting operation on the color gesture image and providing an input image with a fixed size for the dense segmentation network training module.
And the dense segmentation network training module is respectively connected with the first preprocessing module and the gesture image segmentation module, and trains a gesture segmentation model by using the input image output by the first preprocessing module so as to obtain an optimized segmentation model.
And the gesture image segmentation module is respectively connected with the dense segmentation network training module and the image segmentation result output module and is used for segmenting the gesture through the optimized gesture segmentation model.
And the image segmentation result output module is connected with the gesture image segmentation module and is used for outputting the segmented gesture image.
The data processed by the first preprocessing module are divided into training data and testing data, the dense segmentation network training module trains a dense segmentation network model by using the training data, cross entropy loss calculation is carried out on segmentation images and real gesture segmentation labels to obtain integral loss of the segmentation network, and the loss is continuously reduced by using a back propagation idea so as to fit the segmentation model, and a stable segmentation model is obtained. And performing gesture image segmentation on the test data or other non-test data by adopting the optimized dense segmentation network model, and finally outputting a gesture image segmentation result by an image segmentation result output module.
The output in fig. 2 serves as the input in fig. 3, namely: and the gesture image segmentation result output by the image segmentation result output module enters a binarization image acquisition module, and a binarization gesture image is acquired by the binarization image acquisition module. Specifically, the binarization image acquisition is to feed the segmentation result into a sigmoid function to adjust the segmentation result to be in a range of 0-1, and a threshold-based method is used for obtaining a final binarization image. If the value is more than 0.5, the value is 1, otherwise, the value is 0.
The binarization image obtaining module is also connected with a second preprocessing module, and the second preprocessing module is used for cutting the binarization gesture image and providing an input image with a fixed size for the gesture recognition model training module.
And the gesture recognition model training module is connected with the second preprocessing module and is used for recognizing the binaryzation gesture image with the fixed size. The method comprises the following steps: in a gesture recognition model training module, firstly, constructing a gesture recognition model, wherein the gesture recognition model consists of three convolution layers (the first layer is provided with 64 convolution kernels with the size of 19 multiplied by 19, the second layer is provided with 128 convolution kernels with the size of 17 multiplied by 17, the third layer is provided with 128 convolution kernels with the size of 15 multiplied by 15 respectively and the step length of 2), ReLu and MaxPholing layers for feature extraction, a full connection layer and a Softmax layer; initializing parameters, performing gesture model recognition, performing cross entropy loss calculation on a recognition result and a real label, if loss reaches an expectation, obtaining a gesture recognition model, otherwise, continuously reducing the loss by using a back propagation idea, updating the parameters, and continuously performing gesture model recognition.
In the gesture recognition model training module, the output of the segmentation model is also randomly divided into a training set and a test set, and then input into the gesture recognition model as input.
In detail, as shown in fig. 1, the gesture image segmentation and recognition method based on deep learning provided by the present invention includes the following steps:
step 1: and inputting a color gesture image. The color gesture image input in the embodiment of the invention is selected from the common vision data set OUTHANDS and HGR data sets. The input color gesture image is used for making a foundation for subsequent training and verification of the network model.
Step 2: the input image is pre-processed so that the image reaches a fixed dimension.
In this step, the number of the preprocessed images in the outhand data set is 3000, wherein 2000 images are used as a training set and 1000 images are used as a verification set. The number of images after preprocessing of the HGR data set is 899, wherein 630 images are used as a training set, and 269 images are used as a verification set.
And step 3: and constructing a dense split network.
The dense segmentation network set in this step is specifically designed for gestures in a complex background. The structure of the neural network of this step is shown in fig. 5, and the structure of the IASPP module in this step is shown in fig. 4. The method comprises the following specific steps:
the training data in step 2 (here using only the pre-processed training set) is used as the input image for step 3. Firstly, two convolution operations are carried out on an input image, the sizes of convolution kernels used are 7 x 7 and 3 x 3 in sequence, and then the convolution kernels are sent into a first residual error group, a second residual error group, a third residual error group and a fourth residual error group in sequence. Finally, after the feature extraction of the RGB image by DCNN, the feature map finally output by the fourth residual group becomes 1/8 of the original image.
Taking the output characteristic diagram of the fourth residual group as the input of the IASPP module, after the characteristic diagram enters the IASPP module, firstly, performing convolution operation on the characteristic layers by utilizing convolution kernels with five different voidage rates under a parallel mode, acquiring the characteristic layers with different sizes due to the use of the convolution kernels with the different voidage rates, mining multi-scale context information while enriching the characteristic expression, and sequentially finding o from top to bottom according to the graph shown in FIG. 40、o1、o2、o3、o4. In the cascade mode, the output result o of the parallel mode is first convolved with a hole having k 3 and d 21Continuing to extract the features, and outputting the result as p1. Then using the hole convolution pair o with k equal to 3 and d equal to 42And p1Continuing to extract the features after splicing on the channel, and outputting a result of p2. Finally, the hole convolution pair o with k equal to 3 and d equal to 8 is used3And p2Continuing to extract the features after splicing on the channel, and outputting a result of p3
The invention also designs deconvolution with different void ratios, which is connected with the output of the parallel mode from bottom to topThe method is used for recovering local features so that the image edge is smoother. First, a deconvolution pair o with k equal to 3 and d equal to 8 is used4And p3The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q1(ii) a Then, the deconvolution with k equal to 3 and d equal to 4 is used to pair o3And q is1The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q2(ii) a Then, the deconvolution with k 3 and d 2 is used to pair o2And q is2The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q3(ii) a Deconvolution using k-3, d-21And q is3The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q4. The output y of the last IASPP module is o0And q is4Features after splicing on the channel.
For the decoder, three scale features of the 7 × 7 convolution kernel, the first residual group, and the fourth residual group are selected in order to recover more detailed features during decoding. And three upsampling operations are used for adjusting the size of the feature map, wherein the upsampling operation is to expand each layer of features in the feature layer to a corresponding dimension in a linear interpolation mode, and the layer number is unchanged. Finally, the results of the gesture segmentation are refined using 3 × 3 and 1 × 1 convolution kernels.
And 4, step 4: fitting training a gesture segmentation model using gesture data to derive a stable segmentation model
And sending the gesture image as input into a dense segmentation network to obtain a segmentation result, and performing cross entropy loss calculation with a real gesture segmentation label to obtain the overall loss of the dense segmentation network. And the loss is continuously reduced by using a back propagation idea so as to fit the segmentation model and obtain a stable dense segmentation model. Through the steps, a gesture segmentation model based on the convolutional neural network is finally obtained through training, and the gesture image can be segmented according to the segmentation model.
And 5: the segmentation result obtained in the step 4 is subjected to binarization processing
Namely, the segmentation result is sent to a sigmoid function to be adjusted to be in the range of 0-1, and a final binary image is obtained by using a threshold-based method. If the value is more than 0.5, the value is 1, otherwise, the value is 0.
Step 6: constructing gesture recognition models
The model comprises three convolution layers, ReLu and MaxPholing for feature extraction, a full connection layer and a Softmax layer to form a gesture classification layer.
In the classification algorithm, firstly, the binarized image in step 5 is subjected to random cropping operation, the cropping proportion is 0.75-1 times of the original image (512 × 512), and then the image size is reset to 512 × 512 pixels and is input to the gesture classification layer as an input image.
The operations performed sequentially include a first set of convolution operations (the first set of convolutions are performed once for a 19 x 64 convolution, followed by ReLu activation, and finally a maximum pooling operation as a downsampling operation); a second set of convolution operations (the second set of convolutions are performed once with a 17 x 128 convolution, followed by ReLu activation, and finally with a maximum pooling operation as a downsampling operation); a third set of convolution operations (the third set of convolutions performs a 15 x 128 convolution, followed by ReLu activation, and finally a maximum pooling operation as a downsampling operation); and finally, sequentially inputting the results of the third group of convolution operations to a Softmax layer, and outputting the final gesture classification result by the full-connection layer. And training the gesture recognition model by using the classified cross entropy loss, adjusting network model parameters, and storing the model parameters after the training is finished.
And 7: image classification
After the training of the model is completed, for a test image, a gesture segmentation image is obtained through a dense segmentation network, and then the image subjected to binarization is sent to a gesture recognition model for final classification.
In order to further prove the effectiveness of the dense segmentation and gesture classification combined model provided by the invention, the gesture segmentation experiment is carried out on the OUTHANDS and HGR common data set, and the comparison is carried out on the NUS-II data set and other recognition algorithms based on deep learning.
As shown in Table 2, the recognition precision of the dense segmentation and the gesture classification provided by the invention can reach 98.61%, which is improved by 3.99% compared with the gesture recognition algorithm, and the running time of the algorithm is not greatly increased while the algorithm is superior to other comparison algorithms. Therefore, the segmentation algorithm provided by the invention can maximally filter the interference information in the background and improve the accuracy of gesture recognition.
TABLE 2 recognition rates on OUTHANDS datasets
Figure BDA0003240045570000121
From table 3, it can be seen that the segmentation algorithm based on the dense segmentation network has great advantages in gesture segmentation, wherein the accuracy (Precision, Pr), Recall (Recall, Re), balance F score (F-score), and area under ROC curve (AUC) reach 0.9948, 0.9929, 0.9939, and 0.9982, respectively. These evaluation indexes are all higher than the comparison algorithm, which shows that the method of the present invention is superior to the comparison algorithm in all aspects.
TABLE 3 comparison of the algorithm and machine learning methods herein under HGR data sets
Figure BDA0003240045570000122
In order to further prove that the dense segmentation and gesture recognition algorithm provided by the invention can improve the gesture recognition rate, the NUS-II data set is compared with other algorithms based on deep learning. The result is shown in table 4, and it can be known from table 4 that the gesture recognition rate of the method of the present invention can reach 98.63%, which is improved by 0.33% compared with the suboptimal algorithm. Therefore, the method and the device can enable the segmentation of the gesture and the background to be more accurate, and can further improve the gesture recognition rate.
TABLE 4 recognition Rate on NUS-II data set
Figure BDA0003240045570000131
FIGS. 7 and 8 show graphs comparing the results of the method of the present invention in segmenting and recognizing gestures with other methods. As can be seen from the figure, the method (corresponding to IASPP-ResNet) of the invention is closer to the real label (GT) than other methods, and the method of the invention is better than other methods.
The references referred to in the present description are as follows:
[1] wei Bao, brave, Liu jin Rev, Zhou Jia Ming, the adaptive gesture segmentation method of fusion SSD target detection [ J ] signal processing, 2020,36(07): 1038-.
WEI Bao-guo,XU Yong,LIU Jin-wei,ZHOU Jia-ming.Adaptive gesture segmentation based on SSD object detection[J].Journal of Signal Processing,2020,36(07):1038-1047.(in Chinese)
[2]Adithya V,Rajesh R.A deep convolutional neural network approach for static hand gesture recognition[J].Procedia Computer Science,2020,171:2353-2361.
[3]Zhang Q,Yang M,Kpalma K,et al.Segmentation of hand posture against complex backgrounds based on saliency and skin colour detection[J].IAENG International Journal of Computer Science,2018,45(3):435-444.
[4]J.Sun,T.Ji,S.Zhang,J.Yang,G.Ji.Research on the hand gesture recognition based on deep learning[A].2018 12th International Symposium on Antennas,Propagation and EM Theory(ISAPE)[C].Hangzhou,China:IEEE,2018.1-4.
[5]Arenas J O P,Moreno R J,
Figure BDA0003240045570000132
R D H.Convolutional neural network with a DAG architecture for control of a robotic arm by means of hand gestures[J].Contemporary Engineering Sciences,2018,11(12):547-557.
[6]Tan Y S,Lim K M,Tee C,et al.Convolutional neural network with spatial pyramid pooling for hand gesture recognition[J].Neural Computing and Applications,2020:1-13.

Claims (10)

1. A gesture image segmentation and recognition method based on deep learning is characterized by comprising the following steps:
a. carrying out size resetting operation on the input gesture image to fix the size of the image;
b. inputting the gesture image in the step a into the dense segmentation network, training the dense segmentation network, and obtaining a dense segmentation network model after training;
the dense partition network includes an encoder and a decoder; the encoder also comprises a deep convolutional neural network module and an improved cavity space pyramid pooling module;
the improved cavity space pyramid pooling module comprises a parallel mode and a cascade mode; in a parallel mode, carrying out feature coding on the input feature graph by using different void ratios to acquire multi-scale information of the gesture; in the cascade mode, each layer except the first layer and the second layer connects the output of the parallel mode in series with the output of the previous layer; then, deconvolution with different void ratios is adopted to be connected with the output of the parallel mode from bottom to top;
c. segmenting the gesture image by adopting a trained dense segmentation network model, and performing binarization processing on a segmentation result;
d. inputting the divided binary gesture images into a gesture recognition network, training the gesture recognition network by using the gesture images with different gesture shapes, and obtaining a gesture recognition network model after training;
e. and classifying the gestures in different shapes by adopting the trained gesture recognition network model to realize the recognition of the gesture images.
2. The method as claimed in claim 1, wherein in the step b, the hole rate used in the parallel mode is {2 }0,21,22,...,2nThe hole convolution of the feature map comprises n +1 hole convolutions in totalAnd (5) extracting scale features.
3. The method for segmenting and recognizing the gesture image based on the deep learning as claimed in claim 2, wherein n is 4, and the output of the parallel mode is shown as the following formula:
Figure FDA0003240045560000011
where x represents the input feature map and d represents the voidage of {2 }0,21,22,…,24Array of Hk,d(x) Indicating a hole convolution with a convolution kernel size of k and a hole rate of d, oiRepresents the output of 5 parallel modes, which are o from top to bottom0、o1、o2、o3、o4
The output of the cascade mode is given by:
Figure FDA0003240045560000012
pithe output of the cascade mode is represented by,
Figure FDA0003240045560000013
features representing different scales are spliced on the channel;
deconvolution with different void ratios is adopted and connected with the output of the parallel mode from bottom to top, and the specific formula of the deconvolution is as follows:
Figure FDA0003240045560000021
Figure FDA0003240045560000022
in the formula, qjRepresenting after deconvolutionOutput, y represents the output of the improved void space pyramid pooling module, DH3,d[j]Representing a deconvolution with a convolution kernel of 3 and a void rate of d.
4. The method as claimed in claim 1, wherein in step b, the deep convolutional neural network module comprises a 7 × 7 convolutional kernel, a 3 × 3 convolutional kernel and 4 residual error groups.
5. The method as claimed in claim 4, wherein the 4 residual error groups are respectively as follows: the first residual group has 3 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1 × 1 × 64, a convolution kernel of 3 × 3 × 64 and a convolution kernel of 1 × 1 × 256, and the residual blocks have 9 layers in total, the void ratio d is 1, and the step length s is 2; the second residual group has 4 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 128, a convolution kernel of 3 × 3 × 128 and a convolution kernel of 1 × 1 × 512, and the residual blocks have 12 layers, the void ratio d is 1, and the step length s is 1; the third residual group has 6 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 256, a convolution kernel of 3 × 3 × 256, and a convolution kernel of 1 × 1 × 1024, and has 18 layers, the void ratio d is 2, and the step length s is 1; the fourth residual group has 3 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 512, a convolution kernel of 3 × 3 × 512, and a convolution kernel of 1 × 1 × 2048, and has 9 layers, the void rate d is 4, and the step length s is 1.
6. The method as claimed in claim 5, wherein in step b, the decoder decodes the gesture image according to the following steps: performing characteristic splicing on the output result of the improved cavity space pyramid pooling module and the characteristic of the fourth residual error group subjected to 1 × 1 convolution operation on a channel, and performing first-time double upsampling on the spliced result; then, splicing the result of the first double upsampling and the characteristic of the first residual group after 1 × 1 convolution operation on a channel, and continuing to perform the second double upsampling; then, performing feature splicing on the result of the second-time double up-sampling and the features subjected to 7 × 7 convolution and 1 × 1 convolution operations on a channel, and continuing to perform third-time double up-sampling; finally, the results of the gesture segmentation are refined using a 3 × 3 convolution kernel, and a 1 × 1 convolution kernel in sequence.
7. The method for segmenting and recognizing the gesture image based on the deep learning of claim 1, wherein in the step d, the gesture recognition network comprises three convolution layers, an activation function ReLu for feature extraction, a maximum value pooling Max Pooling, a full connection layer and a Softmax layer;
the training gesture recognition network comprises the following steps:
performing a first set of convolution operations: performing 19 × 19 × 64 convolution once, then performing ReLu activation, and finally using a maximum pooling operation as a downsampling operation;
performing a second set of convolution operations: performing 17 × 17 × 128 convolution once, then performing ReLu activation, and finally using a maximum pooling operation as a downsampling operation;
a third set of convolution operations is performed: performing 15 × 15 × 128 convolution once, then performing ReLu activation, and finally using a maximum pooling operation as a downsampling operation;
and sequentially inputting the results of the third group of convolution operations into a Softmax layer and a full-connection layer, and outputting the final gesture classification result.
8. A gesture image segmentation and recognition device based on deep learning is characterized by comprising the following modules:
the gesture image acquisition module is connected with the preprocessing module and used for acquiring a color gesture image;
the preprocessing module is respectively connected with the gesture image acquisition module and the dense segmentation network training module and is used for cutting the color gesture image and providing an input image with a fixed size for the dense segmentation network training module;
the dense segmentation network training module is respectively connected with the preprocessing module and the binarization image acquisition module, trains a gesture segmentation model by using an input image output by the preprocessing module to obtain an optimized segmentation model and outputs a gesture segmentation result;
the binarization image acquisition module is respectively connected with the dense segmentation network training module and the gesture recognition model training module and is used for acquiring a binarization gesture image; and
the gesture recognition model training module is connected with the binarization image obtaining module, trains a gesture recognition model by using the binarization gesture image to obtain an optimized gesture recognition model and outputs a gesture classification result;
in the dense split network training module, a dense split network comprises an encoder and a decoder; the encoder also comprises a deep convolutional neural network module and an improved cavity space pyramid pooling module; the improved cavity space pyramid pooling module comprises a parallel mode and a cascade mode; in a parallel mode, carrying out feature coding on the input feature graph by using different void ratios to acquire multi-scale information of the gesture; in the cascade mode, each layer except the first layer and the second layer connects the output of the parallel mode in series with the output of the previous layer; and then, deconvolution with different void ratios is adopted to be connected with the output of the parallel mode from bottom to top.
9. The apparatus for segmenting and recognizing gesture images based on deep learning of claim 8, wherein the deep convolutional neural network module comprises a convolution kernel of 7 x 7, a convolution kernel of 3 x 3 and 4 residual error groups; the 4 residual groups are as follows: the first residual group has 3 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1 × 1 × 64, a convolution kernel of 3 × 3 × 64 and a convolution kernel of 1 × 1 × 256, and the residual blocks have 9 layers in total, the void ratio d is 1, and the step length s is 2; the second residual group has 4 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 128, a convolution kernel of 3 × 3 × 128 and a convolution kernel of 1 × 1 × 512, and the residual blocks have 12 layers, the void ratio d is 1, and the step length s is 1; the third residual group has 6 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 256, a convolution kernel of 3 × 3 × 256, and a convolution kernel of 1 × 1 × 1024, and has 18 layers, the void ratio d is 2, and the step length s is 1; the fourth residual group has 3 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 512, a convolution kernel of 3 × 3 × 512, and a convolution kernel of 1 × 1 × 2048, and has 9 layers, the void rate d is 4, and the step length s is 1.
10. The apparatus for segmenting and recognizing gesture images based on deep learning of claim 8, wherein a gesture recognition network consisting of three convolutional layers, an activation function ReLu for feature extraction, a maximum pooling Max Pooling, a full connection layer and a Softmax layer is utilized in the training module of the gesture recognition model.
CN202111016595.6A 2021-08-31 2021-08-31 Gesture image segmentation and recognition method and device based on deep learning Active CN113780140B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111016595.6A CN113780140B (en) 2021-08-31 2021-08-31 Gesture image segmentation and recognition method and device based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111016595.6A CN113780140B (en) 2021-08-31 2021-08-31 Gesture image segmentation and recognition method and device based on deep learning

Publications (2)

Publication Number Publication Date
CN113780140A true CN113780140A (en) 2021-12-10
CN113780140B CN113780140B (en) 2023-08-04

Family

ID=78840393

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111016595.6A Active CN113780140B (en) 2021-08-31 2021-08-31 Gesture image segmentation and recognition method and device based on deep learning

Country Status (1)

Country Link
CN (1) CN113780140B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241245A (en) * 2021-12-23 2022-03-25 西南大学 Image classification system based on residual error capsule neural network
CN114241245B (en) * 2021-12-23 2024-05-31 西南大学 Image classification system based on residual capsule neural network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012194659A (en) * 2011-03-15 2012-10-11 Shinsedai Kk Gesture recognition device, gesture recognition method, and computer program
CN108334814A (en) * 2018-01-11 2018-07-27 浙江工业大学 A kind of AR system gesture identification methods based on convolutional neural networks combination user's habituation behavioural analysis
KR20180130869A (en) * 2017-05-30 2018-12-10 주식회사 케이티 CNN For Recognizing Hand Gesture, and Device control system by hand Gesture
CN110728682A (en) * 2019-09-09 2020-01-24 浙江科技学院 Semantic segmentation method based on residual pyramid pooling neural network
CN112950652A (en) * 2021-02-08 2021-06-11 深圳市优必选科技股份有限公司 Robot and hand image segmentation method and device thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012194659A (en) * 2011-03-15 2012-10-11 Shinsedai Kk Gesture recognition device, gesture recognition method, and computer program
KR20180130869A (en) * 2017-05-30 2018-12-10 주식회사 케이티 CNN For Recognizing Hand Gesture, and Device control system by hand Gesture
CN108334814A (en) * 2018-01-11 2018-07-27 浙江工业大学 A kind of AR system gesture identification methods based on convolutional neural networks combination user's habituation behavioural analysis
CN110728682A (en) * 2019-09-09 2020-01-24 浙江科技学院 Semantic segmentation method based on residual pyramid pooling neural network
CN112950652A (en) * 2021-02-08 2021-06-11 深圳市优必选科技股份有限公司 Robot and hand image segmentation method and device thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI QIAN, ET AL: "Temporal Segment Connection Network for Action Recognition", IEEE ACCESS, vol. 8, pages 179118 - 179127, XP011812974, DOI: 10.1109/ACCESS.2020.3027386 *
王龙;刘辉;王彬;李鹏举: "结合肤色模型和卷积神经网络的手势识别方法", 计算机工程与应用, vol. 53, no. 6, pages 209 - 214 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241245A (en) * 2021-12-23 2022-03-25 西南大学 Image classification system based on residual error capsule neural network
CN114241245B (en) * 2021-12-23 2024-05-31 西南大学 Image classification system based on residual capsule neural network

Also Published As

Publication number Publication date
CN113780140B (en) 2023-08-04

Similar Documents

Publication Publication Date Title
CN109558832B (en) Human body posture detection method, device, equipment and storage medium
WO2023185243A1 (en) Expression recognition method based on attention-modulated contextual spatial information
Kowsalya et al. Recognition of Tamil handwritten character using modified neural network with aid of elephant herding optimization
CN109948453B (en) Multi-person attitude estimation method based on convolutional neural network
CN108171133B (en) Dynamic gesture recognition method based on characteristic covariance matrix
CN110782420A (en) Small target feature representation enhancement method based on deep learning
CN110909801B (en) Data classification method, system, medium and device based on convolutional neural network
CN114049381A (en) Twin cross target tracking method fusing multilayer semantic information
CN113920516B (en) Calligraphy character skeleton matching method and system based on twin neural network
CN113159232A (en) Three-dimensional target classification and segmentation method
Cho et al. Semantic segmentation with low light images by modified CycleGAN-based image enhancement
CN114529982A (en) Lightweight human body posture estimation method and system based on stream attention
CN112966574A (en) Human body three-dimensional key point prediction method and device and electronic equipment
CN110555383A (en) Gesture recognition method based on convolutional neural network and 3D estimation
CN108537109B (en) OpenPose-based monocular camera sign language identification method
CN112906520A (en) Gesture coding-based action recognition method and device
CN115205933A (en) Facial expression recognition method, device, equipment and readable storage medium
Kwolek et al. Recognition of JSL fingerspelling using deep convolutional neural networks
CN114937285A (en) Dynamic gesture recognition method, device, equipment and storage medium
CN110555406B (en) Video moving target identification method based on Haar-like characteristics and CNN matching
CN115171052B (en) Crowded crowd attitude estimation method based on high-resolution context network
CN116797640A (en) Depth and 3D key point estimation method for intelligent companion line inspection device
CN113780140B (en) Gesture image segmentation and recognition method and device based on deep learning
CN113673325B (en) Multi-feature character emotion recognition method
CN112580721B (en) Target key point detection method based on multi-resolution feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant