CN113780140A

CN113780140A - Gesture image segmentation and recognition method and device based on deep learning

Info

Publication number: CN113780140A
Application number: CN202111016595.6A
Authority: CN
Inventors: 崔振超; 雷玉; 齐静; 杨文柱
Original assignee: Hebei University
Current assignee: Hebei University
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-12-10
Anticipated expiration: 2041-08-31
Also published as: CN113780140B

Abstract

The invention provides a gesture image segmentation and recognition method and device based on deep learning. The method of the invention firstly preprocesses the gesture image to ensure that the size of the image is fixed. Secondly, acquiring gesture multi-scale information on different visual fields by densely connecting hole convolutions with different hole rates through a dense segmentation network in a complex background so as to improve the accuracy of feature expression. In addition, in order to fuse details and spatial position information on different levels and improve the segmentation performance of the whole network, the dense segmentation network adopts an encoder-decoder structure, redundant background information is removed, and accurate segmentation of the gesture image is realized. And finally, inputting the mask image only retaining the gesture image into a gesture recognition network, and recognizing by adopting an improved algorithm. The invention can improve the segmentation performance of the gesture image, thereby improving the recognition rate of the gesture image.

Description

Gesture image segmentation and recognition method and device based on deep learning

Technical Field

The invention relates to the field of human-computer interaction and computer vision, in particular to a gesture image segmentation and recognition method and device based on deep learning.

Background

Gesture interaction based on gesture recognition is one of basic interaction modes in the field of human-computer interaction, and is one of key directions of machine vision and computer application field research. Gesture recognition has wide application in the fields of unmanned aerial vehicle pan-tilt, ar (augmented reality), vr (virtual reality), and the like, and has strong advantages in various environments, such as a non-contact environment, an environment with noisy or quiet sound, and the like, so how to increase robustness and performance of gesture recognition is of great importance.

At present, gesture interaction methods are mainly divided into two types, namely sensing-based devices and vision-based devices. For gesture recognition based on sensing equipment, for example, chinese patent application 201810542738.9 discloses a gesture recognition method and apparatus for improving accuracy of gesture recognition and reducing misoperation. The method comprises the following steps: when touch operation is detected, detecting a motion track of a contact; the motion trail of the contact is used for representing the gesture of the control terminal equipment; detecting a current moving speed of the contact; and recognizing the gesture according to the current moving speed and the motion trail. Chinese patent application 201510552869.1 discloses a 3D gesture recognition method, comprising the steps of: s1, real-time acquisition of 3D coordinate data of the user gesture is carried out by physical hardware; s2, preprocessing the acquired 3D coordinate data by physical hardware to form feedback data; s3, the data processing software identifies the feedback data; and S4, the system outputs a data identification processing result. The problems that a large amount of data needs to be processed, the process is complex, and the software processing efficiency is low in gesture recognition can be effectively solved, but the support of additional equipment is needed. Therefore, expensive auxiliary equipment is needed for gesture recognition based on the sensor, the interaction mode is not friendly and natural enough, and the requirement in actual human-computer interaction is difficult to meet.

Vision-based gesture recognition, as in document [1], Wei et al fuses a target detection model SSD (SSD) in gesture segmentation, but where a skin color probability map is thresholded, loss of hand detail information is caused. Chinese patent application 201910130815.4 proposes a gesture image segmentation and recognition method for improving capsule network and algorithm, which adopts the improved capsule network in deep learning to detect hand and generate binary image for gesture recognition, comprising the following steps: shooting and collecting gesture images under a complex background; constructing and training a U-shaped residual error capsule network to obtain a binaryzation gesture image; positioning a gesture rectangular enclosure frame; and constructing and training an improved matrix capsule network to realize the recognition of the gesture image. However, in the existing gesture recognition method based on vision, under the conditions of complex background and non-uniform illumination, the existing network convergence speed is slow, and the gesture recognition rate is not high.

The current technical research is mostly applied to the aspect of gesture recognition of actual human-computer interaction, and additional equipment support is needed. In addition, due to the variability of gestures, the results of hand detection easily generate rich backgrounds, thereby interfering with gesture recognition and reducing interactivity. Therefore, how to effectively develop a gesture recognition technology which is high in recognition speed and not greatly influenced by external illumination and environment is worthy of research. Through related technology retrieval, no gesture recognition technology which fully meets the requirements is found at present.

Disclosure of Invention

The invention aims to provide a gesture image segmentation and recognition method and device based on deep learning, and aims to solve the problem that the existing method is low in recognition rate of gesture images under complex backgrounds.

The invention is realized by the following steps: a gesture image segmentation and recognition method based on deep learning comprises the following steps:

a. carrying out size resetting operation on the input gesture image to fix the size of the image;

b. inputting the gesture image in the step a into the dense segmentation network, training the dense segmentation network, and obtaining a dense segmentation network model after training;

the dense partition network includes an encoder and a decoder; the encoder also comprises a deep convolutional neural network module and an improved cavity space pyramid pooling module;

the improved cavity space pyramid pooling module comprises a parallel mode and a cascade mode; in a parallel mode, carrying out feature coding on the input feature graph by using different void ratios to acquire multi-scale information of the gesture; in the cascade mode, each layer except the first layer and the second layer connects the output of the parallel mode in series with the output of the previous layer; then, deconvolution with different void ratios is adopted to be connected with the output of the parallel mode from bottom to top;

c. segmenting the gesture image by adopting a trained dense segmentation network model, and performing binarization processing on a segmentation result;

d. inputting the divided binary gesture images into a gesture recognition network, training the gesture recognition network by using the gesture images with different gesture shapes, and obtaining a gesture recognition network model after training;

e. and classifying the gestures in different shapes by adopting the trained gesture recognition network model to realize the recognition of the gesture images.

In step b, in the parallel mode, the voidage used is {2 }⁰,2¹,2²,...,2ⁿAnd (4) performing hole convolution, wherein the hole convolution comprises n +1 hole convolutions in total to perform multi-scale feature extraction on the feature map.

Taking n to 4, the output of the parallel mode is shown as follows:

where x represents the input feature map and d represents the voidage of {2 }⁰,2¹,2²,…,2⁴Array of H_k,d(x) Representing a convolutionKernel size k, void ratio d void convolution, o_iRepresents the output of 5 parallel modes, which are o from top to bottom₀、o₁、o₂、o₃、o₄；

The output of the cascade mode is given by:

p_ithe output of the cascade mode is represented by,

features representing different scales are spliced on the channel;

deconvolution with different void ratios is adopted and connected with the output of the parallel mode from bottom to top, and the specific formula of the deconvolution is as follows:

in the formula, q_jRepresenting the output after deconvolution, y represents the output of the improved void space pyramid pooling module, DH_3,d[j]Representing a deconvolution with a convolution kernel of 3 and a void rate of d.

The deep convolutional neural network module includes a 7 × 7 convolutional kernel, a 3 × 3 convolutional kernel, and 4 residual groups. The 4 residual groups are as follows: the first residual group has 3 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1 × 1 × 64, a convolution kernel of 3 × 3 × 64 and a convolution kernel of 1 × 1 × 256, and the residual blocks have 9 layers in total, the void ratio d is 1, and the step length s is 2; the second residual group has 4 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 128, a convolution kernel of 3 × 3 × 128 and a convolution kernel of 1 × 1 × 512, and the residual blocks have 12 layers, the void ratio d is 1, and the step length s is 1; the third residual group has 6 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 256, a convolution kernel of 3 × 3 × 256, and a convolution kernel of 1 × 1 × 1024, and has 18 layers, the void ratio d is 2, and the step length s is 1; the fourth residual group has 3 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 512, a convolution kernel of 3 × 3 × 512, and a convolution kernel of 1 × 1 × 2048, and has 9 layers, the void rate d is 4, and the step length s is 1.

In step b, the specific decoding process of the decoder is as follows: performing characteristic splicing on the output result of the improved cavity space pyramid pooling module and the characteristic of the fourth residual error group subjected to 1 × 1 convolution operation on a channel, and performing first-time double upsampling on the spliced result; then, splicing the result of the first double upsampling and the characteristic of the first residual group after 1 × 1 convolution operation on a channel, and continuing to perform the second double upsampling; then, performing feature splicing on the result of the second-time double up-sampling and the features subjected to 7 × 7 convolution and 1 × 1 convolution operations on a channel, and continuing to perform third-time double up-sampling; finally, the results of the gesture segmentation are refined using a 3 × 3 convolution kernel, and a 1 × 1 convolution kernel in sequence.

In the step d, the gesture recognition network comprises three convolution layers, an activation function ReLu for feature extraction, a maximum value pooling MaxPoint, a full connection layer and a Softmax layer;

the training gesture recognition network comprises the following steps:

performing a first set of convolution operations: performing 19 × 19 × 64 convolution once, then performing ReLu activation, and finally using a maximum pooling operation as a downsampling operation;

performing a second set of convolution operations: performing 17 × 17 × 128 convolution once, then performing ReLu activation, and finally using a maximum pooling operation as a downsampling operation;

a third set of convolution operations is performed: performing 15 × 15 × 128 convolution once, then performing ReLu activation, and finally using a maximum pooling operation as a downsampling operation;

and sequentially inputting the results of the third group of convolution operations into a Softmax layer and a full-connection layer, and outputting the final gesture classification result.

The gesture image segmentation and recognition device based on deep learning corresponding to the method comprises the following modules:

the gesture image acquisition module is connected with the preprocessing module and used for acquiring a color gesture image;

the preprocessing module is respectively connected with the gesture image acquisition module and the dense segmentation network training module and is used for cutting the color gesture image and providing an input image with a fixed size for the dense segmentation network training module;

the dense segmentation network training module is respectively connected with the preprocessing module and the binarization image acquisition module, trains a gesture segmentation model by using an input image output by the preprocessing module to obtain an optimized segmentation model and outputs a gesture segmentation result;

the binarization image acquisition module is respectively connected with the dense segmentation network training module and the gesture recognition model training module and is used for acquiring a binarization gesture image; and

and the gesture recognition model training module is connected with the binarization image obtaining module, trains a gesture recognition model by using the binarization gesture image to obtain an optimized gesture recognition model and outputs a gesture classification result.

Due to the variability of gestures, the results of hand detection are prone to generate rich backgrounds, thereby interfering with gesture recognition and reducing interactivity. Aiming at the problem, the invention provides a gesture image segmentation and recognition method based on deep learning, which is based on a dense segmentation network and an improved gesture recognition network, really realizes the fusion of local features and global features of a gesture, and enriches feature expression. The method has stronger robustness and can obtain higher recognition rate under the conditions of similar skin color, hand and face shielding, non-uniform illumination and the like.

The gesture image segmentation and recognition method based on deep learning provided by the invention has the advantages that:

for the problem that the gestures in the complex background have various scales, different void ratios are designed in parallel and cascade modes in the IASPP, and the void convolutions with different void ratios are stacked together, so that the gesture multi-scale information on different receptive fields is obtained, and the feature expression is enriched. Therefore, the IASPP combines global and high-level semantic features with local and detailed semantic features to filter redundant information in the background, and contributes to improving the segmentation precision.

The invention obtains more accurate gesture segmentation result by utilizing the encoder for acquiring high-level semantic information and the decoder for amplifying the image by utilizing the information in the encoding stage to recover the detail information of the image.

The invention has better overall performance than the general mainstream algorithm and is more suitable for the man-machine products. The improved gesture recognition network has the advantages that: compared with the original network method, the method can effectively improve the gesture recognition rate, and has better gesture recognition effect when recognizing gesture images with different illumination compared with the traditional CNN method.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a training diagram of a densely partitioned network in accordance with the present invention.

FIG. 3 is a training diagram of the gesture recognition network of the present invention.

Fig. 4 is a diagram of the IASPP framework in the present invention.

FIG. 5 is a diagram of a dense split network framework in accordance with the present invention.

FIG. 6 is a diagram of the overall network framework of the present invention.

Fig. 7 and 8 are graphs comparing the segmentation effect of the present invention with other algorithms.

Detailed Description

The invention provides a gesture image segmentation and recognition method based on deep learning, which generally comprises the following 3 steps:

step 1: the gesture images in all complex backgrounds are resized (resize operation) so that the image sizes are fixed.

Step 2: and inputting the gesture image subjected to resize operation under the complex background into the dense segmentation network, so as to train the dense segmentation network, and outputting the trained dense segmentation network model. And finally, outputting a binarization gesture image by using the trained dense segmentation network model.

And step 3: and (3) inputting the gesture images divided in the step (2) into a gesture recognition network, training the gesture recognition network by using the gesture images with different gesture shapes, and outputting a trained gesture recognition network model. And classifying each different gesture by using the network model to realize the recognition of the gesture image.

Due to the variability of gestures, the results of hand detection are prone to generate rich backgrounds, thereby interfering with gesture recognition and reducing interactivity. Aiming at the problem, the invention provides a dense segmentation and gesture recognition strategy. The gesture segmentation can remove redundant information brought by the background in a maximized manner, and reduce interference on a gesture recognition algorithm, so that the accuracy of gesture recognition is improved. In order to improve the accuracy of gesture segmentation, the invention provides an Improved void space Pyramid Pooling method (Improved atmospheric Spatial Pyramid Pooling, IASPP), and the method combines a cascade mode and a parallel mode to extract features, so that richer hand feature information is obtained.

And filtering redundant backgrounds by using the proposed dense segmentation network under a complex background, segmenting the gesture image, inputting the positioned gesture area into a gesture recognition network, and recognizing by adopting an improved algorithm. The invention improves the segmentation performance of the gesture image, thereby improving the recognition rate of the gesture image.

The dense segmentation network in the step 2 mainly comprises three parts, which are sequentially as follows: a Deep Convolutional Neural Network (DCNN), a hole space pyramid pooling (IASPP) module, and a decoder.

With reference to fig. 5, the input of the dense partition network in step 2 is a 512 × 512 × 3 RGB image, and the encoding portion is composed of DCNN and IASPP modules. Wherein, DCNN is a feature-extracted backbone network composed of 1 convolution kernel 7 × 7 (referred to as Conv in the figure), 1 convolution kernel 3 × 3 and 4 residual groups. As shown in table 1 below, the first residual group has 3 residual blocks, each of which has 3 layers of 1 × 1 × 64 convolution kernel, 3 × 3 × 64 convolution kernel, and 1 × 1 × 256 convolution kernel, respectively, and has 9 layers, the void rate d is 1, and the step length s is 2; the second residual group has 4 residual blocks, each residual block has 3 layers of convolution kernels of 1 × 1 × 128, 3 × 3 × 128, 1 × 1 × 512, and 12 layers, the void rate d is 1, and the step length s is 1; the third residual group has 6 residual blocks, each residual block has 3 layers of convolution kernels of 1 × 1 × 256, 3 × 3 × 256 and 1 × 1 × 1024, and has 18 layers, the void rate d is 2, and the step length s is 1; the fourth residual group has 3 residual blocks, each residual block has 3 layers of 1 × 1 × 512 convolution kernels, 3 × 3 × 512 convolution kernels, 1 × 1 × 2048 convolution kernels, and 9 layers in total, the void rate d is 4, and the step length s is 1.

TABLE 1 Deep Convolutional Neural Network (DCNN) parameter settings

It is noted that in order for the Decoder (Decoder) to fuse more local detail information while reducing the amount of computation, a 1 × 1 convolution kernel is added after the output features of the 7 × 7 convolution kernel, the first residual group, and the fourth residual group in the DCNN. Finally, after the feature extraction of the RGB image by DCNN, the feature map finally output by the fourth residual group becomes 1/8 of the original image. The feature map of the fourth residual group output is used as the input of the IASPP module.

As shown in fig. 4, the design framework of IASPP in the dense split network combines both parallel and cascade modes. In parallel mode, the invention uses a void ratio of {2 }⁰,2¹,2²,…,2ⁿAnd performing feature coding on the input feature graph by the cavity convolution to acquire multi-scale information of the gesture. In the embodiment of the invention, n is set to be 4, that is, a total of 5 hole convolutions are included to perform multi-scale feature extraction on the feature map, so as to generate richer feature expressions.

Taking n to 4, the output of the parallel mode is defined as formula (1):

where x represents the input feature map and d represents the void fraction {2 }⁰,2¹,2²,…,2⁴Array of { in } H_k,d(x) To represent the hole convolution with a convolution kernel size of k and a hole rate of d, o_iShows the outputs of 5 parallel modes, which are sequentially o from top to bottom as can be seen from FIG. 4₀、o₁、o₂、o₃、o₄。

In the cascade mode, the output of the parallel mode is connected with the output of the previous layer in series by each layer except the first layer and the second layer, so that gesture information is extracted in a more intensive mode, and better feature expression is generated. Specifically, the output result o of the parallel mode is first convolved with a hole having k of 3 and d of 2₁Continuing to extract the features, and outputting the result as p₁. Then using the hole convolution pair o with k equal to 3 and d equal to 4₂And p₁Continuing to extract the features after splicing on the channel, and outputting a result of p₂. Finally, the hole convolution pair o with k equal to 3 and d equal to 8 is used₃And p₂Continuing to extract the features after splicing on the channel, and outputting a result of p₃。

The output of the cascade mode in IASPP is defined as equation (2). In which use

Features representing different scales are spliced on the channel (Concat in fig. 4), p_iRepresenting the output of the cascade mode.

Since image segmentation is extremely sensitive to spatial position information of pixels, in order to fuse more detailed information while restoring image size, the present invention designs deconvolution (denoted by TC in fig. 4) with different voidage rates, connected from bottom to top with the output of the parallel mode, for restoringAnd local characteristics enable the image edge to be smoother. First, the deconvolution (i.e., TC1) with k-3 and d-8 is used to pair o₄And p₃The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q₁(ii) a Then, the deconvolution of k 3 and d 4 (i.e., TC2) is used to pair o₃And q is₁The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q₂(ii) a Then deconvolution (i.e., TC3) using k-3 and d-2 is used to pair o₂And q is₂The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q₃(ii) a Then, the deconvolution (i.e. TC4) with k-3 and d-2 is used to pair o₁And q is₃The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q₄. The output y of the last IASPP is both o₀And q is₄Features after splicing on the channel.

The above paragraph is formulated as follows:

in the formula, q_jRepresenting the output after deconvolution, y the final output of IASPP, DH_3,d[j]Representing a deconvolution with a convolution kernel of 3 and a void rate of d.

The output characteristic diagram of the fourth residual error group in the DCCN is used as the input of the IASPP, the 2048-dimensional characteristics output by the DCNN are subjected to characteristic coding by utilizing the cavity convolution with different cavity rates, and multi-scale context information is mined while characteristic expression is enriched.

As shown in fig. 5, in order to recover more detailed features during decoding (Decoder), three scale features of a 7 × 7 convolution kernel, a first residual group, and a fourth residual group are selected in the DCNN. And three upsampling operations are used to resize the feature map, connecting it with the feature map from the encoding portion after each upsampling. The decoding process specifically comprises: firstly, performing feature splicing on the output result y of the IASPP and the feature of the fourth residual group after 1 × 1 convolution operation on a channel, and performing first double upsampling on the spliced result (denoted by Up in the figure); then, splicing the result of the first double upsampling and the characteristic of the first residual group after 1 × 1 convolution operation on a channel, and continuing to perform the second double upsampling; and then, performing feature splicing on the result of the second-time double up-sampling and the features after 7 × 7 convolution and 1 × 1 convolution operation on the channel, and continuing to perform third-time double up-sampling. Finally, a 3 × 3 convolution kernel, and a 1 × 1 convolution kernel are used in sequence to refine the result of the gesture segmentation.

In step 3, the information from the dense-segment network is input to the gesture recognition network model, and classification is continued.

As shown in fig. 6, in the gesture recognition network model, a gesture classification network is formed by three convolutional layers, an activation function ReLu for feature extraction, a maximum pooling MaxPooling, a Softmax layer, and a full link layer. In the classification process, the output of the dense segmentation network model is randomly divided into a training set and a testing set, and then the training set and the testing set are used as input to be input into a gesture classification layer. In the gesture classification method, the operations performed sequentially include a first set of convolution operations (the first set of convolutions are performed once for 19 × 19 × 64 convolution, followed by ReLu activation, and finally maximum pooling operation is used as a down-sampling operation); a second set of convolution operations (the second set of convolutions are performed once with a 17 x 128 convolution, followed by ReLu activation, and finally with a maximum pooling operation as a downsampling operation); a third set of convolution operations (the third set of convolutions performs a 15 x 128 convolution, followed by ReLu activation, and finally a maximum pooling operation as a downsampling operation); and finally, sequentially inputting the results of the third group of convolution operations to a Softmax layer, and outputting the final gesture classification result by the full-connection layer.

With reference to fig. 2 and fig. 3, a gesture image segmentation and recognition apparatus based on deep learning corresponding to the above method includes the following modules:

and the gesture image acquisition module is connected with the first preprocessing module and used for acquiring the color gesture image.

And the first preprocessing module is respectively connected with the gesture image acquisition module and the dense segmentation network training module and is used for carrying out cutting operation on the color gesture image and providing an input image with a fixed size for the dense segmentation network training module.

And the dense segmentation network training module is respectively connected with the first preprocessing module and the gesture image segmentation module, and trains a gesture segmentation model by using the input image output by the first preprocessing module so as to obtain an optimized segmentation model.

And the gesture image segmentation module is respectively connected with the dense segmentation network training module and the image segmentation result output module and is used for segmenting the gesture through the optimized gesture segmentation model.

And the image segmentation result output module is connected with the gesture image segmentation module and is used for outputting the segmented gesture image.

The data processed by the first preprocessing module are divided into training data and testing data, the dense segmentation network training module trains a dense segmentation network model by using the training data, cross entropy loss calculation is carried out on segmentation images and real gesture segmentation labels to obtain integral loss of the segmentation network, and the loss is continuously reduced by using a back propagation idea so as to fit the segmentation model, and a stable segmentation model is obtained. And performing gesture image segmentation on the test data or other non-test data by adopting the optimized dense segmentation network model, and finally outputting a gesture image segmentation result by an image segmentation result output module.

The output in fig. 2 serves as the input in fig. 3, namely: and the gesture image segmentation result output by the image segmentation result output module enters a binarization image acquisition module, and a binarization gesture image is acquired by the binarization image acquisition module. Specifically, the binarization image acquisition is to feed the segmentation result into a sigmoid function to adjust the segmentation result to be in a range of 0-1, and a threshold-based method is used for obtaining a final binarization image. If the value is more than 0.5, the value is 1, otherwise, the value is 0.

The binarization image obtaining module is also connected with a second preprocessing module, and the second preprocessing module is used for cutting the binarization gesture image and providing an input image with a fixed size for the gesture recognition model training module.

And the gesture recognition model training module is connected with the second preprocessing module and is used for recognizing the binaryzation gesture image with the fixed size. The method comprises the following steps: in a gesture recognition model training module, firstly, constructing a gesture recognition model, wherein the gesture recognition model consists of three convolution layers (the first layer is provided with 64 convolution kernels with the size of 19 multiplied by 19, the second layer is provided with 128 convolution kernels with the size of 17 multiplied by 17, the third layer is provided with 128 convolution kernels with the size of 15 multiplied by 15 respectively and the step length of 2), ReLu and MaxPholing layers for feature extraction, a full connection layer and a Softmax layer; initializing parameters, performing gesture model recognition, performing cross entropy loss calculation on a recognition result and a real label, if loss reaches an expectation, obtaining a gesture recognition model, otherwise, continuously reducing the loss by using a back propagation idea, updating the parameters, and continuously performing gesture model recognition.

In the gesture recognition model training module, the output of the segmentation model is also randomly divided into a training set and a test set, and then input into the gesture recognition model as input.

In detail, as shown in fig. 1, the gesture image segmentation and recognition method based on deep learning provided by the present invention includes the following steps:

step 1: and inputting a color gesture image. The color gesture image input in the embodiment of the invention is selected from the common vision data set OUTHANDS and HGR data sets. The input color gesture image is used for making a foundation for subsequent training and verification of the network model.

Step 2: the input image is pre-processed so that the image reaches a fixed dimension.

In this step, the number of the preprocessed images in the outhand data set is 3000, wherein 2000 images are used as a training set and 1000 images are used as a verification set. The number of images after preprocessing of the HGR data set is 899, wherein 630 images are used as a training set, and 269 images are used as a verification set.

And step 3: and constructing a dense split network.

The dense segmentation network set in this step is specifically designed for gestures in a complex background. The structure of the neural network of this step is shown in fig. 5, and the structure of the IASPP module in this step is shown in fig. 4. The method comprises the following specific steps:

the training data in step 2 (here using only the pre-processed training set) is used as the input image for step 3. Firstly, two convolution operations are carried out on an input image, the sizes of convolution kernels used are 7 x 7 and 3 x 3 in sequence, and then the convolution kernels are sent into a first residual error group, a second residual error group, a third residual error group and a fourth residual error group in sequence. Finally, after the feature extraction of the RGB image by DCNN, the feature map finally output by the fourth residual group becomes 1/8 of the original image.

Taking the output characteristic diagram of the fourth residual group as the input of the IASPP module, after the characteristic diagram enters the IASPP module, firstly, performing convolution operation on the characteristic layers by utilizing convolution kernels with five different voidage rates under a parallel mode, acquiring the characteristic layers with different sizes due to the use of the convolution kernels with the different voidage rates, mining multi-scale context information while enriching the characteristic expression, and sequentially finding o from top to bottom according to the graph shown in FIG. 4₀、o₁、o₂、o₃、o₄. In the cascade mode, the output result o of the parallel mode is first convolved with a hole having k 3 and d 2₁Continuing to extract the features, and outputting the result as p₁. Then using the hole convolution pair o with k equal to 3 and d equal to 4₂And p₁Continuing to extract the features after splicing on the channel, and outputting a result of p₂. Finally, the hole convolution pair o with k equal to 3 and d equal to 8 is used₃And p₂Continuing to extract the features after splicing on the channel, and outputting a result of p₃。

The invention also designs deconvolution with different void ratios, which is connected with the output of the parallel mode from bottom to topThe method is used for recovering local features so that the image edge is smoother. First, a deconvolution pair o with k equal to 3 and d equal to 8 is used₄And p₃The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q₁(ii) a Then, the deconvolution with k equal to 3 and d equal to 4 is used to pair o₃And q is₁The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q₂(ii) a Then, the deconvolution with k 3 and d 2 is used to pair o₂And q is₂The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q₃(ii) a Deconvolution using k-3, d-2₁And q is₃The image size of the characteristic graph after being spliced on the channel is restored, and the output result is q₄. The output y of the last IASPP module is o₀And q is₄Features after splicing on the channel.

For the decoder, three scale features of the 7 × 7 convolution kernel, the first residual group, and the fourth residual group are selected in order to recover more detailed features during decoding. And three upsampling operations are used for adjusting the size of the feature map, wherein the upsampling operation is to expand each layer of features in the feature layer to a corresponding dimension in a linear interpolation mode, and the layer number is unchanged. Finally, the results of the gesture segmentation are refined using 3 × 3 and 1 × 1 convolution kernels.

And 4, step 4: fitting training a gesture segmentation model using gesture data to derive a stable segmentation model

And sending the gesture image as input into a dense segmentation network to obtain a segmentation result, and performing cross entropy loss calculation with a real gesture segmentation label to obtain the overall loss of the dense segmentation network. And the loss is continuously reduced by using a back propagation idea so as to fit the segmentation model and obtain a stable dense segmentation model. Through the steps, a gesture segmentation model based on the convolutional neural network is finally obtained through training, and the gesture image can be segmented according to the segmentation model.

And 5: the segmentation result obtained in the step 4 is subjected to binarization processing

Namely, the segmentation result is sent to a sigmoid function to be adjusted to be in the range of 0-1, and a final binary image is obtained by using a threshold-based method. If the value is more than 0.5, the value is 1, otherwise, the value is 0.

Step 6: constructing gesture recognition models

The model comprises three convolution layers, ReLu and MaxPholing for feature extraction, a full connection layer and a Softmax layer to form a gesture classification layer.

In the classification algorithm, firstly, the binarized image in step 5 is subjected to random cropping operation, the cropping proportion is 0.75-1 times of the original image (512 × 512), and then the image size is reset to 512 × 512 pixels and is input to the gesture classification layer as an input image.

The operations performed sequentially include a first set of convolution operations (the first set of convolutions are performed once for a 19 x 64 convolution, followed by ReLu activation, and finally a maximum pooling operation as a downsampling operation); a second set of convolution operations (the second set of convolutions are performed once with a 17 x 128 convolution, followed by ReLu activation, and finally with a maximum pooling operation as a downsampling operation); a third set of convolution operations (the third set of convolutions performs a 15 x 128 convolution, followed by ReLu activation, and finally a maximum pooling operation as a downsampling operation); and finally, sequentially inputting the results of the third group of convolution operations to a Softmax layer, and outputting the final gesture classification result by the full-connection layer. And training the gesture recognition model by using the classified cross entropy loss, adjusting network model parameters, and storing the model parameters after the training is finished.

And 7: image classification

After the training of the model is completed, for a test image, a gesture segmentation image is obtained through a dense segmentation network, and then the image subjected to binarization is sent to a gesture recognition model for final classification.

In order to further prove the effectiveness of the dense segmentation and gesture classification combined model provided by the invention, the gesture segmentation experiment is carried out on the OUTHANDS and HGR common data set, and the comparison is carried out on the NUS-II data set and other recognition algorithms based on deep learning.

As shown in Table 2, the recognition precision of the dense segmentation and the gesture classification provided by the invention can reach 98.61%, which is improved by 3.99% compared with the gesture recognition algorithm, and the running time of the algorithm is not greatly increased while the algorithm is superior to other comparison algorithms. Therefore, the segmentation algorithm provided by the invention can maximally filter the interference information in the background and improve the accuracy of gesture recognition.

TABLE 2 recognition rates on OUTHANDS datasets

From table 3, it can be seen that the segmentation algorithm based on the dense segmentation network has great advantages in gesture segmentation, wherein the accuracy (Precision, Pr), Recall (Recall, Re), balance F score (F-score), and area under ROC curve (AUC) reach 0.9948, 0.9929, 0.9939, and 0.9982, respectively. These evaluation indexes are all higher than the comparison algorithm, which shows that the method of the present invention is superior to the comparison algorithm in all aspects.

TABLE 3 comparison of the algorithm and machine learning methods herein under HGR data sets

In order to further prove that the dense segmentation and gesture recognition algorithm provided by the invention can improve the gesture recognition rate, the NUS-II data set is compared with other algorithms based on deep learning. The result is shown in table 4, and it can be known from table 4 that the gesture recognition rate of the method of the present invention can reach 98.63%, which is improved by 0.33% compared with the suboptimal algorithm. Therefore, the method and the device can enable the segmentation of the gesture and the background to be more accurate, and can further improve the gesture recognition rate.

TABLE 4 recognition Rate on NUS-II data set

FIGS. 7 and 8 show graphs comparing the results of the method of the present invention in segmenting and recognizing gestures with other methods. As can be seen from the figure, the method (corresponding to IASPP-ResNet) of the invention is closer to the real label (GT) than other methods, and the method of the invention is better than other methods.

The references referred to in the present description are as follows:

[1] wei Bao, brave, Liu jin Rev, Zhou Jia Ming, the adaptive gesture segmentation method of fusion SSD target detection [ J ] signal processing, 2020,36(07): 1038-.

WEI Bao-guo,XU Yong,LIU Jin-wei,ZHOU Jia-ming.Adaptive gesture segmentation based on SSD object detection[J].Journal of Signal Processing,2020,36(07):1038-1047.(in Chinese)

[2]Adithya V，Rajesh R.A deep convolutional neural network approach for static hand gesture recognition[J].Procedia Computer Science，2020，171:2353-2361.

[3]Zhang Q，Yang M，Kpalma K，et al.Segmentation of hand posture against complex backgrounds based on saliency and skin colour detection[J].IAENG International Journal of Computer Science，2018，45(3):435-444.

[4]J.Sun，T.Ji，S.Zhang，J.Yang，G.Ji.Research on the hand gesture recognition based on deep learning[A].2018 12th International Symposium on Antennas，Propagation and EM Theory(ISAPE)[C].Hangzhou，China:IEEE，2018.1-4.

[5]Arenas J O P，Moreno R J，

R D H.Convolutional neural network with a DAG architecture for control of a robotic arm by means of hand gestures[J].Contemporary Engineering Sciences，2018，11(12):547-557.

[6]Tan Y S，Lim K M，Tee C，et al.Convolutional neural network with spatial pyramid pooling for hand gesture recognition[J].Neural Computing and Applications，2020:1-13.

Claims

1. A gesture image segmentation and recognition method based on deep learning is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein in the step b, the hole rate used in the parallel mode is {2 }⁰,2¹,2²,...,2ⁿThe hole convolution of the feature map comprises n +1 hole convolutions in totalAnd (5) extracting scale features.

3. The method for segmenting and recognizing the gesture image based on the deep learning as claimed in claim 2, wherein n is 4, and the output of the parallel mode is shown as the following formula:

where x represents the input feature map and d represents the voidage of {2 }⁰,2¹,2²,…,2⁴Array of H_k,d(x) Indicating a hole convolution with a convolution kernel size of k and a hole rate of d, o_iRepresents the output of 5 parallel modes, which are o from top to bottom₀、o₁、o₂、o₃、o₄；

The output of the cascade mode is given by:

p_ithe output of the cascade mode is represented by,

features representing different scales are spliced on the channel;

in the formula, q_jRepresenting after deconvolutionOutput, y represents the output of the improved void space pyramid pooling module, DH_3,d[j]Representing a deconvolution with a convolution kernel of 3 and a void rate of d.

4. The method as claimed in claim 1, wherein in step b, the deep convolutional neural network module comprises a 7 × 7 convolutional kernel, a 3 × 3 convolutional kernel and 4 residual error groups.

5. The method as claimed in claim 4, wherein the 4 residual error groups are respectively as follows: the first residual group has 3 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1 × 1 × 64, a convolution kernel of 3 × 3 × 64 and a convolution kernel of 1 × 1 × 256, and the residual blocks have 9 layers in total, the void ratio d is 1, and the step length s is 2; the second residual group has 4 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 128, a convolution kernel of 3 × 3 × 128 and a convolution kernel of 1 × 1 × 512, and the residual blocks have 12 layers, the void ratio d is 1, and the step length s is 1; the third residual group has 6 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 256, a convolution kernel of 3 × 3 × 256, and a convolution kernel of 1 × 1 × 1024, and has 18 layers, the void ratio d is 2, and the step length s is 1; the fourth residual group has 3 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 512, a convolution kernel of 3 × 3 × 512, and a convolution kernel of 1 × 1 × 2048, and has 9 layers, the void rate d is 4, and the step length s is 1.

6. The method as claimed in claim 5, wherein in step b, the decoder decodes the gesture image according to the following steps: performing characteristic splicing on the output result of the improved cavity space pyramid pooling module and the characteristic of the fourth residual error group subjected to 1 × 1 convolution operation on a channel, and performing first-time double upsampling on the spliced result; then, splicing the result of the first double upsampling and the characteristic of the first residual group after 1 × 1 convolution operation on a channel, and continuing to perform the second double upsampling; then, performing feature splicing on the result of the second-time double up-sampling and the features subjected to 7 × 7 convolution and 1 × 1 convolution operations on a channel, and continuing to perform third-time double up-sampling; finally, the results of the gesture segmentation are refined using a 3 × 3 convolution kernel, and a 1 × 1 convolution kernel in sequence.

7. The method for segmenting and recognizing the gesture image based on the deep learning of claim 1, wherein in the step d, the gesture recognition network comprises three convolution layers, an activation function ReLu for feature extraction, a maximum value pooling Max Pooling, a full connection layer and a Softmax layer;

the training gesture recognition network comprises the following steps:

8. A gesture image segmentation and recognition device based on deep learning is characterized by comprising the following modules:

the gesture recognition model training module is connected with the binarization image obtaining module, trains a gesture recognition model by using the binarization gesture image to obtain an optimized gesture recognition model and outputs a gesture classification result;

in the dense split network training module, a dense split network comprises an encoder and a decoder; the encoder also comprises a deep convolutional neural network module and an improved cavity space pyramid pooling module; the improved cavity space pyramid pooling module comprises a parallel mode and a cascade mode; in a parallel mode, carrying out feature coding on the input feature graph by using different void ratios to acquire multi-scale information of the gesture; in the cascade mode, each layer except the first layer and the second layer connects the output of the parallel mode in series with the output of the previous layer; and then, deconvolution with different void ratios is adopted to be connected with the output of the parallel mode from bottom to top.

9. The apparatus for segmenting and recognizing gesture images based on deep learning of claim 8, wherein the deep convolutional neural network module comprises a convolution kernel of 7 x 7, a convolution kernel of 3 x 3 and 4 residual error groups; the 4 residual groups are as follows: the first residual group has 3 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1 × 1 × 64, a convolution kernel of 3 × 3 × 64 and a convolution kernel of 1 × 1 × 256, and the residual blocks have 9 layers in total, the void ratio d is 1, and the step length s is 2; the second residual group has 4 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 128, a convolution kernel of 3 × 3 × 128 and a convolution kernel of 1 × 1 × 512, and the residual blocks have 12 layers, the void ratio d is 1, and the step length s is 1; the third residual group has 6 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 256, a convolution kernel of 3 × 3 × 256, and a convolution kernel of 1 × 1 × 1024, and has 18 layers, the void ratio d is 2, and the step length s is 1; the fourth residual group has 3 residual blocks, each residual block has 3 layers, which are respectively a convolution kernel of 1 × 1 × 512, a convolution kernel of 3 × 3 × 512, and a convolution kernel of 1 × 1 × 2048, and has 9 layers, the void rate d is 4, and the step length s is 1.

10. The apparatus for segmenting and recognizing gesture images based on deep learning of claim 8, wherein a gesture recognition network consisting of three convolutional layers, an activation function ReLu for feature extraction, a maximum pooling Max Pooling, a full connection layer and a Softmax layer is utilized in the training module of the gesture recognition model.