CN113780140B

CN113780140B - Gesture image segmentation and recognition method and device based on deep learning

Info

Publication number: CN113780140B
Application number: CN202111016595.6A
Authority: CN
Inventors: 崔振超; 雷玉; 齐静; 杨文柱
Original assignee: Hebei University
Current assignee: Hebei University
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2023-08-04
Anticipated expiration: 2041-08-31
Also published as: CN113780140A

Abstract

The invention provides a gesture image segmentation and recognition method and device based on deep learning. The method of the invention firstly preprocesses the gesture image to fix the size of the image. Secondly, in a complex background, the gesture multi-scale information on different fields of view is acquired through dense division network dense connection of hole convolution with different hole rates, so that the accuracy of feature expression is improved. In addition, in order to integrate the detail and the space position information on different levels, the segmentation performance of the whole network is improved, the dense segmentation network adopts an encoder-decoder structure, redundant background information is removed, and the accurate segmentation of the gesture image is realized. And finally, inputting the mask image only with the gesture image reserved into a gesture recognition network, and recognizing by adopting an improved algorithm. The invention can improve the segmentation performance of the gesture image, thereby improving the recognition rate of the gesture image.

Description

Gesture image segmentation and recognition method and device based on deep learning

Technical Field

The invention relates to the field of man-machine interaction and computer vision, in particular to a gesture image segmentation and recognition method and device based on deep learning.

Background

Gesture interaction based on gesture recognition is one of basic interaction modes in the field of man-machine interaction, and is one of key directions of machine vision and computer application field research. Gesture recognition has wide application in the fields of unmanned aerial vehicle holders, AR (Augmented Reality), VR (Virtual Reality) and the like, and has strong advantages in various environments, such as a non-contact environment, a noisy or quiet environment and the like, so how to increase the robustness and performance of gesture recognition is of great importance.

Currently, gesture interaction methods are mainly divided into two types, namely sensing equipment-based and vision-based. For gesture recognition based on sensing equipment, for example, chinese patent application 201810542738.9 discloses a gesture recognition method and device for improving accuracy of gesture recognition and reducing misoperation. The method comprises the following steps: detecting a motion track of a contact when a touch operation is detected; the motion track of the contact is used for representing gestures of the control terminal equipment; detecting the current moving speed of the contact; and recognizing gestures according to the current moving speed and the moving track. Chinese patent application 201510552869.1 discloses a 3D gesture recognition method comprising the following steps: s1, acquiring 3D coordinate data of a gesture of a user in real time by physical hardware; s2, preprocessing the acquired 3D coordinate data by physical hardware to form feedback data; s3, the data processing software performs identification processing on the feedback data; s4, the system outputs a data identification processing result. The method can effectively solve the problems that a large amount of data needs to be processed in gesture recognition, the process is complex and the software processing efficiency is low, but extra equipment is needed to support. Therefore, the gesture recognition based on the sensor needs expensive auxiliary equipment, the interaction mode is not friendly and natural enough, and the requirements in actual human-computer interaction are difficult to meet.

Vision-based gesture recognition, as in document [1], wei et al then fuses the object detection model SSD (single shot multi-box detector, SSD) in gesture segmentation, but where the skin tone probability map is thresholded, results in loss of hand detail information. Chinese patent application 201910130815.4 proposes a gesture image segmentation and recognition method that improves capsule network and algorithm, which detects hands and generates binarized images for gesture recognition using improved capsule network in deep learning, comprising the steps of: shooting and collecting gesture images under a complex background; constructing and training a U-shaped residual error capsule network to obtain a binarization gesture image; positioning a gesture rectangular bounding box; and constructing and training an improved matrix capsule network to realize the identification of gesture images. However, the existing gesture recognition method based on vision is low in network convergence speed and low in gesture recognition rate under the conditions of complex background and non-uniform illumination.

The current technical research is mostly applied to the gesture recognition aspect of actual human-computer interaction, and additional equipment support is needed. In addition, due to the variability of gestures, the result of hand detection is easy to generate rich background, so that gesture recognition is interfered, and interactivity is reduced. Therefore, how to develop the gesture recognition technology with high recognition speed and little influence by external light and environment is worth researching. Through related technology search, no gesture recognition technology which fully meets the requirements exists at present.

Disclosure of Invention

The invention aims to provide a gesture image segmentation and recognition method and device based on deep learning, which are used for solving the problem that the existing method is low in gesture image recognition rate under a complex background.

The invention is realized in the following way: a gesture image segmentation and recognition method based on deep learning comprises the following steps:

a. carrying out size resetting operation on an input gesture image to fix the size of the image;

b. inputting the gesture image in the step a into a dense segmentation network, training the dense segmentation network, and obtaining a dense segmentation network model after training;

the dense partition network includes an encoder and a decoder; the encoder also comprises a depth convolution neural network module and an improved cavity space pyramid pooling module;

the improved cavity space pyramid pooling module comprises two modes, namely parallel mode and cascade mode; in a parallel mode, performing feature coding on the input feature images by using different void ratios so as to acquire multi-scale information of gestures; in the cascade mode, each layer except the first layer and the second layer connects the output of the parallel mode in series with the output of the previous layer; then adopting deconvolution with different void ratios to connect with the output of the parallel mode from bottom to top;

c. dividing the gesture image by adopting a trained dense dividing network model, and carrying out binarization processing on a dividing result;

d. inputting the divided binarized gesture images into a gesture recognition network, training the gesture recognition network by using gesture images with different gesture shapes, and obtaining a gesture recognition network model after training;

e. and classifying gestures with different shapes by adopting the trained gesture recognition network model, so as to realize the recognition of gesture images.

In step b, in parallel mode, the hole rate used is {2 ] ⁰ ,2 ¹ ,2 ² ,...,2 ⁿ And (3) carrying out cavity convolution, wherein the total of n+1 cavity convolutions are used for carrying out multi-scale feature extraction on the feature map.

Taking n=4, the output of the parallel mode is given by:

wherein x represents the input feature map, d represents the void fraction {2 } ⁰ ,2 ¹ ,2 ² ,…,2 ⁴ Array of }, H _k,d (x) Represents a hole convolution with a convolution kernel of size k and a hole rate d, o _i Output representing 5 parallel modes, o in turn from top to bottom ₀ 、o ₁ 、o ₂ 、o ₃ 、o ₄ ；

The output of the cascade mode is as follows:

p _i an output representing a cascade mode is shown,features representing different scales are spliced on the channel;

deconvolution with different void ratios is adopted and is connected with the output of the parallel mode from bottom to top, and the specific formula of deconvolution is as follows:

wherein q is _j Representing the output after deconvolution, y representing the output of the modified hole space pyramid pooling module, DH _3,d[j] A deconvolution with a convolution kernel of 3 and a void fraction of d is shown.

The deep convolutional neural network module comprises a 7×7 convolutional kernel, a 3×3 convolutional kernel and 4 residual groups. The 4 residual groups are as follows: the first residual group has 3 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1×1×64, a convolution kernel of 3×3×64 and a convolution kernel of 1×1×256, 9 layers are all, the void ratio d=1, and the step size s=2; the second residual group has 4 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1×1×128, a convolution kernel of 3×3×128 and a convolution kernel of 1×1×512, and the total number of the residual blocks is 12, the void ratio is d=1, and the step size is s=1; the third residual group has 6 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1×1×256, a convolution kernel of 3×3×256 and a convolution kernel of 1×1×1024, 18 layers are all, the void ratio d=2, and the step size s=1; the fourth residual group has 3 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1×1×512, a convolution kernel of 3×3×512 and a convolution kernel of 1×1×2048, 9 layers are all provided, the void ratio d=4, and the step size s=1.

In step b, the specific decoding process of the decoder is as follows: performing characteristic splicing on the output result of the improved cavity space pyramid pooling module and the characteristic of the fourth residual group subjected to 1X 1 convolution operation on the channel, and performing first double up-sampling on the spliced result; then splicing the result of the first double up-sampling and the characteristic of the first residual group subjected to the 1X 1 convolution operation on the channel, and continuing to perform the second double up-sampling; then, the result of the second double up-sampling and the characteristics subjected to 7×7 convolution and 1×1 convolution operation are subjected to characteristic splicing on the channel, and the third double up-sampling is continued; finally, the results of the gesture segmentation are refined using a 3×3 convolution kernel, and a 1×1 convolution kernel in order.

In step d, the gesture recognition network comprises three convolution layers, an activation function ReLu for feature extraction and a maximum value pooling MaxPooling, a full connection layer and a Softmax layer;

training the gesture recognition network comprises the steps of:

a first set of convolution operations is performed: a convolution of 19 x 64 is performed, followed by ReLu activation, and finally a max-pooling operation is used as a downsampling operation;

performing a second set of convolution operations: a convolution of 17 x 128 is performed, followed by ReLu activation, and finally a max pooling operation is used as a downsampling operation;

a third set of convolution operations is performed: a convolution of 15×15×128 is performed, then a ReLu activation is performed, and finally a max pooling operation is used as a downsampling operation;

and sequentially inputting the result of the third group of convolution operation to the Softmax layer and outputting the final gesture classification result by the full connection layer.

The gesture image segmentation and recognition device based on the deep learning corresponding to the method comprises the following modules:

the gesture image acquisition module is connected with the preprocessing module and is used for acquiring a color gesture image;

the preprocessing module is respectively connected with the gesture image acquisition module and the dense segmentation network training module and is used for cutting the color gesture image and providing an input image with a fixed size for the dense segmentation network training module;

the intensive segmentation network training module is respectively connected with the preprocessing module and the binarization image acquisition module, and trains a gesture segmentation model by utilizing the input image output by the preprocessing module so as to obtain an optimized segmentation model and output a gesture segmentation result;

the binarization image acquisition module is respectively connected with the dense segmentation network training module and the gesture recognition model training module and is used for acquiring a binarization gesture image; and

the gesture recognition model training module is connected with the binarization image acquisition module, and is used for training a gesture recognition model by using the binarization gesture image so as to obtain an optimized gesture recognition model and outputting a gesture classification result.

Because of the variability of gestures, the result of hand detection is easy to generate rich background, thereby disturbing gesture recognition and reducing interactivity. Aiming at the problem, the invention provides a gesture image segmentation and recognition method based on deep learning, which is based on a dense segmentation network and an improved gesture recognition network, so that fusion of local features and global features of gestures is truly realized, and feature expression is enriched. The invention has stronger robustness and can obtain higher recognition rate under the conditions of similar complexion, hand and face shielding, non-uniform illumination conditions and the like.

The gesture image segmentation and recognition method based on deep learning provided by the invention has the advantages that:

for the problem of multiple scales of gestures in a complex background, different void ratios are designed in parallel and cascade modes in IASPP, and the void convolutions with different void ratios are stacked together, so that gesture multi-scale information on different receptive fields is obtained, and feature expression is enriched. Therefore, the IASPP combines the global and advanced semantic features and the local and detailed semantic features together to filter redundant information in the background, thereby being beneficial to improving the segmentation precision.

The invention obtains more accurate gesture segmentation result by using the encoder for obtaining the advanced semantic information and the decoder for amplifying the image by using the information of the encoding stage to recover the detail information of the image.

The overall performance of the invention is better than that of the general mainstream algorithm, and the invention is more suitable for man-machine products. The improved gesture recognition network has the advantages that: compared with the prior network method, the gesture recognition rate can be effectively improved, and the gesture recognition effect is better than that of the traditional CNN method when the gesture images with different illumination are recognized.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a training diagram of a dense segmentation network in accordance with the present invention.

FIG. 3 is a training diagram of a gesture recognition network according to the present invention.

Fig. 4 is an IASPP framework diagram in accordance with the present invention.

Fig. 5 is a diagram of a densely partitioned network framework in accordance with the present invention.

Fig. 6 is a diagram of the overall network framework of the present invention.

Fig. 7 and 8 are graphs comparing the segmentation effect of the present invention with other algorithms.

Detailed Description

The gesture image segmentation and recognition method based on deep learning provided by the invention mainly comprises the following 3 steps:

step 1: and (3) carrying out the size resetting (size resetting operation) on all gesture images in the complex background, so that the image size is fixed.

Step 2: inputting gesture images subjected to the resolution operation under a complex background into the dense segmentation network, training the dense segmentation network, and outputting a trained dense segmentation network model. And finally, outputting a binarized gesture image by using the trained dense segmentation network model.

Step 3: inputting the gesture images separated in the step 2 into a gesture recognition network, training the gesture recognition network by using gesture images with different gesture shapes, and outputting a trained gesture recognition network model. And classifying each different gesture by using the network model, so as to realize the identification of gesture images.

Because of the variability of gestures, the result of hand detection is easy to generate rich background, thereby disturbing gesture recognition and reducing interactivity. Aiming at the problem, the invention provides a strategy of intensive segmentation and gesture recognition. The gesture segmentation can remove redundant information brought by the background to the greatest extent, and reduce the interference to a gesture recognition algorithm, so that the gesture recognition accuracy is improved. In order to improve the accuracy of gesture segmentation, the invention provides an improved hole space pyramid pooling method ((Improved Atrous Spatial Pyramid Pooling, IASPP)), which combines a cascade mode and a parallel mode for feature extraction, and obtains richer hand feature information.

And filtering out redundant backgrounds by using the proposed dense segmentation network under the complex background, segmenting out the gesture image, inputting the positioned gesture area into a gesture recognition network, and recognizing by adopting an improved algorithm. The invention improves the segmentation performance of the gesture image, thereby improving the recognition rate of the gesture image.

The dense partition network in the step 2 is mainly composed of three parts, which are in turn: a deep convolutional neural network (deep convolutional neural network, DCNN), a hole space pyramid pooling (IASPP) module, and a decoder.

In connection with fig. 5, the input of the dense segmentation network in step 2 is a 512×512×3 RGB image, and the encoding part is composed of DCNN and IASPP modules. Where DCNN is a backbone network of feature extraction consisting of 1 convolution kernel of 7 x 7 (denoted Conv in the figure), 1 convolution kernel of 3 x 3 and 4 residual groups. As shown in table 1 below, the first residual group has 3 residual blocks, each residual block has 3 layers of convolution kernels of 1×1×64, 3×3×64, 1×1×256, 9 layers in total, and a hole rate d=1, and a step size s=2; the second residual group has 4 residual blocks, each residual block has 3 layers of convolution kernels of 1×1×128, 3×3×128 and 1×1×512 respectively, 12 layers of convolution kernels are all, the void ratio d=1, and the step size s=1; the third residual group has 6 residual blocks, each residual block has 3 layers of convolution kernels of 1×1×256, 3×3×256 and 1×1×1024 respectively, 18 layers are all, the hole rate d=2, and the step size s=1; the fourth residual group has 3 residual blocks, each residual block has 3 layers of convolution kernels of 1×1×512, 3×3×512, and 1×1×2048, 9 layers in total, and the hole rate d=4, and the step size s=1.

TABLE 1 Deep Convolutional Neural Network (DCNN) parameter settings

It is noted that in order for the Decoder (Decoder) to merge more local detail information while reducing the computational effort, a 7 x 7 convolution kernel in DCNN, the output features of the first and fourth residual groups, are followed by a 1 x 1 convolution kernel. Finally, after the feature extraction of the DCNN, the feature map finally output by the fourth residual group becomes 1/8 of the original image. The feature map of the fourth residual group output is taken as input to the IASPP module.

As shown in fig. 4, the design architecture of IASPP in a dense partition network combines two modes, parallel and cascaded. In parallel mode, the invention uses the cavity rate of {2 } ⁰ ,2 ¹ ,2 ² ,…,2 ⁿ The cavity convolution performs feature coding on the input feature map to acquire multi-scale information of the gesture. In the embodiment of the invention, n=4 is set, namely 5 hole convolutions are included to extract the multi-scale characteristics of the characteristic map so as to generate richer characteristic expression.

Taking n=4, the output of the parallel mode is defined as equation (1):

wherein x represents the input feature map, d represents the void fraction {2 } ⁰ ,2 ¹ ,2 ² ,…,2 ⁴ Array of } with H _k,d (x) To represent the convolution kernel with k and d, o _i The outputs representing 5 parallel modes, o in order from top to bottom, are shown in FIG. 4 ₀ 、o ₁ 、o ₂ 、o ₃ 、o ₄ 。

In the cascade mode, each layer except the first layer and the second layer connects the output of the parallel mode with the output of the last layer in series, extracts gesture information in a more dense manner,resulting in better expression of the features. Specifically, first, the output result o of the parallel pattern is obtained by using the hole convolution of k=3 and d=2 ₁ Feature extraction is continued, and the output result is p ₁ . Then using the hole convolution pair o with k=3, d=4 ₂ And p is as follows ₁ Features after splicing on the channels are continuously subjected to feature extraction, and the output result is p ₂ . Finally, using the hole convolution pair o with k=3 and d=8 ₃ And p is as follows ₂ Features after splicing on the channels are continuously subjected to feature extraction, and the output result is p ₃ 。

The output of the cascade mode in the IASPP is defined as equation (2). Wherein is used forFeatures representing different dimensions are spliced on the channel (English in FIG. 4 is represented as Concat, spliced, i.e. connected in series), p _i Representing the output of the cascade mode.

Since the image segmentation is extremely sensitive to the spatial position information of the pixels, in order to restore the image size and fuse more detail information at the same time, the invention designs a deconvolution (represented by TC in FIG. 4) with different void ratios, which is connected with the output of the parallel mode from bottom to top to restore the local characteristics, so that the image edge is smoother. First, the deconvolution of k=3, d=8 (i.e., TC 1) is used for o ₄ And p is as follows ₃ The feature map after being spliced on the channel is subjected to image size reduction, and the output result is q ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then the deconvolution of k=3, d=4 (i.e. TC 2) is used for o ₃ And q ₁ The feature map after being spliced on the channel is subjected to image size reduction, and the output result is q ₂ The method comprises the steps of carrying out a first treatment on the surface of the Then the deconvolution of k=3, d=2 (i.e. TC 3) is used for o ₂ And q ₂ The feature map after being spliced on the channel is subjected to image size reduction, and the output result is q ₃ The method comprises the steps of carrying out a first treatment on the surface of the Re-use of the deconvolution of k=3, d=2 (i.e. TC 4) for o ₁ And q ₃ The feature map after being spliced on the channel is subjected to image size reduction, and the output result is q ₄ . The output y of the final IASPP is o ₀ And q ₄ Features after stitching on the channels.

The upper segment is expressed by the following formula:

wherein q is _j Represents the output after deconvolution, y represents the final output of IASPP, DH _3,d[j] A deconvolution with a convolution kernel of 3 and a void fraction of d is shown.

The invention takes the output characteristic diagram of the fourth residual group in the DCCN as the input of IASPP, uses the cavity convolution with different cavity rates to perform characteristic coding on 2048-dimensional characteristics output by the DCNN, and extracts multi-scale context information while enriching the characteristic expression.

In connection with fig. 5, three scale features of the 7 x 7 convolution kernel, the first set of residuals, and the fourth set of residuals are selected in the DCNN for recovering more detail features in the decoding (Decoder) process. And three upsampling operations are used to resize the feature map, which is concatenated with the feature map from the encoded section after each upsampling. The decoding process is specifically as follows: firstly, performing characteristic splicing on characteristics of an output result y of IASPP and a fourth residual group through 1X 1 convolution operation on a channel, and performing first double Up-sampling (represented by Up in the figure) on the spliced result; then splicing the result of the first double up-sampling and the characteristic of the first residual group subjected to the 1X 1 convolution operation on the channel, and continuing to perform the second double up-sampling; and then, the result of the second double up-sampling is subjected to 7×7 convolution, the characteristic of the 1×1 convolution operation is subjected to characteristic stitching on the channel, and the third double up-sampling is continued. Finally, the result of gesture segmentation is refined using a 3×3 convolution kernel, and a 1×1 convolution kernel in order.

In the step 3, the information from the dense segmentation network is input into the gesture recognition network model, and classification is continued.

As shown in fig. 6, in the gesture recognition network model, three convolution layers, and activation functions ReLu and max-pooling MaxPooling for feature extraction, one Softmax layer and one full connection layer constitute a gesture classification network. In the classifying process, the output of the dense segmentation network model is firstly randomly divided into a training set and a testing set, and then is input into a gesture classifying layer as input. In the gesture classification method, the performed operations sequentially include a first set of convolution operations (the first set of convolution performs a convolution of 19×19×64 once, followed by ReLu activation, and finally uses a max-pooling operation as a downsampling operation); a second set of convolution operations (the second set of convolutions performed once by 17 x 128 convolutions followed by ReLu activation, and finally using the max-pooling operation as a downsampling operation); a third set of convolution operations (the third set of convolutions performed once for 15 x 128 convolutions followed by ReLu activation, and finally using the max-pooling operation as a downsampling operation); and finally, sequentially inputting the results of the third group of convolution operations to the Softmax layer, and outputting the final gesture classification result by the full-connection layer.

Referring to fig. 2 and fig. 3, a gesture image segmentation and recognition device based on deep learning corresponding to the above method includes the following modules:

the gesture image acquisition module is connected with the first preprocessing module and is used for acquiring a color gesture image.

The first preprocessing module is respectively connected with the gesture image acquisition module and the dense segmentation network training module and is used for cutting the color gesture image and providing an input image with a fixed size for the dense segmentation network training module.

The intensive segmentation network training module is respectively connected with the first preprocessing module and the gesture image segmentation module, and the gesture segmentation model is trained by utilizing the input image output by the first preprocessing module so as to obtain an optimized segmentation model.

And the gesture image segmentation module is respectively connected with the intensive segmentation network training module and the image segmentation result output module and is used for segmenting gestures through the optimized gesture segmentation model.

And the image segmentation result output module is connected with the gesture image segmentation module and is used for outputting the segmented gesture image.

The data processed by the first preprocessing module is divided into training data and test data, the dense segmentation network training module uses the training data to train a dense segmentation network model, so that the segmentation image and a real gesture segmentation label are subjected to cross entropy loss calculation to obtain the overall loss of the segmentation network, and the loss is continuously reduced by using a counter propagation idea so as to fit the segmentation model, and a stable segmentation model is obtained. And performing gesture image segmentation on the test data or other non-test data by adopting the optimized dense segmentation network model, and finally outputting a gesture image segmentation result by an image segmentation result output module.

The output in fig. 2 is taken as input in fig. 3, namely: the gesture image segmentation result output by the image segmentation result output module enters a binarization image acquisition module, and a binarization gesture image is acquired through the binarization image acquisition module. Specifically, the binarized image acquisition is to send the segmentation result into a sigmoid function to adjust it to a range of 0-1, and obtain the final binarized image using a threshold-based method. If greater than 0.5, it is 1, otherwise it is 0.

The binarization image acquisition module is also connected with a second preprocessing module, and the second preprocessing module is used for cutting the binarization gesture image and providing an input image with a fixed size for the gesture recognition model training module.

The gesture recognition model training module is connected with the second preprocessing module and is used for recognizing the binarized gesture image with a fixed size. The method specifically comprises the following steps: in a gesture recognition model training module, firstly, a gesture recognition model is constructed, wherein the gesture recognition model consists of three convolution layers (64 convolution kernels in a first layer, the size of the convolution kernels is 19 multiplied by 19, 128 convolution kernels in a second layer, the size of the convolution kernels is 17 multiplied by 17, 128 convolution kernels in a third layer, the sizes of the convolution kernels are 15 multiplied by 15 respectively, and the step sizes are 2), a ReLu layer and a MaxPooling layer for feature extraction, a fully connected layer and a Softmax layer; initializing parameters, carrying out gesture model recognition, carrying out cross entropy loss calculation on a recognition result and a real label, if the loss reaches the expectation, obtaining a gesture recognition model, otherwise, continuously reducing the loss by using a counter propagation idea, updating the parameters, and continuing gesture model recognition.

In the gesture recognition model training module, the output of the segmentation model is also required to be randomly divided into a training set and a testing set, and then input into the gesture recognition model as input.

In detail, as shown in fig. 1, the gesture image segmentation and recognition method based on deep learning provided by the invention comprises the following steps:

step 1: a color gesture image is input. The color gesture image input in the embodiment of the invention is selected from the public vision data set OUTHANDS and the HGR data set. The input color gesture image is based on the subsequent training and verification of the network model.

Step 2: the input image is preprocessed so that the image reaches a fixed dimension.

The gesture image is adjusted (cut, resized) to 512×512 pixels, and in this step, the number of preprocessed images in the OUTHANDS dataset is 3000, wherein 2000 images are used as training sets and 1000 images are used as verification sets. The number of images after preprocessing of the HGR dataset was 899, with 630 images as the training set and 269 images as the validation set.

Step 3: a dense partition network is constructed.

The dense segmentation network set in this step is specifically designed for gestures in a complex context. The neural network structure of this step is shown in fig. 5, and the IASPP module structure of this step is shown in fig. 4. The method comprises the following steps:

the training data in step 2 (here only the preprocessed training set is used) is taken as the input image of step 3. The input image is firstly subjected to two convolution operations, the sizes of convolution kernels are 7 multiplied by 7 and 3 multiplied by 3 in sequence, and then the convolution kernels are sequentially fed into a first residual group, a second residual group, a third residual group and a fourth residual group. Finally, after the feature extraction of the DCNN, the feature map finally output by the fourth residual group becomes 1/8 of the original image.

After the output characteristic diagram of the fourth residual group is used as the input of the IASPP module, the characteristic diagram enters the IASPP module, and in the parallel mode, the characteristic layer is firstly convolved by utilizing five convolved check characteristic layers with different void rates, and the characteristic layers with different sizes are obtained due to the use of convolution kernels with different void rates, so that the multi-scale context information is mined while enriching the characteristic expression, and the characteristic layer is sequentially o from top to bottom as shown in fig. 4 ₀ 、o ₁ 、o ₂ 、o ₃ 、o ₄ . In cascade mode, the output result o of the parallel mode is first output using a hole convolution of k=3, d=2 ₁ Feature extraction is continued, and the output result is p ₁ . Then using the hole convolution pair o with k=3, d=4 ₂ And p is as follows ₁ Features after splicing on the channels are continuously subjected to feature extraction, and the output result is p ₂ . Finally, using the hole convolution pair o with k=3 and d=8 ₃ And p is as follows ₂ Features after splicing on the channels are continuously subjected to feature extraction, and the output result is p ₃ 。

The invention also designs a deconvolution with different void ratios, which is connected with the output of the parallel mode from bottom to top to restore the local characteristics, so that the image edge is smoother. First, the deconvolution pair o of k=3, d=8 is used ₄ And p is as follows ₃ The feature map after being spliced on the channel is subjected to image size reduction, and the output result is q ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then the deconvolution pair o using k=3, d=4 ₃ And q ₁ The feature map after being spliced on the channel is subjected to image size reduction, and the output result is q ₂ The method comprises the steps of carrying out a first treatment on the surface of the Then the deconvolution pair o using k=3, d=2 ₂ And q ₂ The feature map after being spliced on the channel is subjected to image size reduction, and the output result is q ₃ The method comprises the steps of carrying out a first treatment on the surface of the Deconvolution pair o using k=3, d=2 ₁ And q ₃ On the wayThe characteristic diagram after splicing on the track is subjected to image size reduction, and the output result is q ₄ . The output y of the final IASPP module is o ₀ And q ₄ Features after stitching on the channels.

For the decoder, three scale features of the 7 x 7 convolution kernel, the first set of residuals, the fourth set of residuals are chosen here in order to recover more detail features in the decoding process. And three upsampling operations are used to adjust the size of the feature map, wherein the upsampling operation is to expand each layer of features in the feature layer to the corresponding dimension in a linear interpolation mode, and the layer number is unchanged. Finally, 3×3 and 1×1 convolution kernels are used to refine the results of the gesture segmentation.

Step 4: fitting training is carried out on the gesture segmentation model by using gesture data to obtain a stable segmentation model

And sending the gesture image as input into a dense segmentation network to obtain a segmentation result, and carrying out cross entropy loss calculation with a real gesture segmentation label to obtain the overall loss of the dense segmentation network. And the back propagation thought is used for continuously reducing loss so as to fit the segmentation model, and a stable dense segmentation model is obtained. Through the step, a gesture segmentation model based on a convolutional neural network is finally obtained through training, and a gesture image can be segmented according to the segmentation model.

Step 5: performing binarization processing on the segmentation result obtained in the step 4

I.e. the segmentation result is fed into a sigmoid function to be adjusted to within the range of 0-1 and a threshold-based method is used to obtain the final binarized image. If greater than 0.5, it is 1, otherwise it is 0.

Step 6: construction of gesture recognition model

The model consists of three convolution layers, reLu and MaxPooling for feature extraction, a fully connected layer and a Softmax layer, which form the gesture classification layer.

In the classification algorithm, firstly, the binarized image in the step 5 is subjected to random clipping operation, the clipping proportion is 0.75-1 times of that of the original image (512×512), and then the image size is reset to 512×512 pixels and is input into the gesture classification layer as an input image.

The operations performed by it include, in sequence, a first set of convolution operations (the first set of convolutions being performed once by a convolution of 19 x 64, followed by a ReLu activation, and finally using the max-pooling operation as the downsampling operation); a second set of convolution operations (the second set of convolutions performed once by 17 x 128 convolutions followed by ReLu activation, and finally using the max-pooling operation as a downsampling operation); a third set of convolution operations (the third set of convolutions performed once for 15 x 128 convolutions followed by ReLu activation, and finally using the max-pooling operation as a downsampling operation); and finally, sequentially inputting the results of the third group of convolution operations to the Softmax layer, and outputting the final gesture classification result by the full-connection layer. The gesture recognition model is trained by using the classified cross entropy loss, network model parameters are adjusted, and model parameters are saved after training is completed.

Step 7: image classification

After training of the model is completed, for a test image, a gesture segmentation diagram is firstly acquired through a dense segmentation network, and then the binarized image is used and then sent into a gesture recognition model for final classification.

In order to further prove the effectiveness of the intensive segmentation and gesture classification combined model, the embodiment of the invention performs a gesture segmentation experiment on an OUTHANDS and HGR public data set, and compares the gesture segmentation experiment with other recognition algorithms based on deep learning on a NUS-II data set.

As shown in Table 2, the recognition accuracy of the dense segmentation and gesture classification provided by the invention can reach 98.61%, which is improved by 3.99% compared with the gesture recognition algorithm, and the running time is not greatly increased while the dense segmentation and gesture classification is superior to other comparison algorithms. Therefore, the segmentation algorithm provided by the invention can greatly filter the interference information in the background and improve the accuracy of gesture recognition.

TABLE 2 identification rates on OUTHANDS dataset

From table 3, it can be seen that the segmentation algorithm based on the dense segmentation network has a great advantage in terms of gesture segmentation, wherein indexes such as accuracy (Pr), recall (Re), equilibrium F-score (F-score) and area under ROC curve (AUC) reach 0.9948, 0.9929, 0.9939 and 0.9982 respectively. These evaluation indexes are all higher than the comparison algorithm, which demonstrates that the method of the present invention is superior to the comparison algorithm in all aspects.

Table 3 comparison of the algorithm herein with the machine learning method under HGR dataset

In order to further prove that the dense segmentation and gesture recognition algorithm provided by the invention can improve the gesture recognition rate, the method is compared with other algorithms based on deep learning on the NUS-II data set. As shown in Table 4, it is clear from Table 4 that the gesture recognition rate of the method of the present invention can reach 98.63%, which is 0.33% higher than that of the suboptimal algorithm. Therefore, the method can enable the segmentation of the gesture and the background to be more accurate, and can further improve the gesture recognition rate.

TABLE 4 identification Rate on NUS-II dataset

Fig. 7 and 8 show a comparison of the results of the method of the present invention with other methods of segmenting and recognizing gestures. From the figure, it can be seen that the method of the present invention (corresponding IASPP-ResNet) is closer to the real tag (GT) than other methods, and it is seen that the method of the present invention is superior to other methods.

References referred to in the present specification are as follows:

[1] wei Baoguo, xu Yong, liu Jinwei, zhou Jiaming. Adaptive gesture segmentation method incorporating SSD destination detection [ J ]. Signal processing, 2020,36 (07): 1038-1047.

WEI Bao-guo,XU Yong,LIU Jin-wei,ZHOU Jia-ming.Adaptive gesture segmentation based on SSD object detection[J].Journal of Signal Processing,2020,36(07):1038-1047.(in Chinese)

[2]Adithya V，Rajesh R.A deep convolutional neural network approach for static hand gesture recognition[J].Procedia Computer Science，2020，171:2353-2361.

[3]Zhang Q，Yang M，Kpalma K，et al.Segmentation of hand posture against complex backgrounds based on saliency and skin colour detection[J].IAENG International Journal of Computer Science，2018，45(3):435-444.

[4]J.Sun，T.Ji，S.Zhang，J.Yang，G.Ji.Research on the hand gesture recognition based on deep learning[A].2018 12th International Symposium on Antennas，Propagation and EM Theory(ISAPE)[C].Hangzhou，China:IEEE，2018.1-4.

[5]Arenas J O P，Moreno R J，R D H.Convolutional neural network with a DAG architecture for control of a robotic arm by means of hand gestures[J].Contemporary Engineering Sciences，2018，11(12):547-557.

[6]Tan Y S，Lim K M，Tee C，et al.Convolutional neural network with spatial pyramid pooling for hand gesture recognition[J].Neural Computing and Applications，2020:1-13.

Claims

1. A gesture image segmentation and recognition method based on deep learning is characterized by comprising the following steps:

2. The gesture image segmentation and recognition method based on deep learning as set forth in claim 1, wherein in the step b, in the parallel mode, the void ratio used is {2 } ⁰ ,2 ¹ ,2 ² ,...,2 ⁿ And (3) carrying out cavity convolution, wherein the total of n+1 cavity convolutions are used for carrying out multi-scale feature extraction on the feature map.

3. The gesture image segmentation and recognition method based on deep learning according to claim 2, wherein n=4 is taken, and the output of the parallel mode is as follows:

wherein x represents the input feature map, d represents the void fraction {2 } ⁰ ,2 ¹ ,2 ² ,…,2 ⁴ Array of }, H _k,d (x) Representing a rollHole convolution with product size k and hole rate d, o _i Output representing 5 parallel modes, o in turn from top to bottom ₀ 、o ₁ 、o ₂ 、o ₃ 、o ₄ ；

The output of the cascade mode is as follows:

4. The method for segmentation and recognition of gesture images based on deep learning according to claim 1, wherein in the step b, the deep convolutional neural network module comprises a 7 x 7 convolution kernel, a 3 x 3 convolution kernel, and 4 residual groups.

5. The gesture image segmentation and recognition method based on deep learning according to claim 4, wherein 4 residuals are respectively as follows: the first residual group has 3 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1×1×64, a convolution kernel of 3×3×64 and a convolution kernel of 1×1×256, 9 layers are all, the void ratio d=1, and the step size s=2; the second residual group has 4 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1×1×128, a convolution kernel of 3×3×128 and a convolution kernel of 1×1×512, and the total number of the residual blocks is 12, the void ratio is d=1, and the step size is s=1; the third residual group has 6 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1×1×256, a convolution kernel of 3×3×256 and a convolution kernel of 1×1×1024, 18 layers are all, the void ratio d=2, and the step size s=1; the fourth residual group has 3 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1×1×512, a convolution kernel of 3×3×512 and a convolution kernel of 1×1×2048, 9 layers are all provided, the void ratio d=4, and the step size s=1.

6. The gesture image segmentation and recognition method based on deep learning of claim 5, wherein in step b, the specific decoding process of the decoder is as follows: performing characteristic splicing on the output result of the improved cavity space pyramid pooling module and the characteristic of the fourth residual group subjected to 1X 1 convolution operation on the channel, and performing first double up-sampling on the spliced result; then splicing the result of the first double up-sampling and the characteristic of the first residual group subjected to the 1X 1 convolution operation on the channel, and continuing to perform the second double up-sampling; then, the result of the second double up-sampling and the characteristics subjected to 7×7 convolution and 1×1 convolution operation are subjected to characteristic splicing on the channel, and the third double up-sampling is continued; finally, the results of the gesture segmentation are refined using a 3×3 convolution kernel, and a 1×1 convolution kernel in order.

7. The deep learning based gesture image segmentation and recognition method according to claim 1, wherein in step d, the gesture recognition network comprises three convolution layers, and an activation function ReLu and a maximum pooling MaxPooling for feature extraction, a full connection layer and a Softmax layer;

training the gesture recognition network comprises the steps of:

8. A gesture image segmentation and recognition device based on deep learning is characterized by comprising the following modules:

the gesture recognition model training module is connected with the binarization image acquisition module, and is used for training a gesture recognition model by using a binarization gesture image so as to obtain an optimized gesture recognition model and outputting a gesture classification result;

in the dense segmentation network training module, the dense segmentation network comprises an encoder and a decoder; the encoder also comprises a depth convolution neural network module and an improved cavity space pyramid pooling module; the improved cavity space pyramid pooling module comprises two modes, namely parallel mode and cascade mode; in a parallel mode, performing feature coding on the input feature images by using different void ratios so as to acquire multi-scale information of gestures; in the cascade mode, each layer except the first layer and the second layer connects the output of the parallel mode in series with the output of the previous layer; then adopting deconvolution with different void ratios to connect with the output of the parallel mode from bottom to top.

9. The deep learning based gesture image segmentation and recognition apparatus of claim 8, wherein the depth convolutional neural network module comprises a 7 x 7 convolutional kernel, a 3 x 3 convolutional kernel, and 4 residual groups; the 4 residual groups are as follows: the first residual group has 3 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1×1×64, a convolution kernel of 3×3×64 and a convolution kernel of 1×1×256, 9 layers are all, the void ratio d=1, and the step size s=2; the second residual group has 4 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1×1×128, a convolution kernel of 3×3×128 and a convolution kernel of 1×1×512, and the total number of the residual blocks is 12, the void ratio is d=1, and the step size is s=1; the third residual group has 6 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1×1×256, a convolution kernel of 3×3×256 and a convolution kernel of 1×1×1024, 18 layers are all, the void ratio d=2, and the step size s=1; the fourth residual group has 3 residual blocks, each residual block has 3 layers, namely a convolution kernel of 1×1×512, a convolution kernel of 3×3×512 and a convolution kernel of 1×1×2048, 9 layers are all provided, the void ratio d=4, and the step size s=1.

10. The deep learning based gesture image segmentation and recognition apparatus according to claim 8, wherein in the gesture recognition model training module, a gesture recognition network consisting of three convolution layers, and activation functions ReLu and max pooling MaxPooling for feature extraction, a full connection layer and a Softmax layer is utilized.