CN112966672A

CN112966672A - Gesture recognition method under complex background

Info

Publication number: CN112966672A
Application number: CN202110473809.6A
Authority: CN
Inventors: 陈昆; 周薇娜
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-06-15
Anticipated expiration: 2041-04-29
Also published as: CN112966672B

Abstract

A gesture recognition method under a complex background adopts a semantic segmentation network based on an encoding and decoding structure to extract features of a gesture picture data set containing the complex background and output a hand segmentation picture; and performing feature extraction on the hand part segmentation image and the original gesture image data set based on a two-channel classification network to identify the gesture category. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the semantic segmentation performance is improved, and meanwhile, the deep separable convolution is introduced into the segmentation network, so that the calculation cost is greatly reduced, the requirements of a model on hardware equipment are reduced, and the whole gesture recognition network is lighter.

Description

Gesture recognition method under complex background

Technical Field

The invention relates to a target segmentation recognition technology, in particular to a gesture recognition method under a complex background.

Background

Since ancient times, humans have been communicating using sign language. Gestures are as old as the human civilization itself. Gestures are particularly useful for expressing any word or sense to be communicated. Thus, despite the established writing conventions, people around the world are continually expressing using gestures.

In recent years, with the development of machine vision, human-computer interaction is more closely related to the daily life of people. Gestures are a common way for people to communicate, are vital to achieving natural communication between human and machines, and provide a more comfortable experience for operators. In particular, gestures may be used to provide more intuitive interaction with a computer, which is brought to the attention of researchers.

Gestures are used to convey information, and gesture recognition has been an important research area of machine vision. Gesture recognition may provide services to a particular group, such as deaf or hearing impaired people. In addition, the method has wide application prospect in the fields of intelligent driving, machine control, virtual reality and the like.

In practical applications, gesture recognition is challenged by different angles, different sizes, skin colors, illumination intensities of gestures and environments around the gestures. The background of the gesture image can be divided into a simple background, which means a background that does not contain any noise, and a complex background, which means a background that contains noise. In practical scenarios, a high-precision solution for gesture recognition in a complex background is still lacking. Therefore, the realization of high-precision gesture recognition under a complex background has great practical significance.

Disclosure of Invention

The invention aims to provide a gesture recognition method under a complex background, which can accurately recognize the category of a gesture under the complex background and reduce the manual recognition cost.

In order to achieve the above object, the present invention provides a gesture recognition method under a complex background, comprising:

adopting a semantic segmentation network based on an encoding and decoding structure to extract the characteristics of a gesture picture data set containing a complex background and outputting a hand segmentation picture;

and performing feature extraction on the hand part segmentation image and the original gesture image data set based on a two-channel classification network to identify the gesture category.

The gesture picture data set containing the complex background meets preset experiment requirements, wherein the preset experiment requirements comprise: the images of the data set are provided with corresponding ground truth images, and each group of images are completed by different subjects; images of a data set are acquired in very challenging situations.

The semantic segmentation network based on the coding and decoding structure comprises: a 3 × 3 convolutional layer, four bottleneck residual modules, a void space pooling pyramid ASPP, and a decoder module;

the 3 x 3 convolutional layer, the four bottleneck residual modules and the hollow space pooling pyramid ASPP are sequentially connected;

and the output of the second bottleneck residual error module is fused with the output characteristic of the space pooling pyramid ASPP through the up-sampled characteristic, and the fused characteristic is used as the input of the decoder module.

The bottleneck residual error module comprises three bottleneck residual error units, and each bottleneck residual error unit is connected in sequence;

the second bottleneck residual error module and the third bottleneck residual error module are used for downsampling operation to capture semantic information;

the features output by the second bottleneck residual error module are subjected to an upsampling operation to obtain shallow detail features;

the fourth bottleneck residual module applies different sizes of hole convolutions to obtain more context information.

The bottleneck residual error unit comprises: two 1 x 1 convolutional layers and a depth separable convolutional structure;

the depth separable convolution structure includes: the channel-wise convolution Depthwise Conv and the point-wise convolution 1 × 1Conv, both followed by the Batch Normalization operations Batch Normalization and Relu activation functions.

The ASPP captures multi-scale semantic information through four parallel cavity volumes and a global pooling operation, and features extracted by each parallel layer are fused together through a cascade module to obtain deep semantic features.

The decoder module fuses the shallow detail features and the deep semantic features into one block, the fused features refine the features through two convolution layers, and finally the hand segmentation graph with clear outlines is output through the up-sampling operation.

The two-channel classification network includes: two identical shallow convolutional neural networks, a cascade network layer and a classification network layer;

the hand part segmentation image and the original gesture image output by the semantic segmentation network are used as the input of two identical shallow layer convolutional neural networks of a double-channel classification network, the shape characteristic and the color characteristic of the hand part are obtained through the two parallel shallow layer convolutional neural networks, the extracted characteristics are fused together through a cascade network layer to be used as the input of a final classification network layer, and the final gesture recognition is realized through the classification network layer.

The loss of the semantic segmentation network is calculated by adopting the following formula:

where N is the number of all samples, y_iAnd p_iRespectively representing the real label pixel value and the predicted probability chart of the ith picture;

the loss of the two-channel classification network is calculated by adopting the following formula:

where N is the number of all samples, K represents the number of all gesture classes, y_ikRepresenting the true probability, p, that the ith sample belongs to the class j_ikRepresenting the prediction probability that the ith sample belongs to the class j.

Evaluating the hand segmentation result by adopting a preset evaluation standard; the preset evaluation criteria include: average cross-over ratio mIOU, Model Size, and floating point operation times per second FLOPS;

the average cross-over ratio mIOU is defined as:

where k +1 represents the number of categories in the image, there are two categories, a hand region and a non-hand region, p_ijRepresenting the number of pixels in the image for which class i is predicted to be class j;

the model size ModelSize and the floating point operation times per second FLOPS are used for further evaluating the feasibility of the model;

evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy Accuracy, macroscopic F1-fraction Macro-F1, model size ModelSize and floating point operation times per second FLOPS;

the Accuracy is defined as:

in the formula, TP represents the number of samples for which the true label is a positive example and is predicted as a positive example; TN represents the number of samples with the true label as a negative case and predicted as a negative case; total represents the number of all samples;

the macroscopic F1-Score Macro-F1 is defined as the average of the corresponding F1-scores (F1-Score) of all gesture categories:

in the formula, C represents all gesture categories, F1-Score_iF1-score representing the ith gesture category.

The invention fuses the shallow detail characteristic and the deep semantic characteristic through the semantic segmentation network based on the coding and decoding structure, and is suitable for correctly positioning the hand region and simultaneously segments the hand with clear outline; the characteristics of the hand part cut image and the original gesture image are respectively extracted by adopting a double-channel classification network, and the fused characteristics are classified and recognized, so that the recognition precision of the gesture is improved. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the semantic segmentation performance is improved, and meanwhile, the deep separable convolution is introduced into the segmentation network, so that the calculation cost is greatly reduced, the requirements of a model on hardware equipment are reduced, and the whole gesture recognition network is lighter.

Drawings

Fig. 1 is a schematic general flow chart of a gesture recognition method under a complex background according to the present invention.

Fig. 2 is a schematic diagram of a network framework used in a gesture recognition method under a complex background according to the present invention.

Fig. 3 is a schematic diagram of a depth separable convolution module provided by the present invention.

Fig. 4 is a schematic diagram of a bottleneck residual unit with depth separable convolution according to the present invention.

Fig. 5 is a schematic diagram of a void space pooling pyramid (ASPP) provided by the present invention.

FIG. 6 is a schematic diagram showing the comparison between the hand segmentation result provided by the present invention and other algorithm results.

Detailed Description

The preferred embodiment of the present invention will be described in detail below with reference to fig. 1 to 6.

As shown in fig. 1, the gesture recognition method in a complex background provided in this embodiment includes the following steps:

and step S1, collecting a data set for gesture recognition in a complex background.

Specifically, the collected image dataset identifying the gesture under the complex background meets a preset experimental requirement, and the preset experimental requirement comprises: each image of the data set to be identified is provided with a corresponding ground surface value image, and each group of images are completed by different subjects; the images of each of the datasets to be identified are acquired in very challenging situations, such as variations in lighting, the inclusion of objects in the background that are close to skin tones, and the occlusion of hands and faces of different shapes and sizes.

And step S2, extracting the features of the data set by adopting a semantic segmentation network based on an encoding and decoding structure, and outputting a hand segmentation graph.

As shown in fig. 2, the semantic segmentation network based on the codec structure is part (a) of the graph, and specifically includes: a 3 x 3 convolutional layer, four bottleneck residual modules, a void space pooling pyramid (ASPP), and a simple decoder module.

As shown in fig. 3, a depth separable convolution structure (DepS Conv) is applied to the semantic segmentation network based on the codec structure, so as to simplify the computation cost of the model and realize the segmentation of the hand in a complicated background in a limited computation resource. The depth separable convolution structure consists of a channel-by-channel convolution (Depthwise Conv) and a point-by-point convolution (1 × 1 Conv). Both convolutions are followed by a Batch Normalization operation (Batch Normalization) and a Relu activation function. Batch normalization operations are beneficial to speed up the network learning rate while reducing gradient vanishing.

And the 3 x 3 convolutional layer and the four bottleneck residual modules are sequentially connected to form a residual network so as to extract the characteristic information of the image. The specific structure is shown in table 1, ResBlock _1 represents a first bottleneck residual module, and each bottleneck residual module is composed of three bottleneck residual units in a cascade connection manner. The structure of the bottleneck residual error unit is shown in fig. 4, and the structure is composed of two 1 × 1 convolutional layers and a depth separable convolutional structure, wherein the 1 × 1 convolutional layers are used for adding nonlinearity to improve the expression capability of the network and simultaneously play a role in reducing the dimension.

The second and third bottleneck residual modules apply a downsampling operation to capture semantic information. Each residual unit of the last bottleneck residual module applies a different hole convolution to capture more context information.

TABLE 1

As shown in fig. 5, the cavity space Pooling pyramid module (ASPP) captures multi-scale semantic information through four parallel cavity volumes and a global Pooling operation (Image Pooling), and the features extracted by each parallel layer are fused together through a cascading module. The global pooling operation is to obtain context information of a larger receptive field.

As shown in fig. 2(a), a Decoder module (Decoder) fuses shallow detail features and deep semantic features into one block, the fused features refine the features through two convolutional layers, and finally, a hand segmentation graph with clear contours is output through an up-sampling operation; the features output by the second bottleneck residual error module are subjected to an upsampling operation to obtain the shallow detail features; the deep semantic features are fusion features of a cavity space pooling pyramid (ASPP) module.

And step S3, extracting the characteristics of the hand part cut pictures and the original pictures by adopting a two-channel classification network, and identifying the gesture category.

As shown in fig. 2(b), the two-channel classification network comprises: two parallel shallow neural networks (CNNs), a cascade layer, and a classification layer. The two parallel shallow convolutional neural networks respectively extract the characteristics of the hand part segmentation image and the original gesture image, the cascade network layer fuses the extracted characteristics together, and the final gesture recognition is realized through the classification network layer.

The structure of the shallow neural networks (CNNs) is shown in table 2, and the structure is composed of four 3 × 3 convolutional layers, four pooling layers, and two fully-connected layers. The pooling layer is mainly used for realizing down-sampling operation and expanding the receptive field; meanwhile, the network computing speed can be increased, and the occurrence of the overfitting phenomenon is reduced.

TABLE 2

Network layer name	Output feature size	Network layer type
			Input	320×320×3	—
Conv2d_1	320×320×16	convolution
			Pooling_2d_1	106×106×16	max-pooling
Conv2d_2	106×106×32	convolution
			Pooling2d_1	35×35×32	max-pooling
Conv2d_3	35×35×64	convolution
			Pooling2d_3	11×11×64	max-pooling
Conv2d_4	9×9×128	convolution
			Pooling2d_4	128	global average pooling
Dense_1	64	fully connected
			Dense_2	64	fully connected

In the embodiment, the semantic segmentation network and the two-channel classification network based on the coding and decoding structure are trained based on a tensoflow frame, and the hardware is a GeForce RTX 3080GPU server. The network training was trained from the beginning, without using pre-trained weights, the trained pictures were pre-set to a size of 320 × 320, and data was enhanced using operations such as horizontal/vertical flipping and scaling, all experiments were trained by Adam optizer, the initial learning rate was set to 0.001, the weight Decay (Decay) was 0, and the Batch (Batch _ size) was 8.

In this embodiment, the loss of the semantic segmentation network is calculated by using the following formula:

where N is the number of all samples, y_iAnd p_iAnd the probability maps respectively represent the real label pixel values and the predictions of the ith picture.

Preferably, a preset evaluation standard is adopted to evaluate the hand segmentation result; the preset evaluation criteria include: average cross-over ratio (mIOU), Model Size (Model Size), number of floating point operations per second (FLOPS);

the mean intersection ratio (mIOU) is defined as:

where k +1 represents the number of categories in the image, there are two categories, here hand region and non-hand region, p_ijIndicating the number of pixels in the image for which class i is predicted to be class j.

The model size (ModelSize) and the number of floating point operations per second (FLOPS) were used to further evaluate the feasibility of the model.

Evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy (Accuracy), macroscopic F1-score (Macro-F1), model size (ModelSize), and floating point operations per second (FLOPS);

the Accuracy (Accuracy) is defined as:

in the formula, TP represents the number of samples for which the true label is a positive example and is predicted as a positive example; TN represents the number of samples with the true label as a negative case and predicted as a negative case; total represents the number of all samples

The macroscopic F1-Score (Macro-F1) is defined as the average of all gesture class corresponding F1-scores (F1-Score):

As shown in table 3, the results of the hand segmentation and the indexes of other algorithms in the complex background provided by the present embodiment are compared. The values corresponding to the best method of effect in the table have been bolded. It can be seen from the table that the performance of all three selected evaluation indexes is significantly improved, especially the two indexes of the model size and the number of floating point operations per second. It can be seen that the hand segmentation performance of the semantic segmentation network based on the coding and decoding structure provided by the invention is superior to that of other algorithms, and meanwhile, the model is very small and the requirement on hardware equipment is low.

TABLE 3

As shown in fig. 6, it is a comparison graph of the hand segmentation result in the complex background provided by the present embodiment and the results of other algorithms. In the figure, the first and second columns show images of the original input image and the corresponding hand mask image, respectively, the third column shows the results of the algorithm proposed herein, and the other columns show images of the comparison algorithm. The graph provides an intuitive segmentation result, and it is easy to see that the hand segmentation method provided by the embodiment has a good segmentation effect even if the environment around the gesture is complex.

As shown in table 4, the results of hand recognition in the complex background provided by the present embodiment are compared with the results of various indicators of other algorithms. The values corresponding to the best method of effect in the table have been bolded. It can be seen from the table that the performance is significantly improved in the four selected evaluation indexes. The gesture recognition method under the complex background provided by the invention has better performance than other algorithms, and meanwhile, the model is very small, and the requirement on hardware equipment is lower.

TABLE 4

Method	Accuracy	Macro-F1	Size of model	FLOPS
					ResNet-101	0.8333	0.8375	162.81M	85041593
ShuffleNetV2	0.8617	0.8612	7.4M	3826374
					MobileNetV3	0.8752	0.8758	11.64M	6056813
HGR-Net	0.8713	0.8810	1.91M	991530
					Ours	0.9117	0.9114	1.85M	950306

The recognition method provided by the embodiment is characterized in that a semantic segmentation network based on an encoding and decoding structure is used for fusing a shallow detail feature and a deep semantic feature, so that the hand with a clear outline is segmented while the method is suitable for correctly positioning a hand region; the characteristics of the hand part cut image and the original gesture image are respectively extracted by adopting a double-channel classification network, and the fused characteristics are classified and recognized, so that the recognition precision of the gesture is improved.

In the embodiment, the multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, the semantic segmentation performance is improved, meanwhile, the deep separable convolution is introduced into the segmentation network, the calculation cost is greatly reduced, the requirement of a model on hardware equipment is reduced, and the whole gesture recognition network is lighter.

In summary, the invention discloses a gesture recognition method under a complex background based on semantic segmentation and a two-channel classification network. After extracting a characteristic map of a hand region by using a residual error network, adding a cavity space pooling pyramid (ASPP) and a decoder module to obtain a better hand segmentation effect map by using a semantic segmentation network; and a double-channel classification network is constructed, and features extracted from the hand segmentation image and the original gesture image are fused, so that the gesture recognition precision under a complex background is improved. The gesture recognition method under the complex background provided by the invention is compared with the results of other algorithms, and the results show that the gesture recognition under the complex background can keep better performance. Meanwhile, the model is small, and the requirement on hardware equipment is low.

It should be noted that, in the embodiments of the present invention, the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", "axial", "radial", "circumferential", etc. indicate the orientation or positional relationship shown in the drawings, and are only for convenience of describing the embodiments, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A gesture recognition method under a complex background, comprising:

2. The method according to claim 1, wherein the gesture picture data set of the complex background conforms to a preset experiment requirement, and the preset experiment requirement comprises: the images of the data set are provided with corresponding ground truth images, and each group of images are completed by different subjects; images of a data set are acquired in very challenging situations.

3. The method for recognizing gestures in complex background according to claim 2, wherein said semantic segmentation network based on codec structure comprises: a 3 × 3 convolutional layer, four bottleneck residual modules, a void space pooling pyramid ASPP, and a decoder module;

4. The method for gesture recognition under a complex background according to claim 3, wherein the bottleneck residual module comprises three bottleneck residual units, and each bottleneck residual unit is connected in sequence;

5. The method for gesture recognition of gestures in complex context according to claim 4, wherein said bottleneck residual unit comprises: two 1 x 1 convolutional layers and a depth separable convolutional structure;

6. The method for gesture recognition in a complex background according to claim 5, wherein the space pooling pyramid module ASPP captures multi-scale semantic information by four parallel hole volumes and one global pooling operation, and features extracted from each parallel layer are merged together by a concatenation module to obtain deep semantic features.

7. The method as claimed in claim 6, wherein the decoder module fuses shallow detail features and deep semantic features into one block, the fused features refine the features through two convolutional layers, and finally the hand segmentation graph with clear outline is output through an up-sampling operation.

8. The method for gesture recognition of gestures in complex contexts as claimed in claim 7, wherein said two-channel classification network comprises: two identical shallow convolutional neural networks, a cascade network layer and a classification network layer;

9. The method for recognizing gestures in complex background according to claim 8, wherein the loss of the semantic segmentation network is calculated by the following formula:

10. The method for recognizing gestures in complex background as claimed in claim 9, wherein the hand segmentation result is evaluated by a preset evaluation criterion; the preset evaluation criteria include: average cross-over ratio mIOU, Model Size, and floating point operation times per second FLOPS;

the average cross-over ratio mIOU is defined as:

the Accuracy is defined as:

the Macro F1-Score Macro-F1 is defined as the average of all gesture categories corresponding to F1-Score F1-Score: