CN112966672B

CN112966672B - Gesture recognition method under complex background

Info

Publication number: CN112966672B
Application number: CN202110473809.6A
Authority: CN
Inventors: 陈昆; 周薇娜
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2024-04-05
Anticipated expiration: 2041-04-29
Also published as: CN112966672A

Abstract

The gesture recognition method under the complex background adopts a semantic segmentation network based on a coding and decoding structure to extract characteristics of a gesture picture data set containing the complex background and output a hand segmentation map; and extracting features of the hand segmentation map and the original gesture picture dataset by adopting a network based on the double-channel classification, and identifying gesture categories. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the performance of semantic segmentation is improved, meanwhile, depth separable convolution is introduced into the segmentation network, the calculation cost is greatly reduced, the requirement of a model on hardware equipment is reduced, and the whole gesture recognition network is lighter.

Description

Gesture recognition method under complex background

Technical Field

The invention relates to a target segmentation recognition technology, in particular to a gesture recognition method under a complex background.

Background

Since ancient times, humans have been communicating using sign language. Gestures are as old as human civilization itself. Gestures are particularly useful for expressing any word or sensation to be communicated. Thus, despite established writing practices, people around the world are continually expressing using gestures.

In recent years, with the development of machine vision, human-computer interaction is more closely related to the daily life of people. Gestures are a common way for people to communicate, are critical to achieving natural communication between humans and machines, and provide a more comfortable experience for operators. In particular, gestures may be used to provide more intuitive interactions with a computer, which draws the attention of researchers.

Gesture recognition has been an important area of research for machine vision for conveying information. Gesture recognition may provide services to a particular group, such as the deaf or hearing impaired. In addition, the method has wide application prospect in the fields of intelligent driving, machine control, virtual reality and the like.

In practical applications, different angles, different sizes, skin colors, illumination intensities, and environments around the gestures present significant challenges for gesture recognition. The background of the gesture image can be classified into a simple background, which refers to a background that does not contain any noise, and a complex background, which refers to a background that contains noise. There is still a lack of high precision solutions for gesture recognition in complex contexts in real scenes. Therefore, the realization of high-precision recognition of gestures in a complex background has great practical significance.

Disclosure of Invention

The invention aims to provide a gesture recognition method under a complex background, which can accurately recognize the category of gestures under the complex background and reduce the manual recognition cost.

In order to achieve the above objective, the present invention provides a gesture recognition method under a complex background, comprising:

carrying out feature extraction on a gesture picture data set containing a complex background by adopting a semantic segmentation network based on a coding and decoding structure, and outputting a hand segmentation map;

and extracting features of the hand segmentation map and the original gesture picture dataset by adopting a network based on the double-channel classification, and identifying gesture categories.

The gesture picture data set of the complex background meets the preset experiment requirements, and the preset experiment requirements comprise: the images of the dataset all bear corresponding ground truth images, each set of images being completed by a different subject; images of the dataset are acquired in very challenging situations.

The semantic segmentation network based on the coding and decoding structure comprises: a 3 x 3 convolutional layer, four bottleneck residual modules, a hole space pooling pyramid ASPP, and a decoder module;

the 3X 3 convolution layer, the four bottleneck residual modules and the cavity space pooling pyramid ASPP are sequentially connected;

and the output of the second bottleneck residual error module is fused with the features output by the cavity space pooling pyramid ASPP through the features after upsampling, and the fused features are used as the input of the decoder module.

The bottleneck residual error module comprises three bottleneck residual error units, and each bottleneck residual error unit is connected in sequence;

the second bottleneck residual module and the third bottleneck residual module are used for downsampling operation to capture semantic information;

the characteristics output by the second bottleneck residual error module are subjected to upsampling operation to obtain shallow detail characteristics;

the fourth bottleneck residual module applies different sizes of hole convolutions to obtain more context information.

The bottleneck residual unit comprises: two 1 x 1 convolution layers and a depth separable convolution structure;

the depth separable convolution structure includes: the channel-by-channel convolution Depthwise Conv and the point-by-point convolution 1 x 1Conv are both followed by a batch normalization operation Batch Normalization and a Relu activation function.

The cavity space pooling pyramid module ASPP captures multi-scale semantic information through four parallel cavity volume and one global pooling operation, and features extracted by each parallel layer are fused together through a cascade module to obtain deep semantic features.

The decoder module fuses the shallow detail features and the deep semantic features together, the fused features refine the features through two convolution layers, and finally the hand segmentation map with clear contours is output through up-sampling operation.

The dual channel classification network comprises: two identical shallow convolutional neural networks, a cascade network layer and a classification network layer;

the hand segmentation graph and the original gesture image output by the semantic segmentation network are used as the input of two identical shallow convolutional neural networks of the two-channel classification network, the shape characteristics and the color characteristics of the hand are obtained through the two parallel shallow convolutional neural networks, the extracted characteristics are fused together through the cascade network layer to be used as the input of the final classification network layer, and the final gesture recognition is realized through the classification network layer.

The loss of the semantic segmentation network is calculated by adopting the following formula:

where N is the number of all samples, y _i And p _i Respectively representing a true label pixel value and a predicted probability map of an ith picture;

the loss of the two-channel classification network is calculated by adopting the following formula:

where N is the number of all samples, K is the number of all gesture categories, y _ik Representing the true probability that the ith sample belongs to class j, p _ik Representing the predicted probability that the i-th sample belongs to category j.

Evaluating the hand segmentation result by adopting a preset evaluation standard; the preset evaluation criteria include: average cross-over ratio mIOU, model Size, floating point number of operations per second FLOPS;

the average cross-over ratio mIOU is defined as:

wherein k+1 represents the number of categories in the image, and there are two categories, namely hand regionsDomain and achiral region, p _ij Representing the number of pixels in the image for which class i is predicted as class j;

the model size ModelSize and the floating point number of operations per second FLOPS are used for further evaluating the feasibility of the model;

evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy Accuracy, macroscopic F1-fraction Macro-F1, model size ModelSize and floating point operation number FLOPS per second;

the Accuracy Accuracy is defined as:

in the formula, TP represents the number of samples with real labels as positive examples and predicted as positive examples; TN represents the number of samples for which the real label is negative and predicted to be negative; total represents the number of all samples;

the macroscopic F1-Score Macro-F1 is defined as the average of all gesture categories corresponding to F1-scores (F1-Score):

wherein C represents all gesture categories, F1-Score _i F1-score representing the i-th gesture category.

According to the invention, the shallow detail features and the deep semantic features are fused through the semantic segmentation network based on the coding and decoding structure, so that the method is suitable for correctly positioning the hand region and simultaneously segmenting the hand with clear outline; and the features of the hand segmentation map and the original gesture image are respectively extracted by adopting a two-channel classification network, and the fused features are classified and identified, so that the gesture identification precision is improved. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the performance of semantic segmentation is improved, meanwhile, depth separable convolution is introduced into the segmentation network, the calculation cost is greatly reduced, the requirement of a model on hardware equipment is reduced, and the whole gesture recognition network is lighter.

Drawings

Fig. 1 is a general flow diagram of a gesture recognition method under a complex background provided by the present invention.

Fig. 2 is a schematic diagram of a network used in a gesture recognition method under a complex background according to the present invention.

Fig. 3 is a schematic diagram of a depth separable convolution module provided by the present invention.

Fig. 4 is a schematic diagram of a bottleneck residual unit with depth separable convolution provided by the present invention.

Fig. 5 is a schematic diagram of a cavity-space pooling pyramid (ASPP) provided by the invention.

Fig. 6 is a schematic diagram comparing the hand segmentation result provided by the present invention with other algorithm results.

Detailed Description

The following describes a preferred embodiment of the present invention with reference to fig. 1 to 6.

The embodiment provides a method for recognizing a gesture in a complex background, as shown in fig. 1, where the method for recognizing a gesture in a complex background provided in the embodiment includes the following steps:

step S1, collecting a data set for gesture recognition under a complex background.

Specifically, the collected image dataset for recognizing the gesture in the complex background meets a preset experiment requirement, and the preset experiment requirement comprises: each image of the data set to be identified carries a corresponding ground truth value image, and each group of images is completed by a different subject; the images of each of the data sets to be identified are acquired in very challenging situations, such as variations in illumination, objects in the background that are similar to skin colors, and mutual occlusion of the hands and faces of different shapes and sizes.

And S2, carrying out feature extraction on the data set by adopting a semantic segmentation network based on a coding and decoding structure, and outputting a hand segmentation map.

As shown in fig. 2, the semantic segmentation network based on the codec structure is part (a) of the figure, and specifically includes: a 3 x 3 convolutional layer, four bottleneck residual modules, a hole space pooling pyramid (ASPP), and a simple decoder module.

As shown in fig. 3, a depth separable convolution structure (DepS Conv) is applied to the semantic segmentation network based on the codec structure, so as to simplify the calculation cost of the model, and the segmentation of the hand in the complex background can be realized in limited calculation resources. The depth separable convolution structure is composed of a channel-by-channel convolution (Depthwise Conv) and a point-by-point convolution (1×1 Conv). Both convolutions are followed by a batch normalization operation (Batch Normalization) and a Relu activation function. The batch normalization operation is beneficial to accelerating the network learning rate and reducing gradient disappearance.

And the 3 multiplied by 3 convolution layer and the four bottleneck residual modules are sequentially connected to form a residual network so as to extract the characteristic information of the image. The specific structure is shown in table 1, and resblock_1 represents a first bottleneck residual module, and each bottleneck residual module is formed by cascading three bottleneck residual units. The bottleneck residual unit has a structure shown in fig. 4, and the structure consists of two 1×1 convolution layers and a depth separable convolution structure, wherein the 1×1 convolution layers have the function of adding nonlinearity to improve the expression capability of the network and can play a role of reducing the dimension.

The second and third bottleneck residual modules apply a downsampling operation to capture semantic information. Each residual unit of the last bottleneck residual module applies a different hole convolution to capture more context information.

TABLE 1

As shown in fig. 5, the hole space Pooling pyramid module (ASPP) captures multi-scale semantic information by four parallel hole volumes and a global Pooling operation (Image Pooling), and features extracted by each parallel layer are fused together by a cascade module. The global pooling operation is to obtain context information for a larger receptive field.

As shown in fig. 2 (a), a Decoder module (Decoder) fuses shallow detail features and deep semantic features together, the fused features refine the features through two convolution layers, and finally the features are subjected to up-sampling operation to output a hand segmentation map with clear outline; the characteristics output by the second bottleneck residual error module are subjected to upsampling operation to obtain the shallow detail characteristics; the deep semantic features are fusion features of a cavity space pooling pyramid (ASPP) module.

And S3, carrying out feature extraction on the hand segmentation graph and the original graph by adopting a two-channel classification network, and identifying gesture types.

As shown in part (b) of fig. 2, the dual channel classification network comprises: two parallel shallow neural networks (CNNs), one cascade layer, one classification layer. The two parallel shallow convolutional neural networks respectively extract the features of the hand segmentation map and the original gesture image, the cascade network layer fuses the extracted features together, and final gesture recognition is realized through the classification network layer.

The structure of the shallow neural network (CNNs) is shown in table 2, and is composed of four 3×3 convolutional layers, four pooling layers, and two fully-connected layers. The pooling layer is mainly used for realizing downsampling operation and expanding receptive fields; meanwhile, the network computing speed can be increased, and the occurrence of the phenomenon of over fitting is reduced.

TABLE 2

Network layer name	Output feature size	Network layer type
			Input	320×320×3	—
Conv2d_1	320×320×16	convolution
			Pooling_2d_1	106×106×16	max-pooling
Conv2d_2	106×106×32	convolution
			Pooling2d_1	35×35×32	max-pooling
Conv2d_3	35×35×64	convolution
			Pooling2d_3	11×11×64	max-pooling
Conv2d_4	9×9×128	convolution
			Pooling2d_4	128	global average pooling
Dense_1	64	fully connected
			Dense_2	64	fully connected

In this embodiment, training on the semantic segmentation network and the dual-channel classification network based on the codec structure is based on a tensorflow framework, and the hardware is a server of the Geforce RTX 3080 GPU. Network training was started without pre-training weights, the training pictures were pre-set to 320 x 320 size and data were enhanced using horizontal/vertical flipping and scaling operations, all experiments were trained by Adam optimizer with an initial learning rate of 0.001, weight Decay (Decay) of 0, and Batch size of 8.

The loss of the semantic segmentation network is calculated according to the following formula:

where N is the number of all samples, y _i And p _i And respectively representing the true label pixel value and the predicted probability map of the ith picture.

where N is the number of all samples, K is the number of all gesture categories, y _ik Representing the true probability that the ith sample belongs to class j, p _ik Indicating that the ith sample belongs to category jThe probability is predicted.

Preferably, the hand segmentation result is evaluated by adopting a preset evaluation standard; the preset evaluation criteria include: average cross-over ratio (mIOU), model Size (Model Size), floating point number of operations per second (flow);

the average cross-over ratio (mIOU) is defined as:

where k+1 represents the number of categories in the image, here two categories, a hand region and a non-hand region, p _ij Representing the number of pixels in the image for which class i is predicted to be class j.

The model size (ModelSize) and floating point operations per second (FLOPS) were used to further evaluate the feasibility of the model.

Evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy (Accuracy), macroscopic F1-fraction (Macro-F1), model size (ModelSize), and floating point number of operations per second (FLOPS);

the Accuracy (Accuracy) is defined as:

in the formula, TP represents the number of samples with real labels as positive examples and predicted as positive examples; TN represents the number of samples for which the real label is negative and predicted to be negative; total represents the number of all samples

The macroscopic F1-Score (Macro-F1) is defined as the average of all gesture categories corresponding to the F1-Score (F1-Score):

wherein C represents all gesture categories, F1-Score _i Representing the ith gesture categoryF1-fraction of (C).

As shown in table 3, the hand segmentation under the complex background provided by the present embodiment is compared with the results of various indexes of other algorithms. The values corresponding to the best-effort methods in the tables are all bolded. The performance of the three selected evaluation indexes is obviously improved, especially the two indexes of model size and floating point operation times per second, as can be easily seen from the table. The hand segmentation performance of the semantic segmentation network based on the coding and decoding structure provided by the invention is superior to that of other algorithms, and meanwhile, the model is very small and the requirement on hardware equipment is low.

TABLE 3 Table 3

Fig. 6 is a graph comparing the hand segmentation result with other algorithm results under the complex background provided by the present embodiment. In the figure, the images displayed in the first column and the second column are respectively the original input image and the corresponding hand mask image, the third column is the result of the algorithm proposed herein, and the images displayed in the other columns are the result of the comparison algorithm. The graph provides visual segmentation results, and it is easy to see that the hand segmentation method provided by the embodiment has good segmentation effect even if the environment around the gesture is complex.

As shown in Table 4, the hand recognition under the complex background provided by the present embodiment is compared with the results of various indexes of other algorithms. The values corresponding to the best-effort methods in the tables are all bolded. The performance is significantly improved in all four selected evaluation indexes as can be seen from the table. The gesture recognition method under the complex background provided by the invention has better performance than other algorithms, and meanwhile, the model is very small and the requirement on hardware equipment is low.

TABLE 4 Table 4

Method	Accuracy	Macro-F1	Model size	FLOPS
					ResNet-101	0.8333	0.8375	162.81M	85041593
ShuffleNetV2	0.8617	0.8612	7.4M	3826374
					MobileNetV3	0.8752	0.8758	11.64M	6056813
HGR-Net	0.8713	0.8810	1.91M	991530
					Ours	0.9117	0.9114	1.85M	950306

The recognition method provided by the embodiment fuses shallow detail features and deep semantic features through the semantic segmentation network based on the coding and decoding structure, and is suitable for correctly positioning the hand region and segmenting the hand with clear outline; and the features of the hand segmentation map and the original gesture image are respectively extracted by adopting a two-channel classification network, and the fused features are classified and identified, so that the gesture identification precision is improved.

According to the embodiment, multi-scale context information is added into the semantic segmentation network of the encoding and decoding structure, so that the performance of semantic segmentation is improved, meanwhile, depth separable convolution is introduced into the segmentation network, the calculation cost is greatly reduced, the requirement of a model on hardware equipment is reduced, and the whole gesture recognition network is lighter.

In summary, the invention discloses a gesture recognition method under a complex background based on semantic segmentation and a dual-channel classification network. After extracting a feature map of a hand region by using a residual error network, the semantic segmentation network adds a cavity space pooling pyramid (ASPP) and a decoder module to obtain a better hand segmentation effect map; the double-channel classification network is constructed, features extracted from the hand segmentation map and the original gesture image are fused, and the gesture recognition accuracy under the complex background is improved. The gesture recognition method under the complex background provided by the invention is compared with the results of other algorithms, and the results show that the gesture recognition method under the complex background can keep better performance. Meanwhile, the model is small, and the requirement on hardware equipment is low.

It should be noted that, in the embodiments of the present invention, the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", "axial", "radial", "circumferential", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the embodiments, and do not indicate or imply that the apparatus or element being referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

While the present invention has been described in detail through the foregoing description of the preferred embodiment, it should be understood that the foregoing description is not to be considered as limiting the invention. Many modifications and substitutions of the present invention will become apparent to those of ordinary skill in the art upon reading the foregoing. Accordingly, the scope of the invention should be limited only by the attached claims.

Claims

1. A method of gesture recognition in a complex context, comprising:

the method comprises the steps of performing feature extraction on a hand segmentation map and an original gesture picture dataset based on a two-channel classification network, and identifying gesture categories;

the gesture picture data set of the complex background meets the preset experiment requirements, and the preset experiment requirements comprise: the images of the dataset all bear corresponding ground truth images, each set of images being completed by a different subject; images of the dataset are all acquired in very challenging situations;

the output of the second bottleneck residual error module is fused with the features output by the cavity space pooling pyramid ASPP through the features after up sampling, and the fused features are used as the input of the decoder module;

the fourth bottleneck residual error module applies hole convolution with different sizes to obtain more context information;

the depth separable convolution structure includes: channel-by-channel convolution Depthwise Conv and point-by-point convolution 1 x 1Conv, both followed by batch normalization operations Batch Normalization and Relu activation functions;

the cavity space pooling pyramid module ASPP captures multi-scale semantic information through four parallel cavity convolution and a global pooling operation, and features extracted by each parallel layer are fused together through a cascade module to obtain deep semantic features;

the decoder module fuses the shallow detail features and the deep semantic features together, the fused features refine the features through two convolution layers, and finally the hand segmentation map with clear contours is output through up-sampling operation;

the hand segmentation graph and the original gesture image output by the semantic segmentation network are used as the input of two identical shallow convolutional neural networks of the two-channel classification network, the shape characteristics and the color characteristics of the hand are obtained through the two parallel shallow convolutional neural networks, the extracted characteristics are fused together through a cascade network layer to be used as the input of a final classification network layer, and the final gesture recognition is realized through the classification network layer;

where N is the number of all samples, K is the number of all gesture categories, y _ik Representing the true probability that the ith sample belongs to class j, p _ik Representing the prediction probability that the ith sample belongs to class j;

the average cross-over ratio mIOU is defined as:

wherein k+1 represents the number of categories in the image, two categories, a hand region and a non-hand region, p _ij Representing the number of pixels in the image for which class i is predicted as class j;

the Accuracy Accuracy is defined as:

the macroscopic F1-Score Macro-F1 is defined as the average of all gesture categories corresponding to F1-scores F1-Score: