CN112966672A - Gesture recognition method under complex background - Google Patents
Gesture recognition method under complex background Download PDFInfo
- Publication number
- CN112966672A CN112966672A CN202110473809.6A CN202110473809A CN112966672A CN 112966672 A CN112966672 A CN 112966672A CN 202110473809 A CN202110473809 A CN 202110473809A CN 112966672 A CN112966672 A CN 112966672A
- Authority
- CN
- China
- Prior art keywords
- gesture
- segmentation
- complex background
- network
- gesture recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000011218 segmentation Effects 0.000 claims abstract description 59
- 238000000605 extraction Methods 0.000 claims abstract description 3
- 238000011176 pooling Methods 0.000 claims description 23
- 238000011156 evaluation Methods 0.000 claims description 14
- 238000007667 floating Methods 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 7
- 238000002474 experimental method Methods 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 4
- 239000011800 void material Substances 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 239000011796 hollow space material Substances 0.000 claims description 2
- 101100295091 Arabidopsis thaliana NUDT14 gene Proteins 0.000 claims 4
- 238000004364 calculation method Methods 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 206010011878 Deafness Diseases 0.000 description 1
- 208000032041 Hearing impaired Diseases 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/107—Static hand or arm
- G06V40/113—Recognition of static hand signs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
A gesture recognition method under a complex background adopts a semantic segmentation network based on an encoding and decoding structure to extract features of a gesture picture data set containing the complex background and output a hand segmentation picture; and performing feature extraction on the hand part segmentation image and the original gesture image data set based on a two-channel classification network to identify the gesture category. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the semantic segmentation performance is improved, and meanwhile, the deep separable convolution is introduced into the segmentation network, so that the calculation cost is greatly reduced, the requirements of a model on hardware equipment are reduced, and the whole gesture recognition network is lighter.
Description
Technical Field
The invention relates to a target segmentation recognition technology, in particular to a gesture recognition method under a complex background.
Background
Since ancient times, humans have been communicating using sign language. Gestures are as old as the human civilization itself. Gestures are particularly useful for expressing any word or sense to be communicated. Thus, despite the established writing conventions, people around the world are continually expressing using gestures.
In recent years, with the development of machine vision, human-computer interaction is more closely related to the daily life of people. Gestures are a common way for people to communicate, are vital to achieving natural communication between human and machines, and provide a more comfortable experience for operators. In particular, gestures may be used to provide more intuitive interaction with a computer, which is brought to the attention of researchers.
Gestures are used to convey information, and gesture recognition has been an important research area of machine vision. Gesture recognition may provide services to a particular group, such as deaf or hearing impaired people. In addition, the method has wide application prospect in the fields of intelligent driving, machine control, virtual reality and the like.
In practical applications, gesture recognition is challenged by different angles, different sizes, skin colors, illumination intensities of gestures and environments around the gestures. The background of the gesture image can be divided into a simple background, which means a background that does not contain any noise, and a complex background, which means a background that contains noise. In practical scenarios, a high-precision solution for gesture recognition in a complex background is still lacking. Therefore, the realization of high-precision gesture recognition under a complex background has great practical significance.
Disclosure of Invention
The invention aims to provide a gesture recognition method under a complex background, which can accurately recognize the category of a gesture under the complex background and reduce the manual recognition cost.
In order to achieve the above object, the present invention provides a gesture recognition method under a complex background, comprising:
adopting a semantic segmentation network based on an encoding and decoding structure to extract the characteristics of a gesture picture data set containing a complex background and outputting a hand segmentation picture;
and performing feature extraction on the hand part segmentation image and the original gesture image data set based on a two-channel classification network to identify the gesture category.
The gesture picture data set containing the complex background meets preset experiment requirements, wherein the preset experiment requirements comprise: the images of the data set are provided with corresponding ground truth images, and each group of images are completed by different subjects; images of a data set are acquired in very challenging situations.
The semantic segmentation network based on the coding and decoding structure comprises: a 3 × 3 convolutional layer, four bottleneck residual modules, a void space pooling pyramid ASPP, and a decoder module;
the 3 x 3 convolutional layer, the four bottleneck residual modules and the hollow space pooling pyramid ASPP are sequentially connected;
and the output of the second bottleneck residual error module is fused with the output characteristic of the space pooling pyramid ASPP through the up-sampled characteristic, and the fused characteristic is used as the input of the decoder module.
The bottleneck residual error module comprises three bottleneck residual error units, and each bottleneck residual error unit is connected in sequence;
the second bottleneck residual error module and the third bottleneck residual error module are used for downsampling operation to capture semantic information;
the features output by the second bottleneck residual error module are subjected to an upsampling operation to obtain shallow detail features;
the fourth bottleneck residual module applies different sizes of hole convolutions to obtain more context information.
The bottleneck residual error unit comprises: two 1 x 1 convolutional layers and a depth separable convolutional structure;
the depth separable convolution structure includes: the channel-wise convolution Depthwise Conv and the point-wise convolution 1 × 1Conv, both followed by the Batch Normalization operations Batch Normalization and Relu activation functions.
The ASPP captures multi-scale semantic information through four parallel cavity volumes and a global pooling operation, and features extracted by each parallel layer are fused together through a cascade module to obtain deep semantic features.
The decoder module fuses the shallow detail features and the deep semantic features into one block, the fused features refine the features through two convolution layers, and finally the hand segmentation graph with clear outlines is output through the up-sampling operation.
The two-channel classification network includes: two identical shallow convolutional neural networks, a cascade network layer and a classification network layer;
the hand part segmentation image and the original gesture image output by the semantic segmentation network are used as the input of two identical shallow layer convolutional neural networks of a double-channel classification network, the shape characteristic and the color characteristic of the hand part are obtained through the two parallel shallow layer convolutional neural networks, the extracted characteristics are fused together through a cascade network layer to be used as the input of a final classification network layer, and the final gesture recognition is realized through the classification network layer.
The loss of the semantic segmentation network is calculated by adopting the following formula:
where N is the number of all samples, yiAnd piRespectively representing the real label pixel value and the predicted probability chart of the ith picture;
the loss of the two-channel classification network is calculated by adopting the following formula:
where N is the number of all samples, K represents the number of all gesture classes, yikRepresenting the true probability, p, that the ith sample belongs to the class jikRepresenting the prediction probability that the ith sample belongs to the class j.
Evaluating the hand segmentation result by adopting a preset evaluation standard; the preset evaluation criteria include: average cross-over ratio mIOU, Model Size, and floating point operation times per second FLOPS;
the average cross-over ratio mIOU is defined as:
where k +1 represents the number of categories in the image, there are two categories, a hand region and a non-hand region, pijRepresenting the number of pixels in the image for which class i is predicted to be class j;
the model size ModelSize and the floating point operation times per second FLOPS are used for further evaluating the feasibility of the model;
evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy Accuracy, macroscopic F1-fraction Macro-F1, model size ModelSize and floating point operation times per second FLOPS;
the Accuracy is defined as:
in the formula, TP represents the number of samples for which the true label is a positive example and is predicted as a positive example; TN represents the number of samples with the true label as a negative case and predicted as a negative case; total represents the number of all samples;
the macroscopic F1-Score Macro-F1 is defined as the average of the corresponding F1-scores (F1-Score) of all gesture categories:
in the formula, C represents all gesture categories, F1-ScoreiF1-score representing the ith gesture category.
The invention fuses the shallow detail characteristic and the deep semantic characteristic through the semantic segmentation network based on the coding and decoding structure, and is suitable for correctly positioning the hand region and simultaneously segments the hand with clear outline; the characteristics of the hand part cut image and the original gesture image are respectively extracted by adopting a double-channel classification network, and the fused characteristics are classified and recognized, so that the recognition precision of the gesture is improved. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the semantic segmentation performance is improved, and meanwhile, the deep separable convolution is introduced into the segmentation network, so that the calculation cost is greatly reduced, the requirements of a model on hardware equipment are reduced, and the whole gesture recognition network is lighter.
Drawings
Fig. 1 is a schematic general flow chart of a gesture recognition method under a complex background according to the present invention.
Fig. 2 is a schematic diagram of a network framework used in a gesture recognition method under a complex background according to the present invention.
Fig. 3 is a schematic diagram of a depth separable convolution module provided by the present invention.
Fig. 4 is a schematic diagram of a bottleneck residual unit with depth separable convolution according to the present invention.
Fig. 5 is a schematic diagram of a void space pooling pyramid (ASPP) provided by the present invention.
FIG. 6 is a schematic diagram showing the comparison between the hand segmentation result provided by the present invention and other algorithm results.
Detailed Description
The preferred embodiment of the present invention will be described in detail below with reference to fig. 1 to 6.
As shown in fig. 1, the gesture recognition method in a complex background provided in this embodiment includes the following steps:
and step S1, collecting a data set for gesture recognition in a complex background.
Specifically, the collected image dataset identifying the gesture under the complex background meets a preset experimental requirement, and the preset experimental requirement comprises: each image of the data set to be identified is provided with a corresponding ground surface value image, and each group of images are completed by different subjects; the images of each of the datasets to be identified are acquired in very challenging situations, such as variations in lighting, the inclusion of objects in the background that are close to skin tones, and the occlusion of hands and faces of different shapes and sizes.
And step S2, extracting the features of the data set by adopting a semantic segmentation network based on an encoding and decoding structure, and outputting a hand segmentation graph.
As shown in fig. 2, the semantic segmentation network based on the codec structure is part (a) of the graph, and specifically includes: a 3 x 3 convolutional layer, four bottleneck residual modules, a void space pooling pyramid (ASPP), and a simple decoder module.
As shown in fig. 3, a depth separable convolution structure (DepS Conv) is applied to the semantic segmentation network based on the codec structure, so as to simplify the computation cost of the model and realize the segmentation of the hand in a complicated background in a limited computation resource. The depth separable convolution structure consists of a channel-by-channel convolution (Depthwise Conv) and a point-by-point convolution (1 × 1 Conv). Both convolutions are followed by a Batch Normalization operation (Batch Normalization) and a Relu activation function. Batch normalization operations are beneficial to speed up the network learning rate while reducing gradient vanishing.
And the 3 x 3 convolutional layer and the four bottleneck residual modules are sequentially connected to form a residual network so as to extract the characteristic information of the image. The specific structure is shown in table 1, ResBlock _1 represents a first bottleneck residual module, and each bottleneck residual module is composed of three bottleneck residual units in a cascade connection manner. The structure of the bottleneck residual error unit is shown in fig. 4, and the structure is composed of two 1 × 1 convolutional layers and a depth separable convolutional structure, wherein the 1 × 1 convolutional layers are used for adding nonlinearity to improve the expression capability of the network and simultaneously play a role in reducing the dimension.
The second and third bottleneck residual modules apply a downsampling operation to capture semantic information. Each residual unit of the last bottleneck residual module applies a different hole convolution to capture more context information.
TABLE 1
As shown in fig. 5, the cavity space Pooling pyramid module (ASPP) captures multi-scale semantic information through four parallel cavity volumes and a global Pooling operation (Image Pooling), and the features extracted by each parallel layer are fused together through a cascading module. The global pooling operation is to obtain context information of a larger receptive field.
As shown in fig. 2(a), a Decoder module (Decoder) fuses shallow detail features and deep semantic features into one block, the fused features refine the features through two convolutional layers, and finally, a hand segmentation graph with clear contours is output through an up-sampling operation; the features output by the second bottleneck residual error module are subjected to an upsampling operation to obtain the shallow detail features; the deep semantic features are fusion features of a cavity space pooling pyramid (ASPP) module.
And step S3, extracting the characteristics of the hand part cut pictures and the original pictures by adopting a two-channel classification network, and identifying the gesture category.
As shown in fig. 2(b), the two-channel classification network comprises: two parallel shallow neural networks (CNNs), a cascade layer, and a classification layer. The two parallel shallow convolutional neural networks respectively extract the characteristics of the hand part segmentation image and the original gesture image, the cascade network layer fuses the extracted characteristics together, and the final gesture recognition is realized through the classification network layer.
The structure of the shallow neural networks (CNNs) is shown in table 2, and the structure is composed of four 3 × 3 convolutional layers, four pooling layers, and two fully-connected layers. The pooling layer is mainly used for realizing down-sampling operation and expanding the receptive field; meanwhile, the network computing speed can be increased, and the occurrence of the overfitting phenomenon is reduced.
TABLE 2
Network layer name | Output feature size | Network layer type |
Input | 320×320×3 | — |
Conv2d_1 | 320×320×16 | convolution |
Pooling_2d_1 | 106×106×16 | max-pooling |
Conv2d_2 | 106×106×32 | convolution |
Pooling2d_1 | 35×35×32 | max-pooling |
Conv2d_3 | 35×35×64 | convolution |
Pooling2d_3 | 11×11×64 | max-pooling |
Conv2d_4 | 9×9×128 | convolution |
Pooling2d_4 | 128 | global average pooling |
Dense_1 | 64 | fully connected |
Dense_2 | 64 | fully connected |
In the embodiment, the semantic segmentation network and the two-channel classification network based on the coding and decoding structure are trained based on a tensoflow frame, and the hardware is a GeForce RTX 3080GPU server. The network training was trained from the beginning, without using pre-trained weights, the trained pictures were pre-set to a size of 320 × 320, and data was enhanced using operations such as horizontal/vertical flipping and scaling, all experiments were trained by Adam optizer, the initial learning rate was set to 0.001, the weight Decay (Decay) was 0, and the Batch (Batch _ size) was 8.
In this embodiment, the loss of the semantic segmentation network is calculated by using the following formula:
where N is the number of all samples, yiAnd piAnd the probability maps respectively represent the real label pixel values and the predictions of the ith picture.
The loss of the two-channel classification network is calculated by adopting the following formula:
where N is the number of all samples, K represents the number of all gesture classes, yikRepresenting the true probability, p, that the ith sample belongs to the class jikRepresenting the prediction probability that the ith sample belongs to the class j.
Preferably, a preset evaluation standard is adopted to evaluate the hand segmentation result; the preset evaluation criteria include: average cross-over ratio (mIOU), Model Size (Model Size), number of floating point operations per second (FLOPS);
the mean intersection ratio (mIOU) is defined as:
where k +1 represents the number of categories in the image, there are two categories, here hand region and non-hand region, pijIndicating the number of pixels in the image for which class i is predicted to be class j.
The model size (ModelSize) and the number of floating point operations per second (FLOPS) were used to further evaluate the feasibility of the model.
Evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy (Accuracy), macroscopic F1-score (Macro-F1), model size (ModelSize), and floating point operations per second (FLOPS);
the Accuracy (Accuracy) is defined as:
in the formula, TP represents the number of samples for which the true label is a positive example and is predicted as a positive example; TN represents the number of samples with the true label as a negative case and predicted as a negative case; total represents the number of all samples
The macroscopic F1-Score (Macro-F1) is defined as the average of all gesture class corresponding F1-scores (F1-Score):
in the formula, C represents all gesture categories, F1-ScoreiF1-score representing the ith gesture category.
As shown in table 3, the results of the hand segmentation and the indexes of other algorithms in the complex background provided by the present embodiment are compared. The values corresponding to the best method of effect in the table have been bolded. It can be seen from the table that the performance of all three selected evaluation indexes is significantly improved, especially the two indexes of the model size and the number of floating point operations per second. It can be seen that the hand segmentation performance of the semantic segmentation network based on the coding and decoding structure provided by the invention is superior to that of other algorithms, and meanwhile, the model is very small and the requirement on hardware equipment is low.
TABLE 3
As shown in fig. 6, it is a comparison graph of the hand segmentation result in the complex background provided by the present embodiment and the results of other algorithms. In the figure, the first and second columns show images of the original input image and the corresponding hand mask image, respectively, the third column shows the results of the algorithm proposed herein, and the other columns show images of the comparison algorithm. The graph provides an intuitive segmentation result, and it is easy to see that the hand segmentation method provided by the embodiment has a good segmentation effect even if the environment around the gesture is complex.
As shown in table 4, the results of hand recognition in the complex background provided by the present embodiment are compared with the results of various indicators of other algorithms. The values corresponding to the best method of effect in the table have been bolded. It can be seen from the table that the performance is significantly improved in the four selected evaluation indexes. The gesture recognition method under the complex background provided by the invention has better performance than other algorithms, and meanwhile, the model is very small, and the requirement on hardware equipment is lower.
TABLE 4
Method | Accuracy | Macro-F1 | Size of model | FLOPS |
ResNet-101 | 0.8333 | 0.8375 | 162.81M | 85041593 |
ShuffleNetV2 | 0.8617 | 0.8612 | 7.4M | 3826374 |
MobileNetV3 | 0.8752 | 0.8758 | 11.64M | 6056813 |
HGR-Net | 0.8713 | 0.8810 | 1.91M | 991530 |
Ours | 0.9117 | 0.9114 | 1.85M | 950306 |
The recognition method provided by the embodiment is characterized in that a semantic segmentation network based on an encoding and decoding structure is used for fusing a shallow detail feature and a deep semantic feature, so that the hand with a clear outline is segmented while the method is suitable for correctly positioning a hand region; the characteristics of the hand part cut image and the original gesture image are respectively extracted by adopting a double-channel classification network, and the fused characteristics are classified and recognized, so that the recognition precision of the gesture is improved.
In the embodiment, the multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, the semantic segmentation performance is improved, meanwhile, the deep separable convolution is introduced into the segmentation network, the calculation cost is greatly reduced, the requirement of a model on hardware equipment is reduced, and the whole gesture recognition network is lighter.
In summary, the invention discloses a gesture recognition method under a complex background based on semantic segmentation and a two-channel classification network. After extracting a characteristic map of a hand region by using a residual error network, adding a cavity space pooling pyramid (ASPP) and a decoder module to obtain a better hand segmentation effect map by using a semantic segmentation network; and a double-channel classification network is constructed, and features extracted from the hand segmentation image and the original gesture image are fused, so that the gesture recognition precision under a complex background is improved. The gesture recognition method under the complex background provided by the invention is compared with the results of other algorithms, and the results show that the gesture recognition under the complex background can keep better performance. Meanwhile, the model is small, and the requirement on hardware equipment is low.
The invention fuses the shallow detail characteristic and the deep semantic characteristic through the semantic segmentation network based on the coding and decoding structure, and is suitable for correctly positioning the hand region and simultaneously segments the hand with clear outline; the characteristics of the hand part cut image and the original gesture image are respectively extracted by adopting a double-channel classification network, and the fused characteristics are classified and recognized, so that the recognition precision of the gesture is improved. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the semantic segmentation performance is improved, and meanwhile, the deep separable convolution is introduced into the segmentation network, so that the calculation cost is greatly reduced, the requirements of a model on hardware equipment are reduced, and the whole gesture recognition network is lighter.
It should be noted that, in the embodiments of the present invention, the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", "axial", "radial", "circumferential", etc. indicate the orientation or positional relationship shown in the drawings, and are only for convenience of describing the embodiments, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.
Claims (10)
1. A gesture recognition method under a complex background, comprising:
adopting a semantic segmentation network based on an encoding and decoding structure to extract the characteristics of a gesture picture data set containing a complex background and outputting a hand segmentation picture;
and performing feature extraction on the hand part segmentation image and the original gesture image data set based on a two-channel classification network to identify the gesture category.
2. The method according to claim 1, wherein the gesture picture data set of the complex background conforms to a preset experiment requirement, and the preset experiment requirement comprises: the images of the data set are provided with corresponding ground truth images, and each group of images are completed by different subjects; images of a data set are acquired in very challenging situations.
3. The method for recognizing gestures in complex background according to claim 2, wherein said semantic segmentation network based on codec structure comprises: a 3 × 3 convolutional layer, four bottleneck residual modules, a void space pooling pyramid ASPP, and a decoder module;
the 3 x 3 convolutional layer, the four bottleneck residual modules and the hollow space pooling pyramid ASPP are sequentially connected;
and the output of the second bottleneck residual error module is fused with the output characteristic of the space pooling pyramid ASPP through the up-sampled characteristic, and the fused characteristic is used as the input of the decoder module.
4. The method for gesture recognition under a complex background according to claim 3, wherein the bottleneck residual module comprises three bottleneck residual units, and each bottleneck residual unit is connected in sequence;
the second bottleneck residual error module and the third bottleneck residual error module are used for downsampling operation to capture semantic information;
the features output by the second bottleneck residual error module are subjected to an upsampling operation to obtain shallow detail features;
the fourth bottleneck residual module applies different sizes of hole convolutions to obtain more context information.
5. The method for gesture recognition of gestures in complex context according to claim 4, wherein said bottleneck residual unit comprises: two 1 x 1 convolutional layers and a depth separable convolutional structure;
the depth separable convolution structure includes: the channel-wise convolution Depthwise Conv and the point-wise convolution 1 × 1Conv, both followed by the Batch Normalization operations Batch Normalization and Relu activation functions.
6. The method for gesture recognition in a complex background according to claim 5, wherein the space pooling pyramid module ASPP captures multi-scale semantic information by four parallel hole volumes and one global pooling operation, and features extracted from each parallel layer are merged together by a concatenation module to obtain deep semantic features.
7. The method as claimed in claim 6, wherein the decoder module fuses shallow detail features and deep semantic features into one block, the fused features refine the features through two convolutional layers, and finally the hand segmentation graph with clear outline is output through an up-sampling operation.
8. The method for gesture recognition of gestures in complex contexts as claimed in claim 7, wherein said two-channel classification network comprises: two identical shallow convolutional neural networks, a cascade network layer and a classification network layer;
the hand part segmentation image and the original gesture image output by the semantic segmentation network are used as the input of two identical shallow layer convolutional neural networks of a double-channel classification network, the shape characteristic and the color characteristic of the hand part are obtained through the two parallel shallow layer convolutional neural networks, the extracted characteristics are fused together through a cascade network layer to be used as the input of a final classification network layer, and the final gesture recognition is realized through the classification network layer.
9. The method for recognizing gestures in complex background according to claim 8, wherein the loss of the semantic segmentation network is calculated by the following formula:
where N is the number of all samples, yiAnd piRespectively representing the real label pixel value and the predicted probability chart of the ith picture;
the loss of the two-channel classification network is calculated by adopting the following formula:
where N is the number of all samples, K represents the number of all gesture classes, yikRepresenting the true probability, p, that the ith sample belongs to the class jikRepresenting the prediction probability that the ith sample belongs to the class j.
10. The method for recognizing gestures in complex background as claimed in claim 9, wherein the hand segmentation result is evaluated by a preset evaluation criterion; the preset evaluation criteria include: average cross-over ratio mIOU, Model Size, and floating point operation times per second FLOPS;
the average cross-over ratio mIOU is defined as:
where k +1 represents the number of categories in the image, there are two categories, a hand region and a non-hand region, pijRepresenting the number of pixels in the image for which class i is predicted to be class j;
the model size ModelSize and the floating point operation times per second FLOPS are used for further evaluating the feasibility of the model;
evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy Accuracy, macroscopic F1-fraction Macro-F1, model size ModelSize and floating point operation times per second FLOPS;
the Accuracy is defined as:
in the formula, TP represents the number of samples for which the true label is a positive example and is predicted as a positive example; TN represents the number of samples with the true label as a negative case and predicted as a negative case; total represents the number of all samples;
the Macro F1-Score Macro-F1 is defined as the average of all gesture categories corresponding to F1-Score F1-Score:
in the formula, C represents all gesture categories, F1-ScoreiF1-score representing the ith gesture category.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110473809.6A CN112966672B (en) | 2021-04-29 | 2021-04-29 | Gesture recognition method under complex background |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110473809.6A CN112966672B (en) | 2021-04-29 | 2021-04-29 | Gesture recognition method under complex background |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112966672A true CN112966672A (en) | 2021-06-15 |
CN112966672B CN112966672B (en) | 2024-04-05 |
Family
ID=76281236
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110473809.6A Active CN112966672B (en) | 2021-04-29 | 2021-04-29 | Gesture recognition method under complex background |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112966672B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113298080A (en) * | 2021-07-26 | 2021-08-24 | 城云科技(中国)有限公司 | Target detection enhancement model, target detection method, target detection device and electronic device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109214250A (en) * | 2017-07-05 | 2019-01-15 | 中南大学 | A kind of static gesture identification method based on multiple dimensioned convolutional neural networks |
CN110781895A (en) * | 2019-10-10 | 2020-02-11 | 湖北工业大学 | Image semantic segmentation method based on convolutional neural network |
WO2020215236A1 (en) * | 2019-04-24 | 2020-10-29 | 哈尔滨工业大学(深圳) | Image semantic segmentation method and system |
CN112184635A (en) * | 2020-09-10 | 2021-01-05 | 上海商汤智能科技有限公司 | Target detection method, device, storage medium and equipment |
-
2021
- 2021-04-29 CN CN202110473809.6A patent/CN112966672B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109214250A (en) * | 2017-07-05 | 2019-01-15 | 中南大学 | A kind of static gesture identification method based on multiple dimensioned convolutional neural networks |
WO2020215236A1 (en) * | 2019-04-24 | 2020-10-29 | 哈尔滨工业大学(深圳) | Image semantic segmentation method and system |
CN110781895A (en) * | 2019-10-10 | 2020-02-11 | 湖北工业大学 | Image semantic segmentation method based on convolutional neural network |
CN112184635A (en) * | 2020-09-10 | 2021-01-05 | 上海商汤智能科技有限公司 | Target detection method, device, storage medium and equipment |
Non-Patent Citations (2)
Title |
---|
王金鹤;苏翠丽;孟凡云;车志龙;谭浩;张楠;: "基于非对称空间金字塔池化的立体匹配网络", 计算机工程, no. 07 * |
邢予权;潘今一;王伟;刘建烽;: "基于语义分割与迁移学习的手势识别", 计算机测量与控制, no. 04 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113298080A (en) * | 2021-07-26 | 2021-08-24 | 城云科技(中国)有限公司 | Target detection enhancement model, target detection method, target detection device and electronic device |
CN113298080B (en) * | 2021-07-26 | 2021-11-05 | 城云科技(中国)有限公司 | Target detection enhancement model, target detection method, target detection device and electronic device |
Also Published As
Publication number | Publication date |
---|---|
CN112966672B (en) | 2024-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022227913A1 (en) | Double-feature fusion semantic segmentation system and method based on internet of things perception | |
CN113221639B (en) | Micro-expression recognition method for representative AU (AU) region extraction based on multi-task learning | |
CN111242288B (en) | Multi-scale parallel deep neural network model construction method for lesion image segmentation | |
CN109492529A (en) | A kind of Multi resolution feature extraction and the facial expression recognizing method of global characteristics fusion | |
Islalm et al. | Recognition bangla sign language using convolutional neural network | |
CN111091130A (en) | Real-time image semantic segmentation method and system based on lightweight convolutional neural network | |
CN107239733A (en) | Continuous hand-written character recognizing method and system | |
CN108804397A (en) | A method of the Chinese character style conversion based on a small amount of target font generates | |
CN105956560A (en) | Vehicle model identification method based on pooling multi-scale depth convolution characteristics | |
CN111340814A (en) | Multi-mode adaptive convolution-based RGB-D image semantic segmentation method | |
CN112163401B (en) | Compression and excitation-based Chinese character font generation method of GAN network | |
CN111652273B (en) | Deep learning-based RGB-D image classification method | |
CN113870335A (en) | Monocular depth estimation method based on multi-scale feature fusion | |
CN110517270B (en) | Indoor scene semantic segmentation method based on super-pixel depth network | |
CN110110724A (en) | The text authentication code recognition methods of function drive capsule neural network is squeezed based on exponential type | |
CN115862045B (en) | Case automatic identification method, system, equipment and storage medium based on image-text identification technology | |
CN108537109B (en) | OpenPose-based monocular camera sign language identification method | |
CN115966010A (en) | Expression recognition method based on attention and multi-scale feature fusion | |
CN113065426A (en) | Gesture image feature fusion method based on channel perception | |
CN116502181A (en) | Channel expansion and fusion-based cyclic capsule network multi-modal emotion recognition method | |
CN116129141A (en) | Medical data processing method, apparatus, device, medium and computer program product | |
CN106203448A (en) | A kind of scene classification method based on Nonlinear Scale Space Theory | |
CN112966672B (en) | Gesture recognition method under complex background | |
CN113378938B (en) | Edge transform graph neural network-based small sample image classification method and system | |
Zhang et al. | A simple and effective static gesture recognition method based on attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |