CN112966672B - Gesture recognition method under complex background - Google Patents

Gesture recognition method under complex background Download PDF

Info

Publication number
CN112966672B
CN112966672B CN202110473809.6A CN202110473809A CN112966672B CN 112966672 B CN112966672 B CN 112966672B CN 202110473809 A CN202110473809 A CN 202110473809A CN 112966672 B CN112966672 B CN 112966672B
Authority
CN
China
Prior art keywords
gesture
features
convolution
network
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110473809.6A
Other languages
Chinese (zh)
Other versions
CN112966672A (en
Inventor
陈昆
周薇娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maritime University
Original Assignee
Shanghai Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maritime University filed Critical Shanghai Maritime University
Priority to CN202110473809.6A priority Critical patent/CN112966672B/en
Publication of CN112966672A publication Critical patent/CN112966672A/en
Application granted granted Critical
Publication of CN112966672B publication Critical patent/CN112966672B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/113Recognition of static hand signs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The gesture recognition method under the complex background adopts a semantic segmentation network based on a coding and decoding structure to extract characteristics of a gesture picture data set containing the complex background and output a hand segmentation map; and extracting features of the hand segmentation map and the original gesture picture dataset by adopting a network based on the double-channel classification, and identifying gesture categories. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the performance of semantic segmentation is improved, meanwhile, depth separable convolution is introduced into the segmentation network, the calculation cost is greatly reduced, the requirement of a model on hardware equipment is reduced, and the whole gesture recognition network is lighter.

Description

Gesture recognition method under complex background
Technical Field
The invention relates to a target segmentation recognition technology, in particular to a gesture recognition method under a complex background.
Background
Since ancient times, humans have been communicating using sign language. Gestures are as old as human civilization itself. Gestures are particularly useful for expressing any word or sensation to be communicated. Thus, despite established writing practices, people around the world are continually expressing using gestures.
In recent years, with the development of machine vision, human-computer interaction is more closely related to the daily life of people. Gestures are a common way for people to communicate, are critical to achieving natural communication between humans and machines, and provide a more comfortable experience for operators. In particular, gestures may be used to provide more intuitive interactions with a computer, which draws the attention of researchers.
Gesture recognition has been an important area of research for machine vision for conveying information. Gesture recognition may provide services to a particular group, such as the deaf or hearing impaired. In addition, the method has wide application prospect in the fields of intelligent driving, machine control, virtual reality and the like.
In practical applications, different angles, different sizes, skin colors, illumination intensities, and environments around the gestures present significant challenges for gesture recognition. The background of the gesture image can be classified into a simple background, which refers to a background that does not contain any noise, and a complex background, which refers to a background that contains noise. There is still a lack of high precision solutions for gesture recognition in complex contexts in real scenes. Therefore, the realization of high-precision recognition of gestures in a complex background has great practical significance.
Disclosure of Invention
The invention aims to provide a gesture recognition method under a complex background, which can accurately recognize the category of gestures under the complex background and reduce the manual recognition cost.
In order to achieve the above objective, the present invention provides a gesture recognition method under a complex background, comprising:
carrying out feature extraction on a gesture picture data set containing a complex background by adopting a semantic segmentation network based on a coding and decoding structure, and outputting a hand segmentation map;
and extracting features of the hand segmentation map and the original gesture picture dataset by adopting a network based on the double-channel classification, and identifying gesture categories.
The gesture picture data set of the complex background meets the preset experiment requirements, and the preset experiment requirements comprise: the images of the dataset all bear corresponding ground truth images, each set of images being completed by a different subject; images of the dataset are acquired in very challenging situations.
The semantic segmentation network based on the coding and decoding structure comprises: a 3 x 3 convolutional layer, four bottleneck residual modules, a hole space pooling pyramid ASPP, and a decoder module;
the 3X 3 convolution layer, the four bottleneck residual modules and the cavity space pooling pyramid ASPP are sequentially connected;
and the output of the second bottleneck residual error module is fused with the features output by the cavity space pooling pyramid ASPP through the features after upsampling, and the fused features are used as the input of the decoder module.
The bottleneck residual error module comprises three bottleneck residual error units, and each bottleneck residual error unit is connected in sequence;
the second bottleneck residual module and the third bottleneck residual module are used for downsampling operation to capture semantic information;
the characteristics output by the second bottleneck residual error module are subjected to upsampling operation to obtain shallow detail characteristics;
the fourth bottleneck residual module applies different sizes of hole convolutions to obtain more context information.
The bottleneck residual unit comprises: two 1 x 1 convolution layers and a depth separable convolution structure;
the depth separable convolution structure includes: the channel-by-channel convolution Depthwise Conv and the point-by-point convolution 1 x 1Conv are both followed by a batch normalization operation Batch Normalization and a Relu activation function.
The cavity space pooling pyramid module ASPP captures multi-scale semantic information through four parallel cavity volume and one global pooling operation, and features extracted by each parallel layer are fused together through a cascade module to obtain deep semantic features.
The decoder module fuses the shallow detail features and the deep semantic features together, the fused features refine the features through two convolution layers, and finally the hand segmentation map with clear contours is output through up-sampling operation.
The dual channel classification network comprises: two identical shallow convolutional neural networks, a cascade network layer and a classification network layer;
the hand segmentation graph and the original gesture image output by the semantic segmentation network are used as the input of two identical shallow convolutional neural networks of the two-channel classification network, the shape characteristics and the color characteristics of the hand are obtained through the two parallel shallow convolutional neural networks, the extracted characteristics are fused together through the cascade network layer to be used as the input of the final classification network layer, and the final gesture recognition is realized through the classification network layer.
The loss of the semantic segmentation network is calculated by adopting the following formula:
where N is the number of all samples, y i And p i Respectively representing a true label pixel value and a predicted probability map of an ith picture;
the loss of the two-channel classification network is calculated by adopting the following formula:
where N is the number of all samples, K is the number of all gesture categories, y ik Representing the true probability that the ith sample belongs to class j, p ik Representing the predicted probability that the i-th sample belongs to category j.
Evaluating the hand segmentation result by adopting a preset evaluation standard; the preset evaluation criteria include: average cross-over ratio mIOU, model Size, floating point number of operations per second FLOPS;
the average cross-over ratio mIOU is defined as:
wherein k+1 represents the number of categories in the image, and there are two categories, namely hand regionsDomain and achiral region, p ij Representing the number of pixels in the image for which class i is predicted as class j;
the model size ModelSize and the floating point number of operations per second FLOPS are used for further evaluating the feasibility of the model;
evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy Accuracy, macroscopic F1-fraction Macro-F1, model size ModelSize and floating point operation number FLOPS per second;
the Accuracy Accuracy is defined as:
in the formula, TP represents the number of samples with real labels as positive examples and predicted as positive examples; TN represents the number of samples for which the real label is negative and predicted to be negative; total represents the number of all samples;
the macroscopic F1-Score Macro-F1 is defined as the average of all gesture categories corresponding to F1-scores (F1-Score):
wherein C represents all gesture categories, F1-Score i F1-score representing the i-th gesture category.
According to the invention, the shallow detail features and the deep semantic features are fused through the semantic segmentation network based on the coding and decoding structure, so that the method is suitable for correctly positioning the hand region and simultaneously segmenting the hand with clear outline; and the features of the hand segmentation map and the original gesture image are respectively extracted by adopting a two-channel classification network, and the fused features are classified and identified, so that the gesture identification precision is improved. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the performance of semantic segmentation is improved, meanwhile, depth separable convolution is introduced into the segmentation network, the calculation cost is greatly reduced, the requirement of a model on hardware equipment is reduced, and the whole gesture recognition network is lighter.
Drawings
Fig. 1 is a general flow diagram of a gesture recognition method under a complex background provided by the present invention.
Fig. 2 is a schematic diagram of a network used in a gesture recognition method under a complex background according to the present invention.
Fig. 3 is a schematic diagram of a depth separable convolution module provided by the present invention.
Fig. 4 is a schematic diagram of a bottleneck residual unit with depth separable convolution provided by the present invention.
Fig. 5 is a schematic diagram of a cavity-space pooling pyramid (ASPP) provided by the invention.
Fig. 6 is a schematic diagram comparing the hand segmentation result provided by the present invention with other algorithm results.
Detailed Description
The following describes a preferred embodiment of the present invention with reference to fig. 1 to 6.
The embodiment provides a method for recognizing a gesture in a complex background, as shown in fig. 1, where the method for recognizing a gesture in a complex background provided in the embodiment includes the following steps:
step S1, collecting a data set for gesture recognition under a complex background.
Specifically, the collected image dataset for recognizing the gesture in the complex background meets a preset experiment requirement, and the preset experiment requirement comprises: each image of the data set to be identified carries a corresponding ground truth value image, and each group of images is completed by a different subject; the images of each of the data sets to be identified are acquired in very challenging situations, such as variations in illumination, objects in the background that are similar to skin colors, and mutual occlusion of the hands and faces of different shapes and sizes.
And S2, carrying out feature extraction on the data set by adopting a semantic segmentation network based on a coding and decoding structure, and outputting a hand segmentation map.
As shown in fig. 2, the semantic segmentation network based on the codec structure is part (a) of the figure, and specifically includes: a 3 x 3 convolutional layer, four bottleneck residual modules, a hole space pooling pyramid (ASPP), and a simple decoder module.
As shown in fig. 3, a depth separable convolution structure (DepS Conv) is applied to the semantic segmentation network based on the codec structure, so as to simplify the calculation cost of the model, and the segmentation of the hand in the complex background can be realized in limited calculation resources. The depth separable convolution structure is composed of a channel-by-channel convolution (Depthwise Conv) and a point-by-point convolution (1×1 Conv). Both convolutions are followed by a batch normalization operation (Batch Normalization) and a Relu activation function. The batch normalization operation is beneficial to accelerating the network learning rate and reducing gradient disappearance.
And the 3 multiplied by 3 convolution layer and the four bottleneck residual modules are sequentially connected to form a residual network so as to extract the characteristic information of the image. The specific structure is shown in table 1, and resblock_1 represents a first bottleneck residual module, and each bottleneck residual module is formed by cascading three bottleneck residual units. The bottleneck residual unit has a structure shown in fig. 4, and the structure consists of two 1×1 convolution layers and a depth separable convolution structure, wherein the 1×1 convolution layers have the function of adding nonlinearity to improve the expression capability of the network and can play a role of reducing the dimension.
The second and third bottleneck residual modules apply a downsampling operation to capture semantic information. Each residual unit of the last bottleneck residual module applies a different hole convolution to capture more context information.
TABLE 1
As shown in fig. 5, the hole space Pooling pyramid module (ASPP) captures multi-scale semantic information by four parallel hole volumes and a global Pooling operation (Image Pooling), and features extracted by each parallel layer are fused together by a cascade module. The global pooling operation is to obtain context information for a larger receptive field.
As shown in fig. 2 (a), a Decoder module (Decoder) fuses shallow detail features and deep semantic features together, the fused features refine the features through two convolution layers, and finally the features are subjected to up-sampling operation to output a hand segmentation map with clear outline; the characteristics output by the second bottleneck residual error module are subjected to upsampling operation to obtain the shallow detail characteristics; the deep semantic features are fusion features of a cavity space pooling pyramid (ASPP) module.
And S3, carrying out feature extraction on the hand segmentation graph and the original graph by adopting a two-channel classification network, and identifying gesture types.
As shown in part (b) of fig. 2, the dual channel classification network comprises: two parallel shallow neural networks (CNNs), one cascade layer, one classification layer. The two parallel shallow convolutional neural networks respectively extract the features of the hand segmentation map and the original gesture image, the cascade network layer fuses the extracted features together, and final gesture recognition is realized through the classification network layer.
The structure of the shallow neural network (CNNs) is shown in table 2, and is composed of four 3×3 convolutional layers, four pooling layers, and two fully-connected layers. The pooling layer is mainly used for realizing downsampling operation and expanding receptive fields; meanwhile, the network computing speed can be increased, and the occurrence of the phenomenon of over fitting is reduced.
TABLE 2
Network layer name Output feature size Network layer type
Input 320×320×3
Conv2d_1 320×320×16 convolution
Pooling_2d_1 106×106×16 max-pooling
Conv2d_2 106×106×32 convolution
Pooling2d_1 35×35×32 max-pooling
Conv2d_3 35×35×64 convolution
Pooling2d_3 11×11×64 max-pooling
Conv2d_4 9×9×128 convolution
Pooling2d_4 128 global average pooling
Dense_1 64 fully connected
Dense_2 64 fully connected
In this embodiment, training on the semantic segmentation network and the dual-channel classification network based on the codec structure is based on a tensorflow framework, and the hardware is a server of the Geforce RTX 3080 GPU. Network training was started without pre-training weights, the training pictures were pre-set to 320 x 320 size and data were enhanced using horizontal/vertical flipping and scaling operations, all experiments were trained by Adam optimizer with an initial learning rate of 0.001, weight Decay (Decay) of 0, and Batch size of 8.
The loss of the semantic segmentation network is calculated according to the following formula:
where N is the number of all samples, y i And p i And respectively representing the true label pixel value and the predicted probability map of the ith picture.
The loss of the two-channel classification network is calculated by adopting the following formula:
where N is the number of all samples, K is the number of all gesture categories, y ik Representing the true probability that the ith sample belongs to class j, p ik Indicating that the ith sample belongs to category jThe probability is predicted.
Preferably, the hand segmentation result is evaluated by adopting a preset evaluation standard; the preset evaluation criteria include: average cross-over ratio (mIOU), model Size (Model Size), floating point number of operations per second (flow);
the average cross-over ratio (mIOU) is defined as:
where k+1 represents the number of categories in the image, here two categories, a hand region and a non-hand region, p ij Representing the number of pixels in the image for which class i is predicted to be class j.
The model size (ModelSize) and floating point operations per second (FLOPS) were used to further evaluate the feasibility of the model.
Evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy (Accuracy), macroscopic F1-fraction (Macro-F1), model size (ModelSize), and floating point number of operations per second (FLOPS);
the Accuracy (Accuracy) is defined as:
in the formula, TP represents the number of samples with real labels as positive examples and predicted as positive examples; TN represents the number of samples for which the real label is negative and predicted to be negative; total represents the number of all samples
The macroscopic F1-Score (Macro-F1) is defined as the average of all gesture categories corresponding to the F1-Score (F1-Score):
wherein C represents all gesture categories, F1-Score i Representing the ith gesture categoryF1-fraction of (C).
As shown in table 3, the hand segmentation under the complex background provided by the present embodiment is compared with the results of various indexes of other algorithms. The values corresponding to the best-effort methods in the tables are all bolded. The performance of the three selected evaluation indexes is obviously improved, especially the two indexes of model size and floating point operation times per second, as can be easily seen from the table. The hand segmentation performance of the semantic segmentation network based on the coding and decoding structure provided by the invention is superior to that of other algorithms, and meanwhile, the model is very small and the requirement on hardware equipment is low.
TABLE 3 Table 3
Fig. 6 is a graph comparing the hand segmentation result with other algorithm results under the complex background provided by the present embodiment. In the figure, the images displayed in the first column and the second column are respectively the original input image and the corresponding hand mask image, the third column is the result of the algorithm proposed herein, and the images displayed in the other columns are the result of the comparison algorithm. The graph provides visual segmentation results, and it is easy to see that the hand segmentation method provided by the embodiment has good segmentation effect even if the environment around the gesture is complex.
As shown in Table 4, the hand recognition under the complex background provided by the present embodiment is compared with the results of various indexes of other algorithms. The values corresponding to the best-effort methods in the tables are all bolded. The performance is significantly improved in all four selected evaluation indexes as can be seen from the table. The gesture recognition method under the complex background provided by the invention has better performance than other algorithms, and meanwhile, the model is very small and the requirement on hardware equipment is low.
TABLE 4 Table 4
Method Accuracy Macro-F1 Model size FLOPS
ResNet-101 0.8333 0.8375 162.81M 85041593
ShuffleNetV2 0.8617 0.8612 7.4M 3826374
MobileNetV3 0.8752 0.8758 11.64M 6056813
HGR-Net 0.8713 0.8810 1.91M 991530
Ours 0.9117 0.9114 1.85M 950306
The recognition method provided by the embodiment fuses shallow detail features and deep semantic features through the semantic segmentation network based on the coding and decoding structure, and is suitable for correctly positioning the hand region and segmenting the hand with clear outline; and the features of the hand segmentation map and the original gesture image are respectively extracted by adopting a two-channel classification network, and the fused features are classified and identified, so that the gesture identification precision is improved.
According to the embodiment, multi-scale context information is added into the semantic segmentation network of the encoding and decoding structure, so that the performance of semantic segmentation is improved, meanwhile, depth separable convolution is introduced into the segmentation network, the calculation cost is greatly reduced, the requirement of a model on hardware equipment is reduced, and the whole gesture recognition network is lighter.
In summary, the invention discloses a gesture recognition method under a complex background based on semantic segmentation and a dual-channel classification network. After extracting a feature map of a hand region by using a residual error network, the semantic segmentation network adds a cavity space pooling pyramid (ASPP) and a decoder module to obtain a better hand segmentation effect map; the double-channel classification network is constructed, features extracted from the hand segmentation map and the original gesture image are fused, and the gesture recognition accuracy under the complex background is improved. The gesture recognition method under the complex background provided by the invention is compared with the results of other algorithms, and the results show that the gesture recognition method under the complex background can keep better performance. Meanwhile, the model is small, and the requirement on hardware equipment is low.
According to the invention, the shallow detail features and the deep semantic features are fused through the semantic segmentation network based on the coding and decoding structure, so that the method is suitable for correctly positioning the hand region and simultaneously segmenting the hand with clear outline; and the features of the hand segmentation map and the original gesture image are respectively extracted by adopting a two-channel classification network, and the fused features are classified and identified, so that the gesture identification precision is improved. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the performance of semantic segmentation is improved, meanwhile, depth separable convolution is introduced into the segmentation network, the calculation cost is greatly reduced, the requirement of a model on hardware equipment is reduced, and the whole gesture recognition network is lighter.
It should be noted that, in the embodiments of the present invention, the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", "axial", "radial", "circumferential", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the embodiments, and do not indicate or imply that the apparatus or element being referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
While the present invention has been described in detail through the foregoing description of the preferred embodiment, it should be understood that the foregoing description is not to be considered as limiting the invention. Many modifications and substitutions of the present invention will become apparent to those of ordinary skill in the art upon reading the foregoing. Accordingly, the scope of the invention should be limited only by the attached claims.

Claims (1)

1. A method of gesture recognition in a complex context, comprising:
carrying out feature extraction on a gesture picture data set containing a complex background by adopting a semantic segmentation network based on a coding and decoding structure, and outputting a hand segmentation map;
the method comprises the steps of performing feature extraction on a hand segmentation map and an original gesture picture dataset based on a two-channel classification network, and identifying gesture categories;
the gesture picture data set of the complex background meets the preset experiment requirements, and the preset experiment requirements comprise: the images of the dataset all bear corresponding ground truth images, each set of images being completed by a different subject; images of the dataset are all acquired in very challenging situations;
the semantic segmentation network based on the coding and decoding structure comprises: a 3 x 3 convolutional layer, four bottleneck residual modules, a hole space pooling pyramid ASPP, and a decoder module;
the 3X 3 convolution layer, the four bottleneck residual modules and the cavity space pooling pyramid ASPP are sequentially connected;
the output of the second bottleneck residual error module is fused with the features output by the cavity space pooling pyramid ASPP through the features after up sampling, and the fused features are used as the input of the decoder module;
the bottleneck residual error module comprises three bottleneck residual error units, and each bottleneck residual error unit is connected in sequence;
the second bottleneck residual module and the third bottleneck residual module are used for downsampling operation to capture semantic information;
the characteristics output by the second bottleneck residual error module are subjected to upsampling operation to obtain shallow detail characteristics;
the fourth bottleneck residual error module applies hole convolution with different sizes to obtain more context information;
the bottleneck residual unit comprises: two 1 x 1 convolution layers and a depth separable convolution structure;
the depth separable convolution structure includes: channel-by-channel convolution Depthwise Conv and point-by-point convolution 1 x 1Conv, both followed by batch normalization operations Batch Normalization and Relu activation functions;
the cavity space pooling pyramid module ASPP captures multi-scale semantic information through four parallel cavity convolution and a global pooling operation, and features extracted by each parallel layer are fused together through a cascade module to obtain deep semantic features;
the decoder module fuses the shallow detail features and the deep semantic features together, the fused features refine the features through two convolution layers, and finally the hand segmentation map with clear contours is output through up-sampling operation;
the dual channel classification network comprises: two identical shallow convolutional neural networks, a cascade network layer and a classification network layer;
the hand segmentation graph and the original gesture image output by the semantic segmentation network are used as the input of two identical shallow convolutional neural networks of the two-channel classification network, the shape characteristics and the color characteristics of the hand are obtained through the two parallel shallow convolutional neural networks, the extracted characteristics are fused together through a cascade network layer to be used as the input of a final classification network layer, and the final gesture recognition is realized through the classification network layer;
the loss of the semantic segmentation network is calculated by adopting the following formula:
where N is the number of all samples, y i And p i Respectively representing a true label pixel value and a predicted probability map of an ith picture;
the loss of the two-channel classification network is calculated by adopting the following formula:
where N is the number of all samples, K is the number of all gesture categories, y ik Representing the true probability that the ith sample belongs to class j, p ik Representing the prediction probability that the ith sample belongs to class j;
evaluating the hand segmentation result by adopting a preset evaluation standard; the preset evaluation criteria include: average cross-over ratio mIOU, model Size, floating point number of operations per second FLOPS;
the average cross-over ratio mIOU is defined as:
wherein k+1 represents the number of categories in the image, two categories, a hand region and a non-hand region, p ij Representing the number of pixels in the image for which class i is predicted as class j;
the model size ModelSize and the floating point number of operations per second FLOPS are used for further evaluating the feasibility of the model;
evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy Accuracy, macroscopic F1-fraction Macro-F1, model size ModelSize and floating point operation number FLOPS per second;
the Accuracy Accuracy is defined as:
in the formula, TP represents the number of samples with real labels as positive examples and predicted as positive examples; TN represents the number of samples for which the real label is negative and predicted to be negative; total represents the number of all samples;
the macroscopic F1-Score Macro-F1 is defined as the average of all gesture categories corresponding to F1-scores F1-Score:
wherein C represents all gesture categories, F1-Score i F1-score representing the i-th gesture category.
CN202110473809.6A 2021-04-29 2021-04-29 Gesture recognition method under complex background Active CN112966672B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110473809.6A CN112966672B (en) 2021-04-29 2021-04-29 Gesture recognition method under complex background

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110473809.6A CN112966672B (en) 2021-04-29 2021-04-29 Gesture recognition method under complex background

Publications (2)

Publication Number Publication Date
CN112966672A CN112966672A (en) 2021-06-15
CN112966672B true CN112966672B (en) 2024-04-05

Family

ID=76281236

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110473809.6A Active CN112966672B (en) 2021-04-29 2021-04-29 Gesture recognition method under complex background

Country Status (1)

Country Link
CN (1) CN112966672B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298080B (en) * 2021-07-26 2021-11-05 城云科技(中国)有限公司 Target detection enhancement model, target detection method, target detection device and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214250A (en) * 2017-07-05 2019-01-15 中南大学 A kind of static gesture identification method based on multiple dimensioned convolutional neural networks
CN110781895A (en) * 2019-10-10 2020-02-11 湖北工业大学 Image semantic segmentation method based on convolutional neural network
WO2020215236A1 (en) * 2019-04-24 2020-10-29 哈尔滨工业大学(深圳) Image semantic segmentation method and system
CN112184635A (en) * 2020-09-10 2021-01-05 上海商汤智能科技有限公司 Target detection method, device, storage medium and equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214250A (en) * 2017-07-05 2019-01-15 中南大学 A kind of static gesture identification method based on multiple dimensioned convolutional neural networks
WO2020215236A1 (en) * 2019-04-24 2020-10-29 哈尔滨工业大学(深圳) Image semantic segmentation method and system
CN110781895A (en) * 2019-10-10 2020-02-11 湖北工业大学 Image semantic segmentation method based on convolutional neural network
CN112184635A (en) * 2020-09-10 2021-01-05 上海商汤智能科技有限公司 Target detection method, device, storage medium and equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王金鹤 ; 苏翠丽 ; 孟凡云 ; 车志龙 ; 谭浩 ; 张楠 ; .基于非对称空间金字塔池化的立体匹配网络.计算机工程.2020,(07),全文. *
邢予权 ; 潘今一 ; 王伟 ; 刘建烽 ; .基于语义分割与迁移学习的手势识别.计算机测量与控制.2020,(04),全文. *

Also Published As

Publication number Publication date
CN112966672A (en) 2021-06-15

Similar Documents

Publication Publication Date Title
CN107239733A (en) Continuous hand-written character recognizing method and system
CN111340814A (en) Multi-mode adaptive convolution-based RGB-D image semantic segmentation method
CN113642390B (en) Street view image semantic segmentation method based on local attention network
CN109961005A (en) A kind of dynamic gesture identification method and system based on two-dimensional convolution network
CN110188708A (en) A kind of facial expression recognizing method based on convolutional neural networks
Zhou et al. A lightweight hand gesture recognition in complex backgrounds
CN110110724A (en) The text authentication code recognition methods of function drive capsule neural network is squeezed based on exponential type
CN108537109B (en) OpenPose-based monocular camera sign language identification method
CN113920581A (en) Method for recognizing motion in video by using space-time convolution attention network
Neto et al. Sign language recognition based on 3d convolutional neural networks
Dar et al. Efficient-SwishNet based system for facial emotion recognition
CN115966010A (en) Expression recognition method based on attention and multi-scale feature fusion
Boukdir et al. Isolated video-based Arabic sign language recognition using convolutional and recursive neural networks
CN112200110A (en) Facial expression recognition method based on deep interference separation learning
CN112966672B (en) Gesture recognition method under complex background
CN116502181A (en) Channel expansion and fusion-based cyclic capsule network multi-modal emotion recognition method
CN106203448A (en) A kind of scene classification method based on Nonlinear Scale Space Theory
CN111813894A (en) Natural language emotion recognition method based on deep learning
Podder et al. Bangla sign language alphabet recognition using transfer learning based convolutional neural network
CN109558880B (en) Contour detection method based on visual integral and local feature fusion
Zhang et al. A simple and effective static gesture recognition method based on attention mechanism
Rawf et al. Effective Kurdish sign language detection and classification using convolutional neural networks
Han Residual learning based CNN for gesture recognition in robot interaction
CN109919057B (en) Multi-mode fusion gesture recognition method based on efficient convolutional neural network
CN114863572B (en) Myoelectric gesture recognition method of multi-channel heterogeneous sensor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant