CN112966672A - Gesture recognition method under complex background - Google Patents

Gesture recognition method under complex background Download PDF

Info

Publication number
CN112966672A
CN112966672A CN202110473809.6A CN202110473809A CN112966672A CN 112966672 A CN112966672 A CN 112966672A CN 202110473809 A CN202110473809 A CN 202110473809A CN 112966672 A CN112966672 A CN 112966672A
Authority
CN
China
Prior art keywords
gesture
segmentation
complex background
network
gesture recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110473809.6A
Other languages
Chinese (zh)
Other versions
CN112966672B (en
Inventor
陈昆
周薇娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maritime University
Original Assignee
Shanghai Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maritime University filed Critical Shanghai Maritime University
Priority to CN202110473809.6A priority Critical patent/CN112966672B/en
Publication of CN112966672A publication Critical patent/CN112966672A/en
Application granted granted Critical
Publication of CN112966672B publication Critical patent/CN112966672B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/113Recognition of static hand signs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A gesture recognition method under a complex background adopts a semantic segmentation network based on an encoding and decoding structure to extract features of a gesture picture data set containing the complex background and output a hand segmentation picture; and performing feature extraction on the hand part segmentation image and the original gesture image data set based on a two-channel classification network to identify the gesture category. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the semantic segmentation performance is improved, and meanwhile, the deep separable convolution is introduced into the segmentation network, so that the calculation cost is greatly reduced, the requirements of a model on hardware equipment are reduced, and the whole gesture recognition network is lighter.

Description

Gesture recognition method under complex background
Technical Field
The invention relates to a target segmentation recognition technology, in particular to a gesture recognition method under a complex background.
Background
Since ancient times, humans have been communicating using sign language. Gestures are as old as the human civilization itself. Gestures are particularly useful for expressing any word or sense to be communicated. Thus, despite the established writing conventions, people around the world are continually expressing using gestures.
In recent years, with the development of machine vision, human-computer interaction is more closely related to the daily life of people. Gestures are a common way for people to communicate, are vital to achieving natural communication between human and machines, and provide a more comfortable experience for operators. In particular, gestures may be used to provide more intuitive interaction with a computer, which is brought to the attention of researchers.
Gestures are used to convey information, and gesture recognition has been an important research area of machine vision. Gesture recognition may provide services to a particular group, such as deaf or hearing impaired people. In addition, the method has wide application prospect in the fields of intelligent driving, machine control, virtual reality and the like.
In practical applications, gesture recognition is challenged by different angles, different sizes, skin colors, illumination intensities of gestures and environments around the gestures. The background of the gesture image can be divided into a simple background, which means a background that does not contain any noise, and a complex background, which means a background that contains noise. In practical scenarios, a high-precision solution for gesture recognition in a complex background is still lacking. Therefore, the realization of high-precision gesture recognition under a complex background has great practical significance.
Disclosure of Invention
The invention aims to provide a gesture recognition method under a complex background, which can accurately recognize the category of a gesture under the complex background and reduce the manual recognition cost.
In order to achieve the above object, the present invention provides a gesture recognition method under a complex background, comprising:
adopting a semantic segmentation network based on an encoding and decoding structure to extract the characteristics of a gesture picture data set containing a complex background and outputting a hand segmentation picture;
and performing feature extraction on the hand part segmentation image and the original gesture image data set based on a two-channel classification network to identify the gesture category.
The gesture picture data set containing the complex background meets preset experiment requirements, wherein the preset experiment requirements comprise: the images of the data set are provided with corresponding ground truth images, and each group of images are completed by different subjects; images of a data set are acquired in very challenging situations.
The semantic segmentation network based on the coding and decoding structure comprises: a 3 × 3 convolutional layer, four bottleneck residual modules, a void space pooling pyramid ASPP, and a decoder module;
the 3 x 3 convolutional layer, the four bottleneck residual modules and the hollow space pooling pyramid ASPP are sequentially connected;
and the output of the second bottleneck residual error module is fused with the output characteristic of the space pooling pyramid ASPP through the up-sampled characteristic, and the fused characteristic is used as the input of the decoder module.
The bottleneck residual error module comprises three bottleneck residual error units, and each bottleneck residual error unit is connected in sequence;
the second bottleneck residual error module and the third bottleneck residual error module are used for downsampling operation to capture semantic information;
the features output by the second bottleneck residual error module are subjected to an upsampling operation to obtain shallow detail features;
the fourth bottleneck residual module applies different sizes of hole convolutions to obtain more context information.
The bottleneck residual error unit comprises: two 1 x 1 convolutional layers and a depth separable convolutional structure;
the depth separable convolution structure includes: the channel-wise convolution Depthwise Conv and the point-wise convolution 1 × 1Conv, both followed by the Batch Normalization operations Batch Normalization and Relu activation functions.
The ASPP captures multi-scale semantic information through four parallel cavity volumes and a global pooling operation, and features extracted by each parallel layer are fused together through a cascade module to obtain deep semantic features.
The decoder module fuses the shallow detail features and the deep semantic features into one block, the fused features refine the features through two convolution layers, and finally the hand segmentation graph with clear outlines is output through the up-sampling operation.
The two-channel classification network includes: two identical shallow convolutional neural networks, a cascade network layer and a classification network layer;
the hand part segmentation image and the original gesture image output by the semantic segmentation network are used as the input of two identical shallow layer convolutional neural networks of a double-channel classification network, the shape characteristic and the color characteristic of the hand part are obtained through the two parallel shallow layer convolutional neural networks, the extracted characteristics are fused together through a cascade network layer to be used as the input of a final classification network layer, and the final gesture recognition is realized through the classification network layer.
The loss of the semantic segmentation network is calculated by adopting the following formula:
Figure BDA0003046598170000031
where N is the number of all samples, yiAnd piRespectively representing the real label pixel value and the predicted probability chart of the ith picture;
the loss of the two-channel classification network is calculated by adopting the following formula:
Figure BDA0003046598170000032
where N is the number of all samples, K represents the number of all gesture classes, yikRepresenting the true probability, p, that the ith sample belongs to the class jikRepresenting the prediction probability that the ith sample belongs to the class j.
Evaluating the hand segmentation result by adopting a preset evaluation standard; the preset evaluation criteria include: average cross-over ratio mIOU, Model Size, and floating point operation times per second FLOPS;
the average cross-over ratio mIOU is defined as:
Figure BDA0003046598170000033
where k +1 represents the number of categories in the image, there are two categories, a hand region and a non-hand region, pijRepresenting the number of pixels in the image for which class i is predicted to be class j;
the model size ModelSize and the floating point operation times per second FLOPS are used for further evaluating the feasibility of the model;
evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy Accuracy, macroscopic F1-fraction Macro-F1, model size ModelSize and floating point operation times per second FLOPS;
the Accuracy is defined as:
Figure BDA0003046598170000034
in the formula, TP represents the number of samples for which the true label is a positive example and is predicted as a positive example; TN represents the number of samples with the true label as a negative case and predicted as a negative case; total represents the number of all samples;
the macroscopic F1-Score Macro-F1 is defined as the average of the corresponding F1-scores (F1-Score) of all gesture categories:
Figure BDA0003046598170000041
in the formula, C represents all gesture categories, F1-ScoreiF1-score representing the ith gesture category.
The invention fuses the shallow detail characteristic and the deep semantic characteristic through the semantic segmentation network based on the coding and decoding structure, and is suitable for correctly positioning the hand region and simultaneously segments the hand with clear outline; the characteristics of the hand part cut image and the original gesture image are respectively extracted by adopting a double-channel classification network, and the fused characteristics are classified and recognized, so that the recognition precision of the gesture is improved. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the semantic segmentation performance is improved, and meanwhile, the deep separable convolution is introduced into the segmentation network, so that the calculation cost is greatly reduced, the requirements of a model on hardware equipment are reduced, and the whole gesture recognition network is lighter.
Drawings
Fig. 1 is a schematic general flow chart of a gesture recognition method under a complex background according to the present invention.
Fig. 2 is a schematic diagram of a network framework used in a gesture recognition method under a complex background according to the present invention.
Fig. 3 is a schematic diagram of a depth separable convolution module provided by the present invention.
Fig. 4 is a schematic diagram of a bottleneck residual unit with depth separable convolution according to the present invention.
Fig. 5 is a schematic diagram of a void space pooling pyramid (ASPP) provided by the present invention.
FIG. 6 is a schematic diagram showing the comparison between the hand segmentation result provided by the present invention and other algorithm results.
Detailed Description
The preferred embodiment of the present invention will be described in detail below with reference to fig. 1 to 6.
As shown in fig. 1, the gesture recognition method in a complex background provided in this embodiment includes the following steps:
and step S1, collecting a data set for gesture recognition in a complex background.
Specifically, the collected image dataset identifying the gesture under the complex background meets a preset experimental requirement, and the preset experimental requirement comprises: each image of the data set to be identified is provided with a corresponding ground surface value image, and each group of images are completed by different subjects; the images of each of the datasets to be identified are acquired in very challenging situations, such as variations in lighting, the inclusion of objects in the background that are close to skin tones, and the occlusion of hands and faces of different shapes and sizes.
And step S2, extracting the features of the data set by adopting a semantic segmentation network based on an encoding and decoding structure, and outputting a hand segmentation graph.
As shown in fig. 2, the semantic segmentation network based on the codec structure is part (a) of the graph, and specifically includes: a 3 x 3 convolutional layer, four bottleneck residual modules, a void space pooling pyramid (ASPP), and a simple decoder module.
As shown in fig. 3, a depth separable convolution structure (DepS Conv) is applied to the semantic segmentation network based on the codec structure, so as to simplify the computation cost of the model and realize the segmentation of the hand in a complicated background in a limited computation resource. The depth separable convolution structure consists of a channel-by-channel convolution (Depthwise Conv) and a point-by-point convolution (1 × 1 Conv). Both convolutions are followed by a Batch Normalization operation (Batch Normalization) and a Relu activation function. Batch normalization operations are beneficial to speed up the network learning rate while reducing gradient vanishing.
And the 3 x 3 convolutional layer and the four bottleneck residual modules are sequentially connected to form a residual network so as to extract the characteristic information of the image. The specific structure is shown in table 1, ResBlock _1 represents a first bottleneck residual module, and each bottleneck residual module is composed of three bottleneck residual units in a cascade connection manner. The structure of the bottleneck residual error unit is shown in fig. 4, and the structure is composed of two 1 × 1 convolutional layers and a depth separable convolutional structure, wherein the 1 × 1 convolutional layers are used for adding nonlinearity to improve the expression capability of the network and simultaneously play a role in reducing the dimension.
The second and third bottleneck residual modules apply a downsampling operation to capture semantic information. Each residual unit of the last bottleneck residual module applies a different hole convolution to capture more context information.
TABLE 1
Figure BDA0003046598170000051
Figure BDA0003046598170000061
As shown in fig. 5, the cavity space Pooling pyramid module (ASPP) captures multi-scale semantic information through four parallel cavity volumes and a global Pooling operation (Image Pooling), and the features extracted by each parallel layer are fused together through a cascading module. The global pooling operation is to obtain context information of a larger receptive field.
As shown in fig. 2(a), a Decoder module (Decoder) fuses shallow detail features and deep semantic features into one block, the fused features refine the features through two convolutional layers, and finally, a hand segmentation graph with clear contours is output through an up-sampling operation; the features output by the second bottleneck residual error module are subjected to an upsampling operation to obtain the shallow detail features; the deep semantic features are fusion features of a cavity space pooling pyramid (ASPP) module.
And step S3, extracting the characteristics of the hand part cut pictures and the original pictures by adopting a two-channel classification network, and identifying the gesture category.
As shown in fig. 2(b), the two-channel classification network comprises: two parallel shallow neural networks (CNNs), a cascade layer, and a classification layer. The two parallel shallow convolutional neural networks respectively extract the characteristics of the hand part segmentation image and the original gesture image, the cascade network layer fuses the extracted characteristics together, and the final gesture recognition is realized through the classification network layer.
The structure of the shallow neural networks (CNNs) is shown in table 2, and the structure is composed of four 3 × 3 convolutional layers, four pooling layers, and two fully-connected layers. The pooling layer is mainly used for realizing down-sampling operation and expanding the receptive field; meanwhile, the network computing speed can be increased, and the occurrence of the overfitting phenomenon is reduced.
TABLE 2
Network layer name Output feature size Network layer type
Input 320×320×3
Conv2d_1 320×320×16 convolution
Pooling_2d_1 106×106×16 max-pooling
Conv2d_2 106×106×32 convolution
Pooling2d_1 35×35×32 max-pooling
Conv2d_3 35×35×64 convolution
Pooling2d_3 11×11×64 max-pooling
Conv2d_4 9×9×128 convolution
Pooling2d_4 128 global average pooling
Dense_1 64 fully connected
Dense_2 64 fully connected
In the embodiment, the semantic segmentation network and the two-channel classification network based on the coding and decoding structure are trained based on a tensoflow frame, and the hardware is a GeForce RTX 3080GPU server. The network training was trained from the beginning, without using pre-trained weights, the trained pictures were pre-set to a size of 320 × 320, and data was enhanced using operations such as horizontal/vertical flipping and scaling, all experiments were trained by Adam optizer, the initial learning rate was set to 0.001, the weight Decay (Decay) was 0, and the Batch (Batch _ size) was 8.
In this embodiment, the loss of the semantic segmentation network is calculated by using the following formula:
Figure BDA0003046598170000071
where N is the number of all samples, yiAnd piAnd the probability maps respectively represent the real label pixel values and the predictions of the ith picture.
The loss of the two-channel classification network is calculated by adopting the following formula:
Figure BDA0003046598170000072
where N is the number of all samples, K represents the number of all gesture classes, yikRepresenting the true probability, p, that the ith sample belongs to the class jikRepresenting the prediction probability that the ith sample belongs to the class j.
Preferably, a preset evaluation standard is adopted to evaluate the hand segmentation result; the preset evaluation criteria include: average cross-over ratio (mIOU), Model Size (Model Size), number of floating point operations per second (FLOPS);
the mean intersection ratio (mIOU) is defined as:
Figure BDA0003046598170000081
where k +1 represents the number of categories in the image, there are two categories, here hand region and non-hand region, pijIndicating the number of pixels in the image for which class i is predicted to be class j.
The model size (ModelSize) and the number of floating point operations per second (FLOPS) were used to further evaluate the feasibility of the model.
Evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy (Accuracy), macroscopic F1-score (Macro-F1), model size (ModelSize), and floating point operations per second (FLOPS);
the Accuracy (Accuracy) is defined as:
Figure BDA0003046598170000082
in the formula, TP represents the number of samples for which the true label is a positive example and is predicted as a positive example; TN represents the number of samples with the true label as a negative case and predicted as a negative case; total represents the number of all samples
The macroscopic F1-Score (Macro-F1) is defined as the average of all gesture class corresponding F1-scores (F1-Score):
Figure BDA0003046598170000083
in the formula, C represents all gesture categories, F1-ScoreiF1-score representing the ith gesture category.
As shown in table 3, the results of the hand segmentation and the indexes of other algorithms in the complex background provided by the present embodiment are compared. The values corresponding to the best method of effect in the table have been bolded. It can be seen from the table that the performance of all three selected evaluation indexes is significantly improved, especially the two indexes of the model size and the number of floating point operations per second. It can be seen that the hand segmentation performance of the semantic segmentation network based on the coding and decoding structure provided by the invention is superior to that of other algorithms, and meanwhile, the model is very small and the requirement on hardware equipment is low.
TABLE 3
Figure BDA0003046598170000084
Figure BDA0003046598170000091
As shown in fig. 6, it is a comparison graph of the hand segmentation result in the complex background provided by the present embodiment and the results of other algorithms. In the figure, the first and second columns show images of the original input image and the corresponding hand mask image, respectively, the third column shows the results of the algorithm proposed herein, and the other columns show images of the comparison algorithm. The graph provides an intuitive segmentation result, and it is easy to see that the hand segmentation method provided by the embodiment has a good segmentation effect even if the environment around the gesture is complex.
As shown in table 4, the results of hand recognition in the complex background provided by the present embodiment are compared with the results of various indicators of other algorithms. The values corresponding to the best method of effect in the table have been bolded. It can be seen from the table that the performance is significantly improved in the four selected evaluation indexes. The gesture recognition method under the complex background provided by the invention has better performance than other algorithms, and meanwhile, the model is very small, and the requirement on hardware equipment is lower.
TABLE 4
Method Accuracy Macro-F1 Size of model FLOPS
ResNet-101 0.8333 0.8375 162.81M 85041593
ShuffleNetV2 0.8617 0.8612 7.4M 3826374
MobileNetV3 0.8752 0.8758 11.64M 6056813
HGR-Net 0.8713 0.8810 1.91M 991530
Ours 0.9117 0.9114 1.85M 950306
The recognition method provided by the embodiment is characterized in that a semantic segmentation network based on an encoding and decoding structure is used for fusing a shallow detail feature and a deep semantic feature, so that the hand with a clear outline is segmented while the method is suitable for correctly positioning a hand region; the characteristics of the hand part cut image and the original gesture image are respectively extracted by adopting a double-channel classification network, and the fused characteristics are classified and recognized, so that the recognition precision of the gesture is improved.
In the embodiment, the multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, the semantic segmentation performance is improved, meanwhile, the deep separable convolution is introduced into the segmentation network, the calculation cost is greatly reduced, the requirement of a model on hardware equipment is reduced, and the whole gesture recognition network is lighter.
In summary, the invention discloses a gesture recognition method under a complex background based on semantic segmentation and a two-channel classification network. After extracting a characteristic map of a hand region by using a residual error network, adding a cavity space pooling pyramid (ASPP) and a decoder module to obtain a better hand segmentation effect map by using a semantic segmentation network; and a double-channel classification network is constructed, and features extracted from the hand segmentation image and the original gesture image are fused, so that the gesture recognition precision under a complex background is improved. The gesture recognition method under the complex background provided by the invention is compared with the results of other algorithms, and the results show that the gesture recognition under the complex background can keep better performance. Meanwhile, the model is small, and the requirement on hardware equipment is low.
The invention fuses the shallow detail characteristic and the deep semantic characteristic through the semantic segmentation network based on the coding and decoding structure, and is suitable for correctly positioning the hand region and simultaneously segments the hand with clear outline; the characteristics of the hand part cut image and the original gesture image are respectively extracted by adopting a double-channel classification network, and the fused characteristics are classified and recognized, so that the recognition precision of the gesture is improved. According to the invention, multi-scale context information is added into the semantic segmentation network of the coding and decoding structure, so that the semantic segmentation performance is improved, and meanwhile, the deep separable convolution is introduced into the segmentation network, so that the calculation cost is greatly reduced, the requirements of a model on hardware equipment are reduced, and the whole gesture recognition network is lighter.
It should be noted that, in the embodiments of the present invention, the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", "axial", "radial", "circumferential", etc. indicate the orientation or positional relationship shown in the drawings, and are only for convenience of describing the embodiments, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims (10)

1. A gesture recognition method under a complex background, comprising:
adopting a semantic segmentation network based on an encoding and decoding structure to extract the characteristics of a gesture picture data set containing a complex background and outputting a hand segmentation picture;
and performing feature extraction on the hand part segmentation image and the original gesture image data set based on a two-channel classification network to identify the gesture category.
2. The method according to claim 1, wherein the gesture picture data set of the complex background conforms to a preset experiment requirement, and the preset experiment requirement comprises: the images of the data set are provided with corresponding ground truth images, and each group of images are completed by different subjects; images of a data set are acquired in very challenging situations.
3. The method for recognizing gestures in complex background according to claim 2, wherein said semantic segmentation network based on codec structure comprises: a 3 × 3 convolutional layer, four bottleneck residual modules, a void space pooling pyramid ASPP, and a decoder module;
the 3 x 3 convolutional layer, the four bottleneck residual modules and the hollow space pooling pyramid ASPP are sequentially connected;
and the output of the second bottleneck residual error module is fused with the output characteristic of the space pooling pyramid ASPP through the up-sampled characteristic, and the fused characteristic is used as the input of the decoder module.
4. The method for gesture recognition under a complex background according to claim 3, wherein the bottleneck residual module comprises three bottleneck residual units, and each bottleneck residual unit is connected in sequence;
the second bottleneck residual error module and the third bottleneck residual error module are used for downsampling operation to capture semantic information;
the features output by the second bottleneck residual error module are subjected to an upsampling operation to obtain shallow detail features;
the fourth bottleneck residual module applies different sizes of hole convolutions to obtain more context information.
5. The method for gesture recognition of gestures in complex context according to claim 4, wherein said bottleneck residual unit comprises: two 1 x 1 convolutional layers and a depth separable convolutional structure;
the depth separable convolution structure includes: the channel-wise convolution Depthwise Conv and the point-wise convolution 1 × 1Conv, both followed by the Batch Normalization operations Batch Normalization and Relu activation functions.
6. The method for gesture recognition in a complex background according to claim 5, wherein the space pooling pyramid module ASPP captures multi-scale semantic information by four parallel hole volumes and one global pooling operation, and features extracted from each parallel layer are merged together by a concatenation module to obtain deep semantic features.
7. The method as claimed in claim 6, wherein the decoder module fuses shallow detail features and deep semantic features into one block, the fused features refine the features through two convolutional layers, and finally the hand segmentation graph with clear outline is output through an up-sampling operation.
8. The method for gesture recognition of gestures in complex contexts as claimed in claim 7, wherein said two-channel classification network comprises: two identical shallow convolutional neural networks, a cascade network layer and a classification network layer;
the hand part segmentation image and the original gesture image output by the semantic segmentation network are used as the input of two identical shallow layer convolutional neural networks of a double-channel classification network, the shape characteristic and the color characteristic of the hand part are obtained through the two parallel shallow layer convolutional neural networks, the extracted characteristics are fused together through a cascade network layer to be used as the input of a final classification network layer, and the final gesture recognition is realized through the classification network layer.
9. The method for recognizing gestures in complex background according to claim 8, wherein the loss of the semantic segmentation network is calculated by the following formula:
Figure FDA0003046598160000021
where N is the number of all samples, yiAnd piRespectively representing the real label pixel value and the predicted probability chart of the ith picture;
the loss of the two-channel classification network is calculated by adopting the following formula:
Figure FDA0003046598160000022
where N is the number of all samples, K represents the number of all gesture classes, yikRepresenting the true probability, p, that the ith sample belongs to the class jikRepresenting the prediction probability that the ith sample belongs to the class j.
10. The method for recognizing gestures in complex background as claimed in claim 9, wherein the hand segmentation result is evaluated by a preset evaluation criterion; the preset evaluation criteria include: average cross-over ratio mIOU, Model Size, and floating point operation times per second FLOPS;
the average cross-over ratio mIOU is defined as:
Figure FDA0003046598160000031
where k +1 represents the number of categories in the image, there are two categories, a hand region and a non-hand region, pijRepresenting the number of pixels in the image for which class i is predicted to be class j;
the model size ModelSize and the floating point operation times per second FLOPS are used for further evaluating the feasibility of the model;
evaluating the gesture recognition result by adopting a preset evaluation standard; the preset evaluation criteria include: accuracy Accuracy, macroscopic F1-fraction Macro-F1, model size ModelSize and floating point operation times per second FLOPS;
the Accuracy is defined as:
Figure FDA0003046598160000032
in the formula, TP represents the number of samples for which the true label is a positive example and is predicted as a positive example; TN represents the number of samples with the true label as a negative case and predicted as a negative case; total represents the number of all samples;
the Macro F1-Score Macro-F1 is defined as the average of all gesture categories corresponding to F1-Score F1-Score:
Figure FDA0003046598160000033
in the formula, C represents all gesture categories, F1-ScoreiF1-score representing the ith gesture category.
CN202110473809.6A 2021-04-29 2021-04-29 Gesture recognition method under complex background Active CN112966672B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110473809.6A CN112966672B (en) 2021-04-29 2021-04-29 Gesture recognition method under complex background

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110473809.6A CN112966672B (en) 2021-04-29 2021-04-29 Gesture recognition method under complex background

Publications (2)

Publication Number Publication Date
CN112966672A true CN112966672A (en) 2021-06-15
CN112966672B CN112966672B (en) 2024-04-05

Family

ID=76281236

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110473809.6A Active CN112966672B (en) 2021-04-29 2021-04-29 Gesture recognition method under complex background

Country Status (1)

Country Link
CN (1) CN112966672B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298080A (en) * 2021-07-26 2021-08-24 城云科技(中国)有限公司 Target detection enhancement model, target detection method, target detection device and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214250A (en) * 2017-07-05 2019-01-15 中南大学 A kind of static gesture identification method based on multiple dimensioned convolutional neural networks
CN110781895A (en) * 2019-10-10 2020-02-11 湖北工业大学 Image semantic segmentation method based on convolutional neural network
WO2020215236A1 (en) * 2019-04-24 2020-10-29 哈尔滨工业大学(深圳) Image semantic segmentation method and system
CN112184635A (en) * 2020-09-10 2021-01-05 上海商汤智能科技有限公司 Target detection method, device, storage medium and equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214250A (en) * 2017-07-05 2019-01-15 中南大学 A kind of static gesture identification method based on multiple dimensioned convolutional neural networks
WO2020215236A1 (en) * 2019-04-24 2020-10-29 哈尔滨工业大学(深圳) Image semantic segmentation method and system
CN110781895A (en) * 2019-10-10 2020-02-11 湖北工业大学 Image semantic segmentation method based on convolutional neural network
CN112184635A (en) * 2020-09-10 2021-01-05 上海商汤智能科技有限公司 Target detection method, device, storage medium and equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王金鹤;苏翠丽;孟凡云;车志龙;谭浩;张楠;: "基于非对称空间金字塔池化的立体匹配网络", 计算机工程, no. 07 *
邢予权;潘今一;王伟;刘建烽;: "基于语义分割与迁移学习的手势识别", 计算机测量与控制, no. 04 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298080A (en) * 2021-07-26 2021-08-24 城云科技(中国)有限公司 Target detection enhancement model, target detection method, target detection device and electronic device
CN113298080B (en) * 2021-07-26 2021-11-05 城云科技(中国)有限公司 Target detection enhancement model, target detection method, target detection device and electronic device

Also Published As

Publication number Publication date
CN112966672B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
WO2022227913A1 (en) Double-feature fusion semantic segmentation system and method based on internet of things perception
CN113221639B (en) Micro-expression recognition method for representative AU (AU) region extraction based on multi-task learning
CN111242288B (en) Multi-scale parallel deep neural network model construction method for lesion image segmentation
CN109492529A (en) A kind of Multi resolution feature extraction and the facial expression recognizing method of global characteristics fusion
Islalm et al. Recognition bangla sign language using convolutional neural network
CN111091130A (en) Real-time image semantic segmentation method and system based on lightweight convolutional neural network
CN107239733A (en) Continuous hand-written character recognizing method and system
CN108804397A (en) A method of the Chinese character style conversion based on a small amount of target font generates
CN105956560A (en) Vehicle model identification method based on pooling multi-scale depth convolution characteristics
CN111340814A (en) Multi-mode adaptive convolution-based RGB-D image semantic segmentation method
CN112163401B (en) Compression and excitation-based Chinese character font generation method of GAN network
CN111652273B (en) Deep learning-based RGB-D image classification method
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN110517270B (en) Indoor scene semantic segmentation method based on super-pixel depth network
CN110110724A (en) The text authentication code recognition methods of function drive capsule neural network is squeezed based on exponential type
CN115862045B (en) Case automatic identification method, system, equipment and storage medium based on image-text identification technology
CN108537109B (en) OpenPose-based monocular camera sign language identification method
CN115966010A (en) Expression recognition method based on attention and multi-scale feature fusion
CN113065426A (en) Gesture image feature fusion method based on channel perception
CN116502181A (en) Channel expansion and fusion-based cyclic capsule network multi-modal emotion recognition method
CN116129141A (en) Medical data processing method, apparatus, device, medium and computer program product
CN106203448A (en) A kind of scene classification method based on Nonlinear Scale Space Theory
CN112966672B (en) Gesture recognition method under complex background
CN113378938B (en) Edge transform graph neural network-based small sample image classification method and system
Zhang et al. A simple and effective static gesture recognition method based on attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant