CN112668543B - Isolated word sign language recognition method based on hand model perception - Google Patents

Isolated word sign language recognition method based on hand model perception Download PDF

Info

Publication number
CN112668543B
CN112668543B CN202110016997.XA CN202110016997A CN112668543B CN 112668543 B CN112668543 B CN 112668543B CN 202110016997 A CN202110016997 A CN 202110016997A CN 112668543 B CN112668543 B CN 112668543B
Authority
CN
China
Prior art keywords
hand
sequence
model
joint point
sign language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110016997.XA
Other languages
Chinese (zh)
Other versions
CN112668543A (en
Inventor
李厚强
周文罡
胡鹤臻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202110016997.XA priority Critical patent/CN112668543B/en
Publication of CN112668543A publication Critical patent/CN112668543A/en
Application granted granted Critical
Publication of CN112668543B publication Critical patent/CN112668543B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an isolated word sign language recognition method based on hand model perception, which comprises the following steps: for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent meaning representation containing hand states through a visual encoder; then, working in a model perception mode through a hand model perception decoder, mapping the latent semantic representation containing the hand state into a three-dimensional hand grid, and obtaining the position of each hand joint point; and finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence. The method can integrate model and data drive, introduce hand type prior, improve the identification accuracy of the system, visualize the intermediate result (namely three-dimensional hand grid) and enhance the interpretability of the framework.

Description

Isolated word sign language recognition method based on hand model perception
Technical Field
The invention relates to the technical field of sign language recognition, in particular to an isolated word sign language recognition method based on hand model perception.
Background
According to the data of world health organization WHO in 2020, about 4.66 million people worldwide have hearing impairment, which accounts for about 5% of the global population. In the hearing impaired population, the most common communication medium is sign language. Sign language is a visual language and has its unique linguistic characteristics. The semantic information is expressed by the aid of fine-grained non-manual control features (expressions, lip shapes and the like) mainly through manual control features (hand shapes, hand movements, positions and the like).
Sign language recognition is developed and widely studied in order to solve the communication gap between listeners and deaf people. It converts the input sign language video into corresponding text by computer algorithm. Isolated word sign language recognition is a basic task in the method, and an input sign language video is recognized as a word corresponding to the video. The general identification process is that firstly, a representation is extracted from an input sign language video, then the representation is transformed into a probability vector, and a category corresponding to the maximum probability is taken as an identification result.
The hand takes a dominant position in sign language notation, and occupies only a small spatial dimension, exhibiting highly articulated joints. The hand has a similar appearance and less locally discernable features than the body and face. In sign language video, the hand usually has motion blur and self-occlusion phenomena, and the background is complex.
Early work often employed manually designed features to describe gestures. With the development of deep learning and hardware computing power in recent years, sign language recognition systems based on deep learning are gradually dominant. The method comprises the steps of extracting representation through a Convolutional Neural Network (CNN), converting the representation into probability vectors after passing through a full connection layer and a Softmax layer, and taking a category corresponding to the maximum probability as an identification result. In recent years, some work has been to pull out by hand as an additional secondary branch and achieve some performance gains. These methods based on deep learning are all performed in a data-driven paradigm where features are learned under the supervision of video category labels. However, the direct data-driven sign language recognition method has the following problems: limited interpretability; it is easy to overfit with limited training data. Since the labeling of sign language data requires professional knowledge, compared with an action recognition data set, the existing sign language data set has fewer samples in each category, and therefore, the recognition accuracy of the existing scheme still needs to be improved.
Disclosure of Invention
The invention aims to provide an isolated word sign language recognition method based on hand model perception, which can improve the recognition accuracy of a system and enhance the interpretability of a recognition framework.
The purpose of the invention is realized by the following technical scheme:
a sign language recognition method for isolated words based on hand model perception comprises the following steps:
for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent semantic representation containing hand states through a visual encoder; then, working in a model perception mode through a hand model perception decoder, mapping the latent semantic representation containing the hand state into a three-dimensional hand grid, and obtaining the position of each hand joint point; and finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence.
The technical scheme provided by the invention can be seen that the model and data drive can be fused, the hand type prior is introduced, the identification accuracy of the system is improved, the intermediate result (namely the three-dimensional hand grid) can be visualized, and the frame interpretability is enhanced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a frame diagram of an isolated word sign language recognition method based on hand model perception according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Aiming at the technical problems in the prior art, the embodiment of the invention provides a sign language identification method of an isolated word sensed by a hand model, which can be used for fusing a model and data drive, introducing a hand model prior, improving the identification accuracy of a system and enhancing the interpretability of the system, and as shown in fig. 1, the invention provides a frame diagram of the sign language identification method of the isolated word sensed by the hand model, and the main identification process comprises the following steps: for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent semantic representation containing hand states through a visual encoder; then, working in a model-aware mode through a hand model aware decoder, mapping latent semantic representations containing hand states into three-dimensional hand grids, and obtaining the position of each hand joint point; and finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence.
For ease of understanding, the various parts of the recognition framework and the corresponding training and testing processes are described in detail below in conjunction with the framework diagram shown in fig. 1.
Firstly, a frame structure.
1. Visual Encoder (Visual Encoder).
In the embodiment of the invention, the input of the visual encoder is a sign sequence containing T frames intercepted from sign language video
Figure BDA0002887277350000031
Converting the hand sequence V' to latent semantic representation by the visual encoder, as:
Figure BDA0002887277350000032
wherein E (-) denotes a visual encoder, vtRepresenting a hand image at time T, wherein T is the length of a hand sequence; theta and beta represent hand states, which are respectively the representations of hand postures and shapes; c. Cr、coAnd csRepresenting camera parameters c for indicating rotation, translation and zoom, respectively.
In the embodiment of the invention, the hand sequence V' is an RGB video hand sequence, the interception mode from the sign language video can be realized in a conventional mode, and the data sets related to the training stage and the testing stage are the hand sequences intercepted from the sign language video.
Illustratively, the visual encoder may be implemented by concatenating the fully-connected layers at the end of the ResNet.
Exemplary, latent semantic characterization:
Figure BDA0002887277350000033
Figure BDA0002887277350000034
representing a set of real numbers.
2. A hand Model-aware Decoder (Model-aware Decoder).
Hand model-aware decoders attempt to map latent semantic feature vectors to compact pose representations in a model-aware approach. The hand model aware decoder constrains the distribution of possible gestures through a pre-coded hand model prior, implicitly filtering out unreasonable gestures during the mapping process. Finally, the method can generate a more concise and highly reliable hand gesture, and reduces the optimization difficulty for the terminal reasoning module.
In the embodiment of the present invention, the hand model aware decoder is a statistical module, and for example, a differentiable MANO hand model can be used as the hand model aware decoder.
The hand model perception decoder can utilize a large amount of high-quality hand scanning to learn in advance, and can obtain a hand template through learning
Figure BDA0002887277350000035
In this way, the hand prior is encoded. At the same time, a compact mapping can be established to describe the hand, i.e. from a low-dimensional semantic vector (latent semantic feature vector) to a high-dimensional triangular hand mesh (containing 778 nodes and 1,538 faces).
The mapping process of the hand model aware decoder is represented as:
M(β,θ)=W(T(β,θ),J(β),θ,W′)
Figure BDA0002887277350000041
where T (β, θ) represents the characterization of θ and β from hand pose and shape by a mixing function BS(. and B)P(. to hand form
Figure BDA0002887277350000042
Obtaining a correction result; w' is the mixing weight; j (β) is a representation of a hand shape comprising a plurality of hand joints provided by a hand model aware decoder; w (-) represents a bone skinning animation algorithm; m (β, θ) represents a three-dimensional hand Mesh (3D Mesh).
Meanwhile, the more concise three-dimensional hand Joint point (3D Joint) position can also be taken out from the linear interpolation of the hand grid related points. Considering that the MANO hand model provides only 16 hand joint points, 5 more fingertips can be extracted from the three-dimensional hand mesh, constituting 21 hand joint points.
The hand model aware decoder may exhibit intermediate results, i.e. a reconstructed three-dimensional hand mesh of the hand, enhancing the interpretability of the framework.
3. Inference Module (Inference Module).
There may be some unsatisfactory results for the three-dimensional pose sequence (consisting of three-dimensional hand joint point positions in the T hand images) predicted by the hand model aware decoder. An inference module is used to further optimize the spatiotemporal characterization of hand poses. Through further adaptive attention calculations, the inference module grasps the most critical cues and performs video-level classification.
The hand gesture sequence is a structured data and presents natural physical connections between the joints, which also allows it to be naturally organized into a space-time diagram. In embodiments of the invention, a popular graph-convolution neural network (GCN) is used that has proven to efficiently process graph structure data, after which video-level classification is performed by a classification output layer.
Recording the position sequence of the hand joint points output by the hand model perception decoder as
Figure BDA0002887277350000043
The corresponding undirected space-time diagram G (V, E) is defined by a point set V and an edge set E, wherein the point set V comprises all hand joint point positions, and the edge set E comprises intra-frame and inter-frame connections, namely the physical connections of the hand joint points and the connections of the same joint point along time; adjacency matrix obtained from edge set E
Figure BDA0002887277350000044
And the identity matrix I are used for the graph convolution neural network layer, and the process of graph convolution is expressed as follows:
Figure BDA0002887277350000045
where k is the group to which the neighborhood node belongs, WkIs the weight of the convolution kernel and is,
Figure BDA0002887277350000046
is decomposed into k sub-matrices, i.e.:
Figure BDA0002887277350000047
each sub-matrix AkRepresenting the connection after disassembly, TkIs an intermediate variable used for calculating a matrix D, M is a weight, the matrix D is used for normalization, M and n are row and column numbers of the matrix D,
Figure BDA0002887277350000051
is a Hadamard product symbol; the information of the hand joint points is transmitted between the edges, so that the space-time representation (including not only the position information but also certain semantic information) of each hand joint point is obtained; further, the Hadamard product is initialized to all 1 attention weights M and A at learnable initializationskTo help the network capture the discriminating clues.
In the embodiment of the invention, after the neural network layers are convolved by a plurality of stacked graphs, the classification output layers classify the graphs, so that the vocabulary corresponding to the hand sequence is recognized.
Secondly, training a model.
In the embodiment of the invention, the visual encoder, the hand model perception decoder and the reasoning module are used as a recognition model. Since sign language datasets do not have labeling of hand poses, in the Training phase (Training Stage), except for cross-entropy classification loss
Figure BDA0002887277350000052
(Classification Loss), and designing a corresponding Loss function (a weakly supervised Loss function based on the space and time relation of the middle hand posture) according to the output of each stage to guide the learning of the middle posture representation. In the training phase, the overall loss function of the recognition model is represented as:
Figure BDA0002887277350000053
wherein the content of the first and second substances,
Figure BDA0002887277350000054
the cross-entropy classification penalty of the inference module is stated,
Figure BDA0002887277350000055
and
Figure BDA0002887277350000056
representing the loss of spatial and temporal consistency of the hand joint point positions obtained by the hand model aware decoder,
Figure BDA0002887277350000057
regularization loss of a hand state in latent semantic representation obtained by a visual encoder; lambdaspa、λtemAnd lambdaregRespectively, the weighting factors of the corresponding losses.
In the training process, the training process can be realized in a conventional manner based on the total loss function due to the parameters of the recognition model.
1. Regularization Loss (Regularization Loss).
To ensure that the hand model works reasonably and generates a reasonable hand grid, regularization penalties are used to further constrain the amplitudes of some of the hidden features, regularization penalties
Figure BDA0002887277350000058
Expressed as:
Figure BDA0002887277350000059
wherein, wβRepresenting a weighting factor.
2. Spatial Consistency Loss (Spatial Consistency Loss).
In the embodiment of the invention, based on a weak perspective camera model, a camera parameter is output by combining a visual encoder, and a three-dimensional posture sequence predicted by a hand model perception decoder is mapped to a two-dimensional space; the mapping process is represented as:
Figure BDA00028872773500000510
wherein Π (-) represents an orthogonal projection,
Figure BDA00028872773500000511
representing hand model aware decoder output of hand joint point position sequence using camera parameters
Figure BDA00028872773500000512
And mapping the position sequence to a two-dimensional space.
Meanwhile, a two-dimensional position sequence J of Hand joint points (2D Joints) extracted from the Hand sequence by a two-dimensional gesture Detector (2D Hand position Detector) is utilized in advance2DAnd using it as pseudo label to constrain it and mapping result
Figure BDA0002887277350000061
And (4) keeping consistency.
Loss of spatial consistency
Figure BDA0002887277350000062
Expressed as:
Figure BDA0002887277350000063
where N is the total number of hand joint points (e.g., N ═ 21); t is the length of the hand sequence; (t, j) represents the jth hand joint point at the time t; c (t, j) represents the confidence coefficient of the position of the jth hand joint point at the t moment extracted in advance, and if the confidence coefficient c (t, j) is greater than or equal to the threshold epsilon, the space consistency loss is participated
Figure BDA0002887277350000064
Otherwise, the calculation is not participated in;
Figure BDA0002887277350000065
an indication function is represented.
3. Loss of Temporal Consistency (Temporal Consistency Loss).
To avoid prediction of jitter, the temporal consistency of the predicted three-dimensional joint points is further constrained. In the process of sign language, different hand joint points usually have different moving speeds, namely joints closer to the palm usually have lower speeds. Thus, the hand joint points are grouped into three groups, { SiI ═ 0,1,2}, corresponding to the palm, mid, and terminal joint sets, respectively.
Loss of temporal consistency
Figure BDA0002887277350000066
Expressed as:
Figure BDA0002887277350000067
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002887277350000068
represents the hand model perception decoder output hand joint point position sequence, (t, j) represents the jth hand joint point at t time, SiA set of hand joint points; alpha (alpha) ("alpha")iRefer to for set SiPredefined penalty weights, for sets with slower motion speeds, will be given greater penalty weights.
And thirdly, testing.
The Testing Stage (Testing Stage) is the same as the training Stage in main flow, and the main difference is that the Testing Stage does not need to use camera parameters and calculate various losses. The main flow of the test stage is as follows: inputting the extracted hand video sequence, obtaining the latent semantic representation of the hand state through a visual encoder, obtaining a corresponding three-dimensional hand grid through a hand model perception decoder, and finally optimizing through an inference module to obtain the space-time representation of each hand joint point, thereby classifying the video level and outputting a corresponding vocabulary.
As shown in the right side of fig. 1, for some hand images, the output layer is classified by the inference module to obtain probabilities corresponding to different words, and the category corresponding to the maximum probability is selected.
According to the scheme of the embodiment of the invention, model and data driving can be fused, hand type prior is introduced, the identification accuracy of the system is improved, an intermediate result can be visualized, and the interpretability of a framework is enhanced.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (7)

1. A method for recognizing an isolated word sign language by hand model perception is characterized by comprising the following steps:
for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent semantic representation containing hand states through a visual encoder; then, working in a model perception mode through a hand model perception decoder, mapping the latent semantic representation containing the hand state into a three-dimensional hand grid, and obtaining the position of each hand joint point; finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence;
the visual encoder, the hand model perception decoder and the reasoning module are used as a recognition model, and in the training stage, the total loss function of the recognition model is expressed as follows:
Figure FDA0003570034490000011
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003570034490000012
represents the cross-entropy classification penalty of the inference module,
Figure FDA0003570034490000013
and
Figure FDA0003570034490000014
representing the loss of spatial and temporal consistency of the hand joint point positions obtained by the hand model aware decoder,
Figure FDA0003570034490000015
regularization loss of a hand state in latent semantic representation obtained by a visual encoder; lambda [ alpha ]spa、λtemAnd lambdaregRespectively, the weighting factors for the corresponding losses.
2. The method of claim 1, wherein the input to the visual encoder is a sign sequence captured from a video of a sign language
Figure FDA0003570034490000016
Converting the hand sequence V' to latent semantic representation by the visual encoder, as:
Figure FDA0003570034490000017
wherein E (-) denotes a visual encoder, vtRepresenting a hand image at time T, wherein T is the length of a hand sequence; theta and beta represent hand states, which are respectively the representations of hand postures and shapes; c. Cr、coAnd csRepresenting camera parameters indicating rotation, translation and zoom, respectively.
3. The method of claim 1, wherein the recognition of isolated words and sign language by hand model perception,
the hand model perception decoder is a statistical module, and uses hand scanning data learning in advance, and the mapping process is expressed as:
M(β,θ)=W(T(β,θ),J(β),θ,W′)
Figure FDA0003570034490000018
where T (β, θ) represents the characterization of θ and β from hand pose and shape by a mixing function BS(. cndot.) and BP(. cndot.) with a previously learned hand template
Figure FDA0003570034490000019
The obtained correction result, the representation theta and beta of the hand posture and shape represent the hand state; w' is the mixing weight; w (-) represents a bone skinning animation algorithm; m (β, θ) represents a three-dimensional hand mesh; j (β) is a representation of a hand shape comprising a plurality of hand joints provided by a hand model-aware decoder;
and obtaining the position of a hand joint point through a three-dimensional hand grid M (beta, theta), wherein the hand joint point comprises a plurality of hand joints and 5 fingertip points.
4. The method for recognizing the isolated words and the sign language sensed by the hand model according to claim 1, wherein the inference module comprises a graph convolution neural network layer and a classification output layer;
recording the position sequence of hand joint points output by the hand model sensing decoder as
Figure FDA0003570034490000021
The corresponding undirected space-time diagram G (V, E) is defined by a point set V and an edge set E, wherein the point set V comprises all hand joint point positions, and the edge set E comprises intra-frame and inter-frame connections, namely, the physical connection of the hand joint points and the connection of the same joint point along time; adjacency matrix obtained from edge set E
Figure FDA0003570034490000022
And the identity matrix I are used for the graph convolution neural network layer, and the process of graph convolution is expressed as follows:
Figure FDA0003570034490000023
where k is the group to which the neighborhood node belongs, WkIs the weight of the convolution kernel and is,
Figure FDA0003570034490000024
is decomposed into k sub-matrices, i.e.:
Figure FDA0003570034490000025
each sub-matrix AkRepresenting the connection after disassembly, TkIs an intermediate variable used for calculating a matrix D, M is weight, the matrix D is used for normalization, M and n are row and column numbers of the matrix D,
Figure FDA0003570034490000026
is a Hadamard product symbol;
the information of the hand joint points is transmitted between the edges, so that the space-time representation of each hand joint point is obtained;
after the neural network layers are convolved by a plurality of stacked graphs, the graphs are classified by a classification output layer, so that words corresponding to the hand sequences are recognized.
5. The method of claim 1, wherein regularization penalties are determined by a hand model-aware isolated word sign language recognition method
Figure FDA0003570034490000027
Expressed as:
Figure FDA0003570034490000028
wherein, wβRepresenting the weight factor, theta and beta representing the hand state, respectively the representation of the hand pose and shape.
6. The method of claim 1, wherein spatial consistency is lost by a loss of isolated word sign language
Figure FDA0003570034490000029
Expressed as:
Figure FDA00035700344900000210
wherein N is the total number of the hand joint points; t is the length of the hand sequence;
Figure FDA00035700344900000211
representing hand model aware decoder output of hand joint point position sequence using camera parameters
Figure FDA00035700344900000212
A sequence of positions mapped to a two-dimensional space; j. the design is a square2DRepresenting a two-dimensional sequence of hand joint points, J, extracted in advance from the hand sequence2DAs a pseudo label; (t, j) represents the jth hand joint point at time t; c (t, j) represents the confidence coefficient of the j th hand joint point position at the t moment extracted in advance, and if the confidence coefficient c (t, j) is greater than or equal to the threshold epsilon, the hand joint point position participates in the space consistency loss
Figure FDA0003570034490000031
Calculating (1);
Figure FDA0003570034490000032
representing an indicator function;
hand joint point position output by hand model perception decoder by using camera parameters
Figure FDA0003570034490000033
The process of mapping to two-dimensional space is represented as:
Figure FDA0003570034490000034
wherein pi (·) represents an orthogonal projection,cr、coand csRepresenting camera parameters indicating rotation, translation and zoom, respectively.
7. The method of claim 1, wherein temporal consistency is lost by a loss of sign language recognition of isolated words based on a hand model perception
Figure FDA0003570034490000035
Expressed as:
Figure FDA0003570034490000036
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003570034490000037
represents the hand model perception decoder output hand joint point position sequence, (t, j) represents the jth hand joint point at t time, SiSet of hand joint points, { SiI is 0,1,2 corresponding to the palm, middle and terminal joint sets respectively; alpha (alpha) ("alpha")iRefer to for set SiSet predefined penalty weights.
CN202110016997.XA 2021-01-07 2021-01-07 Isolated word sign language recognition method based on hand model perception Active CN112668543B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110016997.XA CN112668543B (en) 2021-01-07 2021-01-07 Isolated word sign language recognition method based on hand model perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110016997.XA CN112668543B (en) 2021-01-07 2021-01-07 Isolated word sign language recognition method based on hand model perception

Publications (2)

Publication Number Publication Date
CN112668543A CN112668543A (en) 2021-04-16
CN112668543B true CN112668543B (en) 2022-07-15

Family

ID=75413421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110016997.XA Active CN112668543B (en) 2021-01-07 2021-01-07 Isolated word sign language recognition method based on hand model perception

Country Status (1)

Country Link
CN (1) CN112668543B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239834B (en) * 2021-05-20 2022-07-15 中国科学技术大学 Sign language recognition system capable of pre-training sign model perception representation
CN113239835B (en) * 2021-05-20 2022-07-15 中国科学技术大学 Model-aware gesture migration method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7418390B1 (en) * 2000-11-20 2008-08-26 Yahoo! Inc. Multi-language system for online communications
CN111145865A (en) * 2019-12-26 2020-05-12 中国科学院合肥物质科学研究院 Vision-based hand fine motion training guidance system and method
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111832468A (en) * 2020-07-09 2020-10-27 平安科技(深圳)有限公司 Gesture recognition method and device based on biological recognition, computer equipment and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7418390B1 (en) * 2000-11-20 2008-08-26 Yahoo! Inc. Multi-language system for online communications
CN111145865A (en) * 2019-12-26 2020-05-12 中国科学院合肥物质科学研究院 Vision-based hand fine motion training guidance system and method
CN111325099A (en) * 2020-01-21 2020-06-23 南京邮电大学 Sign language identification method and system based on double-current space-time diagram convolutional neural network
CN111832468A (en) * 2020-07-09 2020-10-27 平安科技(深圳)有限公司 Gesture recognition method and device based on biological recognition, computer equipment and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A novel chipless RFID-based stretchable and wearable hand gesture sensor;Taoran Le,and etc;《2015 European Microwave Conference (EuMC)》;20151203;第371-374页 *
基于Kinect的动态手势识别算法改进与实现;李国友等;《高技术通讯》;20190930;第29卷(第9期);第841-851页 *

Also Published As

Publication number Publication date
CN112668543A (en) 2021-04-16

Similar Documents

Publication Publication Date Title
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
CN116152267B (en) Point cloud instance segmentation method based on contrast language image pre-training technology
Li et al. Improving convolutional neural network for text classification by recursive data pruning
CN111627052A (en) Action identification method based on double-flow space-time attention mechanism
CN110378208B (en) Behavior identification method based on deep residual error network
CN110321805B (en) Dynamic expression recognition method based on time sequence relation reasoning
CN112668543B (en) Isolated word sign language recognition method based on hand model perception
Shiri et al. A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU
CN112561064A (en) Knowledge base completion method based on OWKBC model
Kishore et al. Selfie sign language recognition with convolutional neural networks
Irfan et al. Enhancing learning classifier systems through convolutional autoencoder to classify underwater images
CN114970517A (en) Visual question and answer oriented method based on multi-modal interaction context perception
Alam et al. Two dimensional convolutional neural network approach for real-time bangla sign language characters recognition and translation
CN113436224B (en) Intelligent image clipping method and device based on explicit composition rule modeling
Nguyen et al. Learning recurrent high-order statistics for skeleton-based hand gesture recognition
CN114241606A (en) Character interaction detection method based on adaptive set learning prediction
CN113780059A (en) Continuous sign language identification method based on multiple feature points
CN114882493A (en) Three-dimensional hand posture estimation and recognition method based on image sequence
CN109409246B (en) Sparse coding-based accelerated robust feature bimodal gesture intention understanding method
CN116958324A (en) Training method, device, equipment and storage medium of image generation model
Qiu Convolutional neural network based age estimation from facial image and depth prediction from single image
Wang et al. SURVS: A Swin-Unet and game theory-based unsupervised segmentation method for retinal vessel
Chen et al. MSTP-Net: Multiscale Spatio-temporal Parallel Networks for Human Motion Prediction
CN110555401B (en) Self-adaptive emotion expression system and method based on expression recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant