CN112668543B - Isolated word sign language recognition method based on hand model perception - Google Patents
Isolated word sign language recognition method based on hand model perception Download PDFInfo
- Publication number
- CN112668543B CN112668543B CN202110016997.XA CN202110016997A CN112668543B CN 112668543 B CN112668543 B CN 112668543B CN 202110016997 A CN202110016997 A CN 202110016997A CN 112668543 B CN112668543 B CN 112668543B
- Authority
- CN
- China
- Prior art keywords
- hand
- sequence
- model
- joint point
- sign language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
The invention discloses an isolated word sign language recognition method based on hand model perception, which comprises the following steps: for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent meaning representation containing hand states through a visual encoder; then, working in a model perception mode through a hand model perception decoder, mapping the latent semantic representation containing the hand state into a three-dimensional hand grid, and obtaining the position of each hand joint point; and finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence. The method can integrate model and data drive, introduce hand type prior, improve the identification accuracy of the system, visualize the intermediate result (namely three-dimensional hand grid) and enhance the interpretability of the framework.
Description
Technical Field
The invention relates to the technical field of sign language recognition, in particular to an isolated word sign language recognition method based on hand model perception.
Background
According to the data of world health organization WHO in 2020, about 4.66 million people worldwide have hearing impairment, which accounts for about 5% of the global population. In the hearing impaired population, the most common communication medium is sign language. Sign language is a visual language and has its unique linguistic characteristics. The semantic information is expressed by the aid of fine-grained non-manual control features (expressions, lip shapes and the like) mainly through manual control features (hand shapes, hand movements, positions and the like).
Sign language recognition is developed and widely studied in order to solve the communication gap between listeners and deaf people. It converts the input sign language video into corresponding text by computer algorithm. Isolated word sign language recognition is a basic task in the method, and an input sign language video is recognized as a word corresponding to the video. The general identification process is that firstly, a representation is extracted from an input sign language video, then the representation is transformed into a probability vector, and a category corresponding to the maximum probability is taken as an identification result.
The hand takes a dominant position in sign language notation, and occupies only a small spatial dimension, exhibiting highly articulated joints. The hand has a similar appearance and less locally discernable features than the body and face. In sign language video, the hand usually has motion blur and self-occlusion phenomena, and the background is complex.
Early work often employed manually designed features to describe gestures. With the development of deep learning and hardware computing power in recent years, sign language recognition systems based on deep learning are gradually dominant. The method comprises the steps of extracting representation through a Convolutional Neural Network (CNN), converting the representation into probability vectors after passing through a full connection layer and a Softmax layer, and taking a category corresponding to the maximum probability as an identification result. In recent years, some work has been to pull out by hand as an additional secondary branch and achieve some performance gains. These methods based on deep learning are all performed in a data-driven paradigm where features are learned under the supervision of video category labels. However, the direct data-driven sign language recognition method has the following problems: limited interpretability; it is easy to overfit with limited training data. Since the labeling of sign language data requires professional knowledge, compared with an action recognition data set, the existing sign language data set has fewer samples in each category, and therefore, the recognition accuracy of the existing scheme still needs to be improved.
Disclosure of Invention
The invention aims to provide an isolated word sign language recognition method based on hand model perception, which can improve the recognition accuracy of a system and enhance the interpretability of a recognition framework.
The purpose of the invention is realized by the following technical scheme:
a sign language recognition method for isolated words based on hand model perception comprises the following steps:
for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent semantic representation containing hand states through a visual encoder; then, working in a model perception mode through a hand model perception decoder, mapping the latent semantic representation containing the hand state into a three-dimensional hand grid, and obtaining the position of each hand joint point; and finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence.
The technical scheme provided by the invention can be seen that the model and data drive can be fused, the hand type prior is introduced, the identification accuracy of the system is improved, the intermediate result (namely the three-dimensional hand grid) can be visualized, and the frame interpretability is enhanced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a frame diagram of an isolated word sign language recognition method based on hand model perception according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Aiming at the technical problems in the prior art, the embodiment of the invention provides a sign language identification method of an isolated word sensed by a hand model, which can be used for fusing a model and data drive, introducing a hand model prior, improving the identification accuracy of a system and enhancing the interpretability of the system, and as shown in fig. 1, the invention provides a frame diagram of the sign language identification method of the isolated word sensed by the hand model, and the main identification process comprises the following steps: for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent semantic representation containing hand states through a visual encoder; then, working in a model-aware mode through a hand model aware decoder, mapping latent semantic representations containing hand states into three-dimensional hand grids, and obtaining the position of each hand joint point; and finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence.
For ease of understanding, the various parts of the recognition framework and the corresponding training and testing processes are described in detail below in conjunction with the framework diagram shown in fig. 1.
Firstly, a frame structure.
1. Visual Encoder (Visual Encoder).
In the embodiment of the invention, the input of the visual encoder is a sign sequence containing T frames intercepted from sign language videoConverting the hand sequence V' to latent semantic representation by the visual encoder, as:
wherein E (-) denotes a visual encoder, vtRepresenting a hand image at time T, wherein T is the length of a hand sequence; theta and beta represent hand states, which are respectively the representations of hand postures and shapes; c. Cr、coAnd csRepresenting camera parameters c for indicating rotation, translation and zoom, respectively.
In the embodiment of the invention, the hand sequence V' is an RGB video hand sequence, the interception mode from the sign language video can be realized in a conventional mode, and the data sets related to the training stage and the testing stage are the hand sequences intercepted from the sign language video.
Illustratively, the visual encoder may be implemented by concatenating the fully-connected layers at the end of the ResNet.
2. A hand Model-aware Decoder (Model-aware Decoder).
Hand model-aware decoders attempt to map latent semantic feature vectors to compact pose representations in a model-aware approach. The hand model aware decoder constrains the distribution of possible gestures through a pre-coded hand model prior, implicitly filtering out unreasonable gestures during the mapping process. Finally, the method can generate a more concise and highly reliable hand gesture, and reduces the optimization difficulty for the terminal reasoning module.
In the embodiment of the present invention, the hand model aware decoder is a statistical module, and for example, a differentiable MANO hand model can be used as the hand model aware decoder.
The hand model perception decoder can utilize a large amount of high-quality hand scanning to learn in advance, and can obtain a hand template through learningIn this way, the hand prior is encoded. At the same time, a compact mapping can be established to describe the hand, i.e. from a low-dimensional semantic vector (latent semantic feature vector) to a high-dimensional triangular hand mesh (containing 778 nodes and 1,538 faces).
The mapping process of the hand model aware decoder is represented as:
M(β,θ)=W(T(β,θ),J(β),θ,W′)
where T (β, θ) represents the characterization of θ and β from hand pose and shape by a mixing function BS(. and B)P(. to hand formObtaining a correction result; w' is the mixing weight; j (β) is a representation of a hand shape comprising a plurality of hand joints provided by a hand model aware decoder; w (-) represents a bone skinning animation algorithm; m (β, θ) represents a three-dimensional hand Mesh (3D Mesh).
Meanwhile, the more concise three-dimensional hand Joint point (3D Joint) position can also be taken out from the linear interpolation of the hand grid related points. Considering that the MANO hand model provides only 16 hand joint points, 5 more fingertips can be extracted from the three-dimensional hand mesh, constituting 21 hand joint points.
The hand model aware decoder may exhibit intermediate results, i.e. a reconstructed three-dimensional hand mesh of the hand, enhancing the interpretability of the framework.
3. Inference Module (Inference Module).
There may be some unsatisfactory results for the three-dimensional pose sequence (consisting of three-dimensional hand joint point positions in the T hand images) predicted by the hand model aware decoder. An inference module is used to further optimize the spatiotemporal characterization of hand poses. Through further adaptive attention calculations, the inference module grasps the most critical cues and performs video-level classification.
The hand gesture sequence is a structured data and presents natural physical connections between the joints, which also allows it to be naturally organized into a space-time diagram. In embodiments of the invention, a popular graph-convolution neural network (GCN) is used that has proven to efficiently process graph structure data, after which video-level classification is performed by a classification output layer.
Recording the position sequence of the hand joint points output by the hand model perception decoder asThe corresponding undirected space-time diagram G (V, E) is defined by a point set V and an edge set E, wherein the point set V comprises all hand joint point positions, and the edge set E comprises intra-frame and inter-frame connections, namely the physical connections of the hand joint points and the connections of the same joint point along time; adjacency matrix obtained from edge set EAnd the identity matrix I are used for the graph convolution neural network layer, and the process of graph convolution is expressed as follows:
where k is the group to which the neighborhood node belongs, WkIs the weight of the convolution kernel and is,is decomposed into k sub-matrices, i.e.:each sub-matrix AkRepresenting the connection after disassembly, TkIs an intermediate variable used for calculating a matrix D, M is a weight, the matrix D is used for normalization, M and n are row and column numbers of the matrix D,is a Hadamard product symbol; the information of the hand joint points is transmitted between the edges, so that the space-time representation (including not only the position information but also certain semantic information) of each hand joint point is obtained; further, the Hadamard product is initialized to all 1 attention weights M and A at learnable initializationskTo help the network capture the discriminating clues.
In the embodiment of the invention, after the neural network layers are convolved by a plurality of stacked graphs, the classification output layers classify the graphs, so that the vocabulary corresponding to the hand sequence is recognized.
Secondly, training a model.
In the embodiment of the invention, the visual encoder, the hand model perception decoder and the reasoning module are used as a recognition model. Since sign language datasets do not have labeling of hand poses, in the Training phase (Training Stage), except for cross-entropy classification loss(Classification Loss), and designing a corresponding Loss function (a weakly supervised Loss function based on the space and time relation of the middle hand posture) according to the output of each stage to guide the learning of the middle posture representation. In the training phase, the overall loss function of the recognition model is represented as:
wherein the content of the first and second substances,the cross-entropy classification penalty of the inference module is stated,andrepresenting the loss of spatial and temporal consistency of the hand joint point positions obtained by the hand model aware decoder,regularization loss of a hand state in latent semantic representation obtained by a visual encoder; lambdaspa、λtemAnd lambdaregRespectively, the weighting factors of the corresponding losses.
In the training process, the training process can be realized in a conventional manner based on the total loss function due to the parameters of the recognition model.
1. Regularization Loss (Regularization Loss).
To ensure that the hand model works reasonably and generates a reasonable hand grid, regularization penalties are used to further constrain the amplitudes of some of the hidden features, regularization penaltiesExpressed as:
wherein, wβRepresenting a weighting factor.
2. Spatial Consistency Loss (Spatial Consistency Loss).
In the embodiment of the invention, based on a weak perspective camera model, a camera parameter is output by combining a visual encoder, and a three-dimensional posture sequence predicted by a hand model perception decoder is mapped to a two-dimensional space; the mapping process is represented as:
wherein Π (-) represents an orthogonal projection,representing hand model aware decoder output of hand joint point position sequence using camera parametersAnd mapping the position sequence to a two-dimensional space.
Meanwhile, a two-dimensional position sequence J of Hand joint points (2D Joints) extracted from the Hand sequence by a two-dimensional gesture Detector (2D Hand position Detector) is utilized in advance2DAnd using it as pseudo label to constrain it and mapping resultAnd (4) keeping consistency.
where N is the total number of hand joint points (e.g., N ═ 21); t is the length of the hand sequence; (t, j) represents the jth hand joint point at the time t; c (t, j) represents the confidence coefficient of the position of the jth hand joint point at the t moment extracted in advance, and if the confidence coefficient c (t, j) is greater than or equal to the threshold epsilon, the space consistency loss is participatedOtherwise, the calculation is not participated in;an indication function is represented.
3. Loss of Temporal Consistency (Temporal Consistency Loss).
To avoid prediction of jitter, the temporal consistency of the predicted three-dimensional joint points is further constrained. In the process of sign language, different hand joint points usually have different moving speeds, namely joints closer to the palm usually have lower speeds. Thus, the hand joint points are grouped into three groups, { SiI ═ 0,1,2}, corresponding to the palm, mid, and terminal joint sets, respectively.
wherein, the first and the second end of the pipe are connected with each other,represents the hand model perception decoder output hand joint point position sequence, (t, j) represents the jth hand joint point at t time, SiA set of hand joint points; alpha (alpha) ("alpha")iRefer to for set SiPredefined penalty weights, for sets with slower motion speeds, will be given greater penalty weights.
And thirdly, testing.
The Testing Stage (Testing Stage) is the same as the training Stage in main flow, and the main difference is that the Testing Stage does not need to use camera parameters and calculate various losses. The main flow of the test stage is as follows: inputting the extracted hand video sequence, obtaining the latent semantic representation of the hand state through a visual encoder, obtaining a corresponding three-dimensional hand grid through a hand model perception decoder, and finally optimizing through an inference module to obtain the space-time representation of each hand joint point, thereby classifying the video level and outputting a corresponding vocabulary.
As shown in the right side of fig. 1, for some hand images, the output layer is classified by the inference module to obtain probabilities corresponding to different words, and the category corresponding to the maximum probability is selected.
According to the scheme of the embodiment of the invention, model and data driving can be fused, hand type prior is introduced, the identification accuracy of the system is improved, an intermediate result can be visualized, and the interpretability of a framework is enhanced.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (7)
1. A method for recognizing an isolated word sign language by hand model perception is characterized by comprising the following steps:
for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent semantic representation containing hand states through a visual encoder; then, working in a model perception mode through a hand model perception decoder, mapping the latent semantic representation containing the hand state into a three-dimensional hand grid, and obtaining the position of each hand joint point; finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence;
the visual encoder, the hand model perception decoder and the reasoning module are used as a recognition model, and in the training stage, the total loss function of the recognition model is expressed as follows:
wherein, the first and the second end of the pipe are connected with each other,represents the cross-entropy classification penalty of the inference module,andrepresenting the loss of spatial and temporal consistency of the hand joint point positions obtained by the hand model aware decoder,regularization loss of a hand state in latent semantic representation obtained by a visual encoder; lambda [ alpha ]spa、λtemAnd lambdaregRespectively, the weighting factors for the corresponding losses.
2. The method of claim 1, wherein the input to the visual encoder is a sign sequence captured from a video of a sign languageConverting the hand sequence V' to latent semantic representation by the visual encoder, as:
wherein E (-) denotes a visual encoder, vtRepresenting a hand image at time T, wherein T is the length of a hand sequence; theta and beta represent hand states, which are respectively the representations of hand postures and shapes; c. Cr、coAnd csRepresenting camera parameters indicating rotation, translation and zoom, respectively.
3. The method of claim 1, wherein the recognition of isolated words and sign language by hand model perception,
the hand model perception decoder is a statistical module, and uses hand scanning data learning in advance, and the mapping process is expressed as:
M(β,θ)=W(T(β,θ),J(β),θ,W′)
where T (β, θ) represents the characterization of θ and β from hand pose and shape by a mixing function BS(. cndot.) and BP(. cndot.) with a previously learned hand templateThe obtained correction result, the representation theta and beta of the hand posture and shape represent the hand state; w' is the mixing weight; w (-) represents a bone skinning animation algorithm; m (β, θ) represents a three-dimensional hand mesh; j (β) is a representation of a hand shape comprising a plurality of hand joints provided by a hand model-aware decoder;
and obtaining the position of a hand joint point through a three-dimensional hand grid M (beta, theta), wherein the hand joint point comprises a plurality of hand joints and 5 fingertip points.
4. The method for recognizing the isolated words and the sign language sensed by the hand model according to claim 1, wherein the inference module comprises a graph convolution neural network layer and a classification output layer;
recording the position sequence of hand joint points output by the hand model sensing decoder asThe corresponding undirected space-time diagram G (V, E) is defined by a point set V and an edge set E, wherein the point set V comprises all hand joint point positions, and the edge set E comprises intra-frame and inter-frame connections, namely, the physical connection of the hand joint points and the connection of the same joint point along time; adjacency matrix obtained from edge set EAnd the identity matrix I are used for the graph convolution neural network layer, and the process of graph convolution is expressed as follows:
where k is the group to which the neighborhood node belongs, WkIs the weight of the convolution kernel and is,is decomposed into k sub-matrices, i.e.:each sub-matrix AkRepresenting the connection after disassembly, TkIs an intermediate variable used for calculating a matrix D, M is weight, the matrix D is used for normalization, M and n are row and column numbers of the matrix D,is a Hadamard product symbol;
the information of the hand joint points is transmitted between the edges, so that the space-time representation of each hand joint point is obtained;
after the neural network layers are convolved by a plurality of stacked graphs, the graphs are classified by a classification output layer, so that words corresponding to the hand sequences are recognized.
5. The method of claim 1, wherein regularization penalties are determined by a hand model-aware isolated word sign language recognition methodExpressed as:
wherein, wβRepresenting the weight factor, theta and beta representing the hand state, respectively the representation of the hand pose and shape.
6. The method of claim 1, wherein spatial consistency is lost by a loss of isolated word sign languageExpressed as:
wherein N is the total number of the hand joint points; t is the length of the hand sequence;representing hand model aware decoder output of hand joint point position sequence using camera parametersA sequence of positions mapped to a two-dimensional space; j. the design is a square2DRepresenting a two-dimensional sequence of hand joint points, J, extracted in advance from the hand sequence2DAs a pseudo label; (t, j) represents the jth hand joint point at time t; c (t, j) represents the confidence coefficient of the j th hand joint point position at the t moment extracted in advance, and if the confidence coefficient c (t, j) is greater than or equal to the threshold epsilon, the hand joint point position participates in the space consistency lossCalculating (1);representing an indicator function;
hand joint point position output by hand model perception decoder by using camera parametersThe process of mapping to two-dimensional space is represented as:
wherein pi (·) represents an orthogonal projection,cr、coand csRepresenting camera parameters indicating rotation, translation and zoom, respectively.
7. The method of claim 1, wherein temporal consistency is lost by a loss of sign language recognition of isolated words based on a hand model perceptionExpressed as:
wherein, the first and the second end of the pipe are connected with each other,represents the hand model perception decoder output hand joint point position sequence, (t, j) represents the jth hand joint point at t time, SiSet of hand joint points, { SiI is 0,1,2 corresponding to the palm, middle and terminal joint sets respectively; alpha (alpha) ("alpha")iRefer to for set SiSet predefined penalty weights.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110016997.XA CN112668543B (en) | 2021-01-07 | 2021-01-07 | Isolated word sign language recognition method based on hand model perception |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110016997.XA CN112668543B (en) | 2021-01-07 | 2021-01-07 | Isolated word sign language recognition method based on hand model perception |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112668543A CN112668543A (en) | 2021-04-16 |
CN112668543B true CN112668543B (en) | 2022-07-15 |
Family
ID=75413421
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110016997.XA Active CN112668543B (en) | 2021-01-07 | 2021-01-07 | Isolated word sign language recognition method based on hand model perception |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112668543B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113239834B (en) * | 2021-05-20 | 2022-07-15 | 中国科学技术大学 | Sign language recognition system capable of pre-training sign model perception representation |
CN113239835B (en) * | 2021-05-20 | 2022-07-15 | 中国科学技术大学 | Model-aware gesture migration method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7418390B1 (en) * | 2000-11-20 | 2008-08-26 | Yahoo! Inc. | Multi-language system for online communications |
CN111145865A (en) * | 2019-12-26 | 2020-05-12 | 中国科学院合肥物质科学研究院 | Vision-based hand fine motion training guidance system and method |
CN111325099A (en) * | 2020-01-21 | 2020-06-23 | 南京邮电大学 | Sign language identification method and system based on double-current space-time diagram convolutional neural network |
CN111832468A (en) * | 2020-07-09 | 2020-10-27 | 平安科技(深圳)有限公司 | Gesture recognition method and device based on biological recognition, computer equipment and medium |
-
2021
- 2021-01-07 CN CN202110016997.XA patent/CN112668543B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7418390B1 (en) * | 2000-11-20 | 2008-08-26 | Yahoo! Inc. | Multi-language system for online communications |
CN111145865A (en) * | 2019-12-26 | 2020-05-12 | 中国科学院合肥物质科学研究院 | Vision-based hand fine motion training guidance system and method |
CN111325099A (en) * | 2020-01-21 | 2020-06-23 | 南京邮电大学 | Sign language identification method and system based on double-current space-time diagram convolutional neural network |
CN111832468A (en) * | 2020-07-09 | 2020-10-27 | 平安科技(深圳)有限公司 | Gesture recognition method and device based on biological recognition, computer equipment and medium |
Non-Patent Citations (2)
Title |
---|
A novel chipless RFID-based stretchable and wearable hand gesture sensor;Taoran Le,and etc;《2015 European Microwave Conference (EuMC)》;20151203;第371-374页 * |
基于Kinect的动态手势识别算法改进与实现;李国友等;《高技术通讯》;20190930;第29卷(第9期);第841-851页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112668543A (en) | 2021-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
CN109948475B (en) | Human body action recognition method based on skeleton features and deep learning | |
CN116152267B (en) | Point cloud instance segmentation method based on contrast language image pre-training technology | |
Li et al. | Improving convolutional neural network for text classification by recursive data pruning | |
CN111627052A (en) | Action identification method based on double-flow space-time attention mechanism | |
CN110378208B (en) | Behavior identification method based on deep residual error network | |
CN110321805B (en) | Dynamic expression recognition method based on time sequence relation reasoning | |
CN112668543B (en) | Isolated word sign language recognition method based on hand model perception | |
Shiri et al. | A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU | |
CN112561064A (en) | Knowledge base completion method based on OWKBC model | |
Kishore et al. | Selfie sign language recognition with convolutional neural networks | |
Irfan et al. | Enhancing learning classifier systems through convolutional autoencoder to classify underwater images | |
CN114970517A (en) | Visual question and answer oriented method based on multi-modal interaction context perception | |
Alam et al. | Two dimensional convolutional neural network approach for real-time bangla sign language characters recognition and translation | |
CN113436224B (en) | Intelligent image clipping method and device based on explicit composition rule modeling | |
Nguyen et al. | Learning recurrent high-order statistics for skeleton-based hand gesture recognition | |
CN114241606A (en) | Character interaction detection method based on adaptive set learning prediction | |
CN113780059A (en) | Continuous sign language identification method based on multiple feature points | |
CN114882493A (en) | Three-dimensional hand posture estimation and recognition method based on image sequence | |
CN109409246B (en) | Sparse coding-based accelerated robust feature bimodal gesture intention understanding method | |
CN116958324A (en) | Training method, device, equipment and storage medium of image generation model | |
Qiu | Convolutional neural network based age estimation from facial image and depth prediction from single image | |
Wang et al. | SURVS: A Swin-Unet and game theory-based unsupervised segmentation method for retinal vessel | |
Chen et al. | MSTP-Net: Multiscale Spatio-temporal Parallel Networks for Human Motion Prediction | |
CN110555401B (en) | Self-adaptive emotion expression system and method based on expression recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |