CN112668543B

CN112668543B - Isolated word sign language recognition method based on hand model perception

Info

Publication number: CN112668543B
Application number: CN202110016997.XA
Authority: CN
Inventors: 李厚强; 周文罡; 胡鹤臻
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2022-07-15
Anticipated expiration: 2041-01-07
Also published as: CN112668543A

Abstract

The invention discloses an isolated word sign language recognition method based on hand model perception, which comprises the following steps: for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent meaning representation containing hand states through a visual encoder; then, working in a model perception mode through a hand model perception decoder, mapping the latent semantic representation containing the hand state into a three-dimensional hand grid, and obtaining the position of each hand joint point; and finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence. The method can integrate model and data drive, introduce hand type prior, improve the identification accuracy of the system, visualize the intermediate result (namely three-dimensional hand grid) and enhance the interpretability of the framework.

Description

Isolated word sign language recognition method based on hand model perception

Technical Field

The invention relates to the technical field of sign language recognition, in particular to an isolated word sign language recognition method based on hand model perception.

Background

According to the data of world health organization WHO in 2020, about 4.66 million people worldwide have hearing impairment, which accounts for about 5% of the global population. In the hearing impaired population, the most common communication medium is sign language. Sign language is a visual language and has its unique linguistic characteristics. The semantic information is expressed by the aid of fine-grained non-manual control features (expressions, lip shapes and the like) mainly through manual control features (hand shapes, hand movements, positions and the like).

Sign language recognition is developed and widely studied in order to solve the communication gap between listeners and deaf people. It converts the input sign language video into corresponding text by computer algorithm. Isolated word sign language recognition is a basic task in the method, and an input sign language video is recognized as a word corresponding to the video. The general identification process is that firstly, a representation is extracted from an input sign language video, then the representation is transformed into a probability vector, and a category corresponding to the maximum probability is taken as an identification result.

The hand takes a dominant position in sign language notation, and occupies only a small spatial dimension, exhibiting highly articulated joints. The hand has a similar appearance and less locally discernable features than the body and face. In sign language video, the hand usually has motion blur and self-occlusion phenomena, and the background is complex.

Early work often employed manually designed features to describe gestures. With the development of deep learning and hardware computing power in recent years, sign language recognition systems based on deep learning are gradually dominant. The method comprises the steps of extracting representation through a Convolutional Neural Network (CNN), converting the representation into probability vectors after passing through a full connection layer and a Softmax layer, and taking a category corresponding to the maximum probability as an identification result. In recent years, some work has been to pull out by hand as an additional secondary branch and achieve some performance gains. These methods based on deep learning are all performed in a data-driven paradigm where features are learned under the supervision of video category labels. However, the direct data-driven sign language recognition method has the following problems: limited interpretability; it is easy to overfit with limited training data. Since the labeling of sign language data requires professional knowledge, compared with an action recognition data set, the existing sign language data set has fewer samples in each category, and therefore, the recognition accuracy of the existing scheme still needs to be improved.

Disclosure of Invention

The invention aims to provide an isolated word sign language recognition method based on hand model perception, which can improve the recognition accuracy of a system and enhance the interpretability of a recognition framework.

The purpose of the invention is realized by the following technical scheme:

a sign language recognition method for isolated words based on hand model perception comprises the following steps:

for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent semantic representation containing hand states through a visual encoder; then, working in a model perception mode through a hand model perception decoder, mapping the latent semantic representation containing the hand state into a three-dimensional hand grid, and obtaining the position of each hand joint point; and finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence.

The technical scheme provided by the invention can be seen that the model and data drive can be fused, the hand type prior is introduced, the identification accuracy of the system is improved, the intermediate result (namely the three-dimensional hand grid) can be visualized, and the frame interpretability is enhanced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a frame diagram of an isolated word sign language recognition method based on hand model perception according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Aiming at the technical problems in the prior art, the embodiment of the invention provides a sign language identification method of an isolated word sensed by a hand model, which can be used for fusing a model and data drive, introducing a hand model prior, improving the identification accuracy of a system and enhancing the interpretability of the system, and as shown in fig. 1, the invention provides a frame diagram of the sign language identification method of the isolated word sensed by the hand model, and the main identification process comprises the following steps: for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent semantic representation containing hand states through a visual encoder; then, working in a model-aware mode through a hand model aware decoder, mapping latent semantic representations containing hand states into three-dimensional hand grids, and obtaining the position of each hand joint point; and finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence.

For ease of understanding, the various parts of the recognition framework and the corresponding training and testing processes are described in detail below in conjunction with the framework diagram shown in fig. 1.

Firstly, a frame structure.

1. Visual Encoder (Visual Encoder).

In the embodiment of the invention, the input of the visual encoder is a sign sequence containing T frames intercepted from sign language video

Converting the hand sequence V' to latent semantic representation by the visual encoder, as:

wherein E (-) denotes a visual encoder, v_tRepresenting a hand image at time T, wherein T is the length of a hand sequence; theta and beta represent hand states, which are respectively the representations of hand postures and shapes; c. C_r、c_oAnd c_sRepresenting camera parameters c for indicating rotation, translation and zoom, respectively.

In the embodiment of the invention, the hand sequence V' is an RGB video hand sequence, the interception mode from the sign language video can be realized in a conventional mode, and the data sets related to the training stage and the testing stage are the hand sequences intercepted from the sign language video.

Illustratively, the visual encoder may be implemented by concatenating the fully-connected layers at the end of the ResNet.

Exemplary, latent semantic characterization:

representing a set of real numbers.

2. A hand Model-aware Decoder (Model-aware Decoder).

Hand model-aware decoders attempt to map latent semantic feature vectors to compact pose representations in a model-aware approach. The hand model aware decoder constrains the distribution of possible gestures through a pre-coded hand model prior, implicitly filtering out unreasonable gestures during the mapping process. Finally, the method can generate a more concise and highly reliable hand gesture, and reduces the optimization difficulty for the terminal reasoning module.

In the embodiment of the present invention, the hand model aware decoder is a statistical module, and for example, a differentiable MANO hand model can be used as the hand model aware decoder.

The hand model perception decoder can utilize a large amount of high-quality hand scanning to learn in advance, and can obtain a hand template through learning

In this way, the hand prior is encoded. At the same time, a compact mapping can be established to describe the hand, i.e. from a low-dimensional semantic vector (latent semantic feature vector) to a high-dimensional triangular hand mesh (containing 778 nodes and 1,538 faces).

The mapping process of the hand model aware decoder is represented as:

M(β，θ)＝W(T(β，θ)，J(β)，θ，W′)

where T (β, θ) represents the characterization of θ and β from hand pose and shape by a mixing function B_S(. and B)_P(. to hand form

Obtaining a correction result; w' is the mixing weight; j (β) is a representation of a hand shape comprising a plurality of hand joints provided by a hand model aware decoder; w (-) represents a bone skinning animation algorithm; m (β, θ) represents a three-dimensional hand Mesh (3D Mesh).

Meanwhile, the more concise three-dimensional hand Joint point (3D Joint) position can also be taken out from the linear interpolation of the hand grid related points. Considering that the MANO hand model provides only 16 hand joint points, 5 more fingertips can be extracted from the three-dimensional hand mesh, constituting 21 hand joint points.

The hand model aware decoder may exhibit intermediate results, i.e. a reconstructed three-dimensional hand mesh of the hand, enhancing the interpretability of the framework.

3. Inference Module (Inference Module).

There may be some unsatisfactory results for the three-dimensional pose sequence (consisting of three-dimensional hand joint point positions in the T hand images) predicted by the hand model aware decoder. An inference module is used to further optimize the spatiotemporal characterization of hand poses. Through further adaptive attention calculations, the inference module grasps the most critical cues and performs video-level classification.

The hand gesture sequence is a structured data and presents natural physical connections between the joints, which also allows it to be naturally organized into a space-time diagram. In embodiments of the invention, a popular graph-convolution neural network (GCN) is used that has proven to efficiently process graph structure data, after which video-level classification is performed by a classification output layer.

Recording the position sequence of the hand joint points output by the hand model perception decoder as

The corresponding undirected space-time diagram G (V, E) is defined by a point set V and an edge set E, wherein the point set V comprises all hand joint point positions, and the edge set E comprises intra-frame and inter-frame connections, namely the physical connections of the hand joint points and the connections of the same joint point along time; adjacency matrix obtained from edge set E

And the identity matrix I are used for the graph convolution neural network layer, and the process of graph convolution is expressed as follows:

where k is the group to which the neighborhood node belongs, W_kIs the weight of the convolution kernel and is,

is decomposed into k sub-matrices, i.e.:

each sub-matrix A_kRepresenting the connection after disassembly, T_kIs an intermediate variable used for calculating a matrix D, M is a weight, the matrix D is used for normalization, M and n are row and column numbers of the matrix D,

is a Hadamard product symbol; the information of the hand joint points is transmitted between the edges, so that the space-time representation (including not only the position information but also certain semantic information) of each hand joint point is obtained; further, the Hadamard product is initialized to all 1 attention weights M and A at learnable initializations_kTo help the network capture the discriminating clues.

In the embodiment of the invention, after the neural network layers are convolved by a plurality of stacked graphs, the classification output layers classify the graphs, so that the vocabulary corresponding to the hand sequence is recognized.

Secondly, training a model.

In the embodiment of the invention, the visual encoder, the hand model perception decoder and the reasoning module are used as a recognition model. Since sign language datasets do not have labeling of hand poses, in the Training phase (Training Stage), except for cross-entropy classification loss

(Classification Loss), and designing a corresponding Loss function (a weakly supervised Loss function based on the space and time relation of the middle hand posture) according to the output of each stage to guide the learning of the middle posture representation. In the training phase, the overall loss function of the recognition model is represented as:

wherein the content of the first and second substances,

the cross-entropy classification penalty of the inference module is stated,

and

representing the loss of spatial and temporal consistency of the hand joint point positions obtained by the hand model aware decoder,

regularization loss of a hand state in latent semantic representation obtained by a visual encoder; lambda_spa、λ_temAnd lambda_regRespectively, the weighting factors of the corresponding losses.

In the training process, the training process can be realized in a conventional manner based on the total loss function due to the parameters of the recognition model.

1. Regularization Loss (Regularization Loss).

To ensure that the hand model works reasonably and generates a reasonable hand grid, regularization penalties are used to further constrain the amplitudes of some of the hidden features, regularization penalties

Expressed as:

wherein, w_βRepresenting a weighting factor.

2. Spatial Consistency Loss (Spatial Consistency Loss).

In the embodiment of the invention, based on a weak perspective camera model, a camera parameter is output by combining a visual encoder, and a three-dimensional posture sequence predicted by a hand model perception decoder is mapped to a two-dimensional space; the mapping process is represented as:

wherein Π (-) represents an orthogonal projection,

representing hand model aware decoder output of hand joint point position sequence using camera parameters

And mapping the position sequence to a two-dimensional space.

Meanwhile, a two-dimensional position sequence J of Hand joint points (2D Joints) extracted from the Hand sequence by a two-dimensional gesture Detector (2D Hand position Detector) is utilized in advance_2DAnd using it as pseudo label to constrain it and mapping result

And (4) keeping consistency.

Loss of spatial consistency

Expressed as:

where N is the total number of hand joint points (e.g., N ═ 21); t is the length of the hand sequence; (t, j) represents the jth hand joint point at the time t; c (t, j) represents the confidence coefficient of the position of the jth hand joint point at the t moment extracted in advance, and if the confidence coefficient c (t, j) is greater than or equal to the threshold epsilon, the space consistency loss is participated

Otherwise, the calculation is not participated in;

an indication function is represented.

3. Loss of Temporal Consistency (Temporal Consistency Loss).

To avoid prediction of jitter, the temporal consistency of the predicted three-dimensional joint points is further constrained. In the process of sign language, different hand joint points usually have different moving speeds, namely joints closer to the palm usually have lower speeds. Thus, the hand joint points are grouped into three groups, { S_iI ═ 0,1,2}, corresponding to the palm, mid, and terminal joint sets, respectively.

Loss of temporal consistency

Expressed as:

wherein, the first and the second end of the pipe are connected with each other,

represents the hand model perception decoder output hand joint point position sequence, (t, j) represents the jth hand joint point at t time, S_iA set of hand joint points; alpha (alpha) ("alpha")_iRefer to for set S_iPredefined penalty weights, for sets with slower motion speeds, will be given greater penalty weights.

And thirdly, testing.

The Testing Stage (Testing Stage) is the same as the training Stage in main flow, and the main difference is that the Testing Stage does not need to use camera parameters and calculate various losses. The main flow of the test stage is as follows: inputting the extracted hand video sequence, obtaining the latent semantic representation of the hand state through a visual encoder, obtaining a corresponding three-dimensional hand grid through a hand model perception decoder, and finally optimizing through an inference module to obtain the space-time representation of each hand joint point, thereby classifying the video level and outputting a corresponding vocabulary.

As shown in the right side of fig. 1, for some hand images, the output layer is classified by the inference module to obtain probabilities corresponding to different words, and the category corresponding to the maximum probability is selected.

According to the scheme of the embodiment of the invention, model and data driving can be fused, hand type prior is introduced, the identification accuracy of the system is improved, an intermediate result can be visualized, and the interpretability of a framework is enhanced.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A method for recognizing an isolated word sign language by hand model perception is characterized by comprising the following steps:

for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent semantic representation containing hand states through a visual encoder; then, working in a model perception mode through a hand model perception decoder, mapping the latent semantic representation containing the hand state into a three-dimensional hand grid, and obtaining the position of each hand joint point; finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence;

the visual encoder, the hand model perception decoder and the reasoning module are used as a recognition model, and in the training stage, the total loss function of the recognition model is expressed as follows:

represents the cross-entropy classification penalty of the inference module,

and

regularization loss of a hand state in latent semantic representation obtained by a visual encoder; lambda [ alpha ]_spa、λ_temAnd lambda_regRespectively, the weighting factors for the corresponding losses.

2. The method of claim 1, wherein the input to the visual encoder is a sign sequence captured from a video of a sign language

wherein E (-) denotes a visual encoder, v_tRepresenting a hand image at time T, wherein T is the length of a hand sequence; theta and beta represent hand states, which are respectively the representations of hand postures and shapes; c. C_r、c_oAnd c_sRepresenting camera parameters indicating rotation, translation and zoom, respectively.

3. The method of claim 1, wherein the recognition of isolated words and sign language by hand model perception,

the hand model perception decoder is a statistical module, and uses hand scanning data learning in advance, and the mapping process is expressed as:

M(β,θ)＝W(T(β,θ),J(β),θ,W′)

where T (β, θ) represents the characterization of θ and β from hand pose and shape by a mixing function B_S(. cndot.) and B_P(. cndot.) with a previously learned hand template

The obtained correction result, the representation theta and beta of the hand posture and shape represent the hand state; w' is the mixing weight; w (-) represents a bone skinning animation algorithm; m (β, θ) represents a three-dimensional hand mesh; j (β) is a representation of a hand shape comprising a plurality of hand joints provided by a hand model-aware decoder;

and obtaining the position of a hand joint point through a three-dimensional hand grid M (beta, theta), wherein the hand joint point comprises a plurality of hand joints and 5 fingertip points.

4. The method for recognizing the isolated words and the sign language sensed by the hand model according to claim 1, wherein the inference module comprises a graph convolution neural network layer and a classification output layer;

recording the position sequence of hand joint points output by the hand model sensing decoder as

The corresponding undirected space-time diagram G (V, E) is defined by a point set V and an edge set E, wherein the point set V comprises all hand joint point positions, and the edge set E comprises intra-frame and inter-frame connections, namely, the physical connection of the hand joint points and the connection of the same joint point along time; adjacency matrix obtained from edge set E

is decomposed into k sub-matrices, i.e.:

each sub-matrix A_kRepresenting the connection after disassembly, T_kIs an intermediate variable used for calculating a matrix D, M is weight, the matrix D is used for normalization, M and n are row and column numbers of the matrix D,

is a Hadamard product symbol;

the information of the hand joint points is transmitted between the edges, so that the space-time representation of each hand joint point is obtained;

after the neural network layers are convolved by a plurality of stacked graphs, the graphs are classified by a classification output layer, so that words corresponding to the hand sequences are recognized.

5. The method of claim 1, wherein regularization penalties are determined by a hand model-aware isolated word sign language recognition method

Expressed as:

wherein, w_βRepresenting the weight factor, theta and beta representing the hand state, respectively the representation of the hand pose and shape.

6. The method of claim 1, wherein spatial consistency is lost by a loss of isolated word sign language

Expressed as:

wherein N is the total number of the hand joint points; t is the length of the hand sequence;

A sequence of positions mapped to a two-dimensional space; j. the design is a square_2DRepresenting a two-dimensional sequence of hand joint points, J, extracted in advance from the hand sequence_2DAs a pseudo label; (t, j) represents the jth hand joint point at time t; c (t, j) represents the confidence coefficient of the j th hand joint point position at the t moment extracted in advance, and if the confidence coefficient c (t, j) is greater than or equal to the threshold epsilon, the hand joint point position participates in the space consistency loss

Calculating (1);

representing an indicator function;

hand joint point position output by hand model perception decoder by using camera parameters

The process of mapping to two-dimensional space is represented as:

wherein pi (·) represents an orthogonal projection,c_r、c_oand c_sRepresenting camera parameters indicating rotation, translation and zoom, respectively.

7. The method of claim 1, wherein temporal consistency is lost by a loss of sign language recognition of isolated words based on a hand model perception

Expressed as:

represents the hand model perception decoder output hand joint point position sequence, (t, j) represents the jth hand joint point at t time, S_iSet of hand joint points, { S_iI is 0,1,2 corresponding to the palm, middle and terminal joint sets respectively; alpha (alpha) ("alpha")_iRefer to for set S_iSet predefined penalty weights.