CN112668543A

CN112668543A - Isolated word sign language recognition method based on hand model perception

Info

Publication number: CN112668543A
Application number: CN202110016997.XA
Authority: CN
Inventors: 李厚强; 周文罡; 胡鹤臻
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-04-16
Anticipated expiration: 2041-01-07
Also published as: CN112668543B

Abstract

The invention discloses an isolated word sign language recognition method based on hand model perception, which comprises the following steps: for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent semantic representation containing hand states through a visual encoder; then, working in a model perception mode through a hand model perception decoder, mapping the latent semantic representation containing the hand state into a three-dimensional hand grid, and obtaining the position of each hand joint point; and finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence. The method can integrate model and data drive, introduce hand type prior, improve the identification accuracy of the system, visualize the intermediate result (namely three-dimensional hand grid) and enhance the interpretability of the framework.

Description

Isolated word sign language recognition method based on hand model perception

Technical Field

The invention relates to the technical field of sign language recognition, in particular to an isolated word sign language recognition method based on hand model perception.

Background

According to the data of world health organization WHO in 2020, about 4.66 million people worldwide have hearing impairment, which accounts for about 5% of the global population. In the hearing impaired population, the most common communication medium is sign language. Sign language is a visual language and has its unique linguistic characteristics. The semantic information is expressed by the aid of fine-grained non-manual control features (expressions, lip shapes and the like) mainly through manual control features (hand shapes, hand movements, positions and the like).

In order to solve the communication gap between listeners and deaf people, sign language recognition is carried out and widely researched. It converts the input sign language video into corresponding text by computer algorithm. Isolated word sign language recognition is the basic task among others, and recognizes an input sign language video as a vocabulary corresponding to the video. The general identification process is that firstly, a representation is extracted from an input sign language video, then the representation is transformed into a probability vector, and a category corresponding to the maximum probability is taken as an identification result.

The hand takes a dominant position in sign language notation, and occupies only a small spatial dimension, exhibiting highly articulated joints. The hand has a similar appearance and less locally discernable features than the body and face. In sign language video, the hand usually has motion blur and self-occlusion phenomena, and the background is complex.

Early work typically employed manually designed features to describe gestures. With the development of deep learning and hardware computing power in recent years, sign language recognition systems based on deep learning are gradually dominant. The method comprises the steps of extracting representation through a Convolutional Neural Network (CNN), converting the representation into probability vectors after passing through a full connection layer and a Softmax layer, and taking a category corresponding to the maximum probability as an identification result. In recent years, some work has been to pull out by hand as an additional secondary branch and achieve some performance gains. These methods based on deep learning are all performed in a data-driven paradigm where features are learned under the supervision of video category labels. However, the direct data-driven sign language recognition method has the following problems: limited interpretability; it is easy to overfit with limited training data. Due to the fact that labeling of sign language data requires professional knowledge, compared with an action recognition data set, samples of each category are fewer in the existing sign language data set, and therefore recognition accuracy of the existing scheme is still to be improved.

Disclosure of Invention

The invention aims to provide an isolated word sign language recognition method based on hand model perception, which can improve the recognition accuracy of a system and enhance the interpretability of a recognition framework.

The purpose of the invention is realized by the following technical scheme:

a sign language recognition method for isolated words based on hand model perception comprises the following steps:

for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent semantic representation containing hand states through a visual encoder; then, working in a model perception mode through a hand model perception decoder, mapping the latent semantic representation containing the hand state into a three-dimensional hand grid, and obtaining the position of each hand joint point; and finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence.

The technical scheme provided by the invention can be seen that the model and data drive can be fused, the hand type prior is introduced, the identification accuracy of the system is improved, the intermediate result (namely the three-dimensional hand grid) can be visualized, and the frame interpretability is enhanced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a frame diagram of an isolated word sign language recognition method based on hand model perception according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Aiming at the technical problems in the prior art, the embodiment of the invention provides a sign language identification method of an isolated word sensed by a hand model, which can be used for fusing a model and data drive, introducing a hand model prior, improving the identification accuracy of a system and enhancing the interpretability of the system, and as shown in fig. 1, the invention provides a frame diagram of the sign language identification method of the isolated word sensed by the hand model, and the main identification process comprises the following steps: for a hand sequence intercepted from a sign language video, converting the hand sequence into a latent semantic representation containing hand states through a visual encoder; then, working in a model-aware mode through a hand model aware decoder, mapping the latent semantic representation containing the hand state into a three-dimensional hand grid, and obtaining the position of each hand joint point; and finally, optimizing the three-dimensional hand grid through an inference module to obtain the space-time representation of each hand joint point, and classifying to identify the vocabulary corresponding to the hand sequence.

For ease of understanding, the various components of the recognition framework and the corresponding training and testing process are described in detail below in conjunction with the framework diagram shown in FIG. 1.

Firstly, a frame structure.

1. Visual Encoder (Visual Encoder).

In the embodiment of the invention, the input of the visual encoder is a sign sequence containing T frames and intercepted from sign language video

Converting the hand sequence V' to a latent semantic representation by a visual encoder, denoted as:

wherein E (-) denotes a visual encoder, v_tRepresenting a hand image at time T, wherein T is the length of a hand sequence; theta and beta represent hand states, which are respectively the representation of hand postures and shapes; c. C_r、c_oAnd c_sRepresenting camera parameters c for indicating rotation, translation and zoom, respectively.

In the embodiment of the invention, the hand sequence V' is an RGB video hand sequence, the interception mode from the sign language video can be realized in a conventional mode, and the data sets involved in the training stage and the testing stage are the hand sequences intercepted from the sign language video.

Illustratively, the visual encoder may be implemented by connecting fully connected layers at the end of the ResNet.

Exemplary, latent semantic characterization:

representing a set of real numbers.

2. Hand Model-aware Decoder (Model-aware Decoder).

Hand model-aware decoders attempt to implement a mapping from latent semantic feature vectors to compact pose representations in a model-aware approach. The hand model aware decoder constrains the distribution of possible gestures through a pre-coded hand model prior, implicitly filtering out unreasonable gestures during the mapping process. Finally, the method can generate a more concise and highly reliable hand gesture, and reduces the optimization difficulty for the terminal reasoning module.

In the embodiment of the present invention, the hand model aware decoder is a statistical module, and for example, a differentiable MANO hand model can be used as the hand model aware decoder.

Hand model sensing decoder capable of pre-utilizing large amount of high quality hand scanningLearning, the hand template can be obtained by learning

In this way, the hand prior is encoded. At the same time, a compact mapping can be established to describe the hands, i.e. from low-dimensional semantic vectors (implicit feature vectors) to high-dimensional triangular hand meshes (containing 778 nodes and 1,538 faces).

The mapping process of the hand model aware decoder is represented as:

M(β，θ)＝W(T(β，θ)，J(β)，θ，W′)

where T (β, θ) represents the characterization of θ and β from hand pose and shape by a mixing function B_S(. and B)_P(. to hand form

Obtaining a correction result; w' is the mixing weight; j (β) is a representation of a hand shape comprising a plurality of hand joints provided by a hand model aware decoder; w (-) represents a bone skinning animation algorithm; m (β, θ) represents a three-dimensional hand Mesh (3D Mesh).

Meanwhile, the more concise three-dimensional hand Joint point (3D Joint) position can also be taken out from the linear interpolation of the hand grid related points. Considering that the MANO hand model provides only 16 hand joint points, 5 more fingertips can be extracted from the three-dimensional hand mesh, constituting 21 hand joint points.

The hand model aware decoder may exhibit intermediate results, i.e. a reconstructed three-dimensional hand mesh of the hand, enhancing the interpretability of the framework.

3. Inference Module (Inference Module).

There may be some unsatisfactory results for the three-dimensional pose sequence (consisting of three-dimensional hand joint point positions in the T hand images) predicted by the hand model aware decoder. An inference module is used to further optimize the spatiotemporal characterization of hand gestures. With further adaptive attention calculations, the inference module grabs the most critical cues and performs video-level classification.

The hand gesture sequence is a structured data and there are natural physical connections between the joints, which also allows it to be naturally organized into a space-time diagram. In embodiments of the invention, a popular graph-convolution neural network (GCN) is used that has proven to efficiently process graph structure data, after which video-level classification is performed by a classification output layer.

Recording the position sequence of the hand joint points output by the hand model perception decoder as

The corresponding undirected space-time diagram G (V, E) is defined by a point set V and an edge set E, wherein the point set V comprises all hand joint point positions, and the edge set E comprises intra-frame and inter-frame connections, namely the physical connections of the hand joint points and the connections of the same joint point along time; adjacency matrix obtained according to edge set E

And the unit matrix I are used for a graph convolution neural network layer, and the process of graph convolution is represented as follows:

where k is the group to which the neighborhood node belongs, W_kIs the weight of the convolution kernel and is,

is decomposed into k sub-matrices, i.e.:

each sub-matrix A_kRepresenting the connection after disassembly, T_kIs an intermediate variable used for calculating a matrix D, M is a weight, the matrix D is used for normalization, M and n are row and column numbers of the matrix D,

is a Hadamard product symbol; the information of the hand joint points is transmitted between the edges, so that the space-time representation (including not only the position information but also certain semantic information) of each hand joint point is obtained; further, the Hadamard product is initialized to all 1 attention weights M and A at learnable initializations_kTo help the network capture the discriminating clues.

In the embodiment of the invention, after the neural network layers are convolved by a plurality of stacked graphs, the classification output layers classify the graphs, so that the vocabulary corresponding to the hand sequence is recognized.

Secondly, training a model.

In the embodiment of the invention, the visual encoder, the hand model perception decoder and the reasoning module are used as a recognition model. Since sign language datasets do not have labeling of hand poses, in the Training phase (Training Stage), except for cross-entropy classification loss

(Classification Loss), a corresponding Loss function (a weak supervision Loss function based on the space and time relation of the middle hand posture) is designed according to the output of each stage to guide the learning of the middle posture representation. In the training phase, the overall loss function of the recognition model is represented as:

wherein the content of the first and second substances,

the cross-entropy classification penalty of the inference module is stated,

and

loss of spatial and temporal consistency in hand joint point locations derived by hand model aware decoderIn the light of the above-mentioned problems,

regularization loss of a hand state in latent semantic representation obtained by a visual encoder; lambda [ alpha ]_spa、λ_temAnd lambda_regRespectively, the weighting factors for the corresponding losses.

In the training process, the training process can be realized in a conventional manner based on the parameters of the total loss function and the recognition model.

1. Regularization Loss (Regularization Loss).

To ensure that the hand model works reasonably and generates a reasonable hand mesh, regularization penalties are used to further constrain the amplitudes of some of the hidden features, regularization penalties

Expressed as:

wherein, w_βRepresenting a weighting factor.

2. Spatial Consistency Loss (Spatial Consistency Loss).

In the embodiment of the invention, based on a weak perspective camera model, a camera parameter is output by combining a visual encoder, and a three-dimensional attitude sequence predicted by a hand model perception decoder is mapped to a two-dimensional space; the mapping process is represented as:

wherein Π (-) represents an orthogonal projection,

representing hand model aware decoder output hand joint point position sequence using camera parameters

And mapping the position sequence to a two-dimensional space.

Meanwhile, a two-dimensional position sequence J of Hand joint points (2D Joints) extracted from the Hand sequence by a two-dimensional gesture Detector (2D Hand position Detector) is utilized in advance_2DAnd using it as pseudo label to constrain it and mapping result

Towards consistency.

Loss of spatial consistency

Expressed as:

where N is the total number of hand joint points (e.g., N ═ 21); t is the length of the hand sequence; (t, j) represents the jth hand joint point at time t; c (t, j) represents the confidence coefficient of the j th hand joint point position at the t moment extracted in advance, and if the confidence coefficient c (t, j) is greater than or equal to the threshold epsilon, the hand joint point position participates in the space consistency loss

Otherwise, the calculation is not participated in;

the indication function is represented.

3. Loss of Temporal Consistency (Temporal Consistency Loss).

To avoid prediction of jitter, the temporal consistency of the predicted three-dimensional joint points is further constrained. In the process of sign language, different hand joint points usually have different moving speeds, namely joints closer to the palm usually have lower speeds. Thus, the hand joint points are grouped into three groups, { S_iI ═ 0,1,2}, corresponding to the palm, mid, and terminal joint sets, respectively.

Loss of temporal consistency

Expressed as:

wherein the content of the first and second substances,

representing the output hand joint point position sequence of the hand model perception decoder, (t, j) representing the jth hand joint point at the t moment, S_iA set of hand joint points; alpha is alpha_iRefer to for set S_iPredefined penalty weights will be given greater penalty weights for sets with slower motion speeds.

And thirdly, testing.

The Testing Stage (Testing Stage) is the same as the main flow of the training Stage, and the main difference is that the Testing Stage does not need to use camera parameters and calculate loss. The main flow of the test stage is as follows: inputting the extracted hand video sequence, obtaining the latent semantic representation of the hand state through a visual encoder, obtaining a corresponding three-dimensional hand grid through a hand model perception decoder, and finally optimizing through an inference module to obtain the space-time representation of each hand joint point, thereby classifying the video level and outputting a corresponding vocabulary.

As shown in the right side of fig. 1, for some hand images, the output layer is classified by the inference module to obtain probabilities corresponding to different words, and the category corresponding to the maximum probability is selected.

According to the scheme of the embodiment of the invention, model and data drive can be fused, hand type prior is introduced, the identification accuracy of the system is improved, the intermediate result can be visualized, and the frame interpretability is enhanced.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A sign language recognition method for isolated words sensed by a hand model is characterized by comprising the following steps:

2. The method of claim 1, wherein the input of the visual encoder is a sign sequence intercepted from a sign language video

wherein E (-) denotes a visual encoder, v_tRepresenting a hand image at time T, wherein T is the length of a hand sequence; theta and beta represent hand states, which are respectively the representation of hand postures and shapes; c. C_r、c_oAnd c_sRepresenting camera parameters indicating rotation, translation and zoom, respectively.

3. The method of claim 1, wherein the recognition of isolated words and sign language by hand model perception,

the hand model perception decoder is a statistical module, and uses hand scanning data learning in advance, and the mapping process is expressed as:

M(β,θ)＝W(T(β,θ),J(β),θ,W′)

where T (β, θ) represents the characterization of θ and β from hand pose and shape by a mixing function B_S(. and B)_P(. to a previously learned hand template

The obtained correction result, the representation theta and beta of the hand posture and the shape represent the hand state; w' is the mixing weight; w (-) represents a bone skinning animation algorithm; m (β, θ) represents a three-dimensional hand mesh; j (β) is a representation of a hand shape comprising a plurality of hand joints provided by a hand model aware decoder;

and obtaining the position of a hand joint point through a three-dimensional hand grid M (beta, theta), wherein the hand joint point comprises a plurality of hand joints and 5 fingertip points.

4. The method according to claim 1, wherein the inference module comprises a convolutional neural network layer and a classification output layer;

is decomposed into k sub-matrices, i.e.:

is a Hadamard product symbol;

transmitting the information of the hand joint points between the edges so as to obtain the space-time representation of each hand joint point;

after the neural network layers are convolved by a plurality of stacked graphs, the graphs are classified by a classification output layer, so that words corresponding to the hand sequences are recognized.

5. The sign language recognition method for isolated words perceived by a hand model according to any one of claims 1 to 4, wherein a visual encoder, a hand model perception decoder and an inference module are used as a recognition model, and in a training phase, a total loss function of the recognition model is expressed as:

wherein the content of the first and second substances,

represents the cross-entropy classification penalty of the inference module,

and

representing the loss of spatial and temporal consistency of the hand joint point positions obtained by the hand model aware decoder,

6. The method of any of claim 5, wherein regularization penalties are introduced to identify isolated words in the form of a hand model

Expressed as:

wherein, w_βRepresenting the weight factor, theta and beta representing the hand state, respectively the representation of the hand pose and shape.

7. The method of any of claims 5, wherein spatial consistency is lost by a loss of isolated word sign language

Expressed as:

wherein N is the total number of the hand joint points; t is the length of the hand sequence;

A sequence of positions mapped to a two-dimensional space; j. the design is a square_2DRepresenting a two-dimensional sequence of hand joint points, J, extracted in advance from the hand sequence_2DAs a pseudo tag; (t, j) represents the jth hand joint point at time t; c (t, j) represents the confidence coefficient of the j th hand joint point position at the t moment extracted in advance, and if the confidence coefficient c (t, j) is greater than or equal to the threshold epsilon, the hand joint point position participates in the space consistency loss

Calculating (1);

representing an indicator function;

hand model perception decoder output hand joint point position using camera parameter

The process of mapping to two-dimensional space is represented as:

wherein n (·) represents an orthogonal projection, c_r、c_oAnd c_sRepresenting camera parameters indicating rotation, translation and zoom, respectively.

8. A method as claimed in claim 5, wherein temporal consistency is lost by a loss of isolated word sign language

Expressed as:

wherein the content of the first and second substances,

representing the output hand joint point position sequence of the hand model perception decoder, (t, j) representing the jth hand joint point at the t moment, S_iSet of hand joint points, { S_i-i-0, 1,2 corresponds to the palm, middle and end joint sets, respectively; alpha is alpha_iRefer to for set S_iSet predefined penalty weights.