CN111652164B

CN111652164B - Isolated word sign language recognition method and system based on global-local feature enhancement

Info

Publication number: CN111652164B
Application number: CN202010513333.XA
Authority: CN
Inventors: 李厚强; 周文罡; 胡鹤臻; 蒲俊福
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-06-08
Filing date: 2020-06-08
Publication date: 2022-07-15
Anticipated expiration: 2040-06-08
Also published as: CN111652164A

Abstract

The invention discloses a global-local feature enhancement-based isolated word sign language recognition method and a system, wherein the method comprises the following steps: acquiring a sign language video to be recognized, and performing feature extraction on the sign language video to be recognized through a shared convolution layer to obtain a feature map; capturing context information of the feature graph to obtain global features; capturing fine-grained information of the feature map to obtain local features; and performing collaborative learning based on the global features and the local features to obtain the recognition result of the sign language video to be recognized. The invention can realize the consideration of global and local characteristics at the same time, and realize the self-adaptive enhancement of each characteristic; meanwhile, the learning of the two characteristics can be mutually promoted, so that the accuracy of sign language recognition is further improved.

Description

Isolated word sign language recognition method and system based on global-local feature enhancement

Technical Field

The invention relates to the technical field of sign language recognition, in particular to a sign language recognition method and system for isolated words based on global-local feature enhancement.

Background

According to the second investigation of nationwide disabled people, the number of hearing disabled patients in China is up to 2780 thousands. In hearing impaired people, the most common communication medium is sign language. Sign language, a visual language, has its unique linguistic characteristics. The semantic information is expressed by using fine-grained non-manual control features (expressions, lip shapes and the like) in an auxiliary mode mainly through manual control features (hand shapes, hand movements, positions and the like) with context correlation.

In order to facilitate communication between listeners and deaf people, sign language recognition has been developed and widely studied. The method converts the input sign language video into corresponding text or voice to be output through a computer algorithm. The research relates to the fields of multi-modal human-computer interaction, computer vision, natural language processing and the like.

The isolated word sign language recognition means that a video of a sign language word is input, and a system recognizes a word corresponding to the video. Isolated word sign language recognition can be viewed as a fine-grained classification problem. The accurate discrimination of the sign language isolated words not only depends on the manual control characteristics, but also plays an important role in fine-grained non-manual control characteristics. There are some confusing isolated words with different meanings that have the same manual features but different non-manual features. As in Chinese sign language, the words "if" and "fake" are distinguished only by mouth movement. This ambiguity problem presents a significant challenge to the accurate recognition of the spoken word. The identification process of the whole system comprises the steps of firstly extracting representations of input sign language videos, then converting the representations into probability vectors through transformation, and taking the category with the maximum probability as a final identification result. With the development of deep learning and hardware computing power in recent years, isolated word sign language recognition systems based on deep learning dominate. The method comprises the steps of extracting representation through a Convolutional Neural Network (CNN), converting the representation into probability vectors after passing through a full connection layer and a Softmax layer, and taking a category corresponding to the maximum probability as an identification result.

Therefore, in the isolated word sign language recognition, the step of extracting features is important. The traditional identification methods are divided into two types: features are extracted directly from the global. Some fine-grained local clues exist in the sign language, and the method lacks attention to the clues, so that misclassification is caused; meanwhile, some recognition methods exist for extracting local hand features as assistance, but the method still cannot adaptively pay attention to fine-grained non-manual features in the confusing words.

The above two disadvantages are the main problems in the prior art, and therefore, how to achieve the global and local features simultaneously and achieve the adaptive enhancement of each feature. Meanwhile, the learning of the two characteristics can be mutually promoted so as to further improve the accuracy of sign language recognition, and the method is a problem to be solved urgently.

Disclosure of Invention

In view of the above, the invention provides an isolated word sign language recognition method based on global-local feature enhancement, which can realize the simultaneous consideration of global and local features and realize the adaptive enhancement of each feature; meanwhile, the learning of the two characteristics can be mutually promoted, so that the accuracy of sign language recognition is further improved.

The invention provides an isolated word sign language recognition method based on global-local feature enhancement, which comprises the following steps:

acquiring a sign language video to be identified;

carrying out feature extraction on the sign language video to be identified through a shared convolution layer to obtain a feature map;

capturing context information of the feature graph to obtain global features;

capturing fine-grained information of the feature graph to obtain local features;

and performing collaborative learning based on the global features and the local features to obtain the recognition result of the sign language video to be recognized.

Preferably, the capturing context information of the feature map to obtain a global feature includes:

generating a feature A, a feature B and a feature C which have the same shape as the feature map through independent convolution layers for the feature map X;

defining an enhanced feature map E based on the features A and B;

and aggregating the features from the features C by using the enhanced feature map E, and forming global features with the feature map X.

Preferably, the capturing fine-grained information of the feature map to obtain a local feature includes:

calculating integrals of the saliency map along the X axis and the Y axis and normalizing to obtain distribution functions about the X axis and the Y axis;

and sampling the characteristic diagram based on the inverse function of the distribution function to obtain the sampled local characteristics.

Preferably, the saliency map is generated from high-order features of global features via trilinear attention.

Preferably, the obtaining the recognition result of the sign language video to be recognized through collaborative learning based on the global features and the local features includes:

and performing collaborative learning based on the global features and the local features, and taking the category with the highest prediction probability as the recognition result of the sign language video to be recognized.

An isolated word sign language recognition system based on global-local feature enhancement, comprising:

the acquisition module is used for acquiring a sign language video to be identified;

the characteristic extraction module is used for extracting the characteristics of the sign language video to be identified through the shared convolution layer to obtain a characteristic diagram;

the global enhancement module is used for capturing the context information of the feature map to obtain global features;

the local enhancement module is used for capturing fine-grained information of the feature map to obtain local features;

and the collaborative learning module is used for carrying out collaborative learning based on the global features and the local features to obtain the recognition result of the sign language video to be recognized.

Preferably, the global enhancing module is specifically configured to:

defining an enhanced feature map E based on the features A and B;

Preferably, the local boost module is specifically configured to:

Preferably, the collaborative learning module is specifically configured to:

In summary, the invention discloses an isolated word sign language identification method based on global-local feature enhancement, when an isolated word sign language needs to be identified, firstly, a sign language video to be identified is obtained, and then, feature extraction is carried out on the sign language video to be identified through a shared convolution layer to obtain a feature map; capturing context information of the feature graph to obtain global features; capturing fine-grained information of the feature map to obtain local features; and performing collaborative learning based on the global features and the local features to obtain the recognition result of the sign language video to be recognized. The invention can realize the consideration of global and local characteristics at the same time, and realizes the self-adaptive enhancement of each characteristic; meanwhile, the learning of the two characteristics can be mutually promoted, so that the accuracy of sign language recognition is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flowchart of an embodiment of a sign language recognition method for isolated words based on global-local feature enhancement according to the present invention;

FIG. 2 is a schematic structural diagram of an embodiment of an isolated word sign language recognition system based on global-local feature enhancement according to the present invention;

FIG. 3 is a schematic diagram of the overall-local feature enhancement-based isolated word sign language recognition system according to the present invention;

FIG. 4 is a schematic diagram of the operation of the global augmentation module disclosed herein;

fig. 5 is a schematic diagram of the operation of the local boost module disclosed in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

As shown in fig. 1, which is a flowchart of a method of an embodiment of a global-local feature enhancement-based isolated word sign language recognition method disclosed by the present invention, the method may include:

s101, acquiring a sign language video to be identified;

when isolated word sign language needs to be recognized, firstly, a sign language video to be recognized is obtained.

S102, extracting the features of the sign language video to be recognized through a shared convolution layer to obtain a feature map;

then, as shown in fig. 3, the sign language video to be recognized is led into a shared convolution layer to extract features, and a feature map is obtained. The network is then divided into two branches: a global branch for capturing context information and a local branch for capturing fine-grained cues.

S103, capturing context information of the feature graph to obtain global features;

the global enhancement module of the global branch is intended to capture motion cues containing context information, and filter out irrelevant interference information, such as the clothing and background of the speaker.

Specifically, as shown in fig. 3 and 4, for one signature X ∈ C × T × H × W, first, three signatures having the same shape as the signature X are generated by passing the signature into independent convolutional layers, and the three signatures are denoted as a, B, and C, respectively. Then, it is transformed into C × N, where N ═ T × H × W. Matrix multiplication is then used to calculate the similarity between voxels and normalization is performed using the Softmax function. This operation defines the enhanced feature map E ∈ R^N×NAs follows:

wherein, A_iAnd B_jEach represents i^thAnd j^thA feature vector of the location. E_ijTo denote j^thPosition is given to^thThe influence of (c). Next, the features from feature C are weight aggregated using enhanced feature map E:

thus, with long range dependence, the ability to enhance each voxel is provided. Contextual cues with discriminative power are mutually enhanced while irrelevant information is compressed.

S104, capturing fine-grained information of the feature map to obtain local features;

besides complex motion cues with context information, there are also some fine-grained cues in sign language video. These cues typically include lip shape, line of sight, facial expression, or a combination of these. They occupy a small spatial dimension and vary in video over time. Therefore, they easily disappear in repeated convolution and pooling operations. In order to retain such fine-grained clues, a local enhancement module based on adaptive sampling is introduced, and adaptive sampling is carried out on the input feature map X ∈ C × T × H × W.

Specifically, as shown in fig. 3 and 5, the feature map X is plotted at a certain time t_tFor example, the local enhancement module takes saliency map S e H × W as a guide: where the saliency values are larger, they are sampled more densely. Specifically, the saliency map is first integrated along the x and y axes and normalized:

wherein k is_x∈[1,W],k_y∈[1,H]. Thus, a distribution function about the x and y axes is obtained and sampled according to the inverse of the distribution function, as follows:

where O is a feature map sampled at time t.

Wherein, the saliency map S is generated through trilinear attention according to the high-order features of the global branch:

wherein Y is_t∈C₂×N₂And referring to high-order features from the global branch after deformation corresponding to the time t, mean (-) represents that averaging operation is carried out along the channel dimension, and is used for generating a more robust saliency map. S further morphs and upsamples to X_tHave the same spatial dimensions. By resampling the original feature map, fine-grained clues are located and emphasized and are therefore also more easily captured by the convolution operation.

And S105, performing collaborative learning based on the global features and the local features to obtain the recognition result of the sign language video to be recognized.

As shown in FIG. 3, the output z of the global branch for capturing context information and the local branch for capturing fine-grained clues_globalAnd z_localAll use cross entropy loss to supervise, respectively denoted Λ_globalAnd Λ_local. To further facilitate learning of these two branches, the losses Λ are mutually facilitated_muFor constraining the relationship between the two branches. The learning of the whole framework is supervised by superposition of the above loss functions:

Λ＝Λ_global+Λ_local+Λ_mu

and in the testing stage, the results of the global branch and the local branch are fused, and the category with the highest prediction probability is taken as a prediction result.

Specifically, the global enhancement module and the local enhancement module enhance key features in the sign language video from two complementary angles. One is dedicated to capturing long-range context-dependent information, and the other emphasizes discriminating fine-grained clues. However, the local enhancement module inevitably brings global information loss, and it is difficult to feed back information to the global enhancement module. To this end, the present invention uses a collaborative learning module such that the two modules are optimized in a collaborative mode. Let the probability distributions predicted by the two branches be p₁And p₂The degree of match between the two branches was calculated using the divergence (Kullback Leibler, KL) as follows:

where M represents the total number of categories. The sum of the above two KL divergences is taken as the cooperative learning loss function:

Λ_mu＝D_KL(p₂||p₁)+D_KL(p₁||p₂)。

although these two branches emphasize global and local features, respectively, their goal is correct recognition of the same sign language video. By taking the distribution predicted by the other party as a reference, a connection between the two branches is established. The local branch can implicitly influence the sampling process. Meanwhile, the fine-grained clues emphasized after correction can be better classified and identified together with the global branch.

In summary, the invention can realize the consideration of global and local characteristics at the same time, and realize the adaptive enhancement of each characteristic; meanwhile, the learning of the two characteristics can be mutually promoted, so that the accuracy of sign language recognition is further improved.

As shown in fig. 2, which is a schematic structural diagram of an embodiment of an isolated word sign language recognition system based on global-local feature enhancement disclosed in the present invention, the system may include:

an obtaining module 201, configured to obtain a sign language video to be identified;

when the isolated word sign language needs to be recognized, a sign language video to be recognized is obtained first.

The feature extraction module 202 is configured to perform feature extraction on a sign language video to be identified through a shared convolution layer to obtain a feature map;

and then, introducing the sign language video to be recognized into a shared convolution layer to extract features, so as to obtain a feature map. The network is then divided into two branches: a global branch for capturing context information and a local branch for capturing fine-grained cues.

The global enhancing module 203 is used for capturing context information of the feature map to obtain global features;

Specifically, as shown in fig. 3 and 4, for one signature X ∈ C × T × H × W, three signatures having the same shape as the signature X are first generated by passing the signature X through independent convolutional layers, and are denoted as a, B, and C, respectively. Then, it is transformed into C × N, where N ═ T × H × W. Matrix multiplication is then used to compute the similarity between voxels and normalization is performed using the Softmax function. This operation defines the enhanced feature map E ∈ R^N×NAs follows:

wherein, A_iAnd B_jEach represents i^thAnd j^thA feature vector of the location. E_ijTo denote j^thPosition given to^thThe influence of (c). Next, the features from feature C are weighted aggregated using enhanced feature map E:

The local enhancement module 204 is configured to capture fine-grained information of the feature map to obtain a local feature;

besides complex motion cues with contextual information, sign language video also has some fine-grained cues. These cues typically include lip shape, line of sight, facial expression, or a combination of these. They occupy a small space size and vary in video over time. Therefore, they easily disappear in repeated convolution and pooling operations. In order to retain such fine-grained clues, a local enhancement module based on adaptive sampling is introduced, and adaptive sampling is carried out on the input feature map X ∈ C × T × H × W.

where O is a feature map sampled at time t.

And the collaborative learning module 205 is configured to perform collaborative learning based on the global features and the local features to obtain a recognition result of the sign language video to be recognized.

As shown in FIG. 3, the output z of the global branch for capturing context information and the local branch for capturing fine-grained cues_globalAnd z_localAll use cross entropy loss to supervise, respectively denoted Λ_globalAnd Λ_local. To further facilitate learning of these two branches, the losses Λ are mutually facilitated_muFor constraining the relationship between the two branches. The learning of the whole framework is supervised by the superposition of the above several loss functions:

Λ＝Λ_global+Λ_local+Λ_mu

and in the testing stage, the results of the global branch and the local branch are fused, and the class with the highest prediction probability is taken as a prediction result.

Specifically, the global enhancement module and the local enhancement module enhance key features in the sign language video from two complementary angles. One is dedicated to capturing long-range context-dependent information, and the other emphasizes discriminating fine-grained clues. However, the local enhancement module inevitably brings global informationWhile it is difficult to feed back information to the global augmentation module. To this end, the present invention uses a collaborative learning module such that the two modules are optimized in a collaborative mode. Let the probability distributions predicted by the two branches be p₁And p₂The degree of match between the two branches is calculated using the divergence (Kullback Leibler, KL) as follows:

Λ_mu＝D_KL(p₂||p₁)+D_KL(p₁||p₂)。

In summary, the invention can realize the global and local characteristics simultaneously, and realize the self-adaptive enhancement of each characteristic; meanwhile, the learning of the two characteristics can be mutually promoted, so that the accuracy of sign language recognition is further improved.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A sign language recognition method for isolated words based on global-local feature enhancement is characterized by comprising the following steps:

acquiring a sign language video to be identified, wherein the sign language video comprises isolated word sign languages;

performing feature extraction on the sign language video to be identified through a shared convolution layer to obtain a feature map;

capturing context information of the feature graph to obtain global features;

based on the global features and the local features, cross entropy loss functions and mutual promotion loss functions are used for supervision and collaborative learning to obtain recognition results of the sign language videos to be recognized, and the mutual promotion loss functions

Is equal to D_KL(p₂||p₁)+D_KL(p₁||p₂) Wherein p is₁And p₂Probability distributions predicted separately for global and local branches, D_KL(p₂||p₁) And D_KL(p₁||p₂) The matching degrees between the global branch and the local branch are respectively calculated by using divergence (KL); wherein the output z of the global branch for capturing context information and the local branch for capturing fine-grained clues_globalAnd z_localAll use cross entropy loss function to supervise, and respectively record as

And

the collaborative learning using cross-entropy loss function and mutual facilitation loss function supervision comprises: use functions

Supervising and performing cooperative learning;

the capturing the context information of the feature graph to obtain the global feature comprises:

defining an enhanced feature map E based on the features A and B;

aggregating features from the features C by using the enhanced feature map E, and forming global features with the feature map X;

capturing fine-grained information of the feature map to obtain local features, wherein the capturing comprises:

calculating integrals of the saliency map along the X axis and the Y axis and normalizing the integrals to obtain distribution functions of the saliency map about the X axis and the Y axis;

and carrying out self-adaptive sampling on the characteristic diagram based on the inverse function of the distribution function to obtain the sampled local characteristics.

2. The method of claim 1, wherein the saliency map is generated from high-order features of global features via trilinear attention generation.

3. The method according to claim 1, wherein the performing collaborative learning based on the global features and the local features to obtain the recognition result of the sign language video to be recognized comprises:

4. A system for isolated word sign language recognition based on global-local feature enhancement, comprising:

the acquisition module is used for acquiring a sign language video to be recognized, wherein the sign language video comprises isolated words of sign language;

the feature extraction module is used for extracting features of the sign language video to be identified through a shared convolution layer to obtain a feature map;

a collaborative learning module to learn based on the global features and the local featuresCharacteristic, using cross entropy loss function and mutual promotion loss function to supervise and carry out cooperative learning to obtain the recognition result of the sign language video to be recognized, wherein the mutual promotion loss function is equal to D_KL(p₂||p₁)+D_KL(p₁||p₂) Wherein p is₁And p₂Probability distributions, D, predicted for global and local branches, respectively_KL(p₂||p₁) And D_KL(p₁||p₂) The matching degrees between the global branch and the local branch are respectively calculated by using divergence (KL); wherein the output z of the global branch for capturing context information and the local branch for capturing fine-grained clues_globalAnd z_localAll use cross entropy loss function to supervise, and respectively record as

And

the collaborative learning using cross-entropy loss function and mutual facilitation loss function supervision comprises: use function

Supervising gap collaborative learning;

the global augmentation module is specifically configured to:

defining an enhanced feature map E based on the features A and B;

the local enhancement module is specifically configured to:

5. The system of claim 4, wherein the saliency map is generated from high-order features of global features via trilinear attention generation.

6. The system of claim 4, wherein the collaborative learning module is specifically configured to: