CN111913575B

CN111913575B - Method for recognizing hand-language words

Info

Publication number: CN111913575B
Application number: CN202010721233.6A
Authority: CN
Inventors: 王青山; 郑志文; 朱钰; 王�琦; 张江涛; 胡汇源; 丁景宏; 马晓迪; 王鑫炎
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2021-06-11
Anticipated expiration: 2040-07-24
Also published as: CN111913575A

Abstract

The invention discloses a method for recognizing sign language words, which comprises collecting signals generated when a subject performs sign language gesture actions by utilizing a bracelet of a sensor, firstly, removing high-frequency noise generated by equipment reasons from the acquired sign language gesture signals by using a Butterworth filter, in the feature extraction stage, the invention uses a new feature DTW to mark the signal changes and fluctuations during sign language action, meanwhile, the invention carries out word embedding operation on a predetermined number of the middle general sign language words to obtain high-dimensional sign language word vectors, improves the t-SNE algorithm, combines a sign language corpus, reducing dimensions of each high-dimensional hand word vector to obtain semantic features of the hand words, designing and identifying each hand word based on an LSTM neural network, therefore, the problem that the existing sign language gesture recognition cannot recognize the sign language under the condition of large scale is solved.

Description

Method for recognizing hand-language words

Technical Field

The invention relates to the technical field of intelligent recognition, in particular to a method for recognizing hand-written words.

Background

In recent years, with the development of 5G technology and the popularization of mobile sensor devices, attention is paid to the use of wearable sensor devices to assist deaf-mute communication. The wearable sensor is used for identifying the sign language of the deaf-mute, and essentially belongs to human body action identification. The existing research on motion recognition is mainly divided into three categories: video-based, radio frequency-based, and sensor-based.

The video-based method depends on action equipment such as a camera, and the method often causes serious privacy problems, and has the problem of strong invasiveness due to the limitation of light and environment. In the action recognition based on the radio frequency signal, the existing WiFi device is often used to collect Channel State Information (CSI) signal changes caused by actions of people to presume actions performed by the human body, and the method needs to improve recognition accuracy, often only can recognize simple actions with large amplitude, and has a problem of being greatly influenced by the environment.

Disclosure of Invention

The invention provides a method for recognizing hand-language words, and aims to solve the problem that the existing hand-language gesture recognition cannot recognize hand languages under the condition of large scale.

The application provides a method for recognizing a hand word, which comprises the following steps:

enabling the arms of a plurality of testees to wear sensor bracelets, respectively making sign language gestures of a preset number of sign language words, and acquiring and recording sensor signals corresponding to each sign language gesture through the sensor bracelets;

filtering the sensor signal;

extracting a characteristic value of each sensor signal by adopting a dynamic time normalization algorithm;

performing word embedding operation on semantic information of each hand word to generate a word vector corresponding to the hand word;

performing dimension reduction processing on the word vector to generate a sign language word meaning characteristic vector corresponding to the word vector;

inputting the characteristic values and the semantic feature vectors of the hand words into a preset long-short term memory neural network, and training through the preset long-short term memory neural network to obtain a classification recognition model;

and recognizing the sign language gestures through the classification recognition model, and further recognizing the sign language words with a preset number.

The embodiment of the invention collects the signals generated when the subject performs sign language gesture actions by utilizing the bracelet with the IMU signal sensor and the sEMG signal sensor, the bracelet is worn on the dominant hand of the subject, firstly, removing high-frequency noise generated by equipment reasons from the acquired sign language gesture signals by using a Butterworth filter, in the feature extraction stage, the invention uses a new feature DTW to mark the signal changes and fluctuations during sign language action, meanwhile, the invention carries out word embedding operation on a predetermined number of general sign language words to obtain high-dimensional sign language word vectors, improves the t-SNE algorithm, combines a sign language corpus, reducing dimensions of each high-dimensional hand word vector to obtain semantic features of the hand words, designing and identifying each hand word based on an LSTM neural network, therefore, the problem that the existing sign language gesture recognition cannot recognize the sign language under the condition of large scale is solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for recognizing a handwritten word according to an embodiment of the present invention;

FIG. 2 is a block diagram schematically illustrating a method for recognizing a handwritten word according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a continuous bag-of-words model of a recognition method for sign language words according to an embodiment of the present invention;

FIG. 4 is a LSTM model structure diagram of a method for recognizing a handwritten word according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating an initial state of a gesture action of a subject performing a "hello" handwritten word according to an embodiment of the present invention;

FIG. 6 is a first state diagram of a gesture action of a "hello" handwritten word by a subject according to the method for recognizing the handwritten word provided by the embodiment of the present invention;

FIG. 7 is a second state diagram of a gesture action of a "hello" handwritten word by a subject according to the method for recognizing handwritten words of the present invention;

FIG. 8 is a diagram illustrating the variation of the three types of indicators, i.e., accuracy, recall and F1-score, with the number of the handwritten words according to the method for recognizing the handwritten words in the embodiment of the present invention;

fig. 9 is a schematic diagram illustrating a change in recognition accuracy of different individuals according to the method for recognizing a handwritten word according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flow chart of a method for recognizing a handwritten word according to an embodiment of the present application. As shown in fig. 1, the method for recognizing a handwritten word includes steps S101 to S104.

Step S101: let a plurality of examinee's arms wear the sensor bracelet to make the sign language gesture of predetermined number hand word respectively, pass through the sensor bracelet gathers and takes notes every the sensor signal that the sign language gesture corresponds.

Specifically, as shown in fig. 2, the user performs sign language gesture motions, and the IMU signal and the sEMG signal generated by the sensor bracelet are transmitted to the computer host through bluetooth.

Step S102: and filtering the sensor signal.

Specifically, denoising the sign language gesture signal acquired in the step (1) to remove background noise existing in the signal and generated due to equipment reasons. The normalized Butterworth filter is defined in the frequency domain as follows:

wherein n represents the stage number of the Butterworth filter, j is an imaginary unit, i.e., j is-1, and ω is the cut-off angle frequency of the signal, and radian/second is the unit, in the present invention, 200 hand words are taken as an example, and the bracelet is F_sData were acquired at a rate of 200 samples/second, and during the gesture, the finger motion frequency was f 15Hz, thus setting the cut-off frequency ω of the butterworth filter to be

Wherein, the time of the subject for performing a sign language gesture is 2 seconds, and 200 pieces of data are collected. Wherein the signals are 18-dimensional in total and represent 8-dimensional sEMG signals and 10-dimensional IMU signals, respectively.

Step S103: and extracting the characteristic value of each sensor signal by adopting a dynamic time integral algorithm.

Specifically, by using the DTW to define the change of each line of data, since the subject has a corresponding change in each dimension of signal during the gesture motion, we can use the DTW algorithm to find the points on two adjacent lines of data to obtain their distances relatively more accurately based on the first line of data of the gesture word motion signal.

Next, describing a DTW calculation method, assuming that two adjacent rows of data sequences are a and B, respectively, and their lengths are m and n, respectively, and specifically expressed as follows:

A＝a₁,a₂,a₃,···,a_i,···,a_n,

B＝b₁,b₂,b₃,···,b_j,···,b_m,

wherein, a_iAmplitude of i-th frame representing sequence A, b_jRepresenting the amplitude of the jth frame of sequence B.

Definition a_iAnd b_jThe DTW values in between are as follows:

DTW(i,j)＝d(a_i,b_j)+min{DTW(i-1,j-1),DTW(i-1,j),(i,j-1)},

wherein d (a)_i,b_j) Presentation data a_iAnd b_jThe euclidean distance of (c). The two sequences are matched from the initial end (0,0) point of the data, and the similarity of the two adjacent lines of data can be obtained after the two sequences reach the end point (n, m), namely the DTW (n, m) obtained by the invention.

Then, the invention extracts the common data characteristic value: maximum, minimum, standard deviation, skewness and kurtosis, which are derived from each line of data for each spoken word. The feature values including the DTW extracted above are all the motion features of the hand words.

Step S104: and performing word embedding operation on the semantic information of each sign language word to generate a word vector corresponding to the sign language word.

Specifically, taking 200 handlanguage words as an example, the present invention defines the vector representation of the 200 common handlanguage words (which may be, for example, chinese common handlanguage words) through a word embedding operation. Specifically, the method uses the sign language corpus to perform word embedding operation on the 200 sign language words by using a continuous bag-Of-words model (CBOW). The CBOW model may use existing context to predict the current word. As shown in fig. 3, a schematic diagram of the CBOW model is shown, assuming that the context is "Todayweather", the current word is "good", and Σ g (templates) represents the sum of the words "Today" and "weather". The model is essentially a two-class neural network, assuming that the context is h and the real target vocabulary corresponding to the context is w_tThe noise vocabulary is

Noise sequence is P_noise. Thus, the optimization function of the neural network is:

wherein, when D is 1, w_tlnQ for real target vocabulary_θ(D＝1|w_tH) is a vector word w_tThe probability obtained by Logistic regression corresponding to h, using a binary loss function to train the model, can obtain the value of the hidden layer as the embedded expression of the corresponding word as follows:

V＝{v_i|1≤i≤200},

wherein V represents the word vector group of the 200 hand-written words, V_iA word vector representing each hand word, which is represented as follows:

v_i＝{v_i1,v_i2,...,v_i200}.

wherein, the ith hand word v_iRepresented as a 200-dimensional vector.

Step S105: and performing dimension reduction processing on the word vector to generate a sign language word meaning characteristic vector corresponding to the word vector.

Specifically, a t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm based on semantics is designed, and dimension reduction is performed on the high-dimensional word vectors obtained in step S104 to obtain sign language word meaning feature representation. Firstly, high-dimensional word vectors are arranged from high to low according to the frequency in the daily use process of sign language, and the accuracy of dimension reduction is improved by giving weight to each symbol. The invention gives each word vector v_iAdding a weight

The definition is as follows:

wherein p is_iIs the frequency of occurrence of sign language word vectors in the corpus. Defining a word vector v_iAs v_jConditional probability p of neighbors_j|iThe following were used:

wherein the content of the first and second substances,

represents by v_iA square with center obeying Gaussian distribution, | | · | | non-woven phosphor²Is a two-norm, high-dimensional word vector v_iAnd v_jThe corresponding low-dimensional word vectors are respectively

And

and the corresponding low-dimensional space obeys t-distribution. In the same way, with

As

The conditional probability of the neighbors of (a) is:

analogously, define

And

the joint probability distribution of (c) is:

the invention uses KL (Kullback-Leibler, divergence) distance as our optimization target at this time:

at this point, the optimal gradient is:

then training the model by using a stochastic gradient descent algorithm, and carrying out t-SNE reduction on the word vectors of 200 common symbol words to two dimensions. The reduced symbolic words form a semantic space in which each point represents a symbolic word.

Step S106: and inputting the characteristic values and the sign language word meaning characteristic vectors into a preset long-short term memory neural network, and training through the preset long-short term memory neural network to obtain a classification recognition model.

Specifically, the LSTM recurrent neural network is adopted, the action signal characteristics of each hand word extracted in the step and the semantic characteristics of each hand word are input into the LSTM recurrent neural network, and the LSTM recurrent neural network is used for training to obtain a classification recognition model. The invention uses LSTM neural network to train the hand word feature, and the input vector is the action feature and semantic feature of the hand word. For example, 70% of the feature data may be used as training data and 30% of the feature data may be used as test data. Performing label setting before training, wherein the performing label setting before training comprises: before training, input training data is labeled, namely the sequence number of each sign language gesture feature and sign language semantic feature data is set according to sign language words corresponding to the training data, then the model is trained according to the training times set by the method, and finally the trained model is obtained. Then the test data of the invention is sent to a model for detection to obtain the probability of a predicted value. The LSTM model is shown in fig. 4.

Step S107: and recognizing the sign language gestures through the classification recognition model, and further recognizing the sign language words with a preset number.

Specifically, the hand ring provided with an IMU signal sensor and an sEMG signal sensor is used for collecting signals generated when a subject performs sign language gesture actions, the hand ring is worn on the main hand of the subject, the collected sign language gesture signals are firstly removed by a Butterworth filter to remove high-frequency noise generated due to equipment, in the feature extraction stage, the invention uses a new feature DTW to mark the signal changes and fluctuations during sign language action, meanwhile, the invention carries out word embedding operation on a predetermined number of general sign language words to obtain high-dimensional sign language word vectors, improves the t-SNE algorithm, combines a sign language corpus, reducing dimensions of each high-dimensional hand word vector to obtain semantic features of the hand words, designing and identifying each hand word based on an LSTM neural network, therefore, the problem that the existing sign language gesture recognition cannot recognize the sign language under the condition of large scale is solved.

In the experiment, we had 10 subjects (5 men and 5 women) with an age between 18 and 25 years. With sign language training 8 hours before the experiment, they can accurately execute 200 sign language words in the whole sign language corpus. During the experiment, the subject worn the bracelet on his dominant hand (i.e., right or left hand) and made a gesture in front of the PC. As shown in fig. 5-7, the subject is performing a gesture action of "hello" hand words, wherein fig. 5 is the subject in an initial state, fig. 6 is the subject in a first state of the hand words, and fig. 7 is the subject in a second state of the hand words.

According to the above, the experiment sends the motion characteristics of the hand words and the semantic characteristics of the hand words to the LSTM neural network for training. In the experiment, the recognition effect of the model is judged by using three indexes of accuracy, recall rate and F1-score, and the experimental result is shown in FIG. 8. The experimental result shows that under the condition of various sign language word quantities, the three indexes are maintained to be more than 95%.

FIG. 9 shows the change in recognition accuracy of the model for different individuals. The horizontal axis represents 10 different subjects, and the vertical axis represents recognition accuracy. The accuracy obtained is different due to different habits of each person in sign language. The experimental result shows that the identification accuracy rate of 10 subjects reaches more than 95%.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for recognizing a hand-written word, comprising:

filtering the sensor signal;

2. The method for recognizing hand-language words according to claim 1, wherein the acquiring and recording the sensor signal corresponding to each hand-language gesture through the sensor bracelet comprises:

and acquiring IMU signals and sEMG signals generated corresponding to each sign language gesture through the sensor bracelet, and transmitting the IMU signals and the sEMG signals to a processor for recording and storing.

3. The method for recognizing sign language words according to claim 2, wherein said filtering the sensor signal comprises filtering the sensor signal through a normalized Butterworth filter, the expression of which is:

where n represents the number of stages of the butterworth filter, j is the unit of an imaginary number, and ω is the cut-off angular frequency of the signal.

4. The method according to claim 3, wherein the feature values include DTW feature values and common feature values, and the common feature values include maximum values, minimum values, standard deviations, skewness, and kurtosis of the sensor signal data.

5. The method according to claim 4, wherein said performing word embedding operations on the semantics of each of said handwritten word comprises performing word embedding operations on said predetermined number of said handwritten word through a continuous bag-of-words model,

the optimization function of the continuous bag-of-words model is as follows:

wherein h is a context, w_tFor the real target vocabulary corresponding to the context,

is a noisy vocabulary, P_noiseFor a noise sequence, θ represents a weight parameter of the neural network, and when D is 1, w_tlnQ for real target vocabulary_θ(D＝1|w_tH) is a vector word w_tThe probability obtained by Logistic regression corresponding to h is used for training the model by utilizing a binary loss function, and the value of the hidden layer can be obtained and used as the word embedding expression of the corresponding hand word as follows:

V＝{v_i/1≤i≤z}

wherein z represents the number of the hand words, V represents the word vector group of the z hand words, V_iA word vector representing each hand word, which is represented as follows:

v_i＝{v_i1,v_i2,...,v_iz}

wherein, the ith hand word v_iRepresented as a vector in the z dimension.

6. The method according to claim 5, wherein said performing dimension reduction on said word vector to generate a sign language word sense feature vector corresponding to said word vector comprises:

arranging the word vectors from high to low according to the frequency in the daily use process of the sign language, giving a weight to each symbol to improve the accuracy of dimension reduction, and adding a weight to each word vector, wherein the weights are defined as follows:

wherein p is_iIs the frequency of appearance of sign language word vectors in the corpus;

defining a word vector v_iAs v_jConditional probability p of neighbors_j|iThe following were used:

wherein the content of the first and second substances,

And

the corresponding low-dimensional space obeys t distribution;

to be provided with

As

The conditional probability of the neighbors of (a) is:

definition of

And

the joint probability distribution of (c) is:

wherein the content of the first and second substances,

and

respectively representing the k-th and l-th low-dimensional word vectors, v, in the sign language lexicon_k、v_lRespectively representing the corresponding high-dimensional word vectors;

Kullback-Leibler divergence distance was used as the optimization target:

at this time, the optimal gradient is expressed as:

wherein, y_iRepresenting a hand word v_iThe gradient of (a) of (b) is,

and

respectively represent the gradient y_iAnd gradienty_jAnd training the model by utilizing a random gradient descent algorithm according to the corresponding weight coefficient, reducing the word vector of z commonly used symbolic words to two dimensions by carrying out t-distribution random adjacent embedding operation, and forming a semantic space by the dimensionality reduced symbolic words, wherein each point represents a symbolic word.

7. The method according to claim 6, wherein the inputting the feature values of the sensor signals and the feature vectors of the meaning of the sign language words into a preset long-short term memory neural network, and the training through the preset long-short term memory neural network to obtain the classification recognition model comprises:

and training the characteristic values of the sensor signals and the sign language word meaning characteristic vectors by using the preset long-term and short-term memory neural network, using preset percentages of characteristic data as training data, using the rest characteristic data as test data, setting labels before training, training the classification recognition model according to preset training times, and finally obtaining the trained model.

8. The method according to claim 7, wherein the setting of labels before training comprises: and marking the input training data before training, namely setting the sequence number of each sign language gesture feature and sign language semantic feature data according to the sign language word corresponding to the training data.