CN114999461A

CN114999461A - Silent voice decoding method based on facial neck surface myoelectricity

Info

Publication number: CN114999461A
Application number: CN202210598661.3A
Authority: CN
Inventors: 张旭; 何运宝; 陈希; 陈香; 陈勋
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-09-02
Anticipated expiration: 2042-05-30
Also published as: CN114999461B

Abstract

The invention discloses a silent voice decoding method based on the surface electromyography of the face and neck, which decodes the voice content without sounding by processing the collected surface electromyography signals corresponding to the relevant muscle activities in the process of reading by the user, and comprises the following steps: 1. collecting surface electromyographic signals of a user to form a training data set; 2. carrying out data segmentation to obtain a training data set with syllable labels; 3. carrying out data enhancement; 4. carrying out feature extraction on the training data set after data enhancement; 5. constructing a deep neural network for depicting space-time information; 6. and constructing a statistical language model to obtain the prediction of the continuous reading phrase of the user. The invention recognizes the voice content from the finer-grained structure forming the voice sequence, not only can realize high-performance silent voice recognition, but also can be helpful for understanding the meaning of the voice corresponding to the surface myoelectric activity, and provides a new thought for the silent voice recognition method.

Description

Silent voice decoding method based on facial neck surface myoelectricity

Technical Field

The invention belongs to the field of biological signal processing, machine learning and intelligent control, and particularly relates to a silent voice decoding method based on facial neck surface myoelectricity.

Background

Voice is an essential effective and convenient communication mode in human daily life. In the past decades, speech-related human-computer interaction technology, represented by Automatic Speech Recognition (ASR) technology, has developed rapidly and has shown very high performance in general scenarios. However, the disadvantages of ASR are very apparent due to the dependence on voiced speech. If can't guarantee effective work under high noise background, can't satisfy privacy interactive demand to the dysvocation crowd can't rely on ASR to carry out daily interchange.

To overcome the above disadvantages, researchers have explored non-acoustic speech recognition methods. During human speech and reading, the voice-related muscle groups of the face and neck are activated, producing bioelectric signals called surface electromyograms (sEMG). Therefore, Silent Speech Recognition (SSR) based on sEMG has become an important supplementary approach to ASR in some special scenarios. SSR techniques based on sEMG have made some progress over decades. Early SSRs mainly used classical mode classification methods such as support vector machines, conjugate gradient networks, etc.; and recording the sEMG of the face and the neck of the subject by using discrete electrodes with a small number of channels, and identifying the corpus with limited word number. Later research has tended to identify lexical corpora using Hidden Markov Models (HMMs) that characterize sEMG timing information. With the development of data acquisition technology, high-density (HD) electrode arrays are designed to simultaneously record a large number of channel surface myoelectrical signals of a target muscle or group of muscles over a relatively large area. The use of high density surface electromyographic signal (HD-sEMG) arrays helps to capture valuable spatial information, characterize heterogeneity of muscle activity, and thereby improve performance of electromyographic pattern recognition.

While the above studies demonstrate the availability of pattern classification techniques in achieving satisfactory SSR performance, there are still some deficiencies. For example, 1) depending on the pattern classification method, a phrase or word is simply mapped between sEMG pattern features, and semantic information of time-series association is omitted. 2) The performance of classification techniques is limited by the number of vocabularies in the corpus. 3) The common mode classification technology is mainly used for identifying isolated words and cannot realize natural and coherent silent voice interaction.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a silent speech decoding method based on facial neck surface electromyography, so that a finer-grained structure of a speech sequence can be identified and speech content can be understood, the identification performance of phrases with similar pronunciations is improved, and accurate and natural silent speech interaction can be realized finally.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to a silent voice decoding method based on facial neck surface myoelectricity, which is characterized by comprising the following steps:

step one, constructing an instruction set P ═ P containing N Chinese phrases ₁ ,…,p _n ,…,p _N }，p _n Representing the nth Chinese phrase in the instruction set P, wherein the N Chinese phrases contain L-class syllables;

the method comprises the steps of collecting surface electromyographic signals generated by facial muscles and neck muscles when a user reads Chinese phrases silently by using a high-density electrode array, and labeling a rest signal segment and a surface electromyographic signal segment corresponding to the phrase in the surface electromyographic signals by using a double-threshold detection method based on short-time energy and zero crossing rate so as to form each phrase signal segment with labels and form a training phrase data set S _p ；

Step two, using a series of time weights before and afterSegmenting the training phrase data set S by a stack of signal windows _p Obtaining M signal window samples, uniformly dividing each phrase signal segment according to the number of syllables contained in the phrase signal segment, and marking fine-grained syllables on each signal window sample by combining the syllable sequence of each phrase signal segment so as to obtain a batch of training data sets consisting of the M signal window samples with syllable marks;

step three, changing the segmentation time of the signal window to adjust the segmentation window boundary of each signal window, and then processing according to the process of the step two, thereby obtaining K batches of training data sets with syllable labels

Wherein,

represents the k-th batch of training data set with syllable labels, an

The mth signal window sample representing the kth batch of data,

representing corresponding syllable labels, and adopting one-hot coding to represent,

has a size of [1, L ]]；S _origin Contains a total of M × K signal window samples;

step four, extracting a training data set S _origin Myoelectric characteristics of (2):

4.1, using continuous non-overlapping frames to perform segmentation processing on each signal window sample to obtain d frames of signal window data;

step 4.2, according to the relative position of the signal channels of the high-density electrode array, converting the surface electromyographic signals collected by the high-density electrode array into a surface electromyographic data matrix of a two-dimensional electrode channel array, wherein the size of the matrix is marked as [ e, g ];

4.3, extracting c electromyographic features of each frame of signal window data to obtain a three-dimensional electromyographic feature map of each frame; further obtaining the three-dimensional electromyogram characteristic atlas set of all the signal window samples

A d-frame three-dimensional electromyogram representing the m-th signal window sample of the kth batch of data,

the size of (a) is given as [ d, e, g, c ]]，

Syllable labels of the mth signal window sample representing the kth batch of data;

step five, constructing a deep neural network based on the depicting space-time information, which comprises the following steps: a expansion volume blocks containing time distribution layer, flattening layer, A bidirectional gating cyclic unit blocks and A full connection layers, and three-dimensional electromyogram characteristic map set S _input Inputting the deep neural network according to K batches;

step 5.1, any a-th expansion convolution block comprises an expansion convolution layer, a batch normalization layer and a Dropout layer; and the a-th expanded convolution layer adopts H _a Two-dimensional convolution kernels with dimensions of h x h and a Tanh activation function are adopted;

when a is 1, inputting the k-th three-dimensional electromyogram feature map set into the a-th expansion volume block for processing, and outputting the k-th a feature map as

Representing the mth signal window sample in the kth batch of three-dimensional electromyogram feature set

Output feature map with dimensions [ d, e, g, H ] _a ]；

When a is 2,3, …, A, the a-1 characteristic diagram of the k batch is added

Inputting the a-th expanded volume block for processing, and outputting the a-th feature map of the k-th batch

So that the A-th expanded volume block outputs the final feature map

Step 5.2, the characteristic diagram

After the flattening layer is processed, obtaining a flattening feature set of the kth batch

Wherein,

representing the m characteristic diagram of the k batch

The size of the characteristic diagram output after passing through the flattening layer is [ d, e multiplied by g multiplied by H ] _a ]；

Step 5.3, any a-th bidirectional gating cycle unit block comprises a bidirectional gating cycle unit layer adopting a ReLU activation function and a Dropout layer, and the dimensions of hidden nodes in the bidirectional gating cycle unit layer are all b;

when a is 1, the flattening feature set of the k-th batch

Inputting the a-th bidirectional gating cycle unit block for processing, and outputting the k-th gating characteristic set of the k-th batch

Representation characteristic diagram

The gating characteristics output after the processing of the a-th bidirectional gating circulating unit block,

size of [ d,2 x b ]]；

When a is 2,3, …, A-1, the a-1 feature set of the k-th batch

Inputting the a-th bidirectional gating cycle unit block for processing, and outputting the k-th gating feature set of the k-th batch

So that the A-1 gating feature set of the kth batch is output by the A-1 bidirectional gating cycle unit block

Size of [ d,2 x b ]]；

When a is A, the a-1 gating feature set of the k-th batch

Size of [1,2 x b ]]；

Step 5.4, enabling the activation functions of the first A-1 full connection layers to adopt Tanh and respectively connect one Dropout layer, wherein the activation function of the A-th full connection layer is softmax;

gating characteristic set output by A-th bidirectional gating circulation unit block

In turn pass throughAfter A full connection layers are processed, a scoring matrix of a syllable decision sequence is output

Wherein,

m signal window samples representing the kth batch of data

Are predicted as probabilities of L syllables, respectively, and

wherein,

m sample representing kth batch of data

Probability of being predicted as a class j syllable;

step 5.5, establishing a cross entropy Loss function Loss by using the formula (1):

in the formula (1), the reaction mixture is,

sample of mth signal window for kth batch of data

Corresponding syllable labels

The value of the j-th position;

step 5.6, training a neural network:

updating the weight parameters of the deep neural network by adopting an Adam optimizer, setting the maximum iteration time step and dynamically changing the network learning rate lr, and stopping training when the Loss function Loss reaches the minimum or the iteration time is equal to the step, so as to obtain an optimal syllable classification model;

step six, constructing a statistical language model according to the instruction set P of the Chinese phrases, and then carrying out post-processing on the optimal syllable classifier result:

6.1, establishing a many-to-one mapping relation theta from the syllable label sequence to the Chinese phrase;

step 6.2, processing a Chinese phrase p' to be decoded according to the process of the step two to obtain U signal window samples to be decoded with syllable labels; processing the U signal window samples to be decoded according to the process of the fourth step to obtain a three-dimensional electromyography characteristic atlas to be decoded;

6.3, inputting the three-dimensional electromyographic feature atlas to be decoded into an optimal syllable classification model, and outputting a scoring matrix of a syllable label sequence of the Chinese phrase p

Wherein,

a score probability matrix representing the U-th syllable of the Chinese phrase p', U representing the length of the syllable sequence;

step 6.4, the search depth of each syllable is set as depth, and a multi-cluster search algorithm is utilized to carry out the search on all syllables

Is processed to obtain U ^depth Syllable label sequence and U ^depth Each score;

step 6.5, judge U ^depth Whether the syllable label sequence is successfully matched with the many-to-one mapping relation theta or not is judged, if so, phrases corresponding to the syllable label sequence which is matched with the syllable label sequence and has the highest score are selected from the syllable label sequences

And output, otherwise, execute6.6;

step 6.6, scoring matrix from u-th syllable of Chinese phrase p

The syllable with the highest score probability is recorded as

Thereby obtaining syllable decision sequence

Selecting a syllable decision sequence in a many-to-one mapping relation mapping theta

Phrase with minimum edit distance

As a result of the decoding of the chinese phrase p'.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention divides data based on the original phrase sEMG data to obtain fine-grained syllable surface electromyographic data, establishes a neural network (DC-BiGRU) of an expanded convolution bidirectional gating cyclic unit as a classifier, further provides a statistical language model to depict semantic information, refines and corrects the output of the trained classifier, obtains accurate prediction of phrase sequences, and realizes accurate and natural silent speech recognition through a decoding framework.

2. The invention is beneficial to simplifying the preparation process of the training data through the automatic marking method of the training data; meanwhile, the invention provides a data enhancement method based on adjustment of the windowing boundary, which effectively relieves the over-fitting phenomenon of a deep network and improves the performance of silent speech recognition.

3. The invention starts research at fine granularity level, provides a statistical language model based on multi-cluster searching and editing distance, and utilizes semantic time sequence correlation information of phrase sets to help to understand the meanings of phrases and improve the recognition performance of similar pronunciation phrases.

4. The invention realizes high-performance natural continuous silent speech recognition according to the data processing requirement of a real-time system, thereby being beneficial to the practical application of the method in the fields of myoelectricity control and the like.

Drawings

FIG. 1 is a flow chart of a method for decoding a silent voice based on facial neck surface electromyography according to the present invention;

FIG. 2 is a set of Chinese pronunciation phrases according to the present invention;

FIG. 3 is an illustration of the shape parameters and placement position of a face and neck high density electrode array used in the present invention;

FIG. 4 is a schematic diagram of a data segmentation, data automatic labeling and data enhancement method adopted by the present invention;

FIG. 5 is a schematic illustration of the spatial position distribution and stitching results in a high density electrode array according to the present invention;

FIG. 6 is a schematic diagram of the structure of a classification network based on the extended convolution bi-directional gated cyclic unit (DC-BiGRU) used in the present invention;

FIG. 7 is a graph of the average phrase recognition rate and the standard deviation score obtained by the present invention;

FIG. 8a is a schematic diagram of a confusion matrix based on DC-BiGRU phrase classification obtained by the present invention;

fig. 8b is a schematic diagram of the confusion matrix obtained by the present invention based on the proposed DCBiMEP decoding method.

Detailed Description

In this embodiment, a method for decoding a unvoiced sound based on facial neck surface electromyography extracts time-series related semantic information using a statistical language model, which not only improves recognition performance of phrases with similar pronunciation, but also helps to understand meaning of phrases corresponding to sEMG activity, and provides a new idea for the unvoiced sound recognition method, and specifically, as shown in fig. 1, includes the following steps:

step one, constructing an instruction set P ═ P containing N Chinese phrases ₁ ,…,p _n ,…,p _N }，p _n Representing in instruction set PThe nth Chinese phrase, wherein the N Chinese phrases contain L-class syllables; as shown in fig. 2, the chinese pronunciation vocabulary consists of N-30 phrases, including 79 chinese syllables and 1 resting syllables, L-80;

the method comprises the steps of collecting surface electromyographic signals generated by facial and neck muscles when a user reads Chinese phrases acquiescently by using a high-density electrode array, and labeling a rest signal segment in the surface electromyographic signals and a surface electromyographic signal segment corresponding to the phrases by using a double-threshold detection method based on short-time energy and zero crossing rate, so that each phrase signal segment with labels is formed and a training phrase data set Sp is formed. In this example, a total of 8 healthy subjects aged 21-26 years, having no hearing or speech impairment, were collected from 7 males and one female for the data collection experiment. Each subject was specifically informed for each experimental procedure and specific requirements.

The high density electrode array shape parameters and placement position are shown in fig. 3. The total number of the high-density flexible electrode arrays is four, and two symmetrical arrays are respectively arranged on the left side and the right side of the face and the neck. Illustratively, the number of channels of the two facial electrode arrays is 16, the diameter of the electrodes is 5mm, and the electrode spacing ranges from 10 mm to 15 mm to 18 mm; the number of channels of the two neck electrode arrays is also 16, the diameter of the electrodes is 5mm, and the distance between the electrodes is 18 mm. The array of face-neck electrodes collectively comprises a 64-channel array. In addition, one electrode is respectively attached to the back of the left and right ears and is used as a reference electrode and a ground electrode;

prior to application of the electrode array, the subject's facial and neck target muscles were scrubbed with an alcohol cotton pad to clean the skin keratin while a suitable amount of conductive gel was applied to the electrode probes to reduce the skin impedance. Illustratively, the facial electrode array is used for collecting sEMG of facial muscles such as zygomatic muscles, masseter muscles and inferior labial muscles, and the neck electrode array is used for collecting sEMG of neck muscles such as scapula-hyoid muscles, sternohyoid muscles and platysma muscles. During the collection process, the subjects silently express each phrase at a uniform speed with medium strength, each phrase is repeated 20 times as a test, the interval time of each repetition of the phrase is t, and the t is set to be 3s exemplarily. In each test, activities unrelated to the collection task, such as swallowing saliva and coughing, were not allowed. To avoid muscle fatigue in the subject, there was a rest period of T between trials, illustratively, T was taken to be 30 s;

the sEMG activity detection is performed on the raw data, and the result is shown in a part a in fig. 4. Firstly, calculating the short-time energy and zero crossing rate of a baseline in a resting state as initial energy and initial zero crossing rate, and recording as E _ i and C _ i; the short-time sample is used for calculating the short-time energy and the zero crossing rate of the original data, and the length of the short-time sample is S _ length. The method needs to set three thresholds, wherein the first two thresholds are high and low thresholds set by short-term energy values and are respectively marked as E _ h and E _ l, and are used for carrying out initial judgment on an initial position and an offset position; the third is a threshold SC of short-term zero crossing rate, illustratively, S _ length is set to 64ms, E _ h is set to 8 × E _ i, E _ l is set to 3 × E _ i, SC is set to 3 × C _ i; obtaining a rest signal section and a mark of a signal section corresponding to each phrase;

step two, segmenting the training phrase data set S by a series of signal windows with time overlapping before and after _p Obtaining M signal window samples; as shown in part b of fig. 4, the sliding window length for data partitioning is W _ length, the Overlap ratio is overlay, and illustratively, W _ length is set to 1000ms and overlay is set to 50%; dividing all the phrase signal segments according to the number of syllables contained in the phrase signal segments, and then marking fine-grained syllables on each signal window sample by combining the syllable sequence of each phrase signal segment, wherein the result is shown as part b in fig. 4; according to the time span of the signal window sample under different syllable labels, labeling the signal window with corresponding syllable labels to obtain a training data set with syllable labels;

step three, changing the segmentation time of the signal windows to adjust the window boundary of each signal window, and then processing according to the process of the step two, thereby obtaining K batches of trainings with syllable labelsExercise data set

Wherein,

represents the k-th batch of training data set with syllable labels, an

The mth signal window sample representing the kth batch of data,

representing corresponding syllable labels, and adopting one-hot coding to represent, and its size is [1, L]；S _origin Contains a total of M × K signal window samples; in this embodiment, as shown in part c in fig. 4, the initial position of each data division is shifted backward by Δ/5 on the basis of the previous batch, and K batches of signal window samples with labels are obtained by the above syllable labeling method, where M is 327, K is 5, and Δ is set to 500 ms;

4.1, using continuous non-overlapping frames to perform segmentation processing on each signal window sample to obtain d frames of signal window data; in this embodiment, the frame length of consecutive non-overlapping frames is F _ length ═ 40 ms;

step 4.2, as shown in fig. 5, converting the surface electromyographic signals acquired by the high-density electrode array into a surface electromyographic data matrix of a two-dimensional electrode channel array according to the relative position of the signal channel of the high-density electrode array, wherein the size of the surface electromyographic data matrix is marked as [ e, g ]; in the present embodiment, e ═ 8 and g ═ 8 are set;

4.3, extracting c electromyographic features of each frame of signal window data to obtain a three-dimensional electromyographic feature map of each frame; further obtaining the three-dimensional electromyogram characteristic atlas of all the signal window samples

A d-frame three-dimensional electromyogram representing the mth signal window sample of the kth batch of data,

the size of (a) is given as [ d, e, g, c ]]，

Syllable labels representing mth signal window samples of the kth data; in this embodiment, c is 4, the 4 extracted myoelectric time-domain features are Mean Absolute Value (MAV), Waveform Length (WL), zero cross point (ZC), and slope sign number (SSC), the number of frames of the feature map of each signal window is d is 25, and the feature map size is [25,8,8,4 ═ 25]Finally, a database S formed by the characteristic diagrams of all the signal window samples is obtained _input As input to the neural network.

Step five, constructing a deep neural network based on the depicting space-time information, comprising the following steps of: a expansion volume blocks containing time distribution layer, flattening layer, A bidirectional gating circulation unit blocks and A full connection layer, and collecting three-dimensional electromyogram feature map S _input Inputting the deep neural network according to K batches; as shown in fig. 6, the deep neural network characterizing spatio-temporal information is composed of a number of expanded convolution blocks including a time distribution layer, a flattening layer, a number of bidirectional gated cyclic unit blocks, and a full connection layer; in the present embodiment, a ═ 2;

step 5.1, any a-th expansion convolution block comprises an expansion convolution layer, a batch normalization layer and a Dropout layer;

when a is 1, inputting the k-th three-dimensional electromyogram feature map set into an a-th expansion volume block for processing, and outputting the k-th a feature map of the k-th batch

Output feature map with dimensions [ d, e, g, H ] _a ]；

When a is 2,3, …, A, the a-1 characteristic diagram of the k batch

Inputting the data into the a-th expanded volume block for processing, and outputting the a-th feature map of the k-th batch

So that the A-th expanded volume block outputs the final feature map

In this embodiment, the first expanded convolution layer is composed of H ₁ 32 filters of 3 × 3 with a spreading factor of 1, the second layer of extended convolutional layers consisting of H ₂ With a spreading factor of 3, two Dropout layer ratios of 0.5 for 8 filters of 3 × 3;

size of [25,8,8,32 ]]，

Size of [25,8,8,8 ]]；

Step 5.2, feature map

After the treatment of the flattening layer, the flattening feature set of the kth batch is obtained

Wherein,

representing the m characteristic diagram of the k batch

The feature diagram output after the flattening is in the size of [ d, e × g × H _a ](ii) a In the present embodiment, the first and second electrodes are,

size of [25,512]；

set of flattening features for lot k when a is 1

Inputting the a-th bidirectional gating circulation unit block for processing, and outputting the k-th gating feature set of the k-th batch

Representation characteristic diagram

The gating characteristics output after the processing of the a-th bidirectional gating circulating cell block,

size of [ d,2 x b ]]；

When a is 2,3, …, A-1, the a-1 feature set of the k-th batch

Thereby the A-1 st bidirectional gating circulation listMetablock outputs the A-1 gating feature set of the kth batch

Size of [ d,2 x b ]]；

When a is A, the kth batch of a-1 gating feature set

Size of [1,2 x b ]](ii) a In this embodiment, each bidirectional gated cyclic unit block includes 1 bidirectional gated cyclic unit layer using a ReLU activation function and 1 Dropout layer, the hidden node dimensions of the two layers of bidirectional gated cyclic units are both b ═ 64, and the Dropout ratio is 0.4;

size of [25,128]，

Size of [1,128 ]]；

Step 5.4, the activation functions of the first A-1 fully-connected layers adopt Tanh and are respectively connected with one Dropout layer, and the activation function of the A-th fully-connected layer is softmax;

After being processed by A full connection layers in sequence, the scoring matrix of the syllable decision sequence is output

Wherein,

m signal window samples representing the kth batch of data

Are predicted as probabilities of L syllables, respectively, and

wherein

M-th sample representing the kth batch of data from the network

The probability of predicting a class j syllable; in this embodiment, Tanh is adopted as an activation function of the 1 st fully-connected layer, the dimension of the hidden node layer is 200, 1 Dropout layer with a ratio of 0.2 is connected, and the dimension of the hidden node of the 2 nd fully-connected layer is 80;

in the formula (1), the reaction mixture is,

sample of mth signal window for kth batch of data

Corresponding syllable labels

The value of the j-th position; in this embodiment, the one-hot coding length is 80, only one position has a value of 1, the rest are 0, each batch contains M samples, and the loss function is obtained by cross entropy weighted summation of K batches of samples;

step 5.6, training a neural network:

updating weight parameters of a deep neural network by adopting an Adam optimizer, setting maximum iteration times step and dynamically changing a network learning rate lr, and stopping training when a Loss function Loss reaches the minimum or the iteration times is equal to step so as to obtain an optimal syllable classification model; in this embodiment, step is 300, the initial learning rate lr is 0.01, and the learning rate lr becomes 0.1 × lr every 100 iterations.

6.1, establishing a many-to-one mapping relation theta from the syllable label sequence to the Chinese phrase; in this embodiment, the speech rates of different subjects are different, and it is difficult to ensure that the speech rates of the same phrase are the same when the same phrase is read by default repeatedly, resulting in different numbers of signal window samples of the same phrase, and therefore, the syllable label sequence to phrase is a many-to-one mapping.

6.3, inputting the three-dimensional electromyographic feature atlas to be decoded into the optimal syllable classification model, and outputting a scoring matrix of the syllable label sequence of the Chinese phrase p

Wherein,

step 6.4, the search depth of each syllable is set to depth, and a multi-cluster search algorithm is utilized to carry out the depth matching

step 6.5, judge U ^depth Whether the syllable label sequence is successfully matched with the many-to-one mapping relation theta or not is judged, and if so, the syllable label sequence is matched with the many-to-one mapping relation thetaSuccessfully, the phrase corresponding to the syllable label sequence which is matched and has the highest score is selected from the syllable label sequences

And outputting, otherwise, executing step 6.6;

step 6.6, scoring matrix from u-th syllable of Chinese phrase p

The syllable with the highest score probability is recorded as

Thereby obtaining syllable decision sequence

Selecting a sequence of syllable labels in a many-to-one mapping relation mapping theta

Phrase with minimum edit distance

As a result of the decoding of the chinese phrase p'.

In this embodiment, depth is set to 5, and the phrase corresponding to the syllable tag sequence with the highest score is selected from the tag sequences successfully matched with the many-to-one mapping relation θ by the formula (2)

In the formula (2), the reaction mixture is,

phrase indicating matching success

Corresponding score, max {. to } returns the phrase corresponding to the maximum score

Phrase

Represents a phrase in the phrase instruction set P. Obtaining syllable label sequence from formula (3)

In the formula (3), argmax {. cndot.) returns the syllable with the highest score for each syllable score matrix.

In this embodiment, to quantitatively evaluate the effect of the present invention, the decoding method of the present invention is compared with the conventional classification method, and is denoted as DCBiMEP. In the comparison experiment, four common phrase classification methods are adopted to be compared with the DCBiMEP of the invention. The four classification methods are respectively labeled as HMM, extended convolutional neural network (DCNN), bidirectional gated round robin unit (BiGRU), and DC-BiGRU. The data preparation process of the four methods is as follows: and carrying out sEMG activity detection on the original electromyographic data, extracting sEMG activity data corresponding to each phrase, marking the corresponding phrase tag, and carrying out feature extraction on the tagged phrase data to obtain feature data of all phrases. In addition, in order to verify the effectiveness of data enhancement on the method, the method deduces two methods according to whether the data enhancement is carried out, and the two methods are respectively marked as DCBiMEP and AUG-DCBiMEP which represent the method after the data enhancement is carried out. Fig. 7 shows the results of Phrase Recognition Accuracy (PRA) of the above 6 methods on the data of 8 subjects, and the PRAs of the 4 conventional phrase classification methods were (82.74 ± 7.48)%, (83.06 ± 7.31)%, (87.92 ± 5.82)% and (90.49 ± 5.47)%, respectively, and it can be seen that the DC-BiGRU performance characterizing the spatio-temporal information is the best. The PRA of DCBiMEP of the method is (97.27 +/-1.44)%, and the performance is obviously superior to that of 4 comparison phrase classification methods. The PRA of AUG-DCBiMEP is improved by 0.91 percent on the basis of the method of the invention to reach (98.18 +/-1.44)%, and the effectiveness of the data enhancement on the method of the invention is proved.

Fig. 8a and 8b show the phrase recognition confusion matrix on subject 2 data for the DC-BiGRU and the method of the present invention that performed best in the 4-class comparison method. It is evident that DC-BiGRU does not perform as well as the inventive method for the recognition of similar phrases in pronunciation, such as "slow down" and "speed up" and "turn left" and "turn right".

In combination with the above comparative experiments and recognition results, the following conclusions can be drawn, including: 1) the decoding method provided by the invention can efficiently identify phrases with similar pronunciation, and improve the performance of a silent voice system. 2) The data enhancement method for adjusting the boundary of the window can further improve the performance on the basis of the original method. 3) The statistical language model effectively utilizes semantic time sequence related information of the phrases, is beneficial to understanding the meanings of the phrases, and realizes high-precision natural continuous silent voice interaction.

Claims

1. A silent voice decoding method based on facial neck surface myoelectricity is characterized by comprising the following steps:

Step two, dividing the training by a series of signal windows with time overlapping before and afterPhrase data set S _p Obtaining M signal window samples, uniformly dividing each phrase signal segment according to the number of syllables contained in the phrase signal segment, and marking fine-grained syllables on each signal window sample by combining the syllable sequence of each phrase signal segment so as to obtain a batch of training data sets consisting of the M signal window samples with syllable marks;

step three, changing the segmentation time of the signal windows to adjust the window boundary of each signal window, and then processing according to the process of the step two, thereby obtaining K batches of training data sets with syllable labels

Wherein,

represents the k-th batch of training data set with syllable labels, an

The mth signal window sample representing the kth batch of data,

has a size of [1, L]；S _origin Contains a total of M × K signal window samples;

step 4.2, according to the relative position of the signal channels of the high-density electrode array, converting the surface electromyographic signals acquired by the high-density electrode array into a surface electromyographic data matrix of a two-dimensional electrode channel array, wherein the size of the matrix is marked as [ e, g ];

the size of (a) is given as [ d, e, g, c ]]，

Syllable labels representing the mth signal window sample of the kth batch of data;

step five, constructing a deep neural network based on the depicting space-time information, comprising the following steps of: a expansion volume blocks containing time distribution layer, flattening layer, A bidirectional gating circulation unit blocks and A full connection layer, and collecting three-dimensional electromyogram feature map S _input Inputting the deep neural network according to K batches;

Output feature map with dimensions [ d, e, g, H ] _a ]；

When a is 2,3, …, A, the a-1 characteristic diagram of the k batch

So that the A-th expanded volume block outputs the final feature map

Step 5.2, the characteristic diagram

Wherein,

representing the m characteristic diagram of the k batch

when a is 1, the flattening feature set of the k-th batch

Representation characteristic diagram

size of [ d,2 x b ]]；

When a is 2,3, …, A-1, the a-1 feature set of the k-th batch

So that the A-1 th gating characteristic set of the kth batch is output by the A-1 th bidirectional gating circulation unit block

Size of [ d,2 × b ]]；

When a is A, the a-1 gating feature set of the k-th batch

Size of [1,2 x b ]]；

Wherein,

m signal window samples representing the kth batch of data

Are predicted as probabilities of L syllables, respectively, and

wherein,

m sample representing kth batch of data

Probability of being predicted as a class j syllable;

in the formula (1), the reaction mixture is,

sample of mth signal window for kth batch of data

Corresponding syllable labels

The value of the j-th position;

step 5.6, training a neural network:

step six, constructing a statistical language model according to the instruction set P of the Chinese phrase so as to post-process the optimal syllable classifier result:

step 6.2, processing a Chinese phrase p' to be decoded according to the process of the step two to obtain U signal window samples to be decoded with syllable labels; processing the U signal window samples to be decoded according to the process of the fourth step to obtain a three-dimensional myoelectric characteristic atlas to be decoded;

Wherein,

a score probability matrix representing the U-th syllable of the Chinese phrase p', wherein U represents the length of the syllable sequence;

And outputting, otherwise, executing step 6.6;

step 6.6, scoring matrix from u-th syllable of Chinese phrase p

The syllable with the highest score probability is recorded as

Thereby obtaining syllable decision sequence

Phrase with minimum edit distance

As a result of the decoding of the chinese phrase p'.