CN114999461B - Silent voice decoding method based on surface myoelectricity of face and neck - Google Patents

Silent voice decoding method based on surface myoelectricity of face and neck Download PDF

Info

Publication number
CN114999461B
CN114999461B CN202210598661.3A CN202210598661A CN114999461B CN 114999461 B CN114999461 B CN 114999461B CN 202210598661 A CN202210598661 A CN 202210598661A CN 114999461 B CN114999461 B CN 114999461B
Authority
CN
China
Prior art keywords
syllable
phrase
signal window
data
gating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210598661.3A
Other languages
Chinese (zh)
Other versions
CN114999461A (en
Inventor
张旭
何运宝
陈希
陈香
陈勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202210598661.3A priority Critical patent/CN114999461B/en
Publication of CN114999461A publication Critical patent/CN114999461A/en
Application granted granted Critical
Publication of CN114999461B publication Critical patent/CN114999461B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/167Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Measurement And Recording Of Electrical Phenomena And Electrical Characteristics Of The Living Body (AREA)

Abstract

The invention discloses a silent voice decoding method based on surface myoelectricity of a face and a neck, which decodes voice content without sounding by processing the collected surface myoelectricity signals corresponding to relevant muscle activities in the process of silently reading the user, and comprises the following steps: 1. collecting surface electromyographic signals of a user to form a training data set; 2. performing data segmentation to obtain a training data set with syllable labeling; 3. carrying out data enhancement; 4. extracting features of the training data set after data enhancement; 5. constructing a deep neural network for describing space-time information; 6. and constructing a statistical language model to obtain predictions of continuous unread phrases of the user. The invention recognizes the voice content from a finer granularity structure forming the voice sequence, can realize high-performance silent voice recognition, is helpful for understanding the meaning of the voice corresponding to the surface myoelectric activity, and provides a new thought for the silent voice recognition method.

Description

Silent voice decoding method based on surface myoelectricity of face and neck
Technical Field
The invention belongs to the fields of biological signal processing, machine learning and intelligent control, and particularly relates to a face-neck surface myoelectricity-based silent voice decoding method.
Background
Speech is an effective and convenient way of communication that is indispensable in human daily life. Over the past few decades, speech-related human-computer interaction technologies, typified by automatic speech recognition (automatic speech recognition, ASR) technologies, have evolved rapidly, and have shown very high performance in general scenarios. However, the disadvantages of ASR are very pronounced due to the speech that relies on speech. If effective work cannot be guaranteed under a high noise background, privacy interaction requirements cannot be met, and people with dyschezia cannot conduct daily communication by means of ASR.
To overcome the above drawbacks, researchers have explored non-acoustic speech recognition methods. During the human speaking and reading, the pronunciation-related facial and neck muscle groups are activated, generating bioelectric signals called surface electromyography (surface electromyogram, sEMG). Thus, sEMG-based Silent Speech Recognition (SSR) has become an important complement to ASR in some special scenarios. SSR technology based on sEMG has been developed for decades, with some progress. Early SSR mainly uses classical pattern classification methods such as support vector machines, conjugate gradient networks, etc.; sEMG of the face and neck of the subject is recorded by discrete electrodes with a small number of channels, and a corpus with a limited number of words is identified. Later studies tended to identify a larger vocabulary corpus using hidden Markov models (hidden Markov model, HMM) that characterize the sEMG timing information. With the development of data acquisition technology, high-density (HD) electrode arrays are designed to simultaneously record a large number of channel surface electromyographic signals of a target muscle or a group of muscles over a relatively large area. The use of high density surface electromyographic (HD-sEMG) arrays helps capture valuable spatial information, characterizing the heterogeneity of muscle activity, and thus improving the performance of electromyographic pattern recognition.
While the above studies demonstrate the usability of pattern classification techniques in achieving satisfactory SSR performance, there are some drawbacks. If 1) depending on the pattern classification method, the phrases or words are simply mapped between sEMG pattern features, and the semantic information of time sequence association is ignored. 2) The performance of classification techniques is limited by the number of vocabulary words in the corpus. 3) The common pattern classification technology is mainly used for identifying isolated words, and natural and coherent silent voice interaction cannot be realized.
Disclosure of Invention
The invention aims to solve the defects of the prior art, and provides a silent voice decoding method based on the surface myoelectricity of the face and neck, so as to recognize a finer granularity structure of a voice sequence and understand voice content, thereby improving the recognition performance of phrases with similar pronunciation and finally realizing accurate and natural silent voice interaction.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
the invention relates to a face-neck surface myoelectricity-based silent voice decoding method which is characterized by comprising the following steps:
Step one, constructing an instruction set P= { P 1,…,pn,…,pN},pn containing N Chinese phrases, wherein the N Chinese phrases in the instruction set P are represented by the nth Chinese phrase, and the N Chinese phrases contain L syllables altogether;
Collecting surface electromyographic signals generated by face and neck muscles when a user silently reads Chinese phrases by using a high-density electrode array, and labeling resting signal segments in the surface electromyographic signals and surface electromyographic signal segments corresponding to the phrases by using a double-threshold detection method based on short-time energy and zero crossing rate, so as to form labeled phrase signal segments and form a training phrase data set S p;
Dividing the training phrase data set S p by a series of signal windows with time overlapping front and back to obtain M signal window samples, equally dividing each phrase signal section according to the syllable quantity contained in the phrase signal section, and marking syllables with fine granularity on each signal window sample by combining the syllable sequence of each phrase signal section, thereby obtaining a batch of training data sets consisting of M signal window samples with syllable marking;
step three, changing the segmentation time of the signal window to adjust the window segmentation boundary of each signal window, and processing according to the process of the step two to obtain K batches of training data sets with syllable labels Wherein/>Representing a kth training dataset with syllable labeling, and/> M-th signal window sample representing k-th batch of data,/>Representing the corresponding syllable labels, and adopting single-hot coding representation,The size of (2) is [1, L ]; s origin contains M×K signal window samples;
Step four, extracting myoelectric characteristics of the training data set S origin:
Step 4.1, carrying out segmentation processing on each signal window sample by using continuous non-overlapping frames to obtain signal window data of d frames;
Step 4.2, converting the surface myoelectric signals acquired by the high-density electrode array into a surface myoelectric data matrix of the two-dimensional electrode channel array according to the relative positions of the signal channels of the high-density electrode array, wherein the size of the surface myoelectric data matrix is marked as [ e, g ];
step 4.3, extracting c myoelectric characteristics of the signal window data of each frame, so as to obtain a three-dimensional myoelectric characteristic diagram of each frame; thereby obtaining three-dimensional myoelectricity characteristic atlas of all signal window samples D-frame three-dimensional myoelectric characteristic diagram representing mth signal window sample of kth batch of data,/>The size of (c) is designated as [ d, e, g, c ]/>Syllable labeling of the mth signal window sample representing the kth batch of data;
Step five, constructing a depth neural network based on the descriptive space-time information, which comprises the following steps: a expansion convolution block containing a time distribution layer, a flattening layer, a two-way gating circulation unit blocks and a full-connection layer, and inputting a three-dimensional myoelectricity characteristic atlas S input into the deep neural network according to K batches;
step 5.1, any a-th expansion convolution block comprises an expansion convolution layer, a batch normalization layer and a Dropout layer; and the a expansion convolution layer adopts H a two-dimensional convolution kernels with H multiplied by H, and adopts a Tanh activation function;
when a=1, inputting the three-dimensional myoelectricity characteristic atlas of the kth batch into the a expansion convolution block for processing, and outputting the a characteristic atlas of the kth batch as follows Represents the m-th signal window sample/>, in the k-th three-dimensional myoelectric feature map setThe output characteristic diagram has the dimensions of [ d, e, g, H a ];
when a=2, 3, …, a, the a-1 th feature map of the k-th batch is taken Inputting the a expansion convolution block for processing, and outputting a characteristic diagram/>, of the k batchThus the final feature map/>, is output by the A-th expansion convolution block
Step 5.2, the feature mapAfter the flattening layer is processed, a k-th flattening feature set/>, is obtained Wherein/>Representing the kth batch of mth feature map/>The dimension of the feature map output after passing through the flattening layer is [ d, e multiplied by g multiplied by H a ];
Step 5.3, the arbitrary a-th bidirectional gating cyclic unit block comprises a bidirectional gating cyclic unit layer adopting a ReLU activation function and a Dropout layer, wherein the dimensions of hidden nodes in the bidirectional gating cyclic unit layer are b;
when a=1, the flattening feature set of the kth batch Inputting the a-th bidirectional gating cyclic unit block for processing, and outputting a-th gating feature set/>, of the k-th lotRepresentation of feature map/>Gating characteristics output after being processed by the a-th bidirectional gating circulating unit block,/>The size is [ d,2 Xb ];
when a=2, 3, …, a-1, the a-1 th feature set of the k-th lot Inputting the a-th bidirectional gating cyclic unit block for processing, and outputting a-th gating feature set/>, of the k-th lotThereby outputting the A-1 gating feature set/>, of the kth batch, from the A-1 bi-directional gating cyclic unit blockThe size is [ d,2 Xb ];
When a=a, the 1 st gating feature set of the kth lot Inputting the a-th bidirectional gating cyclic unit block for processing, and outputting a-th gating feature set/>, of the k-th lotThe size is [1,2 Xb ];
Step 5.4, enabling the activation function of the first A-1 fully connected layers to adopt Tanh, and connecting one Dropout layer respectively, wherein the activation function of the first A fully connected layer is softmax;
gating feature set output by A-th bidirectional gating cyclic unit block After being processed by A full-connection layers in sequence, the score matrix/>, of syllable decision sequences is outputWherein/>Mth signal window sample/>, representing kth batch of dataAre predicted as probabilities of L syllables, respectively, and/>Wherein/>Mth sample/>, representing kth batch of dataProbability of being predicted as a j-th syllable;
step 5.5, establishing a cross entropy Loss function Loss by using the formula (1):
In the formula (1), the components are as follows, Mth signal window sample/>, which is kth batch of dataCorresponding syllable labeling/>The value of the j-th position in the list;
Step 5.6, training a neural network:
Updating the weight parameters of the deep neural network by adopting an Adam optimizer, setting the maximum iteration frequency step, dynamically changing the network learning rate lr, and stopping training when the Loss function Loss reaches the minimum or the iteration frequency is equal to step, thereby obtaining an optimal syllable classification model;
Step six, constructing a statistical language model according to an instruction set P of the Chinese phrase, so as to carry out post-processing on the optimal syllable classifier result:
Step 6.1, establishing a many-to-one mapping relation theta between syllable tag sequences and Chinese phrases;
Step 6.2, processing a Chinese phrase p' to be decoded according to the process of the step two to obtain U signal window samples to be decoded with syllable labels; processing the U signal window samples to be decoded according to the process of the fourth step to obtain a three-dimensional myoelectricity characteristic atlas to be decoded;
step 6.3, inputting the three-dimensional myoelectricity characteristic atlas to be decoded into an optimal syllable classification model, and outputting a scoring matrix of a syllable label sequence of the Chinese phrase p Wherein/>A scoring probability matrix representing the U-th syllable of the Chinese phrase p', U representing the length of the syllable sequence;
Step 6.4, making the search depth of each syllable be depth, and utilizing multi-bundle search algorithm to make the search result Processing to obtain U depth syllable tag sequences and U depth scores;
Step 6.5, judging whether the U depth syllable tag sequences are successfully matched with the many-to-one mapping relation theta, if so, selecting phrases corresponding to the syllable tag sequences which are matched and have the highest scores from the syllable tag sequences And outputting, otherwise, executing the step 6.6;
Step 6.6 scoring matrix from the u-th syllable of the Chinese phrase p Syllables with the highest score probability are selected as/>Thereby obtaining syllable decision sequence/>Selecting the syllable decision sequence/>, in the mapping theta of the many-to-one mapping relationPhrase/>, with minimum edit distanceAs a result of decoding the chinese phrase p'.
Compared with the prior art, the invention has the beneficial effects that:
1. According to the invention, data segmentation is carried out on the basis of original phrase sEMG data to obtain fine-granularity syllable surface myoelectricity data, an expanded convolution bi-directional gating cyclic unit neural network (DC-BiGRU) is established as a classifier, statistical language model description semantic information is further provided, the output of the trained classifier is extracted and corrected, accurate prediction of phrase sequences is obtained, and therefore, accurate and natural silent voice recognition is realized through a decoding framework.
2. The invention is helpful to simplify the preparation process of training data by an automatic labeling method of training data; meanwhile, the invention provides a data enhancement method based on adjustment of window boundaries, which effectively relieves the over-fitting phenomenon of a depth network and improves the performance of silent voice recognition.
3. The invention starts research at the fine granularity level, provides a statistical language model based on multi-bundle searching and editing distance, and utilizes semantic time sequence associated information of phrase sets to help understand meaning of phrases and improve recognition performance of similar pronunciation phrases.
4. According to the data processing requirement of the real-time system, the invention realizes high-performance natural and coherent silent voice recognition, thereby being beneficial to the practical application of the method in the fields of myoelectricity control and the like.
Drawings
FIG. 1 is a flow chart of a silent speech decoding method based on face-to-neck surface myoelectricity according to the present invention;
FIG. 2 is a set of Chinese pronunciation phrases in accordance with the present invention;
FIG. 3 is an explanatory diagram of shape parameters and placement positions of the face-neck high density electrode array employed in the present invention;
FIG. 4 is a schematic diagram of a data segmentation, automatic labeling and data enhancement method employed in the present invention;
FIG. 5 is a schematic diagram of spatial position distribution and splice results in a high density electrode array according to the present invention;
FIG. 6 is a schematic diagram of a sorting network based on a dilated convolution bi-directional gating loop cell (DC-BiGRU) used in the present invention;
FIG. 7 is a graph showing average phrase recognition rate and standard deviation obtained by the present invention;
FIG. 8a is a schematic diagram of a confusion matrix based on the DC-BiGRU phrase classification, which is obtained by the present invention;
fig. 8b is a schematic diagram of a confusion matrix based on the DCBiMEP decoding method according to the present invention.
Detailed Description
In this embodiment, a face-neck surface myoelectricity-based silent voice decoding method, in which semantic information related to time sequence is extracted by using a statistical language model, not only can improve recognition performance of phrases with similar pronunciation, but also is helpful for understanding meaning of phrases corresponding to sEMG activity, and provides a new idea for the silent voice recognition method, specifically, as shown in fig. 1, the method includes the following steps:
Step one, constructing an instruction set P= { P 1,…,pn,…,pN},pn containing N Chinese phrases, wherein the N Chinese phrases in the instruction set P are represented by the nth Chinese phrase, and the N Chinese phrases contain L syllables altogether; as shown in fig. 2, the chinese pronunciation vocabulary set is composed of n=30 phrases, including 79 chinese syllables and 1 rest syllables, l=80;
the surface electromyographic signals generated by the face and neck muscles when the Chinese phrase is silently read by a user are collected by using a high-density electrode array, and the rest signal segments in the surface electromyographic signals and the surface electromyographic signal segments corresponding to the phrase are marked by using a double-threshold detection method based on short-time energy and zero crossing rate, so that each marked phrase signal segment is formed and a training phrase data set Sp is formed. In this example, the experiment recruited 8 healthy subjects aged 21-26 years, without hearing or language disorder, 7 men and women, and participated in the data collection experiment. Each subject was explicitly informed of each experimental procedure and specific requirements.
The shape parameters and placement locations of the high density electrode array are shown in fig. 3. Four high-density flexible electrode arrays are arranged, and two symmetrical arrays are arranged on the left side and the right side of the face and the neck respectively. By way of example, the number of channels of the two facial electrode arrays is 16, the diameter of the electrodes is 5mm, and the electrode spacing ranges from 10 mm, 15 mm and 18mm; the number of channels of the two neck electrode arrays is 16, the diameter of the electrodes is 5mm, and the electrode spacing is 18mm. The face-neck electrode arrays together form a 64-channel array. In addition, a piece of electrode is attached to the rear of each of the left ear and the right ear and is used as a reference electrode and a ground electrode;
prior to attaching the electrode array, target muscles of the face and neck of the subject are scrubbed with an alcohol cotton pad to clean skin keratin, while applying an appropriate amount of conductive gel on the electrode probe to reduce skin impedance. Illustratively, the facial electrode array is used to collect semgs of facial muscles such as zygomatic, masseter and descending labial muscles, and the neck electrode array is used to collect semgs of neck muscles such as scapular hyoid, sternohyoid and platysma. During the collection, the subjects silently express each phrase at a moderate speed, and repeat each phrase 20 times as a test, wherein the interval time of each repetition of the phrase is t, and the t is set to be 3s. In each trial, no behavior was allowed that was not relevant to the acquisition task, such as swallowing saliva and coughing. To avoid muscle fatigue in the subject, there is a rest time of T between the two trials, exemplary T taken as 30s;
The result of the sEMG activity detection on the original data is shown in a part a of fig. 4, and in this embodiment, the sEMG activity detection is performed on the original data by using a dual-threshold detection method based on short-time energy and zero-crossing rate, so as to obtain a rest signal segment and a signal segment label corresponding to each phrase. Firstly, calculating short-time energy and zero crossing rate of a resting state baseline as initial energy and initial zero crossing rate, and marking as E_i and C_i; the short-time sample is used for calculating the short-time energy and zero crossing rate of the original data, and the length is S_length. The method needs to set three thresholds, wherein the first two thresholds are respectively marked as E_h and E_l by the high threshold and the low threshold set by the short-term energy value, and are used for carrying out initial judgment on the initial position and the offset position; the third is a threshold SC for short-term zero-crossing rate, exemplary, s_length is set to 64ms, e_h is set to 8 xe_i, e_l is set to 3 xe_i, SC is set to 3 xc_i; obtaining a rest signal section and labels of the signal sections corresponding to each phrase;
Step two, segmenting the training phrase data set S p by using a series of signal windows with time overlapping front and back to obtain M signal window samples; as shown in part b of fig. 4, the sliding window length for data division is w_length, the overlap ratio is Overlap, and exemplary w_length is set to 1000ms, and overlay is set to 50%; dividing each phrase signal segment equally according to the syllable number contained in each phrase signal segment, and marking syllables with fine granularity on each signal window sample by combining the syllable sequence of each phrase signal segment, wherein the result is shown in a part b in fig. 4; marking corresponding syllable labels for the signal window according to the time spans of the signal window samples under different syllable labels, so as to obtain a training data set with syllable labels;
step three, changing the segmentation time of the signal window to adjust the window segmentation boundary of each signal window, and processing according to the process of the step two to obtain K batches of training data sets with syllable labels Wherein/>Representing a kth training dataset with syllable labeling, and/> M-th signal window sample representing k-th batch of data,/>Representing corresponding syllable labels, and adopting single-hot coding to represent, wherein the size is [1, L ]; s origin contains M×K signal window samples; in this embodiment, as shown in part c of fig. 4, the initial position of each batch of data segmentation is shifted back by Δ/5 on the basis of the previous batch, and K batches of signal window samples with labels are obtained by the syllable labeling method, where m=327, k=5, and Δ is set to 500ms;
Step four, extracting myoelectric characteristics of the training data set S origin:
Step 4.1, carrying out segmentation processing on each signal window sample by using continuous non-overlapping frames to obtain signal window data of d frames; in the present embodiment, the frame length of the frames that are consecutive and non-overlapping is f_length=40 ms;
Step 4.2, as shown in fig. 5, according to the relative positions of the signal channels of the high-density electrode array, converting the surface myoelectric signals acquired by the high-density electrode array into a surface myoelectric data matrix of the two-dimensional electrode channel array, wherein the size of the surface myoelectric data matrix is marked as [ e, g ]; in this embodiment, e= 8,g =8 is set;
step 4.3, extracting c myoelectric characteristics of the signal window data of each frame, so as to obtain a three-dimensional myoelectric characteristic diagram of each frame; thereby obtaining three-dimensional myoelectricity characteristic atlas of all signal window samples D-frame three-dimensional myoelectric characteristic diagram representing mth signal window sample of kth batch of data,/>The size of (c) is designated as [ d, e, g, c ]/>Syllable labeling of the mth signal window sample of the kth batch of data is represented; in this embodiment, c=4, the extracted 4 myoelectric time domain features are respectively an average absolute value (mean absolute value, MAV), a Waveform Length (WL), a zero-crossing point number (zero crossingpoints, ZC) and a slope sign change number (slope SIGN CHANGE number, SSC), the number of frames of each signal window sample feature map is d=25, the size of the feature map is [25,8,8,4], and finally the database S input formed by the feature maps of all signal window samples is obtained as the input of the neural network.
Step five, constructing a depth neural network based on the descriptive space-time information, which comprises the following steps: a expansion convolution block containing a time distribution layer, a flattening layer, a two-way gating circulation unit blocks and a full-connection layer, and inputting a three-dimensional myoelectricity characteristic atlas S input into a deep neural network according to K batches; as shown in fig. 6, the deep neural network for describing the time-space information is composed of a expansion convolution block containing a time distribution layer, a flattening layer, a bidirectional gating circulation unit blocks and a full connection layer; in this embodiment, a=2;
Step 5.1, any a-th expansion convolution block comprises an expansion convolution layer, a batch normalization layer and a Dropout layer;
step 5.1, any a-th expansion convolution block comprises an expansion convolution layer, a batch normalization layer and a Dropout layer; and the a expansion convolution layer adopts H a two-dimensional convolution kernels with H multiplied by H, and adopts a Tanh activation function;
when a=1, inputting the three-dimensional myoelectricity characteristic atlas of the kth batch into the a expansion convolution block for processing, and outputting the a characteristic atlas of the kth batch as follows Represents the m-th signal window sample/>, in the k-th three-dimensional myoelectric feature map setThe output characteristic diagram has the dimensions of [ d, e, g, H a ];
when a=2, 3, …, a, the a-1 th feature map of the k-th batch is taken Inputting the a expansion convolution block for processing, and outputting a characteristic diagram/>, of the k batchThus the final feature map/>, is output by the A-th expansion convolution blockIn this embodiment, the first expansion convolution layer is composed of H 1 = 323 x 3 filters, the expansion ratio is 1, the second expansion convolution layer is composed of H 2 = 83 x 3 filters, the expansion ratio is 3, and the ratio of the two Dropout layers is 0.5; Size [25,8,8,32],/> Size [25,8,8,8];
step 5.2, feature map After the flattening layer treatment, a k-th flattening feature set/>, is obtained Wherein/>Representing the kth batch of mth feature map/>The dimension of the feature map output after the flattening layer is [ d, e multiplied by g multiplied by H a ]; in this embodiment,/>Size [25,512];
Step 5.3, the arbitrary a-th bidirectional gating cyclic unit block comprises a bidirectional gating cyclic unit layer adopting a ReLU activation function and a Dropout layer, wherein the dimensions of hidden nodes in the bidirectional gating cyclic unit layer are b;
When a=1, the flattening feature set of the kth batch Inputting the a-th bidirectional gating cyclic unit block for processing, and outputting the a-th gating feature set/>, of the k-th lotRepresentation of feature map/>Gating characteristics output after being processed by the a-th bidirectional gating circulating unit block,/>The size is [ d,2 Xb ];
when a=2, 3, …, a-1, the a-1 th feature set of the k-th lot Inputting the a-th bidirectional gating cyclic unit block for processing, and outputting the a-th gating feature set/>, of the k-th lotThereby outputting the A-1 gating feature set/>, of the kth batch, from the A-1 bi-directional gating cyclic unit blockThe size is [ d,2 Xb ];
When a=a, the 1 st gating feature set of the kth lot Inputting the a-th bidirectional gating cyclic unit block for processing, and outputting the a-th gating feature set/>, of the k-th lotThe size is [1,2 Xb ]; in this embodiment, each bidirectional gating cyclic unit block includes 1 bidirectional gating cyclic unit layer adopting a ReLU activation function and 1 Dropout layer, the hidden node dimensions of the two bidirectional gating cyclic units are b=64, and Dropout ratio is 0.4; /(I)Size [25,128],/>Size [1,128];
step 5.4, adopting Tanh as an activation function of the first A-1 fully-connected layers, and connecting one Dropout layer respectively, wherein the activation function of the A-th fully-connected layer is softmax;
gating feature set output by A-th bidirectional gating cyclic unit block After being processed by A full-connection layers in sequence, the score matrix/>, of syllable decision sequences is outputWherein/>Mth signal window sample/>, representing kth batch of dataAre predicted as probabilities of L syllables, respectively, and/>Wherein/>Mth sample/>, representing the kth batch of data to be processed by the networkPredicting the probability of being the j-th syllable; in the embodiment, the activation function of the 1 st full-connection layer adopts Tanh, the dimension of the hidden node layer is 200, 1 Dropout layer with the ratio of 0.2 is connected, and the dimension of the hidden node of the 2 nd full-connection layer is 80;
step 5.5, establishing a cross entropy Loss function Loss by using the formula (1):
In the formula (1), the components are as follows, Mth signal window sample/>, which is kth batch of dataCorresponding syllable labeling/>The value of the j-th position in the list; in this embodiment, the single thermal code length is 80, and only one position has a value of 1, and the rest is 0, each batch contains M samples, and the loss function is obtained by weighting and summing the sample cross entropy of K batches;
Step 5.6, training a neural network:
Updating weight parameters of the deep neural network by adopting an Adam optimizer, setting the maximum iteration frequency step, dynamically changing the network learning rate lr, and stopping training when the Loss function Loss reaches the minimum or the iteration frequency is equal to step, so as to obtain an optimal syllable classification model; in this embodiment, step=300, the initial learning rate lr=0.01, and the learning rate becomes lr=0.1×lr every 100 iterations.
Step six, constructing a statistical language model according to an instruction set P of the Chinese phrase, so as to carry out post-processing on the optimal syllable classifier result:
Step 6.1, establishing a many-to-one mapping relation theta between syllable tag sequences and Chinese phrases; in this embodiment, the speech speeds of different subjects have a certain difference, and it is difficult to ensure that the speech speeds are the same when the same phrase is repeatedly silently read, so that the number of signal window samples of the same phrase is different, and therefore, the syllable tag sequence is mapped from syllable tag sequence to phrase.
Step 6.2, processing a Chinese phrase p' to be decoded according to the process of the step two to obtain U signal window samples to be decoded with syllable labels; processing the U signal window samples to be decoded according to the process of the fourth step to obtain a three-dimensional myoelectricity characteristic atlas to be decoded;
Step 6.3, inputting the three-dimensional myoelectricity characteristic atlas to be decoded into an optimal syllable classification model, and outputting a scoring matrix of a syllable label sequence of the Chinese phrase p Wherein/>A scoring probability matrix representing the U-th syllable of the Chinese phrase p', U representing the length of the syllable sequence;
Step 6.4, making the search depth of each syllable be depth, and utilizing multi-bundle search algorithm to make the search result Processing to obtain U depth syllable tag sequences and U depth scores;
Step 6.5, judging whether the U depth syllable tag sequences are successfully matched with the many-to-one mapping relation theta, if so, selecting phrases corresponding to the syllable tag sequences which are matched and have the highest scores from the syllable tag sequences And outputting, otherwise, executing the step 6.6;
Step 6.6 scoring matrix from the u-th syllable of the Chinese phrase p Syllables with the highest score probability are selected as/>Thereby obtaining syllable decision sequence/>Selecting the relation of mapping theta with syllable label sequence/>Phrase/>, with minimum edit distanceAs a result of decoding the chinese phrase p'.
In this embodiment, depth is set to 5, and the phrase corresponding to the syllable tag sequence with the highest score is selected from the tag sequences successfully matched with the many-to-one mapping relation θ by the formula (2)
In the formula (2), the amino acid sequence of the compound,Phrase/>, indicating that the match was successfulThe corresponding score, max {.cndot }, returns the phrase/>, corresponding to the maximum scorePhrase/>Representing a phrase in the phrase instruction set P. Obtaining syllable tag sequences from (3)In equation (3), argmax {.cndot }, returns syllables with the highest score for each syllable score matrix.
In this embodiment, in order to quantitatively evaluate the effect of the present invention, the decoding method of the present invention is compared with the conventional classification method, and the decoding method of the present invention is denoted as DCBiMEP. In the comparison experiment, four common phrase classification methods are adopted in total, and are in contrast to DCBiMEP of the invention. The four classification methods are denoted as HMM, dilated convolutional neural network (dilated convolutional neural network, DCNN), bi-directional gating loop unit (bidirectional gated recurrent unit, biGRU), and DC-BiGRU, respectively. The data preparation process of the four methods is as follows: and carrying out sEMG activity detection on the original myoelectricity data, extracting sEMG activity data corresponding to each phrase, marking corresponding phrase labels, and carrying out feature extraction on the phrase data with the labels to obtain feature data of all the phrases. In addition, in order to verify the effectiveness of the data enhancement for the method of the present invention, the method of the present invention derives two methods, labeled DCBiMEP and AUG-DCBiMEP, respectively, depending on whether the data enhancement is performed, the latter representing the method of the present invention after the data enhancement is performed. FIG. 7 shows the results of phrase recognition accuracy (phrase recognition accuracy, PRA) on 8 subjects of the above 6 methods, with PRA of the 4 conventional phrase classification methods being (82.74 + -7.48)%, (83.06+ -7.31)%, (87.92+ -5.82)% and (90.49+ -5.47)%, respectively, as can be seen, DC-BiGRU depicting the best performance of the spatiotemporal information. The PRA of the method DCBiMEP of the invention is (97.27 +/-1.44)%, and the performance is obviously superior to that of 4 comparison phrase classification methods. The PRA of AUG-DCBiMEP was raised by 0.91% to (98.18±1.44)% based on the method of the present invention, demonstrating the effectiveness of the data enhancement presented for the method of the present invention.
Figures 8a and 8b show the best performing DC-BiGRU of the 4 classification comparison methods and the phrase recognition confusion matrix for the method of the invention on subject 2 data. It is apparent that DC-BiGRU does not perform as well as the method of the present invention for the recognition of similar phrases such as "slow down" and "fast up" and "turn left" and "turn right".
Combining the comparison experiment and the recognition result, the following conclusions can be obtained, including: 1) The decoding method provided by the invention can efficiently identify phrases with similar pronunciation, and improves the performance of a silent voice system. 2) The data enhancement method for adjusting the window boundary can further improve the performance based on the original method. 3) The statistical language model effectively utilizes semantic time sequence associated information of the phrase, is beneficial to understanding meaning of the phrase, and realizes high-precision natural continuous silent voice interaction.

Claims (1)

1. A face-neck surface myoelectricity-based silent speech decoding method, comprising the steps of:
Step one, constructing an instruction set P= { P 1,…,pn,…,pN},pn containing N Chinese phrases, wherein the N Chinese phrases in the instruction set P are represented by the nth Chinese phrase, and the N Chinese phrases contain L syllables altogether;
Collecting surface electromyographic signals generated by face and neck muscles when a user silently reads Chinese phrases by using a high-density electrode array, and labeling resting signal segments in the surface electromyographic signals and surface electromyographic signal segments corresponding to the phrases by using a double-threshold detection method based on short-time energy and zero crossing rate, so as to form labeled phrase signal segments and form a training phrase data set S p;
Dividing the training phrase data set S p by a series of signal windows with time overlapping front and back to obtain M signal window samples, equally dividing each phrase signal section according to the syllable quantity contained in the phrase signal section, and marking syllables with fine granularity on each signal window sample by combining the syllable sequence of each phrase signal section, thereby obtaining a batch of training data sets consisting of M signal window samples with syllable marking;
step three, changing the segmentation time of the signal window to adjust the window segmentation boundary of each signal window, and processing according to the process of the step two to obtain K batches of training data sets with syllable labels Wherein/>Representing a kth training dataset with syllable labeling, and/> M-th signal window sample representing k-th batch of data,/>Representing the corresponding syllable labels and representing by single thermal coding,/>The size of (2) is [1, L ]; s origin contains M×K signal window samples;
Step four, extracting myoelectric characteristics of the training data set S origin:
Step 4.1, carrying out segmentation processing on each signal window sample by using continuous non-overlapping frames to obtain signal window data of d frames;
Step 4.2, converting the surface myoelectric signals acquired by the high-density electrode array into a surface myoelectric data matrix of the two-dimensional electrode channel array according to the relative positions of the signal channels of the high-density electrode array, wherein the size of the surface myoelectric data matrix is marked as [ e, g ];
step 4.3, extracting c myoelectric characteristics of the signal window data of each frame, so as to obtain a three-dimensional myoelectric characteristic diagram of each frame; thereby obtaining three-dimensional myoelectricity characteristic atlas of all signal window samples D-frame three-dimensional myoelectric characteristic diagram representing mth signal window sample of kth batch of data,/>The size of (c) is designated as [ d, e, g, c ]/>Syllable labeling of the mth signal window sample representing the kth batch of data;
Step five, constructing a depth neural network based on the descriptive space-time information, which comprises the following steps: a expansion convolution block containing a time distribution layer, a flattening layer, a two-way gating circulation unit blocks and a full-connection layer, and inputting a three-dimensional myoelectricity characteristic atlas S input into the deep neural network according to K batches;
step 5.1, any a-th expansion convolution block comprises an expansion convolution layer, a batch normalization layer and a Dropout layer; and the a expansion convolution layer adopts H a two-dimensional convolution kernels with H multiplied by H, and adopts a Tanh activation function;
when a=1, inputting the three-dimensional myoelectricity characteristic atlas of the kth batch into the a expansion convolution block for processing, and outputting the a characteristic atlas of the kth batch as follows Represents the m-th signal window sample/>, in the k-th three-dimensional myoelectric feature map setThe output characteristic diagram has the dimensions of [ d, e, g, H a ];
when a=2, 3, …, a, the a-1 th feature map of the k-th batch is taken Inputting the a expansion convolution block for processing, and outputting a characteristic diagram/>, of the k batchThus the final feature map/>, is output by the A-th expansion convolution block
Step 5.2, the feature mapAfter the flattening layer is processed, a k-th flattening feature set/>, is obtained Wherein/>Representing the kth batch of mth feature map/>The dimension of the feature map output after passing through the flattening layer is [ d, e multiplied by g multiplied by H a ];
Step 5.3, the arbitrary a-th bidirectional gating cyclic unit block comprises a bidirectional gating cyclic unit layer adopting a ReLU activation function and a Dropout layer, wherein the dimensions of hidden nodes in the bidirectional gating cyclic unit layer are b;
when a=1, the flattening feature set of the kth batch Inputting the a-th bidirectional gating cyclic unit block for processing, and outputting a-th gating feature set/>, of the k-th lot Representation of feature map/>Gating characteristics output after being processed by the a-th bidirectional gating circulating unit block,/>The size is [ d,2 Xb ];
when a=2, 3, …, a-1, the a-1 th feature set of the k-th lot Inputting the a-th bidirectional gating cyclic unit block for processing, and outputting a-th gating feature set/>, of the k-th lotThereby outputting the A-1 gating feature set/>, of the kth batch, from the A-1 bi-directional gating cyclic unit blockThe size is [ d,2 Xb ];
When a=a, the 1 st gating feature set of the kth lot Inputting the a-th bidirectional gating cyclic unit block for processing, and outputting a-th gating feature set/>, of the k-th lot The size is [1,2 Xb ];
Step 5.4, enabling the activation function of the first A-1 fully connected layers to adopt Tanh, and connecting one Dropout layer respectively, wherein the activation function of the first A fully connected layer is softmax;
gating feature set output by A-th bidirectional gating cyclic unit block After being processed by A full-connection layers in sequence, the score matrix/>, of syllable decision sequences is outputWherein/>Mth signal window sample/>, representing kth batch of dataAre predicted as probabilities of L syllables, respectively, and/>Wherein,Mth sample/>, representing kth batch of dataProbability of being predicted as a j-th syllable;
step 5.5, establishing a cross entropy Loss function Loss by using the formula (1):
In the formula (1), the components are as follows, Mth signal window sample/>, which is kth batch of dataCorresponding syllable labeling/>The value of the j-th position in the list;
Step 5.6, training a neural network:
Updating the weight parameters of the deep neural network by adopting an Adam optimizer, setting the maximum iteration frequency step, dynamically changing the network learning rate lr, and stopping training when the Loss function Loss reaches the minimum or the iteration frequency is equal to step, thereby obtaining an optimal syllable classification model;
Step six, constructing a statistical language model according to an instruction set P of the Chinese phrase, so as to carry out post-processing on the optimal syllable classifier result:
Step 6.1, establishing a many-to-one mapping relation theta between syllable tag sequences and Chinese phrases;
Step 6.2, processing a Chinese phrase p' to be decoded according to the process of the step two to obtain U signal window samples to be decoded with syllable labels; processing the U signal window samples to be decoded according to the process of the fourth step to obtain a three-dimensional myoelectricity characteristic atlas to be decoded;
step 6.3, inputting the three-dimensional myoelectricity characteristic atlas to be decoded into an optimal syllable classification model, and outputting a scoring matrix of a syllable label sequence of the Chinese phrase p Wherein/>A scoring probability matrix representing the U-th syllable of the Chinese phrase p', U representing the length of the syllable sequence;
Step 6.4, making the search depth of each syllable be depth, and utilizing multi-bundle search algorithm to make the search result Processing to obtain U depth syllable tag sequences and U depth scores;
Step 6.5, judging whether the U depth syllable tag sequences are successfully matched with the many-to-one mapping relation theta, if so, selecting phrases corresponding to the syllable tag sequences which are matched and have the highest scores from the syllable tag sequences And outputting, otherwise, executing the step 6.6;
Step 6.6 scoring matrix from the u-th syllable of the Chinese phrase p Syllables with the highest score probability are selected as syllablesThereby obtaining syllable decision sequence/>Selecting a syllable decision sequence in a mapping relation mapping theta of many to onePhrase/>, with minimum edit distanceAs a result of decoding the chinese phrase p'.
CN202210598661.3A 2022-05-30 2022-05-30 Silent voice decoding method based on surface myoelectricity of face and neck Active CN114999461B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210598661.3A CN114999461B (en) 2022-05-30 2022-05-30 Silent voice decoding method based on surface myoelectricity of face and neck

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210598661.3A CN114999461B (en) 2022-05-30 2022-05-30 Silent voice decoding method based on surface myoelectricity of face and neck

Publications (2)

Publication Number Publication Date
CN114999461A CN114999461A (en) 2022-09-02
CN114999461B true CN114999461B (en) 2024-05-07

Family

ID=83028992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210598661.3A Active CN114999461B (en) 2022-05-30 2022-05-30 Silent voice decoding method based on surface myoelectricity of face and neck

Country Status (1)

Country Link
CN (1) CN114999461B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117084872B (en) * 2023-09-07 2024-05-03 中国科学院苏州生物医学工程技术研究所 Walking aid control method, system and medium based on neck myoelectricity and walking aid

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170095603A (en) * 2016-02-15 2017-08-23 인하대학교 산학협력단 A monophthong recognition method based on facial surface EMG signals by optimizing muscle mixing
CN107545888A (en) * 2016-06-24 2018-01-05 常州诗雅智能科技有限公司 A kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted and method
CN112151030A (en) * 2020-09-07 2020-12-29 中国人民解放军军事科学院国防科技创新研究院 Multi-mode-based complex scene voice recognition method and device
CN113288183A (en) * 2021-05-20 2021-08-24 中国科学技术大学 Silent voice recognition method based on facial neck surface myoelectricity

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170095603A (en) * 2016-02-15 2017-08-23 인하대학교 산학협력단 A monophthong recognition method based on facial surface EMG signals by optimizing muscle mixing
CN107545888A (en) * 2016-06-24 2018-01-05 常州诗雅智能科技有限公司 A kind of pharyngeal cavity electronic larynx voice communication system automatically adjusted and method
CN112151030A (en) * 2020-09-07 2020-12-29 中国人民解放军军事科学院国防科技创新研究院 Multi-mode-based complex scene voice recognition method and device
CN113288183A (en) * 2021-05-20 2021-08-24 中国科学技术大学 Silent voice recognition method based on facial neck surface myoelectricity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于优化肌电特征的无声语音信号识别;王旭;贾雪琴;李景宏;杨丹;;东北大学学报(自然科学版);20061028(第10期);全文 *

Also Published As

Publication number Publication date
CN114999461A (en) 2022-09-02

Similar Documents

Publication Publication Date Title
CN113288183B (en) Silent voice recognition method based on facial neck surface myoelectricity
CN107256392A (en) A kind of comprehensive Emotion identification method of joint image, voice
CN110059575A (en) A kind of augmentative communication system based on the identification of surface myoelectric lip reading
CN112861604B (en) Myoelectric action recognition and control method irrelevant to user
CN109935243A (en) Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN111832416A (en) Motor imagery electroencephalogram signal identification method based on enhanced convolutional neural network
CN114999461B (en) Silent voice decoding method based on surface myoelectricity of face and neck
Cheng et al. Emotion recognition algorithm based on convolution neural network
CN109272986A (en) A kind of dog sound sensibility classification method based on artificial neural network
Ma et al. Silent speech recognition based on surface electromyography
CN110348482B (en) Speech emotion recognition system based on depth model integrated architecture
CN108509869A (en) Feature set based on OpenBCI optimizes on-line training method
Wand Advancing electromyographic continuous speech recognition: Signal preprocessing and modeling
CN114403878B (en) Voice fatigue detection method based on deep learning
CN114129138B (en) Automatic sleep staging method based on time sequence multi-scale mixed attention model
Harrington et al. A physiological analysis of high front, tense-lax vowel pairs in Standard Austrian and Standard German
CN112733721B (en) Surface electromyographic signal classification method based on capsule network
CN114863912A (en) Silent voice decoding method based on surface electromyogram signals
CN113887339A (en) Silent voice recognition system and method fusing surface electromyogram signal and lip image
Zhang et al. EMG-based cross-subject silent speech recognition using conditional domain adversarial network
Srisuwan et al. Three steps of neuron network classification for EMG-based Thai tones speech recognition
Bush Vowel articulation and laryngeal control in the speech of the deaf
CN111783669A (en) Surface electromyographic signal classification and identification method for individual user
Peng et al. Speech emotion recognition of merged features based on improved convolutional neural network
CN115919313B (en) Facial myoelectricity emotion recognition method based on space-time characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant