WO2025259567A1 - Personalized dysarthric speech recognition - Google Patents

Personalized dysarthric speech recognition

Info

Publication number
WO2025259567A1
WO2025259567A1 PCT/US2025/032777 US2025032777W WO2025259567A1 WO 2025259567 A1 WO2025259567 A1 WO 2025259567A1 US 2025032777 W US2025032777 W US 2025032777W WO 2025259567 A1 WO2025259567 A1 WO 2025259567A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
dysarthric
speaker
model
values
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/032777
Other languages
French (fr)
Inventor
Jun Wang
Beiming Cao
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Texas System
University of Texas at Austin
Original Assignee
University of Texas System
University of Texas at Austin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Texas System, University of Texas at Austin filed Critical University of Texas System
Publication of WO2025259567A1 publication Critical patent/WO2025259567A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice

Definitions

  • Dysarthria is a collection of motor speech disorders resulting from neurological injury to the motor speech system. People with dysarthria cannot control their motor subsystems, including respiration, phonation, resonance, articulation, and prosody. Speech in people with dysarthria is ty pically characterized by poor articulation, breathy voice, and monotonic intonation. Often, the speech intelligibility is reduced in proportion to the severity of dysarthria.
  • ASR automatic speech recognition
  • An exemplary system and method employing an adapted hidden Markov model-based speech recognition system, developed from a lengthy and diverse dysarthric speech dataset. Due to the difficulties in collecting large dysarthric speech datasets for training, prior Markov and artificial intelligence (Al) based systems have produced unsatisfactory results. To date, there has been no showing that Markov model-based speech recognition or Al could produce a commercially acceptable speech recognition system for dysarthria.
  • Dysarthria speech training data comprising normal speech (e.g., 161-200 words/minute), moderate Dysarthric speech (e.g., 121-160 words/minute), severe Dysarthric speech (e.g., fewer than 120 words/minute), or a combination thereof, from one or more patients with various voice pitches and accents, having at least 1000 words and 4500 phonemes (e.g., preferably 20 hours of Dysarthria-speech training data having 40,000 words and 187,000 phonemes).
  • the study demonstrated that the lengthy and diverse training dataset for Dysarthria speech can achieve higher word accuracy than human listening.
  • the study implemented speaker-dependent speech recognition experiments with an adapted hidden Markov model to reflect the speaker’s (lower) speaking rate on 21 hours of dysarthric speech recordings from a single speaker.
  • a baseline speaker-independent experiment w as implemented by adding a 1 OO-hr typical speech data from the LibriSpeech dataset to the training set.
  • the results showed the efficacy of both the speakerdependent design and the adaptation of hidden Markov model.
  • a 86.6% word accuracy was achieved. This is significantly higher than human listening (20.9%) and the baseline speakerindependent, non-personalized design (7.3%), observed as comparison in the study. Severe dysarthria speech is typically considered unintelligible speech.
  • a system for recognizing a speech of a single speaker comprising: a processor; a memory having instructions stored thereon, wherein execution of the instructions by the processor causes the processor to: receive audio data (e.g., recording or audio stream in a controller) having low-intelligibility 7 , severe dysarthric speech; determine, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wherein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and output the determined plurality of estimated speech values, wherein the determined plurality of estimated speech values is used for control or voice synthesis.
  • audio data e.g., recording or audio stream in a controller
  • the training dysarthric speech dataset has at least 40,000 words and 187,000 phonemes.
  • the trained model comprises a trained hidden Markov model.
  • the trained model comprises a transformer-based model (i) trained on a health speech dataset (e.g., acquired from one or more healthy speakers of at least 10000 hours of speech) and (ii) finetuned on a smaller dysarthric speech dataset (e.g.. acquired from one or more dysarthric speakers of less than 1 hour of speech).
  • the determined plurality of estimated speech values is used for health status monitoring and dysarthria condition analysis.
  • the training dysarthric speech data comprises a non- dysarthric (i.e., normal) speech having 161 to 200 words per minute, a moderate dysarthric speech having 121 to 160 words per minute, a severe dysarthric speech having less than 120 words per minute, or a combination thereof.
  • the audio data is of the single speaker, and wherein the estimated speech values are for the single speaker.
  • the audio data is of a speaker different from the single speaker, and the estimated speech values are for the different speaker (e.g., wherein the different speaker has dysarthria).
  • the trained model was characterized as having an accuracy for the speech recognition greater than 85%.
  • system further comprises directing the plurality of estimated speech values to a human-machine interface.
  • the system further comprises a voice synthesis system and speaker.
  • the hidden Markov model employed at least three states and five states to represent nonsilence phonemes and silence phonemes, respectively.
  • the hidden Markov model employed at least six states to represent each of non-silence and silence phonemes.
  • the hidden Markov model employed a Gaussian mixture model (GMM).
  • GMM Gaussian mixture model
  • DNN deep neural network
  • LSTM long-short term memory
  • BLSTM bidirectional-LSTM
  • a method comprising: receiving, by a processor, audio data (e.g., recording or audio stream in a controller) having low-intelligibility, severe dysarthric speech: determining, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility’ dysarthric speech of the audio data, wherein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and outputting, the determined plurality of estimated speech values, wherein the determined plurality of estimated speech values is used for control or voice synthesis.
  • audio data e.g., recording or audio stream in a controller
  • the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes
  • the training dysarthric speech dataset has at least 40.000 words and 187,000 phonemes.
  • the trained model comprises a trained hidden Markov model.
  • the training dysarthric speech data comprises a non- dysarthric (i. e. , normal) speech having 161 to 200 words per minute, a moderate dysarthric speech having between 121 to 160 words per minute, a severe dysarthric speech having less than 120 words per minute, or a combination thereof.
  • the audio data is of the single speaker
  • the estimated speech values are for the single speaker.
  • the audio data is of a speaker different from the single speaker, and the estimated speech values are for the different speaker (e.g., wherein the different speaker has dysarthria).
  • the trained model was characterized as having an accuracy for the speech recognition greater than 85%.
  • the hidden Markov model employed at least three states and five states to represent non-silence phonemes and silence phonemes, respectively.
  • the hidden Markov model employed at least six states to represent each of non-silence and silence phonemes.
  • the hidden Markov model employed a Gaussian mixture model (GMM), deep neural network (DNN), long-short term memory (LSTM)-recurrent neural network, and bidirectional-LSTM (BLSTM).
  • GMM Gaussian mixture model
  • DNN deep neural network
  • LSTM long-short term memory
  • BLSTM bidirectional-LSTM
  • a non-transitory computer-readable medium having instructions stored thereon, wherein execution of the instructions by a processor causes the processor to: receive audio data (e g., recording or audio stream in a controller) having low- intelligibility, severe dysarthric speech; determine, via a trained model, a plurality’ of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wherein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and output the determined plurality of estimated speech values, wherein the determined plurality' of estimated speech values is used for control or voice synthesis.
  • audio data e g., recording or audio stream in a controller
  • FIGURES 1A-1D depict an example workflow for an example system for determining estimated speech values in accordance with the present disclosure.
  • FIGURES 2A-2D depict example implementations of estimated text of dysarthric speech.
  • FIGURE 3 depicts an example method of determining estimated speech values in accordance with the present disclosure.
  • FIGURE 4 depicts phoneme error rates (PERs) of the testing experiment using Gaussian mixture model (GMM)-, deep neural network (DNN)-, long-short term memory (LSTM)-, bidirectional long-short term memory (BLSTM) with standard and adapted hidden Markov models.
  • GMM Gaussian mixture model
  • DNN deep neural network
  • LSTM long-short term memory
  • BLSTM bidirectional long-short term memory
  • FIGURE 5 depicts word error rates (WERs) of the testing experiment using Gaussian mixture model (GMM)-, deep neural network (DNN)-, long-short term memory (LSTM)-, bidirectional long-short term memory (BLSTM) with standard and adapted hidden Markov models.
  • GMM Gaussian mixture model
  • DNN deep neural network
  • LSTM long-short term memory
  • BLSTM bidirectional long-short term memory
  • ratios, concentrations, amounts, and other numerical data can be expressed herein in a range format. It can be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint. It is also understood that there are a number of values disclosed herein, and that each value is also herein disclosed as “about” that particular value in addition to the value itself. For example, if the value “10” is disclosed, then “about 10” is also disclosed. Ranges can be expressed herein as from “about” one particular value, and/or to “about” another particular value. Similarly, when values are expressed as approximations, by use of the antecedent “about.” it can be understood that the particular value forms a further aspect. For example, if the value “about 10” is disclosed, then “10” is also disclosed.
  • a further aspect includes from the one particular value and/or to the other particular value.
  • ranges excluding either or both of those included limits are also included in the disclosure, e.g. the phrase “x to y” includes the range from ‘x’ to ‘y’ as well as the range greater than ‘x’ and less than ‘y’.
  • the range can also be expressed as an upper limit, e.g.
  • ‘about x, y, z, or less’ and should be interpreted to include the specific ranges of ‘about x’, ‘about y’, and ‘about z’ as well as the ranges of ‘less than x’, less than y’, and Tess than z'.
  • the phrase ‘about x, y, z, or greater’ should be interpreted to include the specific ranges of ‘about x’, ’about y’, and ‘about z’ as well as the ranges of ’greater than x’, greater than y’, and ‘greater than z’.
  • the phrase “about ‘x’ to ‘y’”, where ‘x’ and ‘y’ are numerical values includes “about ‘x‘ to about ‘y”‘.
  • a numerical range of “about 0.1% to 5%” should be interpreted to include not only the explicitly recited values of about 0.1% to about 5%, but also include individual values (e.g., about 1%, about 2%, about 3%, and about 4%) and the sub- ranges (e.g., about 0.5% to about 1.1%; about 5% to about 2.4%; about 0.5% to about 3.2%, and about 0.5% to about 4.4%, and other possible sub-ranges) within the indicated range.
  • the terms '‘about,’’ ‘'approximate,” “at or about,” and '‘substantially” mean that the amount or value in question can be the exact value or a value that provides equivalent results or effects as recited in the claims or taught herein. That is, it is understood that amounts, sizes, formulations, parameters, and other quantities and characteristics are not and need not be exact, but may be approximate and/or larger or smaller, as desired, reflecting tolerances, conversion factors, rounding off, measurement error and the like, and other factors known to those of skill in the art such that equivalent results or effects are obtained. In some circumstances, the value that provides equivalent results or effects cannot be reasonably determined.
  • FIG. 1A depicts an example method 100 which uses a trained Al model 102 to determine estimated text of dysarthric speech 104.
  • the trained Al model 102 can be configured using a training dysarthric speech dataset acquired from a speaker of at least 40,000 words (e.g., at least 50,000 words, at least 60,000 words, at least 70,000 words, at least 80,000 words, at least 90,000 words, at least 100,000 words) and at least 187,000 phonemes (e.g., at least 190,000 phonemes, at least 195,000 phonemes, at least 200,000 phonemes, at least 205,000 phonemes, at least 210,000 phonemes, at least 215,000 phonemes, at least 220,000 phonemes, at least 225,000 phonemes).
  • words e.g., at least 50,000 words, at least 60,000 words, at least 70,000 words, at least 80,000 words, at least 90,000 words, at least 100,000 words
  • phonemes e.g., at least 190,000 phonemes, at least 195,000 phonemes, at least 200,000 phonemes, at least
  • the trained Al model can be configured using a training dysarthric speech dataset acquired from a speaker of at least 20 hours (e.g., at least 25 hours, at least 30 hours, at least 35 hours, at least 40 hours, at least 45 hours, at least 50 hours).
  • the trained Al model can be configured using a training dysarthric speech dataset acquired from a speaker of at least 40,000 words, 187,000 phonemes, and at least 20 hours.
  • the estimated text of the dysarthric speech 104 can have a speech recognition accuracy greater than 85% (e.g., greater than 86%, greater than 87%, greater than 88%, greater than 89%, greater than 90%, greater than 91%, greater than 92%. greater than 93%, greater than 94%, greater than 95%, greater than 96% greater than 97%, greater than 98%, greater than 99%).
  • the trained Al model 102 can be configured using a training dysarthric speech dataset, acquired from two or more dysarthric patients with various voice pitches and accents, having a duration of at least 5 hours with at least 1000 words (e.g.. at least 2000 words, at least 3000 words, at least 4000 words, at least 5000 words, at least 6000 words) and at least 4500 phonemes (e.g., at least 5500 phonemes, at least 6500 phonemes, at least 7500 phonemes, at least 8500 phonemes, at least 9500 phonemes).
  • Each of the two or more dysarthric patients whose voices are used for generating the training dataset, can have (i) a normal condition and produce a rate of 161-200 words/minute), (ii) a moderate Dysarthric condition and produce a speech at a rate of 121-160 words/minute, or (iii) a severe Dysarthric condition and produce a speech at a rate of fewer than 120 words/minute.
  • a recording 106 (e.g., dysarthric speech recording or audio stream) of audio data having low-intelligibility, severe dysarthric speech is collected or obtained.
  • the recording can be generated by the same speaker who provided the training dysarthric speech dataset.
  • the recording can be generated by a different speaker.
  • the recording 106 undergoes pre-processing 108. e.g., filtering, normalization.
  • the audio then undergoes segmentation 1 10, which segments the audio into distinct silence and non-silence phonemes.
  • the silence phonemes can be determined using a pause detector 112, which can be used to establish the beginning and end of each nonsilence phoneme.
  • the segmented speech is then input into the trained Al model 102, which outputs the estimated text of the dysarthric speech 104.
  • the trained Al model 102 can include a trained hidden Markov model (HMM).
  • HMM hidden Markov model
  • This hidden Markov model can employ at least three states (e.g., at least four states, at least five states) to represent nonsilence phonemes and at least five states to represent silence phonemes.
  • the hidden Markov model can employ at least six states (e.g., at least seven states, at least eight states, at least nine states, at least ten states) to represent each of the nonsilence and silence phonemes.
  • the hidden Markov model can further employ a Gaussian mixture model (GMM), deep neural network (DNN), long-short term memory' (LSTM)- recurrent neural network, and/or bidirectional-LSTM (BLSTM) to assist with speech recognition.
  • GMM Gaussian mixture model
  • DNN deep neural network
  • LSTM long-short term memory'
  • BLSTM bidirectional-LSTM
  • the neural network (e.g., artificial neural network, convolutional neural network) employed by the hidden Markov model (of the trained Al model 102) can be a computing system including a plurality of interconnected neurons (e.g., also referred to as “nodes”).
  • the nodes can be arranged in a plurality of layers, such as an input layer, an output layer, and optionally one or more hidden layers with different activation functions.
  • An artificial neural network (ANN) with hidden layers can be referred to as a deep neural network or multilayer perceptron (MLP).
  • Each node is connected to one or more other nodes in the ANN.
  • each layer is made of a plurality of nodes, where each node is connected to all nodes in the previous layer.
  • nodes in a given layer are not interconnected with one another, i.e., the nodes in a given layer function independently of one another.
  • nodes in the input layer receive data from outside of the ANN
  • nodes in the hidden layer(s) modify the data between the input and output layers
  • nodes in the output layer provide the results.
  • Each node is configured to receive an input, implement an activation function (e.g., binary step, linear, sigmoid, tanh, or rectified linear unit (ReLU) function), and provide an output in accordance with the activation function. Additionally, each node is associated with a respective weight.
  • ANNs are trained with a dataset to maximize or minimize an objective function.
  • the objective function is a cost function, which is a measure of the ANN’s performance (e.g., error such as LI or L2 loss) during training, and the training algorithm tunes the node weights and/or bias to minimize the cost function.
  • a cost function which is a measure of the ANN’s performance (e.g., error such as LI or L2 loss) during training
  • the training algorithm tunes the node weights and/or bias to minimize the cost function.
  • any algorithm that finds the maximum or minimum of the objective function can be used for training the ANN.
  • Training algorithms for ANNs include but are not limited to backpropagation.
  • an artificial neural network is provided only as an example machine learning model.
  • the machine learning model can be any supervised learning model, semi-supervised learning model, or unsupervised learning model.
  • the machine learning model is a deep learning model. Machine learning models are known in the art and are therefore not described in further detail herein.
  • a convolutional neural network is a ty pe of deep neural network that has been applied, for example, to image analysis applications. Unlike traditional neural networks, each layer in a CNN has a plurality of nodes arranged in three dimensions (width, height, depth). CNNs can include different types of layers, e.g., convolutional, pooling, and fully connected (also referred to herein as “dense”) layers.
  • a convolutional layer includes a set of filters and performs the bulk of the computations.
  • a pooling layer is optionally inserted between convolutional layers to reduce the computational power and/or control overfitting (e.g., by downsampling).
  • a fully connected layer includes neurons, where each neuron is connected to all of the neurons in the previous layer. The layers are stacked similarly to traditional neural networks.
  • GCNNs are CNNs that have been adapted to work on structured datasets such as graphs.
  • FIG. IB depicts an example implementation of the example method 100.
  • This implementation uses a user device 114 to record dysarthric speech and cloud infrastructure 116 to determine the estimated text of the dysarthric speech 140.
  • the user device 114 uses a microphone 118 to capture a recording or audio stream of the dysarthric speech.
  • the recording or audio stream is transferred to the cloud infrastructure 116 via a device-based network interface 120 in communication with a cloud-based network interface 122.
  • the recording or audio stream can be temporarily placed in cloud storage 124; however, in other aspects, the recording or audio stream may be immediately processed.
  • the recording or audio stream is then subjected to the example method 100: the recording is pre-processed 108, segmented 110 using a pause detector 112, and then input into a trained Al model 102, which outputs the estimated text of the dysarthric speech 104.
  • FIG. 1C depicts another example implementation of the example workflow 100.
  • This implementation uses a local processing device 126 (e.g.. smartphone, smartwatch, etc.) to record dysarthric speech and determine the estimated text of the dysarthric speech.
  • the local processing device 126 uses a microphone to capture a recording or audio stream of the dysarthric speech.
  • the recording or audio stream is then subjected to the example workflow 100: the recording is pre-processed 108, segmented 110 using a pause detector 112. and then input into a trained Al model 102, which outputs the estimated text of the dysarthric speech 104.
  • the estimated speech can be used in a user-interface controller, e.g., for navigation devices such as GPS navigation, or vehicle controller to do the same.
  • FIG. ID depicts another example implementation of the example workflow 100, where the trained Al model 102 (shown as 102’) was trained in a training system 120 using a transfer learning method, also referred to as a transfer-leaming- based training system 120.
  • the trained Al model 102 shown as 102’
  • the transformer-based model 122 was input into the training system 120.
  • the transformer-based model 122 was then trained, in a training process 124. on a healthy speech dataset 126 (e.g., having tens of thousands of hours of speech).
  • the transformer-based model 122 (e.g., large-scale pretrained model) was then finetuned, in a finetuning operation 128, on a smaller dysarthric speech dataset (e.g., having less than 1 hour of speech).
  • the model 122 was generalized to various patients with minimal data during the finetuning operation 128.
  • the transformer-based model was fully trained, as atrained Al model 102', and could recognize various speeches (e.g., having varying accents, speeds, etc.) when employed in the exemplary system.
  • FIG. 3 depicts an example method for determining estimated text of dysarthric speech.
  • audio data e.g., a recording or audio stream in a controller
  • severe dysarthric speech is received 302 by a processor.
  • a plurality of estimated speech values for the severe, low-intelligibility’ dysarthric speech of the audio data is determined 304 via a trained model.
  • the trained model may have been configured using a training dysarthric speech dataset acquired from a single speaker of at least [40,000 words and 187,000 phonemes; 20 hours].
  • the determined plurality’ of estimated speech values is output 306.
  • the determined plurality of estimated speech values can be used for control or voice synthesis.
  • ASR Automatic speech recognition
  • Personalized implementation can be an effective approach when developing high- performance speech recognizers for speakers with dysarthria by (a) using speaker-specific training data for each speaker and/or (b) applying suitable model architectures to each speaker.
  • collecting large dysarthric speech data from single speakers is generally a challenging task. Therefore, the nonpersonalized speaker-independent speech recognition systems are more popular currently, in which the speaker normalization or adaptation approaches are usually applied.
  • Popular adaptation approaches include transforming acoustic features or adjusting model parameters trained on multiple typical speakers to a specific speaker using a small number of adaptation utterances from a dysarthric speaker.
  • a speaker-dependent speech recognition system which is trained solely to the individual, would mostly outperform a speaker-independent system given enough training data and a similar experimental setup by effectively modeling the intra-speaker variability. This is not the case for a speaker-dependent system trained on a small dysarthric dataset, speakers with dysarthria still show' relatively consistent articulatory errors (Whurr, 1988), and so it is assumed that speech-dependent speech recognizer for an individual with a large training set would be promising for speakers with dysarthria.
  • the feasibility of speaker-dependent models for dysarthric speech recognition has rarely been explored due to the difficulties in collecting large datasets from single speakers with dysarthria.
  • UA speech corpus includes data from 19 speakers with cerebral palsy; speech materials consist of 765 isolated words per speaker.
  • the Nemours dysarthric speech data includes a collection of 740 short nonsense sentences spoken by 10 male speakers with varying degrees of dysarthria that have been marked at the phoneme-level (Menendez-Pidal et al., 1996).
  • TORGO corpus provides both acoustic and articulatory speech data from seven individuals with speech impediments caused by cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS) and age- and gender-matched control subjects (Rudzicz et al., 2010).
  • CP cerebral palsy
  • ALS amyotrophic lateral sclerosis
  • researchers collected dysarthric speech data for their studies (Kim et al., 2017, 2018, 2019; Shor et al., 2019). Kim et al.
  • Rudzicz (2007) compared the speaker-dependent and speaker-adaptive approach in dysarthric speech recognition, and a speaker-independent model trained with Wall Street Journal (WSJ) data corpus (Lamere et al., 2003). These speaker-dependent studies all used limited datasets (e.g., TORGO or Nemours).
  • TORGO Wall Street Journal
  • Nemours e.g., Nemours.
  • current studies on dysarthric speech recognition are mainly focused on speaker-independent or speaker-adaptation approaches. Although there are a few studies on speaker-dependent dysarthric speech recognition, the data is limited and performances are unsatisfactory (Rudzicz, 2007). Large data collected from single speakers with dysarthria are needed to explore the speaker-dependent approach and further improve the performance (Takashima et al., 2019; Vachhani et al., 2018; Yue et al.. 2020).
  • the developmental experiment was conducted, which are speaker-dependent speech recognition using Gaussian mixture model-hidden Markov model speech recognizers.
  • hidden Markov models with different numbers of states were explored to find a hidden Markov model state setup that better fit the lower speaker rate of the speaker.
  • the testing experiment was performed, which was speaker-dependent speech recognition using standard and adapted hidden Markov models.
  • the standard hidden Markov model used three and five states to represent nonsilence and silence phonemes, respectively, which is a conventional setup.
  • the adapted hidden Markov model used the numbers of states that were found in the developmental experiment.
  • acoustic models were investigated in this testing experiment including hidden Markov model with the Gaussian mixture model (GMM), deep neural network (DNN), more recent long-short term memory (LSTM)-recurrent neural netw ork, and bidirectional-LSTM (BLSTM).
  • GMM Gaussian mixture model
  • DNN deep neural network
  • LSTM long-short term memory
  • BLSTM bidirectional-LSTM
  • Participants' A single participant collected and donated data that was collected on himself for the study. He is a 30-y ear-old male, native American English speaker with athetoid cerebral palsy. The speaking rate of the participant was about 37 words/min. His speech intelligibility was 20.9% (average), which was measured using the standard Sentence Intelligibility Test (SIT) by two speech-language pathologists who were blind to the participant. Compared to typical speech, the participant exhibited a lower speaking rate (100- 130 words/minute for the average person). In addition, the patient produced unintelligible speech due to poor articulation, hoarseness, and breathy voice, but showed his own pronunciation patterns.
  • SIT Sentence Intelligibility Test
  • a desktop computer (Dell Precision 7820 tower3, 64 gigabytes of RAM, Intel Xeon Silver 4114 central processing unit4) with a graphics processing unit support (Nvidia Quadro P40005) was used for training deep learning-based speech recognition models.
  • Stimuli' The stimuli included the Harvard sentences (IEEE. 1969) and humanities research papers and essays the participant had authored. A total of 2,708 phrases, 46,634 words (with 6,397 unique w ords), and 187,549 phonemes (with 39 unique phonemes) w ere included in the stimuli.
  • ASR Automatic Speech Recognition Models'.
  • HMM hidden Markov model
  • DNN deep neural network
  • LSTM long short-term memory
  • BLSTM bidirectional-LSTM
  • the hidden Markov model uses hidden states to model the probability transition between the sub-word units (e.g., phonemes), which represent each phoneme by multiple states so that the stages (e.g., begin and end) of pronouncing the phonemes could be isolated.
  • a popular scheme is to use three states for nonsilence phonemes and five states for silence.
  • the hidden Markov models in this study included 776 tied-state (senone) left-to-right triphone hidden Markov models for the baseline experiment, and 800 triphone hidden Markov models for the adapted models. The senones were obtained using a state tying method based on the decision tree (Reichl & Wu Chou, 2000).
  • the Gaussian mixture model-hidden Markov model (Xuan et al., 2001) has been a long-standing speech recognition model before the deep learning-based models appeared (Yu & Deng. 2014). which models the distribution of acoustic features for phonemes with a weighted mixture of Gaussian distributions (Yu & Deng, 2014).
  • the deep neural network- hidden Markov models (Yu & Deng, 2014) adopted a deep neural network to model the phoneme distributions, which were trained using the Mel-frequency cepstral coefficients with a context window of nine frames (four previous + one current + four succeeding frames), so that the contextual information was used in each frame, deep neural networks with one to six hidden layers and 256 to 1,024 hidden units at each layer had been explored in preliminary experiments. The best results were achieved by using six hidden layers in a dimension of 512 at each layer.
  • the (bidirectional) long short-term memory-hidden Markov models use (bidirectional) long short-term memory to model the distribution of phonemes on the acoustic features.
  • the long short-term memory' (LSTM) is a ty pe of recurrent neural network (Sak et al., 2014) that has memory’ blocks containing long and short-term previous input information as a component of the current input.
  • the bidirectional LSTM (BLSTM) processes data in both forward and backward directions in time with two separate hidden lay ers. Both the previous input information and the following backward input are used. Therefore, the BLSTM is expected to perform better than the LSTM.
  • the input acoustic features for all speech recognition models were the Mel- frequency cepstral coefficients (MFCCs) (Davis & Mermelstein,1980), which included 12 cepstral coefficients and one energy' term.
  • MFCCs Mel- frequency cepstral coefficients
  • the combined 13 -dimensional MFCCs and their first and second derivatives (39-dim.) were used as the input acoustic features of the recognizers.
  • the frame length of feature extraction was 25 ms, and the frame rate was 10 ms.
  • a bigram phoneme-level/word-level language model (Yu & Deng, 2014) was trained by using the training dataset of each experiment. All experiments in this study were implemented with the Kaldi speech recognition toolkit (Povey et al., 2011).
  • a summary of the speech recognition models setup is shown in TABLE 1.
  • MFCC Mel-frequency cepstral coefficient
  • GMM Gaussian mixture model
  • DNN deep neural network
  • LSTM long short-term memory
  • BLSTM bidirectional long short-term memory
  • HMM hidden Markov model
  • RBM restricted Boltzmann machine.
  • Procedure' Experiments were conducted in this study, including a developmental experiment, a testing experiment, and a baseline (comparison) experiment.
  • the developmental and testing experiments were speaker-dependent speech recognition for dysarthric speech, in which only the data from the participants were used for speech recognizer training.
  • the developmental experiment adopted part of the data to find an optimal hidden Markov model (HMM) state numbers for the participant.
  • the testing experiment verified the performance of using the states numbers found in the developmental experiment (adapted HMM), and compared it with the standard HMM.
  • the baseline experiment was speaker-independent (the nonpersonalized design) with the standard hidden Markov model, in which a large amount of typical speech data from multiple speakers was used for training the speech recognizer.
  • the LibriSpeech corpus is a collection of approximately 1.000 hours of audiobooks (Panayotov et al., 2015), which is a publicly available typical speech dataset. In this study, the training set of 100 hours of "clean" speech data (from 252 speakers, sampling rate 16,000 Hz) was used in the baseline experiment.
  • the testing experiment used all the data in the development set (13.48 hr), plus 600 additional phrases (5.9 hr) as the training and validation set, respectively (19.38 hr total). The remaining 200 phrases (2.28 hr, not included in development) were used as the final testing set. The total number of words tested was 4,482; the number of phonemes tested was 20,942.
  • TABLE 3 lists the most frequent 20 phoneme substitutions, deletions, and insertions of bidirectional long short-term memory 7 -hidden Markov model speech recognizer with and without adapted hidden Markov model states. Most of the phoneme recognition errors were significantly decreased by using the adapted hidden Markov model. In particular, most vowel substitutions are significantly reduced. It indicates that the increased number of hidden Markov model states can effectively model prolonged vowel patterns. There are cases of exceptions, where the numbers of errors were increased or only slightly decreased (e.g. substitution of “ao aa”, ah”, insertion of “/n/”).
  • the personalized lexicon approach (Mengistu & Rudzicz, 2011) was validated by using a phoneme-level speech recognizer to find the articulatory 7 deviation patterns; however, due to intra-speaker variability (inconsistent pronunciations of the same sounds within a single speaker) in the data, the lexicon-based approach was not suitable for the participant in this study.
  • the patterns of the participant in attempt to reach articulatory targets were not consistent in all contexts. For example, the participant was able to accurately produce /t/ in all word positions (initial, medial, and final), yet missed the articulatory target in some speech contexts.
  • Ranges can be expressed herein as from “about” one particular value and/or to “about” another particular value. When such a range is expressed, another aspect includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by use of the antecedent “about,” it will be understood that the particular value forms another aspect. It should be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint.
  • Embodiment 1 A system for recognizing a speech of a single speaker comprising: a processor; a memory having instructions stored thereon, wherein execution of the instructions by the processor causes the processor to: receive audio data having low-intelligibility, severe dysarthric speech; determine, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wherein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and output the determined plurality of estimated speech values, wherein the determined plurality of estimated speech values is used for control or voice synthesis.
  • Embodiment 2 The system of the Embodiment 1, wherein the training dysarthric speech dataset has at least 40,000 words and 187,000 phonemes.
  • Embodiment 3 The system of any one of Embodiments 1-2, wherein the trained model comprises a trained hidden Markov model.
  • Embodiment 4 The system of any one of Embodiments 1-3. wherein the trained model comprises a transformer-based model (i) trained on a health speech dataset and (ii) finetuned on a smaller dysarthric speech dataset.
  • the trained model comprises a transformer-based model (i) trained on a health speech dataset and (ii) finetuned on a smaller dysarthric speech dataset.
  • Embodiment 5 The system of any one of Embodiments 1-4. wherein the determined plurality of estimated speech values is used for health status monitoring and dysarthria condition analysis.
  • Embodiment 6 The system of any one of Embodiments 1-5, wherein the training dysarthric speech data comprises a non-dysarthric speech having 161 to 200 words per minute, a moderate dysarthric speech having between 121 to 160 words per minute, a severe dysarthric speech having less than 120 words per minute, or a combination thereof.
  • Embodiment 7 The system of any one of Embodiments 1-6. wherein the audio data is of the single speaker, and wherein the estimated speech values are for the single speaker.
  • Embodiment 8 The system of any one of Embodiments 1-7. wherein the audio data is of a speaker different from the single speaker, and wherein the estimated speech values are for the different speaker.
  • Embodiment 9 The system of any one of Embodiments 1-8, wherein the trained model was characterized as having an accuracy for speech recognition greater than 85%.
  • Embodiment 10 The system of any one of embodiments 1-9, further comprising: directing the plurality of estimated speech values to a human-machine interface.
  • Embodiment 11 The system of any one of Embodiments 1-10, further comprising a voice synthesis system and speaker.
  • Embodiment 12 The system of any one of Embodiments 3-11, wherein the hidden Markov model employed at least three states and five states to represent non-silence phonemes and silence phonemes, respectively.
  • Embodiment 13 The system of any one of Embodiments 3-12, wherein the hidden Markov model employed at least six states to represent each of non-silence and silence phonemes.
  • Embodiment 14 The system of any one of Embodiments 3-13, wherein the hidden Markov model employed a Gaussian mixture model (GMM), deep neural network (DNN), long-short term memory’ (LSTM)-recurrent neural network, and bidirectional-LSTM (BLSTM).
  • GMM Gaussian mixture model
  • DNN deep neural network
  • LSTM long-short term memory
  • BLSTM bidirectional-LSTM
  • Embodiment 15 A method for recognizing a speech of a single speaker comprising: receiving, by a processor, audio data having low-intelligibility', severe dysarthric speech; determining, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wherein the trained model w as configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and outputting the determined plurality of estimated speech values, wherein the determined plurality of estimated speech values is used for control or voice synthesis.
  • Embodiment 16 The method of Embodiment 15, wherein the training dysarthric speech dataset has at least 40,000 words and 187,000 phonemes.
  • Embodiment 17 The method of any one of Embodiments 15-16, wherein the trained model comprises a trained hidden Markov model.
  • Embodiment 18 The method of any one of Embodiments 1 -17, wherein the training dysarthric speech data comprises a non-dysarthric speech having 161 to 200 words per minute, a moderate dysarthric speech having between 121 to 160 words per minute, a severe dysarthric speech having less than 120 words per minute, or a combination thereof.
  • Embodiment 19 The method of any one of Embodiments 15-18, wherein the audio data is of the single speaker, and wherein the estimated speech values are for the single speaker.
  • Embodiment 20 The method of any one of Embodiments 15-19, wherein the audio data is of a speaker different from the single speaker, and wherein the estimated speech values are for the different speaker (e.g., wherein the different speaker has dysarthria).
  • Embodiment 21 The method of any one of Embodiments 15-20, wherein the trained model was characterized as having an accuracy for speech recognition greater than 85%.
  • Embodiment 22 The method of any one of Embodiments 15-21, further comprising: directing the plurality of estimated speech values to a human-machine interface.
  • Embodiment 23 The method of any one of Embodiments 17-22, wherein the hidden Markov model employed at least three states and five states to represent non-silence phonemes and silence phonemes, respectively.
  • Embodiment 24 The method of any one of Embodiments 17-23, wherein the hidden Markov model employed at least six states to represent each of non-silence and silence phonemes.
  • Embodiment 25 The method of any one of embodiments 17-24, wherein the hidden Markov model employed a Gaussian mixture model (GMM), deep neural network (DNN). long-short term memory (LSTM)-recurrent neural network, and bidirectional-LSTM (BLSTM).
  • GMM Gaussian mixture model
  • DNN deep neural network
  • LSTM long-short term memory
  • BLSTM bidirectional-LSTM
  • Embodiment 26 A non-transitory computer-readable medium having instructions stored thereon, wherein execution of the instructions by a processor causes the processor to: receive audio data having low-intelligibility, severe dysarthric speech; determine, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wherein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and output the determined plurality' of estimated speech values, wherein the determined plurality' of estimated speech values is used for control or voice synthesis.
  • Embodiment 27 A method for operating the system of any one of Embodiments 1-14.
  • Embodiment 28 A non-transitor ' computer-readable medium having instructions stored thereon, wherein execution of the instructions by a processor causes the processor to operate the System of any one of the Embodiments 1-14 or perform the method of any one of Embodiments 15-25.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

An exemplary system and method employing adapted hidden Markov model-based speech recognition system, developed from a lengthy and diverse dysarthric speech dataset.

Description

PERSONALIZED DYSARTHRIC SPEECH RECOGNITION
GOVERNMENT SUPPORT CLAUSE
[0001] This invention was made with government support under GrantNo. R01 DC016621 awarded by the National Institutes of Health. The government has certain rights in the invention.
RELATED APPLICATION
[0002] This PCT application claims priority to, and the benefit of, U.S. Provisional Patent Application No. 63/658,764. filed June 11, 2024, entitled “PERSONALIZED DYSARTHRIC SPEECH RECOGNITION,” which is incorporated by reference herein in its entirety.
BACKGROUND
[0003] Dysarthria is a collection of motor speech disorders resulting from neurological injury to the motor speech system. People with dysarthria cannot control their motor subsystems, including respiration, phonation, resonance, articulation, and prosody. Speech in people with dysarthria is ty pically characterized by poor articulation, breathy voice, and monotonic intonation. Often, the speech intelligibility is reduced in proportion to the severity of dysarthria.
[0004] Despite the commercial success of automatic speech recognition (ASR) for typical speech, current ASR still faces unsatisfactory' performance in dysarthric speech, exhibiting a high error rate, particularly for low-intelligibility' dysarthric speech.
[0005] There would be a benefit to improving the speech recognition system for people with dysarthria.
SUMMARY
[0006] An exemplary system and method are disclosed employing an adapted hidden Markov model-based speech recognition system, developed from a lengthy and diverse dysarthric speech dataset. Due to the difficulties in collecting large dysarthric speech datasets for training, prior Markov and artificial intelligence (Al) based systems have produced unsatisfactory results. To date, there has been no showing that Markov model-based speech recognition or Al could produce a commercially acceptable speech recognition system for dysarthria.
[0007] A study was conducted that acquired at least 5 hours of Dysarthria speech training data comprising normal speech (e.g., 161-200 words/minute), moderate Dysarthric speech (e.g., 121-160 words/minute), severe Dysarthric speech (e.g., fewer than 120 words/minute), or a combination thereof, from one or more patients with various voice pitches and accents, having at least 1000 words and 4500 phonemes (e.g., preferably 20 hours of Dysarthria-speech training data having 40,000 words and 187,000 phonemes). The study demonstrated that the lengthy and diverse training dataset for Dysarthria speech can achieve higher word accuracy than human listening. Specifically, the study implemented speaker-dependent speech recognition experiments with an adapted hidden Markov model to reflect the speaker’s (lower) speaking rate on 21 hours of dysarthric speech recordings from a single speaker. A baseline speaker-independent experiment w as implemented by adding a 1 OO-hr typical speech data from the LibriSpeech dataset to the training set. The results showed the efficacy of both the speakerdependent design and the adaptation of hidden Markov model. A 86.6% word accuracy was achieved. This is significantly higher than human listening (20.9%) and the baseline speakerindependent, non-personalized design (7.3%), observed as comparison in the study. Severe dysarthria speech is typically considered unintelligible speech.
[0008] In an aspect, provided is a system for recognizing a speech of a single speaker comprising: a processor; a memory having instructions stored thereon, wherein execution of the instructions by the processor causes the processor to: receive audio data (e.g., recording or audio stream in a controller) having low-intelligibility7, severe dysarthric speech; determine, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wherein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and output the determined plurality of estimated speech values, wherein the determined plurality of estimated speech values is used for control or voice synthesis.
[0009] In some embodiments, the training dysarthric speech dataset has at least 40,000 words and 187,000 phonemes.
[0010] In some embodiments, the trained model comprises a trained hidden Markov model. [0011] In some embodiments, the trained model comprises a transformer-based model (i) trained on a health speech dataset (e.g., acquired from one or more healthy speakers of at least 10000 hours of speech) and (ii) finetuned on a smaller dysarthric speech dataset (e.g.. acquired from one or more dysarthric speakers of less than 1 hour of speech).
[0012] In some embodiments, the determined plurality of estimated speech values is used for health status monitoring and dysarthria condition analysis.
[0013] In some embodiments, the training dysarthric speech data comprises a non- dysarthric (i.e., normal) speech having 161 to 200 words per minute, a moderate dysarthric speech having 121 to 160 words per minute, a severe dysarthric speech having less than 120 words per minute, or a combination thereof.
[0014] In some embodiments, the audio data is of the single speaker, and wherein the estimated speech values are for the single speaker.
[0015] In some embodiments, the audio data is of a speaker different from the single speaker, and the estimated speech values are for the different speaker (e.g., wherein the different speaker has dysarthria).
[0016] In some embodiments, the trained model was characterized as having an accuracy for the speech recognition greater than 85%.
[0017] In some embodiments, the system further comprises directing the plurality of estimated speech values to a human-machine interface.
[0018] In some embodiments, the system further comprises a voice synthesis system and speaker.
[0019] In some embodiments, the hidden Markov model employed at least three states and five states to represent nonsilence phonemes and silence phonemes, respectively.
[0020] In some embodiments, the hidden Markov model employed at least six states to represent each of non-silence and silence phonemes.
[0021] In some embodiments, the hidden Markov model employed a Gaussian mixture model (GMM). deep neural network (DNN), long-short term memory (LSTM)-recurrent neural network, and bidirectional-LSTM (BLSTM).
[0022] In another aspect, provided is a method comprising: receiving, by a processor, audio data (e.g., recording or audio stream in a controller) having low-intelligibility, severe dysarthric speech: determining, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility’ dysarthric speech of the audio data, wherein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and outputting, the determined plurality of estimated speech values, wherein the determined plurality of estimated speech values is used for control or voice synthesis.
[0023] In some embodiments, the training dysarthric speech dataset has at least 40.000 words and 187,000 phonemes.
[0024] In some embodiments, the trained model comprises a trained hidden Markov model. [0025] In some embodiments, the training dysarthric speech data comprises a non- dysarthric (i. e. , normal) speech having 161 to 200 words per minute, a moderate dysarthric speech having between 121 to 160 words per minute, a severe dysarthric speech having less than 120 words per minute, or a combination thereof.
[0026] In some embodiments, the audio data is of the single speaker, and the estimated speech values are for the single speaker.
[0027] In some embodiments, the audio data is of a speaker different from the single speaker, and the estimated speech values are for the different speaker (e.g., wherein the different speaker has dysarthria).
[0028] In some embodiments, the trained model was characterized as having an accuracy for the speech recognition greater than 85%.
[0029] In some embodiments, the method further comprises directing the plurality of estimated speech values to a human-machine interface.
[0030] In some embodiments, the hidden Markov model employed at least three states and five states to represent non-silence phonemes and silence phonemes, respectively.
[0031] In some embodiments, the hidden Markov model employed at least six states to represent each of non-silence and silence phonemes.
[0032] In some embodiments, the hidden Markov model employed a Gaussian mixture model (GMM), deep neural network (DNN), long-short term memory (LSTM)-recurrent neural network, and bidirectional-LSTM (BLSTM).
[0033] In yet another aspect, provided is a non-transitory computer-readable medium having instructions stored thereon, wherein execution of the instructions by a processor causes the processor to: receive audio data (e g., recording or audio stream in a controller) having low- intelligibility, severe dysarthric speech; determine, via a trained model, a plurality’ of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wherein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and output the determined plurality of estimated speech values, wherein the determined plurality' of estimated speech values is used for control or voice synthesis.
[0034] Other systems, methods, features and/or advantages will be or may become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features and/or advantages be included within this description and be protected by the accompanying claims. BRIEF DESCRIPTION OF DRAWINGS
[0035] FIGURES 1A-1D depict an example workflow for an example system for determining estimated speech values in accordance with the present disclosure.
[0036] FIGURES 2A-2D depict example implementations of estimated text of dysarthric speech.
[0037] FIGURE 3 depicts an example method of determining estimated speech values in accordance with the present disclosure.
[0038] FIGURE 4 depicts phoneme error rates (PERs) of the testing experiment using Gaussian mixture model (GMM)-, deep neural network (DNN)-, long-short term memory (LSTM)-, bidirectional long-short term memory (BLSTM) with standard and adapted hidden Markov models.
[0039] FIGURE 5 depicts word error rates (WERs) of the testing experiment using Gaussian mixture model (GMM)-, deep neural network (DNN)-, long-short term memory (LSTM)-, bidirectional long-short term memory (BLSTM) with standard and adapted hidden Markov models.
DETAILED DESCRIPTION
[0040] It is appreciated that certain features of the disclosure, which are, for clarity, described in the context of separate aspects, can also be provided in combination with a single aspect. Conversely, various features of the disclosure, which are, for brevity, described in the context of a single aspect, can also be provided separately or in any suitable subcombination. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. Methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present disclosure.
DEFINITIONS
[0041] In this specification and in the claims that follow, reference will be made to a number of terms, which shall be defined to have the following meanings:
[0042] As used herein, ‘"comprising” is to be interpreted as specifying the presence of the stated features, integers, steps, or components as referred to. but does not preclude the presence or addition of one or more features, integers, steps, or components, or groups thereof. Moreover, each of the terms “by”, “comprising,” “comprises”, “comprised of,” “including,” “includes,” “included,” "‘involving,” “involves,” “involved,” and “such as” are used in their open, non-limiting sense and may be used interchangeably. Further, the term “comprising” is intended to include examples and aspects encompassed by the terms “consisting essentially of’ and “consisting of.’' Similarly, the term “consisting essentially of’ is intended to include examples encompassed by the term “consisting of.
[0043] As used in the specification and the appended claims, the singular forms “a,” “an” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a compound”, “a composition”, or “a cancer”, includes, but is not limited to, two or more such compounds, compositions, or cancers, and the like.
[0044] It should be noted that ratios, concentrations, amounts, and other numerical data can be expressed herein in a range format. It can be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint. It is also understood that there are a number of values disclosed herein, and that each value is also herein disclosed as “about” that particular value in addition to the value itself. For example, if the value “10” is disclosed, then “about 10” is also disclosed. Ranges can be expressed herein as from “about” one particular value, and/or to “about” another particular value. Similarly, when values are expressed as approximations, by use of the antecedent “about.” it can be understood that the particular value forms a further aspect. For example, if the value “about 10” is disclosed, then “10” is also disclosed.
[0045] When a range is expressed, a further aspect includes from the one particular value and/or to the other particular value. For example, where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the disclosure, e.g. the phrase “x to y” includes the range from ‘x’ to ‘y’ as well as the range greater than ‘x’ and less than ‘y’. The range can also be expressed as an upper limit, e.g. ‘about x, y, z, or less’ and should be interpreted to include the specific ranges of ‘about x’, ‘about y’, and ‘about z’ as well as the ranges of ‘less than x’, less than y’, and Tess than z'. Likewise, the phrase ‘about x, y, z, or greater’ should be interpreted to include the specific ranges of ‘about x’, ’about y’, and ‘about z’ as well as the ranges of ’greater than x’, greater than y’, and ‘greater than z’. In addition, the phrase “about ‘x’ to ‘y’”, where ‘x’ and ‘y’ are numerical values, includes “about ‘x‘ to about ‘y”‘.
[0046] It is to be understood that such a range format is used for convenience and brevity, and thus, should be interpreted in a flexible manner to include not only the numerical values explicitly recited as the limits of the range, but also to include all the individual numerical values or sub-ranges encompassed within that range as if each numerical value and sub-range is explicitly recited. To illustrate, a numerical range of “about 0.1% to 5%” should be interpreted to include not only the explicitly recited values of about 0.1% to about 5%, but also include individual values (e.g., about 1%, about 2%, about 3%, and about 4%) and the sub- ranges (e.g., about 0.5% to about 1.1%; about 5% to about 2.4%; about 0.5% to about 3.2%, and about 0.5% to about 4.4%, and other possible sub-ranges) within the indicated range.
[0047] As used herein, the terms '‘about,’’ ‘'approximate,” “at or about,” and '‘substantially” mean that the amount or value in question can be the exact value or a value that provides equivalent results or effects as recited in the claims or taught herein. That is, it is understood that amounts, sizes, formulations, parameters, and other quantities and characteristics are not and need not be exact, but may be approximate and/or larger or smaller, as desired, reflecting tolerances, conversion factors, rounding off, measurement error and the like, and other factors known to those of skill in the art such that equivalent results or effects are obtained. In some circumstances, the value that provides equivalent results or effects cannot be reasonably determined. In such cases, it is generally understood, as used herein, that “about” and “at or about” mean the nominal value indicated ±10% variation unless otherwise indicated or inferred. In general, an amount, size, formulation, parameter or other quantity or characteristic is “about,” “approximate,” or “at or about” whether or not expressly stated to be such. It is understood that where “about.” “approximate,” or “at or about” is used before a quantitative value, the parameter also includes the specific quantitative value itself, unless specifically stated otherwise.
[0048] As used herein, the terms “optional” or “optionally” means that the subsequently described event or circumstance can or cannot occur, and that the description includes instances where said event or circumstance occurs and instances where it does not.
[0049] EXAMPLE SYSTEMS AND METHODS
[0050] FIG. 1A depicts an example method 100 which uses a trained Al model 102 to determine estimated text of dysarthric speech 104. The trained Al model 102 can be configured using a training dysarthric speech dataset acquired from a speaker of at least 40,000 words (e.g., at least 50,000 words, at least 60,000 words, at least 70,000 words, at least 80,000 words, at least 90,000 words, at least 100,000 words) and at least 187,000 phonemes (e.g., at least 190,000 phonemes, at least 195,000 phonemes, at least 200,000 phonemes, at least 205,000 phonemes, at least 210,000 phonemes, at least 215,000 phonemes, at least 220,000 phonemes, at least 225,000 phonemes). In other aspects, the trained Al model can be configured using a training dysarthric speech dataset acquired from a speaker of at least 20 hours (e.g., at least 25 hours, at least 30 hours, at least 35 hours, at least 40 hours, at least 45 hours, at least 50 hours). In yet still other aspects, the trained Al model can be configured using a training dysarthric speech dataset acquired from a speaker of at least 40,000 words, 187,000 phonemes, and at least 20 hours. The estimated text of the dysarthric speech 104 can have a speech recognition accuracy greater than 85% (e.g., greater than 86%, greater than 87%, greater than 88%, greater than 89%, greater than 90%, greater than 91%, greater than 92%. greater than 93%, greater than 94%, greater than 95%, greater than 96% greater than 97%, greater than 98%, greater than 99%).
[0051] In other aspects, the trained Al model 102 can be configured using a training dysarthric speech dataset, acquired from two or more dysarthric patients with various voice pitches and accents, having a duration of at least 5 hours with at least 1000 words (e.g.. at least 2000 words, at least 3000 words, at least 4000 words, at least 5000 words, at least 6000 words) and at least 4500 phonemes (e.g., at least 5500 phonemes, at least 6500 phonemes, at least 7500 phonemes, at least 8500 phonemes, at least 9500 phonemes). The larger the training dataset is, the more words and phonemes the training dataset requires. Each of the two or more dysarthric patients, whose voices are used for generating the training dataset, can have (i) a normal condition and produce a rate of 161-200 words/minute), (ii) a moderate Dysarthric condition and produce a speech at a rate of 121-160 words/minute, or (iii) a severe Dysarthric condition and produce a speech at a rate of fewer than 120 words/minute.
[0052] First, a recording 106 (e.g., dysarthric speech recording or audio stream) of audio data having low-intelligibility, severe dysarthric speech is collected or obtained. In some aspects, the recording can be generated by the same speaker who provided the training dysarthric speech dataset. In other aspects, the recording can be generated by a different speaker. The recording 106 undergoes pre-processing 108. e.g., filtering, normalization. The audio then undergoes segmentation 1 10, which segments the audio into distinct silence and non-silence phonemes. The silence phonemes can be determined using a pause detector 112, which can be used to establish the beginning and end of each nonsilence phoneme. The segmented speech is then input into the trained Al model 102, which outputs the estimated text of the dysarthric speech 104.
[0053] The trained Al model 102 can include a trained hidden Markov model (HMM). This hidden Markov model can employ at least three states (e.g., at least four states, at least five states) to represent nonsilence phonemes and at least five states to represent silence phonemes. In other aspects, the hidden Markov model can employ at least six states (e.g., at least seven states, at least eight states, at least nine states, at least ten states) to represent each of the nonsilence and silence phonemes. The hidden Markov model can further employ a Gaussian mixture model (GMM), deep neural network (DNN), long-short term memory' (LSTM)- recurrent neural network, and/or bidirectional-LSTM (BLSTM) to assist with speech recognition. [0054] The neural network (e.g., artificial neural network, convolutional neural network) employed by the hidden Markov model (of the trained Al model 102) can be a computing system including a plurality of interconnected neurons (e.g., also referred to as “nodes”). The nodes can be arranged in a plurality of layers, such as an input layer, an output layer, and optionally one or more hidden layers with different activation functions. An artificial neural network (ANN) with hidden layers can be referred to as a deep neural network or multilayer perceptron (MLP). Each node is connected to one or more other nodes in the ANN. For example, each layer is made of a plurality of nodes, where each node is connected to all nodes in the previous layer. The nodes in a given layer are not interconnected with one another, i.e., the nodes in a given layer function independently of one another. As used herein, nodes in the input layer receive data from outside of the ANN, nodes in the hidden layer(s) modify the data between the input and output layers, and nodes in the output layer provide the results. Each node is configured to receive an input, implement an activation function (e.g., binary step, linear, sigmoid, tanh, or rectified linear unit (ReLU) function), and provide an output in accordance with the activation function. Additionally, each node is associated with a respective weight. ANNs are trained with a dataset to maximize or minimize an objective function. In some implementations, the objective function is a cost function, which is a measure of the ANN’s performance (e.g., error such as LI or L2 loss) during training, and the training algorithm tunes the node weights and/or bias to minimize the cost function. This disclosure contemplates that any algorithm that finds the maximum or minimum of the objective function can be used for training the ANN. Training algorithms for ANNs include but are not limited to backpropagation. It should be understood that an artificial neural network is provided only as an example machine learning model. This disclosure contemplates that the machine learning model can be any supervised learning model, semi-supervised learning model, or unsupervised learning model. Optionally, the machine learning model is a deep learning model. Machine learning models are known in the art and are therefore not described in further detail herein.
[0055] A convolutional neural network (CNN) is a ty pe of deep neural network that has been applied, for example, to image analysis applications. Unlike traditional neural networks, each layer in a CNN has a plurality of nodes arranged in three dimensions (width, height, depth). CNNs can include different types of layers, e.g., convolutional, pooling, and fully connected (also referred to herein as “dense”) layers. A convolutional layer includes a set of filters and performs the bulk of the computations. A pooling layer is optionally inserted between convolutional layers to reduce the computational power and/or control overfitting (e.g., by downsampling). A fully connected layer includes neurons, where each neuron is connected to all of the neurons in the previous layer. The layers are stacked similarly to traditional neural networks. GCNNs are CNNs that have been adapted to work on structured datasets such as graphs.
[0056] FIG. IB depicts an example implementation of the example method 100. This implementation uses a user device 114 to record dysarthric speech and cloud infrastructure 116 to determine the estimated text of the dysarthric speech 140. The user device 114 uses a microphone 118 to capture a recording or audio stream of the dysarthric speech. The recording or audio stream is transferred to the cloud infrastructure 116 via a device-based network interface 120 in communication with a cloud-based network interface 122. The recording or audio stream can be temporarily placed in cloud storage 124; however, in other aspects, the recording or audio stream may be immediately processed. The recording or audio stream is then subjected to the example method 100: the recording is pre-processed 108, segmented 110 using a pause detector 112, and then input into a trained Al model 102, which outputs the estimated text of the dysarthric speech 104.
[0057] FIG. 1C depicts another example implementation of the example workflow 100. This implementation uses a local processing device 126 (e.g.. smartphone, smartwatch, etc.) to record dysarthric speech and determine the estimated text of the dysarthric speech. The local processing device 126 uses a microphone to capture a recording or audio stream of the dysarthric speech. The recording or audio stream is then subjected to the example workflow 100: the recording is pre-processed 108, segmented 110 using a pause detector 112. and then input into a trained Al model 102, which outputs the estimated text of the dysarthric speech 104. In some embodiments, the estimated speech can be used in a user-interface controller, e.g., for navigation devices such as GPS navigation, or vehicle controller to do the same.
[0058] Transfer Learning Model . FIG. ID depicts another example implementation of the example workflow 100, where the trained Al model 102 (shown as 102’) was trained in a training system 120 using a transfer learning method, also referred to as a transfer-leaming- based training system 120. As shown, the trained Al model 102 (shown as 102’), initially as a transformer-based model 122 (e.g., large-scale pretrained model), was input into the training system 120. The transformer-based model 122 was then trained, in a training process 124. on a healthy speech dataset 126 (e.g., having tens of thousands of hours of speech). The transformer-based model 122 (e.g., large-scale pretrained model) was then finetuned, in a finetuning operation 128, on a smaller dysarthric speech dataset (e.g., having less than 1 hour of speech). In other words, the model 122 was generalized to various patients with minimal data during the finetuning operation 128. After that, the transformer-based model was fully trained, as atrained Al model 102', and could recognize various speeches (e.g., having varying accents, speeds, etc.) when employed in the exemplary system.
[0059] As more and more patients contribute to training, diverse speech samples (e.g., varying accents and dialects, and multilingual speech) can be generated as a training dataset (e.g., 126, 130) for the training system 120 to train the trained Al model 102, facilitating the exemplary system to be more robust in speech recognition.
[0060] Configured with the trained Al model 102 (e.g., as a transformer-based, large-scale pre-trained model), the exemplary system can support health status monitoring by analyzing speech patterns to assess the severity of dysarthria. By continuously processing speech signals, the exemplary' system can provide anon-invasive way to track changes in speech characteristics over time, which can provide (i) ongoing monitoring of disease progression or healing and (ii) feedback to both patients and healthcare providers.
[0061] The estimated text of the dysarthric speech can be implemented into a variety' of end-use platforms. For example, the estimated text of the dysarthric speech can be used for text-to-voice synthesis (FIG. 2A), HMI of an assisted technology device (FIG. 2B), UI of a software application such as a computer operating system (FIG. 2C), or clinical diagnosis such as speech therapy (FIG. 2D).
[0062] FIG. 3 depicts an example method for determining estimated text of dysarthric speech. First, audio data (e.g., a recording or audio stream in a controller) having low intelligibility, severe dysarthric speech is received 302 by a processor. Next, a plurality of estimated speech values for the severe, low-intelligibility’ dysarthric speech of the audio data is determined 304 via a trained model. In some aspects, the trained model may have been configured using a training dysarthric speech dataset acquired from a single speaker of at least [40,000 words and 187,000 phonemes; 20 hours]. Finally, the determined plurality’ of estimated speech values is output 306. In some aspects, the determined plurality of estimated speech values can be used for control or voice synthesis.
[0063] Experimental Results and Additional Examples
[0064] A study was conducted to develop and evaluate a large low-intelligibility dysarthric speech dataset from a single participant for a personalized automatic speech recognition system for the participant. It is contemplated that once the model has been trained, the same model can be applied to, via customization or without, other users having dysarthric speech.
[0065] The experimental results indicated that speech recognition models trained with a large speaker-dependent dataset can overcome the intra-speaker variability in dysarthric speech and achieve high recognition performance. In addition, increasing the number of hidden Markov model states for representing the phoneme durations in dysarthric speech was observed to significantly improve the recognition performance. The highest word recognition accuracy achieved was significantly higher than that of non-personalized design and human listening. The study demonstrates high-accuracy speech recognition for low-intelligibility dysarthric speech. The study verified the results for one speaker, and it is contemplated that the trained model can be applied to other speakers with low-intelligibility' dysarthric speech.
[0066] High-Accuracy Personalized Automatic Speech Recognition for Low -Intelligibility
Dysarthric Speech
[0067] Automatic speech recognition (ASR) has the potential to help people with dysarthria communicate faster and more effectively, particularly those for whom using AAC devices is not an efficient communication method. This can be done by using text-to-speech (Taylor, 2009) systems to convert the recognized text to speech (Dhanalakshmi et al., 2018) to assist in verbal conversation. Additionally, effective speech recognition systems can help people with limited mobility and/or dexterity control their devices using only their voice, which have the potential to enable them to control their surroundings (i.e. light switches, thermostats, door locks) when such control may otherwise be unavailable. Although automatic speech recognition has been successfully and widely used in many applications in recent years, current speech recognition systems for typical speech are not well-suited to dysarthric speech because of acoustic mismatch, resulting from articulatory limitation (Young & Mihailidis, 2010). Dysarthric speech generally deviates from typical speech in various ways, however, it can be characterized by highly consistent articulatory errors for each speaker (Whurr, 1988).
[0068] Personalized implementation can be an effective approach when developing high- performance speech recognizers for speakers with dysarthria by (a) using speaker-specific training data for each speaker and/or (b) applying suitable model architectures to each speaker. In practice, collecting large dysarthric speech data from single speakers is generally a challenging task. Therefore, the nonpersonalized speaker-independent speech recognition systems are more popular currently, in which the speaker normalization or adaptation approaches are usually applied. Popular adaptation approaches include transforming acoustic features or adjusting model parameters trained on multiple typical speakers to a specific speaker using a small number of adaptation utterances from a dysarthric speaker. However, even if the speaker-adapted speech recognition models often show higher performance than speaker-independent models, the performance is still unsatisfactory, especially on the low- intelligibility dysarthric speech from people with severe dysarthria, due to the inter-speaker variation and the variation between typical and dysarthric speech. [0069] In addition to the inter-speaker variance, previous studies have shown that dysarthric speech exhibits greater intra-speaker variability than typical speech, which indicates that dysarthric speech patterns fluctuate when saying the same words and sentences (Takashima et al., 2020; Wilson, 2000). The higher degree of intra-speaker variability reduces the effectiveness of using dysarthric speech as training data when the speaker-specific training data is small. For typical speech, a speaker-dependent speech recognition system, which is trained solely to the individual, would mostly outperform a speaker-independent system given enough training data and a similar experimental setup by effectively modeling the intra-speaker variability. This is not the case for a speaker-dependent system trained on a small dysarthric dataset, speakers with dysarthria still show' relatively consistent articulatory errors (Whurr, 1988), and so it is assumed that speech-dependent speech recognizer for an individual with a large training set would be promising for speakers with dysarthria. However, the feasibility of speaker-dependent models for dysarthric speech recognition has rarely been explored due to the difficulties in collecting large datasets from single speakers with dysarthria.
[0070] Previous studies had investigated improving the performance of dysarthric speech recognition through multiple strategies. As previously mentioned, the mismatch between acoustic characteristics of dysarthric and typical speech is one of the primary reasons for the low' speech recognition performance for dysarthric speech. To overcome this problem, data- driven approaches had been widely explored. Jiao et al. (2018) and Vachhani et al. (2018) conducted data augmentation by simulating dysarthric speech with typical speech data. Bhat et al. (2018) and Sidi Yakoub et al. (2020) enhanced dysarthric speech quality' by transforming it to be more similar to ty pical speech before training the speech recognition system. In addition, Xiong et al. (2019) had explored both adjusting dysarthric speech towards typical speech for speech recognition trained with typical speech, and adjusting typical speech towards dysarthric speech for data augmentation in dysarthric speech recognition training. The adaptation of speech recognition for dysarthric speech could be applied in augmenting acoustic features as well. Xiong et al. (2018) used estimated articulatory-based representations to augment the conventional acoustic features. In Celin et al. (2020), a data augmentation approach was performed with virtual linear microphone array-based synthesis and multi-resolution feature extraction. In addition to the data manipulation approach, adapting speech recognition models trained with large datasets with small amounts of dysarthric data is a popular strategy as w ell (Mustafa et al.. 2014; Shor et al., 2019; Sim et al., 2018; Takashima et al., 2019: Wang et al., 2020). [0071] Other than the data-driven approaches, designing new machine learning models and modifying existing models for dysarthric speech had been demonstrated effective in previous studies. (Kim et al., 2018) investigated the improvement achieved by convolutional neural networks speech recognizer. (Kim et al., 2017) employed Kullback-Leibler divergence-based hidden Markov models to capture the phonetic variation of dysarthric speech. (Kim et al., 2019) applied multi-view representation learning to dysarthric speech. The newly proposed end-to- end and listen-attention-spell models also have been shown to significantly improve dysarthric speech recognition performance (Lin et al., 2020; Shor et al., 2019). Woszczyk et al. (2020) explored domain adversarial neural networks for speaker-independent speech recognition. Takashima et al. (2020) proposed a speech recognition system based on deep metric learning to handle the challenge of intra-speaker variability in dysarthric speech.
[0072] There are also approaches to adapt speech recognition models based on the recognition errors found in the preliminary investigation. Mengistu & Rudzicz (2011) adapted the acoustic and lexical models by exploiting the consistent intra-speaker patterns in the recognition errors obtained by listening tests. Caballero Morales & Cox (2009) incorporated a model of the speaker’s phonetic confusion matrix into the speech recognition process with a cascade of weighted finite-state transducers (Mohri et al., 2002) at the confusion matrix, word, and language levels. Seong et al. (2012) proposed a dysarthric speech recognition error correction method based on weighted finite-state transducers.
[0073] Many of the previously noted studies used publicly available dysarthric speech datasets, which are relatively smaller size from each speaker. Three of these commonly used English dysarthric speech datasets are Universal Access (UA) speech corpus (H. Kim et al., 2008), Nemours (Menendez-Pidal et al., 1996), and TORGO (Rudzicz et al., 2010). The UA speech corpus includes data from 19 speakers with cerebral palsy; speech materials consist of 765 isolated words per speaker. The Nemours dysarthric speech data includes a collection of 740 short nonsense sentences spoken by 10 male speakers with varying degrees of dysarthria that have been marked at the phoneme-level (Menendez-Pidal et al., 1996). TORGO corpus provides both acoustic and articulatory speech data from seven individuals with speech impediments caused by cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS) and age- and gender-matched control subjects (Rudzicz et al., 2010). In addition to these publicly available corpora, researchers collected dysarthric speech data for their studies (Kim et al., 2017, 2018, 2019; Shor et al., 2019). Kim et al. (2018) used a dysarthric speech dataset collected from nine patients with ALS. Shor et al. (2019) collected 36.7 hr of audio from 67 people with ALS. The logistic difficulties in collecting data from people with dysarthria include short collection sessions due to the low stamina and fewer sessions due to the difficulties in getting to and from the recording facility’. Therefore, most of the current dysarthric speech datasets contain small amounts of data from individual speakers with dysarthria.
[0074] Most of the previous studies mentioned in this section focus on speakerindependent implementation or with speaker adaption. There are studies use speaker-dependent or include speaker-dependent as the baselines (Woszczyk et al., 2020). Noyes & Frankish, (1992) reported speaker-dependent models obtained 75% and 99% word-level accuracy for impaired speakers on a small vocabulary (compared to human listeners between 7% and 61%). Sawhney & Wheeler (2001) used an unspecified segmental phoneme recognizer to achieve an error rate reduction of 22% over speaker-independent models. Rudzicz (2007) compared the speaker-dependent and speaker-adaptive approach in dysarthric speech recognition, and a speaker-independent model trained with Wall Street Journal (WSJ) data corpus (Lamere et al., 2003). These speaker-dependent studies all used limited datasets (e.g., TORGO or Nemours). [0075] In summary, current studies on dysarthric speech recognition are mainly focused on speaker-independent or speaker-adaptation approaches. Although there are a few studies on speaker-dependent dysarthric speech recognition, the data is limited and performances are unsatisfactory (Rudzicz, 2007). Large data collected from single speakers with dysarthria are needed to explore the speaker-dependent approach and further improve the performance (Takashima et al., 2019; Vachhani et al., 2018; Yue et al.. 2020).
[0076] In the instant study, the feasibility of personalized speech recognition to improve the performance of speech recognition on severe (low-intelligibility ) dysarthric speech was investigated. Compared to studies that personalized speech recognition models trained with other speakers to the target speakers (Green et al. ,2021; Shor et al., 2019; Tomanek et al., 2021), a speaker-dependent strategy was used, where the speaker’s (unique) data was used for training. A large (more than 21 hr) low-intelligibility dysarthric speech dataset from one single speaker with cerebral palsy was collected, and then built speaker-dependent speech recognition models based on the hidden Markov models. During the experiments, firstly the developmental experiment was conducted, which are speaker-dependent speech recognition using Gaussian mixture model-hidden Markov model speech recognizers. In this developmental experiment, hidden Markov models with different numbers of states were explored to find a hidden Markov model state setup that better fit the lower speaker rate of the speaker. After that, the testing experiment was performed, which was speaker-dependent speech recognition using standard and adapted hidden Markov models. The standard hidden Markov model used three and five states to represent nonsilence and silence phonemes, respectively, which is a conventional setup. The adapted hidden Markov model used the numbers of states that were found in the developmental experiment. To find the suitable model architectures for the speaker, several acoustic models were investigated in this testing experiment including hidden Markov model with the Gaussian mixture model (GMM), deep neural network (DNN), more recent long-short term memory (LSTM)-recurrent neural netw ork, and bidirectional-LSTM (BLSTM). After the developmental and testing experiment, a nonpersonalized experiment of adding a large typical speech dataset (from multiple speakers) to the training set was conducted (using standard HMM) as the baseline for comparison. A detailed discussion on the experimental results was presented, along with the results from other related w orks.
[0077] Materials and Methods
[0078] Participants'. A single participant collected and donated data that was collected on himself for the study. He is a 30-y ear-old male, native American English speaker with athetoid cerebral palsy. The speaking rate of the participant was about 37 words/min. His speech intelligibility was 20.9% (average), which was measured using the standard Sentence Intelligibility Test (SIT) by two speech-language pathologists who were blind to the participant. Compared to typical speech, the participant exhibited a lower speaking rate (100- 130 words/minute for the average person). In addition, the patient produced unintelligible speech due to poor articulation, hoarseness, and breathy voice, but showed his own pronunciation patterns.
[0079] Setting'. The data collection was conducted between 45 and 60 minutes per day over several weeks by the participant at his home. With the collected data, the authors conducted the speech recognition experiments in a laboratory.
[0080] Devices'. The participant used the internal microphones of a 2012 iMac and a 2015 MacBook Prol for the recordings. A user interface script2 was used to present sentences in the stimuli.
[0081] A desktop computer (Dell Precision 7820 tower3, 64 gigabytes of RAM, Intel Xeon Silver 4114 central processing unit4) with a graphics processing unit support (Nvidia Quadro P40005) was used for training deep learning-based speech recognition models.
[0082] Stimuli'. The stimuli included the Harvard sentences (IEEE. 1969) and humanities research papers and essays the participant had authored. A total of 2,708 phrases, 46,634 words (with 6,397 unique w ords), and 187,549 phonemes (with 39 unique phonemes) w ere included in the stimuli.
[0083] Automatic Speech Recognition Models'. The automatic speech recognition (ASR) models in this study were hidden Markov model (HMM)-based (Gales & Young, 2007; Juang & Rabiner, 1991), with four different acoustic models: Gaussian mixture model (GMM) (Xuan et al., 2001), deep neural network (DNN) (Hinton et al., 2012). long short-term memory (LSTM), and bidirectional-LSTM (BLSTM) (Sak et al., 2014). The hidden Markov model uses hidden states to model the probability transition between the sub-word units (e.g., phonemes), which represent each phoneme by multiple states so that the stages (e.g., begin and end) of pronouncing the phonemes could be isolated. A popular scheme is to use three states for nonsilence phonemes and five states for silence. In this study, since the participant had a very low speaking rate (37 words/min), thus more hidden Markov model states were used for representing the phonemes, which better captured the prolonged pronunciation pattern of the subject. The hidden Markov models in this study included 776 tied-state (senone) left-to-right triphone hidden Markov models for the baseline experiment, and 800 triphone hidden Markov models for the adapted models. The senones were obtained using a state tying method based on the decision tree (Reichl & Wu Chou, 2000).
[0084] The Gaussian mixture model-hidden Markov model (Xuan et al., 2001) has been a long-standing speech recognition model before the deep learning-based models appeared (Yu & Deng. 2014). which models the distribution of acoustic features for phonemes with a weighted mixture of Gaussian distributions (Yu & Deng, 2014). The deep neural network- hidden Markov models (Yu & Deng, 2014) adopted a deep neural network to model the phoneme distributions, which were trained using the Mel-frequency cepstral coefficients with a context window of nine frames (four previous + one current + four succeeding frames), so that the contextual information was used in each frame, deep neural networks with one to six hidden layers and 256 to 1,024 hidden units at each layer had been explored in preliminary experiments. The best results were achieved by using six hidden layers in a dimension of 512 at each layer.
[0085] The (bidirectional) long short-term memory-hidden Markov models use (bidirectional) long short-term memory to model the distribution of phonemes on the acoustic features. The long short-term memory' (LSTM) is a ty pe of recurrent neural network (Sak et al., 2014) that has memory’ blocks containing long and short-term previous input information as a component of the current input. The bidirectional LSTM (BLSTM) processes data in both forward and backward directions in time with two separate hidden lay ers. Both the previous input information and the following backward input are used. Therefore, the BLSTM is expected to perform better than the LSTM.
[0086] The input acoustic features for all speech recognition models were the Mel- frequency cepstral coefficients (MFCCs) (Davis & Mermelstein,1980), which included 12 cepstral coefficients and one energy' term. The combined 13 -dimensional MFCCs and their first and second derivatives (39-dim.) were used as the input acoustic features of the recognizers. The frame length of feature extraction was 25 ms, and the frame rate was 10 ms. A bigram phoneme-level/word-level language model (Yu & Deng, 2014) was trained by using the training dataset of each experiment. All experiments in this study were implemented with the Kaldi speech recognition toolkit (Povey et al., 2011). A summary of the speech recognition models setup is shown in TABLE 1.
TABLE 1. Experimental setup. MFCC = Mel-frequency cepstral coefficient; GMM = Gaussian mixture model; DNN = deep neural network; LSTM = long short-term memory; BLSTM = bidirectional long short-term memory; HMM = hidden Markov model; RBM = restricted Boltzmann machine.
[0087] Procedure'. Experiments were conducted in this study, including a developmental experiment, a testing experiment, and a baseline (comparison) experiment. The developmental and testing experiments were speaker-dependent speech recognition for dysarthric speech, in which only the data from the participants were used for speech recognizer training. The developmental experiment adopted part of the data to find an optimal hidden Markov model (HMM) state numbers for the participant. The testing experiment verified the performance of using the states numbers found in the developmental experiment (adapted HMM), and compared it with the standard HMM. The baseline experiment was speaker-independent (the nonpersonalized design) with the standard hidden Markov model, in which a large amount of typical speech data from multiple speakers was used for training the speech recognizer.
[0088] Speech recognition on both phoneme-level and word-level recognition were conducted in all the experiments; thus, the performance of the experiments was measured by the phoneme error rates (PERs) and word error rates (WERs). which were computed by the sum of insertion, deletion, and substitution divided by the total number of phonemes or words tested.
[0089] Data Collection-. The data collection of dysarthric speech was conducted by the participant himself. During the data collection, the participant used a user interface script to present one sentence from the stimuli at a time, wait for him to press a key on the keyboard, start the recording, wait for another keystroke, stop and save the recording, and then present another sentence to the participant. The total recording time was 21.66 hr. The sampling rate of audio recording was 44,100 Hz and downsampled to 16,000 Hz for the experiments.
[0090] The LibriSpeech corpus is a collection of approximately 1.000 hours of audiobooks (Panayotov et al., 2015), which is a publicly available typical speech dataset. In this study, the training set of 100 hours of "clean" speech data (from 252 speakers, sampling rate 16,000 Hz) was used in the baseline experiment.
[0091] Developmental Experiment'. In the developmental session, a subset of dysarthric speech data collected was used as the developmental set (1.908 phrases, 13.48 hr), which was further separated into a training set (1,696 phrases, 11.2 hr) and a testing set (212 phrases, 0.7 hr), to explore the optimal number of hidden Markov model states. The developmental experiment was conducted with the Gaussian mixture model-hidden Markov model (GMM- HMM) in a phoneme-level speech recognition.
[0092] Testing Experiment'. In the testing experiment, speech recognizers with standard hidden Markov models were tested, which used three states to represent nonsilence phonemes and five states to represent silence. After that, the adapted hidden Markov models found in the developmental experiment were tested. All acoustic models for speech recognition introduced were verified in the testing experiment, including Gaussian mixture model-, deep neural network-, long short-term memory7, and bidirectional-long short-term memory-hidden Markov models.
[0093] The testing experiment used all the data in the development set (13.48 hr), plus 600 additional phrases (5.9 hr) as the training and validation set, respectively (19.38 hr total). The remaining 200 phrases (2.28 hr, not included in development) were used as the final testing set. The total number of words tested was 4,482; the number of phonemes tested was 20,942.
[0094] Baseline Experiment'. In addition to the developmental and testing experiments, a nonpersonalized baseline experiment of adding the 100-hr ty pical speech data (LibriSpeech) (Panayotov et al., 2015) to the training set used in the testing experiment was conducted. Therefore, the training set in the baseline experiment includes both ty pical speech data (100 hr) and dysarthric speech data (13.48 hr). Standard hidden Markov models were used in this nonpersonalized baseline experiment. Except for adding typical speech to the training set. other experimental setups were the same as those in the testing experiment, and all acoustic models for speech recognition introduced were used in this baseline experiment.
Results
[0095] Developmental Experiment'. The results of the developmental experiment were shown in TABLE 2. The exploration of the number of hidden Markov model states started from the conventional setup: three states for nonsilence phonemes, and five states for silence. Under the premise of keeping state numbers for silence more than for non-silence phonemes, the state numbers were gradually increased and tested. As shown in TABLE 2, the phoneme error rates were generally reduced along with the increased number of states for nonsilence phonemes. The reduction tapered off after six states for nonsilence and six states for silence, so the six-state setup was chosen to represent nonsilence phonemes. With the fixed six states for nonsilence phonemes, the number of states for silence was adjusted, and the eight-state setup was found to generate the best performance. Therefore, a hidden Markov model states setup of using six states for nonsilence phonemes and eight states for silence was chosen for adapting the conventional hidden Markov model -based speech recognizer to the participant’s speech in the following testing experiment.
TABLE 2. Results of the developmental experiment. Bold numbers indicate selected numbers of hidden Markov model states for adaptation and the phoneme error rates obtained with the numbers.
[0096] Testing Experiment'. FIG. 4 and FIG. 5 presented the phoneme error rate (PER) and word error rate (WER) results, respectively. As shown, the models with adapted (personalized) hidden Markov model states significantly improve the accuracies of all four acoustic models in both phoneme- and word-level recognition. Also, the improvement brought by the adaptation was more significant on more advanced models. Specifically, the phoneme error rate and the word error rate of bidirectional long short-term memory-based models were reduced significantly (40% and 49%, respectively) whereas less (21% and 21%, respectively) were reduced for Gaussian mixture model-based models. The lowest phoneme error rate achieved was 14.2% and the lowest word error rate was 13.4%, which were obtained by bidirectional long short-term memory with adapted hidden Markov model states. All word-level recognition accuracies obtained in the testing experiment are significantly higher than human listening (20.9% word-level accuracy) for the participant.
[0097] TABLE 3 lists the most frequent 20 phoneme substitutions, deletions, and insertions of bidirectional long short-term memory7 -hidden Markov model speech recognizer with and without adapted hidden Markov model states. Most of the phoneme recognition errors were significantly decreased by using the adapted hidden Markov model. In particular, most vowel substitutions are significantly reduced. It indicates that the increased number of hidden Markov model states can effectively model prolonged vowel patterns. There are cases of exceptions, where the numbers of errors were increased or only slightly decreased (e.g. substitution of “ao aa”, ah”, insertion of “/n/”).
TABLE 3. The numbers of 20 most frequent recognition errors in the testing experiment with the lowest phoneme error rate. Sub = substitution; Del = deletion; Ins = insertion; HMM = hidden Markov model. The numbers were obtained from the bidirectional long short-term memory7 -HMM models. Bolded values indicate lower numbers. [0098] Baseline Experiment. In the baseline (comparison) experiment, 100 hr of typical speech data from the LibriSpeech dataset (Panayotov et al., 2015) was added to the training set. The highest performance was achieved by the deep neural network rather than the bidirectional long-short-term memory' model in this experiment. Feature space maximum likelihood linear regression (fMLLR) was needed to improve the performance by reducing the inter-speaker variations. In addition, using a language model trained with the dysarthric data in the training set only had been shown to improve the performance. By using the approaches above, the lowest phoneme error rate and word error rate achieved in this experiment were 89.8% and 92.7%, respectively.
[0099] Discussion
[0100] With the personalization approaches (speaker-dependent design and adapted hidden Markov model), a phoneme error rate of 14.2% and a word error rate of 13.4% were achieved for the low-intelligibility dysarthric speech in this study. These results indicated that although there is intra-speaker variability’ in dysarthric speech, machine learning-based speech recognition systems are able to leam from low-intelligibility speech data and achieve significantly higher recognition accuracy than humans (86.6% vs. 20.9% word accuracies). The nonpersonalized approach achieves poor performance (7.3% word accuracy and 10.2% phoneme accuracy), even if the typical speech training dataset is relatively large (100 hr) compared to the dysarthric data.
[0101] Compared to studies in the literature on similar tasks — continuous speech recognition on low speech intelligibility English dysarthric speech, results in this study obtained the state-of-the-art performance. The word error rate achieved in (Yue et al., 2020) for a single subject was 69.07%. Kim et al. (2018) had achieved a phoneme error rate (PER) as low as 29.5% for a single speaker with amyotrophic lateral sclerosis (speech intelligibility < 65%). Hermann & Magimai-Doss (2020) reported an average word error rate (WER) of 25.9% on continuous severe dysarthric speech recognition. Yue et al. (2020) achieved 48.3% WER on severe dysarthric speech (speech intelligibility < 19%). Joy & Umesh (2018) achieved 25.9% WER on a severe dysarthria subject in TORGO speech data. Shor et al. (2019) had reduced the WER for severe amyotrophic lateral sclerosis speakers to 20% with the data (speech intelligibility' < 19%) they collected (36.7 hr from 67 speakers). Recently, Green et al. (2021) had also demonstrated that personalized automatic speech recognition (ASR) outperformed speaker-independent speech recognition and human perception, and the lowest word error rate of recognizing severe dysarthric speech achieved is about 50%. As mentioned previously, most of these multi-speaker studies were dedicated to reducing the variation between typical and dysarthric speech, which is challenging. The personalized approach in this study may have obtained a high performance because the acoustic variation within the same speaker is significantly smaller than that between speakers.
[0102] In addition to the presented result, some findings in the preliminary experiments suggest dysarthric speech of the participant is inconsistent in error production, which is a contrast with the literature. Using a dataset collected from speakers with cerebral palsy and amyotrophic lateral sclerosis, Mengistu & Rudzicz (2011) designed an adapted lexicon in attempt to increase word-level recognition accuracies. The authors identified deviations in the utterances compared to the true phoneme sequences based on perceptual evaluation. Using the deviation patterns observed during the listening task, the authors developed an adapted pronunciation lexicon. In the preliminary experiments of this study, the personalized lexicon approach (Mengistu & Rudzicz, 2011) was validated by using a phoneme-level speech recognizer to find the articulatory7 deviation patterns; however, due to intra-speaker variability (inconsistent pronunciations of the same sounds within a single speaker) in the data, the lexicon-based approach was not suitable for the participant in this study. The patterns of the participant in attempt to reach articulatory targets were not consistent in all contexts. For example, the participant was able to accurately produce /t/ in all word positions (initial, medial, and final), yet missed the articulatory target in some speech contexts. Data from the participant indicated that when errors were produced, they were mostly within-manner errors (e.g., /t/ and /d/), which was also reported by Platt and colleagues (Platt et al., 1980). It was also noticed that the participant inserted /m/ at the end of words that contained nasals, however, a consistent error pattern was not observed throughout all the speech samples. A better understanding of how speaker with dysarthria handle variant deviation patterns in dysarthric speech is an interesting challenge that is worth further investigation and has direct clinical relevance.
[0103] Conclusion
[0104] Throughout the description and claims of this specification, the word ‘‘comprise” and other forms of the word, such as “comprising” and “comprises,” means including but not limited to, and are not intended to exclude, for example, other additives, segments, integers, or steps. Furthermore, it is to be understood that the terms comprise, comprising, and comprises as they relate to various aspects, elements, and features of the disclosed invention also include the more limited aspects of “consisting essentially of’ and “consisting of.”
[0105] As used herein, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to a “polymer” includes aspects having two or more such polymers unless the context clearly indicates otherwise.
[0106] Ranges can be expressed herein as from “about” one particular value and/or to “about” another particular value. When such a range is expressed, another aspect includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by use of the antecedent “about,” it will be understood that the particular value forms another aspect. It should be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint.
[0107] As used herein, the terms “optional” or “optionally” mean that the subsequently described event or circumstance may or may not occur, and that the description includes instances where said event or circumstance occurs and instances where it does not.
[0108] For the terms "for example" and "such as," and grammatical equivalences thereof, the phrase "and without limitation" is understood to follow unless explicitly stated otherwise. [0109] The following patents, applications and publications as listed below and throughout this document are hereby incorporated by reference in their entirety herein Reference List
[1] Bhat, C., Das, B., Vachhani, B., & Kopparapu, S. K. (2018). Dysarthric speech recognition using time-delay neural network based denoising autoencoder. INTERSPEECH 2018. 451-455. https;//doi:o^
[2] Caballero Morales, S. O., & Cox, S. J. (2009). Modelling errors in automatic speech recognition for dysarthric speakers. EURASIP Journal on Advances in Signal Processing, 2009(1), 308340. https://doi.org/10.1155/2009/308340
[3] Celin, T. A. M., Nagarajan, T., & Vijayalakshmi, P. (2020). Data augmentation using virtual microphone array synthesis and multi-resolution feature extraction for isolated word dysarthric speech recognition. IEEE Journal of Selected Topics in Signal Processing, 14(2), 346-354. https://doi.org/10.1109/JSTSP.2020.2972161
[4] Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357-366.
[5] Dhanalakshmi, M., Celin, T. A. M., Nagarajan, T., & Vijayalakshmi, P. (2018). Speechinput speech-output communication for dy sarthric speakers using HMM-based speech recognition and adaptive synthesis system. Circuits, Systems, and Signal Processing: CSSP; Cambridge, 37(2), 674-703. http://dx.doi.org/10.1007/s00034-017-0567-9 [6] Duffy, J. R. (2019). Motor Speech Disorders: Substrates, Differential Diagnosis, and Mangement (4th edition). Elsevier Health Sciences.
[7] Gales, M., & Young, S. (2007). The application of hidden Markov models in speech recognition. Foundations and Trends® in Signal Processing, 1(3), 195-304. https://doi.org/10.1561/2000000004
[8] Green, J. R., MacDonald, R. L., Jiang, P. P., Cattiau, J., Heywood, R., Cave, R.. ... & Tomanek, K. (2021). Automatic speech recognition of disordered speech: personalized models outperforming human listeners on short phrases. INTERSPEECH 2021, 4778- 4782.
[9] Hawley, M. S., Cunningham, S. P , Green, P. D., Enderby, P., Palmer, R., Sehgal, S., & O’Neill, P. (2013). A voice-input voice-output communication aid for people with severe speech impairment. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 21(1), 23-31. https://doi.org/10.1109/TNSRE.2012.2209678
[10] Hermann, E., & Magimai.-Doss, M. (2020). Dysarthric speech recognition with lattice- free MMI. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6109-6113. https://doi.org/10.1109/ICASSP40776.2020.9053549
[11] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A.,
Vanhoucke, V., Nguyen, P., Sainath. T. N., & Kingsbury-, B. (2012). Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82-97. https://doi.org/10.1109/MSP.2012.2205597
[12] Hosom, J.-P., Kain, A. B., Mishra, T., van Santen, J. P. H., Fried-Oken, M., & Staehely, J. (2003). Intelligibility of modifications to dysarthric speech. IEEE International Conference on Acoustics, Speech, and Signal Processing, I-924-I-927. https://doi.org/10.1109/ICASSP.2003.1198933
[13] IEEE recommended practice for speech quality measurements. (1969). IEEE No 297- 1969. 1-24. https://doi.org/10.1109/IEEESTD.1969.7405210
[14] Jette. A. M.. Spicer, C. M., & Flaubert, J. L. (2017). The Promise of Assistive Technology to Enhance Activity and Work Participation. National Academies Press. https://doi.org/10. 17226/24740
[15] Jiao, Y., Tu, M., Berisha, V., & Liss, J. (2018). Simulating dysarthric speech for training data augmentation in clinical speech applications. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6009-6013. htps://doi.org/10.1109/ICASSP.2018.8462290
[16] Joy, N. M., & Umesh, S. (2018). Improving acoustic models in TORGO dysarthric speech database. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 26(3), 637-645. htps://doi.org/10.1109/TNSRE.2018.2802914
[17] Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov Models for Speech Recognition.
[18] Kent Raymond & Netsell Ronald. (1978). Articulatory abnormalities in athetoid cerebral palsy. Journal of Speech and Hearing Disorders, 43(3), 353-373. htps://doi.org/10.1044/jshd.4303.353
[19] Kim, H., Hasegawa-Johnson, M., Perlman, A., Gunderson, J., Huang, T., Watkin, K., & Frame, S. (2008). Dysarthric speech database for universal access research. INTERSPEECH, 22-26.
[20] Kim, M., Cao, B., & Wang, J. (2019). Multi-view representation learning via canonical correlation analysis for dysarthric speech recognition. In K. Deng, Z. Yu, S. Patnaik, & J. Wang (Eds.), Recent Developments in Mechatronics and Intelligent Robotics (pp. 1085-1095). Springer International Publishing, htps://doi.org/10.1007/978-3-030- 00214-5 133
[21] Kim, M. J., Cao, B., An, K., & Wang, J. (2018). Dysarthric speech recognition using convolutional LSTM neural network. INTERSPEECH, 2948-2952. htps :// doi.org/ 10.21437/Interspeech.2018-2250
[22] Kim, M., Kim, Y., Yoo, J., Wang, J., & Kim, H. (2017). Regularized speaker adaptation of KL-HMM for dysarthric speech recognition. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 25(9), 1581-1591. htps://doi.org/10.1109/TNSRE.2017.2681691
[23] Krigger, K. W. (2006). Cerebral palsy: an overview. American Family Physician, 73(1), 91-100.
[24] Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., Warmuth, M., & Wolf. P. (2003). The CMU SPHINX-4 Speech Recognition System.
[25] Lang. A. E., & Lozano, A. M. (1998). Parkinson’s disease. New England Journal of Medicine, 339(16), 1 130-1143. htps://doi.org/10.1056/NEJM199810153391607
[26] Lin, Y., Wang, L., Li, S., Dang, J., & Ding, C. (2020). Staged knowledge distillation for end-to-end dysarthric speech recognition and speech atribute transcription. INTERSPEECH 2020. 4791-4795. htps://doi.org/10.21437/Interspeech.2020-1755 [27] Menendez-Pidal, X., Polikoff, J. B., Peters, S. M., Leonzio, J. E., & Bunnell, H. T. (1996). The Nemours database of dysarthric speech. Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96, 3, 1962-1965. https://doi.org/10.1109/ICSLP.1996.608020
[28] Mengistu, K. T., & Rudzicz, F. (2011). Adapting acoustic and lexical models to dysarthric speech. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4924-4927. https://doi.org/10T 109/ICASSP.2011.5947460
[29] Mohri, M., Pereira, F., & Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1), 69-88.
[30] Mustafa, M. B., Salim, S. S.. Mohamed. N.. ALQatab, B., & Siong, C. E. (2014). Severity-based adaptation with limited data for ASR to aid dysarthric speakers. PLOS ONE, 9(1), e86285. https://doi.org/10.1371/joumal.pone.0086285
[31] Noyes, J., & Frankish, C. (1992). Speech recognition technology for individuals with disabilities. Augmentative and Alternative Communication, 8(4), 297-303. https://doi.org/10.1080/07434619212331276333
[32] Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206-5210. https://doi.org/10.1109/ICASSP.2015.7178964
[33] Platt, L. J., Andrews, G., & Howie, P. M. (1980). Dysarthria of adult cerebral palsy: II. Phonemic analysis of articulation errors. Journal of Speech, Language, and Hearing Research, 23(1), 41-55.
[34] Povey, D., Ghoshal, A., Boulianne, G.. Burget. L., Glembek, O.. Goel. N.. Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The Kaldi Speech Recognition Toolkit (CONF). IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Infoscience; IEEE Signal Processing Society, http ://infoscience. epfl. ch/record/ 192584
[35] Reichl. W. & Wu Chou. (2000). Robust decision tree state tying for continuous speech recognition. IEEE Transactions on Speech and Audio Processing, 8(5), 555-566. https://doi.org/10.1109/89.861375
[36] Rowland. L. P.. & Shneider, N. A. (2001). Amyotrophic lateral sclerosis. New England
Journal of Medicine, 344(22), 1688-1700. https://doi.org/10.1056/NEJM200105313442207 [37] Rudzicz, F. (2007). Comparing speaker-dependent and speaker-adaptive acoustic models for recognizing dysarthric speech. In ASSETS?07: Proceedings of the Ninth International ACM SIGACCESS Conference on Computers and Accessibility (p. 256). https://doi.org/10. 1145/1296843.1296899
[38] Rudzicz, F., Namasivayam, A., & Wolff, T. (2010). The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Language Resources and Evaluation. 46. 1-19. https://doi.org/10.1007/sl0579-011-9145-0
[39] Rudzicz Frank. (2010). Learning mixed acoustic/articulatory models for disabled speech. Workshop Mach. Leam. Assist. Technol. Neural Inf. Process. Syst, 70-78.
[40] Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. ArXiv: 1402. 1128 [Cs, Stat], http://arxiv.org/abs/1402.1128
[41] Sawhney, N., & Wheeler, S. (2001). Using Phonological Context for Improved Recognition of Dysarthric Speech.
[42] Seong, W. K., Park, J. H., & Kim, H. K. (2012, July). Dysarthric speech recognition error correction using weighted finite state transducers based on context-dependent pronunciation variation. International Conference on Computers for Handicapped Persons. Springer, Berlin, Heidelberg, 475-482.
[43] Shor, J., Emanuel, D., Lang, O., Tuval, O., Brenner, M., Cattiau, J., Vieira, F., McNally, M.. Charbonneau, T., Nollstadt, M., Hassidim, A., & Matias, Y. (2019). Personalizing ASR for dysarthric and accented speech with limited data. Interspeech 2019, 784-788. https :// doi.org/ 10.21437/Interspeech.2019-1427
[44] Sidi Yakoub, M., Selouani, S., Zaidi. B.-F., & Bouchair, A. (2020). Improving dysarthric speech recognition using empirical mode decomposition and convolutional neural network. EURASIP Journal on Audio, Speech, and Music Processing, 2020(1), 1. https://doi. org/10. 1186/s 13636-019-0169-5
[45] Sim, K. C., Narayanan, A., Misra, A., Tripathi, A., Pundak, G., Sainath, T., Haghani, P., Li, B., & Bacchiani, M. (2018). Domain adaptation using factorized hidden layer for robust automatic speech recognition. INTERSPEECH 2018, 892-896. https :// doi.org/ 10.21437/Interspeech.2018-2246
[46] Takashima, Y., Takashima, R., Takiguchi, T., & Ariki, Y. (2020). Dysarthric speech recognition based on deep metric learning. INTERSPEECH 2020, 4796-4800. https :// doi.org/ 10.21437/Interspeech.2020-2267 [47] Takashima, Y., Takiguchi, T., & Ariki, Y. (2019). End-to-end dysarthric speech recognition using multiple databases. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6395-6399. https://doi.org/10.1109/ICASSP.2019.8683803
[48] Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge University Press.
[49] Tomanek, K., Beaufays, F.. Cattiau, J., Chandorkar, A., & Sim, K. C. (2021). On-device personalization of automatic speech recognition models for disordered speech. ArXiv:2106. 10259 [Cs, Eess], http://arxiv.org/abs/2106.10259
[50] Vachhani, B., Bhat, C., & Kopparapu, S. K. (2018). Data augmentation using healthy speech for dysarthric speech recognition. INTERSPEECH 2018, 471-475. https :// doi.org/ 10.21437/Interspeech.2018-1751
[51] Wang, D., Yu, J., Wu, X., Sun, L., Liu, X., & Meng, H. (2020). Improved end-to-end dysarthric speech recognition via meta-leaming based model re-initialization. ArXiv E- Prints, 2011, arXiv:2011.01686.
[52] Whurr, R. (1988). Clinical management of dysarthric speakers. Journal of Neurology,
Neurosurgery & Psychiatry, 51(1 1), 1467-1468. https://doi.org/10.1136/jnnp.51. l l.1467-c
[53] Wilson, B. B., John. (2000). Acoustic variability' in dysarthria and computer speech recognition. Clinical Linguistics & Phonetics, 14(4), 307-327. https://doi. org/10. 1080/02699200050024001
[54] Woszczyk, D., Petridis, S., & Millard, D. (2020). Domain adversarial neural networks for dysarthric speech recognition. INTERSPEECH 2020, 3875-3879. https :// doi.org/ 10.21437/Interspeech.2020-2845
[55] Xiong, F., Barker, J., & Christensen, H. (2018). Deep learning of articulatory-based representations and applications for improving dysarthric speech recognition. Speech Communication; 13th ITG-Symposium, 1-5.
[56] Xiong, F., Barker, J., & Christensen, H. (2019). Phonetic analysis of dysarthric speech tempo and applications to robust personalised dysarthric speech recognition. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5836-5840. https://doi.org/10.1109/ICASSP.2019.8683091
[57] Xuan, G., Zhang, W., & Chai, P. (2001, October). EM algorithms of Gaussian mixture model and hidden Markov model. In Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205) . 1, 145-148. [58] Young, V., & Mihailidis, A. (2010). Difficulties in automatic speech recognition of dysarthric speakers and implications for speech-based applications used by the elderly: a literature review. Assistive Technology, 22(2), 99-112. https://doi.org/10.1080/10400435.2010.483646
[59] Yu, D., & Deng, L. (2014). Automatic Speech Recognition: A Deep Learning Approach. Springer.
[60] Yue, Z., Christensen, H., & Barker. J. (2020). Autoencoder bottleneck features with multi-task optimisation for improved continuous dysarthric speech recognition. INTERSPEECH 2020, 4581-4585. https://doi.org/10.21437/Interspeech.2020-2746
[61] Yue, Z., Xiong. F., Christensen, H., & Barker, J. (2020). Exploring appropriate acoustic and language modelling choices for continuous dysarthric speech recognition. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6094-6098.
[62] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33, 12449-12460.
[63] Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., ... & Pang, R. (2020). Conformer: Convolution- augmented Transformer for Speech Recognition, 5036-5040.
EMBODIMENTS
Embodiment 1. A system for recognizing a speech of a single speaker comprising: a processor; a memory having instructions stored thereon, wherein execution of the instructions by the processor causes the processor to: receive audio data having low-intelligibility, severe dysarthric speech; determine, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wherein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and output the determined plurality of estimated speech values, wherein the determined plurality of estimated speech values is used for control or voice synthesis.
Embodiment 2. The system of the Embodiment 1, wherein the training dysarthric speech dataset has at least 40,000 words and 187,000 phonemes.
Embodiment 3. The system of any one of Embodiments 1-2, wherein the trained model comprises a trained hidden Markov model.
Embodiment 4. The system of any one of Embodiments 1-3. wherein the trained model comprises a transformer-based model (i) trained on a health speech dataset and (ii) finetuned on a smaller dysarthric speech dataset.
Embodiment 5. The system of any one of Embodiments 1-4. wherein the determined plurality of estimated speech values is used for health status monitoring and dysarthria condition analysis.
Embodiment 6. The system of any one of Embodiments 1-5, wherein the training dysarthric speech data comprises a non-dysarthric speech having 161 to 200 words per minute, a moderate dysarthric speech having between 121 to 160 words per minute, a severe dysarthric speech having less than 120 words per minute, or a combination thereof.
Embodiment 7. The system of any one of Embodiments 1-6. wherein the audio data is of the single speaker, and wherein the estimated speech values are for the single speaker. Embodiment 8. The system of any one of Embodiments 1-7. wherein the audio data is of a speaker different from the single speaker, and wherein the estimated speech values are for the different speaker.
Embodiment 9. The system of any one of Embodiments 1-8, wherein the trained model was characterized as having an accuracy for speech recognition greater than 85%.
Embodiment 10. The system of any one of embodiments 1-9, further comprising: directing the plurality of estimated speech values to a human-machine interface.
Embodiment 11. The system of any one of Embodiments 1-10, further comprising a voice synthesis system and speaker.
Embodiment 12. The system of any one of Embodiments 3-11, wherein the hidden Markov model employed at least three states and five states to represent non-silence phonemes and silence phonemes, respectively.
Embodiment 13. The system of any one of Embodiments 3-12, wherein the hidden Markov model employed at least six states to represent each of non-silence and silence phonemes.
Embodiment 14. The system of any one of Embodiments 3-13, wherein the hidden Markov model employed a Gaussian mixture model (GMM), deep neural network (DNN), long-short term memory’ (LSTM)-recurrent neural network, and bidirectional-LSTM (BLSTM).
Embodiment 15. A method for recognizing a speech of a single speaker comprising: receiving, by a processor, audio data having low-intelligibility', severe dysarthric speech; determining, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wherein the trained model w as configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and outputting the determined plurality of estimated speech values, wherein the determined plurality of estimated speech values is used for control or voice synthesis. Embodiment 16. The method of Embodiment 15, wherein the training dysarthric speech dataset has at least 40,000 words and 187,000 phonemes.
Embodiment 17. The method of any one of Embodiments 15-16, wherein the trained model comprises a trained hidden Markov model.
Embodiment 18. The method of any one of Embodiments 1 -17, wherein the training dysarthric speech data comprises a non-dysarthric speech having 161 to 200 words per minute, a moderate dysarthric speech having between 121 to 160 words per minute, a severe dysarthric speech having less than 120 words per minute, or a combination thereof.
Embodiment 19. The method of any one of Embodiments 15-18, wherein the audio data is of the single speaker, and wherein the estimated speech values are for the single speaker.
Embodiment 20. The method of any one of Embodiments 15-19, wherein the audio data is of a speaker different from the single speaker, and wherein the estimated speech values are for the different speaker (e.g., wherein the different speaker has dysarthria).
Embodiment 21. The method of any one of Embodiments 15-20, wherein the trained model was characterized as having an accuracy for speech recognition greater than 85%.
Embodiment 22. The method of any one of Embodiments 15-21, further comprising: directing the plurality of estimated speech values to a human-machine interface.
Embodiment 23. The method of any one of Embodiments 17-22, wherein the hidden Markov model employed at least three states and five states to represent non-silence phonemes and silence phonemes, respectively.
Embodiment 24. The method of any one of Embodiments 17-23, wherein the hidden Markov model employed at least six states to represent each of non-silence and silence phonemes. Embodiment 25. The method of any one of embodiments 17-24, wherein the hidden Markov model employed a Gaussian mixture model (GMM), deep neural network (DNN). long-short term memory (LSTM)-recurrent neural network, and bidirectional-LSTM (BLSTM).
Embodiment 26. A non-transitory computer-readable medium having instructions stored thereon, wherein execution of the instructions by a processor causes the processor to: receive audio data having low-intelligibility, severe dysarthric speech; determine, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wherein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and output the determined plurality' of estimated speech values, wherein the determined plurality' of estimated speech values is used for control or voice synthesis.
Embodiment 27. A method for operating the system of any one of Embodiments 1-14.
Embodiment 28. A non-transitor ' computer-readable medium having instructions stored thereon, wherein execution of the instructions by a processor causes the processor to operate the System of any one of the Embodiments 1-14 or perform the method of any one of Embodiments 15-25.

Claims

CLAIMS What is claimed is:
1. A system for recognizing a speech of a single speaker comprising: a processor; a memory having instructions stored thereon, wherein execution of the instructions by the processor causes the processor to: receive audio data having low-intelligibility, severe dysarthric speech; determine, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wherein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and output the determined plurality of estimated speech values, wherein the determined plurality of estimated speech values is used for control or voice synthesis.
2. The system of claim 1, wherein the training dysarthric speech dataset has at least 40.000 words and 187.000 phonemes.
3. The system of any one of claims 1-2, wherein the trained model comprises a trained hidden Markov model.
4. The system of any one of claims 1 -3, wherein the trained model comprises a transformer-based model (i) trained on a health speech dataset and (ii) finetuned on a smaller dysarthric speech dataset.
5. The system of any one of claims 1-4, wherein the determined plurality of estimated speech values is used for health status monitoring and dysarthria condition analysis.
6. The system of any one of claims 1-5, wherein the training dysarthric speech data comprises a non-dysarthric speech having 161 to 200 words per minute, a moderate dysarthric speech having between 121 to 160 words per minute, a severe dysarthric speech having less than 120 words per minute, or a combination thereof.
7. The system of any one of claims 1-6, wherein the audio data is of the single speaker, and wherein the estimated speech values are for the single speaker.
8. The system of any one of claims 1-7, wherein the audio data is of a speaker different from the single speaker, and wherein the estimated speech values are for the different speaker.
9. The system of any one of claims 1-8, wherein the trained model was characterized as having an accuracy for speech recognition greater than 85%.
10. The system of any one of claims 1-9, further comprising: directing the plurality of estimated speech values to a human-machine interface.
11. The system of any one of claims 1-10, further comprising a voice synthesis system and speaker.
12. The system of any one of claims 3-11, wherein the hidden Markov model employed at least three states and five states to represent non-silence phonemes and silence phonemes, respectively.
13. The system of any one of claims 3-12, wherein the hidden Markov model employed at least six states to represent each of non-silence and silence phonemes.
14. The system of any one of claims 3-13, wherein the hidden Markov model employed a Gaussian mixture model (GMM), deep neural network (DNN), long-short term memory (LSTM)-recurrent neural network, and bidirectional-LSTM (BLSTM).
15. A method for recognizing a speech of a single speaker comprising: receiving, by a processor, audio data having low-intelligibility, severe dysarthric speech; determining, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wh erein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and outputting the determined plurality of estimated speech values, wherein the determined plurality of estimated speech values is used for control or voice synthesis.
16. The method of claim 15, wherein the training dysarthric speech dataset has at least 40,000 words and 187,000 phonemes.
17. The method of any one of claims 15-16, wherein the trained model comprises a trained hidden Markov model.
18. The method of any one of claims 15-17, wherein the training dysarthric speech data comprises a non-dysarthric speech having 161 to 200 words per minute, a moderate dysarthric speech having between 121 to 160 words per minute, a severe dysarthric speech having less than 120 words per minute, or a combination thereof.
19. The method of any one of claims 15-18, wherein the audio data is of the single speaker, and wherein the estimated speech values are for the single speaker.
20. A non-transitory computer-readable medium having instructions stored thereon, wherein execution of the instructions by a processor causes the processor to: receive audio data having low-intelligibility7, severe dysarthric speech; determine, via a trained model, a plurality of estimated speech values for the severe, low-intelligibility dysarthric speech of the audio data, wherein the trained model was configured using a training dysarthric speech dataset acquired from one or more speakers of at least 1000 words and 4500 phonemes; and output the determined plurality of estimated speech values, wherein the determined plurality of estimated speech values is used for control or voice synthesis.
PCT/US2025/032777 2024-06-11 2025-06-06 Personalized dysarthric speech recognition Pending WO2025259567A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202463658764P 2024-06-11 2024-06-11
US63/658,764 2024-06-11

Publications (1)

Publication Number Publication Date
WO2025259567A1 true WO2025259567A1 (en) 2025-12-18

Family

ID=98051602

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2025/032777 Pending WO2025259567A1 (en) 2024-06-11 2025-06-06 Personalized dysarthric speech recognition

Country Status (1)

Country Link
WO (1) WO2025259567A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5473728A (en) * 1993-02-24 1995-12-05 The United States Of America As Represented By The Secretary Of The Navy Training of homoscedastic hidden Markov models for automatic speech recognition
WO2006109268A1 (en) * 2005-04-13 2006-10-19 Koninklijke Philips Electronics N.V. Automated speech disorder detection method and apparatus
US20230360632A1 (en) * 2022-05-03 2023-11-09 Google Llc Speaker Embeddings for Improved Automatic Speech Recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5473728A (en) * 1993-02-24 1995-12-05 The United States Of America As Represented By The Secretary Of The Navy Training of homoscedastic hidden Markov models for automatic speech recognition
WO2006109268A1 (en) * 2005-04-13 2006-10-19 Koninklijke Philips Electronics N.V. Automated speech disorder detection method and apparatus
US20230360632A1 (en) * 2022-05-03 2023-11-09 Google Llc Speaker Embeddings for Improved Automatic Speech Recognition

Similar Documents

Publication Publication Date Title
Zaidi et al. Deep neural network architectures for dysarthric speech analysis and recognition
Xiong et al. Toward human parity in conversational speech recognition
Schultz et al. Modeling coarticulation in EMG-based continuous speech recognition
Rudzicz Adjusting dysarthric speech signals to be more intelligible
Kim et al. Regularized speaker adaptation of KL-HMM for dysarthric speech recognition
Vadwala et al. Survey paper on different speech recognition algorithm: challenges and techniques
Qian et al. A survey of technologies for automatic Dysarthric speech recognition
O'Shaughnessy Automatic speech recognition
Sumit et al. Noise robust end-to-end speech recognition for bangla language
Musaev et al. The use of neural networks to improve the recognition accuracy of explosive and unvoiced phonemes in Uzbek language
O'Shaughnessy Spoken language identification: An overview of past and present research trends
Ren et al. An automatic dysarthric speech recognition approach using deep neural networks
Abdullah et al. Central Kurdish Automatic Speech Recognition using Deep Learning.
Mulfari et al. Toward a lightweight ASR solution for atypical speech on the edge
Vinotha et al. Leveraging openai whisper model to improve speech recognition for dysarthric individuals
Okuda et al. Double articulation analyzer with prosody for unsupervised word and phone discovery
Rasanen Basic cuts revisited: Temporal segmentation of speech into phone-like units with statistical learning at a pre-linguistic level
Kumar et al. Residual convolutional neural network-based dysarthric speech recognition
Gosztolya et al. Extracting phonetic posterior-based features for detecting multiple sclerosis from speech
Terbeh et al. Arabic speech analysis to identify factors posing pronunciation disorders and to assist learners with vocal disabilities
Andra et al. Improved transcription and speaker identification system for concurrent speech in Bahasa Indonesia using recurrent neural network
Soleymanpour et al. Dysarthric Speech Augmentation Using Prosodic Transformation and Masking for Subword End-to-end ASR.
Hase et al. Speech Recognition: A Concise Significance
Mendiratta et al. A robust isolated automatic speech recognition system using machine learning techniques
WO2025259567A1 (en) Personalized dysarthric speech recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25822232

Country of ref document: EP

Kind code of ref document: A1