WO2018077244A1 - Acoustic-graphemic model and acoustic-graphemic-phonemic model for computer-aided pronunciation training and speech processing - Google Patents

Acoustic-graphemic model and acoustic-graphemic-phonemic model for computer-aided pronunciation training and speech processing Download PDF

Info

Publication number
WO2018077244A1
WO2018077244A1 PCT/CN2017/108098 CN2017108098W WO2018077244A1 WO 2018077244 A1 WO2018077244 A1 WO 2018077244A1 CN 2017108098 W CN2017108098 W CN 2017108098W WO 2018077244 A1 WO2018077244 A1 WO 2018077244A1
Authority
WO
WIPO (PCT)
Prior art keywords
phone
sequence
units
grapheme
utterance
Prior art date
Application number
PCT/CN2017/108098
Other languages
English (en)
French (fr)
Inventor
Helen Mei-Ling MENG
Kun Li
Lifa Sun
Xixin WU
Original Assignee
The Chinese University Of Hong Kong
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Chinese University Of Hong Kong filed Critical The Chinese University Of Hong Kong
Priority to CN201780065301.4A priority Critical patent/CN109863554B/zh
Publication of WO2018077244A1 publication Critical patent/WO2018077244A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present disclosure relates to computer-aided speech processing and in particular to an acoustic-graphemic model (AGM) and acoustic-graphemic-phonemic model (AGPM) for computer-aided pronunciation training and speech processing.
  • AGM acoustic-graphemic model
  • AGPM acoustic-graphemic-phonemic model
  • Second-language learners make pronunciation errors for various reasons.
  • One reason is language transfer, where the learner replaces an unfamiliar phone in the second language (L2) with a familiar phone of the first language (L1) .
  • Another reason is incorrect letter-to-sound conversion.
  • a third reason is misreading a text prompt.
  • CAPT Computer-aided pronunciation training
  • MFCCs Mel-frequency cepstral coefficients
  • the frequency-space representation can be analyzed using various computer-implemented models to assess the user’s pronunciation, and the assessment can be used to provide real-time feedback to the user.
  • the efficacy of CAPT depends on the reliability of the analysis, which depends on the models.
  • acoustic models can be trained using native-like (treated as correct) pronunciations and non-native (incorrect) pronunciations.
  • the models can be used to score an L2 learner’s speech.
  • dynamic time warping can be used to correlate a teacher’s utterance and a student’s utterance, and acoustic features of the two utterances can be compared using a scoring algorithm.
  • A. Lee et al. “A comparison-based approach to mispronunciation detection, ” Proc.
  • ERNs extended recognition networks
  • A.M. Harrison et al. “Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training, ” Proc. SLaTE (2009) .
  • ERNs generally do not provide high coverage of possible mispronunciations, and phones that are missing from the ERN cannot be recognized. Further, there is a tradeoff between coverage and precision: increasing the number of possible mispronunciations that can be recognized increases the rate of inaccurate identification.
  • LDA linear discriminant analysis
  • Another approach uses posteriorgrams to represent the L2 English acoustic-phonetic space. (See A. Lee et al., “Mispronunciation detection via dynamic time warping on deep belief network posteriorgrams, ” Proc. ICASSP (2013)) .
  • the present inventors previously developed an Acoustic-Phonemic Model (APM) (See K. Li and H. Meng, “Mispronunciation detection and diagnosis in L2 English speech using multi-distribution deep neural networks, ” Proc. ISCSLP (2014) ) .
  • the APM can be implemented using a multi-distribution deep neural network (MD-DNN) for which the inputs include a representation of the user’s utterance of a prompted text and a corresponding canonical phonemic transcription of the text.
  • MD-DNN multi-distribution deep neural network
  • the APM can be trained using samples of speech that have been hand-annotated to indicate the actual phones that were spoken.
  • a representation of acoustical features of a user’s utterance (from prompted text) and a canonical phonemic transcription of the text can be input into the APM, which computes a posterior probability for each phone.
  • the APM can operate on a sequence of frames representing the utterance, where each frame corresponds to a different (possibly overlapping) time slice of the speech and can produce a set of posterior probabilities for each frame. From the posterior probabilities, a most probable sequence of phones can be determined, e.g., using a Viterbi decoding algorithm and a state transition model that provides the probabilities of phone-state transitions given a particular sequence of preceding phones. The most probable phone sequence can be used to determine whether the speech was correct or incorrect. While the APM provides good performance compared to other existing techniques, further improvement is still desirable.
  • An AGM is a multi-distribution deep neural network (MD-DNN) for which the inputs include a representation of acoustical features of an utterance (which can be from a prompted text) and a corresponding graphemic transcription of the text.
  • MD-DNN multi-distribution deep neural network
  • Graphemes generally correspond to units of a writing system, e.g., letters in an alphabet, and for many languages (such as English) there is a strong, though imperfect, correlation between a grapheme and a phone.
  • An AGM can implicitly model the grapheme-to-likely-pronunciation conversion.
  • An AGM can be trained using utterances of prompted text that have been hand-annotated to align the utterance with actual spoken phones; a graphemic representation of the text is also used in training. After training, a representation of acoustic features of a user’s utterance (from prompted text) and a graphemic representation of the text can be input into the AGM, which computes a posterior probability for each phone.
  • the AGM can operate on a sequence of frames representing the utterance, where each frame corresponds to a different (possibly overlapping) time slice of the speech and can produce a set of posterior probabilities for each frame.
  • a most probable sequence of phones can be determined, e.g., using a Viterbi decoding algorithm and a state transition model that provides the probabilities of phone-state transitions given a particular sequence of preceding phones.
  • the most probable phone sequence can be used to determine whether the speech was correct or incorrect.
  • An AGPM is an MD-DNN for which the inputs include a representation of acoustical features of an utterance (which can be from a prompted text) , a corresponding graphemic transcription of the text, and a canonical phonemic transcription of the text. Accordingly, an AGPM can implicitly model both the grapheme-to-likely-pronunciation conversion and the phone-to-likely-pronunciation conversion.
  • An AGPM can be trained using utterances of prompted text that have been hand-annotated to align the utterance with actual spoken phones; a graphemic representation of the text and a canonical phonemic representation are also used in training. After training, a representation of acoustic features of a user’s utterance (from prompted text) , a graphemic representation of the text, and a canonical phonemic representation of the text can be input into the AGPM, which computes a posterior probability for each phone.
  • the AGPM can operate on a sequence of frames representing the utterance, where each frame corresponds to a different (possibly overlapping) time slice of the speech and can produce a set of posterior probabilities for each frame.
  • a most probable sequence of phones can be determined, e.g., using a Viterbi decoding algorithm and a state transition model that provides the probabilities of phone-state transitions given a particular sequence of preceding phones.
  • the most probable phone sequence can be used to determine whether the speech was correct or incorrect.
  • FIG. 1 shows a conceptual illustration of a method for analyzing speech according to an embodiment of the present invention.
  • FIG. 2 is a flow chart showing a training process that can be used to train an AGM according to an embodiment of the present invention.
  • FIG. 3 illustrates a representative structure of a grapheme-level acoustical model (G-AM) according to an embodiment of the present invention.
  • FIG. 4 shows an example of speech aligned to graphemes for a speaker uttering a phrase.
  • FIG. 5 illustrates a representative structure of an AGM according to an embodiment of the present invention.
  • FIG. 6 is a flow chart showing a speech analysis process incorporating an AGM according to an embodiment of the present invention.
  • FIG. 7 illustrates a representative structure of a state transition model that can be used to provide the probabilities of phone-state transitions according to an embodiment of the present invention.
  • FIG. 8 shows a conceptual illustration of a method for analyzing speech using an AGPM according to an embodiment of the present invention.
  • FIG. 9 is a flow chart showing a training process that can be used to train an AGPM according to an embodiment of the present invention.
  • FIG. 10 illustrates a representative structure of an AGPM according to an embodiment of the present invention.
  • FIG. 11 is a flow chart showing a speech analysis process incorporating an AGM according to an embodiment of the present invention.
  • FIG. 12 shows correctness and accuracy for specific implementations of AGM and AGPM speech-analysis approaches, as well as for an implementation of the APM approach previously developed and reported by the inventors.
  • FIG. 13 shows a classification hierarchy that can be used in evaluating mispronunciation diagnosis and detection (MDD) performance.
  • FIG. 14 is a table showing various MDD-related metrics for specific implementations of the AGM and AGPM approaches, as well as for another approach previously developed and reported by the inventors.
  • FIG. 15 is a table showing additional MDD-related metrics for specific implementations of the AGM and AGPM approaches, as well as for another approach previously developed and reported by the inventors.
  • FIG. 16 is a table indicating various pronunciations of three example English words and implicit modeling ability of APM, AGM, and AGPM for each.
  • FIG. 1 shows a conceptual illustration of a method for analyzing speech according to an embodiment of the present invention.
  • a user is prompted to speak a text, and acoustic features 102 are extracted from the user’s utterance.
  • the graphemes 104 e.g., individual letters of an alphabet or the like
  • the graphemes 104 that make up the text are known and can be time-correlated with acoustic features of the utterance.
  • Acoustic features 102 and graphemes 104 corresponding to a frame (i.e., a time window) of the utterance are provided to an acoustic-graphemic model (AGM) 106, which is a neural network that has been trained to compute a posterior probability of a given phone given an input data set consisting of acoustic features and graphemes. Examples of configuration and training of AGM 106 are described below.
  • the probabilities ⁇ p ⁇ determined by AGM 106 for each frame are provided to a Viterbi decoder 108, which can generate a most likely sequence of phones 110 using a state transition model that computes the probability of a particular phone given a sequence of prior phones.
  • FIG. 2 is a flow chart showing a training process 200 that can be used to train AGM 106 according to an embodiment of the present invention.
  • a training corpus consisting of representations of utterances of prompted texts by various speakers can be obtained.
  • a corpus can be generated by asking a number of speakers to speak a prompted text in a target language (L2) and recording the result, e.g., in a digital format.
  • the text may include a word, a phrase, a sentence, or a longer text, and a combination of different texts of various lengths may be used.
  • L2 should be the same for all samples, and the speakers can include both native and non-native speakers of L2.
  • the non-native speakers may have the same first language (L1) ; in other embodiments, the non-native speakers may have different first languages (or dialects) .
  • the utterances can include utterances from multiple speakers, and it is not necessary that all speakers who provide utterances for the training set speak the same texts.
  • pre-existing corpora of utterances can be used.
  • the CU-CHLOE (Chinese University Chinese Learners of English) corpus is an existing corpus of recorded speech that contains English-language speech recorded from 110 Mandarin speakers (60 males, 50 females) and 100 Cantonese speakers (50 males, 50 females) .
  • the speech includes sets of confusable words, sets of minimal pairs, phonemic sentences, Aesop’s fable “The North Wind and the Sun, ” and prompts from the TIMIT corpus, which is as described in W.M. Fisher et al., “The DARPA speech recognition research database: specifications and stats, ” Proc. DARPA Workshop on Speech Recognition (1986) ; L. F. Lamel et al., “Speech database development: Design and analysis of the acoustic-phonetic corpus, ” Speech Input/Output Assessment and Speech Databases (1989) ; V. Zue et al., Speech Communication, 9: 4, 351-356 (1990) .
  • the training corpus may include recordings that have been annotated by trained linguists to indicate the spoken phones.
  • the training corpus can be randomly divided into training, development, and testing sets. A specific example is described below.
  • the utterances are annotated with corresponding “heard” phones (q Ann ) , which may be determined by a linguist (or other trained person) who listens to the samples and identifies phones; the sequence of heard phones may be referred to as phonemic transcription.
  • utterances in a pre-existing corpus may already be annotated.
  • Utterances may also be annotated with other information, such as word boundaries, graphemes, canonical phones, or other features pertaining to the utterance.
  • sets of acoustic features representing particular utterances from the training corpus are generated.
  • speech can be digitally sampled at a constant rate of, e.g., 16 kHz, using standard audio digitizing equipment.
  • a pre-emphasis filter can be applied (e.g., a filter with a transfer function of 1–0.97z -1 ) .
  • the speech samples can be transformed to frequency space, e.g., by applying a Fast Fourier Transform in a 25-ms Hamming window with a 10-ms frame shift.
  • Coefficients representing acoustic features of a frame can be extracted (e.g., from each Hamming window) ; in some embodiments, the coefficients can be a set of 13 Mel-frequency cepstral coefficients (MFCCs) , which can be computed using conventional techniques. Cepstral mean normalization may be applied to each utterance, and features can be scaled, e.g., to have zero mean and unit variance over the corpus. As used herein, x t denotes the set of coefficients representing acoustic features of the frame associated with time t after all normalizations and scaling.
  • MFCCs Mel-frequency cepstral coefficients
  • the acoustic features x t are temporally aligned with graphemes of the prompted text using a grapheme-level acoustic model (G-AM) .
  • the G-AM is a neural network that computes the posterior probability p (g t
  • FIG. 3 illustrates a representative structure of a G-AM.
  • Bottom layer 302 is an input layer that can receive acoustic features x t for one or more frames. In some embodiments, features for multiple frames are provided.
  • Top layer 304 is an output layer that, in the case where the language being modeled is English, can include 28 units (or nodes) corresponding to 28 graphemes (26 letters in the English alphabet plus two units to represent the word boundary and the apostrophe) .
  • the output units provide the posterior probability p (g t
  • Hidden layers 306 may be arranged as desired. For example, there may be four hidden layers 306 with 512 units (or nodes) each. Optimal configurations may be determined experimentally.
  • Training of the G-AM requires information about the time boundaries of graphemes within each utterance, which may or may not be available (depending on how the training corpus has been annotated) . If annotations indicating word boundaries or phone boundaries are provided in the training corpus, time boundaries of graphemes can be determined during training. For example, if word boundaries are annotated (or derivable from annotated phones) , grapheme boundaries can be derived by dividing the graphemes equally within their words, training the G-AM, running forced alignment, then re-training. This training can be iterated until the performance converges (e.g., 3 to 5 iterations) .
  • the G-AM may be used to determine a probability of each grapheme (g t ) at a particular frame from a given set of acoustic features x t .
  • FIG. 4 shows an example of speech aligned to graphemes for an L2 English speaker uttering the phrase “the north. ”
  • Shown in graph 400 is a speech waveform as a function of time for the utterance.
  • Spectrum plot 402 represents frequency-spectrum characteristics for windows as described above, which can be used to derive the acoustic features x t for each frame.
  • Section 404 represents the mapping of graphemes to the utterance, which can be determined using the G-AM as described above.
  • Section 406 represents the canonical phones (q Dict ) associated with the utterance, and section 408 represents the annotation of heard phones (q Ann ) .
  • Heard phones q Ann may be used in training of the AGM as described below.
  • Canonical phones q Dict are not used in the AGM but may be used in other models, such as an AGPM as described below.
  • FIG. 5 illustrates a representative structure of an AGM according to an embodiment of the present invention.
  • Bottom layer 502 is an input layer that can receive the acoustic and graphemic features for one or more frames. More specifically, input units 502a receive acoustic features x t for one or more frames, while input units 502b receive graphemes g t for one or more frames. In some embodiments, information for multiple frames is provided.
  • Units 502a which receive the acoustic features, can be linear units with Gaussian noise, and there can be one unit for each coefficient (for 13 coefficients per frame and 21 frames, there would be 273 units 502a) .
  • Units 502b which receive the graphemes, can be binary units.
  • the number of graphemes g t can be the same as or different from the number of frames for which acoustic features are provided; for instance, a total of 7 graphemes (the grapheme at the tth frame, the 3 preceding graphemes, and the 3 succeeding graphemes) may be used.
  • each grapheme can be encoded using 5 bits, and there can be 35 binary units 502b.
  • Top layer 504 is an output layer that can include a unit (or node) corresponding to each phone that it is desirable to distinguish; for example, if there are 48 phones to be distinguished, top layer 504 would include 48 units.
  • the output units provide the posterior probability p (s t
  • Hidden layers 506 may be configured as desired. For example, there may be four hidden layers with 512 nodes each. Optimal configurations may be determined experimentally. For example, it has been found that increasing the number of nodes per layer beyond 512 significantly increases computing time but does not significantly improve accuracy.
  • Training may proceed using conventional techniques for training a multi-distribution deep neural network (MD-DNN) .
  • the AGM may be constructed by stacking up multiple Restricted Boltzmann Machines (RBMs) from bottom up.
  • RBMs Restricted Boltzmann Machines
  • a layer-by-layer unsupervised pre-training algorithm can be followed by fine-tuning using the back-propagation algorithm.
  • Other techniques may also be used.
  • the AGM may be used in analyzing utterances of a speaker of language L2, also referred to as a “user” of the AGM.
  • the user may be, for example, a non-native speaker of language L2.
  • FIG. 6 is a flow chart showing a speech analysis process 600 incorporating AGM 106 according to an embodiment of the present invention. It is assumed that, prior to execution of process 600, AGM 106 has been trained using process 200 or other similar process.
  • a representation of an utterance by the user from a prompted text is obtained. For instance, a user of a CAPT system may be prompted to speak, and the utterance may be recorded (e.g., in digital format) and saved for processing.
  • a set of acoustic features representing the utterance is generated. The generation of acoustic features may use the same processing as block 206 of process 200.
  • the trained G-AM (from block 208 of process 200) is used to align acoustic features x t with graphemes g t in a sequence of graphemes corresponding to the prompted text.
  • the frames and associated graphemes are provided as inputs to the AGM.
  • the configuration of inputs e.g., number of frames, binary encoding of graphemes, etc.
  • the AGM produces as its output the posterior probabilities p (s t
  • a decoding algorithm can be applied to the set of posterior probabilities p (s t
  • a Viterbi decoder algorithm is used, in which the most likely phone state is given by:
  • x is the sequence of acoustic vector features
  • g is the grapheme sequence extracted from the prompted words
  • s denotes a possible phone-state sequence.
  • x, g) can be approximated as:
  • the probability of a particular phone-state sequence s can be determined based on phone-state posterior probabilities p (s t
  • a state transition model (STM) can be generated to provide appropriate probability estimates.
  • FIG. 7 illustrates a representative structure of an STM that can be used to provide probabilities of particular phone states according to an embodiment of the present invention.
  • a 7-gram STM is used, where the probability of phone state s t depends on the preceding six phone states.
  • the STM can be implemented as a neural network.
  • the elements of a 90-phone-state set can be represented using 7 bits, and the inputs are six phone states (s t–6 , ...s t–1 ) .
  • bottom layer 702 may include 42 binary units.
  • Top layer 704 may include 90 output units representing the 90 possible phone states. These output units can provide the phone-state transition probability p (s t
  • Hidden layers 706 may be arranged as desired.
  • Optimal configurations may be determined experimentally and may depend on choices about the phone-state set (e.g., how many distinct phone states are being identified) .
  • the STM may be trained in a conventional manner.
  • a most likely phone sequence for the utterance has been determined.
  • this phone sequence can be used for further analysis, such as mispronunciation detection and diagnosis (MDD) .
  • MDD may be based on comparing the most likely phone sequence with a canonical phone sequence associated with the text that was spoken at block 602; differences can be identified as mispronunciations.
  • MDD may lead to further actions, such as providing feedback to the user indicating whether the pronunciation was correct, prompting the user to listen to a correct pronunciation and try again, highlighting portions of the text where incorrect pronunciation occurred, selecting a different text for the user to utter, and so on.
  • the particular form of feedback and/or guidance may be implementation-dependent and is not critical to understanding the present invention.
  • an AGM can be used to identify spoken phones based on acoustic features of an utterance and graphemes associated with a prompted text.
  • the AGM can implicitly model the grapheme-to-likely-pronunciation conversion.
  • the AGM can improve on the performance of conventional acoustic models.
  • an AGM can be used in instances where a canonical phonemic transcription of the text is not available.
  • a canonical phonemic transcription of the prompted text may be available, and this information can be incorporated into a neural network along with the acoustic features and graphemes.
  • some embodiments of the present invention relate to an acoustic-graphemic-phonemic model (AGPM) .
  • AGPM can incorporate both the graphemes and the phones of the prompted text and accordingly can implicitly model both the grapheme-to-likely-pronunciation conversion and the phone-to-likely-pronunciation conversion.
  • Speech analysis processes using an AGPM can be generally similar to processes described above using an AGM, except for the introduction of additional information.
  • FIG. 8 shows a conceptual illustration of a method for analyzing speech using an AGPM according to an embodiment of the present invention.
  • a user is prompted to speak a text, and acoustic features 802 are extracted from the user’s utterance.
  • the graphemes 804 that make up the text are known and can be time-correlated with acoustic features of the utterance.
  • a canonical phonemic sequence 806 of the text is assumed to be available.
  • Acoustic features 802, graphemes 804, and canonical phones 806 corresponding to a frame of the utterance are provided to an acoustic-graphemic-phonemic model (AGPM) 808, which is a neural network that has been trained to compute a posterior probability of a given phone given an input data set consisting of acoustic features, graphemes, and canonical phones. Examples of configuration and training of AGPM 808 are described below.
  • the probabilities ⁇ p ⁇ determined by AGPM 808 for each frame are provided to a Viterbi decoder 810, which can generate a most likely sequence of phones 812 using a state transition model that predicts the probability of a particular phone given a sequence of prior phones.
  • FIG. 9 is a flow chart showing a training process 900 that can be used to train AGPM 808 according to an embodiment of the present invention.
  • a training corpus consisting of utterances of prompted texts by various speakers can be obtained, similarly to block 202 of process 200; as in process 200, pre-existing corpora of utterances can be used.
  • the utterances are annotated with corresponding “heard” phones (q Ann ) and canonical phones (q Dict ) .
  • utterances in a pre-existing corpus may already be annotated.
  • Utterances may also be annotated with other information, such as word boundaries, graphemes, canonical phones, or other features pertaining to the utterance.
  • sets of acoustic features x t representing particular utterances from the training corpus are generated.
  • Generation of acoustic features can be similar or identical to block 206 of process 200.
  • the acoustic features x t are temporally aligned with graphemes of the prompted text using a grapheme-level acoustic model (G-AM) .
  • G-AM grapheme-level acoustic model
  • the acoustic features x t are temporally aligned with phones of the canonical transcription.
  • the alignment can be obtained directly from the annotations at block 904.
  • the canonical phonemic transcription might not be temporally aligned with the utterance, in which case a state-level acoustic model (S-AM) can be used to perform the alignment.
  • S-AM can be similar in implementation and operation to the G-AM, except that the outputs correspond to the posterior probability p (q t
  • FIG. 10 illustrates a representative structure of an AGPM according to an embodiment of the present invention.
  • Bottom layer 1002 is an input layer that can receive acoustic, graphemic, and canonical phonemic features for one or more frames. More specifically, input units 1002a receive acoustic features x t for one or more frames, while input units 1002b receive one or more graphemes g t and input units 1002c receive one or more canonical phones q t Dict .
  • information for multiple frames is provided.
  • a total of 21 frames (the tth frame, the 10 preceding frames and the 10 succeeding frames) may be used.
  • Units 1002a, which receive the acoustic features can be linear units with Gaussian noise, and there can be one unit for each coefficient (for 13 coefficients per frame and 21 frames, there would be 273 units 1002a) .
  • Units 1002b, which receive the graphemes, can be binary units.
  • the number of graphemes g t can be the same as or different from the number of frames for which acoustic features are provided; for instance, a total of 7 graphemes (the grapheme at the tth frame, the 3 preceding graphemes, and the 3 succeeding graphemes) may be used.
  • each grapheme can be encoded using 5 bits, and there can be 35 binary units 1002b.
  • Units 1002c which receive the canonical phones, can also be binary units.
  • the number of canonical phones q t Dict can be the same as or different from the number of graphemes; for instance, a total of 7 canonical phones (the canonical phone at the tth frame, the 3 preceding canonical phones, and the 3 succeeding canonical phones) may be used. If each phone is encoded using 6 bits (enough to distinguish 48 phones) , there can be 42 binary units 1002c.
  • Top layer 1004 is an output layer that can include a unit (or node) corresponding to each phone state that it is desirable to distinguish; for example, if 90 phone states are to be distinguished, top layer 1004 may include 90 units.
  • the output units provide the posterior probability p (s t
  • Hidden layers 1006 may be configured as desired. For example, there may be four hidden layers with 512 nodes each. Optimal configurations may be determined experimentally. For example, it has been found that increasing the number of nodes per layer beyond 512 significantly increases computing time but does not significantly improve accuracy.
  • training of the AGPM may proceed using conventional techniques for training a multi-distribution deep neural network (MD-DNN) .
  • MD-DNN multi-distribution deep neural network
  • Other techniques may also be used.
  • the AGPM may be used in analyzing utterances of a speaker of language L2, also referred to as a user of the AGPM.
  • the user may be, for example, a non-native speaker of language L2.
  • FIG. 11 is a flow chart showing a speech analysis process 1100 incorporating AGM 808 according to an embodiment of the present invention. It is assumed that, prior to execution of process 1100, AGM 808 has been trained using process 900 or other similar process.
  • an utterance spoken from a prompted text is obtained. For instance, a user of a CAPT system may be prompted to speak, and the utterance may be digitized and saved for processing. It is assumed that the prompted text has an associated canonical phoneme transcription, e.g., as shown at section 406 of FIG. 4, and associated graphemes.
  • a set of acoustic features representing the utterance is generated. The generation of acoustic features may use the same processing as block 906 of process 900 (or block 206 of process 200) .
  • the trained G-AM (from block 908 of process 900) is used to align acoustic features x t with graphemes g t of the prompted text.
  • the acoustic features x t are temporally aligned with the phones of the canonical transcription.
  • a trained S-AM as described above with reference to block 910 of process 900, can be used to determine the alignment. Other techniques can also be used.
  • the frames, associated graphemes, and canonical phones are provided as inputs to the AGPM.
  • the configuration of inputs e.g., number of frames, binary encoding of graphemes and phones, etc.
  • the AGPM produces as its output the posterior probabilities p (s t
  • a decoding algorithm can be applied to the set of posterior probabilities p (s t
  • p posterior probabilities
  • p posterior probabilities
  • a most likely phone sequence for the utterance has been determined.
  • this phone sequence can be used for further analysis, such as MDD.
  • MDD may be based on comparing the most likely phone sequence to the canonical phone sequence associated with the text that was spoken at block 1102; differences can be identified as mispronunciations.
  • MDD may lead to further actions, such as providing feedback to the user indicating whether the pronunciation was correct, prompting the user to listen to a correct pronunciation and try again, highlighting portions of the text where incorrect pronunciation occurred, selecting a different text for the user to utter, and so on.
  • the particular form of feedback and/or guidance may be implementation-dependent and is not critical to understanding the present invention.
  • L2 English speech in the CU-CHLOE corpus was annotated using acoustic models trained on the TIMIT corpus to align canonical transcriptions with the L2 English speech. Trained linguists annotated the speech with actual pronunciations; to save time, the annotation was done mainly by modifying the canonical phone sequences to indicate mispronunciations without changing phone boundaries. The annotated phone sequences were realigned using the S-AM described above.
  • DNNs including an AGM and an AGPM
  • Each DNN had four hidden layers with 512 nodes each.
  • Training began with a pre-training stage, in which all data were used to maximize the log-likelihood of the RBMs.
  • the one-step Contrastive Divergence was adopted to approximate the stochastic gradient.
  • Ten epochs were performed with a batch size of 512 frames.
  • the standard back-propagation algorithm was performed using labeled data. A dropout rate of 10%was applied.
  • ASGD a technique of asynchronous stochastic gradient descent
  • N is the total number of labeled phones
  • S is the number of substitution errors
  • D is the number of deletion errors
  • I is the number of insertion errors.
  • FIG. 12 is a table 1200 showing correctness and accuracy (as defined in Eqs. (3) and (4) ) for specific implementations of the AGM and AGPM approaches described above, as well as for an implementation of the APM approach previously developed by the inventors. These figures compare favorably with conventional approaches where correctness has been observed to be about 79-87%and accuracy about 74-83%, depending on the particular approach.
  • MDD performance can be evaluated using hierarchical classifications as shown in FIG. 13.
  • the result for a phone may be: (1) true acceptance (node 1302) , where a correct pronunciation is recognized as such; (2) true rejection (node 1304) , where an incorrect pronunciation is recognized as such; (3) false acceptance (node 1306) , where an incorrect pronunciation is mistakenly identified as correct; and (4) false rejection (node 1308) , where a correct pronunciation is mistakenly identified as incorrect.
  • the phonemic unit may be considered a correct diagnosis (node 1310) if the identified phone corresponds to the phone the speaker uttered and a diagnostic error (node 1312) if the identified phone corresponds to a different (but also incorrect) phone. From measurements of each category of result, figures of merit can be defined, including false rejection rate (FRR) , false acceptance rate (FAR) , and diagnostic error rate (DER) .
  • FRR false rejection rate
  • FAR false acceptance rate
  • DER diagnostic error rate
  • TA is the number of true acceptances
  • TR is the number of true rejections
  • FA is the number of false acceptances
  • FR is the number of false rejections
  • CD is the number of correct diagnoses
  • DE is the number of diagnostic errors.
  • FIG. 14 is a table 1400 showing FRR, FAR, and DER for the specific implementations of the AGM and AGPM approaches described above, as well as for an implementation of the APM approach previously developed and reported by the inventors. These approaches compare favorably with conventional approaches, particularly in regard to FRR.
  • the FAR is comparable to some conventional approaches (though not as low as others) , and the DER is somewhat better.
  • the FAR is believed to be influenced by the fact that AGM, APM, and AGPM are, as a result of the design, inclined to recognize mispronunciations as correct if the acoustic features are not clearly related to the mispronunciation.
  • low FRR is the more important consideration for MDD, since it is more important not to identify the correct pronunciation as wrong than to accept an incorrect pronunciation.
  • accuracy can be defined as:
  • FIG. 15 is a table 1500 showing the metrics of Eqs. (8) - (12) for the specific implementations of the AGM and AGPM approaches described above, as well as for an implementation of the APM approach previously developed and reported by the inventors. These metrics are comparable to some conventional techniques that focus on specific frequently-mispronounced phones or phones in isolated words; however the AGM and AGPM (and AGM) are able to detect all kinds of mispronounced phones in continuous speech, a more difficult task.
  • AGM and AGPM approaches as described above can be effective tools for computer-assisted MDD in L2 learning.
  • AGM and AGPM configurations having different numbers of hidden layers and/or different numbers of nodes per hidden layer have been studied. It was found that increasing the number of hidden layers beyond 4 and increasing the number of nodes per hidden layer beyond 512 did not result in significant performance improvements but did result in significant increases in computing time.
  • FIG. 16 is a table 1600 indicating various pronunciations of three example English words. For each word, the canonical pronunciation is shown first, followed by some likely mispronunciations by English learners. For each of the APM, AGM, and AGPM approaches, it is indicated whether the approach would (check mark) or would not (X) be expected to implicitly model the particular pronunciation.
  • DNN size and structure may be modified, e.g., by varying the number of hidden layers and/or number of nodes per layer.
  • Training algorithms may also be modified, and any suitably annotated set of utterances may be used as a training data set.
  • training is accomplished using utterances from native speakers of a particular target language (L2) and from non-native speakers of L2.
  • the non-native speakers may or may not share the same first language (L1) .
  • examples herein use English as the target language, those skilled in the art will understand that other languages can be substituted.
  • neural networks, speech capture, and other data analysis and computational operations described herein can be implemented in computer systems that may be of generally conventional design.
  • Such systems may include microprocessors, input devices (e.g., microphones, keyboards) , output devices (e.g., display devices, speakers) , memory and other storage devices, signal input/output ports, network communication interfaces, and so on.
  • Computer programs incorporating various features of the present invention may be encoded and stored on various computer readable storage media; suitable media include magnetic disk or tape, optical storage media such as compact disk (CD) or DVD (digital versatile disk) , flash memory, and other non-transitory media. (It is understood that “storage” of data is distinct from propagation of data using transitory media such as carrier waves. )
  • Computer readable media encoded with the program code may be packaged with a compatible electronic device, or the program code may be provided separately from electronic devices (e.g., via Internet download or as a separately packaged computer-readable storage medium) .

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Machine Translation (AREA)
PCT/CN2017/108098 2016-10-27 2017-10-27 Acoustic-graphemic model and acoustic-graphemic-phonemic model for computer-aided pronunciation training and speech processing WO2018077244A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201780065301.4A CN109863554B (zh) 2016-10-27 2017-10-27 用于计算机辅助发音训练和语音处理的声学字形模型和声学字形音位模型

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662413939P 2016-10-27 2016-10-27
US62/413,939 2016-10-27

Publications (1)

Publication Number Publication Date
WO2018077244A1 true WO2018077244A1 (en) 2018-05-03

Family

ID=62024302

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/108098 WO2018077244A1 (en) 2016-10-27 2017-10-27 Acoustic-graphemic model and acoustic-graphemic-phonemic model for computer-aided pronunciation training and speech processing

Country Status (2)

Country Link
CN (1) CN109863554B (zh)
WO (1) WO2018077244A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862958A (zh) * 2020-08-07 2020-10-30 广州视琨电子科技有限公司 发音插入错误检测方法、装置、电子设备及存储介质
CN111933110A (zh) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 视频生成方法、生成模型训练方法、装置、介质及设备
US11694677B2 (en) 2019-07-31 2023-07-04 Samsung Electronics Co., Ltd. Decoding method and apparatus in artificial neural network for speech recognition

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851484A (zh) * 2019-11-13 2020-02-28 北京香侬慧语科技有限责任公司 一种获取多指标问题答案的方法及装置
CN112581963B (zh) * 2020-11-23 2024-02-20 厦门快商通科技股份有限公司 一种语音意图识别方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110131046A1 (en) * 2009-11-30 2011-06-02 Microsoft Corporation Features for utilization in speech recognition
US20150039301A1 (en) * 2013-07-31 2015-02-05 Google Inc. Speech recognition using neural networks
US20150127594A1 (en) * 2013-11-04 2015-05-07 Google Inc. Transfer learning for deep neural network based hotword detection
US20160248768A1 (en) * 2015-02-20 2016-08-25 Sri International Joint Speaker Authentication and Key Phrase Identification
US20160253989A1 (en) * 2015-02-27 2016-09-01 Microsoft Technology Licensing, Llc Speech recognition error diagnosis

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998014934A1 (en) * 1996-10-02 1998-04-09 Sri International Method and system for automatic text-independent grading of pronunciation for language instruction
CN101739869B (zh) * 2008-11-19 2012-03-28 中国科学院自动化研究所 一种基于先验知识的发音评估与诊断系统
CN101887725A (zh) * 2010-04-30 2010-11-17 中国科学院声学研究所 一种基于音素混淆网络的音素后验概率计算方法
CN105976812B (zh) * 2016-04-28 2019-04-26 腾讯科技(深圳)有限公司 一种语音识别方法及其设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110131046A1 (en) * 2009-11-30 2011-06-02 Microsoft Corporation Features for utilization in speech recognition
US20150039301A1 (en) * 2013-07-31 2015-02-05 Google Inc. Speech recognition using neural networks
US20150127594A1 (en) * 2013-11-04 2015-05-07 Google Inc. Transfer learning for deep neural network based hotword detection
US20160248768A1 (en) * 2015-02-20 2016-08-25 Sri International Joint Speaker Authentication and Key Phrase Identification
US20160253989A1 (en) * 2015-02-27 2016-09-01 Microsoft Technology Licensing, Llc Speech recognition error diagnosis

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11694677B2 (en) 2019-07-31 2023-07-04 Samsung Electronics Co., Ltd. Decoding method and apparatus in artificial neural network for speech recognition
CN111862958A (zh) * 2020-08-07 2020-10-30 广州视琨电子科技有限公司 发音插入错误检测方法、装置、电子设备及存储介质
CN111862958B (zh) * 2020-08-07 2024-04-02 广州视琨电子科技有限公司 发音插入错误检测方法、装置、电子设备及存储介质
CN111933110A (zh) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 视频生成方法、生成模型训练方法、装置、介质及设备

Also Published As

Publication number Publication date
CN109863554A (zh) 2019-06-07
CN109863554B (zh) 2022-12-02

Similar Documents

Publication Publication Date Title
Li et al. Mispronunciation detection and diagnosis in l2 english speech using multidistribution deep neural networks
CN109863554B (zh) 用于计算机辅助发音训练和语音处理的声学字形模型和声学字形音位模型
Gruhn et al. Statistical pronunciation modeling for non-native speech processing
Cincarek et al. Automatic pronunciation scoring of words and sentences independent from the non-native’s first language
Dudy et al. Automatic analysis of pronunciations for children with speech sound disorders
Tu et al. Investigating the role of L1 in automatic pronunciation evaluation of L2 speech
US11282511B2 (en) System and method for automatic speech analysis
Harrison et al. Improving mispronunciation detection and diagnosis of learners' speech with context-sensitive phonological rules based on language transfer.
Lazaridis et al. Swiss French Regional Accent Identification.
Bartelds et al. A new acoustic-based pronunciation distance measure
Cucu et al. Statistical error correction methods for domain-specific ASR systems
Middag et al. Robust automatic intelligibility assessment techniques evaluated on speakers treated for head and neck cancer
Luo et al. Improvement of segmental mispronunciation detection with prior knowledge extracted from large L2 speech corpus
Mao et al. Applying multitask learning to acoustic-phonemic model for mispronunciation detection and diagnosis in l2 english speech
Mary et al. Searching speech databases: features, techniques and evaluation measures
Shen et al. Self-supervised pre-trained speech representation based end-to-end mispronunciation detection and diagnosis of Mandarin
Lo et al. Improving end-to-end modeling for mispronunciation detection with effective augmentation mechanisms
Alqadheeb et al. Correct pronunciation detection for classical Arabic phonemes using deep learning
Vythelingum et al. Acoustic-dependent Phonemic Transcription for Text-to-speech Synthesis.
Rao et al. Language identification using excitation source features
Pylkkönen Towards efficient and robust automatic speech recognition: decoding techniques and discriminative training
Baranwal et al. Improved Mispronunciation detection system using a hybrid CTC-ATT based approach for L2 English speakers
Abdou et al. Enhancing the confidence measure for an Arabic pronunciation verification system
Mabokela A multilingual ASR of Sepedi-English code-switched speech for automatic language identification
Jambi et al. Speak-Correct: A Computerized Interface for the Analysis of Mispronounced Errors.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17864740

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17864740

Country of ref document: EP

Kind code of ref document: A1