WO2020122974A1 - Transliteration for speech recognition training and scoring - Google Patents

Transliteration for speech recognition training and scoring Download PDF

Info

Publication number
WO2020122974A1
WO2020122974A1 PCT/US2019/017258 US2019017258W WO2020122974A1 WO 2020122974 A1 WO2020122974 A1 WO 2020122974A1 US 2019017258 W US2019017258 W US 2019017258W WO 2020122974 A1 WO2020122974 A1 WO 2020122974A1
Authority
WO
WIPO (PCT)
Prior art keywords
script
speech recognition
words
language
recognition model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2019/017258
Other languages
English (en)
French (fr)
Inventor
Bhuvana Ramabhadran
Min Ma
Pedro J. Moreno Mengibar
Jesse EMOND
Brian E. Roark
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to EP19707226.7A priority Critical patent/EP3877973B1/en
Priority to CN201980082043.XA priority patent/CN113396455B/zh
Priority to KR1020217017741A priority patent/KR102731583B1/ko
Priority to JP2021533448A priority patent/JP7208399B2/ja
Priority to US16/712,492 priority patent/US11417322B2/en
Publication of WO2020122974A1 publication Critical patent/WO2020122974A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • language examples can be processed to
  • written language examples such as those used to train or test a language model, may include words written in one script (e.g., a first script or primary script) as well as one or more words in a different script.
  • language samples can be normalized by transliterating out-of-script words into the primary script. This can be done for all out-of-script words or more selectively (e.g., by not transliterating proper names). The resulting model provides better accuracy than models that exclude examples with mixed scripts from training and models that use mixed scripts without transliteration.
  • Transliteration of the language model training data reduces inconsistency in transcription, provides better normalization and improves the overall performance of the automatic speech recognizer system.
  • This feature allows the language model training data to be augmented, and enforces the recognizer’s hypotheses to conform to one writing system.
  • model training simply allowed all speech of a secondary writing system in training, the resulting model would tend to output more speech in the secondary writing system, further increasing the already present writing script mismatch between the model hypothesis and the transcript truth. It would also diffuse word counts across two or more representations, e.g., the word in the first script as well as the word in the second script, even though both refer to the same semantic meaning and pronunciation.
  • model output is maintained predominantly in the desired script, while increasing accuracy due to word counts for the same word being combined for instances in any script.
  • a method performed by one or more computers includes: accessing a set of data indicating language examples for a first script, where at least some of the language examples include words in the first script and words in one or more other scripts; transliterating at least portions of some of the language examples to the first script to generate a training data set having words transliterated into the first script; and generating a speech recognition model based on the
  • the method may optionally further include using the speech recognition model to perform speech recognition for an utterance.
  • the speech recognition model is a language model, an acoustic model, a sequence-to-sequence model, or an end-to-end model.
  • transliterating comprises mapping different tokens that represent text from different scripts to a single normalized transliterated
  • transliterating the language examples comprises transliterating words in the language examples that are not in the first script into the first script.
  • transliterating the language examples comprises: accessing a blacklist of terms in a script different from the first script; and bypassing transliteration of instances of terms from the blacklist that occur in the language examples.
  • transliterating the language examples comprises generating altered language examples in which words written in a second script different from the first script are replaced with one or more words in the first script that approximate acoustic properties of the word in the first script.
  • the words written in the second script are individually transliterated into the first script on a word-by-word basis.
  • the method includes: determining a test set of language examples with which to test the speech recognition model; generating a normalized test set by transliterating into the first script words of the language examples in the test set that are not written in the first script; obtaining output of the speech recognition model corresponding to the language examples in the test set; normalizing output of the speech recognition model by transliterating into the first script words of the speech recognition model output that are not written in the first script; and determining an error rate of the speech recognition model based on a comparison of the normalized test set with the normalized speech recognition model output.
  • the error rate is a word error rate
  • the method includes, based on the word error rate: determining whether to continue training or terminate training of the speech recognition model; altering a training data set used to train the speech recognition model; setting a size, structure, or other characteristic of the speech recognition model; or selecting one or more speech recognition models for a speech recognition task.
  • the method includes determining a modeling error rate for the speech recognition model in which acoustically similar words written in any of multiple scripts are accepted as correct transcriptions, without penalizing output of a word in a different script than a corresponding word in a reference transcription.
  • the method includes determining a rendering error rate for the speech recognition model that is a measure of differences between a script of words in the output of the speech recognition model relative to a script of
  • transliterating is performed using a finite state transducer network trained to perform transliteration into the first script.
  • transliterating comprises, for at least one language example, performing multiple rounds of transliteration between scripts to reach a transliterated representation in the first script that is included in the training data set in the first script.
  • the method includes determining a score indicating a level of mixing of scripts in the language examples; and based on the score: selecting a parameter for pruning a finite state transducer network for transliteration; selecting a parameter for pruning the speech recognition model; or selecting a size or structure for the speech recognition model.
  • generating the speech recognition model comprises: after transliterating at least portions of some the language examples to the first script, determining, by the one or more computers, a count of occurrences of different sequences of words in the training data set in the first script; and generating, by the one or more computers, a speech recognition model based on the counts of occurrences of the different sequences of words in the training data set in the first script.
  • the speech recognition model comprises a recurrent neural network
  • generating the speech recognition model comprises training the recurrent neural network.
  • the present disclosure also provides a method of performing speech recognition, including: receiving, by one or more computers, audio data representing an utterance; and using, by the one or more computers, the speech generation model to map the audio data to text (or some other symbolic representation) representing the utterance, where the speech recognition model has been previously generated in accordance with any of the disclosed herein. It will be appreciated that the computers used to generate the speech recognition model may be different from those used to perform speech recognition. The method may further include outputting the text representing the output.
  • FIG. 1 is a block diagram that illustrates an example of a system for transliteration for speech recognition and evaluation.
  • FIG. 2 is a diagram that illustrates an example of a finite state transducer network for transliteration.
  • FIG. 3 is a chart illustrating error rates relative to amounts of code-switching in data sets.
  • FIG. 1 is a diagram that illustrates an example of a system 100 for
  • the system includes a computer system 110, which may include one or more computers located together or remotely from each other.
  • the computer system 110 has a transliteration module 120, a model training module 130, and a scoring module 140.
  • the system 100 is used to train and evaluate a speech recognition model, such as language model 150.
  • a set of language examples 112 are obtained from any of various sources, such as query logs, web pages, books, human or machine recognized speech transcriptions, and so on.
  • the language examples 112 are primarily in a first script.
  • the term“script” generally refers to a writing system.
  • a writing system is a system of symbols that are used to represent a natural language.
  • scripts with which the techniques disclosed herein can be used include Latin, Cyrillic, Greek, Arabic, Indie, or another writing system.
  • the language model 150 will be trained to provide output primarily representing text in a first script.
  • phrases or sentence primarily written in one script includes one or more words written using another script.
  • the language examples 112 typically include mostly examples written purely in the first script, but also includes some language examples that include a combination of words in another script as well as words in the first script.
  • the transliteration module 120 processes the language examples 112 to generate normalized data sets for the first script. Before training the language model 150, the transliteration module 120 processes the language examples 112 to generate normalized data sets for the first script. Before training the language model 150, the transliteration module 120 processes the language examples 112 to generate normalized data sets for the first script. Before training the language model 150, the transliteration module 120 processes the language examples 112 to generate normalized data sets for the first script. Before training the language model 150, the transliteration module 120 processes the language examples 112 to generate normalized data sets for the first script. Before training the language model 150, the transliteration module 120 processes the language examples 112 to generate normalized data sets for the first script. Before training the language model 150, the transliteration module 120 processes the language examples 112 to generate normalized data sets for the first script. Before training the language model 150, the transliteration module 120 processes the language examples 112 to generate normalized data sets for the first script. Before training the language model 150, the transliteration module 120 processes the language examples 112 to
  • transliteration module 120 can use finite state transducer (FST) networks to perform the transliteration.
  • FST finite state transducer
  • the relationships between graphemes and words of different scripts can be learned through analysis of the language examples 112 or from other data.
  • transliteration is done separately for each individual word, e.g., on a word-by-word basis, to achieve a high quality correspondence in the transliteration.
  • the process may optionally take into consideration the context of surrounding words to provide transliterations with high accuracy.
  • the resulting word as transliterated into the first script can be one having a pronunciation that matches or closely approximates the pronunciation of the corresponding original word that occurs in the language examples 112.
  • the transliteration process can change the writing system for original words not originally written in the first script, with the replacement words in the first script representing the same or similar acoustic characteristics or sounds as the original words.
  • transliteration into the first script.
  • transliteration is done selectively.
  • a transliteration blacklist 126 can indicate words or phrases that should not be transliterated.
  • transliteration module 120 keeps these terms in their original script when generating training data and test data, even though the original script differs from the first script.
  • the blacklist may be particularly helpful to include proper names, e.g., for people, locations, companies, and other entities, which may be more common or more recognizable in their native writing system compared to transliterated versions. For example, it may be preferred for some names such as“George Washington,”“New York,” or“Google” to remain in Latin script even among text that is predominantly in another script such as Indie, Cyrillic, Hanzi, Kana, Kanji, etc. Including out-of-script words in the training data set can allow the language model 150 to learn to predict output of these words in their native scripts, even though the scripts are different from the dominant script.
  • the transliteration module 120 From the language examples 112, the transliteration module 120 generates script-normalized training data 122 for training the language model 150.
  • the transliteration module 120 also generates script-normalized test data 124 to be used in testing the language model 150. Testing may occur at various stages, for example, after certain amounts of testing have been completed. The language model 150 may be tested repeatedly, between training processes, until a desired level of performance is achieved.
  • the language model 150 is configured to receive, as input, data representing the acoustic or linguistic units representing a language sequence, e.g., data indicating a pronunciation of a language sequence.
  • the input may indicate a series of phones (which may be context-dependent or context- independent), or a distribution of scores for a set of phones.
  • These pronunciations can be determined for language examples 112 in any of multiple ways. For language examples for which audio data is available, the pronunciation may be output of an acoustic model for the audio data. For language examples where there is no
  • the system can use a pronunciation generator 126 to generate a pronunciation automatically from written text.
  • the pronunciation generator 126 can access a lexicon 128 that indicates pronunciations of words in the language(s) of the language examples 112, e.g., mappings of grapheme sequences to phoneme
  • words may have pronunciations provided by manual annotation from linguists.
  • FIG. 1 shows a pronunciation generator 126 and lexicon 128, these elements are optional in some implementations.
  • language examples 112 may be received that already have corresponding pronunciations associated with the text.
  • the language model 150 does not receive linguistic unit information as input, but instead simply receives data indicating a sequence of graphemes or words.
  • some language models may be used for second-pass re-scoring of candidate transcriptions, and so may receive data indicating a proposed sequence of words.
  • the language model in this scenario may receive data indicating the words themselves, and not linguistic units of a pronunciation, and may be configured to provide scores indicating how likely the overall sequence is given the training of the language model.
  • the model training module 130 performs training of the language model 150.
  • the language model 150 can be a statistical language model, such as an n-gram model, that is generated based on counts of occurrences of different words and phrases in the script-normalized training data 122.
  • Other types of models, such as neural network models, may additionally or alternatively be used.
  • the language model 150 can include a neural network which the model training module 130 trains using backpropagation of errors, stochastic gradient descent, or other techniques.
  • the neural network may be a recurrent neural network such as one including one or more layers having long short-term memory (LSTM) cells.
  • the model may be trained by minimizing an objective function such as connectionist temporal classification (CTC) objective function, a state-level minimum Bayesian risk (sMBR) objective function, or another objective function.
  • CTC connectionist temporal classification
  • sMBR state-level minimum Bayesian risk
  • the language model 150 is trained to predict, e.g., indicate relative likelihoods of occurrence, of words at least in part based on a linguistic context, e.g., one or more surrounding words or phrases.
  • the language model 150 may be configured to provide outputs that indicate probabilities that different words in a vocabulary will occur given the occurrence of one or more immediately preceding words.
  • the prediction can also be based on acoustic or linguistic units, such as a pronunciation indicating a sequence of phones or output of a language model.
  • the language model 150 can indicate which words and word sequences best represent those sounds, according to the patterns of actual language usage observed from the script-normalized training data 122.
  • the language model 150 can be used to generate scores, e.g., probability scores or confidence scores, indicating the relative likelihood that different words would follow each other, which can be used to generate a speech lattice and then a beam search process can be used to determine the best path, e.g., a highest-scoring or lowest-cost path, through the lattice that represents the transcription considered most likely.
  • the scoring module 140 is used to evaluate the accuracy of the language model 150.
  • the result of the evaluation includes the generation of one or more scores 142 indicative of the performance of the language model 150.
  • the scoring module 140 can provide examples from the script-normalized test data 124 as input to the language model 150, causing the language model 150 to generate outputs, e.g., predictions or probability scores, for each input example.
  • the outputs of the language model may be further processes, e.g., using a lattice and beam search or through other techniques, to generate a language sequence output for each input example.
  • words in the output sequences that are not in the first writing system are transliterated using the transliteration module 120 (which is duplicated in the figure for clarity in illustration).
  • the model 150 learns to indicate only words in the dominant first script, except for words in the blacklist 126.
  • the model so trained will indicate output sequences that are in the first script.
  • transliteration can be used to normalize output sequences for more accurate comparison.
  • the scoring module 140 can generate a word error rate that indicates the rate that the output sequences include different words than the original language examples 112.
  • a conventional word error rate calculation would consider a word to be incorrect if it is in a different script than the original, even if the two words are equivalent as transliterations of each other (e.g., representing the same sounds and semantic meaning).
  • this can artificially inflate the apparent error rates when the model has, in effect, predicted the correct word.
  • the scoring module 140 can generate a revised word error rate, referred to as a modeling error rate, that compares output sequences normalized into the first script with test data 124 normalized into the first script.
  • the evaluation scores 142 from the scoring module 140 can be provided for output to a user and can also be used to manage the training of the language model 150.
  • the modeling error rate can be used to determine whether to continue training or terminate training of the language model 150. Training may be set to continue until the error rate is below a threshold level.
  • the computer system 110 may alter a training data set used to train the language model 150, for example, to bring in a different or expanded data set to achieve better accuracy.
  • the computer system 110 may set a size, structure, or other characteristic of the language model 150.
  • the computer system 110 may select one or more language models to be used to perform a speech recognition task.
  • other scores are determined and used to adjust training of the language model 150 and/or the transliteration module 120.
  • the computer system 110 can obtain data indicating a rate at which mixed use of script, e.g.,“code-switching,” occurs for a language generally or in a specific data set (such as the language examples 112). With this score, the system may select a parameter for pruning a finite state transducer network for transliteration, may select a parameter for pruning the language model, and or select a size or structure for the language model.
  • many different parameters for the structure, training, and operation of the language model 150 can be set using the scores including a choice of a development set of data (e.g., a validation set used to tune the training algorithm and prevent overfitting), a training data set, a model size, a learning rate during training, a type of model used (e.g., n-gram, neural network, maximum entropy model, etc.), or a set of output targets for the model (e.g., to predict words, word pieces, or graphemes).
  • a development set of data e.g., a validation set used to tune the training algorithm and prevent overfitting
  • a training data set e.g., a model size, a learning rate during training
  • a type of model used e.g., n-gram, neural network, maximum entropy model, etc.
  • output targets for the model e.g., to predict words, word pieces, or graphemes.
  • transliterated data can be used to train and score a language model, an acoustic model, a sequence-to-sequence model, and/or an end-to-end model (e.g., one that receives acoustic information or features and provides output indicating likelihoods of words, word pieces, or graphemes).
  • the sequence-to-sequence model can map an input sequence to an output sequence.
  • a sequence-to-sequence model can receive acoustic information or features representing of one or more spoken words, and produce a symbolic output (e.g. text) that represents those words.
  • Code-switching is a commonly occurring phenomenon in many multilingual communities. Code-switching generally refers to a speaker switching between languages within a single utterance.
  • Conventional Word Error Rate (WER) measures are not sufficient for measuring the performance of code-mixed languages due to ambiguities in transcription, misspellings and borrowing of words from two different writing systems. These rendering errors artificially inflate the WER of an Automated Speech Recognition (ASR) system and complicate its evaluation. Furthermore, these errors make it harder to accurately evaluate modeling errors originating from code- switched language and acoustic models.
  • WER Word Error Rate
  • transliteration-optimized Word Error Rate can smooth out many of these irregularities by mapping all text to one writing system. There is and demonstrate a correlation with the amount of code-switching present in a language.
  • These techniques can also be used to improve acoustic and language modeling for bilingual code-switched utterance. Examples discussing Indie languages are discussed in detail, although the techniques may be used for any combination of languages and writing systems.
  • the transliteration approach can be used to normalize data for three types of language models, namely, a conventional n-gram language model, a maximum entropy based language model and a Long Short Term Memory (LSTM) language model, and a state-of-the-art Connectionist Temporal Classification (CTC) acoustic model.
  • LSTM Long Short Term Memory
  • CTC Connectionist Temporal Classification
  • Code-switching is common among bilingual speakers of Hindi-English, Bengali- English, Arabic-English and Chinese-English, among many others.
  • a word from a foreign language e.g., English
  • a native language e.g., Hindi
  • code-switching, loan words, and creation of new words in the lexicon of the native language is often not very clear, and falls on a continuum. This phenomena renders the transcription of code-switched speech difficult and inconsistent, resulting in the same word being transcribed using different writing systems.
  • These inconsistencies can lead to incorrect count distributions among words with similar acoustic and lexical context in both acoustic and language models.
  • WFSTs weighted finite state transducers
  • C context-dependent phone sequence model
  • L pronunciation lexicon
  • G language model
  • CLG language model
  • Table 2 Examples containing Devanagari and Latin writing systems Table 1 illustrates the widespread distribution of code-switching.
  • Code-switching is present in multiple writing systems.
  • Hindi uses the Devanagari script
  • Urdu uses an Arabic writing system.
  • codeswitching is a part of daily life, the phenomenon routinely occurs in casual conversations, voice search queries and in presentations, leading to what is commonly referred to as Hinglish.
  • This type of code-switching can occur within a sentence at a phrase level.
  • Table 2 A few examples of commonly transcribed spoken utterances are presented in Table 2. The first column illustrates the mixed writing systems used commonly. The second column shows the equivalent text in Latin script for ease of readability and to illustrate the mix of Hindi and English seen in the data.
  • Transliteration is the process of converting sequences from one writing system to another. Transliteration of Indie languages to Latin script is particularly challenging due to the large combination of consonants, vowels and diacritics that result in a non unique mapping. It is worth noting that non-standard spellings exist in both scripts, for example, loaner words that have variable possible spellings in Devanagari and Hindi words with variable Romanizations.
  • a general transliteration approach is applicable to address code switching for any two languages or writing systems. Transliteration can be done effectively via a weighted finite state transducer.
  • human transcribers were asked to transcribe spoken utterances in the native writing script (Devanagari, in this case) with exceptions for certain commonly used English words to be written in Latin script.
  • the context and range of input from the two writing systems was restricted to what was said in the utterance, unlike unrestricted text entry via the keyboard.
  • T is a composition of three transducers: I ° P ° O, where I maps input Unicode symbols to symbols in a pair language model, P is a bigram pair language model that maps between symbols in the two writing scripts, English and Devanagari, and O maps the pair language model symbols to the target output Devanagari symbols (illustrated in Figure 2).
  • conditional probability of the transliterated word is obtained by dividing the joint probability from T by the marginalization sum over all input and output sequences. This computation is efficiently implemented by computing the shortest path in T.
  • the transliteration transducer computes the shortest path, and significant speed improvements were obtained by the efficient pruning of the search space. All paths that score below the pruning threshold were discarded. This threshold was determined empirically so as to not affect ASR performance. A prune weight threshold of 5 was determined as a good operating point, particularly as the best path is the path of greatest interest.
  • the use of e-transitions to reduce the number of deletions and insertions is important when reducing epsilon cycles in the WFST.
  • a parallel implementation of epsilon removal was used, utilizing eight threads in parallel.
  • the operations for epsilon removal caused dramatic increases in memory use, rendering the transliteration process unusable for large-scale language models. This issue was addressed via weight-based pruning prior to epsilon removal with no impact on the transliteration performance.
  • the speed-up/memory reduction contributions from the above optimization steps are presented in Table 3.
  • the above optimizations may reduce the overall training time of a language model trained on 280 billion words from 165 hours to 12 hours.
  • the experiments discussed below were conducted on training and test sets that were anonymized and hand-transcribed utterances representative of voice search traffic in Indie languages.
  • the training set is augmented with several copies of the original, artificially corrupted by adding varying degrees of noise and reverberation using a room simulator such that the overall SNR varies between 0 and 20 db.
  • the signal processing pipeline for all languages extracted 80-dimensional log mel-filter bank output features with a standard frame rate of 10 ms.
  • the acoustic models for all languages are LSTMs with 5 layers, with each layer consisting of 768 LSTM cells.
  • the acoustic models were trained in TensorFlow using asynchronous stochastic gradient descent minimizing Connectionist Temporal Classification (CTC) and state-level Minimum Bayesian Risk (sMBR) objective functions.
  • CTC Connectionist Temporal Classification
  • sMBR state-level Minimum Bayesian Risk
  • Table 4 The amount of training data used in the experiments for each of the Indie languages is presented in Table 4.
  • the test data varied between 6,000 and 10,000 words. It can be seen that there is a huge variance in available data across these languages.
  • a detailed analysis on Hindi is presented, as it is one of the languages that has the most code-switching with English and maximum number of training tokens the Hindi training data set comprises of approximately 10,000 hours of training data from 10 million utterances.
  • the proposed approach was also validated on the other Indie languages which typically have 10-20% of the data Hindi does.
  • Table 5 illustrates the significant differences in measured WER after correcting for errors related to the writing systems.
  • the proposed toWER metric is computed after transliterating both the reference and hypothesis to one writing system corresponding to the native locale. It can be seen that there is a correlation between the percentage of Latin script and the proposed metric which serves as a good indication of the extent of code-switching in these languages. In languages such as Malayalam, Telugu, Marathi and Urdu, there is a lesser amount of code-switching than in languages such as Hindi and Bengali and toWER reflects that. Thus, transliteration can be a means to correct for errors in the writing system arising from inconsistencies and as a means for separating modeling errors from rendering errors. Transliterated scoring may reduce ambiguity introduced by code-switching and may smooth out ambiguities and transcription errors. Therefore, the proposed toWER is a better metric to evaluate any algorithmic
  • Devanagari-only data based LM As an LM that was built with all utterances containing Devanagari script only. Any utterance containing bilingual text in Devanagari and Latin scripts was not used in the language model builds. As expected, this resulted in a loss of contextual modeling, lesser data and introduced mismatches between training and test set distributions. Transliterated scoring of the hypotheses produced by this LM fixes mismatches with reference transcriptions (row 2). Retaining data from both writing systems ensures that the contexts from code-switches are preserved but introduces all the challenges discussed in Section 2, including the same word appearing in both, Devanagari and Latin.
  • transliterated text provides gains in performance similar to those seen with maximum- entropy based LM for the voice search task and less so for the dictation task. While not surprising, it validates the hypothesis that transliteration based normalization for training as well as scoring helps separate modeling errors from rendering errors and helps with accurate evaluation of the performance of models. For the voice search task shown in Tables 6 and 7, one would conclude that the performance of an LSTM LM and a maximum entropy based LM are very similar (32.0 vs 31.5) when using conventional WER, while toWER would suggest that the maximum entropy based LM is much better than the LSTM (20.6 vs 22.3). The significance of such gains can in fact be measured by human raters in a side-by-side comparison study explained below.
  • Transliteration can also improve the accuracy of acoustic modeling.
  • AMs acoustic models
  • All words in the AM training data written in Latin were first transliterated to Devanagari script and pronunciations were derived in the Hindi phonetic alphabet.
  • the transliterated AM showed small improvements in performance over the model trained with both writing systems (see Table 8).
  • the improvements from sMBR training are expected to be even more significant as the numerator and denominator lattices needed for sMBR training will be consistently rendered in Devanagari script.
  • Table 9 presents the impact of the proposed approach on several other Indie languages. There is a significant, consistent gain on all languages except for Malayalam and Tamil. This can be attributed to the amount of Latin present in the training corpora. For these two languages, it can be seen from Figure 2 that there is very little Latin present in the voice search corpus containing spoken queries, while the corpus containing web-based queries contains a lot more Latin. However, the web- based corpus received a very low interpolation weight for this task and therefore had very little impact on the WER. A similar trend is observed with transliterated LMs on the dictation task (See Table 10 with relative reductions in toWER of up to 10%.)
  • FIG. 3 is a chart that illustrates WER values, toWER values, and correlation with the percentage of codeswitching measured as the percentage of Latin in the data.
  • the reference was transliterated to Jamuna while the hypothesis produced Jumna which is a result of the ambiguity in the transliteration process wherein either forms are acceptable.
  • the third example produces a more classic error.
  • the utterance reads in Latin as BA first year time table. Note that in this example, the transcriber was consistent in producing text in Devanagari only. The ASR system hypothesized the utterance correctly but in a combination of writing systems and counted three substitution errors per the WER metric. In the process of transliterating the hypothesis, BA got mapped to Ba (pronounced as bah in the word‘bar’) in
  • Wins/Losses the ratio of wins to losses in the experimental system vs. the baseline. A p-value less than ⁇ 5% is considered to be statistically significant.
  • the human raters give a neutral rating to the transliterated hypothesis when compared to the mixed writing systems based hypothesis. This is not unexpected, as the semantic content of the two systems being compared has not changed. However, toWER smooths out the rendering errors and offers a better perspective.
  • the second row compares two LMs, baseline system (row 2 in Table 6 with a toWER of 20.6%) and the system with a transliterated LM (row 4 in Table 6 with a toWER of 17.2%). There are far more wins than losses with the transliterated LM (the experimental system).
  • WER Word Error Rate
  • Modeling errors can be accurately measured using the proposed transliteration-based “toWER” metric that smooths out the rendering errors. Consistent normalization of training transcripts for both language and acoustic modeling can provide significant gains of up to 10% relative across several code-switched Indie languages using voice search and dictation traffic. With a simple approach based on transliteration to consistently normalize training data and accurately measuring the robustness and accuracy of the model, significant gains can be obtained.
  • Embodiments of the invention and all of the functional operations described in this specification may be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Embodiments of the invention may be implemented as one or more computer program products, i.e. , one or more modules of computer program instructions encoded on a computer-readable medium for execution by, or to control the operation of, data processing apparatus.
  • the computer readable medium may be a non-transitory computer readable storage medium, a machine-readable storage device, a machine- readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them.
  • the computer readable medium may be a transitory medium, such as an electrical, optical or electromagnetic signal.
  • data processing apparatus encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
  • the apparatus may include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
  • a propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus.
  • a computer program (also known as a program, software, software application, script, or code) may be written in any form of programming language, including compiled or interpreted languages, and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a computer program does not necessarily correspond to a file in a file system.
  • a program may be stored in a portion of a file that holds other programs or data ⁇ e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files ⁇ e.g., files that store one or more modules, sub programs, or portions of code).
  • a computer program may be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • the processes and logic flows described in this specification may be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output.
  • the processes and logic flows may also be performed by, and apparatus may also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
  • a processor will receive instructions and data from a read only memory or a random access memory or both.
  • the essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
  • mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer may be embedded in another device, e.g., a tablet computer, a mobile telephone, a personal digital assistant (PDA), a mobile audio player, a Global Positioning System (GPS) receiver, to name just a few.
  • Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media, and memory devices, including by way of example
  • semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
  • magnetic disks e.g., internal hard disks or removable disks
  • magneto optical disks e.g., magneto optical disks
  • CD ROM and DVD-ROM disks e.g., CD ROM and DVD-ROM disks.
  • the processor and the memory may be supplemented by, or incorporated in, special purpose logic circuitry.
  • embodiments of the invention may be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user may provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices may be used to provide for interaction with a user as well; for example, feedback provided to the user may be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input.
  • Embodiments of the invention may be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the invention, or any combination of one or more such back end, middleware, or front end components.
  • the components of the system may be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
  • LAN local area network
  • WAN wide area network
  • the computing system may include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • HTML file In each instance where an HTML file is mentioned, other file types or formats may be substituted. For instance, an HTML file may be replaced by an XML, JSON, plain text, or other types of files. Moreover, where a table or hash table is mentioned, other data structures (such as spreadsheets, relational databases, or structured files) may be used.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
PCT/US2019/017258 2018-12-12 2019-02-08 Transliteration for speech recognition training and scoring Ceased WO2020122974A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
EP19707226.7A EP3877973B1 (en) 2018-12-12 2019-02-08 Transliteration for speech recognition training and scoring
CN201980082043.XA CN113396455B (zh) 2018-12-12 2019-02-08 用于语音识别训练和评分的音译
KR1020217017741A KR102731583B1 (ko) 2018-12-12 2019-02-08 음성 인식 트레이닝 및 스코어링을 위한 음역
JP2021533448A JP7208399B2 (ja) 2018-12-12 2019-02-08 音声認識の訓練および採点のための音訳
US16/712,492 US11417322B2 (en) 2018-12-12 2019-12-12 Transliteration for speech recognition training and scoring

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862778431P 2018-12-12 2018-12-12
US62/778,431 2018-12-12

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/712,492 Continuation US11417322B2 (en) 2018-12-12 2019-12-12 Transliteration for speech recognition training and scoring

Publications (1)

Publication Number Publication Date
WO2020122974A1 true WO2020122974A1 (en) 2020-06-18

Family

ID=65520451

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/017258 Ceased WO2020122974A1 (en) 2018-12-12 2019-02-08 Transliteration for speech recognition training and scoring

Country Status (5)

Country Link
EP (1) EP3877973B1 (https=)
JP (1) JP7208399B2 (https=)
KR (1) KR102731583B1 (https=)
CN (1) CN113396455B (https=)
WO (1) WO2020122974A1 (https=)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113889105A (zh) * 2021-09-29 2022-01-04 北京搜狗科技发展有限公司 一种语音翻译方法、装置和用于语音翻译的装置
CN114118108A (zh) * 2021-11-11 2022-03-01 支付宝(杭州)信息技术有限公司 建立转译模型的方法、转译方法和对应装置
CN114420159A (zh) * 2020-10-12 2022-04-29 苏州声通信息科技有限公司 音频评测方法及装置、非瞬时性存储介质
CN114520001A (zh) * 2022-03-22 2022-05-20 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质
JP2023545103A (ja) * 2020-10-17 2023-10-26 インターナショナル・ビジネス・マシーンズ・コーポレーション 低リソース・セッティングにおいて多言語asr音響モデルを訓練するための翻字ベースのデータ増強

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626563A (zh) * 2021-08-30 2021-11-09 京东方科技集团股份有限公司 训练自然语言处理模型和自然语言处理的方法、电子设备
CN114299930B (zh) * 2021-12-21 2025-03-14 广州虎牙科技有限公司 端到端语音识别模型处理方法、语音识别方法及相关装置
KR102616598B1 (ko) * 2023-05-30 2023-12-22 주식회사 엘솔루 번역 자막을 이용한 원문 자막 병렬 데이터 생성 방법

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060041427A1 (en) * 2004-08-20 2006-02-23 Girija Yegnanarayanan Document transcription system training
US20090248395A1 (en) * 2008-03-31 2009-10-01 Neal Alewine Systems and methods for building a native language phoneme lexicon having native pronunciations of non-natie words derived from non-native pronunciatons
WO2009129315A1 (en) * 2008-04-15 2009-10-22 Mobile Technologies, Llc System and methods for maintaining speech-to-speech translation in the field
US20160171974A1 (en) * 2014-12-15 2016-06-16 Baidu Usa Llc Systems and methods for speech transcription

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080221866A1 (en) * 2007-03-06 2008-09-11 Lalitesh Katragadda Machine Learning For Transliteration
JP2009157888A (ja) 2007-12-28 2009-07-16 National Institute Of Information & Communication Technology 音訳モデル作成装置、音訳装置、及びそれらのためのコンピュータプログラム
US9176936B2 (en) * 2012-09-28 2015-11-03 International Business Machines Corporation Transliteration pair matching
JP2018028848A (ja) 2016-08-19 2018-02-22 日本放送協会 変換処理装置、音訳処理装置、およびプログラム
US10255909B2 (en) * 2017-06-29 2019-04-09 Intel IP Corporation Statistical-analysis-based reset of recurrent neural networks for automatic speech recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060041427A1 (en) * 2004-08-20 2006-02-23 Girija Yegnanarayanan Document transcription system training
US20090248395A1 (en) * 2008-03-31 2009-10-01 Neal Alewine Systems and methods for building a native language phoneme lexicon having native pronunciations of non-natie words derived from non-native pronunciatons
WO2009129315A1 (en) * 2008-04-15 2009-10-22 Mobile Technologies, Llc System and methods for maintaining speech-to-speech translation in the field
US20160171974A1 (en) * 2014-12-15 2016-06-16 Baidu Usa Llc Systems and methods for speech transcription

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114420159A (zh) * 2020-10-12 2022-04-29 苏州声通信息科技有限公司 音频评测方法及装置、非瞬时性存储介质
JP2023545103A (ja) * 2020-10-17 2023-10-26 インターナショナル・ビジネス・マシーンズ・コーポレーション 低リソース・セッティングにおいて多言語asr音響モデルを訓練するための翻字ベースのデータ増強
JP7706858B2 (ja) 2020-10-17 2025-07-14 インターナショナル・ビジネス・マシーンズ・コーポレーション 低リソース・セッティングにおいて多言語asr音響モデルを訓練するための翻字ベースのデータ増強
CN113889105A (zh) * 2021-09-29 2022-01-04 北京搜狗科技发展有限公司 一种语音翻译方法、装置和用于语音翻译的装置
CN114118108A (zh) * 2021-11-11 2022-03-01 支付宝(杭州)信息技术有限公司 建立转译模型的方法、转译方法和对应装置
CN114520001A (zh) * 2022-03-22 2022-05-20 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质

Also Published As

Publication number Publication date
JP2022515048A (ja) 2022-02-17
JP7208399B2 (ja) 2023-01-18
CN113396455B (zh) 2025-04-15
KR102731583B1 (ko) 2024-11-15
CN113396455A (zh) 2021-09-14
KR20210076163A (ko) 2021-06-23
EP3877973A1 (en) 2021-09-15
EP3877973B1 (en) 2025-09-10

Similar Documents

Publication Publication Date Title
US11417322B2 (en) Transliteration for speech recognition training and scoring
EP3877973B1 (en) Transliteration for speech recognition training and scoring
US11942076B2 (en) Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models
Lee et al. Massively multilingual pronunciation modeling with WikiPron
Emond et al. Transliteration based approaches to improve code-switched speech recognition performance
Stolcke et al. Recent innovations in speech-to-text transcription at SRI-ICSI-UW
TWI539441B (zh) 語音辨識方法及電子裝置
US11978434B2 (en) Developing an automatic speech recognition system using normalization
US20150179169A1 (en) Speech Recognition By Post Processing Using Phonetic and Semantic Information
Hirayama et al. Automatic speech recognition for mixed dialect utterances by mixing dialect language models
Sagae et al. Hallucinated n-best lists for discriminative language modeling
US8805871B2 (en) Cross-lingual audio search
Al-Anzi et al. The impact of phonological rules on Arabic speech recognition
Anoop et al. Suitability of syllable-based modeling units for end-to-end speech recognition in Sanskrit and other Indian languages
Ablimit et al. Lexicon optimization based on discriminative learning for automatic speech recognition of agglutinative language
Hanzlíček et al. Using LSTM neural networks for cross‐lingual phonetic speech segmentation with an iterative correction procedure
US11893349B2 (en) Systems and methods for generating locale-specific phonetic spelling variations
Pellegrini et al. Automatic word decompounding for asr in a morphologically rich language: Application to amharic
Saychum et al. Efficient Thai Grapheme-to-Phoneme Conversion Using CRF-Based Joint Sequence Modeling.
Al-Daradkah et al. Automatic grapheme-to-phoneme conversion of Arabic text
Pushpakumara Applicability of Transfer Learning on End-to-End Sinhala Speech Recognition
Wong et al. Goodness-of-pronunciation without phoneme time alignment
Oba et al. Efficient training of discriminative language models by sample selection
Lehečka et al. Improving speech recognition by detecting foreign inclusions and generating pronunciations
Wang et al. Multi-pronounciation dictionary construction for Mandarin-English bilingual phrase speech recognition system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19707226

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20217017741

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2021533448

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019707226

Country of ref document: EP

Effective date: 20210610

WWG Wipo information: grant in national office

Ref document number: 201980082043.X

Country of ref document: CN

WWG Wipo information: grant in national office

Ref document number: 2019707226

Country of ref document: EP