GB2480649A - Non-native language spelling correction - Google Patents

Non-native language spelling correction Download PDF

Info

Publication number
GB2480649A
GB2480649A GB1008799A GB201008799A GB2480649A GB 2480649 A GB2480649 A GB 2480649A GB 1008799 A GB1008799 A GB 1008799A GB 201008799 A GB201008799 A GB 201008799A GB 2480649 A GB2480649 A GB 2480649A
Authority
GB
United Kingdom
Prior art keywords
word
database
audio
language
audio signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1008799A
Other versions
GB201008799D0 (en
GB2480649B (en
Inventor
Lin Sun
Yichi Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to GB1008799.7A priority Critical patent/GB2480649B/en
Publication of GB201008799D0 publication Critical patent/GB201008799D0/en
Publication of GB2480649A publication Critical patent/GB2480649A/en
Application granted granted Critical
Publication of GB2480649B publication Critical patent/GB2480649B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • G10L15/265

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Machine Translation (AREA)

Abstract

The invention relate to generating an audio database of pronounced words of a user in a language. The present invention relates to a system and method for finding one or more correction candidates given a misspelled word in a language, the system comprising a first input, wherein the first input is an audio signal from a first database of audio recordings of words spoken in the language; and a second input, wherein the second input is an audio signal from a second database of audio recordings of phones and diphones spoken in the language; a third input which is a misspelled word in text; a processor, wherein the processor generates an audio signal for the misspelled word using the audio recordings of phones and diphones stored in the first database and the processor compares the audio signal of the misspelled word to the audio recordings of words in the second database and outputs at least one possible correction for the misspelled word. Preferably, the audiosignal from the first database of audio recordings of words spoken in the language and the audio signal from a second database of audio recordings of phones and diphones spoken in the language have been generated for users whose natural language is not the language for which the misspelled word has occurred.

Description

Non-native Language Spelling Correction
Field of the Invention
The present invention relates to art apparatus and method for finding one or more corrections given a misspelled word, wherein the word is misspelled by a non-native, or heavily accented language speaker; specifically, but not exclusively, using language processing of audio signals of the non-native language speaker.
Background of the Invention
Spelling correction tools are generally used to fix or suggest possible fixes to spelling errors that result in invalid words in word processing software, in email clients and so on. The technology is also used on smart phones and Personal Digital Assistants (PDA's), for example.
There are two steps that these tools used to suggest spelling corrections: 1) determine if the spelling is non-word 2) select and rank the correction candidates. The first step can be accomplished by querying a dictionary. The second step is commonly achieved by using an "edit distance" based method, as known in the art. The idea is that edit operations such as insertion and substitution are needed to convert the misspelling into a word. The correction candidates are ranked according to the number of edit operations (so called edit distance). For example, the most likely candidate for a misspelling "teh" is "the", but an option is also "ten" and both words would be presented as options to a user in the order "the"; "ten". The user then picks the word that they had originally intended to write and the spelling correction tools replaces the word in the text and moves onto the next misspelling.
In recent years, phonetic factors that go beyond the typical edit operations have also been taken into account. It has been found that misspellings in English writing are closely related to incorrect pronunciations of the English word in question. For example "edelvise" is a misspelling of "advise". The incorrect pronunciations are usually resulting from the language background of the speaker. So it is more common for an "English as a second language" speaker to mispronounce and as such, misspell an English word.
It has been shown that the phonetic models can greatly improve the spelling correction performance. (e.g. T Kristina and R Moore (2002); Pronunciation Modelling for Improved Spelling Correction). The existing approaches of phonetic modelling are mainly based on the "letter-to-phone model" and "phonetic algorithms".
Phonetic Algorithms: The purpose of the phonetic algorithms is to encode "homophones" using the same representation, wherein a "homophone" is a word that has the same pronunciation as another word, but with a different meaning. The most well-known algorithm is "Soundex", wherein Soundex is a phonetic algorithm that encodes the homophone into the same representation.
The patents for the algorithm are US 1, 261, 167 (Russell) and US 1, 435 663 (Russell). In the Soundex scheme, only the consonants are encoded. The consonants are converted to numbers 1-6. The first letter of the Soundex code is the first letter of the word, for example, both "Amy" and "Ann" are encoded as A500.
Letter-to-phone modeL This model predicts the pronunciation of a misspelling using rules automatically learned from a pronunciation dictionary, for example CMUDICT. CUMDICT stands for "CMU Pronouncing Dictionary", which is machine readable pronunciation dictionary for North American English. The pronunciation of a word using the letter-to-phone model is usually represented by International Phonetic Alphabet, or its variants. For example, a misspelling "fisks" would be encoded as "F IH Z IH K 5", which is the same as "physics.
Thus "physics" may be presented by a spell checker as a possible option for the misspelled word "fisks".
Both systems convert the spelling into certain kind of textual-phonetic representation. This has major limitations: 1) None of the existing approaches can really model the pronunciation of individual person. Given a spelling, the existing systems will always produce the same phonetic representation.
2) The encoding process from the spelling to the phone representation is lossy. Spellings with different pronunciations can be mapped into the same representation, and wrong pronunciation can be assigned to a spelling.
Brief Summary of the Invention
The present invention therefore seeks to provide a method and system for finding one or more corrections given a misspelled word, wherein the word is written by a non-native language speaker, or for a speaker who is heavily accented, which overcomes, or at least reduces some of the above-mentioned
problems of the prior art.
Accordingly, in a first aspect, the invention provides a system for generating an audio database of pronounced words of a user in a language, the system comprising a first input, wherein the first input is an audio recording of a speech corpus spoken by the user in the language; a second input, wherein the second input is a list of words in the language, wherein the list also comprises an encoding of each word phonetically; a first output which is an audio signal of a phone, as spoken by the user in the language, which is stored in a first database; a second output which is an audio signal of a word, as spoken by the user in the language, which is stored in a second database; wherein the first and second inputs and first and second outputs are coupled to a training processor, wherein the training processor analyses the audio recording of the speech corpus to find at least one a phone and stores the audio signal of the at least one phone via the first output in the first database; wherein the processor for each word in the list of words analyses the audio recording of the speech corpus to find that word, if it does find the word, the training processor stores the audio signal of that word in the second database and if the training processor cannot, the training processor synthesises the word using phones stored in the first database and then stores the audio signal of the synthesised word in the second database.
Preferably, further comprising a third input which may be a transcribed speech corpus of words spoken by natural speakers of the language and/or wherein the transcribed speech corpus may be TIM IT.
Preferably, wherein the processor may use the transcribed speech corpus to train a single phone recognizer and/or wherein the single phone recognizer may be a Hidden Markov model.
Preferably wherein the processor comprises a fourth input which is a phonetic alphabet and/or wherein phonetic alphabet may be the International phone alphabet.
Preferably, wherein the processor may use the trained single phone recognizer to align the audio signal's time window to the phones in the phone alphabet, then processes the speech corpus and the resulting audio signals of the analysed phones are stored in the first databasa Preferably wherein the processor uses diphone synthesis to synthesise the word using the phones stored in the first database.
Accordingly, in a second aspect, the invention provides a system for finding one or more correction candidates given a misspelled word in a language, the system comprising: a first input, wherein the first input is an audio signal from a first database of audio recordings of words spoken in the language; and a second input, wherein the second input is an audio signal from a second database of audio recordings of phones and diphones spoken in the language; a third input which is a misspelled word in text; a processor, wherein the processor generates an audio signal for the misspelled word using the audio recordings of phones and diphones stored in the first database and the processor compares the audio signal of the misspelled word to the audio recordings of words in the second database and outputs at least one possible correction for the misspelled word.
Preferably, wherein the processor outputs a list of more than one possible correction for the misspelled word and/or wherein the list is ranked.
Further preferably wherein the processor uses a similarity score method to compare the audio signal of the misspelled word to the audio recordings of words in the second database andlor wherein the similarity score method is the Bhattacharyva Kernel method.
Preferably, wherein the first and second databases are generated using the system of the first aspect.
Accordingly, in a third aspect, the invention provides a method of generating an audio database of pronounced words of a user in a language, the method comprising: obtaining an audio recording of a speech corpus spoken by the user in the language; obtaining a list of words in the language, wherein the list also comprises an encoding of each word phonetically; producing an audio signal of a phone, as spoken by the user in the language and storing it in a first database; producing an audio signal of a word, as spoken by the user in the language and storing it in a second database; analysing the audio recording of the speech corpus to find at least one a phone and storing the audio signal of the at least one in the first database; and for each word in the list of words, analysing the audio recording of the speech corpus to find that word, if that word is found, storing the audio signal of that word in the second database and if the word is not found, synthesising the word using phones stored in the first database and then storing the audio signal of the synthesised word in the second database.
Preferably further comprising obtaining a transcribed speech corpus of words spoken by natural speakers of the language, wherein the transcribed speech corpus may be TIM IT.
Preferably further comprising using the transcribed speech corpus to train a single phone recognizer, wherein the single phone recognizer is a Hidden Markov model.
Preferably further comprising obtaining a fourth input which is a phonetic alphabet and wherein phonetic alphabet may be the International phone alphabet.
Preferably, further comprising using the trained single phone recognizer to align the audio signal's time window to the phones in the phone alphabet, then processing the speech corpus and storing the resulting audio signals of the analysed phones in the first database.
Further preferably using diphone synthesis to synthesise the word using the phones stored in the first database.
Accordingly, in a fourth aspect, the invention provides a method for finding one or more correction candidates given a misspelled word in a language, the method comprising obtaining an audio signal from a first database of audio recordings of words spoken in the language; and obtaining an audio signal from a second database of audio recordings of phones and diphones spoken in the language; obtaining a misspelled word in text; generating an audio signal for the misspelled word using the audio recordings of phones and diphones stored in the first database; comparing the audio signal of the misspelled word to the audio recordings of words in the second database and outputting at least one possible correction for the misspelled word.
Preferably outputting a list of more than one possible correction for the misspelled word, wherein the list is ranked.
Further preferably using a similarity score method to compare the audio signal of the niisspelled word to the audio recordings of words in the second database, wherein the similarity scores method is the Ahattacharyva Kernel method.
Further preferably, wherein the audio signals stored in the first and second databases are generated using the method of the third aspect. -7-.
Thus, the present invention selects and ranks the corrected word suggestions based on the audio representation of the misspelled word. It avoids the first limitation of the art, because the spelling can be represented by individual or a group of individuals' audio signals. It also avoids the second limitation in the art, because it uses the audio signal directly without the need of a textual representation of the phone.
Brief Description of the Drawings
One embodiment of the invention will now be more fully described, by way of example, with reference to the drawings, of which: Figure 1 is a diagram showing a training system for use with a correct spelling suggestion systerri as shown in Figure 2, according to one embodiment of the present invention; Figure 2 is a diagram showing a correct spelling suggestion system, which uses the training system of Figure 1, according to one embodiment of the present invention; Figure 3 is a flow diagram illustrating a method of training used in the system of Figure 1; and Figure 4 is a flow diagram illustrating a correct spelling suggestion method as used in the system of Figure 2.
Detailed Description of the Drawings
In a brief overview of one embodiment of the present invention, there is shown in Figure 1 a training system for use in the correct spelling suggestion system of Figure 2. The training system 100 comprises a training processor 50 and a set of four inputs (10, 20, 30, 40) as described following: The first input 10 is a speech "corpus" of the user or a group of users, who speak English as a second language; herein to be defined as "U". Wherein a "speech corpus" is a large collection of audio recordings of a set of users, who speak English as a second language, or who have a strong regional accent. For example this could be a recording of a set of Chinese born language students reading the works of Shakespeare in English as their second language. It should be clear to someone skilled in the ad that any languages or regional accent variations could be used as the first input 10 to the training system 100.
The second input 20 is a transcribed speech training corpus, for example a "TIMIT corpus" as known in the ad and to be herein defined as T". Wherein "TIMIT" is a collection of transcribed speech of American speakers of different sexes and dialects. In general, a speech training corpus (T) is a set of transcribed speech which has manually been transcribed from audio to text. For example this could also be a set of the audio recordings of natural English speakers with no strong accent reading the same works of Shakespeare, or a week's worth of newspapers and which has then been transcribed into text.
The specific content of the speech corpus "T" does not have to be directly related to "U", which is the set of audio recordings of the users who speak English as a second language. For example, both sets of speakers do not have to reading the works of Shakespeare, but both sets need to be of a large enough size to give a good training input into the training system 100. The speech training corpus "T" must ideally contain all the words in the English dictionary and the bigger the set of audio recordings of the users "U" is, the better trained the training system 100 will be.
The third input 30 is a phonetic alphabet (e.g. International Phone Alphabet); herein to be defined as lip" which contains both "phones" and "diphones" for the English Language. For example, using a phonetic alphabet results in the encoding of "F lH Z IH K 5", for the word "physics" and wherein "diphones" are descriptors of the way two "phones" sound when put together in a specific sequence in speech. A person skilled in the ad will know what "phones" and "diphones" are and how they are used in language processing and speech synthesis.
The fourth input 40 is a word list, herein to be known as "W' and each word's pronunciation in the Word list "W" is represented phonetically by "p", which is one of the set of the relevant phonetic alphabet. For example the word list "W" could be the list of words in the Oxford English dictionary.
The processor 50 uses the four inputs (10, 20, 30 & 40) to define for each word (w) in the word list "W" the way in which the set of training users actually pronounce English words and for each phonetic part of the word (p), the way in which they pronounce phone and diphones in English as a their non-first language, or with their strong regional accent.
A first output 70 of the training system 100 is therefore a set of audio files of the phones and diphones (p) for each word (w) in the set of words (W), but they are instead recorded how they are pronounced by users in group of users, as opposed to natural or non-accented speakers. The phones and diphones (p) as pronounced by users are then stored in a phones audio database 800 (DP).
A second output 80 of the training system 100 is a set of audio files of each word (w) in the list of words VI", but instead recorded how they are pronounced by users and these audio files are then stored in a word audio database 700 (DW).
In order to achieve this, the processor 50 compares every word (w) in the list of words "W" with every word pronounced by the group of users in the set of audio recordings (U). Then the pronunciation of the particular word and the pronunciation of that word's phones and diphones (p) are stored as audio signals in word audio database 700 and the phones audio database 800 respectively. If any particular word (w) in the list of words (W) is not spoken by a user in the group of users (U), then that word is synthesised by using the phones and diphones stored in the phones audio database 700 (during previous iterations of the process). Once the word is synthesised, it is then stored in the word audio database 700. For further details regarding how the processor 50 does this, please refer to steps T3, 14 and T5 of Figure 3.
Figure 2 is a diagram showing a spelling correction system, which uses the training system of Figure 1. The spelling correction system 200 suggests one or more corrections for misspelled words, for a particular user, using recorded audio signals of how the user actually pronounces words in that language, the system have already been trained, as described previously with reference to Figure 1.
The spelling correction system 200 comprises a spelling correction suggestion S processor 250, which has five inputs, as described following: The first input is a misspelled word Wmis 210. Where Wmis 210 is a word which has been misspelled by a user using a handheld Personal Digital Assistant (PDA), on a smartphone, in an email, on the computer screen, or in a document being edited by a user, for example. The misspelled word Wmis 210 could also come from a computer on a desk in an office. It should be clear to someone skilled in the art how and where spelling mistakes are generated electronically in word processing applications, or on smart phones for example.
The second input 220 is a phone alphabet (e.g. International Phone Alphabet); herein to be defined as "P". This is the same International Phone Alphabet P' as used in the training system as described previously in Figure 1.
The third input 290 is a word list (W, together with each word's pronunciation is represented phonetically by phones and diphones (p).
The fourth input 270 is a signal stored in a word pronunciation audio database (DVW, wherein the word database has already been trained for a particular group of users (U) for each word in the list of words (W). For further information of the training process, please refer to the description regarding Figures 1 and 3.
The fifth input 280 is a signal from a phone audio database (DP), wherein the phone audio database has already been generated using audio recordings of a particular group of users (U) for each word (w) in the list of words (W). For further information of the training process, please refer to the description regarding Figures 1 and 3.
The spelling correction suggestion processor 250 uses similarity scores to rank possible spelling correction candidates for the misspelled word (Wmis) and outputs via its output 230 a ranked list (L) of correction candidates for Wmis 210. Where the list (L) is therefore based on audio recordings of the way users in the group of users actually pronounce words in English, as opposed to the way natural, or non-accented, speakers would.
The actual methods of training and spelling suggestion as used by the training and spelling correction processors 50, 250 will now be described with reference to Figure 3 & 4 following.
Figure 3 shows a flow diagram illustrating a method of training used in the systems of Figure 1, wherein training method aims to acquire the pronunciation for the phones and words in a word list (W) based on the way a set of users (U) actually pronounce English words and comprises the following steps: Start iterative training process: go to step Ti.
TL A single phone recogniser (M) is first trained on the transcribed speech corpus (T). Example of such recogniser is a Hidden Markov Model (HMM) and it is known how to do this in the art. Go to step T2.
T2. Use the single phone recognizer (M) to find the phones and diphones in the audio recordings of the group of users (U). The training processor as described with reference to Figure 1 uses the recogniser M to align the audio signal time window to the phones in the phone alphabet P. The resulting audio signals are stored in the phone audio database OP. The resulting audio diphone signals are also stored in the phone audio database OP. Again, it is known how to do this in the art. Go to step T3.
T3. For each word (w) in the list of words (W), check if the particular word (w) is in the audio recordings of the group of users (U), if the particular word (w) is in audio recordings of the group of users (U) move to step T4, if not, move to step T5.
-12 -T4. Store either the learnt or synthesised audio signal of the particular word (w) in the word audio database (DVti9. Go to Finish.
T5. Synthesise the pronunciation signal of the particular word (w) using the learned phone and diphone signals stored in the phone audio database (DP).
An example of such technology is diphone synthesis', as known in the art: The generation of the pronunciation of a particular word (w) is based on the concatenation of the phone signals according to the phone and diphone representation (p) of that particular word (w). Go to step T4.
Finish -end of iterative training process.
Thus the training process has generated a word audio database (DW) and a phone audio database (DP) of the way in which users actually pronounce words in the language being used. Every word in that language's dictionary is individually recorded either from the actual speech corpus, or it is synthesised from its phones as spoken by the users. The user generated a word audio database (DW) and a phone audio database (DP) is then used in the spelling correction process (as described with reference to Figure 4) a similarity scoring method to produce possible correction candidates.
Figure 4 is a flow diagram illustrating a spelling suggestion method as used in the system of Figure 2, wherein the correct spelling suggestion method aims to find the correction for a given misspelling according to the word and phone pronunciation signal databases (DW and DP), which have been trained for a particular group of users and which comprises the following steps: Start of spelling suggestion method. Go to step P1.
P1. Generate a pronunciation signal (5) for the misspelling (Wmis) based on the learned phone and diphone signals in the phone audio database (DR), using the same method as step T3 in the training process of Figure 3. Go to step P2.
P2. Compare the pronunciation signal (5) against every signal of a particular word (w) in the list of words (W), which is stored in the word audio database (01/1') and produce a similarity score using a similarity measure, as known in the art. An example of the similarity measure is the Bhattacharyya Kernel, which is compression based similarity measure. Given two audio signals Al and A2 and a perpetual audio codec method (eg MPEG-l Layer, MP3 and/or Advanced Audio Coding AAC). Then Al and A2 are compressed individually to produce compressed audio files Fl and F2 respectively. Than Al and A2 are concatenated into a single audio file and the file is compressed, to produce one compressed audio file F3. The similarity of Al and A2 can be calculated based on the file size of Fl, F2, and F3 using a normalized compression distance, as known in the art.
P3. Store the particular word (w) and the similarity score in a list (L) for the 5 closest matches (for example). Go to step P4 P4. Rank the list (L) of possible spellings for Wmis, according to the similarity score produced in step P2. Using the similarity scores generated in step P2, rank the possible spelling corrections for the misspelled word (Wmis) in order of likelihood that it is the correct spelling of the word originally intended instead of the misspelled word (Wmis). Go to Finish.
Finish Spelling Suggestion process.
As such it has been shown that the present invention uses audio recordings from a group of users (U) who speak a language, such as English as a second, non-natural language, or with a heavy accent and uses theses audio recordings to train a system in order to suggest spelling corrections for when that very same user group misspell English words.
It will be appreciated that although only one particular embodiment of the invention has been described in detail, various modifications and improvements can be made by a person skilled in the art without departing from the scope of the present invention.

Claims (32)

  1. -14 -Claims 1. A system for generating an audio database of pronounced words of a user in a language, the system comprising: a first input, wherein the first input is an audio recording of a speech corpus spoken by the user in the language; a second input, wherein the second input is a list of words in the language, wherein the list also comprises an encoding of each word phonetically; a first output which is an audio signal of a phone, as spoken by the user in the language, which is stored in a first database; a second output which is an audio signal of a word, as spoken by the user in the language, which is stored in a second database; o wherein the first and second inputs and first and second outputs are coupled to a training processor, wherein the training processor analyses the audio recording of the speech corpus to find at least one a phone and stores the audio signal of the at least one phone via the first output in the first database; wherein the processor for each word in the list of words analyses the audio recording of the speech corpus to find that word, if it does find the word, the training processor stores the audio signal of that word in the second database and if the training processor cannot, the training processor synthesises the word using phones stored in the first database and then stores the audio signal of the synthesised word in the second database.
  2. 2. A system according to claim 1, further comprising a third input which is a transcribed speech corpus of words spoken by natural speakers of the language.
  3. 3. A system according to claim 2, wherein the processor uses the transcribed speech corpus to train a single phone recognizer.
  4. 4. A system according to claim 3, which the processor comprises a fourth input which is a phonetic alphabet.
  5. 5. A system according to claim 4, wherein the processor uses the trained single phone recognizer to align the audio signal's time window to the phones in the phone alphabet, then processes the speech corpus and the resulting audio signals of the analysed phones are stored in the first database.
  6. 6. A system according to claim 2, wherein the transcribed speech corpus is TI MIT.
  7. 7. A system according to claim 3, wherein the single phone recognizer is a r Hidden Markov model.
  8. 8. A system according to claim 4, wherein phonetic alphabet is the International phone alphabet.
  9. 9. A system according to any preceding claim, wherein the processor uses diphone synthesis to synthesise the word using the phones stored in the first database.
  10. 10.A system for finding one or more correction candidates given a misspelled word in a language, the system comprising: a first input, wherein the first input is an audio signal from a first database of audio recordings of words spoken in the language; and a second input, wherein the second input is an audio signal from a second database of audio recordings of phones and diphones spoken in the language; a third input which is a misspelled word in text; -16 -a processor, wherein the processor generates an audio signal for the misspelled word using the audio recordings of phones and diphones stored in the first database and the processor compares the audio signal of the misspelled word to the audio recordings of words in the second database and outputs at least one possible correction for the misspelled word.
  11. 11.A system according to claim 10, wherein the processor outputs a list of more than one possible correction for the misspelled word.
  12. 12. A system according to claim 11, wherein the list is ranked.
  13. 13.A system according to any of claims 10 to 12, wherein the processor uses a similarity score method to compare the audio signal of the misspelled word to the audio recordings of words in the second database. r
  14. 14.A system according to claim 13, wherein the similarity score method is o 15 the Bhattacharyva Kernel method.
  15. 15. A system according to any of claim 10 to 14, wherein the first and second databases are generated using the system of any of claims 1 to 9.
  16. 16.A method of generating an audio database of pronounced words of a user in a language, the method comprising: obtaining an audio recording of a speech corpus spoken by the user in the language; obtaining a list of words in the language, wherein the list also comprises an encoding of each word phonetically; producing an audio signal of a phone, as spoken by the user in the language and storing it in a first database; -17-producing an audio signal of a word, as spoken by the user in the language and storing it in a second database; analysing the audio recording of the speech corpus to find at least one a phone and storing the audio signal of the at least one in the first database; and for each word in the list of words, analysing the audio recording of the speech corpus to find that word, if that word is found, storing the audio signal of that word in the second database and if the word is not found, synthesising the word using phones stored in the first database and then storing the audio signal of the synthesised word in the second database.
  17. 17.A method according to claim 16, further comprising obtaining a transcribed speech corpus of words spoken by natural speakers of the language.
  18. 18.A method according to claim 17, further comprising using the transcribed speech corpus to train a single phone recognizer.
    C\I
  19. 19.A method according to claim 18, further comprising obtaining a fourth input which is a phonetic alphabet.
  20. 20.A method according to claim 19, further comprising using the trained single phone recognizer to align the audio signal's time window to the phones in the phone alphabet, then processing the speech corpus and storing the resulting audio signals of the analysed phones in the first database.
  21. 21.A method according to claim 17, wherein the transcribed speech corpus isTlMlT.
  22. 22.A method according to claim 18, wherein the single phone recognizer is a Hidden Markov model.
  23. 23.A method according to claim 19, wherein phonetic alphabet is the International phone alphabet.
  24. 24.A method according to any of claims 16 to 23, further comprising using diphone synthesis to synthesise the word using the phones stored in the first database.
  25. 25.A method for finding one or more correction candidates given a misspelled word in a language, the method comprising: obtaining an audio signal from a first database of audio recordings of obtaining an audio signal from a second database of audio recordings of phones and diphones spoken in the language; obtaining a misspelled word in text; r C\J generating an audio signal for the mLsspelled word using the audio recordings of phones and diphones stored in the first database; comparing the audio signal of the misspelled word to the audio C\J recordings of words in the second database and outputting at least one possible correction for the misspelled word.
  26. 26.A method according to claim 25, further comprising outputting a list of more than one possible correction for the misspelled word.
  27. 27.A method according to claim 26, wherein the list is ranked.
  28. 28.A method according to any of claims 25 to 27, further comprising using a similarity score method to compare the audio signal of the misspelled word to the audio recordings of words in the second database.
  29. 29.A method according to claim 28, wherein the similarity scores method is the Bhattacharyva Kernel method.
  30. 30. A method according to any of claims 25 to 29, wherein the audio signals stored in the first and second databases are generated using the method of any of claims 16 to 24.
  31. 31.A system and/or method hereinbefore described for generating an audio database of pronounced words of a user in a language, with specific reference to Figures 1 and 3.
  32. 32.A system and/or method hereinbefore described for finding one or more correction candidates given a misspelled word in a language, with specific reference to Figures 2 and 4. r r (4 (4
GB1008799.7A 2010-05-26 2010-05-26 Non-native language spelling correction Active GB2480649B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1008799.7A GB2480649B (en) 2010-05-26 2010-05-26 Non-native language spelling correction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1008799.7A GB2480649B (en) 2010-05-26 2010-05-26 Non-native language spelling correction

Publications (3)

Publication Number Publication Date
GB201008799D0 GB201008799D0 (en) 2010-07-14
GB2480649A true GB2480649A (en) 2011-11-30
GB2480649B GB2480649B (en) 2017-07-26

Family

ID=42371026

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1008799.7A Active GB2480649B (en) 2010-05-26 2010-05-26 Non-native language spelling correction

Country Status (1)

Country Link
GB (1) GB2480649B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2613271A1 (en) * 2012-01-09 2013-07-10 Research In Motion Limited Method and apparatus for database augmentation and multi-word substitution
WO2018097936A1 (en) * 2016-11-22 2018-05-31 Microsoft Technology Licensing, Llc Trained data input system

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5812863A (en) * 1993-09-24 1998-09-22 Matsushita Electric Ind. Apparatus for correcting misspelling and incorrect usage of word
US20030036903A1 (en) * 2001-08-16 2003-02-20 Sony Corporation Retraining and updating speech models for speech recognition
WO2003021374A2 (en) * 2001-09-06 2003-03-13 Nir Einat H Language-acquisition apparatus
US20030130847A1 (en) * 2001-05-31 2003-07-10 Qwest Communications International Inc. Method of training a computer system via human voice input
WO2004063902A2 (en) * 2003-01-09 2004-07-29 Lessac Technology Inc. Speech training method with color instruction
US7139708B1 (en) * 1999-03-24 2006-11-21 Sony Corporation System and method for speech recognition using an enhanced phone set
US20080270118A1 (en) * 2007-04-26 2008-10-30 Microsoft Corporation Recognition architecture for generating Asian characters
US20090006097A1 (en) * 2007-06-29 2009-01-01 Microsoft Corporation Pronunciation correction of text-to-speech systems between different spoken languages
WO2009040790A2 (en) * 2007-09-24 2009-04-02 Robert Iakobashvili Method and system for spell checking
US20100145698A1 (en) * 2008-12-01 2010-06-10 Educational Testing Service Systems and Methods for Assessment of Non-Native Spontaneous Speech
US20100211376A1 (en) * 2009-02-17 2010-08-19 Sony Computer Entertainment Inc. Multiple language voice recognition
JP2010224563A (en) * 1997-11-17 2010-10-07 Nuance Communications Inc Method and apparatus for correcting speech, and recording medium
WO2011035986A1 (en) * 2009-09-28 2011-03-31 International Business Machines Corporation Method and system for enhancing a search request by a non-native speaker of a given language by correcting his spelling using the pronunciation characteristics of his native language

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5812863A (en) * 1993-09-24 1998-09-22 Matsushita Electric Ind. Apparatus for correcting misspelling and incorrect usage of word
JP2010224563A (en) * 1997-11-17 2010-10-07 Nuance Communications Inc Method and apparatus for correcting speech, and recording medium
US7139708B1 (en) * 1999-03-24 2006-11-21 Sony Corporation System and method for speech recognition using an enhanced phone set
US20030130847A1 (en) * 2001-05-31 2003-07-10 Qwest Communications International Inc. Method of training a computer system via human voice input
US20030036903A1 (en) * 2001-08-16 2003-02-20 Sony Corporation Retraining and updating speech models for speech recognition
WO2003021374A2 (en) * 2001-09-06 2003-03-13 Nir Einat H Language-acquisition apparatus
WO2004063902A2 (en) * 2003-01-09 2004-07-29 Lessac Technology Inc. Speech training method with color instruction
US20080270118A1 (en) * 2007-04-26 2008-10-30 Microsoft Corporation Recognition architecture for generating Asian characters
US20090006097A1 (en) * 2007-06-29 2009-01-01 Microsoft Corporation Pronunciation correction of text-to-speech systems between different spoken languages
WO2009040790A2 (en) * 2007-09-24 2009-04-02 Robert Iakobashvili Method and system for spell checking
US20100145698A1 (en) * 2008-12-01 2010-06-10 Educational Testing Service Systems and Methods for Assessment of Non-Native Spontaneous Speech
US20100211376A1 (en) * 2009-02-17 2010-08-19 Sony Computer Entertainment Inc. Multiple language voice recognition
WO2011035986A1 (en) * 2009-09-28 2011-03-31 International Business Machines Corporation Method and system for enhancing a search request by a non-native speaker of a given language by correcting his spelling using the pronunciation characteristics of his native language

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2613271A1 (en) * 2012-01-09 2013-07-10 Research In Motion Limited Method and apparatus for database augmentation and multi-word substitution
WO2018097936A1 (en) * 2016-11-22 2018-05-31 Microsoft Technology Licensing, Llc Trained data input system
US10095684B2 (en) 2016-11-22 2018-10-09 Microsoft Technology Licensing, Llc Trained data input system

Also Published As

Publication number Publication date
GB201008799D0 (en) 2010-07-14
GB2480649B (en) 2017-07-26

Similar Documents

Publication Publication Date Title
CN105957518B (en) A kind of method of Mongol large vocabulary continuous speech recognition
EP2248051B1 (en) Computer implemented method for indexing and retrieving documents in database and information retrieval system
US20110238412A1 (en) Method for Constructing Pronunciation Dictionaries
US9978364B2 (en) Pronunciation accuracy in speech recognition
US20070255567A1 (en) System and method for generating a pronunciation dictionary
US20110106792A1 (en) System and method for word matching and indexing
US20080027725A1 (en) Automatic Accent Detection With Limited Manually Labeled Data
Reddy et al. Integration of statistical models for dictation of document translations in a machine-aided human translation task
Loots et al. Automatic conversion between pronunciations of different English accents
Nouza et al. ASR for South Slavic Languages Developed in Almost Automated Way.
JP2008243080A (en) Device, method, and program for translating voice
Al-Anzi et al. The impact of phonological rules on Arabic speech recognition
Masmoudi et al. Phonetic tool for the Tunisian Arabic
Liang et al. A Taiwanese text-to-speech system with applications to language learning
GB2480649A (en) Non-native language spelling correction
Pellegrini et al. Automatic word decompounding for asr in a morphologically rich language: Application to amharic
Adda-Decker et al. A first LVCSR system for luxembourgish, a low-resourced european language
Thatphithakkul et al. LOTUS-BI: A Thai-English code-mixing speech corpus
Unnibhavi et al. Development of Kannada speech corpus for continuous speech recognition
Awino et al. Phonemic Representation and Transcription for Speech to Text Applications for Under-resourced Indigenous African Languages: The Case of Kiswahili
Chen et al. Using Taigi dramas with Mandarin Chinese subtitles to improve Taigi speech recognition
JP2021089300A (en) Method and device for multilingual voice recognition and theme-meaning element analysis
Hendessi et al. A speech synthesizer for Persian text using a neural network with a smooth ergodic HMM
Pandey et al. Development and suitability of indian languages speech database for building watson based asr system
Al-Daradkah et al. Automatic grapheme-to-phoneme conversion of Arabic text

Legal Events

Date Code Title Description
732E Amendments to the register in respect of changes of name or changes affecting rights (sect. 32/1977)

Free format text: REGISTERED BETWEEN 20190103 AND 20190109