GB2373088A - Speech recognition apparatus - Google Patents

Speech recognition apparatus Download PDF

Info

Publication number
GB2373088A
GB2373088A GB0028144A GB0028144A GB2373088A GB 2373088 A GB2373088 A GB 2373088A GB 0028144 A GB0028144 A GB 0028144A GB 0028144 A GB0028144 A GB 0028144A GB 2373088 A GB2373088 A GB 2373088A
Authority
GB
United Kingdom
Prior art keywords
phones
data
utterance
sequences
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0028144A
Other versions
GB0028144D0 (en
Inventor
Jason Peter Andre Charlesworth
Philip Neil Garner
Jebu Jacob Rajan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canon Inc
Original Assignee
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Inc filed Critical Canon Inc
Priority to GB0028144A priority Critical patent/GB2373088A/en
Publication of GB0028144D0 publication Critical patent/GB0028144D0/en
Publication of GB2373088A publication Critical patent/GB2373088A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A recognition module (17) matches utterances to sequences of phones. The recognition module (17) then selects a number of sequences of phones which correspond to words or parts of words defined by a phonetic dictionary (18) as being probably representative of a detected utterance. This selection is made by the recognition module (17) determining the differences between matched sequences of phones and sequences of phones defined by the phonetic dictionary (18). The recognition module (17) then assigns probability data to the sequences of phones in the phonetic dictionary (18) utilizing selected data from a confusability database (19) indicating the likelihood of errors, such as the insertion, omission or substitution of phones, occurring which would result in the mismatching of a sequence of phones.

Description

SPEECH PROCESSING APPARATUS AND METHOD
The present invention relates to a speech processing apparatus and method. In particular, embodiments of the present invention are applicable to speech recognition.
Speech recognition is a process by which an unknown speech utterance is identified. There are several different types of speech recognition systems currently available which can be categorised in several ways. For example, some systems are speaker dependent, whereas others are speaker independent. Some systems operate for a large vocabulary of words (greater than 10,000 words) whilst others operate with a limited sized vocabulary (smaller than 1,000). Some systems can only recognise isolated words whereas others can recognise phrases comprising a series of connected words.
In a large vocabulary system, speech recognition is performed by comparing features of an unknown utterance with features of phonemes or phones which are stored in a database. The features of the phones or phonemes are determined during a training session in which one or more samples of phones or phonemes are used to generate reference patterns therefor. The reference patterns may
be acoustic templates of the modelled speech or statistical models such as hidden Markov models.
To recognise an unknown utterance the speech recognition apparatus divides an utterance into a number of distinct temporal portions. Each of these temporal portions are then compared against reference patterns for phonemes or phones stored in the database. A scoring technique is then used to provide a measure of how well each reference pattern matches the temporal portion of the input utterance. This enables a sequence of possible phones or phonemes corresponding to the utterance to be generated.
A generated sequence of phones or phonemes can then be compared with models of words within the recognition apparatus vocabulary. Where a generated sequence of phones or phonemes matches a sequence of stored model of a word a possible candidate match for that portion of the utterance can be output.
A problem with large vocabulary speech recognition systems arises due to the fact that the total number of possible matches between an identified sequence of phones or phonemes for portions of an utterance and sequences of phones or phonemes stored as models of words increases exponentially with the length of the utterance. In order
to overcome this problem, a pruning algorithm is provided so that the speech recognition system only considers as matches for the remainder of an utterance, sequences of phones or phonemes which are identified as probable continuations of the matched portion of an utterance. The criteria used to select sequences of phones or phonemes which are retained is normally based upon the measures of the goodness of match of previous portions of an utterance to the stored models.
Although selecting only a limited number of possible sequences of phones or phonemes as being suitable for continued matching with the remainder of an unknown utterance prevents the total number of sequences being processed from becoming unmanageable, the elimination of possible sequences of phones or phonemes can prevent an utterance from being properly identified.
The present invention aims to provide a speech recognition system which increases the likelihood of an unknown utterance being correctly matched whilst keeping the total number of sequences of candidate matches for an utterance within manageable bounds.
In particular, embodiments of the present invention aim
to provide speech recognition systems which are able to match sequences of phones or phonemes determined for an utterance to sequences of words, even if the errors occur in determining the phones or phonemes representative of an utterance. In embodiments of the present invention speech recognition systems are provided which can match utterances to words when the determination of phones or phonemes representative of utterances erroneously inserts, omits or misrecognises a portion of an utterance.
In accordance with one aspect of the present invention, there is provided a speech recognition apparatus comprising: detection means for detecting an utterance; decoding means for matching portions of an utterance detected by said detection means to data indicative of phones or phonemes; and matching means for matching an utterance detected by said detection means to data indicative of words utilising said matching of portions of said utterance to phones or phonemes by said decoding means wherein said matching
means arranged to store data identifying the probability of a portion of a utterance being incorrectly matched and to utilise said stored data to select data indicative of words to match an utterance detected by said detection means.
An exemplary embodiment of the invention will now be described with reference to the accompanying drawings in which: Figure 1 is a schematic overview of a speech recognition system in accordance with an embodiment of the present invention; Figure 2 is a block diagram of the preprocessor incorporated as part of the system shown in Figure 1, which illustrates some of the processing steps that are performed on an input speech signal; Figure 3 is a block diagram of the recognition module, confusibility database and phonetic dictionary incorporated as part of the system shown in Figure 1; Figure 4 is a schematic block diagram of an exemplary data structure of an active phone list record stored
within the active phone list buffer of Figure 3 ; Figure 5 is a schematic block diagram of an exemplary data structure for data stored within the word hypothesis store of Figure 3; Figure 6 is a schematic block diagram of an exemplary data structure for a record within the phonetic dictionary of Figure 3; Figure 7 is a schematic illustration of part of a phonetic dictionary word tree; Figure 8 is a schematic block diagram of an exemplary data structure for data stored within the confusibility database of Figure 3; Figure 9 is a schematic block diagram of an exemplary data structure for data identifying a word sequence; Figure 10 is a flow diagram of the processing of a word decoder of a recognition module in accordance with an embodiment of the present invention; Figure 11 is a flow diagram of the generation of new word
hypotheses from new phones ; and Figure 12 is a flow diagram of the storage of a newly generated word hypotheses.
Embodiments of the present invention can be implemented in computer hardware, but the embodiment to be described is implemented in computer software which is run in conjunction with processing hardware such as a personal computer, work station, photocopier, facsimile machine, mobile phone or the like. In accordance with embodiments of the present invention speech recognition apparatus is provided which is arranged to convert acoustic speech signals into sequences of data identifying lists of words. Generated word sequence data can then be utilized by the apparatus incorporating the speech recognition system.
Figure 1 is a schematic overview of a context independent speech recognition system in accordance with an embodiment of the present invention. Acoustic speech of the user is detected by a microphone 7 that converts the acoustic speech signal into an equivalent electrical signal s (t) and supplies this signal to a preprocessor module 15. The preprocessor module 15 then converts the
input speech signal into a sequence of parameter frames f,, each representing a corresponding time frame of the input speech signal. The sequence of parameter frames fK are then supplied via a buffer 16 to a recognition module 17. The recognition module 17 processes the parameter frames fK and generates word sequence data comprising data identifying sequences of words known to the system.
When processing the parameter frames fk the recognition module 17 initially determines a number of phones or phonemes which the portion of an utterance represented by a parameter frame is likely to represent. The recognition module 17 then utilises the identified phones to determine word sequence data for output. Specifically, the recognition module 17 outputs word sequence data which is determined to have the greatest probability of being representative of the detected utterance given that the processing of the utterance resulted in the generation of particular sequence of phones or phonemes.
When the recognition module 17 receives a signal from the buffer 16 that the buffer 16 contains no further data this causes the recognition module 17 to output data identifying a sequence of words matched to the parameter
frames fez received from the buffer 16. The word sequence data output from the recognition module 17 can then be utilized as an input into the system incorporating the speech recognition apparatus.
In conventional context independent speech recognition systems, as part of the process of matching sequences of parameter frames fK to word sequence data, the recognition module 17 utilises two pieces of information to reduce the total number of possible sequences of word data that the system considers at any one time. Firstly, the recognition module 17 processes the parameter frame data fK to identify the most likely phones or phonemes that each portion of an utterance represents. Secondly a phonetic dictionary 18 is provided which stores a language model identifying which phones or phonemes can follow one another in words in the vocabulary of the system. The recognition module 17 then limits its matching of a parameter frame to phones which are both probably representative of a portion of an utterance being processed and allowable extensions of the sequences matched for earlier parts of an utterance.
The inventors have realised that problems in recognising in an utterance can arise due to difficulties in
matching of parameter frames to phones and phonemes. Errors can arise because a word is pronounced by an individual differently to the manner expected by the system. As a result, portions of a word may be omitted, added or unclear. Additionally, errors can occur due to erroneous processing of an utterance by the system.
Furthermore, even if the utterance itself is a clear, correct, expected representation of a word within the vocabulary of the system, errors may still arise where a portion of an utterance can match many different phones or phonemes and the recognition module 17 selects only a limited number of the phones or phonemes as suitable for future processing.
If an error in matching occurs, this can then prevent later parts of an utterance from being matched at all as the earlier erroneous match may mean that the sequence of phones represented by the rest of an utterance, no longer corresponds to a valid extension of the matched portion of the utterance. There is therefore a need for a speech recognition system which can account for errors arising due to erroneous decoding and errors which arise due to variation of pronunciation.
Thus, in accordance with the present invention, in
addition to being connected to a phonetic dictionary 18, the recognition module 17 is also connected to a confusibility database 19. The confusibility database 19 is arranged to store data identifying probabilities of parts of an utterance being mismatched to phones or phonemes other than the phone or phoneme a portion of an utterance actually represents. When a limited number of most probable phones or phonemes for matching a portion of an utterance has been determined, probabilities for sequences of recognised words as identified by the phonetic dictionary 18 are then determined for all valid extensions of an earlier matched portion of an utterance, utilising the data within the confusibility database 19.
The recognition module 17 then selects for future consideration the most likely candidate matches from all of the possible valid extensions of the portion of the utterance. Thus even if the matching of parts of an utterance to phones or phonemes results in a failure to match part of and utterance directly to a sequence of phones or phonemes corresponding to the actual word spoken, the recognition module 17 is able to match an utterance to the correct sequence of phones or phonemes.
A more detailed explanation will now be given of some of the apparatus blocks described above.
PREPROCESSOR The preprocessor will now be described with reference to Figure 2.
The functions of the preprocessor 15 are to extract the information required from the speech and to reduce the amount of data that has to be processed. There are many different types of information which can be extracted from the input signal. In this embodiment the preprocessor 15 is designed to extract"formant"related information. Formants are defined as being the resonant frequencies of the vocal tract of the user, which change as the shape of the vocal tract changes.
Figure 2 shows a block diagram of some of the preprocessing that is performed on the input speech signal. Input speech S (t) from the microphone 7 is supplied to filter block 21, which removes frequencies within the input speech signal that contain little meaningful information. Most of the information useful for speech recognition is contained in the frequency band between 300Hz and 4KHz. Therefore, filter block 21 removes all frequencies outside this frequency band.
Since no information which is useful for speech recognition is filtered out by the filter block 21, there
is no loss of recognition performance. Further, in some environments, for example in a motor vehicle, most of the background noise is below 300Hz and the filter block 21 can result in an effective increase in signal-to-noise ratio of approximately lOdB or more. The filtered speech signal is then converted into 16 bit digital samples by the analogue-to-digital converter (ADC) 23. To adhere to the Nyquist sampling criterion, the ADC 23 samples the filtered signal at a rate of 8000 times per second. In this embodiment, the whole input speech utterance is converted into digital samples and stored in a buffer (not shown), prior to the subsequent steps in the processing of the speech signals.
After the input speech has been sampled it is divided into non-overlapping equal length frames in block 25. The speech frames Sk (r) output by the block 25 are then written into a circular buffer 26 which can store 62 frames corresponding to approximately one second of speech. The frames written in the circular buffer 26 are also passed to an endpoint detector 28 which processes the frames to identify when the speech in the input signal begins, and after it has begun, when it ends. Until speech is detected within the input signal, the frames in the circular buffer are not fed to the
computationally intensive feature extractor 30. However, when the endpoint detector 28 detects the beginning of speech within the input signal, it signals the circular buffer to start passing the frames received after the start of speech point to the feature extractor 30 which then extracts a set of parameters fK for each frame representative of the speech signal within the frame.
The parameters fK are then stored in the buffer 16 (not shown in Figure 3) prior to processing by the recognition module 17 (as will now be described).
RECOGNITION MODULE Figure 3 is a schematic block diagram of a recognition module 17 connected to a phonetic dictionary 18 and a confusibility database 19.
In this embodiment the recognition module 17 includes a phone decoder 40 that is arranged to receive parameter frames f, from the buffer 16 (not shown in Figure 3).
The phone decoder 40, is itself connected to an active phone list buffer 42.
The phone decoder 40 processes the frames in a conventional way one at a time to determine the extent to which a portion of an utterance represented by a frame
matches any of the phones or phonemes identified by the system. Processing each of the received parameter frames fk by the phone decoder 40 results in the output of a list of N phones and phonemes that are most likely representative of the portion of an utterance, being processed. This data is output, together with data identifying the probability of each of the phones or phonemes being a correct representation given that the latest processed portion utterance resulted in the generation of the corresponding parameter frame data fk.
This list and the probability data are then passed to the active phone list buffer 42 where the lists and probability data are stored.
If the phone decoder 40 receives a signal from the buffer 16 (not shown in Figure 3) indicating that the buffer 16 is empty a signal indicating this fact is passed to the phone list buffer 42 and stored.
The active phone list buffer 42, in addition to being connected to the phone decoder 40, is also connected to a word decoder 44.
The word decoder 44 is also connected to a word hypothesis store 46, a word sequence memory 48 and an
output module 50. The word decoder is also connected to the phonetic dictionary 18 and the confusibility database 19. The word sequence memory 48 is itself also connected directly to the output module 50.
Periodically the word decoder 44 obtains from the active phone list buffer 42 a list of phones or phonemes together with corresponding logarithmic probabilities resulting from the processing of the phone decoder 40 of the next parameter frame fK received from the buffer 16 (not shown in Figure 3). The word decoder 44 then processes the received list and probability data utilising data within the phonetic dictionary 18 and the confusibility database 19 to generate data indicating logarithmic probabilities of the portions of the detected speech signal for which parameter frames fK have been processed being representative of a sequence phones or phonemes corresponding to words identified within the phonetic dictionary 18. As a result of this processing data identifying the logarithmic probability of the decoded utterance corresponding to identified sequences of words and parts of words is stored in the word sequence memory 48 and word hypothesis store 46 respectively.
Specifically, as will be described in detail later, the word decoder 44 generates and stores within the word hypothesis store 46 and word sequence memory 48, records including: data identifying a sequence of phones or phonemes corresponding to sequences words and parts of words within the vocabulary of the speech recognition apparatus, data identifying a last phone or phoneme utilised to generate the records, and probability data.
This probability data associated with a record is generated in a number of ways as will now be described.
Whenever a new active phone list record is obtained from the active phone list buffer, the logarithmic probability associated with records including data identifying as a last phone or phoneme, a phone or phoneme on the list are all updated. This is achieved by the word decoder 44, incrementing the logarithmic probability associated with each record by the acoustic logarithmic probability associated with that phone or phoneme within the new active phone list record.
The word decoder 44, then generates and stores within the word hypothesis store 46 and word sequence memory 48, new records including data identifying as the last phone, the remaining phones or phonemes from the active phone list
record. These records are generated representative of sequences of phones for all the valid extension of sequences of phones or phonemes represented by records already within the word hypothesis store 46 and word sequence memory.
Where a phone is both a valid extension of a sequence of phones identified by records in the word hypothesis store 46 and word sequence memory 47 and identified within the list of phones from the active phone list buffer 42 currently being processed a single new record is generated. For this new record, the logarithmic probability associated with the records comprises a determined value indicative of the likelihood of the sequence of phones represented by the record corresponding to a detected utterance.
Where a phone which is a valid extension of a sequence of phones identified by records within the word hypothesis store 46 and word sequence memory 48 does not appear within the active phone list record currently being considered, a set of new records is generated with one new record for each of the new phones in the active phone list record. For these records, the probability associated with each record is a determined value
indicative of the likelihood of the sequence of phones represented by the record corresponding to a detected utterance, varied by a factor to allow for the misrecognition that last phone in the sequence as the corresponding phone from the active phone list.
After the new records have all been stored, the word decoder then filters the stored records so that only records including data identifying as a last phone, data corresponding to one of the phones or phonemes within the last active phone list processed and which are also associated with the probabilities above a filtering threshold are retained within the word hypothesis store 46 and word sequence memory 48. The word decoder 44 then proceeds to process the next active phone list record from the active phone list buffer 42.
When the word decoder 44 receives a signal from the active phone list buffer 42 indicating that no parameter frames f. are currently stored within the buffer 16 (not shown in Figure 3) this causes the word decoder 44 to send a signal to the output module 50. The output module 50 then retrieves and outputs as word sequence data identifying a sequence of complete words, data stored within the word sequence memory 48 determined to be the
most probable sequence of complete words which correspond to the detected utterance. Prior to describing in detail the processing of the word decoder 44 in accordance with this embodiment the present invention, data structures for the data stored within the active phone list buffer 42, the word hypothesis store 46, the phonetic dictionary 18, the confusibility database 19 and the word sequence memory 48 will now be described with reference to Figures 4 to 9.
Figure 4 is a schematic block diagram of an exemplary data structure of data stored within the active phone list buffer 42. In this embodiment of the present invention data stored within the active phone list buffer comprises a plurality of records each comprising time data 51 identifying the parameter frame fk processed by the phone decoder 40 which resulted in the generation of the record and a set of N pairs of data 52-1-52-N each comprising data identifying a phone or phoneme and an associated logarithmic probability indicative of the probability of a portion of an utterance represented by the parameter frame fK represented by the record being representative of that phone or phoneme.
In accordance with this embodiment of the present invention when a parameter frame fK is processed by the phone decoder 40, the phone decoder 40 determines the logarithmic probability of the received parameter frame fK representing an utterance indicating a particular phone or phonemes in a conventional manner. The phone decoder then determines the N most likely phones and phonemes that the parameter frame fK represents. The phone decoder 40 then outputs as an active phone list record for the parameter frame fK for storage within the active phone buffer 42 a record comprising time data 51 identifying the parameter frame fK processed and N pairs of data 52-1-52-N identifying a phone or phoneme and an associated logarithmic probability data for each of the N most probable matches for the parameter frame f.
Figure 5 is a schematic block diagram of an exemplary data structure for data stored within the word hypothesis store 46. In this embodiment data within the word hypothesis store 46 comprises a plurality of hypothesis records 53 each record comprising current phone data 54 being data identifying the last phone or phonemes of a sequence of phones or phonemes matched to the processed portions a detected utterance; a lexicon pointer 55 being a pointer to a node within the phonetic dictionary 18
identifying a sequence of phones or phonemes corresponding to part of a valid word within the vocabulary of the speech recognition system; a prior probability 56 being a calculated logarithmic probability for the matching of a processed portion of an utterance to a part of a word identified by the lexicon pointer 55; an acoustic probability 58 being data identifying a logarithmic probability of the last phone or phoneme of a hypothesised sequence of phones or phonemes being representative of the latest portion of an utterance being processed and a previous word number 59 being a pointer to word sequence data within the word sequence memory 48.
The word hypothesis records 53 stored within the word hypothesis store 46 provide a means by which records of candidate matches of parts of words to sequence of phones decoded from parameter frames fK received by the phone decoder 40 can be recorded. As will be described in detail later the word decoder 44 utilises active phone list records received from the active phone list buffer 42 to generate and amend data stored within the word hypothesis store 46, so that the data consistently represents the matching of the portions of an utterance processed by the recognition module 17.
Figure 6 is a schematic block diagram of an exemplary data structure for a record within the phonetic dictionary 18.
In this embodiment, the records within the phonetic dictionary 18 define a language model identifying all of the words within the vocabulary of the speech recognition apparatus. In this embodiment of the present invention these records define the words within the vocabulary of the speech recognition apparatus in the form of a tree structure.
This is achieved by each of the records within the phonetic dictionary 18 comprising node number data 60 and a set of transition records 61 and word records 63, with the transition records 61 all containing data identifying pointers other records within the phonetic dictionary 18.
The transition records 61 and word records 63 also contain data identifying phones and phonemes.
Specifically, in this embodiment, the transition records 61 each comprise next phone data 64 being data identifying a phone or phoneme and a next node number 65 being a pointer to a node number 60 of another record within the phonetic dictionary 18 and a transition
probability 67. The word records 63 in this embodiment, each comprise next phone data 68 identifying a phone or phoneme which when added to the sequence or phonemes represented by a record within the phonetic dictionary 18 which the word record 63 forms a part completes a sequence of phones or phonemes representative of a word within the vocabulary of the speech recognition apparatus; word identification data 70 comprising data identifying the word represented by the extended sequence of phones and phonemes and a transition probability 72.
The word records 63 and transition records 61 within phonetic dictionary records in this embodiment of the present invention are each ordered within the records by phone. The ordering of records 61,63 within each record in the phonetic dictionary 18 enables faster processing as whether or not a phone matches any of the phones identified by the records can be determined utilizing a binomial search strategy.
Figure 7 is a schematic illustration of part of a phonetic dictionary word tree defined by four exemplary records within the phonetic dictionary 18 in accordance with this embodiment of the present invention. In the illustration of Figure 7 each of the records is
illustrated by a circle containing a node number 60 below which extend a number of arrows representing the transition records 61 and word records 63 forming part of the record within the phonetic dictionary 18. The arrows associated with transition records 61 each comprise arrows pointing to another circle containing a node number 60. Each of the arrows representing word records 63 comprises an arrow pointing to a circle containing a word. Each of the arrows within the illustration of Figure 7 as associated with it a probability and a letter
representing a transition probability 67 ; 72 and a next phone data 64 ; 68, within the transition record 61 or word record respectively.
The exemplary illustration of Figure 7 illustrates a word tree for the words: AN, ANT, AXE, ACT and ACTED. Figure 7 also illustrates how each of the records enables each node number 60 and an item of word identification data 70 to be associated with a sequence of phonemes, being the sequence of phonemes identified by the next phone data 64,68 of the transition records 61 and word records 63 required to reach the node number 60 or word record 63 containing the word identification data 70 from the root node zero. In a similar way the records within the phonetic dictionary 18 enable each of the nodes 60 and
items of word identification data 70 to be associated with probability data being the sum of the transition probabilities 67; 72 associated with the transition records 61 and record 63 which identify a path from the origin node zero.
Figure 8 is a schematic block diagram of an exemplary data structure for data stored within the confusibility database 19. In this embodiment in which the recognition module 17 is arranged to process frame parameters fK to identify the probabilities of portions of an utterance matching any of M phones or phonemes, the confusibility database 19 comprises (M-1) 2 items of confusion probability data 75 where each of the items of confusion probability data comprises data indicative of the logarithmic probability of a portion of an utterance representative of a first phoneme being decoded by the phone decoder 40 as a second phoneme.
In this embodiment the data representative of these probabilities is determined by utilising the preprocessor module 15 and phone decoder 40 to process utterances representative of known sequences of phones or phonemes and comparing the known sequences of phones and phonemes with an output sequence of phones resulting from the
processing. For each of the M phones and phonemes which can be recognised by the recognition module 17 the probability of the specific phones or phonemes being misrecognised as any of the other M-1 different phones or phonemes can then be determined. Data representative of all these determined probabilities is then stored within the confusibility database 19.
Figure 9 is a schematic diagram of an exemplary data structure of data stored within the word sequence memory 48. In this embodiment the word sequence memory 48 is arranged to store a number of word sequence records 80 each identifying words matched to portions of an utterance processed by the recognition module 17. The word sequence records 80 within the word sequence memory 48 each comprise with a word number 82 identifying the record within the word sequence memory 48; a word ID 84 identifying the word represented by the record 80; a path probability 86, previous word data 87 and following words data 88.
As will be described in detail later, the processing of the word decoder 44 is such to generate word sequence records 80 for storage within the word sequence memory 48. When an entire utterance has been processed by the
word decoder 44, the output module 50 then selects from the word sequence memory 48 a sequence of data identifying the most likely sequence of words representative of the utterance utilising the stored word sequence records 80.
PROCESSING OF THE WORD DECODER The processing of the word decoder 44 will now be described with reference to Figures 10,11 and 12.
Figure 10 is a flow diagram of the processing of the word decoder 44 in accordance with this embodiment. Initially the word decoder 44 obtains (sil) from the active phone list buffer 42 the active phone record corresponding to the next portion of an utterance to be processed as identified by the time data 51 of the obtained active phone record. The word decoder 44 then (S2) determines whether the obtained active phone list record is a record identifying that the buffer 16 is empty. If this is the case, this indicates that the end of an utterance has been reached and the word decoder 42 sends a signal to the output module 50 to cause the output module 50 to output word sequence data (S7) which will be described in detail later.
If the next active phone record obtained from the active phone list buffer 42 is not a record indicating that the buffer 16 is empty, the word decoder 44 then utilises the active phone data 52-1-52-N in the obtained record to update (S3) the word hypothesis records 53 within the word hypothesis store 46.
Specifically, for each of the word hypothesis records 53 in the word hypothesis store 46 including current phone data 54 identifying a phone or phoneme within the active phone list record 51,52-1-52-N, the acoustic probability data 58 for each record is incremented by a value corresponding to the logarithmic probability associated by the active phone list record with the phone or phoneme identified by the current phone data 54 of the word hypothesis record.
Thus, for example, where the word hypothesis store 46 has stored within it a single word hypothesis record 53 containing the following data: Word Hypothesis Record Current Phone: A Lexicon Pointer: 1 Prior Probability: Pi
Acoustic Probability : P (ao) Previous word number : 0 where the lexicon pointer 55 is a pointer to record 1 of the records illustrated in Figure 7, if the word decoder 44 received an active phone list record 42 of the following form: Active Phone List Record Time 1 Phone (A): P (al) Phone (N): P (nui) Phone (T): P (tri) indicating the portion of an utterance at time t = 1 is most probably representative of phones A, N or T each with a logarithmic probability of P (al), P (nl) and P (tl) respectively, the word hypothesis record 53 would, after it had been updated, become: Current Phone: A Lexicon Pointer: 1 Prior Probability: P1 Acoustic Probability: P (ao) +P (al)
Previous Word Number : 0 Thus in this way the hypothesis record 53 is associated with an acoustic probability 58 indicative of the last phone or phoneme of the sequence of phones that it represents being the phone or phoneme actually representative of the portion of an utterance currently being processed. In the above example this is achieved by the updating of the acoustic probability data 58 representing the logarithmic probability of a portion of an utterance representing phone A by being increased from P (ao) to P (ao) +P (al), within the example word hypothesis record 53.
Returning to Figure 10, after the acoustic probability data 58 for all of the hypothesis records 53 with current phone data 54 corresponding to a phone within the phone list record being processed have all been updated, the word decoder 44 then (S4) determines which of the phones represented by the active phone list record received from the active phone list buffer 42 represents phones or phonemes which are not currently represented by word hypothesis records 53 within the word hypothesis store 46.
Thus in the case of the above example where there is a single word hypothesis record 53 including current phone data 54 representative of the phone A and the phone decoder receives a current phone list record from the active phone list buffer 44 including data identifying the phones A, N and T, the word decoder 44 would determine that phones N and T were not currently represented by word hypothesis records 53 within the word hypothesis store 46.
After the phones within the phone list record being processed, which are not currently represented by word hypothesis records 53 within the word hypothesis store 46 have been determined, data identifying these phones is stored and the word decoder 44 then (S5) proceeds to generate new hypothesis records 53 for these phones as will now be described with reference to Figures 11 and 12.
Figure 11 is a flow diagram of the processing of the word decoder 44 generating new hypotheses for new phones or phonemes. Initially (S10) the word decoder 44 selects a first hypothesis record 53 from the hypotheses records stored within the word hypothesis store 46. The word decoder 44 then (Sll) determines the first item of next
phone data 64 ; 68 for a transition record 61 or word record 63 of the record within the phonetic dictionary 18 having a node number 60 corresponding to the lexicon pointer data 55 for the word hypothesis record 53 currently under consideration.
Thus, for example, in the case of a word hypothesis record 53 including a lexicon pointer 55 to a record within the phonetic dictionary 18 of the following form corresponding to the node 1 in Figure 7: Phonetic Dictionary Record Node Number: 1 First Transition Record Next Phone: N Next Node: 2 Transition Prob: P2 Second Transition Record Next Phone: K Next Node: 3 Transition Prob: P4
First Word Record Next Phone : N Next Word: An Transition Prob : P3 the word decoder 44 would select as the next phone N, being the next phone associated with the first transition record.
The word decoder 44 then (sil2) determines whether the identified candidate phone corresponds to one of the phones previously identified and stored (S4) as being a new phone or phoneme present on the active phone list record being processed.
If this is the case the word decoder 44 then proceeds (S15) to generate and store a new word hypothesis record within the word hypothesis store 46 as will be described in detail later with reference to Figure 12.
If, however, the word decoder 44 determines that the phone identified by the transition record 61 or word record 63 currently under consideration does not correspond to one of the new phones previously identified, the word decoder 44 then proceeds to obtain
(S13) data from the confusibility database 19. Specifically the word decoder 44 proceeds to obtain confusion probability data 75 for the logarithmic probabilities of the phone identified as the next phone within the transition record 61 or word record 63 under consideration being mistakenly decoded by the phone decoder 40 as any of the phones or phonemes previously identified as being new phones or phonemes.
Thus, for example, in the case of processing the second transition record in the above example where the next phone data identifies the phone K, against identified new active phones N and T since neither of these phones correspond to the next phone K the word decoder 44 would obtain from the confusibility database 19 data for the logarithmic probabilities P (K/N) and P (K/T) that an utterance representative of the phone K had been decoded as a phone N or T.
The word decoder 44 then proceeds to generate new word hypothesis records (S15) which will now be described in detail with reference to Figure 12.
Figure 12 is a flow diagram of the processing of the word decoder 44 for creating and storing new word hypothesis
records within the word hypothesis store 46. Initially (S20) the word decoder 44 determines whether the candidate phone currently under consideration corresponds to next phone data 64,68 from a transition record 61 or word record 63.
If the next phone data 64,68 is next phone data 64 from a transition record 61 the word decoder 44 then (S21) proceeds to store within the word hypothesis store 46 either one or more word hypothesis records 53.
Specifically, if no probability data 75 for misrecognising the candidate phone under consideration has been obtained from the confusibility database 19 (S13), the phone decoder 44 proceeds to generate a single word hypothesis record 53 where the current phone data 54 corresponds to the next phone data 64 of the transition record 61 under consideration; the lexicon pointer 55 corresponds to the next node number 65 of the transition record 61 under consideration; the prior probability 56 corresponds to the sum of the transition probability 67 of the transition record 61 under consideration and the prior probability 56 of the word hypothesis record 53 currently under consideration; and the acoustic probability 58 is set equal to the acoustic probability
associated with the current phone 54 for the new record by the active phone list record and the previous word number 59 corresponds to the previous word number 59 of the hypothesis record 53 currently under consideration.
Thus, for example, in the case of the following hypothesis record when a probability of P (nl) is given for the phone N within the active phone list record, Hypothesis Record Current Phone A Lexicon Pointer 1 Prior Probability Pi Acoustic Probability P (ao) + P (ai) Previous Word Number 0 where the lexicon pointer corresponds to record 1 of the records illustrated in Figure 7 and the following transition record of phonetic dictionary record identified by node number 1 is processed, Transition Record Next Phone N
Next Node No. : 2 Transition Prob : P2 a new hypothesis record 53 would be generated of the form: New Hypothesis Record Current Phone: N Lexicon Pointer: 2 Prior Probability: Pi+P2 Acoustic Probability: P (nl) Prev. Word No: 0 If, instead the word decoder 44 determines that confusion probability data 75 was retrieved (S13) when processing the next phone data 64 of the transition record currently under consideration, the word decoder 44 proceeds to generate new word hypothesis records 53 for each of the phones for which confusion probability data 75 was retrieved.
Specifically, the phone decoder 44 creates new word hypothesis records 53, where each of the new records comprise: phone data 54 corresponding to the different
phones for which confusion probability 75 was retrieved ; acoustic probability data 58 corresponding to the acoustic probability for the phone corresponding to the phone data 54 for the new record 53, probability data 58 corresponding to data 52 within the active phone list record for that phone; a lexicon pointer 55 corresponding to the next node number 65 of the transition record 61 under consideration; a previous word number 59 corresponding to the previous word number 59 of the hypothesis record 53 currently under consideration; and the prior probability data is the sum of the prior probability data 56 of the word hypothesis record 53 currently under consideration, the transition probability 67 of the transition record currently under consideration and the confusion probability 75 obtained from the confusion database for the phone represented by the current phone data 54 in the newly generated record.
Thus, for example, in the case of processing the second transition record of the phonetic dictionary record described above including new phone data identifying the phone K above against the hypothesis record 53 including a lexicon pointer 55 to a dictionary record corresponding to node 1 as illustrated in Figure 7 and as previously described, if the active phone list were to associate
only the phones N & T with probabilities p (nui) and p (tri) respectively the following new hypothesis records would then be generated: Current Phone: N Lexicon Pointer: 3 Prior Probability : Pj+P+P (K/N) Acoustic Probability: P (nl) Prev. Word No: 0 Current Phone: T Lexicon Pointer: 3 Prior Probability: Pl+P3+P (K/N) Acoustic Probability: P (ti) Prev. Word No: 0 where P (K/N) and P (K/T) are the logarithmic probabilities obtained from the confusibility database 19 for the probabilities that phone K would be misrecognised as an N or a T respectively.
Returning to Figure 12, if the word decoder 44 determines (S20) that the record currently being processed comprises a word record 63, the word decoder 44 first proceeds to generate (S22) a new word sequence record 80 for storage
within the word sequence memory 48. Specifically, the word decoder 44 generates a new word sequence record 80 comprising the next available word number 82, a word ID 84 corresponding to the word ID 70 of the word record 63 currently under consideration, a path probability 86 comprising the sum of the prior probability 56 for the hypothesis record 53 currently under consideration, the transition probability 72 of the word record 63 currently under consideration and the path probability 86 of the word sequence record 80 having a word number 82 corresponding to the previous word number 59 for the word hypothesis record 53 currently under consideration; previous word data 87 corresponding to the previous word number 59 of the word hypothesis record 53 currently under consideration and following words data 88 set to zero identifying the new record 80 as the final word in a word sequence.
The word decoder 44 then amends the word sequence record 80 having a word number 82 corresponding to the previous word number 59 of the word hypothesis record 53 currently under consideration by incrementing the following words data 88 in that record 80 by one to indicate that the word record 80, no longer indicates the last word in a
word sequence.
Thus, for example, when processing the following word hypothesis record 53: Hypothesis Record Current Phone: A Lexicon Pointer: 1 Prior Probability: P1 Acoustic Probability: P (ao) + P (ai) Previous Word Number: 0 where the word sequence record 80 with word number zero is associated with path probability equal to zero against, a word record 61 such as Word Record Next Phone: N Next Word : An Transition Prob : P3 the following word sequence record 80 would be generated; if the next available word number was 1 Word Sequence Record
Word Number : 1 Word ID : An Path Probability: Pi + P3 Previous Word: 0 Following words 0 and the word sequence record 80 associated with word number zero would be updated to increase the following words data 88 for that record 80 by one.
After word sequence data has been generated and stored within the word sequence memory 48 the word decoder 44 then (S23) proceeds to generate and store within the word hypothesis store 46 one or more word hypotheses records 53 containing previous word number data 59 corresponding to the word number 82 of the newly generated word sequence record 80.
Specifically, if the next phone data 68 of the word record 63 currently under consideration did not result in the retrieval of confusion probability data 75 (S13), a single new word hypothesis record 53 is generated comprising current phone data 54 corresponding to the next phone data 68 of the word record 61 under
consideration, a lexicon pointer 55 pointing to the origin node within the phonetic dictionary 18, a prior probability 56 set to zero, an acoustic probability 58 corresponding to the acoustic probability of the phone identified as the current phone 54 of the newly generated record as identified by the active phone list record being processed and a previous word number 59 corresponding to the word number 82 of the newly generated word sequence record 80.
If the next phone data 68 of the word record 63 currently under consideration did result (S13) in the retrieval of data from the confusibility database 19, the word decoder 44 proceeds to generate for each of the new phones previously identified (S4), word hypotheses records S3 comprising current phone data 54 identifying one of the new phones; a lexicon pointer 55 pointing to the origin record within the phonetic dictionary 18; a prior probability 56 comprising data corresponding to the confusion probability 75 obtained from the confusion database 19 for the logarithmic probability that the next phone 68 identified within the word record 63 currently under consideration might be mis-recognised as the phone identified as the current phone 54 of the record being generated; an acoustic probability 58 corresponding to
the acoustic probability associated with the current phone 54 for the newly generated record by the active phone list obtained from the active phone list buffer 42; and a previous word number 59 corresponding to the word number 82 of the most recently generated word sequence record 80 stored within the word sequence memory 48.
Returning to Figure 11, after new word hypothesis records 53 have been stored (sil5) in the word hypothesis store 46 for the transition 61 or word record 63 currently under
consideration, the word decoder 44 determines (sil6) whether the word or transition record 61 currently being considered is the last word record 61 or transition record 63 record in the phonetic dictionary record identified by the lexicon pointer 55 of the word hypothesis record 53 being processed. If this is not the case the word decoder 44 proceeds to generate new hypothesis records 53 utilising the next transition record 61 or word record 63 within the phonetic dictionary record identified by the word hypothesis record 53 (S12-S15).
If the word decoder 44 determines that the transition record 61 or word record 63 currently under consideration is the last transition record 61 or word record 63 within
the record in the phonetic dictionary 18 identified by the word hypothesis record 53 currently under consideration, the word decoder 44 determines (S18) whether the hypothesis record 53 under consideration is the last hypothesis record 53 of the hypotheses records stored within the word hypothesis memory 46 when the acoustic probabilities for the hypotheses records were previously updated (S3). If this is not the case, the word decoder 44 selects (S19) the next hypothesis record 53 which was in existence when the acoustic probabilities were updated (S3) and processes that word hypothesis record 53 to generate further new word hypothesis records (S11-S18).
Thus in this way all of the transition records 61 and word records 63 in phonetic dictionary records having node numbers 60 corresponding to lexicon pointers 55 of all of the word hypothesis records 53 stored within the word hypothesis memory 46 when the acoustic probabilities for the word hypothesis records were updated (S3) are utilised to generate new word hypothesis records which are then stored within the hypothesis store 46.
Thus at this stage the word hypothesis store 46 has stored within it word hypothesis records 53 for which the
acoustic probability has been updated (S3), word hypothesis records 53 having current phone data 54 which does not correspond to any of the phones within the latest active phone list record which has been processed; and new word hypothesis records 53 for the new phones within the active phone list received from the active phone list buffer 42 (55).
Thus, for example, in the case of a single hypothesis record including a lexicon pointer to the phonetic dictionary record 18 shown as node 1 in Figure 7 being processed against an active phone list assigning phones A and N acoustic probabilities of P (al) and P (nl) respectively where the hypothesis record includes previous word number 59 identifying the origin record in the word sequence memory 48, the following hypotheses records would at this stage be stored within the word hypothesis store 46:
Current Phone A N N N Lexicon 1 2 0 3 Pointer Prior Pi Pl+P2 0 Pl+P4+P (K/N) Probability
Acoustic P (ao) +P (al) P (nl) P (nl) P (ni) Probability Prev. Word No 0 0 1 0 Phone a.. an.. anzac Sequence
where the phone sequence corresponds to the sequence of phones matched to a portion of an utterance implicitly identified by the corresponding word hypothesis records 53. Similarly the word sequence records 80 stored within the word sequence memory 48 would comprise the following two word sequence records 80, word sequence record number 0 being the origin word sequence record and word sequence record 1 being a record word sequence 80 identifying the word'An'as having been recognised.
Word No. 0 1 Word ID - An Path Probability 0 ? i+P3 Prev No. - 0 Following 1 0 Returning to Figure 10 after new word hypothesis records 53 have been generated and stored within the word hypothesis store 46, the word decoder 44 then (S6) proceeds to filter the word hypothesis records 53 stored
in the word hypothesis store. Initially this is achieved by deleting all the word hypothesis records 53 having current phone data 54 which does not correspond to any of the phones identified by the latest active phone list record processed by the word decoder 44. The word decoder 44 then proceeds determine for each of the remaining word hypothesis records 53 a probability identifying the probability of the phone sequence represented by the hypothesis records 53 and associated word sequence records 80 being the correct representation of the portion of an utterance decoded so far.
This is achieved by the word decoder 44 calculating for each hypothesis record 53 within the word hypothesis store 46 the sum of the prior probability data 56 for the hypothesis record 53, the acoustic probability 58 for the hypothesis record 53 and the path probability 86 of the word sequence record 80 having a word number 82 corresponding to the previous word number 59 of the hypothesis record 53 under consideration.
Thus, for example, for the phone sequences represented by the above hypotheses records 53 and word sequence records 80, each of the phone sequences would be associated with the following probabilities:
Phone Path Prob. + Prior Prob. + Acoustic Sequence Prob. a. 0 + Pi + P (Ao) +P (Ai) an.. 0 + Pi+P2 + P (Ni) an/.. PI+P3 + 0 + P (Ni) ak.. 0 + Pi+P3+P (K/N) + P (Ni) As will be appreciated, the path probability associated with the word sequence record 80 identified by the previous word number 59 of a hypothesis record 53 identifies a logarithmic probability of the words identified within a sequence of phones matched to an utterance being correctly matched in terms of a sum of transition probabilities 67,72 associated with transition records 61 and word records 63 corresponding to sequences of phones identifying the words forming the word sequence and probabilities indicative of the assumption of errors arising in decoding to match portions of an utterance to words.
The path probability can therefore be made representative of a probability of a sequence of words matching an utterance by setting the transition probabilities 67,72 in the phonetic dictionary records in the phonetic dictionary 19 so that the sum of the probabilities
associated with each sequence of the phones and phonemes representative of a word equals the probability of the word identified by word data 70 in a word record 63 occurring within speech.
In this embodiment, these transition probabilities 67,72 are set so that all the transition probabilities are equal for the longest sequence of phone or phonemes corresponding to a word within the vocabulary defined by the phonetic dictionary 18. The remaining probabilities are then set so that the probabilities associated with shorter sequences of phones and phonemes forming other valid extensions are also equal whilst the sum of the probabilities also corresponds to the probability of the words corresponding to sequences of phones.
Thus for example in the case of the word tree of Figure 7 where Acted is the longest sequence of phones corresponding to a word the probabilities associated with each of the word records 63 and transition records 61
would be selected so that : PI = P4 = P7 = Pg PI + P4 + P7 + Pg = P (Acted) P2 = Ps PI + P2 + P5 = P (Ant)
P3 = P (An) - PI P6 = P (Axe) - PI P8 = P (Act) - PI - P4 where P (Word) is the probability of a"word"occurring in speech to be detected by the speech apparatus as identified by, for example, data from the British National Corpus (BNC) database. In this embodiment, where the word probabilities P (W) are very small a higher default value is used instead to ensure that words which are rare in one context but common in another are not unduly penalised. In other embodiments, the word probabilities used could be made topic dependent to be representative of words being uttered in the context in which the speech recognition apparatus is to be used.
This manner of spreading the probability data across a sequence of phones enables the word hypothesis records 53 corresponding to possible matches to sequences of phones to be pruned more efficiently. Most words have a relatively low probability of occurring within speech. If probabilities associated with sequences of phones were only to be varied by a value indicative of the word occurring when a complete word had been detected, this would mean probabilities associated with sequences of
phones for complete and partial words would vary considerably. So that sequences corresponding to complete words were not then rejected when pruning, any algorithm utilizing the associated probabilities would need to allow for this variation. This would result in a number of partial words associated with relatively low probabilities being retained for future processing.
In contrast, in accordance with the present invention, whenever a new phone is encountered by the word decoder 44, probabilities are assigned to word records 53 in a way which utilizes word frequency data but is substantially independent of whether a sequence of phones corresponds to a complete or partial word. Thus no allowance needs to be made for wide variations in associated probabilities arising between sequences for complete and partial words. A simple filter based only on the associated probabilities can therefore be used as will be described in detail later.
Considering the prior probability data 56 of hypothesis records 53, in this embodiment this data comprises data identifying a similar calculated logarithmic probability of the portion of an identified sequence of phones corresponding to a partial word as identified by a
pointer to the phonetic dictionary 18 having been correctly matched. In both the case of the path probabilities and the prior probabilities, where, a phone sequence is matched on the basis of an assumption a phone has been incorrectly decoded, as is the case with the phone sequence ak.. in the example above, the probability associated with the sequence is then increased by a factor, in that case P (K/N), to account for the assumption that an error has occurred.
Finally, considering the acoustic probability 58 this comprises data identifying the probability that the latest phone matched to a portion of an utterance is actually representative of that phone.
The sum of the path probability 86, prior probability 56 and acoustic probability 58 associated with a word hypothesis record 53 therefore corresponds to a value indicative of the likelihood that the sequence of phones implicitly identified by the record is in fact representative of a decoded part of an utterance.
After a probability has been determined for each of the word hypothesis records 53, the word decoder 44 then deletes all the word hypothesis records 53 having a total
probability less than a certain threshold.
In this embodiment the threshold utilised to select word hypothesis records 53 for deletion is selected to be equal to the lesser of either the sixteenth greatest probability score of all the hypotheses records 53 in the hypothesis store 46 or four times the highest probability score associated with a hypothesis record 53 stored in the hypothesis store 46, whichever is greater. Thus, in this way, when a large number of hypothesis records 53 are associated with very similar probability scores, a large number of records are retained for further processing whilst when the scores have a broader range a limit of sixteen records is set.
After the word hypothesis records 53 associated with probability scores below the threshold have been deleted, the phone decoder 44 then proceeds to delete word sequence records 80 which are no longer required.
Specifically, in this embodiment the word decoder 44 identifies each of the word sequence records 80 within the word sequence memory 48 including following words data 88 equal to zero indicating the word is the last full word in a sequence of words. For each of these word
sequence records 80, the word decoder 44 then determines whether at least one of the word hypotheses records 53 stored within the word hypothesis store 46 includes previous word number data 59 corresponding to the word number 82 of a word sequence record 80 having last word data 88 equal to zero. If no such record can be identified amongst the hypothesis records 53 stored within the word hypothesis memory 46, the word sequence record 80 is deleted and then the following words data 88 of the word sequence record 80 identified by previous word data 87 of the deleted word sequence record 80 is then reduced by one to indicate that the deleted word sequence record 80, has been removed from the word sequence memory 48.
Thus, in this way, the only word sequence records 80 retained within the word sequence memory 48 with following words data 88 indicating no words follow them, are those referred to by the hypothesis records 53 in the word hypothesis store 46. The word decoder 44 then proceeds to process the next active phone list received from the active phone list buffer 44 (S1-S6).
When (S2) the word decoder 44 receives from the active phone list buffer 42 a signal indicating the end of an
utterance has been reached, the word decoder 44 then selects all the word hypothesis records 53 in the word hypothesis store 46, with lexicon pointers 55 pointing to the origin node within the phonetic dictionary 18, and passes these records to the output module 50.
The output module 50 then determines which of the word sequence records 80 having a word number 82 corresponding to the previous word numbers 59 received from the word decoder 44 the most likely match for a processed utterance. This is achieved by the output module 50 initially determining for each of the word hypothesis records 53 the sum of the prior probability 58 of the record 53, the acoustic probability 58 of the record 53 and path probability 86 of the word sequence record 80 identified by the previous word number 59 of the record 53 and then selecting the word sequence record 80 identified by the previous word number 59 of the word hypothesis record 53 for which the highest value is determined.
The output module 50 then stores the word identification data 84 for this word sequence record 80 as part of a sequence of words to be output and then selects the word sequence record 80 having a word number 82 corresponding
to the previous word data 87 of the word sequence record 80 currently under consideration. This process is then repeated for the newly selected word sequence record 80 with each successive item of word identification data being appended to the front of the stored list of words until the origin word sequence record indicating the null previous word data 87 is reached. The output module 50 then outputs as word sequence data the stored list of word identification data, being data identifying a list of complete words which has been determined as the most likely list of word corresponding to the detected utterance. This list of words can then utilised by the rest of the apparatus incorporating the speech recognition system.
ALTERNATIVES AND AMENDMENTS Although in the above described embodiment, a speech recognition system has been described in which erroneous recognition of one phone or phoneme for another is accounted for by the provision of a confusibility database 19, the present invention is also equally applicable to speech recognition systems able to account for erroneous insertion or omission of phones or phonemes
during decoding.
In the above described embodiment, whenever a portion of an utterance is matched to part of an allowable sequence of phones or phonemes on the assumption an error has occurred, a logarithmic probability associated with the match is varied by a factor to account for the assumption. Specifically, for each assumption the logarithmic probability is increased by a factor representative of the logarithmic probability of the assumed error occurring. Similar factors may be utilized to determine probabilities for matches on the assumption that decoded phones are erroneously omitted or erroneously inserted.
In the case of the erroneous insertion of phones or phonemes, whenever a new word hypothesis record or set of records is to be generated (S15), for each record created in the above described embodiment, a set of records could also be generated to enable the system to cope with erroneous insertion of phones and phonemes.
In such a system the extra records would comprise current phone data 54 and acoustic probability data 58 corresponding to the current phone data 54 and acoustic
probability data 58 for each of the new phones identified within the active phone list being processed and lexicon pointer data 55 and previous word number data 59 corresponding to the lexicon pointer data 55 and previous word data 59 in the word hypothesis record 53 being processed. The prior probability data 56 would then be set equal to the prior probability data of the word hypothesis record 53 being processed, to which is added an insertion probability factor to account, for the assumption of the erroneous insertion of the phone or phoneme identified by the current phone data 54 of the new record occurring. This factor being obtained by selecting appropriate data from additional data stored in the confusibility database 19 for this purpose Additional word hypothesis records 53 could also be generated to enable the system to cope with erroneous omission of phones. These records could be generated (S15) at the same time as the new word hypothesis records are generated by additionally processing each of the transition records 61 and word records 63 of phonetic dictionary records identified by node numbers 60, corresponding to next node numbers 65 of transition
records 61 selected for processing by the word decoder 44.
For each of these identified transition records 61 and word records 63, new word hypothesis records 53 and word sequence records 80 would then be generated in the same way as has previously been described except to each of the prior probabilities 56 of the extra word hypothesis records generated, a factor to account for the assumption of an erroneous omission occurring is added. This factor could either be a fixed value for all of the extra records, or alternatively, extra data could be stored in the confusibility database 19 for each phone or phoneme identifying a value to be used.
In other embodiments, further new records could be generated to account for erroneous omission of two or more phones by generating records identified by other transition records 61 and word records 63 selected utilizing word hypothesis records generated on the assumption of a single omission of a phone or phoneme.
Where extra data is stored in the confusibility database 19 indicative of probabilities of erroneous omission or insertion of phones and phonemes, the data utilised could be generated by processing known sequences of phone and phonemes to determine appropriate values in a similar way to that described for generating data indicative of
erroneous decoding of phones and phonemes. Although in the above described embodiment, a speech recognition system has been described in which each portion of an utterance is processed independently, it will be appreciated that many decoding errors are context dependent. Thus for example, decoding of a silence phoneme before a t phoneme is very likely as the first part of a t phoneme is silent.
Thus in other embodiments of the present invention the variation in probabilities assigned on the basis of the assumption of errors could be made dependent upon the context in which such assumed errors are taken to have occurred.
The way in which context information may be obtained and utilized could be of a number of different forms. Where certain errors are prone to arise within particular words, the calculation of factors for varying probabilities could be made dependent both upon the error in decoding assumed to be occurring and the node numbers of phonetic dictionary records corresponding to preidentified problem words. Alternatively, in a speech recognition system matching groups of phonemes, the
identity of the groups of phonemes could be utilized to vary factors for amending probabilities to account for the assumption of errors occurring during decoding.
Although in the above embodiment word hypothesis records 53 have been described in which acoustic probability data 58 is stored as part of each record 53, it will be appreciated that in other embodiments this data could be stored separately. Specifically, acoustic probability data 58 could be stored for all phones identifying the probability of a processed portion of an utterance being representative of a phone. Probability values associated with a word hypothesis record 53 could then be determined at any time by utilizing the acoustic probability 58 associated with the phone corresponding to the current phone data 54 within the record 53. Thus in this way repeated storage of the same acoustic probability data 58 could be avoided.
Similarly, although in the above embodiment, the prior probability data 56 associated with a new hypothesis record 53 is set to zero, it will be appreciated that by setting the prior probability data 56 of a new record equal to the path probability 86 for the sequence of words preceding a new record, the need for separate
storage of a path probability data could be avoided. Probabilities associated with word hypothesis could then be calculated by summing the prior probability data 56 and acoustic probability data 58 associated with a word hypothesis record 53.
In the above embodiment, phonetic dictionary records are described comprising word records 63 and transition records 61 each comprising next phone data 64,68 identifying the final phone in a sequence of phones. In other embodiments the word records 63 for complete words could be incorporated as part of the phonetic dictionary records 18 representative of sequences of phones corresponding to complete words. In such an embodiment, the word records 63 would comprise null next phone data 68 identifying the record as a word record 63, a word ID 70 and a transition probability 72 being the difference between the probability associated with the phonetic dictionary record in question and the word identified by the word ID 70.
The advantage of such an alternative data storage system would be that utilizing such a system all processing of data relating to an individual phone could be completed separately from the processing required to generate new
word sequence records. Specifically, new word hypothesis records 53 for a phone could first be generated utilizing data from the confusibility database 19 if necessary.
Where any word records 61 were included in the phonetic dictionary record identified by a new word hypothesis, the new word hypothesis record 53 could then be utilized when processing the word records 61. Thus in this way repeated access of the confusibility database 19 could be avoided.
In the above described embodiment, the confusibility data obtained from the confusibility database 19 is determined independently of the acoustic probability for phones determined by the phone decoder 40. However, it will be appreciated that the acoustic probabilities assigned to phones provide further information about the likelihood of particular decoding errors occurring. Thus, in an alterative embodiment, data for varying probabilities on the basis of the assumption of errors in decoding could be based on a calculated value determined utilising the generated acoustic probability data and stored functions associated with the error assumed to have occurred.
Although in the above embodiment a preprocessor is described designed to extract"formant"related
information it will be appreciated that any suitable processing of an input signal to reduce the amount of data to be processed whilst extracting information relevant to speech recognition could be used. Thus for example an input signal might in other embodiments be arranged to extract linear prediction coefficients or Mel-Cepstrum coefficients which where then proceeded to identify likely phones.
Furthermore, although in the above described embodiment division of a speech signal into non-overlapping frames is described in other embodiments frames to be processed might overlap sightly so as to give portions of signal a certain amount of additional contextual information.
Although the embodiments of the invention described with reference to the drawings comprise computer apparatus and processes performed in computer apparatus, the invention also extends to computer programs, particularly computer programs on or in a carrier, adapted for putting the invention into practice. The program may be in the form of source or object code or in any other form suitable for use in the implementation of the processes according to the invention. The carrier be any entity or device capable of carrying the program.
For example, the carrier may comprise a storage medium, such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a floppy disc or hard disk. Further, the carrier may be a transmissible carrier such as an electrical or optical signal which may be conveyed via electrical or optical cable or by radio or other means.
When a program is embodied in a signal which may be conveyed directly by a cable or other device or means, the carrier may be constituted by such cable or other device or means.
Alternatively, the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted for performing, or for use in the performance of, the relevant processes.

Claims (35)

1. A speech recognition apparatus comprising : receiving means for receiving a signal indicative of an utterance; decoding means for processing a signal received by said receiving means to determine for portions of said utterance, one or more phones said portions of said utterance are likely to represent, thereby matching said utterance with one or more sequences of phones; determining means for determining probability data indicative of the likelihood of a number of predetermined errors occurring in the matching of an utterance with one or more sequences of phones; and selection means for selecting a number of sequences of phones as being representative of an utterance utilizing said probability data.
2. An apparatus in accordance with claim 1, further comprising data storage means for storing data defining sequences of phones corresponding to words and parts of words within the vocabulary of the apparatus wherein said selection means is arranged to select sequences of phones being representative of an utterance corresponding to data stored in said data storage means.
3. An apparatus in accordance with claim 2, wherein said determining means is arranged to determine for sequences of phones corresponding to data stored in said data storage means, differences between said sequences and said one or more sequences of phones matched to an utterance by said decoding means and to determine for said sequences of phones, data indicative of the probability of errors occurring in the matching of an utterance such that an utterance representative of a said sequence of phones would be matched to a sequence of phones to which said decoding means matches an utterance.
4. An apparatus in accordance with any preceding claim, wherein said determining means is arranged to determine probability data indicative of a number of predetermined errors including at least one of the omission of phones in a sequence of phones, the insertion of phones in a sequence of phones or the erroneous substitution of specified phones for other specified phones in a sequence of phones occurring in the matching of an utterance with one or more sequences of phones.
5. An apparatus in accordance with claim 3 or claim 4, wherein said determining means comprises database means
for storing data indicative of the likelihood of each of a number of predetermined errors occurring in association with data identifying said errors, wherein said determining means is arranged to determine said probability data utilizing data stored in said database means.
6. An apparatus in accordance with claim 5, wherein said data stored in said database means comprises data determined by comparing the matching of utterances representative of known sequences of phones to sequences of phones matched to said utterances by a said decoding means.
7. An apparatus in accordance with any of claims 2 to 6, wherein said data storage means is arranged to associate with each of said sequences of phones corresponding to words and parts of words within the vocabulary of the apparatus with occurrence data indicative of predetermined likelihoods of said words and parts of words occurring within utterances, wherein said selection means is arranged to select sequences of phones as being representative of an utterance utilizing said probability data and said occurrence data.
8. An apparatus in accordance with claim 7, wherein said data storage means has stored therein data associating at least some of said sequences of phones corresponding to complete words in the vocabulary of said apparatus with occurrence data indicative of the frequency of said words occurring in speech.
9. Apparatus in accordance with claim 7, wherein said data storage means has stored therein data associating a default value with at least some of said sequences of phones corresponding to complete words in the vocabulary of said apparatus occurring within speech with less than a threshold frequency.
10. An apparatus in accordance with any of claims 7 to 9, wherein said data storage means has stored therein occurrence data such that for each sequence of phones corresponding to concatenations of phones representative of increasing portions of the same word, the occurrence data varies approximately in proportion to the number of phones in a sequence.
11. An apparatus in accordance with any of claims 7 to 10, wherein said data storage means is arranged to store data indicative of logarithmic probabilities of words and
parts of words occurring within utterances and said selection means is arranged to select a number of sequences of phones as being representative of an utterance utilizing a determined sum of said occurrence data for a sequence of phones and determined probability data representative of logarithmic probabilities of predetermined errors occurring in the matching of an utterance with one or more sequences of phones.
12. An apparatus in accordance with any preceding claim, wherein said decoding means is arranged to determine for each consecutive portion of an utterance acoustic probability data indicative of the probability of each of said one or more determined phones being representative of that processed portion of an utterance, said selection means being arranged to select sequences of phones as being representative of an utterance utilizing said acoustic probability data for the final processed portion of an utterance.
13. An apparatus in accordance with any preceding claim, wherein said decoding means is arranged to match portions of an utterance to sequences of phones comprising sequences of phones for earlier portions of an utterance selected by said selection means as being representative
of said earlier portions of an utterance to which said one or more phones are concatenated.
14. An apparatus in accordance with any preceding claim, further comprising: means for associating each of said sequences of phones corresponding to words within the vocabulary of said apparatus to word data; and output means for outputting sequences of items of word data indicative of a sequence of words wherein said output means is arranged to output items of word data associated with sequences of phones corresponding to sequences of phones selected by said selection means.
15. A method of selecting a sequence of phones as being representative of an utterance comprising the steps of: receiving a signal indicative of an utterance; processing a received signal to determine for portions of said utterance, one or more phones said portions of said utterance are likely to represent, thereby matching said utterance with one or more sequences of phones; determining probability data indicative of the likelihood of a number of predetermined errors occurring in the matching of an utterance with one or more
sequences of phones ; and selecting a number of sequences of phones as being representative of an utterance utilizing said probability data.
16. A method in accordance with claim 15, further comprising the steps of: storing vocabulary data defining sequences of phones corresponding to words and parts of words in a vocabulary wherein said selection step comprises selecting sequences of phones being representative of an utterance corresponding to stored vocabulary data.
17. A method in accordance with claim 16, further comprising the steps of: determining for sequences of phones corresponding to stored vocabulary data differences between said sequences and said one or more sequences of phones matched to an utterance; and determining for said sequences of phones, data indicative of the probability of errors occurring in the matching of an utterance such that an utterance representative of a sequence of phones corresponding to stored vocabulary data, would be matched to a sequence of phones matched to an utterance.
18. A method in accordance with any of claims 15 to 17, wherein the step of determining probabilities of errors comprises the step of determining probability data indicative of a number of predetermined errors occurring in the matching of an utterance with one or more sequences of phones, said errors including at least one of the omission of phones in a sequence of phones, the insertion of phones in a sequence of phones or the erroneous substitution of specified phones for other specified phones in a sequence of phones.
19. A method in accordance with claim 17 or claim 18, further comprising the steps of storing error data indicative of the likelihood of each of a number of predetermined errors occurring in association with data identifying said errors, wherein said determining probabilities of errors comprises determining said probability data utilizing said stored error data.
20 A method in accordance with claim 19, wherein storing error data comprises storing data determined by comparing the matching of utterances representative to known sequences of phones.
21. A method in accordance with any of claims 16 to 20,
wherein storing vocabulary data further comprises associating with each of said sequences of phones corresponding to words and parts of words within the vocabulary of the apparatus with occurrence data indicative of predetermined likelihoods of said words and parts of words occurring within utterances, wherein said selection step comprises selecting sequences of phones as being representative of an utterance utilizing said probability data and said occurrence data.
22. A method in accordance with claim 21, wherein said storing vocabulary data comprises associating at least some of said sequences of phones corresponding to complete words in a vocabulary with occurrence data indicative of the frequency of said words occurring in speech.
23. A method in accordance with claim 21, wherein said storing vocabulary data comprises associating default value occurrence data with at least some of said sequences of phones corresponding to complete words in a vocabulary which occur within speech with less than a threshold frequency.
24. A method in accordance with any of claims 21 to 23,
wherein said storing vocabulary data comprises associating sequences of phones with occurrence data such that for each sequence of phones corresponding to concatenations of phones representative of increasing portions of the same word, the occurrence data varies approximately in proportion to the number of phones in a sequence.
25. A method in accordance with any of claims 21 to 24, wherein said storing vocabulary data comprises storing data indicative of logarithmic probabilities of words and parts of words occurring within utterances and said selection step comprises selecting a number of sequences of phones as being representative of an utterance utilizing a determined sum of said occurrence data for a sequence of phones and determined probability data representative of logarithmic probabilities of predetermined errors occurring in the matching of an utterance with one or more sequences of phones.
26. A method in accordance with any of claims 15 to 25, wherein said processing step comprises determining for each consecutive portion of an utterance acoustic probability data indicative of the probability of each of said one or more determined phones being representative
of that processed portion of an utterance, said selection step comprising selecting sequences of phones as being representative of an utterance utilizing said acoustic probability data for the final processed portion of an utterance.
27. A method in accordance with any of claims 15 to 26, wherein said processing step comprises matching portions of an utterance to sequences of phones comprising sequences of phones for earlier portions of an utterance selected by said selection step as being representative of said earlier portions of an utterance to which said one or more phones are concatenated.
28. A method in accordance with any of claims 15 to 27, further comprising the steps of: associating each of said sequences of phones corresponding to words within a vocabulary to word data; and outputting sequences of items of word data indicative of a sequence of words comprising items of word data associated with sequences of phones corresponding to sequences of phones selected in said selection step.
29. A recording medium, storing computer implementable processor steps for performing a method in accordance with any of claims 15 to 28.
30. A recording medium storing computer implementable processor step for generating within a programmable computer an apparatus in accordance with any of claims 1 to 14.
31. A recording medium in accordance with claim 29 or claim 30 comprising a computer disc.
32. A recording medium in accordance with claim 29 or claim 30, comprising an electric signal transferred via the Internet.
33. A computer disc in accordance with claim 31, wherein said computer disc comprises an optical, magneto-optical or magnetic disc.
34. Speech recognition apparatus substantially as herein described with reference to the accompanying drawings.
35. A method of selecting a sequence of words as being representative of an utterance substantially as hereinbefore described with reference to the accompanying drawings.
GB0028144A 2000-11-17 2000-11-17 Speech recognition apparatus Withdrawn GB2373088A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0028144A GB2373088A (en) 2000-11-17 2000-11-17 Speech recognition apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0028144A GB2373088A (en) 2000-11-17 2000-11-17 Speech recognition apparatus

Publications (2)

Publication Number Publication Date
GB0028144D0 GB0028144D0 (en) 2001-01-03
GB2373088A true GB2373088A (en) 2002-09-11

Family

ID=9903407

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0028144A Withdrawn GB2373088A (en) 2000-11-17 2000-11-17 Speech recognition apparatus

Country Status (1)

Country Link
GB (1) GB2373088A (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862958B (en) * 2020-08-07 2024-04-02 广州视琨电子科技有限公司 Pronunciation insertion error detection method, pronunciation insertion error detection device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0736481A (en) * 1993-07-19 1995-02-07 Osaka Gas Co Ltd Interpolation speech recognition device
GB2355836A (en) * 1999-10-28 2001-05-02 Canon Kk Pattern matching

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0736481A (en) * 1993-07-19 1995-02-07 Osaka Gas Co Ltd Interpolation speech recognition device
GB2355836A (en) * 1999-10-28 2001-05-02 Canon Kk Pattern matching

Also Published As

Publication number Publication date
GB0028144D0 (en) 2001-01-03

Similar Documents

Publication Publication Date Title
Szöke et al. Comparison of keyword spotting approaches for informal continuous speech.
US6138095A (en) Speech recognition
KR101417975B1 (en) Method and system for endpoint automatic detection of audio record
US6195634B1 (en) Selection of decoys for non-vocabulary utterances rejection
US6308151B1 (en) Method and system using a speech recognition system to dictate a body of text in response to an available body of text
EP1936606B1 (en) Multi-stage speech recognition
JP4414088B2 (en) System using silence in speech recognition
JP5255769B2 (en) Topic-specific models for text formatting and speech recognition
EP1960997B1 (en) Speech recognition system with huge vocabulary
EP2048655B1 (en) Context sensitive multi-stage speech recognition
EP0664535A2 (en) Large vocabulary connected speech recognition system and method of language representation using evolutional grammar to represent context free grammars
KR19990014292A (en) Word Counting Methods and Procedures in Continuous Speech Recognition Useful for Early Termination of Reliable Pants- Causal Speech Detection
JP2000122691A (en) Automatic recognizing method for spelling reading type speech speaking
WO1997008686A2 (en) Method and system for pattern recognition based on tree organised probability densities
JP2001249684A (en) Device and method for recognizing speech, and recording medium
JP7507977B2 (en) Long-context end-to-end speech recognition system
US8234112B2 (en) Apparatus and method for generating noise adaptive acoustic model for environment migration including noise adaptive discriminative adaptation method
EP1460615B1 (en) Voice processing device and method, recording medium, and program
EP2842124A1 (en) Negative example (anti-word) based performance improvement for speech recognition
JP2003515778A (en) Speech recognition method and apparatus using different language models
JP6690484B2 (en) Computer program for voice recognition, voice recognition device and voice recognition method
US6502072B2 (en) Two-tier noise rejection in speech recognition
US6345249B1 (en) Automatic analysis of a speech dictated document
KR101122590B1 (en) Apparatus and method for speech recognition by dividing speech data
KR101122591B1 (en) Apparatus and method for speech recognition by keyword recognition

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)