EP1055228A1 - Automatisches sprachgesteuertes abfragesystem - Google Patents

Automatisches sprachgesteuertes abfragesystem

Info

Publication number
EP1055228A1
EP1055228A1 EP99965413A EP99965413A EP1055228A1 EP 1055228 A1 EP1055228 A1 EP 1055228A1 EP 99965413 A EP99965413 A EP 99965413A EP 99965413 A EP99965413 A EP 99965413A EP 1055228 A1 EP1055228 A1 EP 1055228A1
Authority
EP
European Patent Office
Prior art keywords
acoustic
transcπption
operative
input
transcription
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP99965413A
Other languages
English (en)
French (fr)
Inventor
Matthias Pankert
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nuance Communications Inc
Original Assignee
Koninklijke Philips Electronics NV
Nuance Communications Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics NV, Nuance Communications Inc filed Critical Koninklijke Philips Electronics NV
Priority to EP99965413A priority Critical patent/EP1055228A1/de
Publication of EP1055228A1 publication Critical patent/EP1055228A1/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/487Arrangements for providing information services, e.g. recorded voice services or time announcements
    • H04M3/493Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals
    • H04M3/4931Directory assistance systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/085Methods for reducing search complexity, pruning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/40Electronic components, circuits, software, systems or apparatus used in telephone systems using speech recognition

Definitions

  • the invention relates to an automatic inquiry system including: a storage for storing a plurality of inquiry entries, each inquiry entry comprising a plurality of data fields; a dialogue engine for executing a machine-controlled human-machine dialogue to determine a plurality of pre-determined query fields; a speech recognizer operative to represent an utterance as a corresponding input transcription set, which includes at least two alternative acoustic transcriptions of the corresponding utterance; a sequence of utterances specifying the respective query fields; and a search engine for locating inquiry entries in the storage in dependence on the query fields.
  • Automatic inquiry systems for instance for obtaining directory information, increasingly use automatic human-machine dialogues.
  • a user establishes a connection with the inquiry system using a telephone.
  • a dialogue engine tries to establish a number of query items enabling obtaining the desired information from a storage, such as a database.
  • the query items specify the data of corresponding query fields.
  • the system typically issues an initial question. For a directory assistance system, such an initial question might be "Whose telephone number would you like to have ?".
  • the system uses speech recognition techniques to extract the query items from the user's utterances.
  • a directory assistance system needs to obtain at least a few of the following items to perform a query: the family name, given name, town, street and street number of a person (or similar items for a business).
  • the dialogue may be follow a step-wise approach where the system prompts the user for one query item at a time and the system assumes that information recognized from the response specifies the query item. For instance, for a directory assistance system the system could successively ask the user to specify the town, family name, and street.
  • the system may allow a free-speech dialogue where the system extracts the relevant query items from the sequence of utterances of the user. Once all essential query items have been recognized, the answer(s) to the query are obtained from the storage. This results in one or more information items.
  • the information items are typically presented to the user in spoken form. Normally, textual presentations of the information items are inserted in a sequence of Preformatted phrase or sentence templates. Usually, the templates are prosodically enriched. If no suitable speech presentation is available, for instance in the form of sampled phrases/words, the prosodically enriched text may be converted to speech using a speech synthesis technique. The speech output is fed back to the user. As an alternative to presenting information to the user, the system may also act on the data. For instance, a directory assistance system could automatically connect the user to the retrieved telephone number. In known automatic inquiry systems, the speech recognizer as a first step in the recognition represents utterances of the user as a set of acoustic transcriptions.
  • an acoustic transcription represents a time-sequence of phonemes or similar subword units.
  • the set includes several alternative transcriptions since normally several sequences of phonemes/sub-word units are acoustically close to the utterance.
  • Hidden Markov Models are used to identify several likely acoustic transcriptions.
  • the speech recognizer makes a selection between the various transcriptions.
  • the transcriptions are converted into textual representations.
  • the textual representations are compared to a lexicon which comprises all textual representations allowed for the query item in question. For instance, for the query field "town" the lexicon comprises all names of towns in the area for which the system is operational.
  • the dialogue engine is then provided with a limited number of textual representations for each query field.
  • the dialogue engine performs a query on the storage to determine one or more inquiry entries which match all or a selected few of the query items.
  • the system is characterized in that ⁇ refer characterizing part of claim 1>.
  • the searching of the storage is not performed on textual representations of the utterances but on acoustic representations.
  • At least one of the query fields is specified as a set of acoustic transcriptions and compared to a set of reference acoustic transcriptions in the storage. Other query fields may still be represented in a textual form.
  • the speech recognizer is relieved from performing the difficult process of representing the utterance in a textual form.
  • alternative acoustic transcriptions the system can cope in an easy way with a wide spread in pronunciation.
  • the system then, starting from the user-provided spelling, selects alternative, similar sounding spellings. If the user made an unfortunate spelling (which may be caused by entering only one wrong character), the system may interpret the entry as sounding different than intended and not being able to find the desired alternative textual form. Consequently, the system will not be able to locate the desired inquiry entry.
  • phonetic conversion routines which inevitably reduce the success rate, are avoided.
  • difficult entries such as foreign names, where the spread in pronunciation is larger than normal, many alternative reference acoustic transcriptions may be stored. It will be appreciated that according to the invention, a correct textual transcription of the entry is not required for the searching.
  • the storage may still comprise such a textual transcription for supply to the user or other purposes.
  • At least two query items are represented by transcription sets.
  • a combined search is performed on those (and possibly other) query items.
  • the entries were individually recognized in the sense that a few most likely candidates were output for each entry. Afterwards, a combined search was performed for the remaining candidates.
  • a specific family name may sound similar to hundreds of names, e.g. two hundred names. The number of candidates is then reduced to, for instance, ten most likely candidates. In order to make this reduction, millions of family names may need to be compared to the acoustic transcription of the utterance.
  • a specific town may sound similar to a few dozen names, e.g.
  • the database is then used to verify whether any of the possible 100 remaining options actually exists. In the reduction process the actual correct entries may have been filtered out. In fact, for the total number of 4000 (200x20) similar sounding entries, the database may only have one or very few matches. By using the database to perform a combined matching operation, no extensive pre-filtering is required. Any suitable database technique may be used to achieve a fast searching. For instance, indexing may be used on entries with least candidates. In a directory assistance system, advantageously indexing on a town name is used, reducing the search space for family names. By also indexing on street names, the search space for family names can in most cases be reduced to only a few dozen or to a few hundred.
  • two sets of acoustic transcriptions are considered related (i.e. matching) if at least one transcription of the first set is acoustically similar to at least one transcription of the second set. It is not required that all transcriptions of the sets are acoustically similar. This allows for a wide variation in pronunciation. Any conventional technique for determining that two transcriptions are acoustically similar may be used, for instance by using similarity measures or a similarity ranking which indicate similarity of two phonemes, for each phoneme pair.
  • At least one set of transcriptions is represented as a graph. This allows for an effective representation where paths through the graph may diverge, where part of a 'word' is pronounced differently and paths may converge, where parts are pronounced similarly. In this way, a sharing of representation of the transcriptions occurs. An individual transcription is then represented by a path through the graph. Preferably, the comparison of sets is not performed by isolating all individual transcriptions and comparing the individual transcriptions, but by using the sharing given the graph.
  • the set of acoustic transcriptions is pruned to reduce the number of comparisons to be made. Since in the system according to the invention no full recognition of an utterance is required, this enables further pruning than in conventional systems.
  • the pruning is based on acoustic similarity.
  • the reference set of acoustic transcriptions could be reduced by only keeping transcriptions representing very different pronunciations, for instance to cover entirely different name pronunciations. Transcriptions representing very similar pronunciations can be reduced to one or a few representing the most common ("average') pronunciation.
  • the input transcription set may be reduced in a conventional manner to represent the acoustically most likely transcription.
  • a statistical model which for instance specifies the likelihood of phoneme sequences, is used to eliminate unlikely acoustical transcriptions.
  • a predetermined portion of the transcription is selected. This is possible in the system according to the invention, since no textual representation needs to be made. For instance, if an utterance represents a family name it will be in most cases sufficient to only use the first three to five phonemes to be able to identify a directory entry, particularly in combination with other data such as street and town.
  • Fig.l shows a block diagram of the system according to the invention
  • Fig. 2 shows a block diagram of the speech recognizer according to the invention
  • Fig. 3 illustrates full word and sub-word models
  • Fig. 4 shows a database structure
  • Figure 1 shows a block diagram of a system 10 according to the invention. Examples of the working of the system will be described in particular for a directory assistance system which automatically provides information with respect to telephone numbers (similar to the so-called white or yellow papers). It will be appreciated that these examples are not limiting.
  • the system may equally well be used for supplying information like journey scheduling information, such as involving a train, bus or plane.
  • the system may be used to supply other types of information, such as bank related information (e.g. an account overview), information from the public utility boards, information from the council or other governmental organizations, or, more in general, information related to a company (e.g. product or service information).
  • the system may also act on the information, e.g.
  • item 20 represents an interconnection for receiving a speech representative signal from a user.
  • a microphone may be connected to the interconnection 20.
  • the system comprises an interface 30 to receive the input from the user. This may for instance be implemented using a conventional modem. If the interface has an input for receiving speech in an analogue form, the interface preferably comprises an A/D converter for converting the analogue speech to digital samples of a format suitable for further processing by a speech recognition system 40. If the interface has an input for receiving the speech in a digital form, e.g.
  • Block 40 represents a speech recognition subsystem.
  • the speech recognition system 40 typically analyses the received speech by comparing it to trained material (acoustical data and a lexicon/grammar).
  • the speech recognition is preferably speaker-independent and allows continuous speech input.
  • speech recognition is known and has been disclosed in various documents, such as EP 92202782.6, corresponding to US Serial No. 08/425,304 (PHD 91136), EP 92202783.4, corresponding to US Serial No. 08/751,377 (PHD 91138), EP 94200475.5, corresponding to US 5,634,083 (PHD 93034), all to the assignee of the present application.
  • At least one utterance of the user is not fully recognized (i.e. is not transcribed to a textual representation) by the speech recognition system 40.
  • the speech recognizer 40 outputs a set of acoustic transcriptions representing the utterance.
  • the speech recognizer 40 recognizes part of the sequence of utterances.
  • a description is given of an exemplary speech recognition system 40 which can fully recognize an utterance as well as represent the utterance as a set of acoustic transcriptions.
  • Speech recognition systems such as large vocabulary continuous speech recognition systems, typically use a collection of recognition models to recognize an input pattern.
  • FIG. 2 illustrates a typical structure of a large vocabulary continuous speech recognition system 40 [refer L.Rabiner, B- H. Juang, "Fundamentals of speech recognition", Prentice Hall 1993, pages 434 to 454].
  • the system 40 comprises a spectral analysis subsystem 210 and a unit matching subsystem 220.
  • the speech input signal SIS
  • OV observation vector
  • the speech signal is digitized (e.g.
  • LPC Linear Predictive Coding
  • Y) P(Y
  • an acoustic model provides the first term of equation (1).
  • the acoustic model is used to estimate the probability P(Y
  • a speech recognition unit is represented by a sequence of acoustic references. Various forms of speech recognition units may be used. As an example, a whole word or even a group of words may be represented by one speech recognition unit.
  • a word model (WM) provides for each word of a given vocabulary a transcription in a sequence of acoustic references.
  • a direct relationship exists between the word model and the speech recognition unit.
  • Other systems in particular large vocabulary systems, may use for the speech recognition unit linguistically based sub-word units, such as phones, diphones or syllables, as well as derivative units, such as fenenes and fenones.
  • a word model is given by a lexicon 234, describing the sequence of sub-word units relating to a word of the vocabulary, and the sub-word models 232, describing sequences of acoustic references of the involved speech recognition unit.
  • a word model composer 236 composes the word model based on the sub word model 232 and the lexicon 234.
  • Figure 3A illustrates a word model 300 for a system based on whole-word speech recognition units, where the speech recognition unit of the shown word is modeled using a sequence of ten acoustic references (301 to 310).
  • Figure 3B illustrates a word model 320 for a system based on sub-word units, where the shown word is modeled by a sequence of three sub-word models (350, 360 and 370), each with a sequence of four acoustic references (351, 352, 353, 354; 361 to 364; 371 to 374).
  • the word models shown in Fig. 3 are based on Hidden Markov Models (HMMs), which are widely used to stochastically model speech signals.
  • HMMs Hidden Markov Models
  • each recognition unit is typically characterized by an HMM, whose parameters are estimated from a training set of data.
  • HMM For large vocabulary speech recognition systems usually a limited set of, for instance 40, sub-word units is used, since it would require a lot of training data to adequately train an HMM for larger units.
  • An HMM state corresponds to an acoustic reference.
  • Various techniques are known for modeling a reference, including discrete or continuous probability densities.
  • Each sequence of acoustic references which relate to one specific utterance is also referred as an acoustic transcription of the utterance. It will be appreciated that if other recognition techniques than HMMs are used, details of the acoustic transcription will be different.
  • a word level matching system 230 of Fig. 2 matches the observation vectors against all sequences of speech recognition units and provides the likelihoods of a match between the vector and a sequence. If sub-word units are used, constraints can be placed on the matching by using the lexicon 234 to limit the possible sequence of sub-word units to sequences in the lexicon 234. This reduces the outcome to possible sequences of words. If for utterances, like representing family names, no full recognition is required, the involved words need not be in the lexicon. Consequently, the lexicon can also not be used to reduce the possible sequences of sub-word units. If certain words are not in the lexicon, it is still possible to restrict the number of acoustic transcriptions of an utterance.
  • a statistical N-gram subword-string model 238 is used which provides the likelihood of the sequence of the last N subword units. For instance, a bigram could be used which specifies for each phoneme pair, the likelihood that those two phonemes follow each other. As such, based on general language characteristics the set of acoustic transcriptions of an utterance can be pruned, without the specific word (with its acoustic transcription) being in the lexicon. For non-full recognition, the unit matching subsystem 220 outputs the set 250 of acoustic transcriptions which correspond to an utterance.
  • a sentence level matching system 240 which, based on a language model (LM), places further constraints on the matching so that the paths investigated are those corresponding to word sequences which are proper sequences as specified by the language model.
  • the language model provides the second term P(W) of equation (1).
  • P(W) of equation (1).
  • the language model used in pattern recognition may include syntactical and/or semantical constraints 242 of the language and the recognition task.
  • a language model based on syntactical constraints is usually referred to as a grammar 244.
  • N-gram word models are widely used.
  • w ⁇ w 2 w 3 ...W j . ⁇ ) is approximated by P(W j
  • bigrams or trigrams are used.
  • w ⁇ w 2 w 3 ...W j _ ⁇ ) is approximated by P(W j
  • the output of the speech recognition subsystem 40 (fully recognized speech and/or transcription sets) is fed to the dialogue management subsystem 50 of Fig.l.
  • the dialogue manager 50 forms the core of the system 10 and contains application specific information.
  • the dialogue manager 50 is programmed to determine in which information the user is interested. For a simple, fully controlled system the dialogue manager issues specific question statements, to which the user is expected to reply with only one utterance. In such a simple system, the dialogue manager can sometimes assume that the reply to a specific question represents the information the systems requires (without the system needing to recognize the utterance).
  • the dialogue manager 50 scans the output of the speech recognizer in order to extract key-words or phrases which indicate which information the user wishes to obtain.
  • the key words or phrases to be searched for may be stored in a storage 54, such as a harddisk, and be represented in the lexicon 234. This allows for full recognition of the keywords or phrases. Keywords, such as family names, which can not easily be fully recognized, need not be stored in this way. Contextual information which enables isolating an utterance representing a not-to-be-recognized keyword is preferably stored and recognized in full.
  • the extracted information elements are typically stored in main memory (not shown).
  • a search engine 59 is used to obtain the specified information from a storage 52.
  • the storage is preferably based on a database, and the search engine 59 on search engines usually used for searching databases.
  • the storage is capable of storing a plurality of inquiry entries (e.g. directory entries), where each inquiry entry comprises a plurality of data fields, such as family name, street, town, and telephone number. It may be possible to specify a query in various ways.
  • a directory query may be specified by a family or company name and a town. If the query performed by the search engine 59 results in too many hits (for instance only a frequently occurring family name and a name of a large town were given), additional information such as a street, house number or first name may be obtained in a further dialogue. A user may also have provided the additional information already in the main dialogue. Normally, the main dialogue is initiated by the dialogue manager 50 triggering the issuing of a predetermined welcome statement. If all the query items can be extracted from the initial user response, the dialogue may be completed after the first user response. Otherwise, one or more sub-dialogues may be started to determine the missing items. A sub-dialogue usually starts with a question statement.
  • the question statement may be combined with a confirmation statement, to confirm already identified items. If a user rejects an item already identified, the rejected item may be re-determined, for instance from the rejection utterance, from less likely alternatives derived from previous utterances or by starting a new sub-dialogue. Particularly at the end of the dialogue, all the query items that have been identified may be confirmed by issuing a confirmation statement.
  • the request/confirmation statements typically are formed from predetermined sentences/phrases, which may be stored in a background storage 56.
  • a template sentence/phrase may be completed by the dialogue manager 50 filling in information believed to have been identified in utterances of the user. For instance, the dialogue manager could form a question/confirmation statement like "In which town lives Mr.
  • the system 10 further comprises a speech generation subsystem 60.
  • the speech generation subsystem 60 may receive the question/confirmation statements from the dialogue manager 50 in various forms, such as a (potentially prosodically enriched) textual form or as speech fragments.
  • the speech generation subsystem 60 may be based on speech synthesis techniques capable of converting text-to-speech.
  • the speech generation subsystem 60 may itself prosodically enrich the speech fragments or text in order to generate more naturally sounding speech. The enriched material is then transformed to speech output.
  • a question/confirmation statement may include information identified by the dialogue manager 50. If such information was supplied in a fully recognized form (e.g. a textual representation), this information can be synthesized in a conventional manner, together with other elements of the statement. If the information was not fully recognized, the identified original input utterance may be used.
  • speech synthesis techniques are used to change the characteristics of the input utterance such that the user is not confronted with a sentence with certain system-specific voice characteristics (e.g. the voice of an actor) and one isolated utterance (his own) in between.
  • the prosody of the input utterance is changed to correspond to the prosody of the entire statement.
  • the speech output is provided to the user at the speech output interconnection 80.
  • a loudspeaker is connected to the interconnection 80 for reproducing the speech output.
  • the loudspeaker forms part of the telephone used for the speech input.
  • the speech output interface 70 is usually combined with the speech input interface 30 in a known manner, for instance in the form of a modem.
  • the storage 52 stores for those elements which are not fully recognized a respective reference transcription set including at least two alternative acoustic transcriptions of data stored in a data field corresponding to the element. For instance, if only the family name is not fully recognized, than for each directory entry a set of acoustic transcriptions of the family name corresponding to that entry is stored. In principle, the family name need not be stored in a textual form. However, for most practical applications it will be useful to also store the family name as text, for instance for display purposes or for verifying whether the acoustic transcription set is accurate.
  • Fig. 4 illustrates a database structure for a directory assistance system.
  • each directory record (entry) is identified by an entry number (key) 402.
  • each record provides information with respect to the following data fields: the family name 404, the initial(s) 406, the town 408, the street 410, the house number 412 and the telephone number 414.
  • the fields family name 404, town 408 and street 410 are not searched for based on a textual representation but on an acoustic representation (set of acoustic transcriptions). This is illustrated further for the family name.
  • a reference is stored to a record in a separate table 420 which stores the textual representation 422 of the family name.
  • Each record of table 420 also comprises an identifier (key) 424 of the family name.
  • a reference acoustic transcription 432 is stored together with a reference 434 to the corresponding family name.
  • several acoustic references, forming a set of reference acoustic references are associated with one entry.
  • all acoustic transcriptions in table 430 relating to the same record of table 420 refer back to the same identifier 424 of table 420.
  • records 440, 442 and 444 of table 430 all relate to record 426 of table 420.
  • the reference acoustic transcription set may be obtained in any suitable way as for instance is known from training speech recognition systems. As an example, a selected group of people may be used to pronounce the item several times. Conventional techniques, such as the Viterbi algorithm, may be used to align the utterance with the states of the Hidden Markov Models and obtain a transcription of the utterance. As described earlier, such a reference transcription set is preferably pruned to a sufficiently small set of transcriptions to enable an adequately fast search of the storage.
  • the storage 52 stores for at least two data fields of the directory records respective associated transcription sets.
  • Each transcription set includes at least two alternative acoustic transcriptions of data stored in the associated data field.
  • three data fields (family name, town, street) were associated with acoustic transcriptions sets.
  • the search engine 59 can perform a query on two or more query fields in combination, where the query fields are specified by textual representations. In such a situation, a record is located by the query if all the query fields match the corresponding data fields of the record (a so-called AND operation). Such a query can be speeded up using conventional techniques, such as indexing on one or more of the query fields.
  • the same combined query is performed where at least two of the query fields (and the associated data fields) are specified by sets of acoustic transcriptions.
  • the number of hits may be very large, this will normally not be the case for a combined (AND) query. This avoids having to select between hits for individual query fields and as such no 'recognition' of the individual query fields is required.
  • the query fields are only 'recognized' as belonging to the record(s) which was/were found to match the combined query.
  • a set of acoustic transcriptions is stored/processed as a graph.
  • the graph representation may be used for the input transcription set(s) as well as the reference transcription set(s) stored in the storage 52. If a reference transcription set is represented as one graph, this makes table 430 of Fig. 4 obsolete.
  • the records 440, 442, and 444 can be replaced by one graph, which can be inserted in an additional field in table 420 for record 426.
  • Fig. 5 illustrates a relatively simple graph.
  • the nodes (501 to 515) represent an acoustic (sub-)word unit, such as an HMM state, like a phoneme.
  • a path through the graph represents an acoustic transcription of the utterance. All paths through one graph are transcriptions of the same utterance. Normally, a likelihood is associated with the paths too.
  • the graph of Fig. 5 consists of seven paths (transcriptions): 1. 501, 502, 503, 504
  • the search engine 59 matches the individual transcriptions of the set which is not represented as a graph against the graph. This may, for instance, be done by sequentially comparing the nodes of the individual transcription against the possible paths through the graph. Since, the paths of the graph have a common representation (they normally share nodes), this has the advantage of reducing the number of comparisons. For instance, if an individual path is compared with the graph of Fig. 5, a comparison with path 1 of the graph (as indicated above) may show a match of nodes 501 and 502, but a mismatch at node 503.
  • a match at node-level is not restricted to a binary yes-no decision.
  • a measure of similarity between the nodes may be used. Such a measure may be based on a perceived acoustic similarity between sub-word units (nodes in the graph and transcription), such as phonemes. Preferably such a measure is combined with a likelihood measure of the paths to providing a likelihood of a match. A selection of most likely matches may be considered further. Instead of using a likelihood, also fuzzy logic rules may be used for determining whether a path and a graph match.
  • search engine 59 is programmed to compare both graphs in one operation, benefiting of the node sharing in both graphs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)
EP99965413A 1998-12-17 1999-11-29 Automatisches sprachgesteuertes abfragesystem Withdrawn EP1055228A1 (de)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP99965413A EP1055228A1 (de) 1998-12-17 1999-11-29 Automatisches sprachgesteuertes abfragesystem

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
EP98204286 1998-12-17
EP98204286 1998-12-17
PCT/EP1999/009263 WO2000036591A1 (en) 1998-12-17 1999-11-29 Speech operated automatic inquiry system
EP99965413A EP1055228A1 (de) 1998-12-17 1999-11-29 Automatisches sprachgesteuertes abfragesystem

Publications (1)

Publication Number Publication Date
EP1055228A1 true EP1055228A1 (de) 2000-11-29

Family

ID=8234476

Family Applications (1)

Application Number Title Priority Date Filing Date
EP99965413A Withdrawn EP1055228A1 (de) 1998-12-17 1999-11-29 Automatisches sprachgesteuertes abfragesystem

Country Status (3)

Country Link
EP (1) EP1055228A1 (de)
JP (1) JP2002532763A (de)
WO (1) WO2000036591A1 (de)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030093272A1 (en) * 1999-12-02 2003-05-15 Frederic Soufflet Speech operated automatic inquiry system
DE10035523A1 (de) * 2000-07-21 2002-01-31 Deutsche Telekom Ag Virtuelles Testbett
WO2004036887A1 (en) 2002-10-16 2004-04-29 Koninklijke Philips Electronics N.V. Directory assistant method and apparatus
US7346151B2 (en) 2003-06-24 2008-03-18 Avaya Technology Corp. Method and apparatus for validating agreement between textual and spoken representations of words
US9837070B2 (en) 2013-12-09 2017-12-05 Google Inc. Verification of mappings between phoneme sequences and words

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5268990A (en) * 1991-01-31 1993-12-07 Sri International Method for recognizing speech using linguistically-motivated hidden Markov models
CA2088080C (en) * 1992-04-02 1997-10-07 Enrico Luigi Bocchieri Automatic speech recognizer
EP0645757B1 (de) * 1993-09-23 2000-04-05 Xerox Corporation Semantische Gleichereignisfilterung für Spracherkennung und Signalübersetzungsanwendungen
US5799276A (en) * 1995-11-07 1998-08-25 Accent Incorporated Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
GB9601925D0 (en) * 1996-01-31 1996-04-03 British Telecomm Database access

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0036591A1 *

Also Published As

Publication number Publication date
JP2002532763A (ja) 2002-10-02
WO2000036591A1 (en) 2000-06-22

Similar Documents

Publication Publication Date Title
US6856956B2 (en) Method and apparatus for generating and displaying N-best alternatives in a speech recognition system
US6243680B1 (en) Method and apparatus for obtaining a transcription of phrases through text and spoken utterances
US6937983B2 (en) Method and system for semantic speech recognition
US8065144B1 (en) Multilingual speech recognition
US6910012B2 (en) Method and system for speech recognition using phonetically similar word alternatives
US6085160A (en) Language independent speech recognition
US5581655A (en) Method for recognizing speech using linguistically-motivated hidden Markov models
US20050033575A1 (en) Operating method for an automated language recognizer intended for the speaker-independent language recognition of words in different languages and automated language recognizer
KR20060041829A (ko) 호출자 식별 방법 및 시스템, 컴퓨터 판독가능 매체
WO1999021172A2 (en) Pattern recognition enrolment in a distributed system
Rabiner et al. Speech recognition: Statistical methods
US20040006469A1 (en) Apparatus and method for updating lexicon
JPH10274996A (ja) 音声認識装置
US20020095282A1 (en) Method for online adaptation of pronunciation dictionaries
EP1055228A1 (de) Automatisches sprachgesteuertes abfragesystem
Georgila et al. A speech-based human-computer interaction system for automating directory assistance services
Wu et al. Application of simultaneous decoding algorithms to automatic transcription of known and unknown words
EP1135768B1 (de) Buchstabiermodus in einem spracherkenner
Huggins et al. The use of shibboleth words for automatically classifying speakers by dialect
EP1158491A2 (de) Spracheingabe und Wiederauffiden von Personendaten
Wan et al. Bob: A lexicon and pronunciation dictionary generator
Georgila et al. An integrated dialogue system for the automation of call centre services.
Neubert et al. Directory name retrieval over the telephone in the Picasso project
JP2005534968A (ja) 漢字語の読みの決定
Georgila et al. Large Vocabulary Search Space Reduction Employing Directed Acyclic Word Graphs and Phonological Rules

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

17P Request for examination filed

Effective date: 20001222

17Q First examination report despatched

Effective date: 20030212

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: SCANSOFT, INC.

RBV Designated contracting states (corrected)

Designated state(s): DE FR GB

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20040508