EP1518221A1

EP1518221A1 - Method for natural voice recognition based on a generative transformation/phrase structure grammar

Info

Publication number: EP1518221A1
Application number: EP03761435A
Authority: EP
Inventors: Klaus Dieter Liedtke; Guntbert Markefka
Original assignee: T Mobile Deutschland GmbH
Current assignee: Telekom Deutschland GmbH
Priority date: 2002-06-28
Filing date: 2003-06-26
Publication date: 2005-03-30
Also published as: PL373306A1; IL165957A0; CN1315109C; US20060161436A1; DE10229207B3; US7548857B2; IL165957A; JP4649207B2; CA2493429C; JP2005539249A; AU2003250272A1; CN1666254A; WO2004003888A1; CA2493429A1; WO2004003888B1

Abstract

The invention relates to a method for natural voice recognition based on a generative transformation/phrase structure grammar known as GT /PS grammar. According to the invention, a spoken phrase is analyzed for triphones contained therein, words contained in the spoken phrase are formed from the recognized triphones with the aid dictionaries and the spoken phrase is syntactically reconstructed from the recognized words using a grammar. The GT /PS grammar is a novel method enabling target sentences to be placed in said grammar. It uses traditional Grammar Specification Language (GSL), structures said sentences however in an innovative manner. It is oriented towards the rules of phrase structure grammar and Noam Chomsky's concept of generative transformation grammar.

Description

Process for natural speech recognition based on a generative transformation / phrase structure grammar

The invention relates to a method for natural speech recognition based on a generative transformation / phrase structure grammar (GT / PS grammar).

Current speech recognition systems with natural speech recognition (NLU = Natural Language Understanding) are able to understand a variety of possible utterances and implement them in complex command structures, the speech recognition systems, e.g. Computer, to take certain actions. They do this on the basis of predefined, meaningful sample sets, which are defined by application developers and so-called dialog designers. This collection of sample sentences - also called "grammar" - includes individual command words as well as complicated nesting sentences that make sense at a certain point in the dialog. If the user expresses such a sentence, the system will understand it with great certainty and the instructions associated with it is running.

When programming a recognition application, such as an NLU telephone application, the Grammar is an indispensable component. It is generated using a special tool, the so-called Grammar Specification Language (GSL). It is used to reproduce the words to be understood as well as their links in advance and to lay them down for the speech recognizer. The predefined sentences are formed from combinations of words that are interchangeable (paradigmatic axis) and combinable (syntagmatic axis). An example of this is shown in FIG. 7. The possible utterances result from the syntagmatic connection of the paradigmatic word combinations. The fact that sentences that are grammatically wrong, such as "Would you perhaps replace the Telly tariff?" Must be accepted in order to keep the range of answers as large as possible Recognizing nonsensical pattern sentences or expressions with the same meaning, but should be kept to a minimum, because it requires considerable system resources and at the same time reduces the recognition performance, because the system has to compare every user utterance with an abundance of predefined sentence combinations that are hardly ever uttered.

In previous practice, the paradigmatic word combinations were defined in a way that apparently connected things together. The meaningful quality of the words was assumed. This process, which is based on a presumed set of success, meets the requirements of simple applications and leads to satisfactory results. In the case of complex applications, on the other hand, with an abundance of sensible answer options, these conventional grammars become so large that they themselves push the computing capacity of current high-performance servers to the limit. The consequences are:

Greatly increased overgeneration Noticeable delays in recognition (latency) Decreasing recognition reliability (Accuracy). - Reduced system stability (robustness)

The main shortcoming of this method is that the specified sentences only follow a superficial combinatorics. The overgeneration generated is so great because the apparently related elements actually follow other combination rules that have long been known in linguistics. In summary, it is stated that the currently widespread grammars, which determine which sentences are recognized by an ASR system, follow traditional grammatical conventions, which depict natural language expressions in an insufficiently structured manner. So far, no differentiation between "surface" or "deep structures" has been assumed. The linguistic hypothesis states that a syntactic deep structure and its "generative implementation" towards concrete surface structures constitute the performance of a language system. If, with increasing complexity, only the surface structure used up to now is used, in order to still be able to fulfill its task, it must be dimensioned so large that it can hardly be maintained properly in operation and the server is loaded to the limits of its capacity.

The object of the invention is to provide a method for speech recognition on the basis of a generative transformation / phrase structure grammar which, compared to conventional recognition methods, requires less system resources and thereby enables reliable and fast recognition of speech while reducing over-generation.

This object is achieved by the features of claim 1.

According to the invention, a spoken phrase is analyzed for triphones contained therein, words contained in the spoken phrase are formed from the recognized triphones with the aid of phonetic word databases

(Dictionaries) and a syntactic reconstruction of the spoken phrase from the recognized words using a grammatical set of rules (grammar).

Advantageous refinements and developments of the invention result from the features of the subclaims. Particularly striking is the contrast between the method according to the invention and the traditional grammar specification language, which achieved good results in small applications even with syntactic surfaces, ie concrete formulation of success sentences.

According to the invention, the linking rules of grammatical sentences are not reproduced on the surface, but the depth structures are shown, which are followed by the syntagmatic links of all Indo-European languages. Each sentence is described using a syntactic model in the form of so-called structure trees.

The GT / PS grammar is not based on the potential statements of a specific application, but on the deep structure of the syntax (sentence formation rules) of Indo-European languages. It provides a framework that can be filled with different words and depicts the reality of the spoken language better than the previously used "mimetic" process.

Within the deep structures described by the structure trees it can be seen that certain phrases are repeated within a sentence. Such repetitions can be reproduced and caught with the help of the GSL. This not only significantly reduces the size of a grammar, but also significantly reduces the over-generation of grammatically incorrect sentences.

While in the traditional GSL grammar e.g. around 500 subgrams are intertwined in seven hierarchical levels, the number of

Subgrammars in the GT / PS model on e.g. 30 subgrammars can be reduced in just two hierarchical levels.

The new grammar type depicts natural language expressions in a structured form and is only around 25% the size of the previous grammar, for example. Because of its small size, this grammar is easier to maintain, and the times for compilation decrease rapidly. Due to their small size, the Detection reliability (Accuracy) and decreases the detection delay (Latency). Current computer capacities are better used and the performance of the servers increases. In addition, the new Grammar is not related to a specific application, but can be used in its basic structures for different applications, which increases the homogeneity of the systems and reduces development times.

The universal code of the deep structure enables the use and added value for multilingual language systems in a dimension that has not yet been achieved, especially the standard Western European languages can be processed with comparatively little effort.

In contrast to the previous grammar for natural-language dialog applications, the new GT / PS grammar is based on current linguistic models that provide natural-language utterances in the context of surface and

Show depth structures. The abstract structure patterns are transferred with a Grammar Specification Language (GSL) into a hierarchically nested and networked set of rules, the structures of which are shown in the two systems.

The technical advantages of the GT / PS grammar are:

The GT / PS grammar is much smaller than the previous grammar because it only needs two levels instead of the up to seven subgrammar levels; - The number of grammatically incorrect sentences covered by the grammar

(Overgeneration) drops drastically;

It only needs around a third of the slots used up to now;

Contrary to today's speech recognition philosophy, it fills the slots in the lower grammar levels instead of the upper ones; - It uses the one provided by the GSL (Grammar Specification Language)

Instrument to reach slot values high in higher grammar levels consistently; It has a new slot called ACTION, which can only be filled with the values GET and KILL; it works with nested slots that are highly multitasking capable. - It leads to an improvement in the recognition performance

It enables a simplified option to introduce multilingual

applications

It has a seamless integration capability in Nuance technology

The economic advantages of the PSG are:

Reduction of hardware costs through better use of system resources

Reduction of transmission times through more efficient recognition - saving of human resources through easier maintenance, greater customer satisfaction applicable to all world languages (English to Chinese)

The invention is explained in more detail below on the basis of a simplified exemplary embodiment with reference to the drawings. Further features, advantages and possible uses of the invention result from the drawings and their description. It shows.

Figure 1: A triphone analysis as the first step in the recognition process;

Figure 2: Word recognition from the recognized triphones as a second step in the recognition process; Figure 3 ;: a syntactic reconstruction of the recognized words as the third step of the recognition process; Figure 4: An example of the structure of the recognized words in

Parts of speech as well as in nominal and verbal phrases; Figure 5: A sample program for a possible grammar; Figure 6: An overview of the structure of a PSG grammar;

Figure 7: An example of the formation of word combinations in a grammar according to the prior art.

Figure 1 shows the first step of speech recognition: the triphone analysis. The continuous flow of speech of a person 1 is e.g. accepted by a microphone of a telephone and fed to a speech recognizer 2 as an analog signal. There, the analog voice signal is converted into a digital voice signal 3. The speech signal contains a variety of triphones, i.e. Sound segments that in speech recognizer 2 with existing, i.e. Predefined triphon linking rules are compared. The existing triphones are stored in a database which contains one or more phonebooks. The recognized triphones are then present as a triphone chain 4, e.g. "Pro", "red", "ote", "tel".

In a second step according to FIG. 2, useful words are formed from the recognized triphones. For this purpose, the existing triphone chain 4 with predetermined words 6 stored in a further phonebook 5, e.g. "Professional", "portal", "protel", "hotel". The phonetic dictionary 5 can comprise a certain vocabulary from the colloquial language as well as a special vocabulary tailored to the respective application. Are the recognized triphones true, e.g. "Pro" and "tel", with those in one word, e.g. "Protel", contained triphones, the corresponding word 7 is recognized as such: "protel".

In the next step, shown in FIG. 3, the recognized words 7 are reconstructed using the grammar 8. For this purpose, the recognized words are assigned to their part of speech, such as noun, verb, adverb, article, adjective, etc., as shown in FIG 6 is shown. This is done using databases that are divided into parts of speech. As can be seen in FIG. 5, the databases 9-15 can contain both the conventional part of speech categories mentioned above and special part of speech types, such as yes / no grammar 9, telephone numbers 14, 15. A detection of DTMF inputs 16 can also be provided. The described assignment of the part of speech type to the recognized words can already take place during the word recognition process.

In the next step (step 17) the recognized words are based on their word categories of a verbal phrase, i.e. a verb-based phrase, and a nominal phrase, i.e. assigned to a phrase based on a noun, cf. Figure 6.

Then the nominal phrases and verbal phrases are merged into objects according to phrase-structural aspects. In step 18, the objects for multitasking are linked to the corresponding voice-controlled application.

Each object 19 comprises a target sentence stored in the grammar 8, more precisely a sentence model. From Figure 4 it can be seen that such a sentence model e.g. can be defined by a word order "subject, verb, object" or "object, verb, subject". Many other sentence structures are stored in this general form in Grammar 8. If the word categories of the recognized words 7 correspond to the order of one of the predefined sentence models, they are assigned to the associated object. The sentence is considered recognized. In other words, each sentence model comprises a number of variables assigned to the different word categories, which are filled with the corresponding word categories of the recognized words 7.

The procedure uses the traditional Grammar Specification Language (GSL), but structures the stored sentences in an innovative way. It is based on the rules of phrase structure grammar and the concept of a generative transformation grammar.

Through the consistent application of the deep structures of a sentence described there, in particular the distinction between noun phrases and verb phrases, it is much closer to the sentence constitution of natural language than the intuitive grammar concepts that have prevailed so far. The GT / PS grammar is therefore based on a theoretical model that is suitable for determining the abstract principles of natural language utterances. In the field of modern speech recognition systems, it opens up the possibility for the first time to reverse the abstraction of sentence formation rules and to substantiate them as a prediction of the statements made by application users. This enables systematic access to speech recognition grammars that have always been based on the intuitive accumulation of example sentences.

A central feature of conventional and GT / PS grammars is the hierarchical nesting into so-called subgrammars, which combine individual words and variables at the highest level to form an entire sentence. The GT / PS grammar is much smaller and hierarchically much clearer than the previously known grammars. In contrast to conventional grammars, "meaningful" sentences are almost exclusively stored in the new grammar, so that the degree of overgeneration, ie stored sentences that are incorrect in the natural language sense, decreases. This, in turn, is the prerequisite for improved recognition performance, since the Application only has to choose between a few stored alternatives.

Claims

claims

1. Method for natural speech recognition based on a generative transformation / phrase structure grammar, characterized by the steps:

- Analysis of a spoken phrase for triphones contained therein;

- Formation of words contained in the spoken phrase from the recognized triphones with the help of phonetic word databases (dictionaries); and

- Syntactic reconstruction of the spoken phrase from the recognized words using a grammatical set of rules (grammar).

2. The method according to claim 1, characterized in that the syntactic reconstruction of the spoken phrase comprises the steps:

- Assignment of the recognized words to part of speech categories (verb, noun etc.)

- Assignment of part of speech types to noun phrases and verb phrases;

- Merging the nominal phrases and verbal phrases according to syntactic rules in objects using different sentence models, whereby the recognized word sequences are compared with the given sentence models, whereby in the event of a match, the sentence is considered recognized and an action is triggered in a voice-controlled application.

3. The method according to any one of claims 1 or 2, characterized in that each sentence model have a number of variables assigned to word categories, which are filled with the corresponding word categories of the recognized words.

4. The method according to any one of claims 1 to 3, characterized in that the words to be recognized are divided into different word categories in the

Word databases are kept. Method according to one of claims 1 to 4, characterized in that the objects or parts thereof are linked to corresponding action parameters of a voice-controlled application.