WO1989003083A1

WO1989003083A1 - Systems architecture for an acoustic man/machine dialogue system

Info

Publication number: WO1989003083A1
Application number: PCT/DE1988/000596
Authority: WO
Inventors: Lothar Glasser; Harald Höge; Erwin Marschall; Gerhard Niedermair; Montserrat Meya-Llopart; Jorge Romano-Rodriguez; Robert J. Sommer; Otto Schmidbauer; Gregor Thurmair; Hendrich Bunt; Jan B. Van Hemert; Kees Van Deemter; Dieter Mergel; Hermann Ney; Andreas Noll; John H. M. De Vet
Original assignee: Siemens Aktiengesellschaft; N.V. Philips' Gloeilampenfabrieken
Priority date: 1987-09-29
Filing date: 1988-09-27
Publication date: 1989-04-06
Also published as: DE3732849A1

Abstract

The systems architecture described comprises a speech input device for the dialogue system, a configuration system and an adaptation system. Said architecture contains essentially a signal analysis unit (31) which forms an input device for the dialogue system (30) and into which the input speech signal is fed, and a word sequence generating unit (32) connected downstream of the signal analysis unit (31). A phoneme lexicon module (321), a phonetic lexicon module (322), and a speech model module (323) are connected to the word sequence generation unit (32). The architecture also comprises a contents analysis unit (33) connected downstream of the word sequence generation unit (32) for syntactic, semantic, and pragmatic contents analysis, a module for syntactic, semantic, and pragmatic rules (331) and a linguistic lexicon module (332) being connected to the contents analysis unit (33). The architecture further comprises a dialogue control unit (34), connected downstream of the contents analysis unit (33), to which are connected a module (341) for adaptation to an input/output procedure for data processing applications and an answer-generating unit (35) connected to a phonetic-linguistic module (351) for producing a synthetic speech signal and a video signal.

Description

System architecture for an acoustic human / machine dialogue system

The present invention relates to a system architecture for an acoustic human / machine dialog system with a voice input device for voice input into the dialog system, a configuration system and an adaptation system, the voice input device generating an input voice signal.

Nowadays, human-machine communication is largely carried out using mechanical aids such as a keyboard, mouse, light pen, etc. In a dialog system of the type mentioned at the beginning, communication takes place via human language. The dialog system translates the language of a user's wishes into the language of the machine. The machine is usually an EDP system on which an application with highly formalized input / output procedures of the machine language is implemented (see FIG. 1).

The voice input / output can take place via a voice terminal with additional aids (image output, light pen, etc.) or via a telephone. As computer applications such. B. automatic information and advisory services, such as train and flight information, automatic transfer services, such as booking or ordering from a catalog, or office management services conceivable.

In order to implement a dialog system, methods of automatic speech recognition, linguistic text entry and dialog guidance have to be combined in an overall system with a suitable architecture. Some architectures have already been proposed, but they are incomplete in terms of an overall system and sometimes lead to very inefficient implementations, cf. EG Goodman, R. Reddy "Alternative Control Structures for Speech Understanding Systems "in 'Trends in Speech Recognition', Prentice-Hall, Signal / Processing Series, 1980.

The interpretation of fluent spoken language has so far only been realized for very restricted applications in the research area, whereby no technical maturity level has yet been achieved for practical use. B. B. Lowerre, R. Reedy "The Harpy Speech Understanding System" in Trends in Speech Recognition, Prentice-Hall, Signal Processing Series, 1980.

Fig. 2 shows the basic structure of a human-machine dialog system, which consists of the systems configuration system, adaptation system and dialog system. The core of the system is the dialog system, which conducts the dialog between a user and an IT application. The configuration system is used to adapt the dialog system to the respective IT application. The application-specific vocabulary needed for the dialog is entered here with its conceptual relationships (syntactic / semantic / pragmatic relationships).

The task of the adaptation system is to adapt the dialogue system to the voice characteristics of the respective user. This increases the recognition performance of the dialog system, which leads to smoother dialog operation.

The present invention has for its object to provide a system architecture of the type mentioned, with the help of which it is possible to implement a working and efficient man / machine dialog system which uses instructions, commands, questions, etc can direct to an EDP system and process answers or queries from the EDP system and, in some cases, pass them on to the user in the form of synthetic language and / or in the form of a screen display. The object on which the present invention is based is achieved by a system architecture of the type mentioned at the outset and according to the preamble of patent claim 1, which is characterized according to the invention by the features specified in the characterizing part of patent claim 1.

Advantageous developments of the invention are characterized by the features specified in the subclaims.

The present invention is described in detail below with reference to several figures.

As already explained, FIG. 1 shows the basic structure of a block diagram of an overall system to be implemented, as has already been discussed in the technical field.

FIG. 2 shows, as also already explained, a block diagram of the human / machine dialog system to be provided according to FIG. 1 in more detail.

FIG. 3 shows a block diagram of the system architecture according to the invention of the human / machine dialog system shown in FIG. 2.

The architecture of the dialog system 30, as shown in FIG. 3, consists of a recognition module with the units “signal analysis” 31, “word sequence generation” 32 and “syntactic-semantic-pragmatic content analysis” 33, a dialog control unit 34 with adaptation to the EDP application and response generation unit 35.

The speech signal of a user coming from a microphone is interpreted in the recognition module and brought into a content-oriented representation. Here, the speech signal is first analyzed with regard to language-specific features. In the word sequence generation unit, the Features are mapped to word sequences using a phonetic word lexicon 322. In general, this mapping is not clear due to the limited acoustic signal analysis, which is taken into account by parallel tracking of possible word sequences (word sequence hypotheses). The number of word sequence hypotheses can become very large. This effort can be achieved using a language model 323 in which the possible sequence of words based on the EDP application is stored, as a result of which only "valid" word sequence hypotheses need to be considered. The check for valid word sequence hypotheses can also be carried out during content analysis, with the meaningful word sequences being filtered out of the word sequence hypotheses on the basis of linguistic rules. In order to obtain the only correct word sequence, statistical methods are additionally used in that the probability with which the acoustic features are mapped onto the word sequence is calculated and the sequence with the highest probability is passed on to the dialog control unit 35 as an interpreted utterance by the user. The dialog control unit 35 decides whether the content of the utterance makes "sense" for the application or whether yet another dialog with the user has to be conducted. In the case of a meaningful request, the content-oriented utterance of the dialog system is converted into a machine language that is understandable for the EDP application. When feedback from the EDP application is brought back into a content-oriented representation of the dialog system 30 and generated for this answer. The answer is output either acoustically through speech synthesis or pictorially through an image terminal.

The architecture of the dialog system 30 allows simple configuration to various types of EDP applications by restructuring the databases “phonetic lexicon”, “language model”, “linguistic rules” and “word lexicon” and by redesigning the adaptation to the I / O procedure. The adaptation The speaker characteristics of the user are carried out via a phonetic lexicon 321, in which the speaker-specific data is entered through user training.

The architecture according to the invention is also suitable for real-time implementation. Due to the high computing power required, the various modules can be implemented as separate processing units, so that several modules can be used in parallel.

Claims

1. System architecture for an acoustic human / machine dialog system, with a voice input device for voice input into the dialog system, a configuration system and an adaptation system, the voice input device generating an input voice signal, g e k e n n e e c h n e t by

- a signal analysis unit (31) which forms an input device of the dialog system (30) and to which the input speech signal is supplied,

- a word sequence generation unit (32) downstream of the signal analysis unit (31) for generating word sequences, the word sequence generation unit (32) comprising a phonetic lexicon module (321), a phonetic word lexicon module (322) and a language model module (323) are assigned,

- A content analysis unit (33) downstream of the word sequence generation unit (32) for carrying out a syntactic-semantic-pragmatic content analysis, the content analysis unit (33) being a module for syntactic-semantic-pragmatic rules (331) and a module for a linguistic Word dictionary (332) are assigned,

a dialog control unit (34) which is downstream of the content analysis unit (33) and which is assigned a module (341) for adapting to an input / output procedure for IT applications,

- A response generation unit (35) which is assigned a building block "phonetic-linguistic word lexicon" (351) for. Generating a synthetic speech signal and an image signal, and thereby

- That the Lautlexikon module (321) at an interface between the architecture and the adaptation system (ADS) and all other modules (322, 323, 331, 332, 341, 351) at an interface between the architecture and the configuration system (KFS) are arranged.

2. Architecture according to claim 1, characterized in that the speech signal generated by a microphone of a user in the recognition module, which is formed from the signal analysis unit (31), the word sequence generation unit (32) and the content analysis unit (33) , is interpreted and converted into a content-oriented presentation.

3. Architecture according to claim 1, characterized in that the speech signal characteristics are mapped to word sequences in the word sequence generation unit (32) with the aid of the phonetic word lexicon module (322).

4. Architecture according to one of the preceding claims, characterized in that a language model is stored in the language model module (323) in which possible sequences of words are stored on the basis of certain IT applications, with the aid of which only "valid" word sequence hypotheses need to be checked.

5. Architecture according to one of claims 1 to 3, characterized in that the check for "valid" word sequence hypotheses is carried out by a content analysis in the content analysis unit (33), the relevant word sequences being filtered out of the word sequence hypotheses on the basis of linguistic rules.

6. Architecture according to one of the preceding claims, characterized in that, in order to obtain the only correct word sequence for a specific process, additional statistical methods are used in which the probability with which the acoustic features are mapped to the word sequence is calculated and that sequence is also used is most likely passed on to the dialog control module as an interpreted statement by the user.

7. Architecture according to one of the preceding claims, characterized in that the dialog control module decides whether the content of the utterance makes a "sense" for the EDP application or whether another dialog must be conducted with the user.

8. Architecture according to one of the preceding claims, characterized in that the content-oriented utterance representation of the dialogue system is implemented in a "meaningful" request into a machine language that is understandable for the EDP application in question.

9. Architecture according to one of claims 1 to 7, characterized in that in the event of a feedback on the relevant computer application, this feedback is converted into a content-oriented representation of the dialog system and that a response signal is generated.

10. Architecture according to claim 9, characterized in that outputting of the response signal is carried out either acoustically by speech synthesis or pictorially by an image terminal.

11. Architecture according to claim 1, characterized in that a restructuring of the databases "phonetic lexicon", "language model", "linguistic rules" and "word lexicon" and a redesign of the adaptation to the relevant input / output procedure for different types of EDP applications for easy configuration for different types of EDP applications.

12. Architecture according to claim 1, characterized in that an adaptation to the speaker's characteristic of the user takes place via the phonetic lexicon, into which the memory-specific data are introduced by user training.

13. Architecture according to one of the preceding claims, characterized in that real-time operation is provided.

14. Architecture according to one of the preceding claims, characterized in that different modules are implemented in separate processing units to increase the computing speeds, so that a large number of modules can work in parallel.