CA2231504C - Process for automatic control of one or more devices by voice commands or by real-time voice dialog and apparatus for carrying out this process - Google Patents

Process for automatic control of one or more devices by voice commands or by real-time voice dialog and apparatus for carrying out this process Download PDF

Info

Publication number
CA2231504C
CA2231504C CA002231504A CA2231504A CA2231504C CA 2231504 C CA2231504 C CA 2231504C CA 002231504 A CA002231504 A CA 002231504A CA 2231504 A CA2231504 A CA 2231504A CA 2231504 C CA2231504 C CA 2231504C
Authority
CA
Canada
Prior art keywords
voice
cndot
input
characterized
command
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
CA002231504A
Other languages
French (fr)
Other versions
CA2231504A1 (en
Inventor
Walter Stammler
Fritz Class
Carsten-Uwe Moller
Gerhard Nussle
Frank Reh
Burkard Buschkuhl
Christian Heinrich
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nuance Communications Inc
Original Assignee
Harman Becker Automotive Systems GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
Priority to DE19533541A priority Critical patent/DE19533541C1/en
Priority to DE19533541.4 priority
Application filed by Harman Becker Automotive Systems GmbH filed Critical Harman Becker Automotive Systems GmbH
Priority to PCT/EP1996/003939 priority patent/WO1997010583A1/en
Publication of CA2231504A1 publication Critical patent/CA2231504A1/en
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=7771821&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=CA2231504(C) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Publication of CA2231504C publication Critical patent/CA2231504C/en
Application granted granted Critical
Anticipated expiration legal-status Critical
Application status is Expired - Lifetime legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Abstract

The invention pertains to a voice dialog system wherein a process for automatic control of devices by voice dialog is used applying methods of voice input, voice signal processing and voice recognition, syntactical-grammatical postediting as well as dialog, executive sequencing and interface control, and which is characterized in that syntax and command structures are set during real-time dialog operation; preprocessing.
recognition and dialog control are designed for operation in a noise-encumbered environment; no user training is required for recognition of general commands; training of individual users is necessary for recognition of special commands; the input of commands is done in linked form, the number of words used to form a command for voice input being variable; a real-time processing end execution of the voice dialog is established: the voice input and output is done in the hands-free mode.

Description

PROCESS FOR AUTOMATIC CONTROL OF ONE OR MORE DEVICES BY
VOICE COMMANDS OR BY REAL-TIME VOICE DIALOG AND APPARATUS
FOR CARRYING OUT THIS PROCESS
The invention concerns a process for automatic _ 5 control of one or more devices by voice control or by real-time voice dialog, as well as an apparatus for carrying out this process.
Processes or apparatuses of this kind are generally used in the so-called voice dialog systems or voice-operated systems, e.g. for vehicles, computer-controlled robots, machines, plants etc.
In general, a voice dialog system (VDS) can be reduced to the following components:
~ A voice recognition system that compares a spoken-in command ("voice command") with other allowed voice commands and decides which command in all probability was spoken in;
A voice output, which issues the voice commands and signaling sounds necessary for the user control and, if necessary, feeds back the results from the recognizer;
~ A dialog control and sequencing control to make it clear to the user which type of input is expected, or to check whether the input that occurred is consistent with the query and the momentary status of the application, and to trigger the resulting action during the application (e.g. the device to be controlled);
r A control interface as application interface: concealed behind this are hardware and software modules for selecting various actuators or computers, which comprise the application;
A voice-selected application: this can be an order system or an information system, for example, a CAE work station or a wheel chair suitable for a handicapped person;
Without being limited to the general usability of the described processes, devices, and sequences, the present description focuses on the voice recognition, the dialog structure, - L~ -as well as a special application in motor vehicles.
The difficulties for the solutions known so far include:
a) The necessity for an involved training in order to adapt the system to the characteristic: of the respective speaker or an alternating vocabulary. The systems are either completely speaker-independent or completely speaker-dependent or speaker-adaptive, wherein the latter require a training session for each new user. This requires time and greatly reduces the operating comfort if the speakers change frequently. That is the reason why the vocabulary range for traditional systems is small for applications where a frequent change in speakers and a lack of time for the individual speakers must be expected.
b) The insufficient user comfort, which expresses itself in that - the vocabulary is limit=ed to a minimum to ensure a high recognition reliability;
- the individual words of a command are entered isolated (meaning with pauses in-between);
- i.ndividual words must be acknowledged to detect errors;
- :3 -- multi-stage dialog hierarchies must be processed to control multiple functions;
- a microphone must be held in the hand or a headset (combination of earphones and lip microphone) must be worn.
c) The lack of robustness - to operating errors;
- to interfering environmental noises.
d) The involved and expensive hardware realization, especially for average and small piece numbers.
It is the object of the invention to specify on the one hand a process, which allows the reliable control or operation of one or several devices by voice commands or by voice dialog in the real-time operation and at the lowest possible expenditure. The object is furthermore to specify a suitable apparatus for carrying out the process to be developed.
In accordance with one aspect of this invention, there is provided a process for the automatic control of one or several devices by voice commands or by voice dialog in the real-time operation, characterized by the following features:
the entered voice commands are recognized by means of a speaker-independent compound-word voice recognizer and a speaker-dependent additional voice recognizer and are classified according to their recognition probability;
recognized, admissible voice commands are checked for their plausibility, and the admissible and plausible voice command with the highest recognition probability is identified as the entered voice command, and functions assigned to this voice command of the device or devices or responses of the voice dialogue system are initiated or generated.
In accordance with another aspect of this invention, there is provided an apparatus for carrying out the above process, in which a voice input/output unit is connected via a voice signal preprocessing unit with a voice recognition unit, which in turn is connected to a sequencing control, a dialog control, and an interface control, characterized in that the voice recognition unit consists of a speaker-independent compound-word recognizer and a speaker-dependent additional voice recognizer, which are both connected on the output side with a unit for syntactical-grammatical or semantical postprocessing that is linked to the sequencing control, the dialog control, and the interface control.
The fact that a reliable control or operation of devices by voice command or real-time voice dialog is possible with relatively low expenditure must be seen as the essential advantage of the invention.
A further essential advantage must be seen in the fact that the system permits a voice command input or voice dialog control that is for the most part adapted to the natural way of speaking, and that an extensive vocabulary of admissible commands is made available to the speaker for this.
A third advantage must be seen in the fact that the system operates failure-tolerant and, in an advantageous modification of the invention, for example, generally recognizes even non-admissible words, names, sounds or word rearrangements in the voice commands entered by the speaker as such and extracts from these entered voice commands admissible voice commands, which the speaker actually intended.
The invention is explained in the following in more detail with the aid of the figures, which show:
l~igure 1 The block diagram of a preferred embodiment of the apparatus according to the invention for carrying our the process according to the invention ("voice dialog system");
l~igure 2 A detailed illustration of the actual voice dialog system according to figure 1;
1~igure 3 The flow diagram for a preferred embodiment showing the segmentation of the input voice commands for a voice dialog system according to figure 2;
1~igures 4 and 5 Exemplary embodiments of Hidden-Markov models;
l~igure 6 The hardware configuration of a preferred embodiment of the voice dialog system according to figure 2;
l~igure 7 The status diagram for the application of the voice dialog system according to figure 2, for a voice-controlled telephone operation;
- t~ -Figure 8 The flow diagram for operating a telephone according to ffigure 7;
Figures 9 and 10 The flow diagram for the function "name selection"
(ffigure 9) or "number dialing" (ffigure 10) when operating a telephone according to the flow diagram based on figure 8.
The voice dialog system (VDS) 1 in figure 1, described in the following, comprises the components voice input (symbolically represented by a microphone 2), voice recognition, dialog control and sequencing control, communication interface and control interface, voice output (with connected speaker 3), as well as an application (exemplary), meaning a device to be controlled or operated by the VDS. VDS and application together form a voice operating system (VOS), which is operated in real-time ("on-line").
The syntax structure and dialog structure as well as the base commands that are mandatory for a:11 users/speakers are created and fixed "off-line" outside of the VDS or the VOS (example) with the aid of a PC work station and in the "off-line dialog editor mode"

9:, and are then transferred in the form of data files to the VDS or the VOS , prior to the start-up and together with the parameters and executive sequencing structures to be specified.
The VDS 1 in figure 1 is shown in detail in figure 2. A
microphone (not shown) is connected to an analog/digital converter, which is connected via devices for the echo compensation, the noise reduction and the segmentation to a speaker-independent compound word voice recognizer and to a spc=aker-dependent voice recognizer.
The two voice recognizer are connected on the output side to a postprocessing unit for the syntactical-grammatical and semantical processing of the recognizer output signals. This unit, in turn, i.s connected to the dialog control and the sequencing control, which itself forms the control for the VDS and the devices to be controlled by the VDS. A voice input/output unit is furthermore provided, which includes a voice encoder, a voice decoder and a voice memory.
On the input side, the voice encoder is connected to the device for noise reduction and on the output side to the voice memory. The voice memory is connected on the output side to the _ ~s -voice decoder, which itself is connected on the output side via a digital/analog converter to a speaker (not shown).
The echo compensation device is connected via interfaces with units/sensors (not shown), which supply audio signals that may have t.o be compensated (referred to as "audio" in the figure).
The speaker-independent compound word voice recognizes on the one hand comprises a unit for the feature extraction, in which the cepstrum formation takes place and. the recognizes is adapted, among c>ther things, to the analog transmission characteristic of the incoming signals and, on the other hand, it has a downstream-connected classification unit.
The speaker-dependent voice recognizes also has a unit for the feature extraction on the one hand and a classification unit on the other hand. In place of the classification unit, it is also ~>ossible to add with a selector switch a unit for the input of the ~;peaker-specific additional voice commands that must be trained by the voice recognizes in the training phases before, during or after the real-time operation of the=_ VDS. The speaker-dependent recognizes operates, for example, based on the dynamic-time-warping W'O 97/10583 PCT/EP96/03939 process (DTW), based on which it:~ classification unit determines the intervals between the command to be recognized and the previously-trained reference patterns and identifies the reference pattern with the smallest interval as the command to be recognized.
The speaker-dependent recognizer can operate with feature extraction methods such as the ones used in speaker-independent voice recognizers (cepstrum formation, adaptation, ect.).
On the output side, the two recognizers are connected to the postprocessing unit for the syntacaical-grammatical and semantical processing of the recognizer output signals (object and function of this unit are explained later on). The dialog control that is connected to the sequencing control is connected downstream of the postprocessing unit on the output side. Dialog and sequencing control together form the VDS control unit, which selects the preprocessing, the voice input unit and the voice output unit, the two recognizers, the postprocessing unit, the communication interface and the control interface, as well as the devices to be controlled or operated (the latter via suitable interfaces - as shown in figure 2).

The mode of operation for the VDS is explained in more detail in the following.
As previously explained, the VDS contains two different types of voice recognizers for recognizing specified voice commands. The two recognizers can be characterized as follows:
t Speaker-independent recogn:izer: the speaker-independent recognition of words spoken in linked form. This permits the recognition of general control commands, numbers, names, letters, etc., without requiring that the speaker or user trained one or several of the words ahead of time.
The input furthermore can be in the compound-word mode, meaning a combination of several words, numbers, names results in a command, which is spoken in linked form, meaning without interruption (e.g. the command: circle with radius one"). The classification algorithm is a HMM (Hidden Markov Model) recognizer, which essentially builds on phonemes (sound subunits) and/or whole-word models and composes words or commands from this. The vocabulary and the commands ("syntax structure") constructed from this are fixed ahead of time in the laboratory and are transmitted to the recognizer in the form of data files ("off-line dialog editing mode"). In the real-time operation, the vocabulary and syntax structure of the independent recognizer cannot be modified by the user.
~ Speaker-dependent recognizer: Speaker-dependent recognition of user-specific/speaker-specific names or functions, which the user/speaker defines and trains. The user/speaker has the option of setting up or editing a personal vocabulary in the form of name lists, function. lists, etc.. The user/speaker consequently can select his/her personal vocabulary and adapt this vocabulary at any time "on-line," that is in the real-time operation, to his/her nc=eds.
The "list of names" can be cited as example for a use in the telephone ambient field, meaning a list of names of telephone subscribers compiled individually by the user/speaker, wherein - during a training phase, the respective name is spoken in once or several times by the user (e.g. "uncle Willi") and a telephone number is assigned to the name via keyboard input, but preferably via independent voice - 12. -recognizes;
- at the conclusion of the above training and assigning of the number, the user only supplies a name to the speaker-dependent recognizes ("uncle Willi"), but not the coordinated telephone number, which is already known to the system.
The speaker-dependent recognizes is:
- in the most simple form designed as a single-word recognizes;
- in the more powerful form designed as compound-word recognizes, which is connected without interface to the speaker-independent recognizes (e. g. "call uncle Willi"
as a complete command, wherein the word "call" is part of the speaker-independent vocabulary and "uncle Willi" is part of the speaker-dependent vocabulary).
Following the voice recognition, a postprocessing of the results encumbered with a certain recognition probability of the t:wo voice recognizers takes places in the postprocessing unit.
The speaker-independent compound-word voice recognizes, for example, supplies several sentence=_ hypotheses in a sequence, which represents the recognition probabilities. These sentence hypotheses as a rule already take into account the allowed syntax ~;tructure. Where this is not the case, non-admissible word sequences are separated out or evaluated based on different criteria within the syntactical postprocessing (figure 2), to determine the probability of the therein occurring word combination. The sentence hypotheses generated by the voice recognizers are furthermore checked as to their semantical ~>lausibility, and the hypothesis with the highest probability is then selected.
A correctly recognized voice command is passed on to the dialog control and subsequently leads to an intervention, assigned t.o this voice command, in the application, wherein the message is transmitted via the control interface. If necessary, the recognized voice command is also (or only) transmitted from the dialog control to the voice output and is issued there.
The here outlined system is characterized in the "on-line"
c>peration by a fixed syntax structure and a fixed command structure as well as by a combination of fixed vocabulary (speaker-independent recognizer) and freely definable vocabulary such as names (speaker-dependent recognizer).
This framework, which initially appears to be inflexible, is a precondition for a high recognition capacity with an extensive vocabulary (at the present time up to several hundred words), e.g.
for a noise-encumbered environment, for changing acoustic conditions in the passenger cell, as well as for a variety of ~~peakers. The extensive vocabulary is used to increase the user friendliness by using synonymous words or different variations in t:he pronunciation. Also, the syntax permits the rearranging of words in the voice command, for example as follows:
"larger radius for left circle"
or - alternative to this -"For the left circle a larger radius"
wherein these alternatives, however, must be defined from the beginning during the setting up with the "off-line dialog editor."
The here outlined approach to a solution proves to be advantageous, in particular because the compound-word input of commands is more natural and faster than the input of isolated words. It has turned out in practical operations that the impartial user has difficulty getting used to speaking i:n a hacking manner (with clear pauses in-between) in order to enter a multiword command (that is why the acceptance of such systems is clearly lower);
t the input of, for example, number columns or letter columns in a compound form is easier and requires less concentration than the individual input;
~ the dialog control is more natural, for example, as not every individual number must be acknowledged in number columns, but only the entered number block;
owing to the vocabulary of, for example, up to several hundred words, a plurality of functions for each language can be operated, which previously required a manual operation;
the number of manual switching elements can be reduced or the hands can otherwise be used during the voice input, e.g. for the quality control of motors.
The user comfort is further increased in the present system - lE~ -through the advantageous use of hands-free microphones in place of ~;or to complement) headsets (earphones and lip microphone) or a hand-held microphone. However, t:he use of a hands-free microphone generally requires a powerful noise reduction (figure 2) and, if necessary, an echo compensation of signals, e.g. coming from the dialog speaker or other speakers. These measures may also be necessary when using a headset or hand-held microphone, depending on the application or noise level..
The echo compensation in particular permits the user/speaker t:o interrupt the voice output, meaning to address the recognizes while the voice output is active.
The vocabulary and the commands furthermore can be changed at any time in the laboratory via "off-line dialog editor," without requiring a new training with a plurality of speakers for the new words of the speaker-independent recognizes. The reason for this is that the data bank for speaker-independent phonemes and/or speaker-independent whole-word models exists in the laboratory and t=hat with the existing developmental environment, new words and commands can be generated without: problems from these phonemes or vahole-word models. In the final analysis, a command or vocabulary change is aimed at transferring the new parameters and data, computed in the laboratory with the development system, as data f=ile to the speaker-independent "real-time recognizer" and to store them in the memory there.
It is possible with the aid of the VDS to operate functions within the computer, of which the VDS is an integral component, as well as to operate external devices. In addition to a PCMCIA
interface, the VDS, for example, also has interfaces that are accessible to external devices. 7.'hese include, for example, a V.24 _Lnterface, an optical data control bus, a CAN interface, etc. The VDS can be provided optionally with additional interfaces.
The VDS is preferably activated by actuating a push-to-talk )cey (PTT key) or through a defined key word. The system is shut down by entering a respective voice command ("termination command") <~t defined locations in the dialog or at any time by actuating the PTT key or an escape key or aui~omatically through the internal ;sequencing control, if, following a time interval that is specified by the VDS or is adjusted adaptively to the respective user and/or following a query by the VDS, no voice input has taken place or the dialog selected by the user has been completed as planned(e.g. the desired telephone number has been transmitted to the telephone for making a connection). In a low-noise environment, the VDS can also be activated continuously.
Description of the sequence It must be stressed at this point that the VDS in figure 2 is only one example for a voice dialog system possible in accordance with the invention. The configuration of the interfaces for the data input or the data output or the control of the connected components is also shown only as an example here.
The functional blocks shown in figure 2 are explained in more detail in the following:
1. Echo compensation The digitized speaker signals, e.g. from the voice output or a turned-on radio, are subtracted via the echo compensation and via adaptive filter algorithms from the microphone signal.
The filter algorithms form the echo path from the speaker to the microphone.
- 1<3 -2. Noise reduction The noise reduction makes it possible to differentiate stationary or quasi-stationary environmental noises from the digitized voice signal and to subtract these from the voice signal. Noises of this type are, for example, driving noises in a motor vehicle (MV), environmental noises inside laboratories and offices such as fan noises, or machine noises in factory buildings.

3. Segmentation:
As shown in figure 3, the segmentation is based on spectrally transformed data. For this, the signals are combined block by block to form so-called "frames" and are converted to the frequency range with the aid of a Fast Fourier Transformation (FFT). Through forming an amount and weighting with an audio-related MEL filter, meaning a filter that copies the melodic perception of the sound level, for which an audio-related division of the voice range (~ 200 Hz to 6 -- kHz) into individual frequency ranges ("channels") is carried out, the spectral values are combined to form channel vectors, which indicate the capacity in the various frequency bands. This is followed by a rough segmentation that is permanently active and roughly detects the beginning and the end of the command, as well as a precise segmentation, which subsequently S determines the exact limits.
9:. Feature extraction The feature extractor computes feature vectors over several stages from the digitized and segmented voice signals and determines the associated standardized energy value.
For this, the channel vectors are transformed in the speaker-independent recognizer with a discrete cosine transformation (DCT) to cepstral vectors. In addition, the energy of the signal is calculated and standardized. Parallel to this, the mean of the cepstral values is calculated continuously, with the goal of adapting the recognizer to the momentary speaker as well as to the transmission characteristics, e.g. of the microphone and the channel (speaker -~ microphone). The cepstral vectors are freed of this adapted mean value and are combined with the previously calculated standardized energy to so-called CMF vectors (c_epstral coefficients mean value free).
5. Classification of the speaker-independent compound-word voice recognizer.
5.1 Hidden-Markov-Model (HMM) A Hidden-Markov-Model is a collection of states connected to each other by transitions (figure 4).
Each transition from a state qi to another state qj is described by a so-called transition probability. A vector of so-called emission probabilities with length M is assigned to each node (state). The connection to the physical world is made via these emission probabilities. The model idea goes so far as to state that in a specific state qi, , a symbol differing from M is "emitted" in accordance with the emission probability related to the ~~tate. The symbols represent the feature vectors.
The sequence of "emitted" symbols generated by the model is visible. However, the concrete sequence of the states, passed through within the model, is not visible (English: "hidden").

A Hidden-Markov-Model is defined by the following quantities:
t T number of symbols t point in time for an observed symbol, t - 1 ... T

N number of states (nodes) of the model ~ M number of possible symbols ( - code book value) t Q states of the model fql, q2, ... qn}

V number of symbols that are possible A transition probability from one state to another B probability for an output symbol in a model state (emission probability) n probability for the initial state of the model (during the HMM training).

Output symbols can be generated with the aid of this model and using the probability distributions A and B.

5.2 Design of the phoneme-based HM~i recognizer The word recognition for a voice recognition system with a larger vocabulary usefully i~~ not based on whole words, but on phonetic word subunits. Such a word subunit is, for example, a phoneme, a diphone (double phoneme) or a phoneme transition.

- 2 :3 -A word to be recognized is then represented by the linking of the respective models for word subunits. Figure 5 shows such an example of a representation with linked Hidden-Markov-Models (HMM), on the one hand by the standard phonetic description of the word "frying" (figure 5a) and on the other hand by the phonetic description of the pronunciation variants (figure 5b). When setting up the system, these word subunits are trained with random samples from many speakers and form the data base on which the "off-line dialog editor" builds.
This concept with word subunits has the advantage that new words can be incorporated relatively easily into the existing dictionary since the parameters for the word subunits are already known.
Theoretically, an optionally large vocabulary can be recognized with this recogn.izer. In practical operations, however, limits will be encountered owing to a limited computing power and the recognition capacity necessary for the respective application.
The classification is based on the so-called Viterbi algorithm, which is used to compute the probability of each word for the arriving symbol sequence, wherein a word here must be understood as a linking of various phonemes. The Viterbi algorithm is complemented by a word sequence statistic ("language model"), meaning the multiword commands specified in the "off-line dialog editor" supply the allowed word combinations. In the extreme case, the classification also includes the recognizing and separating out of filler phonemes (ah, hm, pauses, throat clearing sound) or garbage words ("non-words"). Garbage word: are language complements, which are added by the speaker - unnecessarily - to the actual voice commands, but which are not part of the vocabularies of the voice recognizes. For example, the speaker can further expand the command "circle with radius one" by using terms such as "I
now would like to have a..." or "please a...." Depending on the application or the scope of the necessary vocabulary, these phoneme-based Hidden-Markov-Models can also be complemented by or expanded with Hidden-Markov-Models based on whole words.

6. Speaker-dependent recognizer The speaker-dependent recognition is based on the same preprocessing as is used for the speaker-independent recognizer. Different approaches to a solution are known from the literature (e. g. "dynamic time warping" (DTW), neuronal net classifiers), which permit a real-time training. Above all, this concerns individual word recognizers, wherein the dynamic time warping process. is preferably used in this case.
In order to increase the user friendliness, the VDS
described here uses a combination of a speaker-independent (compare point 5) and a speaker-dependent recognizer in the compound word mode ("call G:Loria," "new target uncle Willi,"
"show function oblique ellipse"), wherein the words "Gloria,"
"uncle Willi," "oblique ellipse" were selected freely by the user during the training and were recorded in respective lists, together with the associated telephone numbers/target addresses/function descriptions. The advantage of this approach to a solution is that one to two (or if necessary even more) dialog steps are saved.

7. Postprocessing: check of syntax and semantics:
The VDS includes an efficient postprocessing of the results, supplied by the voice recognizers. This includes a check of the syntax to detect whether the determined sentence hypotheses correspond to the a priori fixed configuration of the voice command ("syntax"). If this is not the case, the respective hypotheses are discarded. In individual cases, this syntactical analysis can be partially or totally integrated into the recognizer itself, e.g. in that the syntax is already taken into account in the decision trees of the classifier.
The sentence hypotheses supplied by the voice recognizer are also checked as to their meaning and plausibility.
Following this plausibility check, the active sentence hypothesis is either transmitted to the dialog control or rejected.
In case of a rejection, the next probable hypothesis of the voice recognizer is accepted. and treated the same way.
In case of a syntactically correct and plausible command, this command is transmitted together with the description of the meaning to the dialog control.
8. Dialog and sequence control The dialog control reacts to the recognized sentence and determines the functions to be carried out. For example, it determines:
which repetition requests, information or queries are issued to the user;
which actuators are to be addressed in what way;
~ which system modules are active (speaker-independent recognizer, training);
which partial-word vocabularies (partial vocabularies) are active for the response expected to come next (e. g. numbers only) .
The dialog control furthermore maintains a general view of the application status, as far as this is communicated to the VDS.
Underlying the dialog control is the sequence control, which controls the individual processes; logically and temporally.

9. Interface for communication and control This is where the communication with the connected peripheral devices, including the devices to be operated, takes place.
Various interfaces are available for this. However, not all these interfaces are generally required by the VDS. The options named in figure 2 are only examples of an implementation. The interface for communication and control among other things also handles the voice input and output, e.g. via the A/D or D/A converter.
10. Voice input/output The voice input/output is composed of a "voice signal compression module" (_ "voice encoder"), which removes the redundancy or irrelevancy from the digitized voice signal and thus can store a voice signal with a defined length in a considerably smaller memory than directly following the A/D
conversion. The compressed information is stored in a voice memory and is regenerated for the output in the "voice decoder," so that the originally input word can be heard once more. Given the presently available encoding and decoding processes, the loss in quality during the playback, which may occur in this case, is within a justifiable framework.
A number of commands, auxi7_iary texts or instructions are stored from the start in the voice memory for the dialog control ("off-line dialog editor"), which are designed to aid the user during the operation or to supply him/her with information from the application side.
Furthermore, the voice encoding is activated during the training for the speaker-dependent recognizer since the name spoken in by the user is also stored in the voice memory. By listening to the name list oz- the function list, the user can be informed acoustically at any time of the content, that is to say the individual names or functions.
With respect to the algorithm for the voice encoding and decoding, it is possible to use processes, for example, which are known from the voice transmission under the catchword "source coding" and which are implemented with software on a programmable processor.
- 3C) -Figure 6 shows an example of a possible hardware configuration of the VDS according to figure 2. The configuration of the individual function blocks as we7_1 as the interfaces to the data input and the data output or fc>r the control of the connected components is shown only as an example in this case. The here assumed active stock of words (vocabulary), for speaker-independently spoken words, for example, can comprise several hundred words.
The digital signal proce:~sor (DSP) is a commercially available, programmable processor', which is distinguished from a microprocessor by having a different bus architecture (e. g. Harvard architecture instead of Von-Neumann architecture), special "on-chip" hardware arithmetic logic 'units (multipliers/accumulators/
shifters, etc.) and I/0 functionalities, which are necessary for the real-time digital signal processing. Powerful RISC processors increasingly offer similar functionalities as the DSP's and, if necessary, can replace these.
The digital signal processor shown here (or another microprocessor with comparable capacity) can process all functions shown in figure 2 with the aid of software or integrated hardware, with the exception of special interface control functions. With the DSP's that are presently available commercially and the concept presented here, vocabularies of several hundred words (an example) can be realized, wherein it is assumed that this vocabulary is available completely as "active vocabulary" and is not reduced considerably through forming partial vocabularies. In the event that partial vocabularies are formed, each of these can comprise the aforementioned size.
The use of the hardware structure according to figure 6 and especially omitting the additional special components for the recognition and/or the dialog control, sequencing control, voice encoding and interface protocol processing, offers the chance of a realization with compact, cost-effective hardware with low current consumption. In the future, DSP's will have higher arithmetic capacities and higher storage capacities owing to the technological improvements, and it will be pos;~ible to address larger external storage areas, so that more extensive vocabularies or more powerful algorithms can be realized.
o _ The VDS is activated by the "push-to-talk" key (PTT) connected t:o the DSP. Actuating this key causes the control software to :tart the recognition process. In detail, the following additional hardware modules exist besides the DSP:
~ A/D and D/A converter:
Via a connected A/D and D/A converter:
- the microphone signal and, if necessary, the speaker signals are digitized and transmitted to the DSP for further processing;
- the digitized voice data for the voice output/dialog control are converted back into an analog signal, are amplified and transmitted to a suitable playback medium (e . g . a speaker) .
D2B optical:
This is an optical bus system, which can be used to control diverse audio devices and information devices (e. g. car radio and CD changer, car telephone and navigation equipment, etc.).
This bus not only transmits control data, but also audio data.
In the extreme case (meaning if it is used to transmit microphone and speaker signals), the A/D and D/A conversion in the VDS can be omitted.
CAN bus:
This is a bus system, which can be used to control information devices and actuators in the motor vehicle. As a rule, an audio transmission is not possible.
V.24 interface:
This interface can be used to control diverse peripheral devices. The VDS software can furthermore be updated via this interface. A respective vocabulary or a corresponding language (e. g. German, Engli;~h, French...) can thus be loaded m .
PCMCIA interface:
In addition to communicating with a desktop or portable computer, this interface also functions to supply voltage to the VDS. Several of the above-listed functions can be combined here. In addition i~o the electrical qualities, this interface can also determine the mechanical dimensions of the VDS. These can be selected, for example, such that the VDS

can be plugged into a PCMCIA port of a desktop or portable computer.
Memory The memory (data/program RAM and ROM) connected to the DSP
serves as data and program storage for the DSP. It furthermore includes the specific classification models and, if necessary, the reference patterns for the two voice recognizers and the fixed texts for the dialog control and the user prompting. The user-sp<=_cific information (address list, data list) is filed in a FLASH memory or a battery-buffered memory.
The hardware configuration outlined here, in particular with respect to the interfaces, depends strongly on the respective application or the special client requirements and is described here in examples for several application cases. The selection of interfaces can be totally differs=_nt for other applications (e. g.
when linking it to a PC or a work station or when using it in portable telephones). The A/D anal the D/A converters can also be integrated on the DSP already.

Function description using the example of a voice-operated car telephone The dialog sequences are described in the following with the example of a voice-controlled telephone control (e. g. in a motor vehicle) .
This example can be expanded to the selecting of telephone and radio and/or CD and/or navigation in the motor vehicle or the operation of a CAE work station or the like.
Characteristic for each of these examples is:
- The speaker-independent recognition of multiword commands, as well as letter columns and number columns.
- The speaker-dependent input of a freely selected name or function word, previously trained by the user, which is associated with a function, a number code (e. g. telephone number of a telephone directory or station frequency of a radio station list) or a .Letter combination (e. g. target location for navigation systems).
- In the process of defining the association, the user enters the function, letter combination or number combination in the speaker-independent compound-word mode (wherein the function, the letters, the numbers mu:~t be included in the admissible vocabulary, meaning they must be initially fixed with the "off-line dialog editor").
- This name selection is always linked to the management of a corresponding list of different names or function words of the same user (telephone directory, station list, target location list). This list can be expanded, deleted, polled or corrected.
Diagram of VDS states (figure 7):
When operating the telephone via the voice input, the VDS
assumes different states, some which are shown as examples in figure 7 (deactivated state; command mode "telephone;" number input or number dialing, as well as input or selection of name in connection with the selection unction; number input or name training in connection with the storage function; name deleting or complete or selective deleting of telephone directory in connection with the delete function). The transitions are controlled by issuing voice commands ("number dialing," "name selection," "name storage," "number storage," "termination," "deleting"), wherein the VDS is activated by actuating the PTT key. A dialog termination occurs, for example, through the input of a special termination command ("terminate") or by activating an escape key.
Operating state "deactivated":
The voice dialog system is not ready for recognition when in this state. However, it is advantageous if parts of the signal processing software are continuou;~ly active (noise reduction, echo compensation) in order to update the noise and echo state permanently.
Operating state "active" (figure 8):
The voice dialog system has been activated with the PTT key and is now awaiting the commands, which are allowed for the further control of the peripheral devices (telephone). The function sequences of the operating state "active" are shown in figure 8 in t:he form of a flow diagram (as example), that is to say for the functions "select telephone directory," "delete telephone directory," "delete name," "select name," "dial number," "store name," "store number," "listen to telephone directory," and the associated actions and reactions ;output of name lists, complete or selective deleting, name selection or number selection, number input or name training). Of course, these functions can be complemented or expanded if necessary, or can be replaced partially or totally by other functions.
It must be mentioned in general in this connection that the activated VDS can be deactivated at any time, meaning also during one of the function sequences explained further in the following, with the result that the funct_Lon sequence, which may not be complete, is terminated or interrupted. The VDS can be deactivated, for example, at any time by actuating, if necessary, the existing escape key or the input of a special termination command (e. g. "stop," "terminate," or the like) at defined locations in the dialog.
Operating state "name selection" (figure 9):
This state presumes the correct recognition of the respective voice command "name selection" or "telephone name selection" or the like. It is possible in this state to dial a telephone number by entering a name. For this, a switch to a speaker-dependent voice recognizer is made.

The voice dialog system requests the input of a name. This name is ackn.owJ.edged for the user. The voice dialog system then switches again to the speaker-independent recognizer. If the name was recognised correctly, the telephone number assigned to the name is tra==9mit'ted to the telephone where ~he connection to the respective telephone subscriber is made.
Lf: the name was misunderstood, a dialing of the telephone number Can be prevented through a termination function (e.g. by activating the escape k:ey) . Alternatively, a request for repetition from the V7~S is conceivable, to determine whether the action/function assigns ~ to the voice command must be carried out or not.
Depend-~g on the effort or the storage capacity, the telephone directc--'y can Comprise, for example, 50 or more stored names. The functic:. sequences for the operating state "name selection" are 1 ~; shown = = the form of a f low diagram in f figure 9 .
Operating state °number dialing" (figure 10):
This state presumes a correct recognition of the respective voice command (e.g. "number diali,ng~ or the like) . A telephone number is dialed in this state by entering a number sequence. The - 4b -input is made in a linked form (if necessary in blocks) and speaker-independent. In this operating state, the VDS requests the input of a number. The user then enters the number either as a whole or in individual blocks as voice command. The entered numbers or the respectively entered number block is acknowledged for the user following the input of the respective voice command.
Following the request "dialing," the number is transmitted to the telephone where the connection is made to the respective telephone subscriber. If the number was misunderstood, then the number can be corrected or deleted with an error function, or the voice operation can be terminated via a termination function, for example with a command "terminate," that is to say the VDS can be deactivated.
The function sequences of the operating state "number dialing" are shown in the form of a flow diagram in figure 10.
Operating state "connection":
The telephone connection to the desired telephone subscriber is established. The voice recognition unit is deactivated in this state. The telephone conversation is ended, for example, by using the escape key.
Operating state "store number/store names"
After the VDS has requested that the user/speaker input the numbers, following the voice command "store number" or "store name," and after the user has spoken those in (compare operating state "number dialing"), the command "storing" or a comparable command is input in place of the command "dialing." The telephone number is then stored. The VDS subsequently requests that the user speak in the associated name and makes sure that the name input is repeated once or several times to improve the training result.
Following this repetition, the dialog is ended. In completion, it must be said that the initial nurr~ber input can be controlled with dialog commands such as "terminat:ing," or "termination," "repeat,"
"correct" or "correction," "error" etc..
Operating state "delete telephone directory/delete name"
In connection with the "te:lephone directory" (list of all trained names and associated telephone numbers), a number of editing functions are defined, which increase the system comfort for the user, for example:

TnTO 97/10583 PCT/EP96/03939 Deleting of telephone directory:
A complete or selective deleting, wherein an accidental deleting caused by recognition errors is avoided through a repetition request by the VDS ("are you sure?") prior to the final deleting and, if necessary, an output of the specific name.
Name deleting:
The VDS urges the user to speak in the name to be deleted.
The name is then repeated by the VDS. With the question "are you sure?" the user is subsequently urged to confirm the deleting operation:
The input of the voice command "yes" triggers the deleting of the name from the telephone directory.
Any other word input as a voice command will end the dialog.
Operating state "listen to telephone directory":
The VDS announces the content of the total telephone directory. An acknowledgment) of: the PTT key or the input of the termination command terminates the announcement or the dialog.
Note: The German word Bestatigen = confirm; whereas the word Betatigen = activate; actuate Operating state "telephone directory dialing":
The VDS announces the content of the complete telephone directory. If a termination or dialing command is issued following the announcement of the desired name, or if the PTT key is actuated, then the selected name is announced once more and the following question is asked: "should this number be dialed?"
The input of the voice command "yes" triggers the dialing operation, meaning the connection is established.
A "no" causes the VDS to continue the announcement of the telephone directory. The voice command "termination," "terminate," or the like or an actuation of the escape key ends the announcement or the dialog.
The two last-named functions "listen to telephone directory" and "telephone directory dialing" can also be combined to form a single function. This can be done, f=or example, if the PTT key is actuated following the relevant name during the function "listen to telephone directory," and if the VDS initiates the dialing operation, e.g. following the announcement "the name 'uncle Willi' is selected."

By taking into account. further applications, the characteristics of the above-described VDS can be summarized as follows:
Used is a process for the automatic control and/or operation of one or several devices for each voice command or each voice dialog in the real-time operation, in which processes for the voice output, voice signal preprocessing and voice recognition, syntactical-grammatical postprocessing as well as dialog control, sequence control, and interface control are used. In its basic version, the process is characterized in the "on-line" operation by a fixed syntax structure and a fixed command structure, as well as a combination of fixed vocabulary (speaker-independent recognizer) and freely definable vocabulary, e.g. names or function values (speaker-dependent recognizer). In advantageous embodiments and modifications, it can be characterized through a series of features, based on which it is provided that:
- Syntax structure and command structure are fixed during the real-time operation;
- Preprocessing, recognition and dialog control are configured for the operation in a noise-encumbered environment;
- No user training is required ("speaker-independence") for the recognition of general commands, names or data;
- Training is necessary for the recognition of specific names, data or commands of individual users ("speaker-dependence" for user-specific names or function words);
- The input of commands, names or data is preferably done in a linked form, wherein the number of words used to form a command for the voice input varies, meaning that not only one or two word commands, but also three, four or more word commands can be defined;
- A real-time processing and executing of the voice dialog is ensured;
- The voice input and the voice output occur not or not only via a hand-held device, earphones, headset or the like, but preferably in the hands-free operation;
- The speaker echos recorded during the hands-free talking into the microphone are electrically compensated (echo compensation) to permit a simultaneous operation of voice input and speaker (e. g. for a voice output, ready signals, etc . ) .
- There is a continuous autamatic adaptation to the analog transmission characteristic (acoustics, microphone and amplifier characteristic, speaker characteristic) during the operation;
- In the "off-line dialog edi.tor," the syntax structure, the dialog structure, the vocabulary and the pronunciation variants for the recognizer can be reconfigured and fixed, without this requiring additional or new voice recordings for the independent recognizer;
- The voice range for the voice output is fixed in the off-line dialog editor, wherein a) the registered voice signals are subjected to a digital voice data compression ("voice encoding"), are subsequently stored, and a corresponding voice decoding takes place during the :real-time operation and following the reading-out of the memory, or b) the voice content was previously stored in the form of text and is subjected during the real-time voice output operation to a "text-to-voice" synthesis ("text-to-speech" synthesis);
- The word order can be changed by interchanging individual words in a command;
- Predetermined synonymous words can be used;
- The same function can be realized through commands with a different number of words (e. g. through two-word or three-word commands);
- Additional words or phoneme units can be added to the useful vocabulary ("non-words," "garbage words") or word spotting approaches can be used to recognize and subsequently remove interjections such as "ah," "hm," "please," or other commands that do not belong to the vocabulary;
- The dialog structure is distinguished by the following characteristics:
- a flat hierarchy, meaning a few hierarchy planes, preferably one or two selection planes;
- integrating of "ellipses," that is to say omitting the repeating of complete command sentences with several command words and instead a limiting to short commands, e.g. "further," "higher," "stronger," wherein the system knows from the respectively preceding command what this statement refers to;
- including of the help menu or the information menu;
- including of repetition requests from the VDS in case of unsure decisions by the recognizer ("what do you mean,"
"please repeat," "and further");
- including of voice outputs in order to ensure that the recognition is increased by stimulating certain manners of speaking (e.g. by the query: "please louder");
- The voice recognition is activated by a one-time actuation of a push-to-talk key (PTT key) and this is acknowledged acoustically (e.g. with a beeping sound) to indicate that the input can now take place;
- It is not necessary to actuate the PTT key if a voice input is required following a repetition request by the voice output, wherein the PTT key - either performs or comprises multiple functions, for example during the telephoning ("hanging up the receiver," "lifting off the receiver") or during the restart of the voice dialog system or the termination of a telephone dialing operation;
- or is complemented by additional switches, e.g.
permitting a restart. or the termination of a function/action ("escape key"); if necessary, the PTT and the termination function can be integrated into one single lever (e. g. triggering the PTT function by pulling the lever toward oneself; triggering the termination function by pushing the lever away);
- The dialog system has one or more of the following performance features:
- the specific (e.g. trained) commands, data, names, or parameters of the various users are stored on demand for a later use;
- the commands or names trained by the speaker are not only supplied to the recognition system during the training phase, but are also recorded as to their time history, are fed to a data compression ("voice encoding") and are stored in a non-volatile memory in order to provide the user with the updated status by reading it out;
- the commands or names trained by the speaker are processed during the training phase in such a way that environmental noises are for the most part compensated during the recording;
- If necessary, the completion of a recognition operation is optically or acoustically acknowledged ("beeping" sound or the like) or, alternatively (and if necessary only), the recognition result is repeated acoustically (voice output) for decisions relevant to safety, time, or costs, and that the user has the option of stopping the execution of the respective action through a voice command or by activating a switch (e. g. the escape key);
- The voice dialog system is connected to an optical display medium (LCD display, monitor, or the like), wherein the optical display medium can take over individual, several, or all of the following functions:
- output of the recognized commands for control purposes;
- display of the functions adjusted by the target device as reaction to the voice command;
- display of various functions/alternatives, which are subsequently adjusted or selected or modified via voice command;
- Each user can set up his/her own name lists or abbreviation lists (comparable to a telephone directory or address book), wherein - the name trained by the user on the speaker-dependent recognizer is associated with a number sequence, a letter sequence or a command or a command sequence, input in the speaker-independent operating mode;
- in place of the renewed input of the number sequence, letter sequence, or command sequence, the user enters the list designation and the name selected by him/her, or a suitable command is entered in addition to the name, which suggests the correct list;

- the list can be expanded at any time through additional entries by voice control;
- the list can be deleted either completely or selectively by voice control;
- the list can be listened to for a voice command, wherein the names entered by the user and, if necessary, the associated number sequence, letter sequence or commands can be output acoustically;
- the acoustical output of the list can be terminated at any point in time;
- A sequence of numbers (number column) can be spoken in either continuously (linked together) or in blocks, wherein the VDS
preferably exhibits one or more or all of the following characteristics:
- an acknowledgment follows each input pause in that the last input block is repeated by the voice output;
- following the acknowledgment through a command "error,"
"wrong," or the like, the last input block is deleted and the remaining, stored blocks are output acoustically;

- following the acknowledgment through a command "delete"
or a similar command input, all entered number blocks are deleted;
- following the acknowledgment through a command "repeat"
or the like, the blocks stored until then are output acoustically;
- following the acknowledgment through a command "termination" or a similar command input, the input of the number column is terminated completely;
- additional numbers or number blocks can be input following the acknowledgment;
- the input of numbers is concluded with a suitable command following the acknowledgment;
- the same blocking as for the input is used for the output of the numbers spoken in so far, which output follows the command "error" or the like or the command "repeat;"
- A sequence of letters (letter column) is spoken in, which is provided for selecting complex functions or the input of a plurality of information bits, wherein the letter column is input in a linked form or in blocks and the VDS preferably exhibits one or several or all of the following characteristics:
- an acknowledgment follows each input pause, in that the last input block is repeated by the voice output;
- following the acknowledgment through a command "error,"
"wrong," or the like, the last input block is deleted and the remaining, stored blocks are output acoustically;
- following the acknowledgment through a command "delete"
or the like, all input letters are deleted and this is followed by a new input;
- following the acknowledgment through a command "repeat"
or the like, the blocks stored so far are output acoustically;
- additional letters or :Letter blocks are input following the acknowledgment;
- if necessary, the letter column is matched to a stored word list and the most suitable word(s)is (are) extracted from this; alternatively, this matching can already take place following the input of the individual letter blocks;
- following the acknowledgment through a command "termination" or a similar command input, the input of the letter column is terminated completely;
- the letter input is concluded with a suitable command following the acknowledgment.
- The volume of the voice output and the "beep" sound must be adapted to the environmental noises, wherein the environmental noises are detected during the speaking pauses with respect to their strength and characteristic.
- That access to the voice dialog system or access the user-specific data/commands is possible only after special key words or pass words have been input or after special key words or pass words have been entered by an authorized speaker whose speech characteristics are known to the dialog system and checked by the dialog system.
- That voice outputs with a longer duration (e. g. information menus) can be terminated prematurely through spoken termination commands, or the PTT, or the escape key.
- That the voice dialog system in one of the following forms either complements or replaces the manual operation of the above functions (e. g. via switch, key, rotary knob):
- using the voice command does not replace any manual operation, but exists along with the manual operation (meaning the operation can at any time be performed or continued manually);
- some special performance characteristics can be activated only via voice input, but that the essential device functions and operating functions continue to be controlled manually as well as by voice;
- the number of manual operating elements is clearly reduced and individual keys or rotary knobs take over multiple functions; manual operating elements are assigned a special function by voice; only the essential operating functions can still be actuated manually; the operating functions are based, however, on the voice command control.

- That a plurality of different devices as well as device functions can be made to respond and can be modified with a single multiword command, and an involved, multistage mode of action (e. g. selection of device in the first step, followed by selection of function in step 2, and subsequently selection of the type of change in step 3) is thus not required.
- That the voice dialog system in the motor vehicle is used for one or several of the functions named in the following:
- the operation of individual or several devices, e.g. a car telephone, car radio (if necessary with tape deck, CD
changer, sound system), navigation system, emergency call, telematics services, onboard monitor, air-conditioning system, heating, travel computer, lighting, sun roof, window opener, seat adjuster, seat heater, rear-windshield heater, mirror adjuster and memory, seat adjuster and memory, steering wheel adjuster and memory, etc.;
- information polling of parameters, e.g. oil pressure, oil temperature, cooling-water temperature, consumption, tire pressure, etc.;
- information on measures required in special situations, e.g. if the cooling-water temperature is too high, the tire pressure is too low, etc.;
- warning the driver of defects in the vehicle, wherein - the voice-controlled selection of a new station in the car radio preferably ocr_urs in accordance with one of the following sequences:
- issuing command for the search operation up or down;
- voice input of the station frequency, preferably in the colloquial form (e.g. "one hundred three comma seven" or "hundred three comma seven," "hundred and three comma seven" or including the frequency information (e. g.
hundred three comma seven megahertz"));
- voice input of the commonly used station name (e. g.
"SDR1 " ) .
- That for the air-conditioning system, it is possible to set the desired temperature (if necessary staggered according to location in the passenger cell of the motor vehicle, divided into left, right, front, back) not only relatively, but preferably also absolutely (meaning as to degree, Fahrenheit, or the like) and that commands for a minimum, maximum, or average temperature or the normal temperature can additionally be issued; the operating states for the fan in the passenger space can be set in a similar way.
- The navigation system is informed of a target location (location name, street name) by entering letter columns in the "spelling mode," wherein it is also sufficient to use the beginning of the name for the input and wherein the navigation system, if necessary, offers several candidates for selection.
- One or several of the following, user-specific name lists are set up:
- a list for storing telephone numbers under predetermined names/abbreviations;
- a list for storing targets for the navigation system under predetermined names/abbreviations;
- a list for storing function names for commands or command sequences;
- a list for storing car radio station frequencies under station names or abbreviations that can be specified.
- The output sound level of the voice output and the "beeping"
sound, if necessary also the sound level of the radio, are set or adaptively adjusted by taking into account one or several of the following parameters:
- the vehicle speed - the rotational number - the opening width for the window and the sun roof;
- the fan setting;
- the vehicle type;
- the importance of the voice output in the respective dialog situation.
For one preferred embodiment of the described voice dialog system, it is provided, among other things, that executive sequence control, dialog control, interface control, voice input/output, as well as voice signal preprocessing, recognition, syntactical-grammatical and semantical postprocessing are carried out with the aid of micro processors and signal processors, memories and interface modules, but preferably with a single digital signal processor or microprocessor, as well as the required external data memories and program memories, the interfaces and the associated driver modules, the clock generator, the control logic and the microphones and speakers, including the associated converters and amplifiers necessary for the voice input/output, as well as a push-to-talk (PTT) key and an escape key if necessary.
It is furthermore possible that with the aid of one or several interfaces:
- data and/or parameters can be loaded or reloaded in order to realize, for example, process changes or a voice dialog system for another language;
- the syntax structure, dialog structure,~executive sequencing control, voice output etc., which are fixed or modified on a separate computer, are transferred to the voice dialog system ("off-line dialog editor");
- the VDS can request and collect status information or diagnostic information;

- the voice dialog system is linked via a bus system and/or a ring-shaped net with several of the devices to be actuated (in place of point-to-point connections to the individual devices) and that control data or audio signals or status information from the motor vehicle or the devices to be serviced are transmitted via this bus or the net;
- the individual devices to be selected do not respectively comprise their own voice dialog system, but are serviced by a single (joint) voice dialog system;
- one or several interfaces to vehicle components or vehicle computers exist, which are used to transmit information on permanent or actual vehicle data to the voice dialog system, e.g. speed, engine temperature, etc.;
- the voice dialog system takes over other functions such as the radio, telephone, or the like during the waiting period (in which there is no voice input or output);
- a multilingual, speaker-independent dialog system is set up with the aid of an expanded memory, which permits a quick switching between the dialog systems of various languages;

- an optical display is coupled with the voice dialog system via a special interface or via the bus connection, wherein this bus preferably is an optical data bus and that control signals as well as audio signals can be transmitted via this bus;
It is understood that the invention is not limited to the embodiments and application examples shown here, but can be transferred to others in a corresponding way. Thus, it is conceivable, for example, that such a voice dialog system is used to operate an electronic dictionary or an electronic dictation or translation system.
One special embodiment of the invention consists in that for relatively limited applications with little syntax, the syntactical check is incorporated into the recognition process in the form of a syntactical bigram language model and the syntactical postprocessing can thus be eliminated;
for complex problem definitions, the interface between recognizer and postprocessing no longer consists of individual sentences, but a so-called "word hypotheses net," from which the most suitable sentence is extracted in a postprocessing stage and on the basis of predetermined syntactical values with special pairing strategies;
It is furthermore possible to provide an output unit (e.g.
display) that operates on an optical basis as a complement or alternative to the voice output, which output unit displays the entered voice command, for example, in the form recognized by the VDS.
Finally, it is conceivable that the activated VDS can also be deactivated in that no new voice command is input by the user/speaker during a prolonged interval, which is either specified by the system or adaptively adjusted to the user/speaker.

Abbreviations PTT push-to-talk HMM Hidden Markov Models DTW dynamic time warping CMF cepstral vectors mean-value free DCT digital cosine transformation FFT Fast Fourier Transformation LDA linear discrimination analysis PCM pulse code modulation VQ vector quantization SDS voice dialog system SBS voice operating system

Claims (58)

1. ~A process for the automatic control of one or several devices by voice commands or by voice dialog in the real-time operation, characterized by the following features:
- the entered voice commands are recognized by means of a speaker-independent compound-word voice recognizer and a speaker-dependent additional voice recognizer and are classified according to their recognition probability;
- recognized, admissible voice commands are checked for their plausibility, and the admissible and plausible voice command with the highest recognition probability is identified as the entered voice command, and functions assigned to this voice command of the device or devices or responses of the voice dialogue system are initiated or generated.
2. A process according to claim 1, characterized by the following features:
.cndot. the voice commands or the voice dialog are or is formed or controlled on the basis of at least one syntax structure, at least one base command vocabulary and, if necessary, at least one speaker-specific additional command vocabulary;
.cndot. the at least one syntax structure and the at least one base command vocabulary are provided in speaker-independent form and are fixed during the real-time operation;

.cndot. the at least one speaker-specific additional command vocabulary is entered and/or changed by the respective speaker in that during training phases within and/or outside of the real-time operation, an additional voice recognizer that operates on the basis of a speaker-dependent recognition method is trained by the respective speaker through single or multiple input of the additional commands for the voice-specific features of the respective speaker;
.cndot. in real-time operation, the voice dialog and/or the control of the device or devices takes place as follows:
- voice commands spoken in by the respective speaker are transmitted to a speaker-independent compound-word recognizer operating on the basis of phonemes and/or whole-word models and to the speaker-dependent additional voice recognizer, where they are respectively subjected to a feature extraction and - are examined and classified in the compound-word voice recognizer with the aid of the features extracted there to determine the existence of base commands from respective base command vocabulary according to the respectively specified syntax structure, and - are examined and classified in the speaker-dependent additional voice recognizer with the aid of the features extracted there to determine the existence of additional commands from the respective additional command vocabulary;
the commands that have been classified as recognized with a certain probability and the syntax structures of the two voice recognizers are then joined to form hypothetical voice commands, and that these are examined and classified according to the specified syntax structure as to their reliability and recognition probability;
- the admissible hypothetical voice commands are subsequently examined as to their plausibility on the basis of predetermined criteria, and that among the hypothetical voice commands recognized as plausible, the one with the highest recognition probability is selected and is identified as the voice command entered by the respective speaker;
- that subsequently - a function or functions assigned to the identified voice command of the respective device or devices to be controlled is or are initiated and/or - a response or responses is or are generated in accordance with a specified voice dialog structure for continuing the voice dialog.
3. A process according to one of the claims 1 or 2, characterized in that the input of voice commands occurs acoustically and preferably in hands-off operation.
4. A process according to any one of claims 1 to 3, characterized in that acoustically input voice commands are transmitted noise-reduced to the two voice recognizers, in that noise signals, caused by stationary or quasi-stationary environmental noises, are compensated in the voice signal receiving channel in front of the two voice recognizers and preferably by means of adaptive digital filtering methods.
5. A process according to any one of claims 1 to 4, characterized in that acoustically input voice, commands are transmitted echo-compensated to the two voice recognizers, in that signals of a voice or music output unit that are fed back into the voice signal receiving channel are compensated in a voice signal receiving channel in front of the two voice recognizers, in particular in front of the noise reduction unit, and preferably by means of adaptive digital filtering methods.
6. A process according to any one of claims 1 to 5, characterized in that the entered voice commands are combined in blocks after digitizing, are converted to a frequency range following a weighting by means of a spectral transformation, preferably a Fast Fourier Transformation (FFT), and are subsequently combined to form channel vectors through sum formation and subsequent audio-related MEL
filtering, and that this is followed by a segmentation.
7. A process according to claim 6, characterized in that the segmentation is divided into a rough and precise segmentation.
8. A process according to one of the claims 6 or 7, characterized in that the feature extraction is carried out in the speaker-independent compound-word recognizer in such a way that .cndot. the channel vectors are transformed with a discrete cosine transformation into cepstral vectors;
.cndot. additionally the energy of the associated signal is calculated and standardized;

.cndot. in order to adapt the recognizer to the respective speaker and/or the respective transmission characteristics of the voice signal receiving channel, a cepstral vector mean value is constantly computed and is subtracted from the cepstral vectors;
.cndot. the cepstral vectors freed of the cepstral vector mean value and a computed, standardized signal energy are combined to form mean-value free cepstral coefficients.
9. A process according to any one of claims 1 to 8, characterized in that the speaker-independent compound-word recognizer uses Hidden Markov Models (HMM) based on phonemes and/or whole words for the classification.
10. A process according to claim 9, characterized in that the classification is carried out with the aid of the Viterbi algorithm and that the Viterbi algorithm preferably is complemented by a specified word-sequence statistic.
11. A process according to any one of claims 1 to 10, characterized in that for the classification, filler words or filler phonemes or other faulty commands not included in the specified basic vocabulary are recognized as such, are correspondingly classified, and are separated out.
12. A process according to any one of claims 1 to 11, characterized in that the speaker-independent compound-word voice recognizer and the speaker-dependent additional voice recognizer build onto the same signal preprocessing for the input voice commands, preferably including the methods for noise reduction, echo compensation, and segmentation.
13. A process according to any one of claims 1 to 12, characterized in that the additional voice recognizer operates as single-word voice recognizer, preferably based on a dynamic time warping process.
14. A process according to any one of claims 1 to 13, characterized in that the speaker-independent compound-word voice recognizer and the speaker-dependent voice recognizer operate jointly in a compound-word mode.
15. A process according to any one of claims 1 to 14, characterized in that during the real-time operation, there is a continuous adaptation of a voice signal receiving channel to an analog transmission characteristic, in particular to the characteristic for acoustic and/or microphone and/or amplifier and/or speaker.
16. A process according to any one of claims 1 to 15, characterized in that predetermined basic commands are specified and stored in voice-encoded form and/or additional commands input by a respective speaker during training phases and/or voice commands input during the real-time operation are further processed in voice-encoded form following their input and/or are stored in a non-volatile memory, and that encoded voice commands that must be output acoustically are voice-decoded prior to their output.
17. A process according to any one of claims 1 to 15, characterized in that specified basic commands and/or additional commands and/or the voice commands input during the real-time operation are stored in text form, and that voice commands that must be output acoustically are subjected to a text-to-language synthesis prior to their output.
18. A process according to any one of claims 1 to 17, characterized in that syntax structure and speaker-independent commands are created and fixed ahead of time in an "off-line dialog editor mode" in a laboratory and are transmitted to the compound-word voice recognizer in the form of data files.
19. A process according to any one of claims 1 to 18, characterized in that .cndot. the word order in the voice commands can be changed by exchanging the individual words in a command and/or .cndot. specified synonymous words can be used for generating the voice command and/or .cndot. the same function can be realized through voice commands with a varying number of words.
20. A process according to any one of claims 1 to 19, characterized in that for the recognition and subsequent separating out of insertions or other commands not belonging to a vocabulary, additional words or phonemes are added to an admissible vocabulary or that word spotting approaches are used.
21. A process according to any one of claims 1 to 20, characterized in that a dialog structure has the following features:
.cndot. a flat hierarchy with only a few hierarchy levels, preferably one or two hierarchy levels, .cndot. integration of ellipses for the processing of the voice dialog;
.cndot. including of auxiliary and information menus;

.cndot. including of repetition requests from the voice dialog system in case of unsure decisions by the recognizer .cndot. including of voice outputs, in order to increase the recognition certainty by stimulating certain manners of speech.
22. A process according to any one of claims 1 to 21, characterized in that the voice recognition or the voice dialog for control of one or several device functions is preferably activated by a one-time actuation of a push-to-talk key (PTT) and that this activation is preferably acknowledged acoustically and/or optically.
23. A process according to any one of claims 1 to 22, characterized in that activation is terminated automatically if no voice input has occurred, following a time interval that can be specified or adaptively adjusted to the respective user and/or following a repetition request by the voice dialog system, or if the dialog selected by the user has been completed according to plan.
24. A process according to any one of claims 1 to 23, characterized in that the voice dialog or the input of voice commands can be terminated through the input of a specified, special termination voice command at defined locations in the voice dialog or at any time by actuating a key, preferably the push-to-talk key or an escape key.
25. A process according to any one of claims 1 to 24, characterized in that a voice dialog system has one or more of the following performance characteristics:
.cndot. specific voice commands from various speakers are stored, if necessary, for a later reuse;

.cndot. voice commands or names trained by the speaker are not only transmitted to a recognition system during a training phase, but are also recorded as to their time history, are transmitted to a data compression, and are stored in a non-volatile memory;
.cndot. the voice commands trained by the speaker are processed during the training phase in such a way that environmental noises are for the most part compensated during the recording.
26. A process according to any one of claims 1 to 25, characterized in that the completion of a recognition operation is acknowledged acoustically with a control sound.
27. A process according to any one of claims 1 to 26, characterized in that the recognition result is acoustically repeated (voice output), especially for decisions involving safety, time, or cost and that the speaker is given an option of preventing or reversing the carrying out of the function assigned to the voice command with the aid of a voice command or by actuating a switch, preferably a push-to-talk key or an escape key.
28. A process according to any one of claims 1 to 27, characterized in that the voice dialog system is connected to an optical display medium, preferably a LCD display or a monitor or a display for a selected device.
29. A process according to claim 28, characterized in that the optical display medium takes over individual or a plurality of the following functions:
.cndot. output of the recognized voice command for control purposes;

.cndot. illustration of the functions adjusted by the target device in reaction to the voice command;
.cndot. illustration of various functions/alternatives, which are subsequently adjusted or selected or modified by voice command.
30. A process according to any one of claims 1 to 29, characterized in that each speaker can set up his/her own name or abbreviation lists, comprising one or several or all of the following features:
.cndot. the name trained by the speaker on the speaker-dependent recognizer represents a number sequence, a letter sequence and/or a command or a command sequence, input in the speaker-independent operating mode;
.cndot. the user can input the list designation and the name selected by the user in place of the renewed input of the number sequence, letter sequence or command sequence, or the user can input a suitable command in addition to the name, which suggests the correct list;
.cndot. the list can be expanded at any time by voice control to comprise further entries;
.cndot. the list can be deleted completely or selectively with voice control;
.cndot. the list can be listened to for a voice command, wherein the names input by the user and, if necessary the associated number sequence, letter sequence or commands can be output acoustically;

.cndot. the acoustic output of the list be terminated at any point in time.
31. A process according to any one of claims 1 to 30, characterized in that a sequence of numbers or number column can be spoken in a linked form or in blocks, wherein the voice input or the voice dialog preferably exhibits one or more or all of the following features:
.cndot. each input pause is followed by an acknowledgement in which the last input block is repeated by the voice output;
.cndot. following the acknowledgement through a voice command "error" or the like, the last input block is deleted and the remaining, stored blocks are acoustically output;
.cndot. following the acknowledgement through a voice command "delete" or the like, all entered number blocks are deleted.
.cndot. following the acknowledgement through a voice command "repeat" or the like, the blocks stored until then are output acoustically;
.cndot. following the acknowledgement through a voice command "termination" or the like, the input of the number column is terminated completely;
.cndot. additional numbers or number blocks can be input following the acknowledgement;
.cndot. following the acknowledgment, the number input is ended with a suitable voice command "stop", "store", or the like;

.cndot. the input is completed by entering a voice command starting an action/function, e.g. "select" or the like, which initiates the action/function associated with the voice command.
32. A process according to any one of claims 1 to 31, characterized in that the sequence of letters or letter column is spoken in, which is provided for the selection of complex functions or the input of a plurality of information bits, wherein the letter column is entered either linked together or in blocks and the voice input or the voice dialog preferably exhibits one or several or all of the following features:
.cndot. each input pause is followed by an acknowledgement, in which the last input block is repeated by the voice output;
.cndot. following the acknowledgement through a voice command "error" or the like, the last input block is deleted and the remaining, stored blocks are output acoustically;
.cndot. following the acknowledgement through a voice command "delete" and the like, all previously entered letters are deleted and a new input can subsequently take place;
.cndot. following the acknowledgement through a voice command "repeat" or the like, the blocks stored until then are output acoustically;
.cndot. following the acknowledgement, additional letters or letter blocks can be input;

.cndot. if necessary, the letter column or the individual letter blocks are matched with a stored word list, and the most suitable word or words is or are extracted from this;
.cndot. following the acknowledgement through a voice command "termination" or the like, the input of the letter column is terminated completely;
.cndot. following the acknowledgement, the letter input is completed with a voice command "stop", "store", or the like;
.cndot.the input is completed by entering a voice command starting an action/function, such as "select" or the like and the action/function associated with the voice command is initiated.
33. A process according to any one of claims 31 or 32, characterized in that the same blocking as for the input is used for the output of the numbers entered until then, which output follows the voice command "error" or the like or the voice command "repeat" or the like.
34. A process according to any one of claims 1 to 33, characterized in that voice output volume and control sound volume are adapted to environmental noises, wherein the environmental noises are detected during speaking pauses with respect to their strength and characteristic.
35. A process according to any one of claims 1 to 34, characterized in that access to a voice dialog system or access to user-specific data/commands can be gained only through the input of special command words or the input of special command words from an authorized speaker, whose speech characteristics are known to the voice dialog system and are analyzed by this system.
36. A process according to any one of claims 1 to 35, characterized in that voice output operations of a longer duration e.g., information menus can be terminated prematurely through spoken or manual termination commands.
37. A process according to any one of claims 1 to 36, characterized in that a voice dialog system in one of the following forms complements or replaces manual operation of functions e.g., by switch, key, rotary button:
.cndot. the voice command control exists in addition to the manual operation, so that it is possible at any time to have a manual operation or to continue the operation manually;
.cndot. some special performance characteristics can be activated only by voice input, while other device functions and operating functions continue to be controlled manually as well as by voice;
.cndot. a number of manual operating elements is clearly reduced, and-individual keys or rotary knobs take over multiple functions; manual operating elements are assigned a special function by each language; only essential operating functions can still be actuated manually; voice command control forms the basis for operating functions.
38. A process according to any one of claims 1 to 37, characterized in that a plurality of different devices as well as device functions can be addressed and modified with a single one-word or multiword command, and a multistage action is therefore either not required at all or required only to a minor extent.
39. A process according to any one of claims 1 to 38, characterized in that the voice dialog system in vehicles is used for individual or a plurality of the functions named in the following:
.cndot. the operation of individual or multiple devices, e.g. car telephone, car radio, car radio with tape deck, CD
changer, and sound system, navigation system, emergency call, onboard monitor, air-conditioning system, heater, travel computer, lighting, sun roof, window opener, seat adjuster;
.cndot. information polling of parameters, e.g. oil pressure, oil temperature, cooling-water temperature, consumption, tire pressure;
.cndot. information concerning necessary measures such as a cooling-water temperature that is too high, a tire pressure that is too low;
.cndot. warning of the driver in case of vehicle malfunctions.
40. A process according to claim 39, characterized in that voice-controlled selection of a new station on the car radio occurs based on the following processes:
.cndot. issuing a command for the search operation up or down;
.cndot. voice input of the station frequency, preferably in its colloquial form, and preferably also including its frequency information;
.cndot. voice input of its commonly used station name.
41. A process according to claim 39, characterized in that a desired temperature for the air-conditioning system can be set relatively and/or absolutely through voice input, and preferably and additionally, a minimum, maximum, or average temperature or a normal temperature can be issued.
42. A process according to claim 39, characterized in that the navigation system is informed of a target location e.g., location name, street name, through the input of letter columns in the "spelling mode", wherein the beginning of a name is preferably sufficient for the input and wherein the navigation system, if necessary, offers several candidates for selection.
43. A process according to any one of claims 39 to 42, characterized in that one or several of the following, user-specific name lists are set up:
.cndot. a list for storing telephone numbers under names/abbreviations that can be specified;
.cndot. a list for storing targets for the navigation system under names/abbreviations that can be specified;
.cndot. a list for storing function names for commands or command sequences;
.cndot. a list for storing station frequencies for car radios under specifiable station names or abbreviations;
44. A process according to any one of claims 39 to 43, characterized in that the volume for the voice output and the control sound or the control sounds, if necessary also the radio volume, are adaptively adjusted or set by taking into account one or several of the following parameters:

.cndot. vehicle speed .cndot. fan setting .cndot. rotational number .cndot. opening width for the window and sun roof .cndot. vehicle type .cndot. importance of the voice output in the respective dialog situation
45. A process according to any one of claims 22 to 44, characterized in that a push-to-talk key .cndot. either makes use of or contains multifunctions, e.g. when using the telephone ("replace receiver", "lift off receiver") or for restart of the voice dialog system or when terminating the telephone dialing operation;
.cndot. or is complemented by an additional switch or an additional switching position that permits, for example, a restart or the termination of a function.
46. An apparatus for carrying out this process in accordance with any one of claims 1 to 45, in which a voice input/output unit is connected via a voice signal preprocessing unit with a voice recognition unit, which in turn is connected to a sequencing control, a dialog control, and an interface control, characterized in that the voice recognition unit consists of a speaker-independent compound-word recognizes and a speaker-dependent additional voice recognizes, which are both connected on the output side with a unit for syntactical-grammatical or semantical postprocessing that is linked to the sequencing control, the dialog control, and the interface control.
47. An apparatus according to claim 46, characterized in that the voice signal preprocessing unit includes a noise reduction device and/or an echo compensation device and/or a segmenting device.
48. An apparatus according to one of the claims 46 or 47, characterized in that the voice input/output unit includes a voice encoder, a voice decoder, as well as a voice memory.
49. An apparatus according to one of the claims 46 to 48, characterized in that the sequencing control, the dialog control, and the interface control, the voice input/output, as well as the voice signal preprocessing, the voice recognition, the syntactical-grammatical and semantical postprocessing are carried out with microprocessors and signal processors, memories, and interface modules, but preferably with a single digital signal processor or microprocessor as well as the required external memories for data and programs, the interfaces, as well as an associated driver module, a clock generator, a control logic, and microphones and speakers necessary for the voice input/output, including associated converters and amplifiers, as well as a push-to-talk (PTT) key and an escape key if necessary.
50. An apparatus according to claim 49, characterized in that with the aid of one or several interfaces, .cndot. data and/or parameters can be loaded or reloaded, e.g. to realize processing changes or a voice dialog system for another language;
.cndot. syntax structure, dialog structure, sequencing control, voice output, etc., which are fixed or modified on a separate computer, are transmitted to a voice dialog system;
.cndot. diagnostic and status information can be requested and collected by the voice dialog system.
51. An apparatus according to claim 49, characterized in that this apparatus is linked via a bus system or a ring-shaped net with several of the devices to be controlled, and that control data and/or audio signals and/or status reports of a voice dialog system and/or the devices to be operated can be transmitted via this bus or the net.
52. An apparatus according to one of the claims 46 to 51 for use in vehicles, characterized in that the individual devices to be selected do not contain a separate voice dialog system each, but are operated with the aid of a single, joint voice dialog system.
53. An apparatus according to one of the claims 46 to 52, characterized by the existence of one or several interfaces to vehicle components or vehicle computers, which are used to provide a voice dialog system with permanent or up-to-date vehicle data, e.g. the speed.
54. An apparatus according to one of the claims 46 to 53, characterized in that this apparatus takes on other functions, e.g. for the radio, telephone, etc., during the waiting periods in which no voice input or voice output occurs.
55. An apparatus according to one of the claims 46 to 54, characterized in that a multilingual, speaker-independent dialog system is realized by means of an expanded memory, which permits the switching between the dialog systems of various languages.
56. An apparatus according to one of the claims 46 to 55, characterized in that an optical display is coupled to the voice dialog system via a special interface or via the bus connection.
57. An apparatus according to one of the claims 46 to 56, characterized in that the complete voice dialog system is coupled via a PCMCIA interface with a voice-controlled or voice-operated device or with a host computer or an application computer.
58. An apparatus according to claim 51 or 57, characterized in that this bus or this net is an optical data bus, and that control signals as well as audio signals or status reports from the voice dialog system and the devices to be operated are transmitted via this data bus or net.
CA002231504A 1995-09-11 1996-09-09 Process for automatic control of one or more devices by voice commands or by real-time voice dialog and apparatus for carrying out this process Expired - Lifetime CA2231504C (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
DE19533541A DE19533541C1 (en) 1995-09-11 1995-09-11 A method for automatically controlling one or more appliances by voice commands or by voice dialogue in real-time operation and apparatus for performing the method
DE19533541.4 1995-09-11
PCT/EP1996/003939 WO1997010583A1 (en) 1995-09-11 1996-09-09 Process for automatic control of one or more devices by voice commands or by real-time voice dialog and apparatus for carrying out this process

Publications (2)

Publication Number Publication Date
CA2231504A1 CA2231504A1 (en) 1997-03-20
CA2231504C true CA2231504C (en) 2005-08-02

Family

ID=7771821

Family Applications (1)

Application Number Title Priority Date Filing Date
CA002231504A Expired - Lifetime CA2231504C (en) 1995-09-11 1996-09-09 Process for automatic control of one or more devices by voice commands or by real-time voice dialog and apparatus for carrying out this process

Country Status (8)

Country Link
US (1) US6839670B1 (en)
EP (1) EP0852051B1 (en)
JP (1) JP3479691B2 (en)
AT (1) AT211572T (en)
CA (1) CA2231504C (en)
DE (2) DE19533541C1 (en)
ES (1) ES2170870T3 (en)
WO (1) WO1997010583A1 (en)

Families Citing this family (252)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5616549A (en) * 1995-12-29 1997-04-01 Clark; Lawrence A. Molecular level cleaning of contaminates from parts utilizing an envronmentally safe solvent
DE19635754A1 (en) * 1996-09-03 1998-03-05 Siemens Ag Speech processing system and method for speech processing
DE19709518C5 (en) * 1997-03-10 2006-05-04 Harman Becker Automotive Systems Gmbh Method and device for voice input of a destination address in a real-time route guidance system
DE19715101C2 (en) * 1997-04-11 2003-04-10 Saechsisches Inst Fuer Die Dru Method for controlling a graphic machine
DE19715325A1 (en) * 1997-04-12 1998-10-15 Bayerische Motoren Werke Ag Display and menu selection of road vehicle functions
DE19730935C2 (en) * 1997-07-18 2002-12-19 Siemens Ag A method of generating a voice output and navigation system
DE19730816A1 (en) * 1997-07-18 1999-01-21 Ise Interactive Systems Entwic Hands-free speech communication arrangement for computer
DE19730920A1 (en) * 1997-07-18 1999-01-21 Ise Interactive Systems Entwic Computer system adapted for hands-free speech communications
DE19738339C2 (en) * 1997-09-02 2000-08-31 Siemens Ag Method for user-controlled dismantling of wireless telecommunication connections in wireless telecommunication systems, particularly DECT systems
US6671745B1 (en) * 1998-03-23 2003-12-30 Microsoft Corporation Application program interfaces and structures in a resource limited operating system
DE19818262A1 (en) * 1998-04-23 1999-10-28 Volkswagen Ag Method and device for operation, or for operating various devices in a vehicle
EP0971330A1 (en) * 1998-07-07 2000-01-12 Otis Elevator Company Verbal remote control device
AU1097300A (en) * 1998-09-30 2000-04-17 Brian Gladstein Graphic user interface for navigation in speech recognition system grammars
DE19908137A1 (en) * 1998-10-16 2000-06-15 Volkswagen Ag Method and apparatus for automatically controlling at least one appliance by voice dialogue
US6411926B1 (en) * 1999-02-08 2002-06-25 Qualcomm Incorporated Distributed voice recognition system
JP2000259198A (en) * 1999-03-04 2000-09-22 Sony Corp Device and method for recognizing pattern and providing medium
DE19913677A1 (en) * 1999-03-25 2000-10-05 Groza Igor Talk system controlling board computer of car or lorry communicates with driver via stored software
DE19925064B4 (en) * 1999-04-21 2004-12-16 Thomas Böhner Apparatus and method for control of lighting systems, machinery. like.
DE19939065A1 (en) * 1999-08-18 2001-02-22 Volkswagen Ag Multi function operator
DE19955890B4 (en) * 1999-11-20 2006-10-05 Robert Bosch Gmbh Method and device for issuing operating instructions
DE19956747C1 (en) * 1999-11-25 2001-01-11 Siemens Ag Speech recognition method for telecommunications system
DE10007223B4 (en) 2000-02-17 2019-04-25 Harman Becker Automotive Systems Gmbh System having a voice control system as a first system unit and a second system unit in a motor vehicle
DE10008226C2 (en) * 2000-02-22 2002-06-13 Bosch Gmbh Robert Apparatus for voice control and method for voice control
DE10012572C2 (en) * 2000-03-15 2003-03-27 Bayerische Motoren Werke Ag Apparatus and method for speech input of a destination with the aid of a defined input dialogue into a destination guiding system
US8645137B2 (en) 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
DE10012756B4 (en) * 2000-03-16 2017-11-02 Volkswagen Ag Method and device for storing and retrieving individual settings
DE10017717B4 (en) * 2000-04-11 2006-01-05 Leopold Kostal Gmbh & Co. Kg Voice input controlled controller
DE10021389A1 (en) * 2000-05-03 2001-11-08 Nokia Mobile Phones Ltd Electronic system setting modification method e.g. for radio receiver, involves interpreting user input with respect to each electronic device and confirming the input before regulation
US6587824B1 (en) * 2000-05-04 2003-07-01 Visteon Global Technologies, Inc. Selective speaker adaptation for an in-vehicle speech recognition system
JP2003534576A (en) * 2000-05-23 2003-11-18 トムソン ライセンシング ソシエテ アノニム Syntactic and semantic analysis of the voice command
DE10030369A1 (en) * 2000-06-21 2002-01-03 Volkswagen Ag Voice recognition system
DE10034235C1 (en) * 2000-07-14 2001-08-09 Siemens Ag A method for speech recognition and speech
DE10037023A1 (en) * 2000-07-29 2002-02-21 Bosch Gmbh Robert Method and system for acoustic control function in the motor vehicle
CN1190775C (en) * 2000-08-15 2005-02-23 皇家菲利浦电子有限公司 Multi-device audio-video system with common echo canceling means
DE10040466C2 (en) * 2000-08-18 2003-04-10 Bosch Gmbh Robert Method for controlling a speech input and output
DE10041456A1 (en) * 2000-08-23 2002-03-07 Philips Corp Intellectual Pty A method for controlling devices by means of voice signals, in particular in motor vehicles
US6915262B2 (en) 2000-11-30 2005-07-05 Telesector Resources Group, Inc. Methods and apparatus for performing speech recognition and using speech recognition results
US8135589B1 (en) 2000-11-30 2012-03-13 Google Inc. Performing speech recognition over a network and using speech recognition results
US7203651B2 (en) * 2000-12-07 2007-04-10 Art-Advanced Recognition Technologies, Ltd. Voice control system with multiple voice recognition engines
DE10062669A1 (en) * 2000-12-15 2002-06-20 Bsh Bosch Siemens Hausgeraete Input device for central control unit of program-controlled domestic appliance has unique tactile or audible feedback signals corresponding to button position, functions or menus
DE10103610A1 (en) * 2001-01-28 2002-08-14 Audioton Kabelwerk Gmbh Hands-free operation of mobile phones in motor vehicles
DE10103609A1 (en) * 2001-01-28 2002-08-14 Audioton Kabelwerk Gmbh Hands-free operation of mobile phones in motor vehicles
DE10103608A1 (en) * 2001-01-28 2002-08-14 Audioton Kabelwerk Gmbh Hands-free operation of mobile phones in motor vehicles
JP3919210B2 (en) * 2001-02-15 2007-05-23 アルパイン株式会社 Voice input guidance method and apparatus
DE10110977C1 (en) * 2001-03-07 2002-10-10 Siemens Ag Providing help information in a speech dialogue system
DE10115899B4 (en) * 2001-03-30 2005-04-14 Siemens Ag Method for creating computer programs by means of speech recognition
JP4724943B2 (en) * 2001-04-05 2011-07-13 株式会社デンソー Voice recognition device
JP2002304188A (en) * 2001-04-05 2002-10-18 Sony Corp Word string output device and word string output method, and program and recording medium
DE10118668B4 (en) * 2001-04-14 2004-02-05 Schott Glas Coordinate measuring
DE10127852A1 (en) * 2001-06-08 2002-12-12 Mende Speech Solutions Gmbh & Procedure for detecting conversational information e.g. over a telephone line, involves extracting part of the information for storage
DE10129720B4 (en) * 2001-06-15 2004-02-19 Forschungsinstitut Für Die Biologie Landwirtschaftlicher Nutztiere According processing apparatus and method
US20030007609A1 (en) * 2001-07-03 2003-01-09 Yuen Michael S. Method and apparatus for development, deployment, and maintenance of a voice software application for distribution to one or more consumers
DE10151007A1 (en) * 2001-10-16 2003-04-17 Volkswagen Ag Operating device for function selection in automobile, has memory holding menu structure for navigation between vehicle functions
US7610189B2 (en) 2001-10-18 2009-10-27 Nuance Communications, Inc. Method and apparatus for efficient segmentation of compound words using probabilistic breakpoint traversal
GB2385665B (en) * 2001-10-19 2004-06-02 Visteon Global Tech Inc Engine combustion monitoring and control with intergrated cylinder head gasket combustion sensor
JP3863765B2 (en) 2001-11-30 2006-12-27 三洋電機株式会社 Navigation device
US7174300B2 (en) * 2001-12-11 2007-02-06 Lockheed Martin Corporation Dialog processing method and apparatus for uninhabited air vehicles
DE10163214A1 (en) * 2001-12-21 2003-07-10 Philips Intellectual Property Method and control system for voice control of a device
DE10207895B4 (en) * 2002-02-23 2005-11-03 Harman Becker Automotive Systems Gmbh Method for speech recognition and speech recognition system
DE10208466A1 (en) * 2002-02-27 2004-01-29 BSH Bosch und Siemens Hausgeräte GmbH Electrical household appliance
JP2004032430A (en) * 2002-06-26 2004-01-29 Fujitsu Ltd Control device and control program
DE10237951A1 (en) * 2002-08-20 2004-03-04 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Operating robot to music being played involves reading dynamic movement properties from table of dynamic movement properties associated with defined musical properties according to choreographic rules
JP2004110613A (en) * 2002-09-20 2004-04-08 Toshiba Corp Controller, control program, objective device, and control system
WO2004049192A2 (en) 2002-11-28 2004-06-10 Koninklijke Philips Electronics N.V. Method to assign word class information
DE10344007A1 (en) 2002-12-24 2004-07-08 Robert Bosch Gmbh Vehicle information system has unit for dialog voice input so when synonym input at least one associated defined control menu term associated with entered term is output for confirmation or selection
US20040143440A1 (en) * 2003-01-03 2004-07-22 Venkatesh Prasad Vehicle speech recognition system
ES2245546B1 (en) * 2003-03-12 2006-11-01 Carlos Catala Costa Voice-controlled shower cabin, mini swimming pool, spa or whirlpool bathtub has software module for voice recognition, control module that sends control signals according to recognized voice, and integration hardware
DE10334400A1 (en) * 2003-07-28 2005-02-24 Siemens Ag Method for speech recognition and communication device
DE10338512A1 (en) * 2003-08-22 2005-03-17 Daimlerchrysler Ag Support procedure for speech dialogues for the operation of motor vehicle functions
DE102004006467A1 (en) * 2003-09-09 2005-04-21 Volkswagen Ag Navigating a vehicle to a destination involves using position determining device, digital map, route computer, instruction output unit, input unit with voice recognition system set to basic language, changeable to translation mode
US20050071170A1 (en) * 2003-09-30 2005-03-31 Comerford Liam D. Dissection of utterances into commands and voice data
US7552221B2 (en) 2003-10-15 2009-06-23 Harman Becker Automotive Systems Gmbh System for communicating with a server through a mobile communication device
EP1555652B1 (en) * 2004-01-19 2007-11-14 Harman Becker Automotive Systems GmbH Activation of a speech dialogue system
EP1560200B8 (en) * 2004-01-29 2009-08-05 Harman Becker Automotive Systems GmbH Method and system for spoken dialogue interface
EP1560199B1 (en) 2004-01-29 2008-07-09 Harman Becker Automotive Systems GmbH Multimodal data input
EP1562180B1 (en) * 2004-02-06 2015-04-01 Nuance Communications, Inc. Speech dialogue system and method for controlling an electronic device
US20090164215A1 (en) * 2004-02-09 2009-06-25 Delta Electronics, Inc. Device with voice-assisted system
US7366535B2 (en) * 2004-04-21 2008-04-29 Nokia Corporation Push-to-talk mobile communication terminals
FR2871978B1 (en) * 2004-06-16 2006-09-22 Alcatel Sa Method for processing sound signals for a communication terminal and communication terminal using the same
DE102004046932A1 (en) * 2004-09-28 2006-04-13 Aug. Winkhaus Gmbh & Co. Kg Locking device and method for programming a locking device
US8725505B2 (en) * 2004-10-22 2014-05-13 Microsoft Corporation Verb error recovery in speech recognition
US7689423B2 (en) * 2005-04-13 2010-03-30 General Motors Llc System and method of providing telematically user-optimized configurable audio
US20060235698A1 (en) * 2005-04-13 2006-10-19 Cane David A Apparatus for controlling a home theater system by speech commands
US20060253272A1 (en) * 2005-05-06 2006-11-09 International Business Machines Corporation Voice prompts for use in speech-to-speech translation system
JP4660299B2 (en) * 2005-06-29 2011-03-30 三菱電機株式会社 Mobile information device
US7424431B2 (en) * 2005-07-11 2008-09-09 Stragent, Llc System, method and computer program product for adding voice activation and voice control to a media player
EP1750253B1 (en) * 2005-08-04 2012-03-21 Nuance Communications, Inc. Speech dialog system
US7904300B2 (en) * 2005-08-10 2011-03-08 Nuance Communications, Inc. Supporting multiple speech enabled user interface consoles within a motor vehicle
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US7590541B2 (en) * 2005-09-30 2009-09-15 Rockwell Automation Technologies, Inc. HMI presentation layer configuration system
DE102005059630A1 (en) * 2005-12-14 2007-06-21 Bayerische Motoren Werke Ag Method for generating speech patterns for voice-controlled station selection
US20090222270A2 (en) * 2006-02-14 2009-09-03 Ivc Inc. Voice command interface device
US20070198271A1 (en) * 2006-02-23 2007-08-23 Dana Abramson Method for training a user of speech recognition software
JP4131978B2 (en) * 2006-02-24 2008-08-13 本田技研工業株式会社 Voice recognition device controller
DE102006035780B4 (en) * 2006-08-01 2019-04-25 Bayerische Motoren Werke Aktiengesellschaft Method for assisting the operator of a voice input system
US7899673B2 (en) * 2006-08-09 2011-03-01 Microsoft Corporation Automatic pruning of grammars in a multi-application speech recognition interface
US20080045256A1 (en) * 2006-08-16 2008-02-21 Microsoft Corporation Eyes-free push-to-talk communication
EP1933303B1 (en) * 2006-12-14 2008-08-06 Harman/Becker Automotive Systems GmbH Speech dialog control based on signal pre-processing
US8831183B2 (en) * 2006-12-22 2014-09-09 Genesys Telecommunications Laboratories, Inc Method for selecting interactive voice response modes using human voice detection analysis
JP4827721B2 (en) * 2006-12-26 2011-11-30 ニュアンス コミュニケーションズ,インコーポレイテッド Utterance division method, apparatus and program
US20080221899A1 (en) * 2007-03-07 2008-09-11 Cerra Joseph P Mobile messaging environment speech processing facility
US8838457B2 (en) * 2007-03-07 2014-09-16 Vlingo Corporation Using results of unstructured language model based speech recognition to control a system-level function of a mobile communications facility
US8886540B2 (en) * 2007-03-07 2014-11-11 Vlingo Corporation Using speech recognition results based on an unstructured language model in a mobile communication facility application
US20080221884A1 (en) * 2007-03-07 2008-09-11 Cerra Joseph P Mobile environment speech processing facility
US8886545B2 (en) * 2007-03-07 2014-11-11 Vlingo Corporation Dealing with switch latency in speech recognition
US20080312934A1 (en) * 2007-03-07 2008-12-18 Cerra Joseph P Using results of unstructured language model based speech recognition to perform an action on a mobile communications facility
US20090030687A1 (en) * 2007-03-07 2009-01-29 Cerra Joseph P Adapting an unstructured language model speech recognition system based on usage
US8635243B2 (en) * 2007-03-07 2014-01-21 Research In Motion Limited Sending a communications header with voice recording to send metadata for use in speech recognition, formatting, and search mobile search application
US20090030691A1 (en) * 2007-03-07 2009-01-29 Cerra Joseph P Using an unstructured language model associated with an application of a mobile communication facility
US8949266B2 (en) * 2007-03-07 2015-02-03 Vlingo Corporation Multiple web-based content category searching in mobile search application
US8949130B2 (en) 2007-03-07 2015-02-03 Vlingo Corporation Internal and external speech recognition use with a mobile communication facility
US10056077B2 (en) 2007-03-07 2018-08-21 Nuance Communications, Inc. Using speech recognition results based on an unstructured language model with a music system
US20080228493A1 (en) * 2007-03-12 2008-09-18 Chih-Lin Hu Determining voice commands with cooperative voice recognition
DE102007037567A1 (en) 2007-08-09 2009-02-12 Volkswagen Ag Method for multimodal operation of at least one device in a motor vehicle
US8868410B2 (en) * 2007-08-31 2014-10-21 National Institute Of Information And Communications Technology Non-dialogue-based and dialogue-based learning apparatus by substituting for uttered words undefined in a dictionary with word-graphs comprising of words defined in the dictionary
JP5238205B2 (en) * 2007-09-07 2013-07-17 ニュアンス コミュニケーションズ,インコーポレイテッド Speech synthesis system, program and method
ES2363037T3 (en) * 2007-09-21 2011-07-19 The Boeing Company Vehicle control
DE102007046761A1 (en) * 2007-09-28 2009-04-09 Robert Bosch Gmbh Navigation system operating method for providing route guidance for driver of car between actual position and inputted target position, involves regulating navigation system by speech output, which is controlled on part of users by input
WO2009047858A1 (en) 2007-10-12 2009-04-16 Fujitsu Limited Echo suppression system, echo suppression method, echo suppression program, echo suppression device, sound output device, audio system, navigation system, and moving vehicle
DE602007011073D1 (en) * 2007-10-17 2011-01-20 Harman Becker Automotive Sys Speech dialogue system with user-adapted speech output
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
DE102008008948A1 (en) 2008-02-13 2009-08-20 Volkswagen Ag System architecture for dynamic adaptation of information display for navigation system of motor vehicle i.e. car, has input modalities with input interacting to modalities so that system inputs result about user interfaces of output module
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US7516068B1 (en) * 2008-04-07 2009-04-07 International Business Machines Corporation Optimized collection of audio for speech recognition
US8958848B2 (en) 2008-04-08 2015-02-17 Lg Electronics Inc. Mobile terminal and menu control method thereof
US8682660B1 (en) * 2008-05-21 2014-03-25 Resolvity, Inc. Method and system for post-processing speech recognition results
US8983841B2 (en) * 2008-07-15 2015-03-17 At&T Intellectual Property, I, L.P. Method for enhancing the playback of information in interactive voice response systems
US20100030549A1 (en) 2008-07-31 2010-02-04 Lee Michael M Mobile device having human language translation capability with positional feedback
US9043209B2 (en) * 2008-11-28 2015-05-26 Nec Corporation Language model creation device
WO2010067118A1 (en) 2008-12-11 2010-06-17 Novauris Technologies Limited Speech recognition involving a mobile device
DE102009018590A1 (en) * 2009-04-23 2010-10-28 Volkswagen Ag Motor vehicle has operating device for menu-guided operation of motor vehicle, where computing device is provided for displaying list of sub-menus on display
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
DE102009025530B4 (en) * 2009-06-19 2019-05-23 Volkswagen Ag Method for operating a vehicle by means of an automated voice dialogue and a correspondingly designed voice dialogue system and vehicle
US9431006B2 (en) 2009-07-02 2016-08-30 Apple Inc. Methods and apparatuses for automatic speech recognition
DE102009041007A1 (en) * 2009-09-10 2011-03-24 Bayerische Motoren Werke Aktiengesellschaft Navigation system and radio reception system
US8428947B2 (en) 2009-12-15 2013-04-23 At&T Intellectual Property I, L.P. Automatic sound level control
EP2339576B1 (en) * 2009-12-23 2019-08-07 Google LLC Multi-modal input on an electronic device
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8700405B2 (en) * 2010-02-16 2014-04-15 Honeywell International Inc Audio system and method for coordinating tasks
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US8787977B2 (en) 2010-04-08 2014-07-22 General Motors Llc Method of controlling dialing modes in a vehicle
US8265928B2 (en) 2010-04-14 2012-09-11 Google Inc. Geotagged environmental audio for enhanced speech recognition accuracy
US8468012B2 (en) 2010-05-26 2013-06-18 Google Inc. Acoustic model adaptation using geographic information
US20120065972A1 (en) * 2010-09-12 2012-03-15 Var Systems Ltd. Wireless voice recognition control system for controlling a welder power supply by voice commands
KR20120046627A (en) * 2010-11-02 2012-05-10 삼성전자주식회사 Speaker adaptation method and apparatus
US8352245B1 (en) 2010-12-30 2013-01-08 Google Inc. Adjusting language models
US8296142B2 (en) 2011-01-21 2012-10-23 Google Inc. Speech recognition using dock context
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
JP2012215673A (en) * 2011-03-31 2012-11-08 Toshiba Corp Speech processing device and speech processing method
US9368107B2 (en) * 2011-04-20 2016-06-14 Nuance Communications, Inc. Permitting automated speech command discovery via manual event to command mapping
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
JP5681041B2 (en) * 2011-06-03 2015-03-04 富士通株式会社 Name identification rule generation method, apparatus, and program
US20120316884A1 (en) * 2011-06-10 2012-12-13 Curtis Instruments, Inc. Wheelchair System Having Voice Activated Menu Navigation And Auditory Feedback
US8994660B2 (en) 2011-08-29 2015-03-31 Apple Inc. Text correction processing
JP2013068532A (en) * 2011-09-22 2013-04-18 Clarion Co Ltd Information terminal, server device, search system, and search method
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US9280610B2 (en) 2012-05-14 2016-03-08 Apple Inc. Crowd sourcing information to fulfill user requests
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US20140016797A1 (en) * 2012-07-16 2014-01-16 Ford Global Technologies, Llc Method for Changing Audio System Mode for Roof Open/Closed Condition
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
US9378737B2 (en) 2012-11-05 2016-06-28 Mitsubishi Electric Corporation Voice recognition device
US9148499B2 (en) 2013-01-22 2015-09-29 Blackberry Limited Method and system for automatically identifying voice tags through user operation
DE102013001219B4 (en) * 2013-01-25 2019-08-29 Inodyn Newmedia Gmbh Method and system for voice activation of a software agent from a standby mode
AU2014214676A1 (en) 2013-02-07 2015-08-27 Apple Inc. Voice trigger for a digital assistant
US9734819B2 (en) 2013-02-21 2017-08-15 Google Technology Holdings LLC Recognizing accented speech
US9237225B2 (en) 2013-03-12 2016-01-12 Google Technology Holdings LLC Apparatus with dynamic audio signal pre-conditioning and methods therefor
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
AU2014233517B2 (en) 2013-03-15 2017-05-25 Apple Inc. Training an at least partial voice command system
WO2014144579A1 (en) 2013-03-15 2014-09-18 Apple Inc. System and method for updating an adaptive speech recognition model
WO2014197334A2 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
WO2014197336A1 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
WO2014197335A1 (en) 2013-06-08 2014-12-11 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
JP6259911B2 (en) 2013-06-09 2018-01-10 アップル インコーポレイテッド Apparatus, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
WO2014200731A1 (en) 2013-06-13 2014-12-18 Apple Inc. System and method for emergency calls initiated by voice command
US10163455B2 (en) * 2013-12-03 2018-12-25 Lenovo (Singapore) Pte. Ltd. Detecting pause in audible input to device
US9842592B2 (en) 2014-02-12 2017-12-12 Google Inc. Language models using non-linguistic context
US9412365B2 (en) 2014-03-24 2016-08-09 Google Inc. Enhanced maximum entropy models
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US10289433B2 (en) 2014-05-30 2019-05-14 Apple Inc. Domain specific language for encoding assistant dialog
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
EP3480811A1 (en) 2014-05-30 2019-05-08 Apple Inc. Multi-command single utterance input method
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
DE102014108371B4 (en) * 2014-06-13 2016-04-14 LOEWE Technologies GmbH Method for voice control of entertainment electronic devices
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9418679B2 (en) 2014-08-12 2016-08-16 Honeywell International Inc. Methods and apparatus for interpreting received speech data using speech recognition
DE102014111503B4 (en) * 2014-08-12 2016-04-28 Gls It Services Gmbh Intelligent delivery system
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9711141B2 (en) 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
US9792901B1 (en) * 2014-12-11 2017-10-17 Amazon Technologies, Inc. Multiple-source speech dialog input
DE112014007287B4 (en) * 2014-12-24 2019-10-31 Mitsubishi Electric Corporation Speech recognition device and speech recognition method
CN104615052A (en) * 2015-01-15 2015-05-13 深圳乐投卡尔科技有限公司 Android vehicle navigation global voice control device and Android vehicle navigation global voice control method
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
JP6481939B2 (en) * 2015-03-19 2019-03-13 株式会社レイトロン Speech recognition apparatus and speech recognition program
US10134394B2 (en) 2015-03-20 2018-11-20 Google Llc Speech recognition using log-linear model
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
CN104899002A (en) * 2015-05-29 2015-09-09 深圳市锐曼智能装备有限公司 Conversation forecasting based online identification and offline identification switching method and system for robot
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US9578173B2 (en) 2015-06-05 2017-02-21 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US20170069312A1 (en) * 2015-09-04 2017-03-09 Honeywell International Inc. Method and system for remotely training and commanding the speech recognition system on a cockpit via a carry-on-device in a connected aircraft
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US9978367B2 (en) 2016-03-16 2018-05-22 Google Llc Determining dialog states for language models
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
DK201670578A1 (en) 2016-06-09 2018-02-26 Apple Inc Intelligent automated assistant in a home environment
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK179343B1 (en) 2016-06-11 2018-05-14 Apple Inc Intelligent task discovery
EP3270575A1 (en) 2016-07-12 2018-01-17 Veecoo Ug Platform for integration of mobile terminals and peripheral aftermarket equipment in a vehicle
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
KR20180073115A (en) * 2016-12-22 2018-07-02 삼성전자주식회사 Electronic device including component mounting structure through bended display
US10311860B2 (en) 2017-02-14 2019-06-04 Google Llc Language model biasing system
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
WO2019027992A1 (en) * 2017-08-03 2019-02-07 Telepathy Labs, Inc. Omnichannel, intelligent, proactive virtual agent
US10311874B2 (en) 2017-09-01 2019-06-04 4Q Catalyst, LLC Methods and systems for voice-based programming of a voice-controlled device

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1987001546A1 (en) * 1985-09-03 1987-03-12 Motorola, Inc. Hands-free control system for a radiotelephone
US4751737A (en) * 1985-11-06 1988-06-14 Motorola Inc. Template generation method in a speech recognition system
US4856072A (en) * 1986-12-31 1989-08-08 Dana Corporation Voice actuated vehicle security system
DE3819178C2 (en) * 1987-06-04 1991-06-20 Ricoh Co., Ltd., Tokio/Tokyo, Jp
US5033087A (en) * 1989-03-14 1991-07-16 International Business Machines Corp. Method and apparatus for the automatic determination of phonological rules as for a continuous speech recognition system
DE3928049A1 (en) * 1989-08-25 1991-02-28 Grundig Emv Voice-controlled archiving system
US5144672A (en) * 1989-10-05 1992-09-01 Ricoh Company, Ltd. Speech recognition apparatus including speaker-independent dictionary and speaker-dependent
US5127043A (en) * 1990-05-15 1992-06-30 Vcs Industries, Inc. Simultaneous speaker-independent voice recognition and verification over a telephone network
US5125022A (en) * 1990-05-15 1992-06-23 Vcs Industries, Inc. Method for recognizing alphanumeric strings spoken over a telephone network
US5303299A (en) * 1990-05-15 1994-04-12 Vcs Industries, Inc. Method for continuous recognition of alphanumeric strings spoken over a telephone network
US5241619A (en) * 1991-06-25 1993-08-31 Bolt Beranek And Newman Inc. Word dependent N-best search method
DE4130632A1 (en) * 1991-09-14 1993-03-18 Philips Patentverwaltung A method for recognizing the spoken words in a speech signal
US5388183A (en) * 1991-09-30 1995-02-07 Kurzwell Applied Intelligence, Inc. Speech recognition providing multiple outputs
US5353376A (en) * 1992-03-20 1994-10-04 Texas Instruments Incorporated System and method for improved speech acquisition for hands-free voice telecommunication in a noisy environment
US5297183A (en) * 1992-04-13 1994-03-22 Vcs Industries, Inc. Speech recognition system for electronic switches in a cellular telephone or personal communication network
US5475791A (en) * 1993-08-13 1995-12-12 Voice Control Systems, Inc. Method for recognizing a spoken word in the presence of interfering speech
US5893059A (en) * 1997-04-17 1999-04-06 Nynex Science And Technology, Inc. Speech recoginition methods and apparatus
US5913192A (en) * 1997-08-22 1999-06-15 At&T Corp Speaker identification with user-selected password phrases

Also Published As

Publication number Publication date
ES2170870T3 (en) 2002-08-16
JP3479691B2 (en) 2003-12-15
EP0852051B1 (en) 2002-01-02
EP0852051A1 (en) 1998-07-08
US6839670B1 (en) 2005-01-04
DE59608614D1 (en) 2002-02-28
AT211572T (en) 2002-01-15
CA2231504A1 (en) 1997-03-20
JPH11506845A (en) 1999-06-15
DE19533541C1 (en) 1997-03-27
WO1997010583A1 (en) 1997-03-20

Similar Documents

Publication Publication Date Title
Rabiner et al. Theory and applications of digital speech processing
Walker et al. Sphinx-4: A flexible open source framework for speech recognition
Furui 50 years of progress in speech and speaker recognition research
US7174300B2 (en) Dialog processing method and apparatus for uninhabited air vehicles
US6711543B2 (en) Language independent and voice operated information management system
EP0965979B1 (en) Position manipulation in speech recognition
Juang et al. Automatic speech recognition–a brief history of the technology development
US7209880B1 (en) Systems and methods for dynamic re-configurable speech recognition
US6925154B2 (en) Methods and apparatus for conversational name dialing systems
US8510103B2 (en) System and method for voice recognition
US6243680B1 (en) Method and apparatus for obtaining a transcription of phrases through text and spoken utterances
US7228275B1 (en) Speech recognition system having multiple speech recognizers
EP0965978A1 (en) Non-interactive enrollment in speech recognition
CN1655235B (en) Automatic identification of telephone callers based on voice characteristics
CN1327406C (en) Open type word table speech identification method
US7200555B1 (en) Speech recognition correction for devices having limited or no display
US6856956B2 (en) Method and apparatus for generating and displaying N-best alternatives in a speech recognition system
US20030125955A1 (en) Method and apparatus for providing a dynamic speech-driven control and remote service access system
CN101272416B (en) Voice dialing using a rejection reference
US7016849B2 (en) Method and apparatus for providing speech-driven routing between spoken language applications
US5983177A (en) Method and apparatus for obtaining transcriptions from multiple training utterances
US5615296A (en) Continuous speech recognition and voice response system and method to enable conversational dialogues with microprocessors
EP2196989B1 (en) Grammar and template-based speech recognition of spoken utterances
US8560313B2 (en) Transient noise rejection for speech recognition
US8355915B2 (en) Multimodal speech recognition system

Legal Events

Date Code Title Description
EEER Examination request
MKEX Expiry

Effective date: 20160909