WO2004029932A1

WO2004029932A1 - Method and device for the computer-aided comparison of a first sequence of phoneme units with a second sequence of phoneme units, voice recognition device, and speech synthesis device

Info

Publication number: WO2004029932A1
Application number: PCT/DE2003/003206
Authority: WO
Inventors: Diane Hirschfeld; Michael Küstner; Ronald Römer
Original assignee: Infineon Technologies Ag
Priority date: 2002-09-25
Filing date: 2003-09-25
Publication date: 2004-04-08
Also published as: DE10244722A1

Abstract

According to the invention, each phoneme unit of a first sequence of phoneme units is shown on a corresponding phoneme unit of a second sequence of phoneme units, a comparison-cost function being increased if a phoneme unit has to be omitted or exchanged when an insertion is shown. Characteristic articulation vectors are taken into account in the framework of said showing such that the comparison-cost function is increased by different degrees for phoneme units having different characteristic articulation vectors.

Description

description

Method and device for the computer-aided comparison of a first sequence of spoken units with a second sequence of spoken units, speech recognition device and speech synthesis device

The invention relates to a method and a device for computer-aided comparison of a first sequence of spoken units with a second sequence of spoken units, a speech recognition device and a speech synthesis device.

The problem frequently arises in digital speech processing, two symbol sequences, in particular two sequences of spoken units, preferably two phoneme chains, i.e. two sequences of phonemes to compare their similarity.

Such a comparison often takes place, for example, as part of an automatic generation of databases for linguistic processing stages, for example phonetic rules or electronic lexicons, as part of a selection of suitable phoneme-labeled phonetic units (building blocks) on the basis of a sequence of phonemes to be synthesized for the speech signal generating unit Tex-to-speech system (text-to-speech system) or as part of a search for suitable words in an electronic lexicon (hereinafter also referred to as an electronic dictionary) for the generation of phoneme hypotheses of a speech recognition device. Linguistic units are represented by symbols of a phonetic alphabet (phonemes).

For the comparison of two sequences of spoken units it is known to use a comparison method Taking into account the so-called Levenshtein distance (cf. [1], [2]).

The Levenshtein distance is well suited for the comparison of two symbol sequences, in particular of phoneme sequences, since it evaluates insertions, omissions and exchanges of individual symbols in fictitious mapping transformations that a first sequence of symbols, e.g. a first phoneme string into another sequence of symbols, e.g. transfer another phoneme chain. For this purpose, the transformation operations "insertion" and "deletion" are assigned flat costs of "1", the exchange operation is assigned costs of "2" because they combine an insertion and an omission, ie two operations.

After calculating the transformation of each symbol, i.e. of each phoneme of one sequence of phonemes for each symbol of the other symbol chain, i.e. of a corresponding phoneme of the other phoneme sequence, the most cost-effective way is sought in the resulting distance matrix along which the minimum number of insertions, omissions and interchanges, in other words the Levenshtein distance, can be determined.

Thus, the Levenshtein distance describes the optimal mapping between two sequences of symbols, i.e. between two episodes of spoken units.

The smaller the Levenshtein distance, the fewer transformation operations are necessary to map a first sequence of spoken units to a second sequence of spoken units and the more similar the two spoken units are.

In the method described in [1] and [2], the Levenshtein distance does not take into account that the similarity between individual phonemes is greater than between others. In the known Levenshtein method, all exchanges cause the same costs.

One solution would be to group similar phonemes into similarity classes, whereby i.a. The articulation location is used as a decision criterion, since this characteristic causes essential transients of the spectral characteristics to the neighboring sound. Similarity measures for two phonemes of the same class would then be smaller than for two phonemes of two different classes.

The disadvantage is that it is difficult to assign the spoken units to individual classes, e.g. Vowels have no place of articulation. Furthermore, the evaluation is complicated and inhomogeneous.

Furthermore, the feature used in [1] and in [2] is static and therefore does not take into account the temporally developing co-articulatory influences, which can span up to three phonemes.

Furthermore, the known method does not allow an exact quantification of the degree of similarity, both within a similarity class and between different similarity classes.

From [3] it is known in the context of speech synthesis that several independent articulation settings can be used for the numerical quantification of phonemic similarities.

[4] describes a method for speech recognition by means of a dynamic programming algorithm (DP algorithm), in which the search range is restricted depending on the gradient of the search paths, in order to shorten the execution of the standard DP algorithm because there is less mathematical operations are to be performed. [5] describes a speech recognition system based on a neural network, in which two types of information are taken into account. Dynamically changing acoustic and visual signals are taken into account. Using a visual vector in which the component features include the nose-chin distance, the mouth opening, the horizontal-center lip distance, the vertical center lip distance as well as the corner angle separation and an audio vector in the features of short-term energy spectra of a spoken voice signal, which together form an audiovisual vector, a time delay neural network (TDNN) is fed and the speech recognition takes place using the TDNN.

The invention is based on the problem of comparing two sequences of spoken-language units with one another in a simpler and thus computing time-saving manner than with the known methods.

The problem is solved by the method and the apparatus for comparing a first sequence of phonetic units with a second sequence of phonetic ^'units by the speech recognition means as well as by speech synthesis device with the features according to the independent claims.

In a method for computer-aided comparison of a first sequence of spoken units with a second

Sequence of spoken units, each spoken unit being assigned an articulation feature vector, the articulatory properties of the spoken units and / or generation-physical

Contains properties of the spoken units, each spoken unit of the first series of spoken units becomes a corresponding spoken unit of the second series of spoken units

Units mapped, preferably using a Cost function the effort involved in converting the first spoken unit into the second spoken unit can be calculated numerically. The costs are calculated as a numerical distance from the individual numerically coded articulatory features of the first and the second phonetic unit. Any metrics, in particular the Euclidean metric, can be used to calculate the distance.

K c (i, j) = wk x-j_ (k) - yj (k) || k = l

Where c (i, j) the distance between two phonemes i and j to be compared, w ^ the weighting factor for the kth feature component, Xi (k) the kth feature component of the phoneme i and y _j (k) the k- te feature component of the phoneme j, each with a total of K features.

In the context of the illustration, the articulation feature vectors are taken into account in such a way that the comparison cost function is increased differently in spoken units with different articulation feature vectors.

In other words, each spoken unit of the first series of spoken units is mapped to a corresponding spoken unit of the second series of spoken units, the numerical distance of two unit-dependent articulation feature vectors being calculated for each pairing of spoken units, and this in the form of a matrix, clearly a cost matrix spanned by the two sequences of spoken units to be compared.

A device for computer-aided comparison of a first sequence of spoken units with a second sequence spoken units has a processor unit which is set up in such a way that the method steps described above can be carried out.

A speech recognition device with a device described above is also provided.

Furthermore, a speech synthesis device with a device specified above is provided.

According to the invention, the knowledge is exploited for the first time that the articulatory formation of spoken-language units significantly influences their physical characteristics. The individual articulation organs act relatively independently of each other.

For this reason, several features, which are summarized in the articulation feature vector and are relevant for their differentiation, are used to describe the phonemes.

After an optional examination of the articulatory features relevant for a phonetic classification

(articulatory range of variation) as well as the appropriate division according to physical characteristics

(physical range of variation) there is a division of the feature space and a numerical coding of the articulatory features as well as an assignment to the phonemes in the phoneme system (generally in the system of phonetic units) based on their physical features.

A numerical similarity measure can thus be calculated in a very simple manner for any phoneme, generally for any spoken units, by calculating the distance between the associated articulation feature vectors. This clearly provides a system of features and a calculation method that allows a simple determination of a quantitative measure of similarity between any two sequences of phonetic units, preferably the measure of similarity between two sequences of any phoneme, and that is clearly based on articulatory-physical relationships and quantifies the similarity in articulatory units ,

Thus, according to the invention, it is taken into account that the similarity between individual phonetic units, for example individual phonemes, is greater (for example different / e / -phonemen in German or between / i / and / j /) than between others (an exchange therefore causes lower costs) ,

According to the invention, for the computer-aided comparison of a first sequence of symbol representations of a spoken utterance with a second sequence of symbol representations of a spoken utterance, each symbol representation is assigned an articulation feature vector which contains articulatory properties of the symbol representations and / or physical properties of the symbol representations. Each symbol representation of the first sequence of symbol representations is mapped to a corresponding symbol representation of the second sequence of symbol representations, the numerical distance between two units of dependent articulation feature vectors being calculated for each pairing of symbol representations and these being arranged in the form of a cost matrix.

A device according to the invention has a processor unit which is set up in such a way that the method steps described above can be carried out or are carried out. Preferred developments of the invention result from the dependent claims.

As a spoken unit in the context of the invention

Understanding the symbolic representation of a spoken utterance, for example

A symbol representation of individual characters that represent a spoken utterance,

Phoneme segments,

Phonemes,

allophones,

diphones

Half-syllables,

Syllables or whole words.

The configurations described below relate both to the method, the device, the speech recognition device and the speech synthesis device.

According to one embodiment of the invention, the optimal (least expensive) route is determined in the cost matrix in accordance with the Levenshtein method.

The mapping of spoken-language units of the first sequence of spoken-language units onto the second sequence of spoken-language units is clearly carried out in accordance with the Levenshtein method modified according to the invention, i.e. in other words, using the Levenshtein distance, whereby the articulation feature vectors are taken into account within the scope of the Levenshtein distance.

According to another embodiment of the invention, it is provided that phonemes are used as spoken units. With this embodiment of the invention, the accuracy of the Levenshtein distance is considerably improved by taking the spoken units into account and the meaningfulness of the result achieved is considerably increased.

The articulation feature vectors can be dependent on different articulation settings of an organ producing a spoken unit, preferably a phoneme.

Particularly suitable articulation features can be:

The jaw position of a person,

• the place of articulation of the respective speaking unit,

• the rounding of a person's lips,

• the nasality,

• the degree of glottalization, or

• the pitch.

Depending on the language, different other characteristics can be added to the articulation characteristic vector.

The characteristic of whether a sound is glottalized or not is particularly important for African languages.

For Asian languages, especially Chinese, the pitch has turned out to be a suitable feature within the framework of the articulation feature vector.

The respective features in the articulation feature vector are preferably assigned several different articulation feature vector values, each of which describes a different form of the feature.

The simulation of the actual physical parameters of the organ producing the spoken-language unit can be further improved by the fact that the respective Articulation feature vector values subject to statistical variance are assumed to be around a given mean.

If the variances of the individual feature vector components are taken into account, the comparability of articulation feature vectors whose feature vector components scatter differently can be ensured by means of an additional variance normalization. In other words, this means that a variance within one of the features of the articulation feature vectors can be taken into account in the context of the illustration.

The invention is preferably used in the context of speech synthesis and speech recognition or in the context of the automatic generation of an electronic dictionary. Further areas of application are the automatic creation of databases for linguistic processing stages, for example for phonetic rules or electronic lexicons, the selection of suitable, phoneme-labeled. phonetic units using a ^■ to be synthesized phoneme for the artificial production of a speech signal of a text-to-speech system or the search for the correct word in an electronic encyclopedia for Phonemhypothesen a speech recognition device.

An embodiment of the invention is shown in the figures and is explained in more detail below.

Show it

Figure 1 is a block diagram of a speech recognition device according to an embodiment of the invention;

Figure 2 is a block diagram in which the merging of an electronic dictionary with only for speaker-independent speech recognition beforehand entered and trained words are included and a dictionary in which statements relating to speaker-dependent speech recognition are stored, which according to the invention are merged into a common dictionary for speaker-independent speech recognition;

Figure 3 is a table in which the intended

Speech recognition states and the state transitions between different speech recognition states are shown;

FIG. 4 shows a dialog state diagram according to an embodiment of the invention;

Figure 5 is a flow chart in which the individual

Method steps for speech recognition and for supplementing the electronic dictionary of the speech recognition device according to an embodiment of the invention are shown;

Figure 6 is a detailed view of the individual process steps for initializing the ^{• 'speech} recognition device according to an embodiment of the invention;

FIG. 7 shows a message flow diagram in which the individual method steps for carrying out a voice dialog are shown in accordance with an exemplary embodiment of the invention;

FIG. 8 shows a message flow diagram in which the individual steps for supplementing an electronic dictionary in the speech recognition device according to an exemplary embodiment of the invention are shown; Figure 9 shows a first functional diagram according to an embodiment of the invention;

Figure 10 shows a second functional diagram according to an embodiment of the invention;

FIGS. 11A and 11B are tables in which speech prompts presented to the user in the context of the speech dialogue according to a first exemplary embodiment of the invention (FIG. 9A) and additional actions of the speech recognition device (FIG. 9B) according to a first exemplary embodiment of the invention are shown;

Figure 12 is a speech dialog state diagram of a first

State according to a first embodiment of the invention;

FIGS. 13A and 13B show a speech dialog state diagram of a second state of a speech recognition device according to the first exemplary embodiment of the invention (FIG. 13A) and the associated flow diagram - (FIG. 13B);

FIGS. 14A and 14B show a speech dialog state diagram of a third state of a speech recognition device according to the first exemplary embodiment of the invention (FIG. 14A) and the associated flow diagram (FIG. 14B);

FIGS. 15A and 15B show a speech dialog state diagram of a fourth state of a speech recognition device according to the first exemplary embodiment of the invention

(Figure 15A) and the associated flow chart

(Figure 15B); FIGS. 16A and 16B show a speech dialog state diagram of a fifth state of a speech recognition device according to the first exemplary embodiment of the invention (FIG. 16A) and the associated flow diagram (FIG. 16B);

FIGS. 17A and 17B show a speech dialog state diagram of a sixth state of a speech recognition device according to the first exemplary embodiment of the invention

(Figure 17A) and the associated flow chart

(Figure 17B);

FIGS. 18A and 18B show a speech dialog state diagram of a seventh state of a speech recognition device according to the first exemplary embodiment of the invention (FIG. 18A) and the associated flow diagram (FIG. 18B);

FIG. 19 shows a speech dialog state diagram of a first state of a speech recognition device according to a second exemplary embodiment of the invention;

Figure 20 is a speech dialog state diagram of a second

State of a speech recognition device according to the second embodiment of the invention;

Figure 21 is a speech dialog state diagram of a third

FIG. 22 shows a telecommunication device with a speech recognition device according to an embodiment of the invention;

Figure 23 shows a car radio with a speech recognition device according to an embodiment of the invention. 1 shows a speech recognition device 100 according to an embodiment of the invention.

Depending on the operating mode, the speech recognition device 100 operates in a first operating mode as a speech recognition device, in which

Speech recognition mode, the speech recognition device a spoken utterance 101, spoken by a user (not shown) of the speech recognition device 100,. is recognized using a method for speaker-independent speech recognition. In a second operating mode, also referred to below as the dictionary supplement mode, a spoken utterance is converted into a sequence of spoken units, furthermore into a sequence of phonemes, and, as will be explained in more detail below, possibly the electronic dictionary as a supplementary entry fed and stored therein.

In both operating modes, the speech signal 101 spoken by the user is fed to a microphone 102 and, as the recorded electrical analog signal 103, is subjected to preprocessing, in particular preamplification by means of a preprocessing unit 104, in particular by means of a preamplifier, and as preprocessed and amplified analog signal 105 to an analog / Digital converter 106 supplied, converted there into a digital signal 107 and supplied as a digital signal 107 to a computer 108.

In this context, it should be noted that the microphone 102, the preprocessing unit 104, in particular the amplification unit, and the analog / digital converter 106 can be implemented as separate units or as units integrated in the computer 108.

According to this exemplary embodiment, it is provided that the digitized signal 107 is fed to the computer 108 via its input interface 109. The computer 108 also has a microprocessor 110, a memory 111 and an output interface 112, all of which are coupled to one another by means of a computer bus 113.

The method steps described below, in particular the methods for supplementing the electronic dictionary and the respectively provided speech dialogue, are carried out by means of the microprocessor 110. An electronic dictionary 114, which contains the entries which contain speech words as reference words, is stored in the memory 111.

Furthermore, a digital signal processor (DSP) 123, which is also coupled to the computer bus 113, is provided, which has a microcontroller that is specially specialized for the speaker-independent speech recognition algorithms used.

A computer program, which is set up for speaker-independent speech recognition, is also stored in the digital signal processor 123. Alternatively, the algorithms used can be implemented in hard-wired logic, that is, directly in hardware.

Furthermore, the computer 108 is coupled by means of the input interface 109 to a keyboard 115 and a computer mouse 116 via electrical lines 117, 118 or a radio connection, for example an infrared connection or a Bluetooth connection.

Via additional cables or radio connections, for example one. Infrared connection or a Bluetooth connection 119, 120, the computer 108 is coupled by means of the output interface 114 to a loudspeaker 121 and an actuator 122. Actuator 122 generally represents every possible actuator in the context of the control of a technical system in FIG. 1, for example implemented in the form of a hardware switch or in the form of a computer program in the event that, for example, a telecommunication device or another technical system, for example a car radio Stereo system, a video recorder, a television, the computer itself or any other technical system to be controlled.

According to the exemplary embodiment of the invention, the preprocessing unit 104 has a filter bank with a plurality of bandpasses, which measure the energy of the input speech signal 103 in individual frequency bands. So-called short-term spectra are formed by means of the filter bank, in that the output signals of the bandpasses are rectified, smoothed and sampled at short intervals, in accordance with the exemplary embodiment every 10 msec. The so-called cepstrum coefficients of two successive time windows as well as their temporal first ^' derivation and their temporal second derivation are determined and combined into a supercharacteristic svector' and leads the computer 108 train ^' e ^' .

In the computer 108, as described above, a speech recognition unit for speech-independent speech recognition is implemented in the form of a computer program, the speech recognition based on the principle of the Hidden Markov models, according to the exemplary embodiment in the DSP 123.

In a basic vocabulary, which is stored in an electronic dictionary 114 at the beginning of the method, a Hidden Markov model is stored for each basic entry, each in the following manner using a training data record, that is to say a set of training courses Voice signals, spoken by one or more training users, is determined. According to this exemplary embodiment, the training of the Hidden Markov models takes place in three phases:

A first phase in which the speech signals contained in the training database are segmented,

• a second phase in which the LDA matrix (linear discriminant analysis matrix) is calculated and

A third phase, in which the code book, ie the HMM prototype feature vectors, is calculated for a number of feature vector components selected in a selection step.

The entirety of these three phases is referred to below as the training of the Hidden Markov models (HMM training).

The HMM training is carried out using the DSP 123 and using predetermined training scripts, clearly illustrated by suitably set up computer programs.

According to this exemplary embodiment, each phonetic unit formed, i.e. each phoneme, is divided into three successive phoneme segments, corresponding to an initial phase (first phoneme segment), a central phase (second phoneme segment) and an end phase (third phoneme segment) of a sound that is called a phoneme.

In other words, each sound is modeled in a three-state sound model, that is, with a three-state HMM.

During speech recognition, the three phoneme segments are lined up in a Bakis topology or generally a left-right topology and the concatenation of these three lined up segments is carried out as part of the speaker-independent speech recognition. As will be explained in more detail below, a Viterbi algorithm for decoding the feature vectors which are formed from the input speech signal 101 is carried out in the speech recognition mode.

After segmentation has taken place, the LDA matrix 304 (step 403) is determined by means of an LDA matrix calculation unit 303.

The LDA matrix 304 is used for transformation of a respective Super-feature vector y to a characteristic vector x in accordance with the following rule: ^'

-x = A ^τ • ^ - y), (1)

being with

X a feature vector,

A an LDA matrix,

Y is a super feature vector,

• y a global displacement vector

referred to as.

The LDA matrix A is determined in such a way that

The components of the feature vector x are essentially uncorrelated from one another on a statistical average,

The statistical variances within a segment class are normalized on a statistical average,

• The centers of the segment classes have a maximum distance from each other on a statistical average and

• The dimension of the feature vectors x is reduced as much as possible, preferably depending on the speech recognition application. The method for determining the LDA matrix A according to these exemplary embodiments is explained below.

However, it should be noted that, alternatively, all known methods for determining an LDA matrix A can be used without restriction.

It is assumed that J segment classes exist, each segment class j containing a set of Dy-dimensional super feature vectors y, that is to say:

N

Class j = ^γ i> Y ^• Y [2)

where Nj is the number of super feature vectors yj in class j.

JN = N _j (3) j = l

is the total number of super feature vectors y

It should be noted that the super feature vectors y. under

Using the segmentation of the speech signal database described above have been determined.

According to this exemplary embodiment, each super feature vector ykv has a dimension Dy of

-1

D _y = 78 (= 2 • 3 • 13) with 13 MFCC coefficients (cepstrums coefficients) in the super feature vector y. are included, as well as their respective temporal first derivative and their respective temporal second derivative (this justifies factor 3 above).

Furthermore, in each super feature vector, y. each the

Components of two temporally immediately consecutive time windows included in the short-term analysis (this justifies factor 2 above).

In this context, it should be noted that any number of vector components in the super feature vector y that are adapted to the respective application. can be included, for example up to 20 cepstrums coefficients and their associated temporal first derivatives and second derivatives.

The statistical mean or, in other words, the center of class j results from the following rule:

^N y y. = - ^• Vy ¹ • ^• (4) ^•

J ₁ = 1

The covariance matrix Σ. Class j results according to the following regulation:

The average intra-scattering matrix S is defined as:

JS _w = ∑PÜ) '∑j, (6) j = ι With

/ N ^N i

PÖ ^s i N ⁽⁷⁾

where p (j) is called the weighting factor of class j.

Analogously, the average inter-scatter atrix Sj-) is defined as:

J y = ∑ P (D) ^• _j (9) j = ι

as the average super feature vector across all classes.

The LDA matrix A is broken down according to the following rule:

A = ywv, ^■ (ιo)

being with

U a first transformation matrix,

• W a second transformation matrix and

V a third transformation matrix

referred to as.

The first transformation matrix U is used to diagonalize the average intra-scatter matrix S _w and is determined by the positively definite and symmetrical Average intra-scatter matrix S _{w is} transformed into its eigenvector space. In its eigenvector space, the average intra-scatter atrix S _{w is} a diagonal matrix, the components of which are positive and greater than or equal to zero. The components whose values are greater than zero correspond to the average variance in the respective dimension defined by the corresponding vector component.

The second transformation matrix W is used to normalize the average variances and is determined according to the following rule:

= (u ^τ -s _w -y 2. (iι)

The transformation U • W is also called whitening.

With

1 = U • W (12)

for the matrix B ^T • S_ _w • B the unit matrix results, which remains unchanged with any orthonormal linear transformation.

In order to diagonalize the average inter-scattering matrix Sjβ, the third transformation matrix V, which is formed in accordance with the following rule:

Y = B ^{T •} _? B '£' ( ¹³ )

where B ^τ • S _j -, ^• B also represents a positively definite and symmetrical matrix, is transformed into its eigenvector space.

In the transformation space _ = A ^{T •} (y - y) (14)

the following matrices result:

A diagonalized average intra-scattering matrix S _w :

S _w = diag (l) _{d = 1} __ _Dy (15)

and a diagonalized average inter-scattering matrix S_b:

where with diag (c _c j) π_ _{1 D} a Dy x Dy diagonal matrix with the

Components c ^ in the row / column d and otherwise with components with the value zero.

The values σ _^ are the eigenvalues of the average inter-scattering matrix Sjb and represent a measure for the so-called pseudo-entropy of the feature vector components, which is also referred to below as the information content of the feature vector components. It should be noted that the trace of each matrix is invariant with respect to any orthogonal transformation, which results in the sum

d = l

represents the total average variance of the average vector XJ of the J classes.

This results in a determined dependency of the pseudo-entropy of the feature vectors on the respective in the Feature vector contained or considered feature vector components.

According to this exemplary embodiment, a dimension reduction is then carried out by sorting the OJ values in order of decreasing size and the

2 σ- ^ values are omitted, i.e. disregarded, which are smaller than a given threshold. The predetermined threshold value can also be defined cumulatively.

The LDA matrix A can then be adapted by sorting the lines according to the eigenvalues σ _^ and omitting the lines which belong to the sufficiently "small" variances and thus have only a low information content (low pseudo-entropy).

According to this exemplary embodiment, the components with the 24 largest eigenvalues σ _^ are used, in other words

D _x = 24.

The four steps described above for determining the LDA matrix A are summarized in the following table:

The last method for the partial method in the course of training the hidden Markov models is the clustering of the feature vectors, which is carried out by means of a cluster unit and which as a result has a respective code book, in each case specifically for a training data record with a predetermined number of feature vectors. components.

The entirety of the representatives of the segment classes is referred to as a code book and the representatives themselves are also referred to as prototypes of the phoneme segment class.

The prototypes, hereinafter also referred to as prototype feature vectors, are determined in accordance with the Baum-Welch training known per se. The basic entries of the electronic dictionary, that is to say the basic entries for speaker-independent speech recognition, were created and stored in the manner described above, and the corresponding Hidden Markov models were trained.

There is therefore a Hidden Markov model for each basic entry.

The electronic dictionary with the basic entries 201 is designated by the reference symbol 200 in FIG.

As will be explained in more detail below, a sequence 203 of phonemes is determined in each case for one or more utterances which are spoken by a user and which are clearly referred to in their entirety as a speaker-dependent dictionary 202 in FIG. 2 and as one such a sequence 203 of phonemes is stored as a new entry in the common electronic dictionary 204, which now contains the basic entries 201 and the new entries 203 in the form of phoneme chains.

The new electronic dictionary 204 thus clearly contains the basic entries 201 as well as the linguistic utterances converted into phoneme chains, that is to say the originally speaker-dependent entries, which can be regarded as speaker-independent representatives of new entries due to the conversion into phoneme chains.

In this way, speaker-independent speech recognition is made possible on the basis of the common electronic dictionary 204.

The common electronic dictionary 204 thus forms the search space for the Viterbi search as part of the speaker-independent speech recognition. The expressions of a user are clearly mapped to a sequence of phonemes and a phoneme dictionary 202 is formed which contains the sequences of phonemes (phoneme chains).

Thus, no commands or words are recognized for the speaker-dependent part as part of speaker-independent speech recognition, but rather phoneme chains.

^• This phoneme strings are stored in the new electronic dictionary 204th

The addition of the user-defined entries, that is to say the phoneme chains, to the electronic dictionary 200 is made possible, as will be explained in more detail below, in particular by means of file management in the computer 108, by means of which the speaker-dependent list entries and the basic entries are stored in the speaker-independent dictionary 200 and the communication with the speech recognition application is realized and managed.

According to this exemplary embodiment, the microprocessor 110 is the controller SDA 80D51U-A from Infineon Technologies AG and the method for forming hidden Markov models is based on the software and a digital signal processor (DSP) from Oak.

Electronic messages are used for communication between the microprocessor 110 and the DSP 123 from Oak in order to trigger predetermined events and actions in the microprocessor 110 or the DSP 123.

According to this exemplary embodiment of the invention, the following messages are provided for different speech recognition states, hereinafter also referred to as HMM states: First, the speech recognition states permitted within the speech recognizer are explained.

As can be seen in FIG. 3, the speech recognizer has the following four HMM speech recognition states:

An initialization state INIT 301,

A stop state STOP 302,

• a pause state PAUSE 303 and

An operating mode state RUN 304.

As can be seen from the state transition diagram 300 in FIG. 3, the following state transitions are provided.

The initialization state INIT 301 can be changed to the stop state STOP 302, which happens automatically when all databases are loaded (step 305).

Another transition from the initialization state INIT 301 to one of the other states PAUSE, RUN 303, 304 is not provided.

From the stop state STOP 302 can in the

Initialization state INIT 301 are passed if the loaded databases are inconsistent (step 306), in which case the InitHMMSRDefault message is transmitted from the DSP 123 to the microprocessor 110.

If the "PAUSE" command in the form of the PauseHMMSR message is received by the microprocessor 110 from the speech recognizer, that is to say the DSP 123, the DSP 123 changes to the pause state PAUSE 303 (step 307).

After receiving the "RUN" command in the form of the StartHMMSR message from the microprocessor 110, the DSP 123 changes to the RUN 304 operating mode state (step 308). The pause state PAUSE 303 can be changed to the initialization state INIT 301 (step 309) if the databases are inconsistent. This occurs in the case when the DSP 123 receives the InitHMMSRDefault message from the microprocessor 110.

After receiving the "STOP" command in the form of the StopHMMSR message from the microprocessor 110, the DSP 123 changes to the STOP state STOP 302 (step 310).

After receiving the "RUN" command in the form of the StartHMMSR message from the microprocessor 110, the DSP 123 changes from the pause state PAUSE 303 to the operating mode state 304 (step 311).

Finally, the speech recognition unit can return from the RUN 304 operating mode state to the INIT 301 initialization state if the databases are inconsistent (step 312). This happens when the DSP 123 receives the InitHMMSRDefault message from the microprocessor 110.

After receiving the "STOP" command in the form of the StopHMMSR message from the microprocessor 110, the ^" speech recognition unit, in other words the DSP 123, returns from the RUN 304 operating mode state to the STOP 302 stop state (step 313).

Finally, it is provided that the speech recognition unit goes from the operating mode state RUN 304 to the pause state PAUSE 303 after receiving the command "PAUSE" (step 314). This happens when the DSP 123 receives the message PauseHMMSR from the microprocessor 110.

In summary, the following messages are provided for communication between the microprocessor 110 and the Oak DSP 123: The following messages, which can be sent to the Oak DSP 123 or from the Oak DSP 123, are defined in the initialization state INIT 301, in which the speech recognizer is initialized with default values after activation:

InitHMMSRDefault:

This message initiates the setting of default values for the speech recognizer in the Oak software in the DSP 123,

(Dear Dr. Küstner: Is this statement correct?)

• InitHMMSRPara s:

This message initiates the loading of the speech recognition parameters into the Oak DSP 123,

• StartLoadHMMLDA:

This message starts loading the program to determine the LDA matrix,

• StartLoadHMMDictionary:

With this message, the loading of an electronic dictionary is started depending on the respective status or the respective speech dialog,

• CodebookBlockLoadedAndSwitched:

^'With this message, the Oak software in the DSP 123 is notified that the Switched Memory Block (SMB), according to this embodiment, the size of 16 Kbytes, is open to the Oak software because now the total number of blocks and segments in application are taken into account by the speech recognizer in which microprocessor 110 is known.

In the STOP state STOP 302, in which the speech recognizer is deactivated, the following messages are provided, which can be sent to the Oak DSP 123 or from the Oak DSP 123:

InitHMMSRDefault,

InitHMMSRParams,

• StartLoadHMMLDA, StartLoadHMMDictionary,

• Pause HMMSR:

This message informs the Oak DSP 123 that the speech recognizer should go into the deactivated state,

• StartHMMSR:

With this message, the speech recognizer in the Oak DSP 123 is started.

In the pause state PAUSE 303, in which the preprocessing is carried out in the speech recognizer, but no feature extraction has yet taken place, the following messages are provided, which can be transmitted to the Oak DSP 123 or from the Oak DSP 123:

InitHMMSRDefault,

InitHMMSRParams,

• StartLoadHMMLDA,

StartLoadHMMDictionary,

• StartHMMSR,

StopHMMSR: ^"

This message indicates to the Oak DSP 123 that the speech recognizer should be stopped and should go into the stop state 302,

• CodebookBlockLoadedAndSwitched.

In the operating mode state RUN 304, in which the speech recognizer is fully active and carries out the speech recognition, the following messages are provided, which can be transmitted to the Oak DSP 123 or from the Oak DSP 123:

InitHMMSRDefault,

InitHMMSRParams,

StartLoadHMMLDA,

StartLoadHMMDictionary,

PauseHMMSR,

StopHMMSR,

CodebookBlockLoadedAndSwitched. There is also an additional message

SetHMMSearchParams (minStableTime, wordStartupPenalty, transitionPenalty) are provided, with which the search parameters minStableTime, wordStartupPenalty and transitionPenalty can be set for the speaker-dependent dictionary. (Dear Dr. Küstner: Could you briefly explain the meaning of these three parameters?)

This message can be transmitted to the microprocessor 110 by the speech recognizer in the Oak DSP 123 in any state of the firmware of the DSP 123. With the message SetHMMSearchParams (minStableTime, wordStartupPenalty, transitionPenalty) the search is reset and the parameters for the search space are defined.

The structure of the voice dialog states explained below is set out below in a voice dialog in which the voice recognition is carried out.

A speech dialog is used for the interactive communication of the speech recognizer with a human user in order to allow the user predefined control options and thus an interactive intervention in the system to be controlled, that is to say in the computer 108 using the speech recognizer.

In this context, it should be pointed out that the speech dialog states should not be confused with the states of the speech recognizer described above.

Every speech dialog, i.e. every speech application, starts from a basic state after its activation. According to this exemplary embodiment, a number of commands are defined for a speech application, which are also referred to below as keywords.

Each command can have a single word or multiple words. Each command is linked to an action that is uniquely assigned to the respective command (see voice dialog state diagram 400 in FIG. 4).

Actions 408, which of a command-action tuple (407, 409) each associated with a command 406 in the form, can ^■ a device such as a CD player or a communication device or other element of a stereo system, or in general a technical system control and exit the speech recognition application, or perform additional actions in the same step, triggered with the same command, or, for example, emit a sound by means of the loudspeaker 121, and then change into the same speech dialog state or into a different speech dialog state.

In the speech dialog state diagram 400 - in FIG. 4, a temporally preceding speech dialog state Xl 401 is shown, as well as its transition to the current speech dialog state X 402, the speech dialog. State diagram 400 also showing that in the current speech dialog state X 402 words to be taken into account has been stored in an electronic dictionary 403, the electronic dictionary 403 containing both the words 404 previously stored in the form of speaker-independent, previously trained HMMs, generally the speaker-independent utterances with an arbitrary number of words 404 and list entries 405 which were originally speaker-dependent Contain utterances, which, however, as will be explained below, are mapped to a sequence of phonemes, which are then accessible to speaker-independent speech recognition. In a respective speech dialog state diagram 400, as explained above, the commands 406 and the actions 408 are each contained in a command-action tuple (407, 409).

According to this exemplary embodiment, the maximum length of a phoneme sequence is application-dependent with regard to its temporal length or with regard to the number of permissible phonemes.

According to this exemplary embodiment, 255 states are permissible per list entry 405, so that the largest number of permissible phonemes per list entry 405 is less than or equal to 85.

The commands are generated from the available vocabulary, that is to say from the words 404 or list entries 405 contained in the dictionary 403.

Each command 406, 407 is necessarily linked to an action 408, 409.

The command structure is as follows:

• command = one or more words,

• command = one or more words + a list entry,

• command = one list entry + one or more words,

• command = one or more words + one list entry + one or more words,

• Command = a list entry.

In the context of this description, an action 407, 409 represents a synonym for a reaction of the respective technical system that is fundamentally complex, in abstract terms the reaction of the speech recognizer for controlling the actuator 122 in FIG. 1. Action 407, 409 provides, for example, feedback from the speech recognizer according to the invention and, in order to indicate to the user that something was recognized at all, switching to another speech dialog state, loading a new electronic dictionary, or executing an action assigned to the command , such as dialing an entered phone number.

In this context it should be noted that the same action can be defined for different commands.

The communication between the microprocessor 110, that is to say the 80D51 controller and the digital signal processor 123 from Oak Technologies, is described below in the form of message flow diagrams.

As is shown in the sequence diagram 500 in FIG. 5, an acoustic signal is issued to the user by means of the loudspeaker 121 from a start state 501 after the DSP 123 has been activated, that is to say in other words after the speech recognizer has started, in order to signal the user, that the speech recognition process has been activated.

Starting from the start state 601, a transition is made to a first cycle (Cycle_0) 502, in which the HMMs are initialized.

Subsequently, a transition is made to a second cycle (Cycle_l) 503 and after it has been carried out it is checked whether the speech recognition process is to be ended (test step 504) and in the event that

If the speech recognition method is to be ended, a transition is made to an end state 505, in which the speech recognizer is deactivated. However, if the speech recognition is not to be ended, an additional test step (test step 506) is used to check whether a change should be made to a new speech dialog state.

If this is the case, then the second cycle 503 is carried out for the new speech dialog state.

However, if the speech dialog state is not to be changed, a third check step (check step 507) checks whether a change should be made in the electronic dictionary, for example whether a new entry should be added to the electronic dictionary or whether an entry should be made should be removed from the electronic dictionary or whether only an entry in the electronic dictionary should be changed.

If this is the case, the system branches again to the second cycle 503.

However, if this is not the case, a third cycle (Cycle_2) 508, which will be explained in more detail below, is carried out.

After the third cycle 508 has ended, a fourth test step (test step 509) again checks whether the speech recognition is to be ended.

If the speech recognition is to be ended, the end state 505 is entered and the speech recognizer is deactivated, otherwise a branch is made to the second test step 506, in which it is checked whether a new speech dialog state is to be assumed.

Fig. 6 shows the implementation of the first cycle 502 in detail in a message flow diagram 600. In the first cycle 502, the HMMs are initialized.

In an initialization phase, in which the speech recognizer is in the initialization state INIT 301, a message StartHMMSRFlowgraph 601 is transmitted by the microprocessor 110 to the DSP 123, with which the respective HMM in the DSP 123 is started.

In response to the StartHMMSRFlowgraph 601 message, the DSP 123 sends a confirmation message 602 to the microprocessor 110.

A message InitHMMSRParams 603 is then sent by the microprocessor 110 to the DSP 123, which activates the loading of the speech recognition parameters into the DSP 123.

In response to the InitHMMSRParams 603 message, a confirmation message 604 is transmitted from the DSP 123 to the microprocessor 110.

In a further step, a StartLoadHMMLDA 605 message is sent from the microprocessor 110 to the DSP 123 in order to start the loading of the computer program which causes the determination of an LDA matrix, as described above, for the respective HMM.

In response, the DSP 123 sends a message SMBRequestLoadHMMLDA 606 to the microprocessor 110, which responds to this message with a confirmation message 607, with which the microprocessor 110 indicates that the program codes necessary for performing a linear discriminant analysis are available in a switched memory block.

In a further message StartLoadHMMDictionary 608, which is sent from the microprocessor 110 to the DSP 123, the microprocessor 110 transmits this to the DSP 123 electronic dictionary for the basic dialog state of the respective speech dialog in the respective application in the DSP 123.

The DSP 123 responds to the receipt of the StartLoadHMMDictionary 608 message with a message

SMBRequestLoadHMMDictionary 609, with which the active vocabulary of the respective electronic dictionary is requested by the microprocessor 110. This request SMBRequestLoadHMMDictionary 609 is acknowledged with a confirmation message 610 from the microprocessor 110 after it has been received.

Subsequently, a message SMBRequestCodebookBlock 611 is transmitted from the DSP 123 to the microprocessor 110 and the microprocessor 110 responds with a message CodebookBlockLoadedAndSwitched 612 with which the microprocessor 110 transmits the switched memory blocks (SMBs) with the requested codebook data to the DSP 123.

In another message SMBRequestCodebookBlock 613 of the DSP calls for additional blocks to, - where the code book needed, otherwise ^'expressed the required codebook data is included.

The microprocessor 110 in turn reacts to the message SMBRequestCodebookBlock 613 with a message CodebookBlockLoadedAndSwitched 614, with which it transmits one or more further SMBs with the required code book data to the DSP 123, the total number of blocks and required segments now being known in the speech recognizer ,

The speech recognizer then changes to the stop state STOP 302 and only leaves this state after receipt of the command "Start", which the microprocessor HO transmits to the DSP 123 by means of the message StartHMMSR 615, and the DSP 123 confirms receipt of the StartHMMSR 615 message with a confirmation message 616.

In the operating mode state 404, the DSP 123 transmits a message SMBRequestCodebookBlock 617 to the microprocessor 110 at periodic intervals, which is descriptive for each frame, which then periodically sends a response message, also for each frame in the form of a message CodebookBlockLoadedAndSwitched 618 to the DSP 123 with which the SMB was switched and a new codebook block was transmitted by the microprocessor 110 to the DSP 123.

The messages SMBRequestCodebookBlock 617 and CodebookBlockLoadedAndSwitched 618 are exchanged for each frame between the microprocessor 110 and the DSP 123 until the DSP 123 transmits the speech recognition result in a message HMMHypothesisStable 619 to the microprocessor 110, which receives an acknowledgment message 620 upon receipt responding.

The DSP 123 remains in spite of the transmission of the

Speech recognition result in the RUN 304 operating mode state.

Only after receiving a PauseHMMSR 621 message, which is sent by the microprocessor 110, does the DSP 123 go into the pause state PAUSE 303 and, however, sends a confirmation message 622 to the microprocessor 110 beforehand.

Thus, a speech recognition result is available in the microprocessor 110 and the microprocessor 110 now decides what should happen next in the speech dialog depending on the application.

It is assumed below that an application controls several different devices, one of which Telecommunication device, especially a telephone, is just one among many.

In this case, in a first dialogue step of the speech dialogue, a method for speaker-independent speech recognition is used to determine which device is actually to be controlled, for example whether a CD player, a tape recorder, a cassette recorder, a radio or a telecommunications terminal (such as a telephone, for example) Fax machine, a teletext device, a mobile device, a PDA, etc.) to be controlled.

Without restricting the generality, it is now assumed that a telephone is selected as the device to be controlled in the first voice dialog step, so that the result of the first cycle 502 is that the menu should be changed, in other words the voice dialog state to control a Phone.

In the event that a state transition from a speech dialogue state to another speech dialogue state is to take place, it is usually necessary to change the electronic dictionary currently used in the context of speech recognition, which is normally dependent on the respective speech dialogue state.

In this case, the second cycle 503 is carried out, the exchanged messages between the microprocessor 110 and the DSP 123 being shown in detail in a message flow diagram 700 in FIG.

In a first phase, in which the speech recognizer is in the pause state PAUSE 303, a message SetHMMSearchParams 701 is transmitted from the microprocessor 110 to the DSP 123 in order to indicate to the DSP 123 that it should load new parameters for the search. The receipt of this message will with a confirmation message 702 from the DSP 123 to the microprocessor 110.

A message StartLoadHMMDictionary 703 is then transmitted from the microprocessor 110 to the DSP 123, by means of which message the loading of the electronic dictionary for the respective new speech dialog state into the DSP 123 is started.

With a message SMBRequestLoadHMMDictionary 704, the signal processor 123 requests the respective electronic dictionary from the microprocessor 110, divided into blocks (SMB) of 16 KB each. With a

Confirmation message 806 indicates microprocessor 110 that the requested electronic dictionary is available in SMBs and can be copied from DSP 123 to another location.

By means of a message StartHMMSR 706, which is sent by the microprocessor 110 to the DSP 123 and received by the latter and is confirmed by the DSP 123 by means of a confirmation message 707, the speech recognizer in the DSP 123 changes to the RUN 304 operating mode state and it there is a message exchange in a second phase, in which the DSP 123 transmits a message SMBRequestCodebookBlock 708 for each frame to the microprocessor 110, which responds to this message with a message CodebookBlockLoadedAndSwitched 709, also for each frame. With the message CodebookBlockLoadedAndSwitched 709, the DSP 123 is informed by the microprocessor 110 that the SMB has been switched and that a new code book block with code book data has been loaded on the part of the microprocessor 110 and is therefore available for the DSP 123.

These messages SMBRequestCodebookBlock 708 and CodebookBlockLoadedAndSwitched 709 are as above described, exchanged for each frame until the speech recognizer has determined a speech recognition result, whereupon the DSP 123 transmits a message HMMHypothesisStable 710 to the microprocessor 110, which simultaneously indicates that and which speech recognition result has been determined by the DSP 123.

Again, the speech recognizer remains in the RUN 304 operating mode state.

The microprocessor 110 responds to the HMMHypothesisStable 710 message with a confirmation message 711 and only after receiving a "pause" command does the microprocessor 110 transmit a PauseHMMSR 712, def ^" message to the DSP 123 upon receipt of this message, with a confirmation message 713, whereupon the speech recognizer goes into the pause state PAUSE 303.

In summary, regarding the second cycle 503 shown in FIG. 7, it should be noted that in the event that the speech recognizer is in a speech dialog state in which both basic entries (speaker-independent) and list entries (speaker-dependent per se, converted into phoneme sequences) are used are also independent of the speaker) are contained in a dictionary, each new dictionary is a dynamic dictionary.

As described above, it contains speaker-independent commands and speaker-dependent, user-defined list entries, for example in a telephone book list.

The list entries according to this exemplary embodiment of the invention are stored in a fixed memory location of the flash memory, which is provided as memory 111 in the computer 108. According to the invention, new list entries can be added to the electronic dictionary and old, no longer required list entries can be removed from the electronic dictionary, so that a dynamic, that is to say a changeable, number of words is contained in the electronic dictionary which can be recognized by the speech recognition unit using a speaker-independent speech recognition technique. It should also be noted that the search space also depends on the current number and size of the basic entries and list entries contained in the dictionary. For this reason, the most efficient file management, in particular efficient management of the electronic dictionaries, makes sense.

To add or remove list entries from the speaker-dependent part of the respective dictionary, that is, the list entries in the dictionary, a change in the current dictionary is required.

The third cycle 507 is provided in particular for adding new list entries, as is shown in detail below in a message flow diagram 800 (see FIG. 8).

The third cycle 507 is executed when a state transition from one speech dialog state to another speech dialog state does not require a change in the dictionary.

An example of such a state transition is an intrinsic loop, that is, a transition from a speech dialog state to itself.

As can be seen in FIG. 8, in a first phase a message StartHMMSR 801 is transmitted from the microprocessor 110 to the DSP 123, which is then sent with a Confirmation message 802 acknowledges receipt of the StartHMMSR 801 message, with which the speech recognizer is in the RUN 304 operating mode state.

Subsequently, the DSP 123 transmits a message SMBRequestCodebookBlock 803 in the corresponding manner described above in connection with the second cycle 603 for each frame periodically to the microprocessor 110, which in turn, as described above, responds with a message CodebookBlockLoadedAndSwitched 804 for each frame and thus provides the requested data to the DSP-123.

The exchange of the messages SMBRequestCodebookBlock 803 and CodebookBlockLoadedAndSwitched 804 takes place in a corresponding manner until a result of the speech recognition has been determined by the DSP 123 and the result is transmitted by means of a message HMMHypothesisStable 805 to the microprocessor 110, which receives the message HMMHypothesisStable 805 acknowledged with a confirmation message 806.

Again, the speech recognizer remains in the RUN 404 operating mode state and only after receiving a "pause" command does the microprocessor 110 send a PauseHMMSR 807 message to the DSP 123, which acknowledges receipt with an acknowledgment message 808, whereupon the speech recognizer goes into the pause state PAUSE 403 transforms.

In summary, the respective HMM is thus initialized in the first cycle 502, which is always carried out after an HMM speech dialog is started.

For each new speech dialog state, it is usually provided to load a new electronic dictionary into the DSP 123. To achieve this, the second cycle 503 is performed. In the event that the current dictionary the second cycle 503 is also used and is not changed to another dialog state.

If the current speech dialog state remains without changing the dictionary, the third cycle 507 is carried out.

Thus, two operating levels to be distinguished from one another are usually provided in a speech recognition application according to the invention.

The first operating level is formed by the communication between the DSP 123 and the microprocessor 110 and the speech recognition carried out by the DSP 123 of a speech signal spoken by a user.

The second level of operation is the level of software that runs on the microprocessor 110, in other words, the speech dialog.

Thus, the tasks in a speech recognition application are divided between the microprocessor 110 and the DSP 123 in such a way that the less computationally intensive implementation of the speech dialog is usually carried out by the microprocessor 110, and the very computation-intensive actual speech recognition by the DSP 123.

The file management, in particular the management of the electronic dictionaries, is carried out by the microprocessor 110.

The microprocessor 110 performs the following tasks in particular:

Adding a new list entry to the respective dictionary, Removing a list entry from the respective dictionary,

• Remove the entire list from the dictionary.

The three tasks described above are explained in more detail below:

1. Remove the entire list from the relevant electronic dictionary

The voice dialog starts with the first cycle 502. Each state transition from one voice dialog state to the next causes the second cycle 503 to be executed.

After recognizing the command "delete list" the file management is started.

All list entries in the list of the electronic dictionary are deleted.

Subsequently, an electronic dictionary still exists, which, however, now only contains the speaker-independent basic entries or is even empty if the dictionary only contained speaker-dependent and user-defined entries.

After the list is deleted from the dictionary, the speech recognition process is either ended (step 505) or continued in the third cycle 507 or the second cycle 503.

2. Remove a single list entry from the respective dictionary

Again, the speech dialog is started in the first cycle 502 and possibly one or more previously State transitions performed using the second cycle 503 to other speech dialog states.

If the command "delete [entry]" is recognized, the file management is started by the microprocessor 110. The phoneme chain [entry] represents the recognized speaker-dependent list entry in the dictionary.

This entry is removed from the list in the respective dictionary.

Furthermore, according to this exemplary embodiment it is provided that the memory management reorganizes the memory which is occupied by the list entry and thus releases it again to be occupied by other list entries. In this way, economical memory management can be implemented.

In order to ensure that the correct desired list entry is actually removed from the electronic dictionary, a security loop can be provided according to the invention, in which the user is asked again whether he really wants to delete the entered list entry. This possibly means a self-loop in the current speech dialog state without changing the dictionary, that is to say in this case the third cycle 507 is carried out.

In this case, the phoneme chain [entry] and the “delete” command are buffered. After the desired entry has been removed from the dictionary or after the delete command based on a user input with which it is indicated in the security loop that the specified entry is not deleted is to be ended, the speech dialog and thus also the speech recognition process can be ended (step 505) or that

Speech recognition method can be by means of transition to another dialog state (second cycle 503) or one Self-loops can be continued in the current dialog state (third cycle 507).

3. Add new list entries to the respective dictionary

This task is the most complex routine. As in the previous two procedures, the first cycle 502 is started this time and the second cycle 503 may have been performed one or more times.

After receiving the command "Add [entry]", which was recognized by the speech recognizer, the system goes into a speech dialog state in which the phoneme dictionary is loaded, in which phoneme dictionary the available phonemes, each one depend on the language used, are included.

The speech recognizer asks the user to speak the respective utterance, which is to be added to the dictionary as a list entry, one or more times into the speech recognition system.

It can be seen that the utterance must be spoken into the speech recognition system at least once. As will be explained in more detail below, the recognition rate for the list entry is the better, the more representational utterances of the same speech utterance are received by the speech recognition unit, in other words, the more often the user utters the same speech utterance into the speech recognition system.

Furthermore, it is assumed that three representational utterances are available, that is to say that the speech utterance has been spoken three times by the user into the speech recognition system. According to the invention, a computer variable is provided which specifies the number of inputs of a speech utterance into the speech recognition system, in other words, the number of representation utterances, which variable is corrected for each new representation utterance. After the first utterance has been spoken, the third cycle 507 is carried out, the first utterance is temporarily stored in a buffer as a representative utterance, and the value of the computer variable described above is increased by the value “1”.

In the case in which the value of the computer variable is "3", the procedure explained in more detail below is started, in which it is checked whether the utterance is stored as a new list entry in the electronic dictionary or whether the desired utterance is shown in cannot be saved in the electronic dictionary.

In accordance with this exemplary embodiment of the invention, it is clearly checked for an utterance to be entered as a list entry whether several spoken-in speech signals for the same utterance are sufficiently similar in a comparison space and at least one of the representational utterances of the spoken-in speech signals for a same utterance is a sufficiently large distance of the words or list entries already stored in the dictionary and only if these two conditions are met, the electronic dictionary is supplemented by the new spoken utterance.

In this way it is possible to use an electronic dictionary for speaker-independent speech recognition in a user-friendly and resource-saving manner, the following framework conditions being sought:

• The smallest possible number of entry assumptions should take place in the event of unequal acoustic expressions;

• The highest possible number of entry assumptions should be made with the same acoustic utterances; • The number of additional acoustic additions should be as small as possible, which would lead to resource problems;

• A vocabulary should be formed which offers the highest possible recognition rate.

Clearly, similarity measures or distance measures are thus used as threshold value parameters, whereby the optimum range of the speech recognizer for a predetermined number of words or list entries contained in the dictionary can also be determined using statistical methods.

To solve the optimization problem mentioned above, "two parameters are introduced in the following:

• MIN_FOR_WORD: This parameter, also referred to below as the intra-similarity value, is used to determine the similarity between the entry candidates,

• MAX__TO_DICT: This parameter, hereinafter also as

The inter-threshold is used to determine the similarity between an entry candidate and an entry of the existing vocabulary.

Based on the assumption that the variables MIN__FOR_WORD and MAX_TO_DICT are statistically independent of one another, the optimum threshold values for the respective application are determined independently of one another, the threshold values being determined in a training phase. The threshold values determined in the training phase are accordingly used as constant values in operation.

The training procedure for determining the threshold values is described below for the specific case of two entry candidates explained. However, it should be pointed out that the method can be applied to any number of entry candidates for a speech utterance.

In addition, the number of speaker-dependent entries is set to N.

The distance d (sj_, sj) is used as a measure of similarity. The distance d (si, sj) is a measure of the distance between the phoneme sequences s ± and S. According to this exemplary embodiment, a standardized distance is used, since the phoneme chains can also have unequal lengths.

In principle, any measure that is suitable for determining a distance between two symbol chains, that is to say clearly between two sequences of spoken units, according to this exemplary embodiment, between two phoneme sequences, for example the so-called Levenstein distance measure or the so-called Mahalanobis distance, can be used as a distance measure , are used.

Any comparison standard that meets the above requirement can also be used.

First, both the acceptance rate (AR) and the false acceptance rate (FAR) are determined to determine the intra-threshold value.

In this case, a threshold value variable T takes numbers from the range of values of the intra-threshold value MIN_FOR__WORD. For each realization of the threshold value variable T, the function values AR (T) and FAR (T) are determined using a loop that includes all N entries.

The function curve is then plotted for both functions using the threshold value variable T. The optimal one Intra-threshold value MIN_FOR_WORD * is determined by evaluating an optimality criterion which is explained in more detail below.

First, the acceptance rate AR is calculated for the same acoustic utterances, that is, for the case that s = SJ.

In this case, on a statistical average, the relative distances d (s, sj) between the entry candidates should only assume small values, that is to say the distance should go against the value “0”. For a small one

Threshold value variable value T, however, the values of the distance d (sj_, sj) can exceed the threshold value variable value T, which would make the acceptance rate low. For a large threshold variable value T, however, the acceptance rate increases relatively quickly.

The acceptance rate results according to the following regulation:

where N&R (T) denotes the number of acceptances that have been made for a threshold variable value T.

The following rule is used as an acceptance condition that an entry candidate is accepted if:

The function AR (T) will also assume a small value for a small threshold value variable T, then a monotonically increasing function curve results until the saturation of the function AR (T), that is to say AR (T = 1), is reached. The false acceptance rate (FAR) is then calculated for unequal acoustic utterances, that is to say in the event that s «SJ.

In this case, on a statistical average, the values of the relative distance d (sι, sj) between the entry candidates should assume larger values. For a small threshold variable value T, the distance d (si, sj) would often exceed the intra-threshold value and the false acceptance rate is therefore low. For a larger threshold variable value T, the false acceptance rate increases relatively slowly.

The false acceptance rate is determined in this exemplary embodiment of the invention in accordance with the following regulation:

FAR (T). = ^{AR ()} , (20)

N

where Np ^^ (T) denotes the number of false acceptances for a threshold variable value T.

The false acceptance condition according to this exemplary embodiment is:

d ( _Si , Sj) <T. (21)

The function FAR (T) will also assume a small value for a small threshold value variable T and in this case there is also a monotonically increasing function curve.

In comparison to the FAR (T) function, however, the saturation will only occur for a larger threshold value value T compared to the AR (T) function. Thus, the functional image of both functions AR (T) and FAR (T) corresponds roughly to that of a hysteresis curve. If you wear both functions against the

Threshold value variable T as a parameter, the optimal intra-threshold value MIN_FOR_WORD * is obtained at the point at which the two curves of the functions AR (T) and FAR (T) have the greatest distance from one another, as is shown in the function diagram 900 in FIG. 9 is outlined, which means:

MIN _ FOR _ WORD * = arg max | JAR (τ) - FAR (τ) |]. (22)

T

The optimal inter-threshold value is then calculated. In this calculation, it should be noted that after each rejection of an utterance to be entered, after a comparison with the content of the electronic dictionary, an acoustic addition, that is to say that the same utterance has to be repeated by the user, should be requested.

An acoustic addition is to be understood as speaking an additional parameter to the utterance to be spoken in, for example in the event that the utterance is a surname of a person, in addition the first name of that person.

However, an acoustic addition is not absolutely necessary according to the invention; alternatively, the application for entry of the new utterance in the electronic dictionary can also be rejected or the rejection can be made after a predetermined number of spoken utterances, according to which there is still no sufficiently high-quality signal, be rejected.

For the following explanation it is assumed that only one comparison per entry is permitted. In other words, the acoustic expansion according to the im The exemplary embodiment described below is not provided.

If one finds an optimal threshold value for this case, the number of acoustic additions is reduced to a reasonable level.

In this case, the threshold value variable T takes numbers from the range of values of the inter-threshold value MAX_TO_DICT.

Using the independence assumption as set out above, the previously determined intra-threshold value MIN_FOR_WORD * can be used to carry out the second optimization task.

First, the acceptance rate (ERA) is measured on the inter-threshold value MIN_FOR_WORD *.

The relative number of speaker-dependent entries EAR (T) is measured for each threshold value variable value T from the value range of the inter-threshold value MAX_TO_DICT, that is, EAR (T) results according to the following rule:

EAR (T) = ^N EAR (T) _{23)

N

where NEAR ( ^T ) denotes the number of accepted candidates for a threshold variable value T.

The following rule is used as an acceptance condition:

d (w _{k + 1} , W)> T, (24)

being with

• W is the vocabulary already stored in the dictionary, that is the stored words or saved list entries (W = [w, W ₂ , W ₃ , ..., Wfc]) and • k + l the new entry

referred to as.

For a small threshold variable value T, entries are also accepted which are only a short distance from the existing vocabulary W in the respective comparison space. On the other hand, hardly any entries are made for a large threshold value variable T. The function EAR (T) thus has a monotonically falling course.

The recognition rate for the current threshold value variable T is then determined using the current total vocabulary M (T).

In this context, it should be noted that the total number of entries in the dictionary depends on the current threshold variable value T. If there are speaker-dependent and speaker-independent entries in the respective common dictionary, the total vocabulary M (T) results from the sum of the number of speaker-dependent entries and the speaker-independent entries. The detection rate ER (T) thus results from the following rule:

where CRW (T) denotes the number of correctly recognized words for a threshold variable value T.

For a small threshold variable value T, a relatively large number of entries are recorded in the common electronic dictionary, the risk of confusion is accordingly high, but the word recognition rate is relatively low. For a large threshold variable value T, fewer Entries for storage in the common dictionary are accepted, but due to the increasing spacing of the entries in the vocabulary, the recognition rate ER (T) increases. The function ER (T) thus has a monotonically increasing profile.

To determine the optimality criterion, it should be noted that the requirement applies to find an optimal range of the inter-threshold value MAX_TO_DICT for a fixed N.

A compromise between the number of entries to be recorded and the associated recognition rate ER (T) is therefore necessary.

If the function EAR (T) is observed until after leaving the maximum, that is to say as long as EAR (T) has the value “1”, in which area N words are always recorded for storage in the electronic dictionary, the corresponding recognition rate ER ( T) can be determined as a function of the threshold value variable T by means of simple reading or by means of an appropriate automated evaluation.

It should be noted that the detection rate is normalized until the maximum is exceeded from EAR (T) to the value M (T) = N.

This results in a working range for the inter-threshold value MAX_TO_DICT, which is limited to low values of the threshold value variable value T by the minimum recognition rate BRMINI with a fixed predetermined number of speaker-dependent entries N in the dictionary.

This situation is shown in the function diagram 1000 in FIG. 10. The inter-threshold MAX_TO_DICT becomes fixed by the maximum detection rate ERMAXIN limited number of speaker-dependent entries in the dictionary.

For the inter-threshold MAX_TO_DICT, the following must therefore apply:

TjJ _IN > MAX_TO_DICT <TJJAX. (26)

This two-stage statistical method according to the invention has the advantage that, given the number of speaker-dependent entries N, a work area is found in which the recognition rate can move.

Conversely, it is also possible to request a minimum recognition rate and then to determine the maximum number of entries that are permitted to be stored in the dictionary.

In the speech dialog state, in which speaker-dependent and speaker-independent entries are stored in the dictionary, it is possible to add new user-defined names, that is to say generally user-definable utterances in the dictionary, as a list entry or to delete an old list entry from the dictionary.

In both cases, the microprocessor 110 manages the file that represents the dictionary. After the file has been changed, the speech recognition unit can be activated again for speech recognition.

As long as the speech dialogue continues, the second cycle 503 is repeated, in which cycle the message StartLoadHMMDictionary is provided with the respective identifier, i.e. the ID of the respective defined dictionary, it being noted that the ID remains unchanged for the respective application and cannot be changed at runtime. In order to insert a new entry in the electronic dictionary, this means that the electronic dictionary is not changed for the first step, which is why the third cycle 507 is carried out in order to return to the current speech dialog state.

In order to add a new utterance to the speaker-dependent list in the dictionary, a special dictionary is loaded, namely the phoneme dictionary. The utterance is analyzed using the phonemes stored in the phoneme dictionary.

The file management of the dictionary file, which has saved speaker-dependent list entries, is a task of the microprocessor 110, as stated above.

After recognizing a command which triggers the application to delete an entry in the dictionary or to add an entry to the dictionary, the speech recognizer can be put into the pause state PAUSE 303 or it can be ended (step 505).

Then the file management can be done.

The file management process in which the dictionary is accessed starts by comparing the speaker-dependent spoken utterance with all entries stored in the dictionary, both with the speaker-dependent entries and with the speaker-independent entries.

In the event that an entry is to be deleted from the dictionary, the corresponding entry is searched in the entire dictionary. The file manager is responsible for deleting the entries or changing any entries only in the speaker-dependent list of entries in the dictionary.

In the context of file management, the microprocessor 110 knows the memory size available for the dictionary, that is to say in particular the microprocessor 110 knows at which point in the memory 111 the speaker-dependent list begins and at which it ends.

In the case where the memory 111 is full, or in other words, in the case where there is not enough space in the memory 111 to record a new utterance, the request for a new utterance is rejected.

The rejection can be done using special voice prompts, for example predefined and recorded voice prompts for the different cases:

A possible language prompt to be output to the user in the event that an input which was actually intended as a command was not recognized as a word which is contained in the dictionary is:

"I beg your pardon?"

• In the event that the utterance that is to be added to the dictionary already exists in the dictionary, a possible language prompt to be output to the user is:

"[Entry] is already available."

• In the event that there is no more storage space available, a possible voice prompt to be output to the user is:

"[List] is full. ^»

The following is an example of such administration in a metalanguage in a C pseudo code: SlartDialogue () {

State 1 (Action) {// Dialogue State 1, basic State cycle_0 (HMMHypothesisStable (RecResult)} = true) then {case (

RecResult = CD player: Action = goto state_2; do b, a RecResult »Tape: Action = goto state_3; do c, a RecResult = Radio: Action = goto state_W; do d, a RecResult = Telephone: Action = goto state_5; do e, a RecResult = csncel: Action = StopHMMSRFlowgraph; do f

)

}

} do (action;

State 5 (Action) {// Dialogue State ith SD-HMM cycle_l (H HypotheslsStable (RecResult)) = true) then {case (

RecResult = dial number: Action = goto state_6; do g, a RecResult = dial <name>: Action = StopHMMSRFlowgraph; do h, i, o RecResult = disturb πame: Action = goto state_7; do g, a RecResult = delete <name>: Action = StopHMMSRFlowgraph; do hj.k RecResult = cancel: Action = StopHMMSRFlowgraph; do f

)

}

}) do (Action) State 6 (Action) I // digits fordialling cycle_1 (HMMHypotheslsStabΙe (RecResult)) * true) then {case (

RecResult = <digits>; Action = goto state_6 (cycle_2); do l, m, t

RecResult = dial: Aclion = StopHMMSRFlowgraph; do π, o

RecResult = cancel: Action = StopHMMSRFlowgraph; do t

))

! } do (action)

State 7 (Action) {/ digits for storing cycie_1 (HMMHypothesisStable (RecResult)) = true) then {case (

RecResult = <digits>; Action = goto state_6; do), m, t

RecResult = disrupt; Action = goto state_8; do p, a

RecResult = cancel: Action = StopH MSRFlowgraph; do f

)

}

}> do (Action) State 8 (Action) {// πame iπpuf lor list if {cycle_1 (HMMHypothesisStable (RecResult)) = true) then (case (

RecResult = <πame>: Action = goto state_9; do q, a, r

)

}

)

} do (Action) State 9 (Action) {// name iπpüt for list ^' rf (cycle_1 (HMMHypothεsisStable (RecResult)) = true) then {case (

RecResult = <na e>; Aclion = StopHMMSRFlowgraph; do u, s

)}

) do (action) A first exemplary embodiment for a concrete speech dialogue is explained in more detail below.

Different speech dialog states are specified and the actions of the system are shown, which are carried out after recognizing a defined command.

The voice dialogue according to this exemplary embodiment of the invention is a simplified schematic telephone voice dialogue.

In FIGS. 11A and 11B, the speech prompts (FIG. 11A) defined according to this exemplary embodiment are listed in a first table 1100 and additional system reactions in table 1101 shown in FIG. 12B.

A voice prompt is to be understood as a predefined utterance of the system, which either represents a voice utterance of a system administrator that has previously been recorded and is simply reproduced by the computer 108, or it can be a synthesized voice signal which is generated from textual information on a voice signal by means of the computer 108 was converted.

The additional system reaction means actions of the system after a special command has been recognized by the speech recognizer.

12 shows an HMM state diagram 1200 for a first state 0 in which the HMMs have not yet started.

The state diagram 1200 is to be understood in such a way that after receipt of the command 1201 StartHMMSRFlowgraph 1202, the action 1203 of the state transition 1204 to the second state and the output 1205 of the speech romp <a>, that is to say the output of the beep by means of the loudspeaker 121. FIG. 13A shows an HMM state diagram 1300 for the second state 1 and FIG. 13B the associated flow diagram.

As can be seen from the speech dialog state diagram 1300, the dictionary in this second HMM state 1 has the following entries: "CD player, cassette recorder, radio, telephone, cancel", in other words, the speech recognizer loads in the second speech dialogue state 1 this dictionary and can only recognize the words contained in this electronic dictionary.

If the speech recognizer recognizes the received words as a command, as they are contained in the command list in the speech dialog state diagram 1300, a corresponding state transition is initiated into a next subsequent state, the speech dialogue states 2, 3 and 4 further for reasons of Clarity can not be explained in more detail.

Thus, without restricting the generality, only the voice dialog with regard to the telephone application is explained in more detail below, that is to say the branching of the voice dialog for the case when the "telephone" command has been recognized and thus into the sixth voice dialog state 5 as it is in the speech dialog state diagram 1400 in Fig. 14A and the associated sequence diagram 1410 in Fig. 14B.

The dictionary of the sixth speech dialog state 5 has the following terms: "number, dial, name, save, delete, cancel" as basic entries, that is to say as speaker-independent entries and a list of speaker-dependent entries, designated in FIG. 14A with <name> ,

In the event that the "dial number" command has been entered by a user and recognized by the speech recognizer, is branched into a seventh speech dialog state 6 and the speech prompts <g> and <a> are output.

The speech dialogue state diagram 1500 for the seventh speech dialogue state 6 is shown in FIG. 15A and the associated flow diagram 1510 in FIG. 15B.

The dictionary in the seventh speech dialog state 6 has the following entries: "zero, one, two, three, four, five, six, seven, eight, nine, dial, cancel", which words in the seventh speech dialog state 6 of the speaker-independent speech recognizer can be recognized.

Depending on which command is recognized by the speech recognizer, either the seventh speech dialog state 6 remains, for example in the case when digits are recognized, which are temporarily stored or the first state 0 is entered in the event that the "Select" command or "Cancel" command is detected.

In the event that the "memory name" command was recognized in the sixth speech dialog state 5, the system branches to the eighth speech dialog state 7, which is shown in the speech dialog state diagram 1600 in FIG. 16A and the associated flowchart 1610 in FIG. 16 is shown.

For the eighth speech dialog state 7, an electronic dictionary is provided with the following entries: "zero, one, two, three, four, five, six, seven, eight, nine, dial, cancel".

It is further provided that in this state essentially three commands can be recognized and processed, namely the input of individual digits, whereupon the eighth speech dialog state 7 remains and the respectively recognized digit is temporarily stored; the command to save, whereupon in a further explained the ninth voice dialog state 8 is changed to output the voice prompt <p> and the beep <a> and the command "Cancel", whereupon the first voice dialog state 0 is passed to output the voice prompt <f> ( see Fig.llA).

The voice dialog state diagram 1700 of the ninth voice dialog state 8 is shown in Fig. 17A and the associated flow diagram 1710 is shown in Fig. 17B.

In the ninth speech dialog state 8, an input name is converted into a sequence of phonemes using the phoneme dictionary, which is the dictionary of the ninth speech dialogue state.

After recognizing the command <name>, the tenth speech dialog state 9 is entered, in which the respective list entry is stored in the electronic dictionary.

The speech dialogue state diagram 1800 for the tenth speech dialogue state 9 and the associated flow diagram 1810 are shown in FIGS. 1A and 1B.

After the name has been stored in the electronic dictionary, the transition to the first speech dialog state 0 takes place.

A second, simplified exemplary embodiment of the invention is explained in more detail below.

19 shows a speech dialog state diagram 1900 of a first speech dialog state 1 according to the second exemplary embodiment of the invention. The dictionary of the first speech dialog state 1 has the following entries: "Name, Save, Delete, Dial, Phonebook, Help, Yes, No, List of Names".

The following commands 1902 are defined in the first speech dialog state 1 and can be recognized by the speech recognizer: "Save name, delete <name>, select <name>, delete phone book, yes, no, help".

The commands are clearly linked to the corresponding actions 1903, which are explained in more detail below.

After recognizing the command “Save name”, the system switches to the second speech dialog state 2 and the phoneme dictionary is loaded. The following voice prompt is also output to the user: “Please speak the name.”

If the command "Delete <name>" is recognized, then no voice dialog state transition is carried out, but instead the voice prompt is: "Do you really want to delete <name> _^ ?" are output and both the specification and the command "Delete <name>" are buffered.

If the command "Select <name>" is recognized, the system does not change to another voice dialog state and the following voice prompt is output: "Do you really want to select <name>?" and the information of "<name>" and of "select <name>" are buffered.

After recognizing the command "delete phone book", the first voice dialog state 1 also remains and the voice prompt is output: "Do you really want to delete the phone book?" and the command "delete phone book" is cached. In the event that the "yes" command is recognized, the first speech dialog state 1 also remains, and the command temporarily stored in the buffer with the associated information is executed and, depending on the respective temporarily stored command, file management is started.

For example, in the event that the "delete <name>" command is cached, the cached "<name>" is deleted from the dictionary, that is, the list of names. The basic dictionary is also reloaded.

If the command "select <name>" is temporarily stored, the sequence of digits belonging to the temporarily stored "<name>" is dialed by the telephone, in other words, the actuator controls the telephone in such a way that a communication connection to the subscriber with the associated telephone number is established ,

If the command "delete phone book" is cached, the entire "list", as it is also cached as information, is deleted from the dictionary. The basic dictionary is also reloaded.

If the speech recognizer recognizes the command "No", the first speech dialog state is lingered and the memory is initialized as actions and the basic dictionary 1901 is reloaded.

In the event that the "Help" command has been recognized, all commands available in this speech dialog state are graphically presented to the user for selection.

20 shows a speech dialogue state diagram 2000 for a second speech dialogue state 2 according to the second exemplary embodiment of the invention. The available phonemes are stored in the phoneme dictionary 2001, the user-defined entries <name> can be recognized as possible commands 2002 and the action 2003 of a recognized command is intended to remain in the second speech dialog state and to change the number of allowed attempts to increase the value 1, and in the event that the number of permitted attempts has been exceeded, to change to the first speech dialog state 1.

If no utterance is temporarily stored in the buffer in the second speech dialog state, the first phoneme sequence, that is to say the first utterance, is buffered. However, if an utterance has already been buffered in the buffer, the utterance is buffered as a second utterance in the form of a second phoneme chain.

In the event that two utterances have already been cached, the uttered utterance is cached as a third utterance in the form of a third phoneme chain and file management is then started, which carries out the following method steps:

First, the three cached statements are matched. In the event that no match of the three buffered utterances is achieved, the request is rejected, the buffer is emptied and the actions of the second speech dialog state 2003 are carried out again.

The two best utterances are then compared with the entries contained in the dictionary. If the similarity in the sense of the above description using the inter-threshold is too great, the request for an additional name in the dictionary is rejected and the buffer is emptied. In In this case, the action 2003 of the second speech dialog state is carried out again.

In the event that a match could not be generated, a check is carried out to determine whether the memory is already fully occupied and, in the event that the memory is fully occupied, the request for the entry of a new user-defined entry in the dictionary is rejected and it is carried out Speech dialog state transition to the first speech dialog state 1 and at the same time the basic dictionary is loaded.

In the event that sufficient free space is available and the above checks have been positive, the desired user-defined entry is added to the dictionary as a list entry. In this case, the best acoustic representation of the entered linguistic utterance is saved as a voice prompt.

Subsequently, a speech dialog state transition in more detail explained in the following third speech dialogue state 3 with simultaneous loading of a digit dictionary 2101 in more detail of the dictionary, which is used in the third ^■ voice dialogue state. 3

A speech dialog state diagram 2100 of the third speech dialog state is shown in Fig. 21.

The electronic dictionary 2101 of the third speech dialog state 3 has the following entries: "zero, one, two, two, three, four, five, six, seven, eight, nine, save, correction, back, cancel, help".

As commands 2102 are the following chains of commands from the

Speech recognizers recognizable and interpretable:

„<Numeric keypad>, save, correction, back, cancel,

Help". If the command "<number block>" is recognized, the respectively recognized numbers are temporarily stored in the buffer and the speech recognizer remains in the third speech dialog state 3.

If the "Save" command is recognized, the content of the buffer memory is saved in the list of telephone numbers and a speech dialog state transition takes place to the first speech dialog state 1 with simultaneous loading of the basic dictionary.

When the command “correction” is recognized, the action “2103” that is the last “numerically recognized block of numbers” that is cached in the buffer memory is deleted. Furthermore, ^" 3 ^" remains in the third speech dialog state.

When the command "Back" is recognized, the third speech dialog state 3 also remains and the last digit temporarily stored in the buffer ^" is deleted.

When the "Cancel" command is recognized, the system switches to the first speech dialog state 1 while simultaneously loading the basic dictionary.

When the "Help" command is recognized, the user is again presented with all of the commands available in the third speech dialog state 3.

FIG. 22 shows a mobile radio telephone device 2200, in which the speech recognition device 100 shown in FIG. 1 is integrated. Furthermore, a PDA (Personal Digital Assistant) can be integrated in the mobile telephone device 2200, as well as further telecommunication functions, such as for example the sending and / or receiving of fax messages or SMS messages (Short Message Service messages) or MMS messages (Multimedia Message service messages). Furthermore, the mobile radio telephone device 2200 can be expanded by additional multimedia functionalities, for example a camera can be integrated in the mobile radio telephone device 2200.

Fig. 23 shows a car radio 2300 in which (symbolically shown in Fig. 23) a large number of different components are integrated, for example a navigation system 2301, a CD player 2302, a cassette recorder 2303, a radio 2304, a telephone device with hands-free system 2305 and the speech recognition device 100, as shown in Fig.l. The information can be exchanged between the user and the car radio 2300 both by means of the speech recognition device 100 and via a screen 2306.

Just to control a variety of

^"The system providing different functionalities, such as a car radio 2300 provided with a multitude of different functions, is very well suited, since an arbitrarily complicated voice dialog structure can be set up and implemented very flexibly and independently of the speaker.

For the distance calculations set out above, in particular as part of the addition of the respective electronic dictionary and thereby as part of the check whether the respective phonetic transcription, ie the respective sequence of phonemes, is sufficiently representative of the current name entry, that is to say, in other words, whether the respective Sequences of spoken units, ie the phoneme sequences are sufficiently similar to one another or for checking whether the phoneme sequence provided in each case for supplementing an electronic dictionary differs sufficiently from the phoneme sequences already existing and stored in the electronic dictionary is the method described below using a variant of the Levenshtein distance.

As explained above, the Levenshtein distance is used to determine the distance between two phoneme sequences, supplemented according to the invention by the articulation feature vectors as part of the cost function.

According to the Levenshtein method using the Levenshtein distance, two sequences of spoken units, generally symbol sequences, are compared with one another in such a way that insertions (q), omissions (r) and exchanges (c) of individual elements result in a first symbol sequence for one second symbol sequence is mapped.

The costs arising in the context of the figure are described by means of a cost function and can be viewed as the distance between the two symbol chains to be compared with one another, in this exemplary embodiment the two phoneme sequences.

At the start of the method, a large number of articulation feature vectors are determined, one articulation feature vector each being assigned to a phoneme and stored in the memory.

The articulation feature vectors are formed according to the method described in [3] and are summarized in the following table for the phonemes of the German language, with a pause phoneme in the set of phonemes provided for describing the German language.

According to the invention, use is made of the knowledge that the articulation settings for speech production are closely related to the expression of the physical characteristics of a phoneme.

As shown in [3], experimental studies on natural language databases for different speakers have shown that the phonemes of a speaker word base according to the expression of their physical characteristics Different categories of articulatory settings can be assigned.

According to the invention, the following are provided as elements (components) of the articulation feature vector to take articulatory settings into account:

The jaw position of a person,

• the articulation location of the respective phoneme,

• the rounding of a person's lips,

• the nasality.

Each feature vector component is assigned a value that describes the respective category of the articulatory setting.

Starting from a neutral articulatory position of the organs producing the phoneme, deviations from this are represented by means of positive or negative numbers.

According to the invention, four categories are observed for the jaw position, in other words for the jaw opening, to which the following numerical values are assigned:

• closed jaw position → 0,

• semi-closed jaw position → 1,

• semi-open jaw position → 2 as well

• open jaw position - »3.

The following categories are provided in the Articulation Location characteristic component:

• Palatal → 1,

• Velar → 0,

• Uvular → -1 as well

• Pharyngeal → -2.

The lip rounding is provided as a further feature vector component, to which the following different categories are assigned: wide lip position → -1, indifferent lip position → 0, rounded lip position → 1 and pre-turned lip position → 2.

Most vowels show a stable setting of the articulatory positions for all descriptive features, since these have to be achieved during the production process so that the correct vowel is produced.

Consonants, on the other hand, have free settings in which the position of the articulation organs can vary slightly. These enable the transition from a previous, stable articulatory position to a subsequent, stable articulatory position with the least articulatory effort. This process of overlapping articulatory gestures under the influence of neighboring phonemes by preceding or following phonemes is also called coarticulation.

The specified characteristic values of the consonants vary by the mean value given in brackets.

The articulation feature vectors m (i), which thus have the components “jaw position”, “articulation site”, “lip rounding” and “nasality” as articulation feature vector components, are, as explained in more detail below, as part of the calculation of the Levenshein Distance considered.

For example, the articulation feature vector m (U) of the phoneme "U" has the structure:

m (U) = (1, -1, 2, 0). Within the general Levenshtein distance, an insertion variable q is first assigned the value 1 and an omission variable r is also assigned the value 1.

The procedure according to the invention is different. The modified Levenshtein method for each spoken unit, i.e. According to this exemplary embodiment, a local distance is determined for each phoneme of the first series of phonemes and for each phoneme of the second series of phonemes, starting from the first phoneme of the first phoneme series to the corresponding first phoneme of the second phoneme series.

The method of the Levenshtein distance according to this exemplary embodiment is shown below in the form of a pseudo code:

q = 1, r = 1, for i = 1: I for j = 1: J d (i, j) = min [d (il, j-1) + c (i, j), d (i, j -1) + q, d (il, j) + r] end end dNorm = d (I ^" , J) / len_max

The cost value c (i, j) is given here for two phonetic units directly from the amount-related difference in the articulation of the phoneme feature vectors i and j of the phoneme, that is, in other words the costs are ^"in accordance with the following rule:

c (i, j) = | | mi- mj | , This calculation rule thus clearly replaces the cost calculation for the exchange of phonemes in the Levenshtein algorithm known per se described above.

If one also takes into account the variances of the individual articulation feature vector components, the comparability of vectors whose components scatter differently can be ensured by means of an additional normalization of variance.

This modified expression then corresponds to the variance normalized length of the difference vector.

In contrast to the fixed costs that arise when exchanging any phonemes, this type of cost calculation according to the invention allows a differentiated statement about the similarity of phonemes than is possible with the Levenshtein method according to the prior art.

The expression len_max corresponds to the number of phonemes that belong to the longer phoneme sequence.

To calculate the Levenshtein distance, both phoneme sequences are processed successively, phoneme for phoneme, generally symbol for symbol, and a cost matrix is determined. Then the route is determined within the cost matrix with the lowest local costs and the costs along this route are added. The result of the sum formation is the Levenshtein distance

The intermediate results are stored in the matrix d (I, J), the rows of which are assigned to the symbols of the first symbol sequence, ie the phonemes of the first phoneme sequence, and the columns of which are assigned to the symbols of the second symbol sequence, ie the phonemes of the second phoneme sequence. The start line or start column contains the values of the so called recursion basis. After the calculations have been completed, the total distance d (I, J) is shown in the bottom right corner of the distance matrix.

For further illustration, the following tables show a comparison for calculating the similarity of two phoneme sequences, on the one hand according to the prior art, and on the other hand according to the inventive method for improved calculation of the Levenshtein distance.

The first phoneme sequence is the sequence of the phonemes (g e: z a :) and the second phoneme sequence is (g e: s a :).

The following table shows the distance calculated according to the prior art, ie according to the usual method for calculating the Levenshtein distance. As can be seen in the table, ie the distance matrix, there is a Levenshein distance d _norm of 1/2:

According to the new method, there is a distance value of 0, since the similarity in the articulatory properties of the respective phonemes is now taken into account and the two phoneme sequences sounding very similar are also classified as very similar:

In an alternative embodiment of the invention, individual articulation features are weighted differently and different metrics are used to calculate the distance, for example a Euclidean metric or a Mahalanobis distance. In this way, the artificially modified Levenshtein distance can be subsequently adapted to the system properties of the respective target application and to the classification target.

The invention clearly provides a novel method for determining a quantitative measure of similarity between two given phoneme sequences, which can be used for the first time in a phoneme-based speech recogniser as a measure for the simple calculation and assessment of the similarity between two recognition results, ie two phoneme sequences. The addition of the Levenshtein distance according to the invention for the similarity calculation with the novel evaluation method enables differentiated statements about the similarity of two phoneme sequences for the first time.

The following publications are cited in this document:

[1] John Nerbonne and Wilbert Heeringa, Measuring Dialect

Distance Phonetically, in: John Coleman (ed.) ^' Workshop on Computational Phonology, Special Interest Group of the Association for Computational Linguistics, Madrid, 1997, pp. 11 - 18

[2] J. Nerbonne et al, Phonetic Distance between Dutch Dialects, Proceedings of CLIN '95, pp. 185-202, Antwerp, 1995

[3] D. Hirschfeld, Comparing static and dynamic features for segmental cost function calculation in concatenative speech synthesis, ICSLP, Beijing, 2000

[4] DE 44 333 66 AI

[5] DE 43 173 72 AI

21

LIST OF REFERENCE NUMBERS

100 speech recognition device

101 voice signal

102 microphone

103 Recorded analog voice signal

104 preprocessing

105 Preprocessed voice signal

106 analog / digital converter

107 Digital voice signal 108 calculator

109 input interface

110 microprocessor

111 memory

112 output interface

113 computer bus

114 Electronic dictionary

115 keyboard

116 computer mouse

117 cables

118 cables

119 radio connection 120 radio connection

121 speakers

122 actuator

123 DSP

200 speaker-independent dictionary

201 basic entry

202 Speaker dependent dictionary

203 Entry memory-dependent dictionary

204 Common dictionary

300 speech recognizer state diagram

301 initialization state

302 stop state

303 pause state c? 2

304 operating mode state

305 step

306 method step 307 method step 308 method step

309 step

310 step

311 procedural step

312 procedural step

313 procedural step

314 procedural step

400 speech dialog state diagram

401 temporally preceding speech dialog state

402 Speech dialog state X

403 dictionary

404 word

405 language-dependent list entry

406 command

407 command request

408 action

409 request for action

500 flowchart

501 start state

502 First cycle

503 Second cycle

504 First test step

505 end state

506 Second test step

507 Third test step

508 Third cycle

509 Fourth test step

600 message flow diagram

601 message StartHMMSRFlowgraph

602 confirmation message £ 3

603 InitHMMSRParams message

604 confirmation message

605 Message StartLoadHMMLDA

606 Message SMBReguestLoadHMMLDA

607 confirmation message

608 Message StartLoadHMMDictionary

609 Message SMBRequestLoadHMMDictionary

610 confirmation message

611 Message SMBRequestCodebookBlock

612 CodebookBlockLoadedAndSwitched message

613 Message SMBRequestCodebookBlock

614 CodebookBlockLoadedAndSwitched message

615 Message StartHMMSR

616 confirmation message

617 Message SMBRequestCodebookBlock

618 CodebookBlockLoadedAndSwitched message

619 message HMMHypothesisStable 620 confirmation message

621 Message pause HMMSR

622 confirmation message

700 message flow diagram

701 message SetHMMSearchParams

702 confirmation message

703 Message StartLoadHMMDictionary

704 Message SMBRequestLoadHMMDictionary

705 confirmation message

706 message StartHMMSR

707 confirmation message

708 Message SMBRequestCodebookBlock

709 Message CodebookBlockLoadedAndSwitched

710 Message HMMHypothesisStable

711 confirmation message

712 Message break HMMSR

713 confirmation message

800 message flow diagram £ 4

801 message StartHMMSR

802 confirmation message

803 Message SMBRequestCodebookBlock

804 Message CodebookBlockLoadedAndSwitched

805 Message HMMHypothesisStable

806 confirmation message

807 Message break HMMSR

808 confirmation message

900 functional diagram

1000 function diagram

1100 "Language Prompts" Table

1101 Table "Additional system reactions"

1200 speech dialog state diagram

1201 command

1202 Command request

1203 action

1204 Action request

1205 Action request

1300 Speech dialog state diagram 1310 Sequence diagram

1400 Speech dialog state diagram 1410 Flow diagram

1500 Speech dialog state diagram 1510 Flow diagram

1600 Speech dialog state diagram 1610 Flow diagram

1700 Voice dialog state diagram 1710 Flow diagram £ 5

1800 speech dialog state diagram 1810 flow diagram

1900 Speech dialog state diagram

1901 dictionary

1902 command list

1903 action list

2000 speech dialog state diagram 2001 interrogation dictionary

2002 command list

2003 action list

2100 speech dialog state diagram

2101 number dictionary

2102 command list

2103 Action list

2200 mobile radio telephone

2300 car radio

2301 navigation system

2302 CD player

2303 radio

2304 tape recorder

2305 phone

2306 screen

Claims

claims

1. Method for computer-aided comparison of a first sequence of symbol representations of a spoken utterance with a second sequence of symbol representations of a spoken utterance,

Wherein each symbol representation is assigned an articulation feature vector which contains articulatory properties of the symbol representations and / or physical properties of the symbol representations,

In which each symbol representation of the first sequence of symbol representations is mapped to a corresponding symbol representation of the second sequence of symbol representations,

• the numerical distance between two units of dependent articulation feature vectors is calculated for each pair of symbol representations and these are arranged in the form of a cost matrix.

2. The method according to claim 1, in which the optimal path is determined in the cost matrix according to the Levenshtein method.

3. The method according to claim 1 or 2, in which phonemes are used as phonetic units.

4. The method according to any one of claims 1 to 3, in which the articulation feature vectors are formed as a function of different articulation settings of an organ producing a spoken unit.

5. The method according to any one of claims 1 to 4, wherein a respective articulation feature vector contains one of the following features:

The jaw position of a person, the place of articulation, the rounding of the lips of a person, the nasality, the glottalization, or the pitch.

6. The method according to claim 5, in which a variance of at least one of the features of the articulation feature vectors is taken into account in the context of the mapping.

7. The method according to any one of claims 1 to 6, used in one of the following areas:

• speech synthesis,

• speech recognition, or

Automatic generation of an electronic dictionary,

8. Device for computer-aided comparison of a first sequence of symbol representations of a spoken utterance with a second sequence of symbol representations of a spoken utterance, with a processor unit, which is set up in such a way that the following method steps can be carried out:

Each symbol representation is assigned an articulation feature vector which contains articulatory properties of the symbol representations and / or physical properties of the symbol representations,

Each symbol representation of the first sequence of symbol representations is mapped onto a corresponding symbol representation of the second sequence of symbol representations,

• the numerical distance between two units of dependent articulation feature vectors is calculated for each pair of symbol representations and these are arranged in the form of a cost matrix. 3

9. Speech recognition device with a device according to claim 8.

10. Speech synthesis device with a device according to claim 8.