WO2005034087A1

WO2005034087A1 - Selection of a voice recognition model for voice recognition

Info

Publication number: WO2005034087A1
Application number: PCT/EP2004/050645
Authority: WO
Inventors: Sorel Stan
Original assignee: Siemens Aktiengesellschaft
Priority date: 2003-09-29
Filing date: 2004-04-29
Publication date: 2005-04-14

Abstract

The invention relates to a method for selecting a voice recognition model (5) for recognising a voice contained in a voice signal. According to the invention, a user profile associated with the voice signal is received, a group of values of relevant parameters of the voice signal for voice recognition is specified; the group contained in the user profile is compared to groups of parameter values of a plurality of predetermined voice recognition models (5); the voice recognition model (5) whose group of parameter values best corresponds to the group of the user profile is selected, and voice recognition is carried out by means of said voice recognition model (5).

Description

description

Selection of a speech recognition model for speech recognition

The invention relates to a method for selecting a speech recognition model for the recognition of in one

Speech signal containing speech, and a speech recognition unit and a terminal for recording the speech signal.

Automatic speech recognition requires a set of components or resources that must be available to a speech recognition entity to enable it to identify words in a sequence of feature vectors generated from a speech signal. In addition to a vocabulary of the language to be recognized, these resources include a so-called acoustic model, which specifies the probability of the observed feature vectors for a given sequence of words selected from the vocabulary, and a so-called language model, which defines the probability of successive described in terms of individual words in the language to be recognized. These resources work together to determine a word sequence which has been assessed as the most probable for a received sequence of feature vectors.

A complete set of the resources that enable a speech recognition unit to convert spoken language into text is intended below, regardless of whether it uses the components listed above or others for the same purpose serving content - to be referred to as a "speech recognition model".

Since the sound of the phonemes of a language differs individually for different speakers, an acoustic model that is supposed to be suitable for recognizing the speech of different speakers must describe a “mean” or “typical” sound of the phonemes. In order to create such a model, the language of different speakers of different sexes and different ages with different background noise levels can be recorded and an averaged acoustic model can be created from this. It is obvious that the recognition of the phonemes in the language of a given speaker is the more unreliable, the further his speech differs from the acoustic model. In addition, the same words are spoken by speakers with different regional dialects, sometimes with different phonemes, e.g. a phoneme in a high-level language is replaced by another in one dialect, swallowed or replaced by a combination of two phonemes.

It is therefore hardly possible with a single speech recognition model to satisfactorily process speech signals from any speaker of a language.

If only one person or a small number of people need to be recognized with speech recognition, the speech recognition model can be individually refined for each person. In known speech recognition programs for the PC, an individual speech recognition model is thus created for each individual user of the program. The program starts from a given basic language set for an average speaker recognition model. During its interaction with a person, the program refines this basic speech recognition model and makes it better and better for this person. This creates an individual speech recognition model for each user of the program. From this plurality of speech recognition models, the program then selects the speech recognition model individually created for this person, depending on which of the people is currently using the program. For this purpose, the person logs on to the program with a user name, and the program accesses the individual speech recognition model assigned to this user name.

Difficulties with automatic speech recognition are not only due to different ways of speaking by different speakers. In addition to a variable act in the person of the speaker, background noises are the main sources of recognition errors. With the individual speech recognition models of the application programs mentioned above, different speakers are excluded as a source of error, but a high background noise level still remains as a possible source of error for the speech recognition. A satisfactory solution to the problem is not known: PC speech recognition programs are able to dynamically adapt the speech recognition model used for a specific user and thus gradually improve the recognition accuracy if a model originally adapted to a specific background noise spectrum is used over a long period of time is used for speech recognition in an environment with a changed noise spectrum, but as a result the adaptation to the previous noise spectrum is lost over time, so that when the latter is restored, the Detection rate deteriorates and needs to be re-trained.

There are numerous applications for automatic speech recognition which must be able to recognize speech in speech signals from a large number of different speakers with a high degree of certainty without prior training, these speech signals not only being in the sound of the speech of the different speakers, but also can differentiate between the strength and type of the secondary noise contained therein, which complicates the recognition. Examples of such applications are e.g. automated information systems, for example for telephone numbers, timetables or the like, which must be able to understand a question from a user in order to be able to answer it. The object of the invention is to provide a method and devices for carrying out the method which enable accurate speech recognition in speech signals from a large number of different speakers, possibly with different secondary noises contained in the speech signal. This object is achieved by the method according to claim 1, the voice recognition unit according to claim 15 and the terminal according to claim 16.

To carry out speech recognition, the method according to the invention selects from the predefined speech recognition models the one that best matches the user profile of any speaker whose speech is to be recognized. For example, it can be the case that with changing environmental conditions of the speaker, for example different or differently loud background noise, with repeated recognition of the speech of one and the same speaker, different speech recognition models are selected in each case. Also, no voice recognition model is generated for a speaker. A speech recognition unit using the method according to the invention differs from the PC speech recognition programs described above, in which a speech recognition model generated specifically for the user is selected on the basis of his user name, the speech recognition model being assigned to this user from the outset. Because the assignment of one of the speech recognition models to a speaker whose voice is to be known is determined anew each time in the method according to the invention, the method makes speech recognition flexible for any and any number of speakers. The specific selection of the most suitable speech recognition model leads to accurate speech recognition with a minimized error rate. In the invention, the comparison of the new set of parameter values transferred as the user profile with the sets of parameter values of the individual speech recognition models can be carried out very quickly and without great computational effort. A language-inherent feature of the speech signal is particularly preferably designated with at least one of the parameters. Such language-inherent features can be, for example, an age group or a gender of the speaker. However, the language-inherent feature is very particularly preferably a national language of the speaker. This reveals another particular strength of the method according to the invention because it is now possible to implement a multilingual system for automatic speech recognition. If the system is an automatic information system, with which not only one person connects via a telephone network, foreign people, for example, become foreign Mobile phones can use the information system in their respective national language, provided that the information system has a speech recognition model for this national language. Furthermore, a language-inherent characteristic can also be used to differentiate between regional dialects of a language in order to reduce sources of error that can occur due to an accent of a speaker.

If the terminal used to record the voice signal has a voice-oriented user interface, e.g. has a menu and the language used by this interface can be selected by the user from several languages, the language selected for the interface can advantageously be adopted as the national language of the user in the user profile stored in the terminal.

Likewise particularly preferably, at least one of the parameters denotes an environment-inherent feature of the speech signal, which is, in particular, a background noise level. However, a parameter value can also specify a type of environment in which the speaker is located. For example, the parameter value can be used to distinguish between the environments “street”, “interior of the building” or “interior of the vehicle”, in order to be able to more easily identify secondary noises contained in a speech signal and distinguish them from the spoken language.

The method according to the invention is particularly suitable for a system in which the voice signal is picked up by a terminal and transmitted via a data network, in particular a telephone network, to a speech recognition unit which carries out the speech recognition. At the speech recognition unit can, as already mentioned above, be an automatic information system. Another possibility is a dictation system, such as a voice-controlled short message generator, which converts the voice signal received by a user into a text message and sends it in a suitable format supported by the respective telephone network, for example as an SMS message, to a recipient specified by the user.

For example, a mobile phone can be used as the terminal. Some known mobile telephones can be set manually for the purpose of noise suppression for various background noise levels such as “normal”, “quiet” or “loud”. This setting can be used advantageously to determine the parameter value of a parameter that is an inherent characteristic of the speech signal denotes a background noise level.

In the simplest case, a voice signal picked up by the terminal could be transmitted to the voice recognition unit in the same format as, for example, to another telephone in the data network. However, since telephone networks generally do not have the bandwidth required for the faithful reproduction of the voice signal, it is preferred to preprocess the voice signal into a sequence of feature vectors at the terminal, the amount of data of which is smaller than that of a digitized voice signal from which they have been received and which can be digitally transmitted in the data network without loss of quality.

It is possible to provide the end device in such a way that the user profile is transmitted from the end device to the speech recognition unit. In such a case, a parameter of the user profile can be defined by the end device. This can be done, for example, by frequency analysis of the language that the terminal device uses in order to obtain the parameter values, for example for the parameters “age group” and “gender”. Some of the parameter values can also be recognized by the terminal device based on its mode of operation. If, for example, the user profile has a parameter whose parameter values specify a type of environment in which the speaker can be located and one of these environments is a vehicle interior, it is possible to use a hands-free device to which a mobile phone is connected as a terminal , advantageous to use to identify the type of environment. The vehicle interior is easily recognized as the type of surroundings of the speaker by the fact that the mobile radio telephone is connected to the hands-free device. The mobile radio telephone or the terminal device sends this information to the speech recognition unit by means of a correspondingly set parameter value in the user profile.

Typically, mobile telephones have a language-oriented user interface, usually in the form of a screen, in which options available for operating and configuring the device are shown in text form and are offered to the user for selection. If the language of such an interface can be set by the user, then it is likely that the national language set will be the one in which he speaks most, so that the terminal device advantageously sets the language that the user uses for the user profile as the national language of the user Interface. The user profile can be transferred to the speech recognition unit before the speech recognition begins. For example, the end device of the speech recognition unit can be the user profile when establishing a connection between the two. Then the speech recognition unit makes the selection of the speech recognition model at the beginning of the speech recognition and executes the speech recognition with this one speech recognition model.

However, it is advantageous if the user profile is transmitted repeatedly during the speech recognition. This gives the terminal the possibility to update the user profile if necessary, in particular to adapt it to changing environmental conditions, and to transfer it to the speech recognition unit. Using the repeatedly transmitted user profile, the speech recognition unit can continuously check whether the speech recognition model chosen by it is still appropriate and, if necessary, replace it with another one. In this way, changing environmental conditions, such as e.g. a changing background noise level must be taken into account. The speech recognition is carried out with a speech recognition model that is always up to date, so that the error rate of the speech recognition can be further reduced. However, the repeated transmission of the user profile can also be used to inform a speech recognition unit of this other network about the speech recognition model to be used in the event of a handover of a mobile radio telephone to another mobile radio network.

The invention is explained in more detail below on the basis of an exemplary embodiment. It shows:

Fig. 1 is a schematic representation of a system for executing the inventive method. Figure 1 is a schematic representation of a system for performing the method according to the invention. A language Identification unit 1 is connected via a telephone network to a mobile radio base station 6, which has an antenna 2. The base station 6 is connected to a terminal 4, which is a mobile radio telephone, via a radio link 3 and the antenna 2.

The speech recognition unit 1 has a plurality of different speech recognition models 5 available for selection, each of which is adapted to different forms of certain features of a speech signal that correspond to the speech of a speaker or the environment in which the speech

Speaker speaks, more precisely, of their background noise, may be inherent. In the present case, the language-inherent features taken into account by the speech recognition models 5 include the gender of the speaker as well as an age group of the speaker and one spoken by the speaker

National language. Each speech recognition model 5 has access to a vocabulary of its national language. With regard to the features inherent in the environment, the speech recognition models 5 are set to different types of the environment in which the speaker can be located, as well as different levels of background noise of this particular environment.

All of these features are each designated by a parameter from a predetermined set of parameters, which can assume certain parameter values. In the present case, the set comprises five parameters P1, P2, P3, P4, P5. Depending on the parameter value that it takes, the parameter Pl denotes the speaker's gender. In all speech recognition models 5 which are provided for a female speaker, the parameter value Pl has the same parameter value and for all speech recognition models 5 which are provided for male speakers, the parameter Pl assumes the corresponding parameter value for male speakers. The parameter P2 differentiates between predefined age groups to which the speaker can belong. Depending on which of these age groups the speaker to whom a speech recognition model 5 is set is to be assigned, the parameter P2 of this speech recognition model 5 assumes the corresponding parameter value.

In the same way, the parameter value belonging to parameter P3 is used to differentiate between three different types of environment in which the speaker can be located, namely between a vehicle interior, a building interior, and a street.

Likewise, parameter P4 designates a level of background noise in the vicinity of the speaker. A distinction is made between a normal noise level, a low noise level and a loud noise level.

Finally, parameter P5 is provided, which e.g. can take five different parameter values, each of which stands for a different language. In the present case, the parameter values are used to differentiate between the national languages German, English, French, Italian and Spanish.

In operation, the mobile radio telephone 4 records a language spoken and recognized by a speaker and digitizes it, possibly including background noise that is also recorded. The mobile radio telephone 4 preprocesses the digitized speech signal into a sequence of feature vectors which are sent via the radio link 3 to the base station 6 and from there to the speech recognition unit 1. These feature vectors are compatible with feature vectors used by a speech model of the speech recognition unit 1 and can by the Speech recognition unit 1 can be compared to feature vectors of the speech model without further preprocessing in order to identify the words contained therein. This measure, which can be implemented with very little technical effort, reduces the amount of data to be transmitted between the telephone 4 and the speech recognition unit 1 to such an extent that the bandwidth of a telephone channel is sufficient to enable speech recognition in the speech recognition unit 1 with the same quality as if it were with would be connected to the end device without bandwidth limitation.

In addition, the mobile radio telephone 4 sends a set of parameter values via the radio link 3 to the speech recognition unit 1, which represents a user profile for the speaker. Just as the sets of parameters of the speech recognition models 5 provide information about features of the speaker and his environment to which the respective speech recognition model 5 is set, this user profile with its parameter values contains information about corresponding features of the speaker of the language to be recognized and his environment.

The user profile can, for example, be created entirely or partially manually, in particular entered by a user of the mobile telephone 4 via the keyboard. Once a user profile has been entered, it remains stored in the mobile phone and can be transmitted to the speech recognition unit each time the mobile phone establishes a connection.

Some known mobile radio telephones 4 can, for example, be set manually to different levels of background noise levels. These settings can then ver from the mobile phone 4 as a parameter value for the user profile be applied. On the other hand, parameter values can also be created by the mobile radio telephone 4 itself. If this is equipped accordingly, it can, for example, determine a background noise level itself and create the user profile with corresponding parameter values. However, it can also carry out a frequency analysis of the speaker's language and classify the speaker into a specific age group based on a determined spectrum of the speech. Then a corresponding parameter value for identifying this age group is set in the user profile. However, it is also possible for the speech recognition unit 1 to carry out such an analysis with subsequent classification of the speaker into an age group. The cellular phone 4 can recognize a type of the surroundings of the speaker, for example, from the inside of a vehicle that the cellular phone 4 is connected to a hands-free device. Accordingly, the mobile telephone 4 sets the parameter value of the parameter that characterizes the type of environment of the speaker in the user profile. The mobile telephone 4 sets the parameter of the user profile which designates the national language of the speaker to the value assigned to a language selected by the user for operating the user interface of the mobile radio telephone 4.

The user profile is compared by the speech recognition unit 1 with the sets of parameter values of the individual speech recognition models 5. The speech recognition model 5 whose set of parameter values matches the user profile most closely is selected by the speech recognition unit 1 and used for automatic speech recognition of the speech contained in the speech signal.

Because the speech recognition unit 1 does not ensure an exact match when selecting the speech recognition model 5 between the user profile and the set of parameter values of the speech recognition model 5, but only selects the speech recognition model 5 whose set of parameter values has the best match with the user profile, operation of the speech recognition system is also ensured in the event that a speech recognition model with parameter values , which correspond exactly to those of the transmitted user profile, is not available at the speech recognition unit 1.

In order to enable the speech recognition unit to select a speech recognition model, the user profile must be transmitted to the speech recognition unit at least once when establishing communication. However, the profile is preferably also transmitted repeatedly during the communication. This is the prerequisite for a mobile phone that is able to automatically define certain parameters of the user profile to report the current value of these parameters to the speech recognition unit at any time and, if necessary, by changing to a different speech recognition model adapted to the current parameter values can optimize the speech recognition or, if the speech recognition unit changes as a result of a handover, the new speech recognition unit can immediately select the best-matched speech recognition model and work with it.

According to a preferred embodiment, the mobile radio telephone 4 transmits the voice signal as a multi-frame message packagc, for example according to ETSI ES 201 108 vl .1.2. The header of such a message packet comprises nine previously non-standardized bits, called "expansion bits" EXP1 to EXP9, which are available for functional expansions. One of these can be used, for example, to encode the gender of a speaker, two for encoding four different accents or dialects, one for the age group of the speaker, one for differences Decoration between operation with and without a hands-free system and the remaining four for coding up to 16 national languages.

Claims

claims

1. A method for selecting a speech recognition model (5) for recognizing speech contained in a speech signal, in which (a) e user profile assigned to the speech signal is received, which specifies a set of values of a set of parameters of the speech signal relevant for speech recognition; (b) the set contained in the user profile is compared with sets of parameter values of a plurality of predefined speech recognition models (5); (c) that speech recognition model (5) is selected whose set of parameter values best matches the set of the user profile, and the speech recognition is carried out with this speech recognition model (5).

2. The method according to claim 1, characterized in that at least one of the parameters em language-inherent feature of the speech signal, in particular a national language or an accent or an age group or a gender of a speaker of the language is referred to.

3. The method according to claim 1, characterized in that one of the parameters designates a national language, and that the national language is the language in the user profile that is set on a language-oriented user interface of a terminal device used to record the speech signal.

4. The method according to any one of the preceding claims, characterized in that at least one of the parameters em environment-inherent feature of the speech signal, in particular a background noise level, is referred to.

5. The method according to claim 4, characterized in that with a parameter that designates an inherent feature of the speech signal, each parameter value specifies a type of environment in which the speaker can be.

6. The method according to any one of the preceding claims, characterized in that the voice signal is picked up by a terminal (4) and transmitted via a data network (3) to a speech recognition unit (1) which carries out the speech recognition.

7. The method according to any one of claims 1 to 5, characterized in that the voice signal is picked up by a terminal (4) and processed into a sequence of feature vectors, and that the sequence of feature vectors via a data network (3) to a speech recognition unit (1 ) is transmitted, which carries out the speech recognition.

8. The method according to claim 6 or 7, characterized in that the transmission in the data network takes place via a radio channel.

9. The method according to claim 6, 7 or 8, characterized in that the user profile from the terminal (4) to the speech recognition unit (1) is transmitted.

10. The method according to claim 9, characterized in that a parameter of the user profile is determined by the terminal.

11. The method according to claim 9 or 10, characterized in that the user profile is transmitted via a radio channel.

12. The method according to claim 5 with one of claims 6 to 11, characterized in that a type of environment in which the speaker is located is recognized as a vehicle interior that the terminal (4) is connected to a hands-free device.

13. The method according to any one of the preceding claims, characterized in that the user profile of the speech recognition unit (1) is transferred before the start of the speech recognition.

14. The method according to any one of the preceding claims, characterized in that the user profile is updated repeatedly during the speech recognition.

15. Speech recognition unit (1) for recognizing speech contained in a speech signal with a speech recognition model (5), wherein the speech recognition unit (1) looks for a most suitable from a plurality of speech recognition models (5) with a method according to one of the preceding claims.

16. Terminal (4) for recording a voice signal containing voice, characterized in that a user profile is stored in the terminal (4) which specifies a set of values of a set of parameters of the voice signal relevant for voice recognition, and that the terminal (4) transfers the voice signal and the user profile to a voice recognition unit (1).

17. The terminal according to claim 16, characterized in that it has a language-oriented user interface, that the language used by this interface can be selected by the user from a plurality of languages, and that it is the language selected for the interface as the national language of the user in whose user profile stored in the terminal device is overwhelmed.