WO2005034087A1 - Selection d'un modele de reconnaissance vocale pour une reconnaissance vocale - Google Patents

Selection d'un modele de reconnaissance vocale pour une reconnaissance vocale Download PDF

Info

Publication number
WO2005034087A1
WO2005034087A1 PCT/EP2004/050645 EP2004050645W WO2005034087A1 WO 2005034087 A1 WO2005034087 A1 WO 2005034087A1 EP 2004050645 W EP2004050645 W EP 2004050645W WO 2005034087 A1 WO2005034087 A1 WO 2005034087A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech recognition
speech
user profile
language
terminal
Prior art date
Application number
PCT/EP2004/050645
Other languages
German (de)
English (en)
Inventor
Sorel Stan
Original Assignee
Siemens Aktiengesellschaft
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Aktiengesellschaft filed Critical Siemens Aktiengesellschaft
Publication of WO2005034087A1 publication Critical patent/WO2005034087A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/228Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context

Definitions

  • the invention relates to a method for selecting a speech recognition model for the recognition of in one
  • Speech signal containing speech, and a speech recognition unit and a terminal for recording the speech signal.
  • Automatic speech recognition requires a set of components or resources that must be available to a speech recognition entity to enable it to identify words in a sequence of feature vectors generated from a speech signal.
  • these resources include a so-called acoustic model, which specifies the probability of the observed feature vectors for a given sequence of words selected from the vocabulary, and a so-called language model, which defines the probability of successive described in terms of individual words in the language to be recognized.
  • speech recognition model A complete set of the resources that enable a speech recognition unit to convert spoken language into text is intended below, regardless of whether it uses the components listed above or others for the same purpose serving content - to be referred to as a "speech recognition model".
  • an acoustic model that is supposed to be suitable for recognizing the speech of different speakers must describe a “mean” or “typical” sound of the phonemes.
  • the language of different speakers of different sexes and different ages with different background noise levels can be recorded and an averaged acoustic model can be created from this.
  • the recognition of the phonemes in the language of a given speaker is the more unreliable, the further his speech differs from the acoustic model.
  • the same words are spoken by speakers with different regional dialects, sometimes with different phonemes, e.g. a phoneme in a high-level language is replaced by another in one dialect, swallowed or replaced by a combination of two phonemes.
  • the speech recognition model can be individually refined for each person.
  • an individual speech recognition model is thus created for each individual user of the program.
  • the program starts from a given basic language set for an average speaker recognition model. During its interaction with a person, the program refines this basic speech recognition model and makes it better and better for this person. This creates an individual speech recognition model for each user of the program. From this plurality of speech recognition models, the program then selects the speech recognition model individually created for this person, depending on which of the people is currently using the program. For this purpose, the person logs on to the program with a user name, and the program accesses the individual speech recognition model assigned to this user name.
  • Difficulties with automatic speech recognition are not only due to different ways of speaking by different speakers.
  • background noises are the main sources of recognition errors.
  • different speakers are excluded as a source of error, but a high background noise level still remains as a possible source of error for the speech recognition.
  • PC speech recognition programs are able to dynamically adapt the speech recognition model used for a specific user and thus gradually improve the recognition accuracy if a model originally adapted to a specific background noise spectrum is used over a long period of time is used for speech recognition in an environment with a changed noise spectrum, but as a result the adaptation to the previous noise spectrum is lost over time, so that when the latter is restored, the Detection rate deteriorates and needs to be re-trained.
  • the object of the invention is to provide a method and devices for carrying out the method which enable accurate speech recognition in speech signals from a large number of different speakers, possibly with different secondary noises contained in the speech signal. This object is achieved by the method according to claim 1, the voice recognition unit according to claim 15 and the terminal according to claim 16.
  • the method according to the invention selects from the predefined speech recognition models the one that best matches the user profile of any speaker whose speech is to be recognized. For example, it can be the case that with changing environmental conditions of the speaker, for example different or differently loud background noise, with repeated recognition of the speech of one and the same speaker, different speech recognition models are selected in each case. Also, no voice recognition model is generated for a speaker.
  • a speech recognition unit using the method according to the invention differs from the PC speech recognition programs described above, in which a speech recognition model generated specifically for the user is selected on the basis of his user name, the speech recognition model being assigned to this user from the outset.
  • a language-inherent feature of the speech signal is particularly preferably designated with at least one of the parameters.
  • Such language-inherent features can be, for example, an age group or a gender of the speaker.
  • the language-inherent feature is very particularly preferably a national language of the speaker.
  • the system is an automatic information system, with which not only one person connects via a telephone network, foreign people, for example, become foreign Mobile phones can use the information system in their respective national language, provided that the information system has a speech recognition model for this national language.
  • a language-inherent characteristic can also be used to differentiate between regional dialects of a language in order to reduce sources of error that can occur due to an accent of a speaker.
  • the terminal used to record the voice signal has a voice-oriented user interface, e.g. has a menu and the language used by this interface can be selected by the user from several languages, the language selected for the interface can advantageously be adopted as the national language of the user in the user profile stored in the terminal.
  • At least one of the parameters denotes an environment-inherent feature of the speech signal, which is, in particular, a background noise level.
  • a parameter value can also specify a type of environment in which the speaker is located.
  • the parameter value can be used to distinguish between the environments “street”, “interior of the building” or “interior of the vehicle”, in order to be able to more easily identify secondary noises contained in a speech signal and distinguish them from the spoken language.
  • the method according to the invention is particularly suitable for a system in which the voice signal is picked up by a terminal and transmitted via a data network, in particular a telephone network, to a speech recognition unit which carries out the speech recognition.
  • a speech recognition unit can, as already mentioned above, be an automatic information system.
  • a dictation system such as a voice-controlled short message generator, which converts the voice signal received by a user into a text message and sends it in a suitable format supported by the respective telephone network, for example as an SMS message, to a recipient specified by the user.
  • a mobile phone can be used as the terminal.
  • Some known mobile telephones can be set manually for the purpose of noise suppression for various background noise levels such as “normal”, “quiet” or “loud”. This setting can be used advantageously to determine the parameter value of a parameter that is an inherent characteristic of the speech signal denotes a background noise level.
  • a voice signal picked up by the terminal could be transmitted to the voice recognition unit in the same format as, for example, to another telephone in the data network.
  • telephone networks generally do not have the bandwidth required for the faithful reproduction of the voice signal, it is preferred to preprocess the voice signal into a sequence of feature vectors at the terminal, the amount of data of which is smaller than that of a digitized voice signal from which they have been received and which can be digitally transmitted in the data network without loss of quality.
  • a parameter of the user profile can be defined by the end device. This can be done, for example, by frequency analysis of the language that the terminal device uses in order to obtain the parameter values, for example for the parameters “age group” and “gender”. Some of the parameter values can also be recognized by the terminal device based on its mode of operation. If, for example, the user profile has a parameter whose parameter values specify a type of environment in which the speaker can be located and one of these environments is a vehicle interior, it is possible to use a hands-free device to which a mobile phone is connected as a terminal , advantageous to use to identify the type of environment.
  • the vehicle interior is easily recognized as the type of surroundings of the speaker by the fact that the mobile radio telephone is connected to the hands-free device.
  • the mobile radio telephone or the terminal device sends this information to the speech recognition unit by means of a correspondingly set parameter value in the user profile.
  • mobile telephones have a language-oriented user interface, usually in the form of a screen, in which options available for operating and configuring the device are shown in text form and are offered to the user for selection.
  • a language-oriented user interface usually in the form of a screen, in which options available for operating and configuring the device are shown in text form and are offered to the user for selection.
  • the terminal device advantageously sets the language that the user uses for the user profile as the national language of the user Interface.
  • the user profile can be transferred to the speech recognition unit before the speech recognition begins.
  • the end device of the speech recognition unit can be the user profile when establishing a connection between the two. Then the speech recognition unit makes the selection of the speech recognition model at the beginning of the speech recognition and executes the speech recognition with this one speech recognition model.
  • the user profile is transmitted repeatedly during the speech recognition. This gives the terminal the possibility to update the user profile if necessary, in particular to adapt it to changing environmental conditions, and to transfer it to the speech recognition unit.
  • the speech recognition unit can continuously check whether the speech recognition model chosen by it is still appropriate and, if necessary, replace it with another one. In this way, changing environmental conditions, such as e.g. a changing background noise level must be taken into account.
  • the speech recognition is carried out with a speech recognition model that is always up to date, so that the error rate of the speech recognition can be further reduced.
  • the repeated transmission of the user profile can also be used to inform a speech recognition unit of this other network about the speech recognition model to be used in the event of a handover of a mobile radio telephone to another mobile radio network.
  • FIG. 1 is a schematic representation of a system for executing the inventive method.
  • Figure 1 is a schematic representation of a system for performing the method according to the invention.
  • a language Identification unit 1 is connected via a telephone network to a mobile radio base station 6, which has an antenna 2.
  • the base station 6 is connected to a terminal 4, which is a mobile radio telephone, via a radio link 3 and the antenna 2.
  • the speech recognition unit 1 has a plurality of different speech recognition models 5 available for selection, each of which is adapted to different forms of certain features of a speech signal that correspond to the speech of a speaker or the environment in which the speech
  • Speaker speaks, more precisely, of their background noise, may be inherent.
  • the language-inherent features taken into account by the speech recognition models 5 include the gender of the speaker as well as an age group of the speaker and one spoken by the speaker
  • Each speech recognition model 5 has access to a vocabulary of its national language. With regard to the features inherent in the environment, the speech recognition models 5 are set to different types of the environment in which the speaker can be located, as well as different levels of background noise of this particular environment.
  • the set comprises five parameters P1, P2, P3, P4, P5.
  • the parameter Pl denotes the speaker's gender.
  • the parameter value Pl has the same parameter value and for all speech recognition models 5 which are provided for male speakers, the parameter Pl assumes the corresponding parameter value for male speakers.
  • the parameter P2 differentiates between predefined age groups to which the speaker can belong. Depending on which of these age groups the speaker to whom a speech recognition model 5 is set is to be assigned, the parameter P2 of this speech recognition model 5 assumes the corresponding parameter value.
  • parameter value belonging to parameter P3 is used to differentiate between three different types of environment in which the speaker can be located, namely between a vehicle interior, a building interior, and a street.
  • parameter P4 designates a level of background noise in the vicinity of the speaker. A distinction is made between a normal noise level, a low noise level and a loud noise level.
  • parameter P5 is provided, which e.g. can take five different parameter values, each of which stands for a different language.
  • the parameter values are used to differentiate between the national languages German, English, French, Italian and Spanish.
  • the mobile radio telephone 4 records a language spoken and recognized by a speaker and digitizes it, possibly including background noise that is also recorded.
  • the mobile radio telephone 4 preprocesses the digitized speech signal into a sequence of feature vectors which are sent via the radio link 3 to the base station 6 and from there to the speech recognition unit 1.
  • These feature vectors are compatible with feature vectors used by a speech model of the speech recognition unit 1 and can by the Speech recognition unit 1 can be compared to feature vectors of the speech model without further preprocessing in order to identify the words contained therein.
  • This measure which can be implemented with very little technical effort, reduces the amount of data to be transmitted between the telephone 4 and the speech recognition unit 1 to such an extent that the bandwidth of a telephone channel is sufficient to enable speech recognition in the speech recognition unit 1 with the same quality as if it were with would be connected to the end device without bandwidth limitation.
  • the mobile radio telephone 4 sends a set of parameter values via the radio link 3 to the speech recognition unit 1, which represents a user profile for the speaker.
  • the sets of parameters of the speech recognition models 5 provide information about features of the speaker and his environment to which the respective speech recognition model 5 is set, this user profile with its parameter values contains information about corresponding features of the speaker of the language to be recognized and his environment.
  • the user profile can, for example, be created entirely or partially manually, in particular entered by a user of the mobile telephone 4 via the keyboard. Once a user profile has been entered, it remains stored in the mobile phone and can be transmitted to the speech recognition unit each time the mobile phone establishes a connection.
  • Some known mobile radio telephones 4 can, for example, be set manually to different levels of background noise levels. These settings can then ver from the mobile phone 4 as a parameter value for the user profile be applied.
  • parameter values can also be created by the mobile radio telephone 4 itself. If this is equipped accordingly, it can, for example, determine a background noise level itself and create the user profile with corresponding parameter values. However, it can also carry out a frequency analysis of the speaker's language and classify the speaker into a specific age group based on a determined spectrum of the speech. Then a corresponding parameter value for identifying this age group is set in the user profile. However, it is also possible for the speech recognition unit 1 to carry out such an analysis with subsequent classification of the speaker into an age group.
  • the cellular phone 4 can recognize a type of the surroundings of the speaker, for example, from the inside of a vehicle that the cellular phone 4 is connected to a hands-free device. Accordingly, the mobile telephone 4 sets the parameter value of the parameter that characterizes the type of environment of the speaker in the user profile. The mobile telephone 4 sets the parameter of the user profile which designates the national language of the speaker to the value assigned to a language selected by the user for operating the user interface of the mobile radio telephone 4.
  • the user profile is compared by the speech recognition unit 1 with the sets of parameter values of the individual speech recognition models 5.
  • the speech recognition model 5 whose set of parameter values matches the user profile most closely is selected by the speech recognition unit 1 and used for automatic speech recognition of the speech contained in the speech signal.
  • the speech recognition unit 1 does not ensure an exact match when selecting the speech recognition model 5 between the user profile and the set of parameter values of the speech recognition model 5, but only selects the speech recognition model 5 whose set of parameter values has the best match with the user profile, operation of the speech recognition system is also ensured in the event that a speech recognition model with parameter values , which correspond exactly to those of the transmitted user profile, is not available at the speech recognition unit 1.
  • the user profile In order to enable the speech recognition unit to select a speech recognition model, the user profile must be transmitted to the speech recognition unit at least once when establishing communication. However, the profile is preferably also transmitted repeatedly during the communication. This is the prerequisite for a mobile phone that is able to automatically define certain parameters of the user profile to report the current value of these parameters to the speech recognition unit at any time and, if necessary, by changing to a different speech recognition model adapted to the current parameter values can optimize the speech recognition or, if the speech recognition unit changes as a result of a handover, the new speech recognition unit can immediately select the best-matched speech recognition model and work with it.
  • the mobile radio telephone 4 transmits the voice signal as a multi-frame message packagc, for example according to ETSI ES 201 108 vl .1.2.
  • the header of such a message packet comprises nine previously non-standardized bits, called "expansion bits" EXP1 to EXP9, which are available for functional expansions.
  • EXP1 to EXP9 nine previously non-standardized bits, called "expansion bits" EXP1 to EXP9, which are available for functional expansions.
  • One of these can be used, for example, to encode the gender of a speaker, two for encoding four different accents or dialects, one for the age group of the speaker, one for differences Decoration between operation with and without a hands-free system and the remaining four for coding up to 16 national languages.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

L'invention concerne un procédé de sélection d'un modèle de reconnaissance vocale (5) pour une reconnaissance de voix contenue dans un signal vocal. Ledit procédé consiste à recevoir un profil utilisateur affecté au signal vocal, spécifiant un ensemble de valeurs d'une quantité de paramètres du signal vocal nécessaires à la reconnaissance vocale; à comparer l'ensemble contenu dans le profil utilisateur à des ensembles de valeurs de paramètres d'une pluralité de modèles de reconnaissance vocale prédéfinis (5); à sélectionner le modèle de reconnaissance vocale (5) dont l'ensemble de valeurs de paramètres correspond le mieux à l'ensemble du profil utilisateur; et à exécuter la reconnaissance vocale au moyen de ce modèle de reconnaissance vocale (5).
PCT/EP2004/050645 2003-09-29 2004-04-29 Selection d'un modele de reconnaissance vocale pour une reconnaissance vocale WO2005034087A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE10345254 2003-09-29
DE10345254.0 2003-09-29

Publications (1)

Publication Number Publication Date
WO2005034087A1 true WO2005034087A1 (fr) 2005-04-14

Family

ID=34399051

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2004/050645 WO2005034087A1 (fr) 2003-09-29 2004-04-29 Selection d'un modele de reconnaissance vocale pour une reconnaissance vocale

Country Status (1)

Country Link
WO (1) WO2005034087A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1096473A2 (fr) * 1999-10-26 2001-05-02 Persay Inc., c/o Corporation Service Company Partition de modèles de bruits ambiants pour l'identification et la vérification du locuteur
EP1134726A1 (fr) * 2000-03-15 2001-09-19 Siemens Aktiengesellschaft Méthode pour la reconnaissance de prononciations d'un locuteur étranger dans un système de traitement de la parole
EP1215653A1 (fr) * 2000-12-18 2002-06-19 Siemens Aktiengesellschaft Procédé et dispositif pour la reconnaissance de la parole dans un petit appareil
US20020138272A1 (en) * 2001-03-22 2002-09-26 Intel Corporation Method for improving speech recognition performance using speaker and channel information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1096473A2 (fr) * 1999-10-26 2001-05-02 Persay Inc., c/o Corporation Service Company Partition de modèles de bruits ambiants pour l'identification et la vérification du locuteur
EP1134726A1 (fr) * 2000-03-15 2001-09-19 Siemens Aktiengesellschaft Méthode pour la reconnaissance de prononciations d'un locuteur étranger dans un système de traitement de la parole
EP1215653A1 (fr) * 2000-12-18 2002-06-19 Siemens Aktiengesellschaft Procédé et dispositif pour la reconnaissance de la parole dans un petit appareil
US20020138272A1 (en) * 2001-03-22 2002-09-26 Intel Corporation Method for improving speech recognition performance using speaker and channel information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"IMPROVING SPEECH RECOGNITION ACCURACY WITH MULTIPLE PHONETIC MODELS", IBM TECHNICAL DISCLOSURE BULLETIN, IBM CORP. NEW YORK, US, vol. 38, no. 12, 1 December 1995 (1995-12-01), pages 73, XP000588077, ISSN: 0018-8689 *

Similar Documents

Publication Publication Date Title
DE69631318T2 (de) Verfahren und Vorrichtung zur Erzeugung von Hintergrundrauschen in einem digitalen Übertragungssystem
DE69827667T2 (de) Vokoder basierter spracherkenner
DE69911723T2 (de) Automatische Sprach/Sprecher-Erkennung über digitale drahtlose Kanäle
DE69910837T2 (de) Beseitigung von tonerkennung
DE60201939T2 (de) Vorrichtung zur sprecherunabhängigen Spracherkennung , basierend auf einem Client-Server-System
WO2002018897A1 (fr) Dispositif a commande vocale et procede d'entree et de reconnaissance vocale
DE3416238A1 (de) Extremschmalband-uebertragungssystem
DE60127550T2 (de) Verfahren und system für adaptive verteilte spracherkennung
DE10117367B4 (de) Verfahren und System zur automatischen Umsetzung von Text-Nachrichten in Sprach-Nachrichten
DE10006930A1 (de) System und Verfahren zur Spracherkennung
EP1456837B1 (fr) Procede et dispositif de reconnaissance vocale
WO2004068465A1 (fr) Systeme de communication, dispositif d'envoi de communications et dispositif de detection de messages textuels entaches d'erreurs
WO2001086634A1 (fr) Procede pour produire une banque de donnees vocales pour un lexique cible pour l'apprentissage d'un systeme de reconnaissance vocale
WO1999026232A1 (fr) Dispositif et procede de composition par nom a reconnaissance vocale independamment de l'utilisateur, pour des terminaux de telecommunication
DE60315544T2 (de) Telekommunikationsendgerät zur Veränderung eines übertragenen Sprachsignals bei einer bestehenden Fernsprechverbindung
EP1169841B1 (fr) Etablissement d'une liste de modeles de reference pour un appareil de communication a commande vocale
WO2005034087A1 (fr) Selection d'un modele de reconnaissance vocale pour une reconnaissance vocale
DE60027140T2 (de) Sprachsynthetisierer auf der basis von sprachkodierung mit veränderlicher bit-rate
EP1649672A1 (fr) Procede et systeme pour assurer une fonctionnalite mains libres dans des terminaux de telecommunication mobiles par telechargement temporaire d'un algorithme de traitement de la parole
DE102004001863A1 (de) Verfahren und Vorrichtung zur Bearbeitung eines Sprachsignals
EP1659571A2 (fr) Système de dialogue vocal et méthode pour son exploitation
WO2018188907A1 (fr) Traitement d'une entrée vocale
EP4027333B1 (fr) Assistant vocal virtuel à précision de reconnaissance améliorée
DE602004002845T2 (de) Sprachaktivitätsdetektion unter Verwendung von komprimierten Sprachsignal-Parametern
DE102016214853A1 (de) Verfahren und Vorrichtung zur Verbesserung einer Sprachqualität einer mit einem Fahrzeug gekoppelten Kommunikationseinrichtung

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase