WO2001018788A2 - Procede de determination de fins de phrase dans le traitement vocal automatique - Google Patents

Procede de determination de fins de phrase dans le traitement vocal automatique Download PDF

Info

Publication number
WO2001018788A2
WO2001018788A2 PCT/DE2000/002979 DE0002979W WO0118788A2 WO 2001018788 A2 WO2001018788 A2 WO 2001018788A2 DE 0002979 W DE0002979 W DE 0002979W WO 0118788 A2 WO0118788 A2 WO 0118788A2
Authority
WO
WIPO (PCT)
Prior art keywords
tokens
token
sentence
assessment
category
Prior art date
Application number
PCT/DE2000/002979
Other languages
German (de)
English (en)
Other versions
WO2001018788A3 (fr
Inventor
Horst-Udo Hain
Martin Holzapfel
Original Assignee
Siemens Aktiengesellschaft
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Aktiengesellschaft filed Critical Siemens Aktiengesellschaft
Publication of WO2001018788A2 publication Critical patent/WO2001018788A2/fr
Publication of WO2001018788A3 publication Critical patent/WO2001018788A3/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding

Definitions

  • the present invention relates to a method for end-of-sentence determination in automatic speech processing.
  • the two main areas of application for automatic speech processing are automatic speech recognition and automatic speech synthesis.
  • Methods for synthesizing speech are known for example from EP 793 218 A2, EP 821 344 A2 or WO 96/42079.
  • a text present in the form of a text file is converted into an audio file which is output as speech by means of an acoustic output unit.
  • an attempt is made to reproduce human language as precisely as possible.
  • the two main criteria for this are the intelligibility of the language itself and the prosody of the language generated.
  • the prosody is essentially determined by the basic frequency (voice position), sound energy (loudness) and sound duration (stretching and pauses).
  • a complex problem in creating the correct prosody is the recognition of the end of the sentence in any text. To do this, the punctuation marks valid in the respective language must be interpreted correctly. So far, this problem has been solved by rule-based routines which are implemented in a corresponding program for generating speech. To set up such a rule-based routine, a language expert is required who sets up a rule set for the respective language. The creation of the rule set means a considerable effort, which must be repeated for each language for which the method is to be used.
  • the invention has for its object to provide methods for end-of-sentence determination in automatic speech processing which can be adapted to different languages more easily than the known methods and nevertheless correctly recognizes sentence ends with the lowest error rate.
  • the method according to the invention for determining the end of a sentence in automatic speech processing comprises the following steps:
  • the assessment of the flagged tokens can be carried out using a data-driven routine, that is to say a learning program part which can adapt itself to a language essentially independently.
  • data-driven routines are routines that independently generate statistics and evaluate them accordingly when making a decision, or neural networks.
  • a lso, the disambiguating the token can be realized by means of data-driven routines.
  • the method according to the invention is particularly suitable for data-driven routines, since the assessment of the tokens provided with a flag is carried out after disambiguating the tokens on the basis of the category assigned to them, so that the linguistic categories of the individual tokens determined are almost completely correct and accordingly the token can be assessed precisely.
  • the two procedural steps of disambiguating and evaluating the tokens provided with a flag are designed as neural networks, each of which has the same context, e.g. access three tokens before and three tokens after the token to be examined.
  • Fig. 4 shows the structure of a neural network for assessing sentence endings.
  • a text file token is divided.
  • tokens are all text elements that are located between two token separators.
  • the token separators include spaces, tabs and end-of-line characters.
  • E token begins with a character that is not a separator and ends with the character after which e separator comes. These separators can be stored in a separate file for each language.
  • the tokens that can represent the end of the sentence are marked with a corresponding flag.
  • Flags in the sense of the invention are any data assignments with which individual tokens can be identified simply and quickly as a possible sentence end after a corresponding assignment. This flag is called PEOS (possible end of sentence). All tokens that have an emblem that can possibly be understood as the end of a sentence are assessed as tokens that can represent the end of a sentence.
  • end-of-sentence characters a distinction is made between characters that always mark the end of a sentence, such as the question mark or exclamation mark, and characters that can also have other uses, such as the period, which can also appear in abbreviations, acronyms and numbers.
  • a special case for the determination of prosody is the colon, since it never stands at the end of the grammatical sentence, but for prosody, especially for a pause in speech, m generally has the same meaning as the point at the end of the sentence.
  • the end-of-sentence character is at the end of the token and a lower-case token follows. In this case it is not the end of a sentence.
  • the punctuation mark is in the token, which means that there is no token separator. This case occurs e.g. m Figures on (1.5, 13:20).
  • the end-of-sentence character in no case marks the end of a sentence.
  • the end of sentence character is at the end of the token and the next token does not begin with a small letter.
  • This token which has the end of the sentence character at the end, represents a possible sentence end and is marked with the PEOS flag (PEOS: possible end of sentence).
  • linguistic categories are assigned to the individual tokens.
  • the linguistic categories include word classes and other characters that can be contained in a text.
  • the linguistic categories used in the present exemplary embodiment are listed:
  • the division of the linguistic categories given above is only an example. Other subdivisions of linguistic categories can also be used. For example, up to 40 linguistic categories are used in speech recognition. In the present invention, however, a division with fewer categories is advantageous since the neural networks explained in more detail below can be implemented more easily and can be trained faster.
  • the linguistic categories belonging to the respective tokens are read from a lexicon. It is possible that several linguistic categories are assigned to a single token. As a rule, but not all tokens are a text in Le ⁇ xikon present, so that the appropriate category and the corresponding categories may not apply to all tokens with the help of the dictionary to be determined.
  • the linguistic category of the tokens which cannot be clearly assigned to a category, is determined using a so-called OOV routine (out of vocabulary).
  • this OOV routine is designed as a neural network, which uses the last four letters of the respective token to infer its category. However, this OOV routine can also be based on another data-driven method.
  • the neural network of the OOV routme can also evaluate the last three or five characters of the token in order to infer its category. In another language, it may be appropriate not to determine the category based on the ending, but on another section of the token.
  • the linguistic criterion can be ambiguous in both the categorization using the lexicon and the categorization using the OOV routme, that is to say that the token is assigned several linguistic categories.
  • the lexicons for the individual languages are in turn language-specific, so that the lexicon must be replaced accordingly when the method according to the invention is transferred to another language.
  • such lexicons are known for most languages, which is why the exchange of the lexicons is not a serious problem when the method according to the invention is transferred to another language.
  • the tokens can be subjected to further processing operations, which are summarized in the flow chart shown in FIG. 1 in step S4.
  • abbreviations, acronyms and formulas contained in the text can be evaluated. This can show that a token marked with a flag as a potential end of a sentence cannot be an end of a sentence. In such a case, the corresponding flag is deleted during these processing operations.
  • Other such work processes can be, for example, normalizing (normalizing) or expanding (expanding) the tokens.
  • normalizing a token tokens are categorized that contain characters of different categories, such as "54jahng".
  • When tokens are expanded several tokens such as "New" and "York” are combined into a single token "New York". Even with these
  • Processing operations can result in that a flag set in step S2 can be deleted, which is then carried out accordingly.
  • the ambiguous tokens that is to say the tokens to which several linguistic categories are assigned, are disambiguated.
  • this is carried out by a neural network which is based on a standard feed-forward architecture with a hidden layer.
  • This neural network is schematically shown in a roughly simplified manner in FIG. 3.
  • On the input side it has nodes for the word to be disambiguated and the corresponding processors or successors.
  • three tokens preceding the token to be disambiguated and three tokens following the token to be disambiguated are taken into account. This means that for the three tokens of the processors, 14 nodes are provided for the individual categories.
  • 13 nodes are provided for the token to be disambiguated, since the category of punctuation marks does not have to be taken into account here.
  • 3 x 14 (42) nodes are to be provided for the successor as well as for the predecessor. Each of these nodes thus represents a linguistic category for a specific token.
  • the input signal +1 is applied to the nodes if the respective category is assigned to the respective token or -1 if this category is not assigned to the respective token , If there is no token with the processors or successors, what at the beginning and at the end of the text, the respective nodes are assigned the value 0.
  • 13 nodes are provided for the respective categories of the word to be disambiguated.
  • a hidden layer is located between the output nodes and the input nodes.
  • FIG. 4 Show end of sentence or no end of sentence.
  • the neural network in turn has 13 nodes for the token to be assessed and 42 nodes for the predecessor (3 tokens) and 42 nodes for the successor (3 tokens).
  • a hidden layer is arranged above it and on the output side there is only a single node which represents the binary result, the token is an end of the sentence or is not an end of the sentence.
  • This structure of the neural network shows that the linguistic category of the token to be assessed and the linguistic category of the processors and successors are also taken into account in the assessment of the token provided with the flag.
  • An audio file can thus be generated on the basis of this data (step S7), with further parameters for determining the prosody to be taken into account here are, but which are not the subject of the present invention.
  • the neural networks or other data-driven routines of the method according to the invention are initially m one
  • Training phase trained using a text.
  • the linguisti ⁇ rule categories of tokens and the ends of each sentence of this training text are known and the training will be during the input to be trained routines.
  • the method according to the invention thus automatically learns the laws of a language, only known and easily available knowledge (division of the tokens, allocation of flags for sentence endings, lexicon) having to be added as expert knowledge.
  • the method according to the invention learns the laws of language that are difficult to create in practice during training. The method according to the invention can thus be quickly and easily transferred to another language.
  • the method according to the invention is implemented as a computer program on a computer system, as is shown schematically in a simplified manner in FIG. 2.
  • the computer program can also be stored on an electronically readable data carrier and can thus be transferred to another computer system.
  • the computer system 1 has an internal bus 2, which is connected to a memory area 3, a central processor unit 4 and an interface 5.
  • the interface 5 establishes a data connection to further computer systems via a data line 6.
  • the acoustic output unit 7 is connected to a loudspeaker 10, the graphic output unit 8 to a screen 11, and the output unit 9 to a keyboard 12.
  • Texts that are stored in the memory 3 can be transmitted to the computer system 1 via the data line 6 and the interface 5.
  • the memory area 3 is subdivided into several areas in which texts, audio files, application programs for carrying out the method according to the invention and further application and auxiliary programs are stored.
  • the texts stored as a text file are converted by the application programs for executing the method according to the invention m audio files, which are transmitted via the internal bus 2 to the acoustic output unit 7 and are output by the latter at the loudspeaker 10 as speech.
  • the invention is explained in more detail above using an exemplary embodiment for the German language.
  • the invention is not restricted to the use of the German language, but can be very easily transferred to other languages in comparison with known methods.
  • An essential advantage of the method according to the invention compared to known methods is that it also enables end-of-sentence recognition in languages for which expert knowledge of the language rules for determining the token category and the end of sentences is not yet known.
  • the method according to the invention can thus also be used easily in languages which are not very popular and therefore only little researched.
  • the two neural networks of the exemplary embodiment described above are designed as a single neural network for disambiguating and for evaluating the sentence ends. It is also possible to use any other statistical, data-driven method instead of neural networks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

Selon l'invention, un texte divisé en unités lexicales est traité de telle sorte que les différentes unités lexicales sont d'abord subdivisées en catégories linguistiques prédéterminées, des unités lexicales ambiguës étant désambiguisées au cours d'une étape séparée et la détermination définitive des fins de phrase étant réalisée sur la base des catégories linguistiques.
PCT/DE2000/002979 1999-09-03 2000-08-31 Procede de determination de fins de phrase dans le traitement vocal automatique WO2001018788A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE19942171.4 1999-09-03
DE1999142171 DE19942171A1 (de) 1999-09-03 1999-09-03 Verfahren zur Satzendebestimmung in der automatischen Sprachverarbeitung

Publications (2)

Publication Number Publication Date
WO2001018788A2 true WO2001018788A2 (fr) 2001-03-15
WO2001018788A3 WO2001018788A3 (fr) 2001-09-07

Family

ID=7920746

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/DE2000/002979 WO2001018788A2 (fr) 1999-09-03 2000-08-31 Procede de determination de fins de phrase dans le traitement vocal automatique

Country Status (2)

Country Link
DE (1) DE19942171A1 (fr)
WO (1) WO2001018788A2 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102016008855A1 (de) 2016-07-20 2018-01-25 Audi Ag Verfahren zum Durchführen einer Sprachübertragung

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE3733674A1 (de) * 1986-10-03 1988-04-21 Ricoh Kk Sprachanalysator
US4773009A (en) * 1986-06-06 1988-09-20 Houghton Mifflin Company Method and apparatus for text analysis
EP0327266A2 (fr) * 1988-02-05 1989-08-09 AT&T Corp. Méthode pour la détermination des élements de langage et utilisation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5860064A (en) * 1993-05-13 1999-01-12 Apple Computer, Inc. Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
US6330538B1 (en) * 1995-06-13 2001-12-11 British Telecommunications Public Limited Company Phonetic unit duration adjustment for text-to-speech system
JPH09230896A (ja) * 1996-02-28 1997-09-05 Sony Corp 音声合成装置
JPH1039895A (ja) * 1996-07-25 1998-02-13 Matsushita Electric Ind Co Ltd 音声合成方法および装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4773009A (en) * 1986-06-06 1988-09-20 Houghton Mifflin Company Method and apparatus for text analysis
DE3733674A1 (de) * 1986-10-03 1988-04-21 Ricoh Kk Sprachanalysator
EP0327266A2 (fr) * 1988-02-05 1989-08-09 AT&T Corp. Méthode pour la détermination des élements de langage et utilisation

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
EDGINGTON M ET AL: "OVERVIEW OF CURRENT TEXT-TO-SPEECH TECHNIQUES: PART II - PROSODY AND SPEECH GENERATION" BT TECHNOLOGY JOURNAL,GB,BT LABORATORIES, Bd. 14, Nr. 1, 1996, Seiten 84-99, XP000554641 ISSN: 1358-3948 *
PALMER D D ET AL: "Adaptive multilingual sentence boundary disambiguation" COMPUTATIONAL LINGUISTICS, JUNE 1997, MIT PRESS FOR ASSOC. COMPUT. LINGUISTICS, USA, Bd. 23, Nr. 2, Seiten 241-267, XP002164114 ISSN: 0891-2017 *
SPROAT R W ET AL: "TEXT-TO-SPEECH SYNTHESIS" AT & T TECHNICAL JOURNAL,US,AMERICAN TELEPHONE AND TELEGRAPH CO. NEW YORK, Bd. 74, Nr. 2, 1. M{rz 1995 (1995-03-01), Seiten 35-44, XP000495044 ISSN: 8756-2324 *
YUKIKO YAMAGUCHI ET AL: "A NEURAL NETWORK APPROACH TO MULTI-LANGUAGE TEXT-TO-SPEECH SYSTEM" PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (ICSLP),TOKYO, JP, 18. November 1990 (1990-11-18), Seiten 325-328, XP000503375 *

Also Published As

Publication number Publication date
DE19942171A1 (de) 2001-03-15
WO2001018788A3 (fr) 2001-09-07

Similar Documents

Publication Publication Date Title
DE69908047T2 (de) Verfahren und System zur automatischen Bestimmung von phonetischen Transkriptionen in Verbindung mit buchstabierten Wörtern
DE69519328T2 (de) Verfahren und Anordnung für die Umwandlung von Sprache in Text
DE60203705T2 (de) Umschreibung und anzeige eines eingegebenen sprachsignals
DE60020434T2 (de) Erzeugung und Synthese von Prosodie-Mustern
DE69618503T2 (de) Spracherkennung für Tonsprachen
DE69923379T2 (de) Nicht-interaktive Registrierung zur Spracherkennung
DE69821673T2 (de) Verfahren und Vorrichtung zum Editieren synthetischer Sprachnachrichten, sowie Speichermittel mit dem Verfahren
DE60200857T2 (de) Erzeugung einer künstlichen Sprache
DE69413052T2 (de) Sprachsynthese
DE69712216T2 (de) Verfahren und gerät zum übersetzen von einer sparche in eine andere
DE3337353C2 (de) Sprachanalysator auf der Grundlage eines verborgenen Markov-Modells
DE60216069T2 (de) Sprache-zu-sprache erzeugungssystem und verfahren
DE69937176T2 (de) Segmentierungsverfahren zur Erweiterung des aktiven Vokabulars von Spracherkennern
DE3783154T2 (de) Spracherkennungssystem.
DE69831991T2 (de) Verfahren und Vorrichtung zur Sprachdetektion
DE3876207T2 (de) Spracherkennungssystem unter verwendung von markov-modellen.
DE69828141T2 (de) Verfahren und Vorrichtung zur Spracherkennung
DE3236832C2 (de) Verfahren und Gerät zur Sprachanalyse
DE19825205C2 (de) Verfahren, Vorrichtung und Erzeugnis zum Generieren von postlexikalischen Aussprachen aus lexikalischen Aussprachen mit einem neuronalen Netz
EP1273003B1 (fr) Procede et dispositif de determination de marquages prosodiques
EP1217610A1 (fr) Méthode et système pour la reconnaissance vocale multilingue
EP0994461A2 (fr) Procédé de reconnaissance automatique d'une expression vocale épellée
DE2212472A1 (de) Verfahren und Anordnung zur Sprachsynthese gedruckter Nachrichtentexte
DE10306599B4 (de) Benutzeroberfläche, System und Verfahren zur automatischen Benennung von phonischen Symbolen für Sprachsignale zum Korrigieren von Aussprache
EP1214703B1 (fr) Procede d'apprentissage des graphemes d'apres des regles de phonemes pour la synthese vocale

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: A3

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP