JPH11344990A

JPH11344990A - Method and device utilizing decision trees generating plural pronunciations with respect to spelled word and evaluating the same

Info

Publication number: JPH11344990A
Application number: JP11121710A
Authority: JP
Inventors: Roland Kuhn; ローランド・クーン; Jean-Claude Junqua; ジャン−クロード・ジュンカ; Matteo Contolini; マッテオ・コントリーニ
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1998-04-29
Filing date: 1999-04-28
Publication date: 1999-12-14
Anticipated expiration: 2019-04-28
Also published as: EP0953970A2; ATE261171T1; KR19990083555A; DE69915162D1; TW422967B; JP3481497B2; EP0953970A3; CN1118770C; EP0953970B1; KR100509797B1; CN1233803A

Abstract

PROBLEM TO BE SOLVED: To automatically generate pronunciation of a spelled word by generating the second set of the voice phoneme provided with a score indicating the voice phoneme of an input sequence. SOLUTION: In the second stage of a pronunciation system, a mixed tree score evaluating part 20 assesses the posibility of continued existence of respective pronunciations of a list 18. A score evaluating part functions by successively investigating respective characters being in the input sequence together with phonems assigned to individual characters by a sequence generating part 16. Then, the evaluating part 20 rescores respective pronunciations being in the list 18 on the basis of questions of the mixed tree and by utilizing probability data being in leaf nodes of the mixed tree. A selection part module 24 may access a list 22 in order to draw out one or more pronunciations. Tipically, the selection part 24 draws out a pronunciation having the highest score as an output pronunciation 26.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、概略言語処理に係
るものである。本発明は、特に、綴り言葉の発音を生成
するシステムに係るものである。本発明は、音声認識、
音声合成及び辞書編集を含む、様々な適用において、利
用し得る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to general language processing. The invention particularly relates to a system for generating spelling pronunciations. The present invention provides voice recognition,
It can be used in a variety of applications, including speech synthesis and dictionary editing.

【０００２】[0002]

【発明の背景】発音を伴う綴り言葉は、言語処理の分野
の様々な場面で発生する。音声認識においては、利用前
に、辞書の各用語を音声に転換して認識部を学習（教
育）しなければならない。伝統的に、音声への転換は、
対象である特定言語の音声発音の微細な差異に長けた辞
書編集者により、手作業で生成されている。辞書の各々
の言葉に対して質のよい音声に転換することは、時間の
かかることでありしかも多大なスキルを要することであ
る。言葉の文字綴りを基礎にして言葉を音声に転換し得
る信頼に足るシステムがもしあれば、この労力と特殊な
専門技術の大半は、不要になり得る。そのようなシステ
ムであれば、現存する辞書に目下見当たらない、例えば
地理上の位置と名字のような、言葉を認識し得るよう
に、現行の認識システムを拡張し得る。BACKGROUND OF THE INVENTION Spelled words with pronunciation occur in various situations in the field of language processing. In speech recognition, before use, each term in the dictionary must be converted to speech to learn (educate) the recognition unit. Traditionally, the switch to speech has been
It is manually generated by a dictionary editor who is skilled in the minute differences in the phonetic pronunciation of a specific language of interest. Converting each word in the dictionary to good speech is time consuming and requires a great deal of skill. If there is a reliable system that can convert words to speech based on the spelling of words, most of this effort and specialized expertise may be unnecessary. Such a system can extend the current recognition system to recognize words that are not currently found in existing dictionaries, such as geographical locations and surnames.

【０００３】綴り言葉は、音声合成分野でもしばしば登
場する。今日の音声合成器は、デジタル状にサンプルさ
れた音素を辞書から検索しこれら音素を繋げて文章を形
成することにより、テキストを音声に変換する。[0003] Spelling often appears in the field of speech synthesis. Today's speech synthesizers convert text to speech by searching digitally sampled phonemes from a dictionary and connecting these phonemes to form a sentence.

【０００４】上記の例が示すように、言語処理の音声認
識分野と音声合成分野の両方は、綴り言葉から正確な発
音を生成し得ると、利益を得るものである。しかしなが
ら、このテクノロジーに対する必要性は、言語処理分野
に限定されない。辞書編集者は、今日、主要な世界的言
語の多くに対しかなり大規模で正確な発音辞書を完成さ
せている。しかしながら、質のよい音声表記のない地域
的言語がなお何百と残っている。質のよい音声表記を作
成する作業はこれまでは大半が手作業であったため、表
記しようとしても地域的言語が表記されるには年月がか
かるものである。表記の正確さを評価するためのコンピ
ュータに適合したよい技術があるならば、表記処理は大
きく増進しうる。そのような評価システムは、表記プロ
トタイプ内の発音が不確かである見出し項目を識別する
現存する言語表記集成を用いる。これにより、質の高い
表記を生成するスピードが大きく増進する。As the above example shows, both the speech recognition and speech synthesis fields of linguistic processing benefit from being able to generate accurate pronunciation from spelled words. However, the need for this technology is not limited to the language processing field. Dictionary editors today have completed fairly large and accurate pronunciation dictionaries for many of the major world languages. However, there are still hundreds of regional languages without good phonetic transcription. Until now, the work of creating high-quality phonetic transcriptions was manual, so it would take years for local languages to be represented. The notation process can be greatly enhanced if there are good computer-appropriate techniques for assessing the accuracy of the notation. Such a rating system uses an existing linguistic notation collection that identifies uncertain heading entries in the notation prototype. This greatly increases the speed of generating high quality notation.

【０００５】これまで綴り言葉から発音表記への変換の
試みの多くは、文字そのもののみをあてにしていた。こ
れらの技術は多くの問題を有する。例えば、文字のみか
ら発音を生成する生成部は、Ｂｉｂｌｅと言う単語を適
切に発音するのが非常に困難である。文字のみのシーケ
ンスを基礎にすると、文字のみからの発音生成システム
は、読みを習う多くの小学１年生のように、その単語を
“ビブル（Ｂｉｂ−ｌ）”と発音しがちである。従来シ
ステムにおける欠点は、多くの言語の発音規則が強要す
る固有のあいまいさにある。例えば、英語には、数百に
のぼる様々な発音規則があり、逐語単位を基礎にして問
題にアプローチすることには困難がありコンピュータを
利用するとコストがかかることになってしまう。Many attempts to convert spelled words to phonetic notation have relied solely on the characters themselves. These techniques have many problems. For example, it is very difficult for a generator that generates a pronunciation from only characters to properly pronounce the word “Bible”. Based on a character-only sequence, pronunciation-only-from-letter systems tend to pronounce the word "Bib-l", much like many first-year students learn to read. A disadvantage of conventional systems is the inherent ambiguity that many language pronunciation rules impose. For example, English has hundreds of different pronunciation rules, making it difficult to approach problems on a verbatim basis, and using a computer would be costly.

【０００６】[0006]

【発明の概要】本発明は、問題を異なる角度から眺め
る。本発明は、文字シーケンス判断形成ルールと音素シ
ーケンス判断形成ルールの両方を含む、特別に構築され
た混合判断ツリーを利用する。特に、混合判断ツリー
は、ツリーの内部ノードに配置された一連のイエス・ノ
ー質問を含む。これらの質問には、綴り言葉シーケンス
内の文字やその近接文字に関するものが含まれているこ
ともあり、言葉シーケンス内の音素やその近接音素に関
するものが含まれていることもある。内部ノードは最終
的には、文字シーケンスにより定義される単語を発音す
る際に、所与の文字の音声発音が適切である傾向が最も
あるあたりの確率データを含むリーフ・ノードに繋が
る。SUMMARY OF THE INVENTION The present invention looks at the problem from different angles. The present invention utilizes a specially constructed mixed decision tree that includes both character sequence decision making rules and phoneme sequence decision making rules. In particular, a mixed decision tree includes a series of yes / no questions located at internal nodes of the tree. These questions may include questions about the characters in the spelling sequence and their proximate characters, and may also include questions about the phonemes in the word sequence and its proximate phonemes. The internal node ultimately leads to a leaf node that contains the probability data around which a given character is most likely to be phonetically appropriate when pronouncing the word defined by the character sequence.

【０００７】本発明の発音生成部は、種々の発音の候補
にスコアを付すためにこの混合判断ツリーを利用し、該
ツリーに所与の綴り言葉に対して最も良い発音として最
も相応しい候補を選ばせる。最も良い発音の生成は、文
字のみのツリーが複数の発音候補を生成する第１のステ
ージにて利用される、２つのステージのプロセスである
のが好ましい。それからこれら候補は、最も良い候補を
選択する第２のステージの混合判断ツリーを用いて、ス
コアが付される。The pronunciation generation unit of the present invention uses this mixed decision tree to score various pronunciation candidates, and selects the best candidate as the best pronunciation for a given spelled word in the tree. Let Generating the best pronunciation is preferably a two-stage process in which a tree of letters only is used in the first stage of generating a plurality of pronunciation candidates. These candidates are then scored using a second stage mixed decision tree that selects the best candidates.

【０００８】混合判断ツリーは、２つのステージの発音
生成部内で、利点をもって利用されるが、混合ツリー
は、文字のみの第１のステージの処理を必要としない問
題を解決する場合に有益である。例えば、混合判断ツリ
ーは、言語学者が手作業技術を用いて生成する発音にス
コアを付するのに利用し得る。While the mixed decision tree is used with advantage in the two-stage pronunciation generator, the mixed tree is useful in solving problems that do not require the processing of the first stage of only characters. . For example, a mixed decision tree may be used by linguists to score pronunciations generated using manual techniques.

【０００９】本発明のより完全な理解のために、発明の
目的、利点、参照物が、以下の明細書と添付の図面にお
いて、示されてもよい。For a more complete understanding of the present invention, objects, advantages and references of the invention may be set forth in the following specification and accompanying drawings.

【００１０】[0010]

【発明の実施の形態】本発明の原理を説明するため、図
１の例示の実施形態は、綴られた文字から発音への生成
部を示す。以下においてより十分に説明するが、本発明
の混合判断ツリーは、ここで示される発音生成部だけで
はなく、種々の異なるアプリケーションにて利用するこ
とができる。発音生成部は混合判断ツリー構造に関する
多くの形態と利点を強調するものなので、説明のために
選ばれた。DESCRIPTION OF THE PREFERRED EMBODIMENTS To illustrate the principles of the present invention, the exemplary embodiment of FIG. 1 shows a spelled character to pronunciation generator. As will be described more fully below, the mixed decision tree of the present invention can be used in a variety of different applications, not just the pronunciation generator shown here. The pronunciation generator is chosen for explanation because it emphasizes many aspects and advantages of the mixed decision tree structure.

【００１１】発音生成部は、２つのステージを使用す
る。第１のステージは文字のみの判断ツリー１０のセッ
トを使用し、第２のステージは混合判断ツリー１２のセ
ットを使用する。文字「Ｂ−Ｉ−Ｂ−Ｌ−Ｅ」のシーケ
ンスのような、入力シーケンス１４は、動的プログラミ
ング音素シーケンス生成部１６に与えられる。シーケン
ス生成部は文字のみのツリー１０を用いて発音リスト１
８を生成し、綴り言葉の入力シーケンスに係る可能性あ
る発音の候補を示す。The sound generation unit uses two stages. The first stage uses a set of character-only decision trees 10 and the second stage uses a set of mixed decision trees 12. An input sequence 14, such as the sequence of characters “BIBLE”, is provided to a dynamic programming phoneme sequence generator 16. The sequence generation unit uses the character-only tree 10 to generate the pronunciation list 1
8 is generated to indicate possible pronunciation candidates for the spelled word input sequence.

【００１２】シーケンス生成部はシーケンスの中の各々
の文字を順次調べ、その文字に関係する判断ツリーを利
用して、文字のみのツリーの中に含まれる可能性あるデ
ータを基礎にしてその文字に対する音素発音を選択す
る。The sequence generator sequentially examines each character in the sequence and utilizes the decision tree associated with that character to determine the character for that character based on data that may be contained in the character-only tree. Select phonemic pronunciation.

【００１３】文字のみの判断ツリーのセットが、アルフ
ァベットの中の個々の文字に対する判断ツリーを含むの
が好ましい。図２は、文字Ｅに対する文字のみの判断ツ
リーの例を示す。判断ツリーは、（図中に長円形で示さ
れる）複数の内部ノードと（図中に矩形で示される）複
数のリーフノードを含む。個々の内部ノードには、イエ
ス・ノー質問が配置されている。イエス・ノー質問は、
イエスかノーで答えられる質問である。文字のみのツリ
ーでは、これらの質問は、入力シーケンス中の、所与の
文字（このケースでは文字Ｅ）と、所与の文字の近接の
文字とに対して向けられている。図２では、関連する質
問に対する答えがイエスかノーかによって、個々の内部
ノードは左か右かに枝分かれる。Preferably, the set of character-only decision trees includes a decision tree for each character in the alphabet. FIG. 2 shows an example of a character-only decision tree for character E. The decision tree includes a plurality of internal nodes (indicated by an oval in the figure) and a plurality of leaf nodes (indicated by a rectangle in the figure). A yes / no question is located at each internal node. The yes-no question is
A question that can be answered with yes or no. In a character-only tree, these questions are addressed to the given character (in this case, the letter E) and to the characters in proximity to the given character in the input sequence. In FIG. 2, each internal node branches left or right depending on whether the answer to the relevant question is yes or no.

【００１４】図２では、省略は以下のように用いられ
る。“＋１”や“−１”のような質問中の数字は、現在
文字に対するスペル中の相対的位置を示す。例えば、
“＋１Ｌ＝＝‘Ｒ’？”とは、“（現ケースでは文字Ｅ
である）現在文字の後の文字はＲか？”ということであ
る。省略形ＣＯＮＳとＶＯＷは、文字の種類、即ち子音
と母音を表す。近接文字の欠如即ちヌル文字（ｎｕｌｌ
ｌｅｔｔｅｒ）は、シンボル“−”で示され、該シン
ボルは文字と対応する音素発音とを配列する際にはフィ
ラ（つなぎ）やプレースホルダとして用いられる。シン
ボル＃は、単語の境界を示す。In FIG. 2, the abbreviations are used as follows. A number in the question, such as "+1" or "-1", indicates the relative position in the spelling for the current character. For example,
"+ 1L == 'R'?" Means "(in the present case, the character E
Is the letter after the current letter R? The abbreviations CONS and VOW indicate the character type, ie, consonants and vowels. Lack of adjacent characters, ie, null character
letter) is indicated by a symbol "-", which is used as a filler or a placeholder when arranging characters and corresponding phoneme pronunciations. The symbol # indicates a word boundary.

【００１５】リーフノードには、可能性ある音素発音
を、特定の音素が所与の文字の適切な発音を示す確率を
表す数値と、関連付ける確率データが配置されている。
例えば、“ｉｙ＝０．５１”という表記は、“このリー
フの音素‘ｉｙ’の確率は、０．５１である”というこ
とである。空の音素、即ち無音は、シンボル‘−’によ
って表されている。[0015] Probability data is associated with the leaf nodes to associate a possible phoneme pronunciation with a numerical value representing the probability that a particular phoneme indicates the proper pronunciation of a given character.
For example, the notation “iy = 0.51” means “the probability of the phoneme 'iy' of this leaf is 0.51”. Empty phonemes, ie, silence, are represented by the symbol '-'.

【００１６】（図１の）シーケンス生成部１６は、この
ように、リスト１８に蓄える１つ又はそれ以上の仮想発
音を構成するため、文字のみの判断ツリーを利用する。
個々の発音は、判断ツリー１０を利用して選択された個
別の音素の確率のスコアを結合させて得られた数値スコ
アと関連付けられるのが、好ましい。単語の発音は、可
能性ある結合のマトリックスを構築し、ｎ個の最も相応
しい候補を選択する動的プログラミングを利用すること
により、スコアが付され得る。また一方、ｎ個の最も相
応しい候補は、以下のように、最初に最も相応しい単語
の候補を識別しそれから反復置換を通じて追加候補を生
成することにより、選択し得る。The sequence generator 16 (FIG. 1) utilizes a character-only decision tree to construct one or more virtual pronunciations stored in the list 18 in this manner.
Preferably, each pronunciation is associated with a numerical score obtained by combining the probability scores of the individual phonemes selected using the decision tree 10. The pronunciation of words can be scored by building a matrix of possible combinations and utilizing dynamic programming to select the n best candidates. Alternatively, the n best candidates may be selected by first identifying the best possible word candidates and then generating additional candidates through iterative permutation as follows.

【００１７】最も高い確率のスコアを持つ発音は、（リ
ーフノードを調査することにより識別される）最高スコ
ア音素の各々のスコアを掛け合わせ、さらにこの選択を
最高確率候補又は第１単語候補として用いることによ
り、最初に選ばれる。追加の（ｎ個の最も相応しい）候
補は、リーフノード中の音素データを再び調査し、先に
選択されておらず、最初に選択された音素と最も差のな
い音素を、識別することにより、選ばれる。そしてこの
最小限差異音素は、最初に選択された音素に取って代わ
り、それにより２番目に数値の高い候補を生成する。上
記処理は、ｎ個の最も相応しい候補の所定の数が選ばれ
るまで、反復して繰り返してもよい。リスト１８は、ス
コアの降ベキ順に分類されており、従って文字のみの分
析により最も数値が高いと判定された発音がリストの最
初に現れる。The pronunciation with the highest probability score is multiplied by the score of each of the highest scoring phonemes (identified by examining the leaf nodes), and this selection is used as the highest probability or first word candidate. By being chosen first. Additional (n most likely) candidates are examined by again examining the phoneme data in the leaf nodes and identifying the phonemes that have not been previously selected and are the least different from the first selected phoneme, To be elected. This minimal difference phoneme then replaces the first selected phoneme, thereby generating the second highest candidate. The above process may be iteratively repeated until a predetermined number of the n best candidates is selected. The list 18 is sorted in descending order of the score, so that the pronunciation determined to have the highest numerical value by character-only analysis appears first in the list.

【００１８】上記のように、文字のみの分析では、貧弱
な結果しか得られないことがしばしばである。これは、
文字のみの分析では、後続の文字により何の音素が生成
されるかを個々の文字において決定する方法がないから
である。このように、文字のみの分析は、自然な音声で
は実際に生じない、高いスコアの付された発音を生成す
ることがある。例えば、固有名詞のＡｃｈｉｌｌｅｓ
は、両方のｌを音声表記するａｈ−ｋ−ｉｈ−ｌ−ｌ−
ｉｙ−ｚの発音である、という結果になる傾向がある。
自然な音声では、２番目のｌは実際には発音せず、ａｈ
−ｋ−ｉｈ−ｌ−ｉｙ−ｚとなる。文字のみのツリーを
利用するシーケンス生成部には、自然な音声では決して
生じない単語の発音をふるいにかけるメカニズムがな
い。As noted above, character-only analysis often yields poor results. this is,
This is because in character-only analysis, there is no way to determine for each character what phonemes are generated by subsequent characters. Thus, character-only analysis may produce a high-scored pronunciation that does not actually occur with natural speech. For example, the proper noun Achilles
Is ah-k-ih-l-l-
This tends to result in i-z pronunciation.
In natural speech, the second l does not actually sound,
−k-ih-l-iy-z. Sequence generators that use a tree of letters only have no mechanism for sifting out pronunciations of words that never occur in natural speech.

【００１９】発音システムの第２ステージは、上記の問
題に取り組むものである。混合ツリースコア評価部２０
は混合判断ツリー１２のセットを利用し、リスト１８の
各々の発音の存続可能性を査定する。スコア評価部は、
シーケンス生成部１６により個々の文字に割り当てられ
た音素と共に、入力シーケンスの中の各々の文字を順次
調査することにより、機能する。The second stage of the pronunciation system addresses the above problem. Mixed tree score evaluation unit 20
Utilizes a set of mixed decision trees 12 to assess the viability of each pronunciation in list 18. The score evaluation unit
It works by sequentially examining each character in the input sequence along with the phonemes assigned to each character by the sequence generator 16.

【００２０】文字のみのツリーのセットと同様に、混合
ツリーのセットにはアルファベットの個々の文字に対す
る混合ツリーが備わる。例としての混合ツリーが図３に
示されている。文字のみのツリーと同様に、混合ツリー
には内部ノードとリーフノードが備わる。図３におい
て、内部ノードは長円形で示されリーフノードは矩形で
示される。内部ノードにはそれぞれイエス・ノー質問が
配置され、リーフノードにはそれぞれ、確率データが配
置されている。混合ツリーのツリー構造は、文字のみの
ツリーのツリー構造と類似するが、１つの重要な差異が
ある。混合ツリーの内部ノードは２つの異なる種類の質
問を含むことができる。内部ノードはシーケンス中の所
与の文字について及び近接の文字についての質問を含む
か、またはその文字と関連する音素について及びそのシ
ーケンスに対応する近接の音素についての質問を含む
か、することができる。判断ツリーはこのように、混合
した種類の質問を含むという点で、混合されているもの
である。Like the set of character-only trees, the set of mixed trees contains a mixed tree for each letter of the alphabet. An example mixing tree is shown in FIG. Like a character-only tree, a mixed tree has internal nodes and leaf nodes. In FIG. 3, the internal nodes are indicated by an oval, and the leaf nodes are indicated by a rectangle. A yes / no question is arranged in each of the internal nodes, and probability data is arranged in each of the leaf nodes. The tree structure of a mixed tree is similar to that of a character-only tree, with one important difference. An internal node of a mixed tree can contain two different types of questions. The internal node may include a question about a given character in the sequence and about nearby characters, or may include a question about the phoneme associated with that character and about the closest phoneme corresponding to the sequence. . The decision tree is thus mixed in that it contains mixed types of questions.

【００２１】図３で用いられる省略は、いくつか付加さ
れる省略（形）はあるが、図２で用いられるものと同様
である。シンボルＬは、文字とその近接文字についての
質問を表す。シンボルＰは、音素とその近接音素につい
ての質問を表す。例えば、“＋１Ｌ＝＝‘Ｄ’？”とい
う質問は、“＋１の位置の文字は‘Ｄ’か？”というこ
とである。省略ＣＯＮＳとＳＹＬは音素の種類で、即ち
子音と音節である。例えば、“＋１Ｐ＝＝ＣＯＮＳ？”
という質問は、“＋１の位置の音素は子音か？”という
ことである。リーフノードの数字は、文字のみのツリー
の場合のように、音素の確率を与える。The abbreviations used in FIG. 3 are the same as those used in FIG. 2, although some abbreviations (shapes) are added. The symbol L represents a question about a character and its adjacent characters. The symbol P represents a question about a phoneme and its neighboring phonemes. For example, the question "+ 1L == 'D'?" Means "is the character at position +1 'D'?" The abbreviations CONS and SYL are phoneme types, ie consonants and syllables. For example, “+ 1P == CONS?”
The question is, "Is the phoneme at position +1 a consonant?" The numbers at the leaf nodes give the phoneme probabilities, as in a tree of letters only.

【００２２】混合ツリースコア評価部は混合ツリーの質
問を基礎にして及び混合ツリーのリーフノード内の確率
データを利用して、リスト１８内の発音の各々について
再スコアする。発音リストは、リスト２２として個々の
スコアと関連させて蓄えてもよい。リスト２２は、第１
にリストされた発音が最も高いスコアを備えたものとな
るように降ベキ順に分類してもよい。The mixed tree score evaluator rescores each of the pronunciations in list 18 based on the mixed tree query and utilizing the probability data in the leaf nodes of the mixed tree. The pronunciation list may be stored as a list 22 in association with each score. Listing 22 shows the first
May be sorted in descending power order so that the pronunciations listed in the list have the highest score.

【００２３】多くの例において、リスト２２で最も高い
スコア位置を占める発音は、リスト１８で最も高いスコ
ア位置を占める発音とは異なるものである。これは、混
合ツリースコア評価部は、自己一貫性のある音素シーケ
ンスを含まない発音か、さもなくば自然な音声では発生
しない発音を表す発音を、混合ツリー１２を利用してふ
るいにかけるために生じる。In many instances, the pronunciation occupying the highest score position in list 22 is different from the pronunciation occupying the highest score position in list 18. This is because the mixed tree score evaluator uses the mixed tree 12 to sift through the mixed tree 12 pronunciations that do not contain self-consistent phoneme sequences or that otherwise would not occur in natural speech. Occurs.

【００２４】選択部モジュール２４は、１つ又はそれ以
上のリストの発音を引き出すために、リスト２２にアク
セスしてもよい。典型的には、選択部２４は最も高いス
コアの発音を引き出しこれを出力発音２６として与え
る。The selector module 24 may access the list 22 to retrieve the pronunciation of one or more lists. Typically, the selection unit 24 extracts the pronunciation with the highest score and gives it as the output pronunciation 26.

【００２５】上記のように、図１に示された発音生成部
は、本発明の混合ツリーを利用した１つの可能な実施形
態を表すに過ぎない。別の実施形態として示すように、
動的プログラミング音素シーケンス生成部１６と、それ
に係る文字のみの判断ツリー１０は、所与の綴り単語シ
ーケンスに対する１つの又はそれ以上の発音がすでに利
用しうるようなアプリケーションにおいては、無しで済
まし得る。この状況は、先行して形成された発音辞書が
利用し得る場合に、生じ得る。そのようなケースでは、
混合ツリースコア評価部２０は、発音辞書内の見出し項
目にスコアを付し、低スコアである見出し項目を識別
し、よって構築中の辞書内の疑わしい発音にフラグを付
すために、関連する混合ツリー１２と共に利用されても
よい。そのようなシステムは、例えば、辞書編集者の作
成のためのツールに組み入れてもよい。As noted above, the pronunciation generator shown in FIG. 1 represents only one possible embodiment utilizing the mixed tree of the present invention. As shown in another embodiment,
The dynamic programming phoneme sequence generator 16 and its associated character-only decision tree 10 may be omitted in applications where one or more pronunciations for a given spelling word sequence are already available. This situation can occur if a previously formed pronunciation dictionary is available. In such cases,
The mixed tree score evaluator 20 scores associated heading items in the pronunciation dictionary, identifies heading items that have low scores, and thus flags suspicious pronunciations in the dictionary being constructed. 12 may be used. Such a system may for example be incorporated into a tool for the creation of a dictionary editor.

【００２６】出力発音、即ちリスト２２から選択された
発音は、音声認識のアプリケーションや音声合成のアプ
リケーションの両方のための発音辞書を形成するため
に、用いることができる。音声認識関連では、発音辞書
は、認識部語彙目録内にまだ見当たらない単語に対する
発音を供給することにより、認識部トレーニングフェー
ズの間に用いることができる。合成関連では、発音辞書
は、連結された再生に対する音素音を生成するのに、用
いることができる。システムは、例えば、イーメール・
リーダ、または他のテキストから音声へ変換するアプリ
ケーションの、特色を増大させるのに、用いることがで
きる。The output pronunciations, ie, the pronunciations selected from the list 22, can be used to form a pronunciation dictionary for both speech recognition and speech synthesis applications. In the context of speech recognition, pronunciation dictionaries can be used during the recognizer training phase by providing pronunciation for words not yet found in the recognizer vocabulary. In the context of synthesis, pronunciation dictionaries can be used to generate phoneme sounds for linked playback. The system is, for example, email,
It can be used to augment features of readers or other text-to-speech applications.

【００２７】本発明の混合ツリースコアリングシステム
は、ただ１つの又はリストでの、可能性ある発音が求め
られる、いろいろなアプリケーションで用いることがで
きる。例えば、動的なオンライン辞書では、ユーザは単
語をタイプすると、システムが確率の順で可能性ある発
音のリストを与えてくれる。スコアリングシステムは、
言語習得システムのためのユーザ・フィードバック・ツ
ールとしても、利用することができる。音声認識能力を
備えた言語習得システムは、綴り言葉をディスプレイし
新しい言語のその言葉を発音する際の話者の試みを分析
するのに、利用することができる。そうすると、システ
ムはユーザの発音がその言葉に対しどれだけ相応しいか
又は相応しくないかをユーザに伝えることになる。The mixed tree scoring system of the present invention can be used in a variety of applications where only one or a list of possible pronunciations are required. For example, in a dynamic online dictionary, when a user types a word, the system will provide a list of possible pronunciations in order of probability. The scoring system is
It can also be used as a user feedback tool for language acquisition systems. A language acquisition system with speech recognition capabilities can be used to display spelling and analyze speaker attempts to pronounce the word in a new language. The system will then tell the user how good or bad the pronunciation of the user is for the word.

【００２８】《判断ツリーの生成》文字のみのツリーと
混合ツリーの生成システムが、図４に示される。判断ツ
リー生成システムの中心は、ツリー生成部４０である。
ツリー生成部は、システム開発者により供給される所定
のトレーニングデータのセット４２に、作用するツリー
生成アルゴリズムを利用する。典型的には、トレーニン
グデータは、周知の固有の単語発音に対応する、配列さ
れた文字と音素の対を含む。トレーニングデータは、図
５に示される配列処理を通じて生成し得る。図５は、例
としての単語ＢＩＢＬＥに施される配列処理を示す。綴
り言葉４４とその発音４６は、綴り言葉の文字と、対応
する発音の音素とを配列する、動的プログラミング配列
モジュール４８に与えられる。示された例において最後
のＥは発音しない。文字音素対はそれからデータ４２と
して蓄えられる。<< Generation of Judgment Tree >> FIG. 4 shows a tree-only generation system and a mixed tree generation system. The center of the decision tree generation system is the tree generation unit 40.
The tree generator utilizes a tree generation algorithm that operates on a predetermined set of training data 42 provided by the system developer. Typically, the training data includes arranged letter-phoneme pairs corresponding to well-known unique word pronunciations. Training data may be generated through the array processing shown in FIG. FIG. 5 shows an example of the arrangement processing performed on the word BIBLE. The spelling 44 and its pronunciation 46 are provided to a dynamic programming arrangement module 48 which arranges the characters of the spelling and the phonemes of the corresponding pronunciation. The last E in the example shown does not sound. The character phoneme pairs are then stored as data 42.

【００２９】図４に戻ると、ツリー生成部は、３つの付
加的要素と関連して機能する：可能性あるイエス・ノー
質問のセット５０と、個々のノードに対し最も相応しい
質問を選択するための、又はノードがリーフノードであ
るべきか否かを決定するためのルールのセット５２と、
オーバートレーニングを避けるための剪定方法５３であ
る。Returning to FIG. 4, the tree generator works in conjunction with three additional elements: a set of possible yes / no questions 50, and to select the most relevant question for an individual node. Or a set 52 of rules for determining whether a node should be a leaf node;
This is a pruning method 53 for avoiding overtraining.

【００３０】可能性あるイエス・ノー質問のセットは、
文字のみのツリーか又は混合ツリーか、何れが開発され
るかに依存するが、文字質問５４と音素質問５６を含み
得る。文字のみのツリーを開発するならば、文字質問５
４が利用され、混合ツリーを開発するならば、文字質問
５４と音素質問５６の両方が利用される。The set of possible yes / no questions is:
Depending on whether a character-only tree or a mixed tree is being developed, it may include a character question 54 and a phoneme question 56. If you develop a character-only tree, you can ask questions 5
4 is used, and if a mixed tree is to be developed, both character questions 54 and phoneme questions 56 are used.

【００３１】目下の好ましい実施形態の個々のノードに
配置する最もよい質問を選択するためのルールは、ジー
ニ（Ｇｉｎｉ）基準に従うように設計される。他の分割
基準も代わりに用いることができる。分割基準に係るよ
り多くの情報のために、ブライマン、フライドマンその
他（Ｂｒｅｉｍａｎ，Ｆｒｅｉｄｍａｎｅｔａｌ）
による“分類及び回帰ツリー（Ｃｌａｓｓｉｆｉｃａｔ
ｉｏｎａｎｄＲｅｇｒｅｓｓｉｏｎＴｒｅｅ
ｓ）”を参照してもよい。本質的には、ジーニ（Ｇｉｎ
ｉ）基準は、可能性あるイエス・ノー質問のセット５０
から質問を選択するためと、ノードがいつリーフノード
であるか決定する停止ルールを利用するためとに、用い
られる。ジーニ（Ｇｉｎｉ）基準は“不純度（ｉｍｐｕ
ｒｉｔｙ）”と呼ばれる概念を利用する。不純度は、常
に、非負数である。あらゆる可能なカテゴリを等しい割
合で含むノードには最大限の不純度が備わり、可能なカ
テゴリのただ１つだけを含むノードにはゼロの不純度
（最小限可能値）が備わるというように、不純度はノー
ドに適用される。上記状況を満足する機能は幾つかあ
る。これらはノード内部の個々のカテゴリのカウントに
依拠する。ジーニ（Ｇｉｎｉ）不純度は以下のように定
義される。Ｃが、データ項目が属し得るクラスのセット
であり、かつＴが現在ツリーノードであるならば、ｆ
（１｜Ｔ）は、ノードＴ内の、クラス１に属するトレー
ニングデータ項目の割合であり、ｆ（２｜Ｔ）は、クラ
ス２に属する項目の割合であると、仮定する。そうする
と、The rules for selecting the best question to place at an individual node in the currently preferred embodiment are designed to follow the Gini criteria. Other split criteria can be used instead. For more information on partitioning criteria, see Breiman, Friedman et al. (Breiman, Freidman et al)
Classification and Regression Trees (Classificat
ion and Regression Tree
s) ". In essence, Gin
i) The criteria is a set 50 of possible yes / no questions
Used to select questions from and to use stop rules to determine when a node is a leaf node. Gini standards are "impure (impu)
(Righty) ". Impurity is always non-negative. A node that contains an equal proportion of all possible categories has maximum impurity and only one of the possible categories is assigned Impurity is applied to nodes, such that the containing node has zero impureness (minimum possible value), and there are several features that satisfy the above situation, such as the counting of individual categories within the node. Gini impurity is defined as follows: If C is the set of classes to which the data item can belong and T is the current tree node, then f
It is assumed that (1 | T) is the ratio of the training data items belonging to class 1 and f (2 | T) is the ratio of the items belonging to class 2 in the node T. Then,

【数１】となる。(Equation 1) Becomes

【００３２】例を用いて説明すると、システムは文字
“Ｅ”のツリーを形成していると仮定する。該ツリーの
所与のノードＴにおいて、例えばシステムは、“Ｅ”を
単語中でいかに発音するかについての１０例を備えてい
る。これらの例の５つにおいては、“Ｅ”は“ｉｙ”
（“ｃｈｅｅｚｅ”の“ｅｅ”の音）と発音される。例
の３つにおいては、“Ｅ”は“ｅｈ”（“ｂｅｄ”の
“ｅ”の音）と発音される。そして残りの２例では、
“Ｅ”は“−”（即ち、“ｍａｐｌｅ”の“ｅ”のよう
に無音）である。To illustrate, assume that the system is forming a tree of the letter "E". At a given node T in the tree, for example, the system has ten examples of how to pronounce "E" in a word. In five of these examples, "E" is "iy"
(The sound of "ee" of "cheeze"). In three examples, "E" is pronounced as "eh" (the "e" sound of "bed"). And in the remaining two cases,
"E" is "-" (that is, silence like "e" of "maple").

【００３３】システムは、１０例に適用可能な、可能性
ある２つのイエス・ノー質問、Ｑ₁とＱ₂を考察している
と仮定する。Ｑ₁に対し“イエス”と答える項目は、
“ｉｙ”の４例と“−”の１例を含む（他の５項目はＱ
₁に対し“ノー”と答える。）。Ｑ₂に対し“イエス”と
答える項目は、“ｉｙ”の３例と“ｅｈ”の３例を含む
（他の４項目はＱ₂に対し“ノー”と答える。）。図７
はこれら２ケースを図式的に比較する。Assume that the system considers two possible yes-no questions, Q ₁ and Q ₂ , applicable to the ten cases. Items that answer "yes" to the Q ₁ is,
Includes four examples of “iy” and one example of “-” (the other five items are Q
Answer "no" to ₁ ). Item you answer "yes" to the Q ₂ is, "iy" including three cases of three patients with "eh" of (the other four items answer "no" to the Q _2.). FIG.
Compares these two cases graphically.

【００３４】ジーニ（Ｇｉｎｉ）基準は、システムがこ
のノードに対し、どの質問、Ｑ₁とＱ₂のどちらを選ぶべ
きかに答える。適切な質問を選択するジーニ（Ｇｉｎ
ｉ）基準とは、親ノードから子ノードに進む際に不純度
の低下が最大になるような質問を見出すことである。こ
の不純度低下ΔＴは、ΔＩ＝ｉ（Ｔ）−Ｐ_yes ^*ｉ（ｙｅ
ｓ）−Ｐ_no ^*ｉ（ｎｏ）と定義される。ここで、Ｐ
_yesは、“イエス”子ノードへ進む項目の比率であり、
Ｐ_noは、“ノー”子ノードへ進む項目の比率である。The Gini criterion answers to this node which question the system should choose, Q ₁ or Q ₂ . Choosing the Right Questions Gini
i) The criterion is to find the question that has the greatest reduction in impurity when going from parent node to child node. This impurity reduction ΔT is represented by ΔI = i (T) −P _yes ^* i (ye
s) is defined as -P _no ^* i _(no). Where P
_yes is the percentage of items that go to the “yes” child node,
P _no is the percentage of items that go to the “no” child node.

【００３５】ジーニ（Ｇｉｎｉ）基準を上記例に適用す
る。The Gini criterion applies to the above example.

【数２】Ｑ₁に対するΔＩは従って、(Equation 2) ΔI for Q ₁ is therefore:

【数３】ｉ（Ｔ）−Ｐ_yes（Ｑ₁）＝１−０．８²−０．
２²＝０．３２ｉ（Ｔ）−Ｐ_no（Ｑ₁）＝１−０．２²−０．６²＝０．
５６から、## EQU3 ## i (T) -P _yes (Q ₁ ) = 1-0.8 ² -0.
^{2 2 = 0.32 i (T)} -P no (Q 1) = 1-0.2 2 -0.6 2 = 0.
From 56,

【数４】ΔＩ（Ｑ₁）＝０．６２−０．５^*０．３２−
０．５^*０．５６＝０．１８である。Ｑ₂に対しては、ΔI (Q ₁ ) = 0.62−0.5 ^* 0.32
0.5 ^* 0.56 = 0.18. For the Q _2,

【数５】ｉ（ｙｅｓ，Ｑ₂）＝１−０．５²−０．５²＝０．５ｉ（ｎｏ，Ｑ₂）＝（同上）＝０．５から、I (yes, Q ₂ ) = 1−0.5 ² −0.5 ² = 0.5 i (no, Q ₂ ) = (same as above) = 0.5

【数６】ΔＩ（Ｑ₂）＝０．６−（０．６）^*（０．５）
−（０．４）^*（０．５）＝０．１２である。ΔI (Q ₂ ) = 0.6− (0.6) ^* (0.5)
-(0.4) ^* (0.5) = 0.12.

【００３６】このケースでは、Ｑ₁が不純度において最
も大きな低下を与える。よってＱ₂ではなくＱ₁が選ばれ
る。In this case, Q ₁ gives the greatest drop in impurity. Therefore, Q ₁ instead of Q ₂ is selected.

【００３７】ルールセット５２は、あるノードに対し最
も相応しい質問は、親ノードから子ノードへ進む際に最
も大きな不純度低下を生じさせる質問であると、宣言す
る。The ruleset 52 declares that the most relevant question for a node is the one that causes the greatest reduction in impurity as it proceeds from the parent node to the child nodes.

【００３８】ツリー生成部は、セット５０から選ばれた
イエス・ノー質問の判断ツリーを生成するルール５２を
適用する。生成部は最適サイズのツリーが生成されるま
でツリーを生成し続ける。ルール５２は、ツリーが所定
の大きさにまで生成すればツリーの生成を終らせる停止
ルールセットを含んでいる。好適な実施形態では、ツリ
ーは最終的に求められるよりも、大きいサイズにまで成
長する。すると剪定方法５３が、ツリーを望ましい大き
さにまで切り戻すために、利用される。剪定方法は、上
記のリファレンスに記されているブライマン・テクニッ
ク（Ｂｒｅｉｍａｎｔｅｃｈｎｉｑｕｅ）を実行す
る。The tree generator applies a rule 52 for generating a decision tree for a yes / no question selected from the set 50. The generator continues to generate the tree until an optimally sized tree is generated. The rule 52 includes a stop rule set that terminates the generation of the tree if the tree is generated to a predetermined size. In the preferred embodiment, the tree grows to a larger size than is ultimately required. The pruning method 53 is then used to cut the tree back to the desired size. The pruning method implements the Breiman technique described in the above reference.

【００３９】このように、ツリー生成部は、可能性ある
イエス・ノー質問のセット５０が文字のみの質問のみを
含むのか、音素質問と組み合わさっているのかにより決
定されるのだが、６０に概略示される文字のみのツリー
のセットを生成するか、または７０に概略示される混合
ツリーのセットを生成する。トレーニング・データの集
積４２は、上記のように、文字音素の対を含む。文字の
みのツリーを生成する際には、これらの対のうち文字部
分のみが、内部ノードの配置において用いられる。逆
に、混合ツリーを生成する際には、トレーニングデータ
対の文字と音素の両方の要素が、内部ノードを配置する
ために用いられる。両方の例において、それら対のうち
の音素部分は、リーフ・ノードを配置するために用いら
れる。リーフ・ノード内の音素データに関連する確率デ
ータは、トレーニング・データ集積全体において所与の
音素が所与の文字と配列することの発生の回数をカウン
トすることにより生成される。As described above, the tree generator determines whether the set 50 of possible yes / no questions includes only text-only questions or is combined with phoneme questions. Generate a set of trees with only the characters shown, or a set of mixed trees, shown schematically at 70. The training data collection 42 includes character phoneme pairs, as described above. When generating a character-only tree, only the character portion of these pairs is used in the arrangement of internal nodes. Conversely, when generating a mixture tree, both the character and phoneme elements of the training data pair are used to locate internal nodes. In both examples, the phoneme portion of the pair is used to place leaf nodes. Probability data associated with the phoneme data in the leaf nodes is generated by counting the number of occurrences of a given phoneme aligning with a given character throughout the training data collection.

【００４０】上記方法により生成される、文字から発音
を判断するツリーは、種々の異なる言語処理アプリケー
ションにおいて利用するため、メモリに蓄えることがで
きる。これらのアプリケーションは数も多く変化に富む
が、これらのツリーの性能と長所をよりよく強調するた
めに、次にいくつかの例を示す。The tree generated by the above method for determining pronunciation from characters can be stored in a memory for use in various different language processing applications. These applications are many and varied, but here are some examples to better emphasize the performance and strengths of these trees.

【００４１】図６は、綴り言葉の文字シーケンスから発
音を生成するために、文字のみのツリーと混合ツリーの
両方を利用する様子を示す。図示された実施形態は文字
のみのツリー要素と混合ツリー要素を共に利用するが、
他のアプリケーションでは一方の要素のみを用い他の要
素を用いないこともある。図示された実施形態では、文
字のみのツリーのセットは８０のメモリに蓄えられ混合
ツリーは８２のメモリに蓄えられる。多くのアプリケー
ションでは、アルファベットの個々の文字に対し１つの
ツリーがある。動的プログラミングシーケンス生成部８
４は、文字のみのツリー８０を基礎にして８８に発音を
生成するために、入力シーケンス８６を受けて稼動す
る。本質的に、入力シーケンスの個々の文字は、個別に
考察されるのであり、適切な文字のみのツリーが、該文
字に対し最も相応しい発音を選択するために用いられ
る。前に説明したように、文字のみのツリーはシーケン
ス中の所与の文字と近接の文字とに関する一連のイエス
・ノー質問を行う。シーケンス中の全ての文字について
考察した後、結果としての発音は、シーケンス生成部に
より選択された音素を結びつけることにより生成され
る。FIG. 6 illustrates the use of both a character-only tree and a mixed tree to generate a pronunciation from a spelled word character sequence. Although the illustrated embodiment utilizes both character-only and mixed tree elements,
Other applications may use only one element and not the other. In the illustrated embodiment, the set of character-only trees is stored in 80 memories, and the mixed tree is stored in 82 memories. In many applications, there is one tree for each letter of the alphabet. Dynamic programming sequence generator 8
4 operates in response to an input sequence 86 to generate a pronunciation at 88 based on the character-only tree 80. In essence, the individual characters of the input sequence are considered individually and a tree of only the appropriate characters is used to select the most appropriate pronunciation for the character. As explained earlier, a character-only tree asks a series of yes-no queries for a given character in the sequence and its neighbors. After considering all the characters in the sequence, the resulting pronunciation is generated by connecting the phonemes selected by the sequence generator.

【００４２】発音を改良するためには、混合ツリーセッ
ト８２を用いることができる。文字のみのツリーは文字
に関する質問のみを行うが、混合ツリーは文字に関する
質問と音素に関する質問も行うことができる。スコアラ
９０は、シーケンス生成部８４の出力から音素情報を受
け取り得る。この点については、シーケンス生成部８４
は、文字のみのツリー８０を利用して、複数の異なる発
音を生成することができ、それらの個々の確率スコアを
基礎にしてそれらの発音を分類することができる。この
分類された発音リストは、スコアラ９０によるアクセス
に対して、９２に蓄えることができる。To improve pronunciation, a mixed tree set 82 can be used. While a character-only tree asks only questions about letters, a mixed tree can ask questions about letters and questions about phonemes. The scorer 90 may receive phoneme information from the output of the sequence generator 84. In this regard, the sequence generation unit 84
Can utilize the character-only tree 80 to generate a plurality of different pronunciations and classify those pronunciations based on their individual probability scores. The classified pronunciation list can be stored in 92 in response to access by the scorer 90.

【００４３】スコアラ９０は、シーケンス生成部８４に
供給されるのと同じ入力シーケンスを入力として受け取
る。スコアラ９０は、文字シーケンスに対し混合ツリー
８２の質問を適用し、質問の際にはストア９２からのデ
ータを利用して音素の質問に応答する。９４における結
果出力は、典型的には、８８にて与えられる出力より
も、よりよい発音である。この理由は、混合ツリーは自
然な音声において生じることのない発音を、濾過する傾
向があるからである。例えば、固有名詞のＡｃｈｉｌｌ
ｅｓは、両方のｌを音声表記するａｈ−ｋ−ｉｈ−ｌ−
ｌ−ｉｙ−ｚの発音である、という結果になる傾向があ
る。自然な音声では、２番目のｌは実際には発音せず、
ａｈ−ｋ−ｉｈ−ｌ−ｉｙ−ｚとなる。The scorer 90 receives as input the same input sequence supplied to the sequence generator 84. The scorer 90 applies the questions of the mixed tree 82 to the character sequence, and responds to phoneme questions using data from the store 92 when asking. The resulting output at 94 is typically better pronounced than the output provided at 88. The reason for this is that mixed trees tend to filter out pronunciations that do not occur in natural speech. For example, the proper noun Achill
es is an ah-k-ih-l- phonetic transcription of both l's.
This tends to result in a pronunciation of l-iy-z. In natural speech, the second l does not actually sound,
ah-k-ih-l-iy-z.

【００４４】スコアラ生成部９０は、９６においてｎ個
の可能性ある発音の分類リストを生成してもよい。個々
の発音に関連するスコアは、発音中の個々の音素に割り
当てられた個別の確率スコアの合成数を表す。これらス
コアは、それ自身、疑わしい発音を識別する必要がある
アプリケーションで用いることができる。例えば、辞書
編集者のチームにより供給された音声転写であれば、混
合ツリーを用いて疑わしい発音を素早く識別してチェッ
クすることができる。The scorer generator 90 may generate a classification list of n possible pronunciations at 96. The score associated with each pronunciation represents a composite number of individual probability scores assigned to each phoneme being pronounced. These scores can themselves be used in applications that need to identify suspicious pronunciations. For example, a speech transcript provided by a team of dictionary editors could use a mixing tree to quickly identify and check for suspicious pronunciations.

【００４５】《文字音声発音生成部》本発明の原理を示
すために、図８の例示形態は、２つのステージからなる
綴り文字発音生成部を示す。より十分に以下に説明する
が、本発明の混合判断ツリーアプローチは、ここで示さ
れる発音生成部だけではなく種々の異なるアプリケーシ
ョンにおいて利用することができる。２つのステージか
らなる発音生成部は混合判断ツリー構造の多くの形態と
利点を強調するので、例示のために選ばれた。<< Character Voice Pronunciation Generator >> In order to show the principle of the present invention, the example shown in FIG. 8 shows a spelled character pronunciation generator comprising two stages. As described more fully below, the mixed decision tree approach of the present invention can be utilized in a variety of different applications, not just the pronunciation generator shown here. The two stage pronunciation generator was chosen for illustration because it emphasizes many aspects and advantages of the mixed decision tree structure.

【００４６】２つのステージからなる発音生成部は、文
字・シンタックス・コンテクスト・ダイアレクト（方
言）判断ツリー１１０のセットを利用するのが望ましい
第１のステージ１１６と、入力シーケンス１１４を音素
レベルで調査する音素混合判断ツリー１１２のセットを
利用する第２のステージ１２０とを含む。文字・シンタ
ックス・コンテクスト・ダイアレクト判断ツリーは、綴
り言葉シーケンスでの文字とその直近の文字を含む質問
（即ち、文字関連質問）を調査する。調査される他の質
問は、特定単語に先行する或いは後続する単語は何か
（即ち、コンテクスト関連質問）、ということである。
調査されるさらに他の質問は、単語が文の内部において
音声の何の部分を備えるかと、他の単語が文内で何のシ
ンタックスを備えるか（即ち、シンタックス関連質
問）、ということである。調査されるそのさらなる他の
質問は、何のダイアレクト（方言）が話されるのが好ま
しいかということである。ユーザはダイアレクト選択デ
バイス１５０によりどのダイアレクトが話されるかを選
択するのが好ましい。The two-stage pronunciation generator includes a first stage 116 which preferably utilizes a set of character / syntax / context / dialect (dialect) decision trees 110 and an input sequence 114 at phoneme level. A second stage 120 that utilizes a set of phoneme mixture decision trees 112 to perform. The character-syntax-context-dialect decision tree examines questions that include a character in a spelling sequence and the nearest character (ie, a character-related question). Another question that is investigated is what is the word that precedes or follows the particular word (ie, a context-related question).
Still other questions that are investigated are what parts of the word comprise the speech inside the sentence and what syntax the other words comprise within the sentence (ie, syntax related questions). is there. Yet another question that is investigated is what dialect is preferred to be spoken. The user preferably selects which dialect is spoken by dialect selection device 150.

【００４７】本発明の別の実施形態は、文字関連質問
と、言語レベル特性（即ち、シンタックス関連質問か又
はコンテクスト関連質問）のうち少なくとも１つのもの
とを利用することを含む。例えば、１つの実施形態は、
第１のステージに対し文字シンタックス判断ツリーのセ
ットを利用する。別の１つの実施形態は、入力シーケン
スのシンタックスを調査しない文字・コンテクスト・ダ
イアレクト判断ツリーのセットを利用する。Another embodiment of the present invention involves utilizing character-related questions and at least one of the language-level characteristics (ie, syntax-related questions or context-related questions). For example, one embodiment is:
The first stage utilizes a set of character syntax decision trees. Another embodiment utilizes a set of character / context / dialect decision trees that does not examine the syntax of the input sequence.

【００４８】本発明は一文中に発生する単語に限定され
るのではなく、断片的な文章やフレーズのような、シン
タックスを示す他の言語学上の構造をも含むということ
を、理解すべきである。It is to be understood that the present invention is not limited to words that occur in a single sentence, but also includes other linguistic structures that indicate syntax, such as fragmentary sentences or phrases. Should.

【００４９】一文の文字シーケンスのような、入力シー
ケンス１１４は、テキストベース発音生成部１１６に与
えられる。例えば、入力シーケンス１１４は次のような
文でよい。“Ｄｉｄｙｏｕｋｎｏｗｗｈｏｒｅ
ａｄｔｈｅａｕｔｏｂｉｏｇｒａｐｈｙ？”An input sequence 114, such as a one-sentence character sequence, is provided to a text-based pronunciation generator 116. For example, the input sequence 114 may be the following sentence. “Did you know where
ad the autobiography? "

【００５０】シンタックスデータ１１５はテキストベー
ス発音生成部１１６への入力である。この入力は、テキ
ストベース発音生成部１１６が文字・シンタックス・コ
ンテクスト・ダイアレクト判断ツリー１１０中に適切に
流す情報を与える。シンタックスデータ１１５は、入力
シーケンス１１４において個々の単語が言語の何の要素
を備えるか、を扱う。例えば、上記入力シーケンス例の
“ｒｅａｄ”という単語は、シンタックス・タガ・ソフ
トウエア・モジュール１２９により（名詞や形容詞では
なく）動詞の標識が付される。シンタックス・タガ・ソ
フトウエア・テクノロジは、プロジェクト“Ｘｔａｇ”
を遂行中のペンシルバニア大学（Ｕｎｖｅｒｓｉｔｙ
Ｐｅｎｎｓｙｌｖａｎｉａ）のような機関から入手可能
である。さらに、次のリファレンスはシンタックス・タ
ガ・ソフトウエア・テクノロジを論じる。ジョージフ
ォスタ（ＧｅｏｒｇｅＦｏｓｔｅｒ）、“統計学的辞
書編集あいまい性除去（ＳｔａｔｉｓｔｉｃａｌＬｅ
ｘｉｃａｌＤｉｓａｍｂｉｇｕａｔｉｏｎ）”、コン
ピュータサイエンスにおける修士論文（Ｍａｓｔｅｒ
ＴｈｅｓｉｓｉｎＣｏｍｐｕｔｅｒＳｃｉｅｎｃ
ｅ）、カナダ・モントリオール・マックギル大学（Ｍｃ
ＧｉｌｌＵｎｉｖｅｒｓｉｔｙ，Ｍｏｎｔｒａｌ，Ｃ
ａｎａｄａ）、１９９１年１１月１１日（Ｎｏｖｅｍｂ
ｅｒ１１，１９９１）。The syntax data 115 is an input to the text-based pronunciation generation unit 116. This input provides information that the text-based pronunciation generator 116 appropriately flows into the character / syntax / context / dialect decision tree 110. Syntax data 115 deals with what elements of the language each word in input sequence 114 comprises. For example, the word "read" in the above example input sequence is marked by a verb (rather than a noun or adjective) by the syntax tag software module 129. Syntax Taga Software Technology is project "Xtag"
University of Pennsylvania (University)
Available from institutions such as Pennsylvania). Additionally, the following references discuss Syntax Taga Software Technology: George Foster, "Statistical Dictionary Editing Disambiguation (Statistical Le
xical Disambiguation ”, a master's thesis in computer science (Master
Thesis in Computer Science
e), McGill University of Montreal, Canada (Mc
Gill University, Montral, C
anada), November 11, 1991 (Novemb
er11, 1991).

【００５１】テキストベース発音生成部１１６は、発音
リスト１１８を生成するために判断ツリー１１０を利用
し、綴り言葉入力シーケンスの可能性ある発音候補を提
示する。リスト１１８の個々の発音（例えば、発音Ａ）
は、個々の単語に如何にストレスを付すかを含んだ入力
シーケンス１１４の発音を示すのが、好ましい。さら
に、好ましい実施形態では、個々の単語の話される速度
が決定される。The text-based pronunciation generator 116 uses the decision tree 110 to generate a pronunciation list 118 and presents possible pronunciation candidates for the spelling input sequence. Individual pronunciations of list 118 (eg, pronunciation A)
Preferably indicates the pronunciation of the input sequence 114 including how to stress the individual words. Further, in a preferred embodiment, the rate at which individual words are spoken is determined.

【００５２】文章速度計算部ソフトウエアモジュール１
５２は、個々の単語をどのくらい速く話すべきかを決定
するために、テキストベース発音生成部１１６により利
用される。例えば、文章速度計算部１５２は、文章のコ
ンテクスト（文脈）を調査し、文中の特定の単語が通常
より速く話されるべきか遅く話されるべきかを決定す
る。例えば、文末に感嘆符が付してある文は、感嘆文の
インパクトをよりよく伝えるために、文末より前にある
予め決められた数の単語には通常より短い期間を備える
べきであることを、示唆する速度データを生成する。Sentence speed calculation unit software module 1
52 is used by the text-based pronunciation generator 116 to determine how fast individual words should be spoken. For example, the sentence speed calculation unit 152 examines the context of the sentence and determines whether a particular word in the sentence should be spoken faster or slower than usual. For example, a sentence with an exclamation point at the end of the sentence should state that a predetermined number of words before the end of the sentence should have a shorter duration than usual to better communicate the impact of the exclamation sentence. Generate suggested speed data.

【００５３】テキストベース発音生成部１１６は、シー
ケンス中の個々の文字や単語を、その文字又は単語のシ
ンタックス（又は単語のコンテクスト）に関連する判断
ツリーを利用し、判断ツリーに含まれる確率データを基
礎にしてその文字に対する音素発音を選択して、順々に
調査する。判断ツリー１１０のセットは、アルファベッ
トの個々の文字と関連する言語のシンタックスとに対す
る判断ツリーを、含むのが好ましい。The text-based pronunciation generation unit 116 converts each character or word in the sequence into a probability data included in the decision tree using a decision tree related to the syntax (or the context of the word) of the character or word. The phoneme pronunciation for the character is selected on the basis of, and investigated in order. The set of decision trees 110 preferably includes a decision tree for each letter of the alphabet and the syntax of the associated language.

【００５４】図９は、“ＲＥＡＤ”という単語の中の文
字“Ｅ”に対し適用できる文字・シンタックス・コンテ
クスト・ダイアレクト判断ツリー１４０の例を示す。判
断ツリーは（図中に長円形として示される）複数の内部
ノードと（図中に矩形として示される）複数のリーフ・
ノードを含む。各々の内部ノードにはイエスノー質問が
配置されている。イエスノー質問は、イエス又はノーで
答えられる質問である。文字・シンタックス・コンテク
スト・ダイアレクト判断ツリー１４０において、これら
の質問は以下のものに向けられたものである。入力シー
ケンス中の所与の文字（例えば、個の場合では文字
“Ｅ”）とその近接の文字、又は、文中の単語のシンタ
ックス（例えば、名詞、動詞、他）、又は、文のコンテ
クストとダイアレクト、である。図９においては、関連
する質問に対する答えがイエスかノーかによって個々の
内部ノードは左か右かに枝分かれる。FIG. 9 shows an example of a character / syntax / context / dialect decision tree 140 applicable to the character "E" in the word "READ". The decision tree consists of multiple internal nodes (shown as ovals in the figure) and multiple leaf nodes (shown as rectangles in the figure).
Contains nodes. A yes-no question is located at each internal node. A yes-no question is a question that can be answered with yes or no. In the character / syntax / context / dialect decision tree 140, these questions are directed to: Given a character in the input sequence (eg, the letter “E” in the case of an individual) and its immediate neighbors, or the syntax of a word in the sentence (eg, noun, verb, etc.), or the context of the sentence Dialect. In FIG. 9, each internal node branches left or right depending on whether the answer to the related question is yes or no.

【００５５】第１の内部ノードは、話される方言（ダイ
アレクト）について問い合わせることが好ましい。内部
ノード１３８はそのような問い合わせを表している。南
部方言（ダイアレクト）が話されるのならば、リーフ・
ノードにて南部ダイアレクトについてより特徴的な音素
値を最終的に生成する南部ダイアレクト判断ツリー１３
９にデータが通される。[0055] The first internal node preferably inquires about the dialect to be spoken. Internal node 138 represents such a query. If the southern dialect is spoken,
A southern dialect decision tree 13 that finally generates more characteristic phoneme values for the southern dialect at the node
Data is passed through 9.

【００５６】図９で用いられている省略は以下の通りで
ある。“＋１”や“−１”のような質問中の数字は、現
在文字に対する相対的な綴り中の位置である。シンボル
Ｌは文字とその近接の文字に関する質問であることを示
す。例えば、“−１Ｌ＝＝‘Ｒ’ｏｒ‘Ｌ’？”という
質問は、（‘Ｅ’である）現在文字の前の文字は‘Ｒ’
か又は‘Ｌ’か、ということである。‘ＣＯＮＳ’と
‘ＶＯＷ’という省略は、文字のクラスつまり子音と母
音である。シンボル‘＃’は単語の境界を示す。‘ｔａ
ｇ（ｉ）’という用語は、ｉ位置の単語のシンタックス
標識に関する質問であることを示し、ここでｉ＝０なら
ば現在単語、ｉ＝−１ならば直前の単語、ｉ＝＋１なら
ば直後の単語、等々である。よって、“ｔａｇ（０）＝
＝ＰＲＥＳ？”は、“現在単語は、現在形動詞か？”と
いうことである。The abbreviations used in FIG. 9 are as follows. The number in the question, such as "+1" or "-1", is the spelling position relative to the current character. The symbol L indicates that the question is related to a character and a character in the vicinity thereof. For example, the question "-1L == 'R'or'L'?" Means that the character before the current character (which is "E") is "R".
Or 'L'. The abbreviations 'CONS' and 'VOW' are character classes, ie consonants and vowels. The symbol '#' indicates a word boundary. 'ta
The term g (i) ′ indicates that the question is about the syntax indicator of the word at position i, where i = 0 is the current word, i = −1 is the previous word, and i = + 1 is the current word. The next word, and so on. Therefore, “tag (0) =
= PRES? "Is the current word a present verb? "That's what it means.

【００５７】リーフ・ノードは、可能性ある音素発音
を、特定の音素が所与の文字の適正な発音を表す確率を
意味する数値と関連づける、確率データが配置される。
無音素、即ち無音は、シンボル‘−’により表される。The leaf node is populated with probability data that associates a potential phoneme pronunciation with a numerical value that means the probability that a particular phoneme represents the proper pronunciation of a given character.
A phoneme, or silence, is represented by the symbol '-'.

【００５８】例えば、現在形動詞“ＲＥＡＤ”と“ＬＥ
ＡＤ”の中の“Ｅ”は、判断ツリーによりリーフ・ノー
ド１４２において１．０の確率で適切な発音が割り当て
られる。“ｒｅａｄ”の過去形（例えば、“Ｗｈｏｒ
ｅａｄａｂｏｏｋ”）の“Ｅ”は、リーフ・ノード
１４４において０．９の確率で“ｅｈ”の発音が割り当
てられる。For example, the present verbs “READ” and “LE”
"E" in "AD" is assigned an appropriate pronunciation with a probability of 1.0 at the leaf node 142 by the decision tree.The past tense of "read" (eg, "Whor")
“E” in “ad a book”) is assigned a pronunciation of “eh” with a probability of 0.9 at the leaf node 144.

【００５９】（図８の）判断ツリー１１０は、コンテク
スト（文脈）関連の質問を含むのが好ましい。例えば、
内部ノードのコンテクスト（文脈）関連の質問は、“ｙ
ｏｕ”という単語の前に“ｄｉｄ”という単語があるか
どうかを調査することがある。そのようなコンテクスト
（文脈）では、“ｙｏｕ”の“ｙ”は、典型的には、口
語的音声では“ｊａ”と発音される。The decision tree 110 (of FIG. 8) preferably contains context-related questions. For example,
The internal node context related question is "y
We may check for the word "did" before the word "ou." In such a context, the "y" in "you" is typically Pronounced "ja".

【００６０】本発明は、韻律表示データも生成し、文を
話す際のストレス、ピッチ、抑音又はポーズの相を伝え
る。シンタックス関連の質問は、音素がどのようにスト
レスされ又はピッチを与えられ又は抑音されされるか、
を決定する手助けになる。例えば、（図９の）内部ノー
ド１４１は、文の最初の単語が例文“ｗｈｏｒｅａｄ
ａｂｏｏｋ？”の“ｗｈｏ”のような疑問代名詞か
どうかを、問い合わせる。この例では、この例の最初の
単語が疑問代名詞であるから、音素ストレスを伴うリー
フ・ノード１４４が選択される。リーフ・ノード１４６
は音素にストレスが付されない他のオプションを示す。The present invention also generates prosody display data and conveys stress, pitch, suppression, or pause phases when speaking a sentence. Syntax-related questions are how phonemes are stressed or pitched or suppressed.
Will help you decide. For example, the internal node 141 (of FIG. 9) indicates that the first word of the sentence is the example sentence "who read".
a book? In this example, a leaf node 144 with phoneme stress is selected because the first word in this example is a question pronoun. Leaf node 146
Indicates other options that do not stress the phonemes.

【００６１】別の例として、疑問文において、文の最後
の単語の最後の音節の音素に、文の疑問相をより自然に
伝えるように、ピッチマークを付する。さらに別の例で
は、文を話すときの自然なポーズを適応し得る本発明を
含む。本発明は、コンマやピリオドのような、中断に関
する質問を与えることにより、そのようなポーズの詳細
を含む。As another example, in a question sentence, a pitch mark is attached to the phoneme of the last syllable of the last word of the sentence so as to convey the question phase of the sentence more naturally. Yet another example includes the present invention that can accommodate natural poses when speaking a sentence. The present invention includes details of such poses by giving questions about interruptions, such as commas and periods.

【００６２】（図８）テキストベース発音生成部１１６
はこのように、リスト１１８に蓄えられる１つ又はそれ
以上の発音の仮説を構築するために、判断ツリー１１０
を利用する。個々の発音は、判断ツリー１１０を利用し
て選択した個別の音素の確率スコアを結合して得られる
数値スコアと関連付けるのが、好ましい。単語発音は、
可能性ある結合のマトリックスを構築しｎ個の最も相応
しい候補を選択する動的プログラミングを用いることに
より、スコアが与えられ得る。(FIG. 8) Text-based pronunciation generator 116
Thus constructs one or more pronunciation hypotheses stored in the list 118, using the decision tree 110
Use Preferably, each pronunciation is associated with a numerical score obtained by combining the probability scores of the individual phonemes selected using the decision tree 110. The word pronunciation is
A score may be provided by using dynamic programming to build a matrix of possible connections and select the n best candidates.

【００６３】また一方で、ｎ個の最も相応しい候補は、
以下のような、最初最も相応しい言葉の候補を識別し次
に反復置換を通じて追加候補を生成する、置換テクニッ
クを利用して選ばれ得る。最も高い確率スコアを備えた
発音が、（リーフ・ノードを調査することにより識別さ
れる）最も高いスコアの音素のそれぞれのスコアを掛け
合わせ、そしてこの選択を最も相応しい候補即ち第一の
言葉の候補として利用することにより、最初に選ばれ
る。追加の（ｎ個の最も相応しい）候補は、リーフノー
ド中の音素データを再び調査し、先に選択された音素で
はなく、最初に選択された音素と最も差のない音素を、
識別することにより、選ばれる。そしてこの最小限差異
音素は、最初に選択された音素に取って代わり、それに
より２番目に数値の高い候補を生成する。上記処理は、
ｎ個の最も相応しい候補の所定の数が選ばれるまで、反
復して繰り返してもよい。リスト１１８は、スコアの降
ベキ順に分類されてもよく、従って文字のみの分析によ
り最も相応しいと判定された発音がリストの最初に現れ
る。On the other hand, the n most suitable candidates are:
It can be selected using a permutation technique that first identifies the most relevant word candidates and then generates additional candidates through iterative permutation, such as: The pronunciation with the highest probability score is multiplied by the score of each of the highest-scoring phonemes (identified by examining the leaf nodes), and this selection is matched to the best candidate, ie, the first word candidate. By being used as the first choice. An additional (n best match) candidate examines the phoneme data in the leaf node again and finds not the phoneme selected earlier, but the phoneme that is least different from the phoneme selected first.
Selected by identification. This minimal difference phoneme then replaces the first selected phoneme, thereby generating the second highest candidate. The above process
Iterative iteration may be performed until a predetermined number of the n best candidates are selected. The list 118 may be sorted in descending power order of the score, so that the pronunciations that are determined to be most appropriate by character-only analysis appear first in the list.

【００６４】判断ツリー１１０では、ある程度成功した
結果しか得られないことがしばしばである。これは、こ
れらの判断ツリーでは、後続の文字により何の音素が生
成されるかを個々の文字において決定する方法がないか
らである。このように、判断ツリー１１０は、自然な音
声では実際に生じない、高いスコアの付された発音を生
成することがある。例えば、固有名詞のＡｃｈｉｌｌｅ
ｓは、両方のｌを音声表記するａｈ−ｋ−ｉｈ−ｌ−ｌ
−ｉｙ−ｚの発音である、という結果になる傾向があ
る。自然な音声では、２番目のｌは実際には発音せず、
ａｈ−ｋ−ｉｈ−ｌ−ｉｙ−ｚとなる。判断ツリー１１
０を利用する発音生成部には、自然な音声では決して生
じない単語の発音をふるいにかけるメカニズムがない。Often, the decision tree 110 will only provide some success. This is because in these decision trees, there is no way to determine for each character what phoneme is generated by the following character. Thus, the decision tree 110 may generate a high-scored pronunciation that does not actually occur with natural speech. For example, the proper noun Achille
s is ah-k-ih-l-l which phonetically represents both l's
-Iy-z. In natural speech, the second l does not actually sound,
ah-k-ih-l-iy-z. Decision tree 11
The pronunciation generator using 0 has no mechanism to sift through pronunciations of words that never occur in natural speech.

【００６５】発音システム１０８の第２のステージ１２
０は、上記の問題に取り組むものである。音素混合ツリ
ースコア評価部１２０は音素混合判断ツリー１１２のセ
ットを利用し、リスト１１８の各々の発音の存続可能性
を査定する。スコア評価部１２０は、テキストベース発
音生成部１１６により個々の文字に割り当てられた音素
と共に、入力シーケンス１１４の中の各々の文字を順次
調査することにより、機能する。The second stage 12 of the pronunciation system 108
0 addresses the above problem. The phoneme mixture tree score evaluator 120 uses the set of phoneme mixture decision trees 112 to assess the viability of each pronunciation in the list 118. The score evaluator 120 functions by sequentially examining each character in the input sequence 114 along with the phonemes assigned to each character by the text-based pronunciation generator 116.

【００６６】音素混合ツリースコア評価部１２０は、音
素混合ツリーの質問１１２を基礎にして及び混合ツリー
のリーフノード内の確率データを利用して、リスト１１
８内の発音の各々について再スコアする。発音リスト
は、リスト１２２として個々のスコアと関連させて蓄え
てもよい。リスト１２２は、第１にリストされた発音が
最も高いスコアを備えたものとなるように降ベキ順に分
類してもよい。The phoneme-mixed tree score evaluator 120 performs the list 11 based on the phoneme-mixed tree query 112 and using the probability data in the leaf nodes of the mixed tree.
Re-score for each of the pronunciations in 8. The pronunciation list may be stored as a list 122 in association with individual scores. The list 122 may be sorted in descending power order so that the first listed pronunciation has the highest score.

【００６７】多くの例において、リスト１２２で最も高
いスコア位置を占める発音は、リスト１１８で最も高い
スコア位置を占める発音とは異なるものである。これ
は、音素混合ツリースコア評価部１２０は、自己一貫性
のある音素シーケンスを含まない発音か、さもなくば自
然な音声では発生しない発音を表す発音を、音素混合ツ
リー１１２を利用してふるいにかけるために生じる。In many instances, the pronunciation occupying the highest score position in list 122 is different from the pronunciation occupying the highest score position in list 118. This is because the phoneme mixture tree score evaluator 120 uses the phoneme mixture tree 112 to sift through the phoneme mixture tree 112 a pronunciation that does not include a self-consistent phoneme sequence or a pronunciation that would otherwise not occur in natural speech. It happens to multiply.

【００６８】好ましい実施形態では、音素混合ツリース
コア評価部１２０は、リスト１２２中の発音に対する速
度データを決定するために、文章速度計算部１５２を利
用する。さらに、評価部１２０は、ダイアレクト（方
言）に関する質問を調査させ、かつ前述のアプローチと
同様な方法でリーフノードにてストレスや他の韻律局面
を質問により決定させる、音素混合ツリーを利用する。In the preferred embodiment, the mixed phoneme tree score evaluator 120 utilizes the sentence speed calculator 152 to determine speed data for pronunciations in the list 122. In addition, the evaluator 120 uses a phoneme mixture tree that causes questions about dialects (dialects) to be investigated and stresses and other prosodic aspects to be determined at the leaf nodes in a manner similar to the approach described above.

【００６９】選択部モジュール１２４は、リスト１２２
中の１つ又はそれ以上の発音を引き出すために、リスト
１２２にアクセスしてもよい。典型的には、選択部１２
４は最も高いスコアの発音を引き出しこれを出力発音１
２６に与える。The selection unit module 124 includes a list 122
The list 122 may be accessed to retrieve one or more pronunciations therein. Typically, the selection unit 12
4 draws the pronunciation with the highest score and outputs it.
Give to 26.

【００７０】上記のように、図８に示された発音生成部
は、本発明の混合ツリーアプローチを利用した１つの可
能な実施形態を表すに過ぎない。別の実施形態におい
て、出力発音、即ちリスト１２２から選択される発音
は、音声認識のアプリケーションや音声合成のアプリケ
ーションの両方のための発音辞書を形成するために、用
いることができる。音声認識関連では、発音辞書は、認
識部語彙目録内にまだ見当たらない単語に対する発音を
供給することにより、認識部トレーニングフェーズの間
に用いることができる。合成関連では、発音辞書は、連
結された再生に対する音素音を生成するのに、用いるこ
とができる。システムは、例えば、イーメール・リー
ダ、または他のテキストから音声へ変換するアプリケー
ションの、特色を増大させるために、用いることができ
る。As described above, the pronunciation generator shown in FIG. 8 represents only one possible embodiment utilizing the mixed tree approach of the present invention. In another embodiment, the output pronunciations, ie, the pronunciations selected from the list 122, can be used to form a pronunciation dictionary for both speech recognition and speech synthesis applications. In the context of speech recognition, pronunciation dictionaries can be used during the recognizer training phase by providing pronunciation for words not yet found in the recognizer vocabulary. In the context of synthesis, pronunciation dictionaries can be used to generate phoneme sounds for linked playback. The system can be used, for example, to augment features of email readers or other text-to-speech applications.

【００７１】本発明の混合ツリースコアリングシステム
（即ち、文字、シンタックス、コンテクスト及び音素）
は、ただ１つの又はリストでの、可能性ある発音が求め
られる、いろいろなアプリケーションで用いることがで
きる。例えば、動的なオンライン言語習得システムで
は、ユーザは文をタイプすると、システムが、確率の順
で、その文に対する可能性ある発音のリストを与えてく
れる。スコアリングシステムは、言語習得システムのた
めのユーザ・フィードバック・ツールとしても、利用す
ることができる。音声認識能力を備えた言語習得システ
ムは、綴り文をディスプレイし新しい言語のその文を発
音する際の話者の試みを分析するのに、利用することが
できる。システムはユーザの発音がその文に対しどれだ
け相応しいか又は相応しくないかをユーザに示すことに
なる。The mixed tree scoring system of the present invention (ie, letters, syntax, context, and phonemes)
Can be used in various applications where only one or a list of possible pronunciations are required. For example, in a dynamic online language acquisition system, when a user types a sentence, the system provides a list of possible pronunciations for that sentence, in order of probability. The scoring system can also be used as a user feedback tool for a language acquisition system. A language acquisition system with speech recognition capabilities can be used to display spellings and analyze speaker attempts to pronounce the sentences in a new language. The system will indicate to the user how good or bad the pronunciation of the user is for the sentence.

【００７２】本発明は現存の適切な形態にて記述された
が、混合ツリー発音システムに対しては多数の適用例が
あることが理解されるものである。従って、本発明は、
添付の請求項が示す発明の精神から離れることなく、一
定の修正や変更は可能である。Although the present invention has been described in its appropriate form, it will be appreciated that there are numerous applications for mixed tree pronunciation systems. Therefore, the present invention
Certain modifications and changes are possible without departing from the spirit of the invention as set forth in the appended claims.

[Brief description of the drawings]

【図１】本発明の要素とステップを示す、ブロック図
である。FIG. 1 is a block diagram illustrating the elements and steps of the present invention.

【図２】文字のみのツリーを示すツリー図である。FIG. 2 is a tree diagram showing a tree of characters only.

【図３】本発明に係る混合ツリーを示すツリー図であ
る。FIG. 3 is a tree diagram showing a mixed tree according to the present invention.

【図４】本発明に係る混合ツリーを生成するための現
存の好ましいシステムを示すブロック図である。FIG. 4 is a block diagram illustrating an existing preferred system for generating a mixing tree according to the present invention.

【図５】配列プロセスを通じてトレーニングデータを
生成する方法を示すフローチャートである。FIG. 5 is a flowchart illustrating a method of generating training data through an alignment process.

【図６】例示の発音生成部内の判断ツリーの利用を示
すブロック図である。FIG. 6 is a block diagram illustrating the use of a decision tree in an exemplary pronunciation generator.

【図７】ノードを配置する際にどの質問を用いるべき
かを査定するジーニ（Ｇｉｎｉ）基準の適用を示す。FIG. 7 illustrates the application of the Gini criterion to assess which questions should be used in placing a node.

【図８】本発明に係る文字から音声への発音生成部の
ブロック図である。FIG. 8 is a block diagram of a character-to-speech pronunciation generation unit according to the present invention.

【図９】文字・シンタックス・コンテクスト・ダイア
レクト混合判断ツリーを示すツリー図である。FIG. 9 is a tree diagram showing a character / syntax / context / dialect mixed decision tree.

[Explanation of symbols]

１０・・・文字のみの判断ツリー１２・・・混合判断ツリー１４・・・入力シーケンス１６・・・動的プログラミング音素シーケンス生成部１８・・・発音リスト２０・・・混合ツリースコア評価部２４・・・選択部モジュール４０・・・ツリー生成部４２・・・トレーニングデータ集積４８・・・動的プログラミング配列モジュール５０・・・イエス・ノー質問セット５２・・・ルールセット５３・・・剪定方法８０・・・文字のみのツリーのメモリ８２・・・混合ツリーセットのメモリ８４・・・動的プログラミングシーケンス生成部８６・・・入力シーケンス９０・・・スコアラ生成部１１０・・・文字・シンタックス・コンテクスト・ダイ
アレクト判断ツリー１１２・・・音素混合判断ツリー１１４・・・入力シーケンス１１５・・・シンタックスデータ１１６・・・テキストベース発音生成部１２０・・・音素混合ツリースコア評価部１２４・・・選択部モジュール１２９・・・シンタックス・タガ・ソフトウエア・モジ
ュール１３８・・・内部ノード１４０・・・文字・シンタックス・コンテクスト・ダイ
アレクト判断ツリー１４１・・・内部ノード１４４・・・リーフ・ノード１５０・・・ダイアレクト選択デバイス１５２・・・文章速度計算部ソフトウエアモジュールReference Signs List 10: Character-only decision tree 12: Mixed decision tree 14: Input sequence 16: Dynamic programming phoneme sequence generation unit 18: Pronunciation list 20: Mixed tree score evaluation unit 24 ..Selection module 40... Tree generation module 42... Training data accumulation 48... Dynamic programming array module 50... Yes / no question set 52... Rule set 53. ··· Memory of tree of characters only 82 ··· Memory of mixed tree set 84 ··· Dynamic programming sequence generation unit 86 ··· Input sequence 90 ··· Scorer generation unit 110 ··· Characters / Syntax / Context dialect decision tree 112 ... phoneme mixture decision tree 114 ... input sequence 115: Syntax data 116: Text-based pronunciation generation unit 120: Phoneme mixture tree score evaluation unit 124: Selection module 129: Syntax tag software module 138・ Internal node 140 ・・・ Character ・ Syntax ・ Context ・ Dialect judgment tree 141 ・・・ Internal node 144 ・・・ Leaf node 150 ・・・ Dialect selection device 152 ・・・ Sentence speed calculation unit software module

───────────────────────────────────────────────────── フロントページの続き (72)発明者マッテオ・コントリーニアメリカ合衆国93109カリフォルニア州サンタ・バーバラ、クリフ・ドライブ821番、ナンバー・ビー−１ ──────────────────────────────────────────────────続き Continued on the front page (72) Inventor Matteo Contorini 93109 California, USA Santa Clara Drive, 821 Cliff Drive, Number B-1

Claims

[Claims]

1. An apparatus for generating at least one phonetic pronunciation for an input character sequence selected from a predetermined alphabet, comprising: a memory for storing a decision tree of only a plurality of characters corresponding to the alphabet; The character-only decision tree with internal nodes representing yes-no questions for a given character in the sequence and its neighboring characters; and a memory further storing a plurality of mixed decision trees corresponding to the alphabet; A first plurality of internal nodes representing a yes-no question for a given character of the given sequence and its immediate neighbors, and a second plurality of internal nodes representing a yes-no question for the given sequence of phonemes and their immediate neighbors. A plurality of internal nodes, wherein the mixed decision tree comprises: Further comprising a leaf node indicating the probability data associated with the input character sequence, the input character sequence being processed to correspond to the input character sequence, being combined with the character-only decision tree and the mixed decision tree; A phoneme sequence generator for generating a first set of phonetic pronunciations; and a scored phoneme associated with the mixed decision tree, processing the first set and indicating at least one phoneme in the input sequence. Generate a second set,
And a score evaluation unit.

2. The method of claim 2, wherein the second set comprises a plurality of pronunciations each having an associated score derived from the probability data, and further wherein the body accepts the second set and based on the associated score. The apparatus of claim 1, further comprising a sound selection unit operable to select one sound from the two sets.

3. The apparatus according to claim 1, wherein said phoneme sequence generation unit generates a predetermined number of different sounds corresponding to a given input sequence.

4. The apparatus of claim 1, wherein said phoneme sequence generator generates a predetermined number of different pronunciations corresponding to a given input sequence, and indicates the n most suitable pronunciations according to said probability data. .

5. The apparatus of claim 4, wherein said score evaluator re-scores said n most suitable pronunciations based on said mixed decision tree.

6. The apparatus according to claim 1, wherein the sequence generation unit constructs a matrix indicating various pronunciations, the matrix relating to a possible combination of phonemes.

7. The apparatus of claim 6, wherein the sequence generator selects n most suitable phoneme combinations from the matrix using dynamic programming.

8. The apparatus of claim 6, wherein the sequence generator selects n most suitable phoneme combinations from the matrix by iterative permutation.

9. A speech recognition system comprising a pronunciation dictionary utilized for recognizer training, wherein at least a portion of the second set provides a pronunciation of words based on spelling of words. Place dictionaries,
The device of claim 1.

10. The apparatus of claim 1, further comprising an audio system that accepts at least a portion of the second set to generate an audible synthetic pronunciation of the words based on the spelling of the words.

11. The voice synthesis system according to claim 1, wherein
11. The device of claim 10, wherein the device is incorporated in a reader.

12. The apparatus of claim 10, wherein said speech synthesis system is incorporated into a dictionary to provide a list of possible pronunciations in stochastic order.

13. Displaying a spelled word and analyzing a speaker's attempt to pronounce the word using at least one of the character-only decision tree and the mixed decision tree, Further includes a language acquisition system that shows the speaker how appropriate the pronunciation of the word is to the word,
The device of claim 1.

14. A method of processing spelling-to-pronunciation data, the method comprising: providing a first set of yes-no questions regarding characters in an input sequence and their relationship to nearby characters. Providing a second set of yes-no questions regarding the phonemes within and their relation to neighboring phonemes; a plurality of different sets of pairs each comprising a character sequence and a phoneme sequence selected from the alphabet. Using the first and second sets and the training data to form a decision tree, each comprising a plurality of internal nodes and a plurality of leaf nodes, in the alphabet of the alphabet. Generating for at least a portion; and providing the first and second Locating a question selected from a set of letters, and locating, at the leaf node, probability data relating the portion of the alphabet to a plurality of phonemic pronunciations based on the training data. The method.

15. The method of claim 14, further comprising the step of providing said collection of training data as pairs of ordered character sequence phoneme sequences.

16. The step of providing training data collection comprises: providing a plurality of input sequences including a phoneme sequence representing the pronunciation of a word formed by the character sequence; and selecting a selected one of the phonemes. Arranging with selected ones of the characters to define an arrayed character phoneme pair.

17. The method of claim 14, further comprising providing at least one associated phoneme pronunciation to an input string and using the decision tree to score the pronunciation based on the probability data. Method.

18. The method according to claim 18, further comprising the step of providing a plurality of related phoneme pronunciations to the input string and using the decision tree to select one of the plurality of pronunciations based on the probability data. 14 methods.

19. The method according to claim 19, further comprising the step of providing a plurality of associated phonemic pronunciations to an input character string indicative of a word, and using the decision tree to generate a speech transcript of the word based on the probability data. Item 14. The method according to Item 14.

20. The method of claim 19, further comprising using the speech transcription to place a dictionary associated with a speech recognizer.

21. The method of claim 14, further comprising the step of providing a plurality of associated phoneme pronunciations to an input string representing a word, and assigning a numerical score to each of said plurality of pronunciations using said decision tree. the method of.

22. An apparatus for generating at least one phonetic pronunciation for an input character sequence that forms a word selected from a predetermined alphabet and adheres firmly to a predetermined syntax. An input device receiving syntax data indicative of the syntax of the words in the sequence; a computer storage device storing a plurality of text-based decision trees comprising questions indicative of predetermined characteristics of the input sequence; The predetermined characteristic, comprising a character-related question relating to the sequence, and further including a characteristic selected from a group consisting of a syntax-related question, a context-related question, a dialect-related question or a combination thereof; Predetermined input sequence The text-based decision tree comprising internal nodes representing gender questions; the text-based decision tree further comprising leaf nodes indicating probability data associating each of the characters with a plurality of phoneme pronunciations; and processing the input character sequence. And a text-based pronunciation generator for generating a first set of phonetic pronunciations corresponding to the input character sequence based on the text-based decision tree. Characteristic speech pronunciation generation device.

23. A phoneme coupled to the text-based pronunciation generator for processing the first set, generating a second set of scored phonetic pronunciations indicative of at least one phonetic pronunciation of the input sequence. 23. The apparatus of claim 22, further comprising a mixed tree score evaluator.