JPH09504117A

JPH09504117A - Speech synthesis method by converting phonemes into digital waveforms

Info

Publication number: JPH09504117A
Application number: JP7506281A
Authority: JP
Inventors: ブリーン、アンドリュウ・ポール
Original assignee: ブリテイッシュ・テレコミュニケーションズ・パブリック・リミテッド・カンパニー
Priority date: 1993-08-04
Filing date: 1994-08-01
Publication date: 1997-04-22
Also published as: WO1995004988A1; AU7270194A; HK1014431A1; AU674246B2; SG52347A1; EP0712529B1; EP0712529A1; DE69411275D1; DE69411275T2; ES2118424T3; DK0712529T3; CA2166883C

Abstract

(57)【要約】本発明は合成スピーチの発生、特に音素のテキストからのデジタル波形の発生に関する。音素の延長されたテキストと、デジタル波形の形態の等価物とを具備する連結したデータベースを使用する。データベースの２つの部分は音素テキストとデジタル波形との両者の等価点を設けるパラメータにより連結される。（音素）の入力テキストはデータベースの音素部分に整合部分を位置付けするために分析される。この整合はこれが可能な音素の正確な等価を利用する。そうでなければ音素間の関係が利用される。選択処理は前後関係の入力音素を弁別し、従って改良された会話が得られる。入力テキストをデータベースの入力形態の整合ストリングに分析すると、部分の開始および終了パラメータが設けられる。出力テキストはデジタル波形の接触部分により発生され、開始および終了パラメータにより限定される。 (57) [Summary] The present invention relates to the generation of synthetic speech, and in particular to the generation of digital waveforms from phoneme text. A concatenated database with phoneme-extended text and equivalents in the form of digital waveforms is used. The two parts of the database are linked by parameters that establish the equivalence points of both the phoneme text and the digital waveform. The (phoneme) input text is parsed to locate matching parts to phoneme parts of the database. This matching makes use of the exact equivalent of phonemes for which this is possible. Otherwise the relationship between phonemes is used. The selection process discriminates contextual input phonemes, thus resulting in improved speech. Parsing the input text into a matching string of input forms in the database provides the start and end parameters for the part. The output text is generated by the touch portion of the digital waveform and is limited by the start and end parameters.

Description

【発明の詳細な説明】音素をデジタル波形へ変換することによるスピーチの合成方法本発明は合成スピーチ、特に音素を表わす信号からのデジタル波形を合成する方法に関する。例えば合成スピーチの使用が便利である電話システムのような多数の状況が存在する。ある応用では、開始点はワードプロセッサにより生成されるディスク等の一般的な印刷の電子表示である。処理の多数の段階はこのような開始点から合成スピーチを発生することを必要とするが、処理の予備部分として、一般的なテキストを音声テキストに変換することが通常である。この応用では、このような音声テキストを表わす信号は“音素”と呼ばれる。従って本発明は音素を表わす信号をデジタル波形に変換する問題をアドレスする。デジタル波形はオージオ技術で普通のものであり、デジタルアナログコンバータとスピーカはデジタル波形を音響波形に変換することを可能にするよく知られた装置であることが認められよう。音素をデジタル波形に変換する多数の処理が提案されており、これは通常多数のエントリーを有する連結したデータベースにより行われ、エントリーはそれぞれ音素で限定されたアクセス部分と、アクセス音素に対応するデジタル波形を含んだ出力部分を有する。明白に、全ての音素はアクセス部分で表示されるべきであるが、付加的に音素のストリングを有することも知られている。しかしながら、既存のシステムはアクセス部分に含まれる音素のストリングのみを考慮し、さらにストリングの前後関係を考慮しない。本発明は特許請求の範囲で限定されているように、音素のストリングをデジタル波形に変換するため連結したデータベースを使用するが、選択された音素のストリングの前後関係を考慮する。本発明はまた前後関係の考慮を容易にする新しい形態のデータベースを含み、また好ましいデータベースのストリングがそこに記憶されている選択肢から選択される方法を含む。本発明の好ましい実施例を例示により説明する。一般例の説明この一般的な説明は、本発明の好ましい実施例の幾つかの重要な完全なものを弁別する。これらの完全なもののそれぞれはこの一般的な説明後に詳細に説明する。本発明の方法は音素で表現されているテキストを表わす入力信号を、最終的に音響波に変換されるデジタル波形に変換する。変換前に、最初のデジタル波形はさらに当業者に知られている方法に応じて処理されてもよい。好ましい実施例で使用される音素セットはＳＡＭＰ−ＰＡ（Speech Assessmen t Methologies - Phonetic Alphabet ）の簡単なセット番号６に準じる。本発明の方法は電子装置で実行され、音素が信号形態で与えられ、従ってその方法は入力波形の出力波形への変換に対応することが理解されよう。本発明の好ましい実施例は１、２または３の音素のストリングを表わす波形をデジタル波形に変換するが、常に少なくとも１つの先行する音素と少なくとも１つの後続する音素が考慮されるように５つの音素のストリングで動作する。これは５の音素のストリングの選択肢が利用可能であるとき、“ 最良”の前後関係が選択される効果を有する。本発明は特に５の音素のストリングを使用し、このストリングは以下の説明において“前後関係窓”と呼ばれ、“前後関係窓”を構成する５つの音素は連続してＰ１、Ｐ２、Ｐ３、Ｐ４、Ｐ５として示される。入力信号からの５つの連続的な音素である“データ前後関係窓”は、データベースに含まれる５つの連続音素の連続である“アクセス前後関係窓”に一致されることが本発明の重要な特徴である。従来技術は可変の長さのストリングがデジタル波形に変換される技術を含んでいる。しかしながら、選択されたストリングの前後関係は考慮されない。選択されたストリングを構成している各音素は勿論ストリングの他の全ての音素との前後関係にあるが、ストリングの前後関係は全体としては考慮されない。本発明は選択されたストリング内の前後関係を考慮するだけでなくデータベースで有効なストリングから最良の整合のストリングを選択する。この明細書は以下の好ましい実施例の重要な事項を説明する。ｉ）選択で使用されるときの“最良”の定義 ii）対応するデジタル波形を伴ったデータ前後関係窓の信号表示を記憶するデータベース構成 iii ）（ｉ）を使用した（ii）の選択方法 iv）（iii ）により与えられる種々の選択肢の１つの採用“最良”の定義本発明は入力前後関係窓と種々の記憶された前後関係窓との“最良”の整合に基いて選択肢の前後関係窓から選択する。例えば10⁸または10¹⁰の多数の可能な前後関係窓（それぞれ５音素）が存在するから、これらを全て記憶することはできず、即ちデータベースには可能な前後関係窓が幾分か不足している。全ての可能な前後関係窓が記憶されるならば、正確な対応が常に得られるので“最良”の整合を決定する必要はない。しかしながら、各個々の音素はデータベースに含まれるべきであり、常に少なくとも１つの音素に対して正確な整合を達成でき、好ましい実施例では、データ前後関係窓のＰ３を記憶された前後関係窓のＰ３に正確に整合することが常に可能であるが、通常、さらに正確に整合することは可能ではない。本発明は後述するように２つの音素の間の相関パラメータを定める。ここで各音素に対応して、係数の定められたリストからなるタイプベクトルが存在する。これらの各係数は音素の特徴を表し、例えばその音素は音声または非音声であるか、音素がシリバント（silibant）、破裂音、唇音であるか否かである。例えば音素が強勢または強勢のない音節にあるか否かの位置的な特徴を含むことも望ましい。従ってタイプベクトルは特有にその音素を特徴づけ、２つの音素は例えば排他的オアゲート（等価ゲートと呼ばれることもある）を使用することによってこれらのタイプベクトル係数を比較することにより比較されることができる。多数の整合は相関パラメータを決定する１方法である。所望ならば、これはパラメータの最大の可能な値により除算し、１００を乗算することにより、パーセンテージに変換されることができる。代りの例として、不整合パラメータは例えば２つのタイプのベクトルの相違数を数えることにより定められることができる。“最良”の整合を選択することは最低の不整合を選択することに等しいことが理解されよう。主要な決定は１対の音素の相関パラメータに関する。ストリングの相関パラメータは２つのストリングの対応する対のパラメータを合計または平均することにより得られる。加重された平均は適切な場合に利用されることができる。データベース好ましい実施例では、（文節の情報内容は重要ではないが）データベースは例えば英語等の選択言語の延長された文節に基づく。適切な文節は２または３分間継続し、約１０００乃至１５００の音素を含んでいる。あらゆる音素を含まなければならず、種々の前後関係のあらゆる音素を含むべきであるが、延長した文節の正確な特性は特に重要ではない。延長された文節は２つの異なったフォーマットで記憶されることができる。第１に、延長した文節は連結したデータベースのアクセス部分を提供するために音素で表されることができる。特に、延長した文節を表す音素はそれぞれ５音素を含んだ前後関係窓に分離される。本発明の方法はデータの前後関係窓を丁度弁別された記憶された前後関係窓に対して最良に整合することを得ることからなる。延長した文節はまたデジタル化波形の形態で与えられることもできる。予測されるように、これは設定された技術を使用してデジタル記録を行うため読者または暗唱者がマイクロホンに向って延長した文節を発することにより達成される。デジタル記録のあらゆる点は例えば開始からの時間等のパラメータにより定められることができる。記録の分析は等価テキストの各対の音素の間の中断に対応する時間パラメータに対する値を設定する。この装置は、ストリングの第１の音素に対応する時間パラメータの開始値とストリングの最後の音素に対応する時間パラメータの値の終了値とを設定し、データベースの等価部分即ち特定のデジタル波形を検索することにより含まれているストリングに対して音素と波形との変換を許容する。特に、１、２または３の音素のストリングの変換が達成されることができる。重要な必要条件は変換用の延長されたテキストの最良部分を選択することである。延長したテキストの音素部分はそれぞれ５の音素の前後関係窓形態で記憶されることを既に前述した。これは３階級レベルを有するツリーで音素を記憶することにより最も適切に達成される。第１のレベルの階級は各窓の音素Ｐ３により限定される。この効果として、あらゆる音素が前後関係窓のサブセットに直接アクセスを与え、即ち前後関係窓全体はサブセットに分割され、それぞれのサブセットは同一値のＰ３を有する。ツリーの次のレベルは音素Ｐ２、Ｐ４により限定され、この選択は前述のように定められたサブセットから行われるので、前後関係窓全体がさらに小さいサブセットに分割される効果が生じサブセットはそれぞれ音素Ｐ２、Ｐ３、Ｐ４を共通して有することにより限定される。（約５０万のサブセットが存在するが、妥当なシーケンスＰ２、Ｐ３、Ｐ４は延長したテキストでは生じないので、そのほとんどは空白である）。空白のサブセットは全く記録されず、従ってデータベースは管理可能な大きさに留る。延長したテキストで生じる各３つのシーケンスＰ２、Ｐ３、Ｐ４下に対して、Ｐ２、Ｐ４下のデータベースの第２のレベルで記録されるサブセットが存在することが真であるにもかかわらず、このレベルはＰ３下の第１のレベルで指数化されている。正確な整合として、第２のレベルはＰ２、Ｐ３、Ｐ４を有するサブセットを含んだ第３のレベルへアクセスを与え、これはこれらの３つに対応してＰ１とＰ５の全ての値を含む。データＰ１とＰ５の最良の整合が選択される。この選択は延長したテキストに含まれている前後関係窓の１つを完全に弁別し、前記窓の時間パラメータにアクセスを提供する。特に、以下のように、４つの異なったストリングまでの開始および終了時間パラメータを与える。（ａ）Ｐ３自体；（ｂ）Ｐ２＋Ｐ３の音素対；（ｃ）Ｐ３＋Ｐ４の音素対；（ｄ）Ｐ２＋Ｐ３＋Ｐ４の音素からなる３つの音素第１の場合では、データベースは選択されたストリング（ａ）乃至（ｄ）のそれぞれ１つに対応している時間パラメータの開始値および終了値を提供する。前述したように、時間パラメータは等価波形が選択されるようにデジタル波形の関連部分を限定する。データベースに含まれるならば、項目（ｄ）が提供され、この場合項目（ａ）、（ｂ）、（ｃ）は全て選択された（ｄ）に組込まれ、それ故これらは選択肢として有効であることに留意すべきである。項目（ｄ）がデータベースに含まれていないならば、明白にこの選択は与えられることができない。項目（ｄ）がデータベースになくても、項目（ｂ）および／または（ｃ）はデータベースに存在する可能性がある。これらの両者の選択が提供されるとき、項目（ｄ）がないのでこれらはデータベースの異なった部分から生じる。それ故、データベースの内容に基づいて、選択は（ｂ）のみをまたは（ｃ）のみまたはその両者を与える。従って選択は選択肢を与え、いかなる場合でも項目（ａ）は対で組込まれるために利用可能である。最後に、（ｂ）、（ｃ）、（ｄ）が全てデータベースになくても、項目（ａ）は常に存在し、従って“最良の整合”は単一の音素に対して提供され、これは提供される唯一の可能性である。項目（ｂ）、（ｃ）、（ｄ）はストリングの重複を示唆していることが明白であろう。従って項目（ｃ）がいかなる音素用に対して選択されるときでも、項目（ｂ）は次の音素に対して利用できなければならない。より良好なものが提供されない場合に、データベースの同一部分は初期の音素で（ｃ）の要求を満たし、後の音素で（ｂ）の必要条件を満たすが、異なった相関が含まれるために、より良好な選択が選択されてもよい。項目（ｄ）が有効であるときには、項目（ｃ）が前の音素で有効であり、さらに項目（ｂ）は後の音素で有効であることが明白である。換言すると、幾つかのストリングは重複し、即ち同一の音素が異なったストリングの異なった位置で生じるようにいくつかの音素に対する選択肢が存在する。本発明のこの観点についてより詳細に後述する。好ましい実施例は５音素の長さである前後関係窓に基づいていることが強調された。しかしながら、５音素の十分なストリングが選択されることはない。幸い、入力テキストがデータベースで発見される５つのストリングを含んでいる場合に、３つのストリング、Ｐ２、Ｐ３、Ｐ４のみが使用される。このことは本発明の重要な特徴が前後関係からのストリングの選択であり、それ故、本発明は５つの音素の“最良”の前後関係窓を選択し、全ての選択されたストリングが前後関係に基づくことを確実にするためにその一部だけを使用する。“最良”窓の選択データベースに含まれる音素へのテキストの分析は音素により実行されるが、それぞれの音素はその前後関係窓で利用される。次の部分の説明はデータ音素の１つの選択処理に基づいているが、同一の処理が各データ音素で使用されるものと理解される。選択されたデータ音素は別々ではなく前後関係窓の一部分として利用される。より正確には、選択されたデータ音素は、関連する前後関係窓の５つの音素を与えるために選択された２つの先行音素と２つの後続音素を有するデータ窓の音素Ｐ３になる。前述のデータベースはこの前後関係窓で検索される。正確な窓が位置付けられることはほとんどないので、検索は記憶された前後関係窓の最良の適合のために行われる。検索の第１のステップは指数化素子として音素Ｐ３を使用して前述のツリーをアクセスすることを含む。前述したように、これは記憶された前後関係窓のサブセットに直接的なアクセスを与える。より詳しく説明すると、音素Ｐ３によるアクセスレベルはデータ前後関係窓の可能な値Ｐ２とＰ４に対応する音素対のリストに対するアクセスを与える。最良の対は以下の４つの基準に従って選択される。第１の基準幸にも、サブセットの１つの対がデータＰ２とＰ４に対して正確な整合を与えることが生じる可能性がある。これが生じたとき、その対は選択され、検索は直ちにレベル３に進行する。詳細に前述したようにストリングＰ２、Ｐ３、Ｐ４は延長した文節に含まれないのでこの結果は起こりそうもない。第２の基準。３つの整合がない場合、これが生じた時、左対が選択される。Ｐ２に対する正確な整合が発見されたとき左側の整合が選択され、選択肢が提供されたならば、最高の相関パラメータを有するＰ４はツリーのレベル３にアクセスを与えるように選択される。第３の基準はこれがＰ４に対して発見された正確な整合に基づく右側の対である点を除いて第２の基準に類似している。この場合、レベル３へのアクセスは最高の相関パラメータを与えるＰ２の値により与えられる。最高の平均相関パラメータを有する対Ｐ２、Ｐ４がレベル３へのアクセスの基礎として選択される場合にＰ２、Ｐ３の一方に整合が存在しないとき基準４が生じる。基準１が成功したならば、選択肢として基準２、３、４に応じて、左側の対と右側の対と単一値を取ることが可能であることが留意されよう。基準１が失敗しても、左側の対は基準２により発見され、同時に右側の対は基準３により発見されることができる。しかしながら、基準１が失敗したため、これらはデータベースの異なった部分から選択され、これらはアクセスをレベル３のツリーの異なった部分に与える。最後に、基準４は基準１、２、３が全て失敗した時のみ、受けられ、その結果、他の前後関係窓で使用されるとき、音素Ｐ３は３つの音素または対で発見されることができない。従って、基準１または４が利用されるとき、第３のレベルでツリーの一部分のみにアクセスされ、基準２、３が使用されるとき、第３のレベルの２つの異なった部分にアクセスされる。前後関係窓の選択がツリーの第３のレベルの１または２の領域に対して行われる態様について説明する。それぞれの場合に、第３のレベルはデータ前後関係窓の音素１および５に対する幾つかの対を含んでもよい。最良の平均相関パラメータを有する対がデータベースのアクセス部分の前後関係窓として選択される。前述したように、この前後関係窓は時間パラメータを用いてデジタル波形形態に変換される。再度強調するが、基準１が使用されると、１つの前後関係窓だけが選択されるが、４つの可能性、即ち以下の時間パラメータ範囲を生じる。（ｉ）３つの音素Ｐ２＋Ｐ３＋Ｐ４；（ii）左側の対Ｐ２＋Ｐ３；（iii ）右側の対Ｐ３＋Ｐ４；（iv）単一のＰ３自体基準２が作用するとき、これは左側の対Ｐ２＋Ｐ３と単一のＰ３自体のみに対する時間パラメータ範囲を与える。基準３が作用するとき、類似の考察が適用されるが、パラメータ範囲は右側の対Ｐ２＋Ｐ３と単一のＰ４である。両者の基準が作用すると、これは単一のＰ３に対して２つの選択肢を提供し、Ｐ１＋Ｐ５に対するより高い相関パラメータを有する単一のものが選択される。最後に、基準４が作用するとき、１つのみの可能性、即ち音素Ｐ３自体が存在する。前述の説明は変換が入力テキストの各音素に対してどのように与えられるかを説明した。時には、この方法は単一の音素のみの変換を与えるが、この場合、選択肢は提供されない。ある場合には、その方法は２または３の隣接音素のストリングに対する変換を提供するが、これらの状況では、変換は少なくとも１つの音素に対する選択肢を提供する。選択を終了するため、選択肢の数を１つに減らすことが必要である。この減少を達成する好ましい方法を以下説明する。減少を行う好ましい方法は、入力テキストの短いセグメント、例えばサイレンスで開始し終了するセグメント等を処理することにより実行される。それ程長くないならば、センテンスは適切なセグメントを構成する。例えば３０ワード以上のようにセンテンスが非常に長いならば、例えば節と他のサブユニットの間に１以上の組込まれたサイレンスを含む。長いセンテンスの場合、このようなサブユニットはセグメントとしての使用に適している。各選択肢のセットを１つに減少するためのセグメントの処理を以下説明する。前述したように、いくつかの音素には選択肢は提供されず、それ故、これらの音素には選択は必要とされない。選択肢は他の音素で有効であり、全体的にセグメントに“最良”の結果を生じるような選択が行われる。これはセグメントの他の場所で“より良好な”選択を得るためセグメントの１点で局部的に“よくない” 選択を行うことを含んでいる。“より良好な”基準は以下のことを含んでいる。（ｉ）短いストリングよりも長いストリングを採用し、（ii）単に接触しているストリングよりも重複するストリングから選択する。不所望な選択肢の排除によって各音素が１つおよび１つのみの変換を有する位置が発生する。換言すると、入力テキストはデータベースに整合する１、２または３の音素のサブストリングに分割され、それ故選択されたストリームの開始値および終了値が設定される。データベースの出力部分はデジタル波形の形態を取り、設定されたパラメータはこの波形のセグメントを決定する。それ故、入力テキストに対応するデジタル波形を発生するように指定されたセグメントが選択され、接触される。これは本発明の要求が完了する。デジタル波形が得られると、これは通常のデジタルアナログ変換技術と一般的なスピーカを使用して音響出力として与えられることができる。所望ならば一次的デジタル波形は当業者に知られている技術を使用して強化されることができる。本発明を添付図面を参照して例示によりさらに説明する。図１は本発明によるスピーチエンジンを概略的に示している。図２は電話回路網に取付けられた図１で示されているスピーチエンジンを示している。図１で示されているように、本発明によるスピーチエンジンは文字素でテキストを受け、それから音素で等価テキストを発生するように構成された１次プロセッサ11を具備している。このテキストは、本発明によりデータベース13と動作上関連しているコンバータ12へ送られる。コンバータ12は、音素テキストのセグメントとデータベース13のアクセス部分に記憶されたセグメントとを整合させる。従ってデジタル波形のセグメントは検索され、これらはもとの入力の延長された部分に対応しているデジタル波形の延長部分に組立てられる。デジタル波形のこれらの延長された部分は波形プロセッサ14に送られ、ここでこれらはスムースな出力を発生するためにさらに処理を受ける。最後に、デジタル出力はさらに伝送するために出力ポート15で与えられるアナログ波形に変換される。図１で示されているように、スピーチエンジンはテキストを一般的な正字法でテキストを保持する外部データベース16から入力を受信するように接続されている。外部データベース16はそこに記憶されたテキストを選択するためにキーボード17により動作されると便利である。このテキストは１次コンバータ11に与えられ、アナログ波形として出力ポート15に現れる。図２は公共アクセス電話回路網に取付けられた図１で示されているスピーチエンジンを示している。図２で示されているように一般的な音声電話20は切換えアクセス回路網21を経てステーション22に接続されている。ステーション22は図１で示されているようにスピーチエンジンを含んでおり、出力ポート15は、外部データベース16中で利用可能な情報がアナログ音響波形として電話20に与えられるように回路網に接続されている。所望ならば、電話20の（ダイヤル用に使用される）キーパッドは外部データベース16のキーパッド17として使用されることができる（この場合、外部データベース16は好ましくはスピーチエンジンにより読取られることができる指令を含んでいる）。より簡単な技術の装置ではステーション20で人間のオペレータが配置され、人間のオペレータは回路網21に通って受信される指令に応じてキーボード17を付勢する。オペレータがテキストの一部を選択したとき、これはスピーチエンジンにより読取られ、さらにオペレータの関与する必要はない。従って、オペレータは自由に問い合わせを補助し、スピーチエンジンの使用は動作の効率を強化する。例えば公共アドレスシステムへの接続に適切である等、本発明によるスピーチエンジンに対する多数の他の応用が存在することが認められよう。TECHNICAL FIELD The present invention relates to synthetic speech, and more particularly to a method of synthesizing a digital waveform from a signal representing a phoneme. There are numerous situations, such as telephone systems, where the use of synthetic speech is convenient. In some applications, the starting point is a typical printed electronic representation of a disk, such as a disk produced by a word processor. Many stages of the processing require generating synthetic speech from such a starting point, but it is common to convert general text into spoken text as a preliminary part of the processing. In this application, the signal representing such phonetic text is called a "phoneme". The present invention thus addresses the problem of converting a signal representing a phoneme into a digital waveform. It will be appreciated that digital waveforms are commonplace in audio technology, and digital-to-analog converters and speakers are well-known devices that enable the conversion of digital waveforms into acoustic waveforms. A number of processes for converting phonemes into digital waveforms have been proposed, which is usually done by a concatenated database with many entries, each entry being a phoneme-limited access part and a digital waveform corresponding to the access phoneme. Has an output part including. Obviously, all phonemes should be displayed in the access part, but it is also known to additionally have strings of phonemes. However, the existing system considers only the string of phonemes included in the access part and does not consider the context of the string. The present invention uses concatenated databases to convert strings of phonemes into digital waveforms, as limited by the claims, but considers the context of the selected strings of phonemes. The present invention also includes a new form of database that facilitates contextual considerations, and also a method by which a preferred database string is selected from the choices stored therein. A preferred embodiment of the invention will now be described by way of example. Description of the General Examples This general description distinguishes some important completeness of the preferred embodiments of the present invention. Each of these perfections will be described in detail after this general description. The method of the present invention transforms an input signal representing text represented in a phoneme into a digital waveform that is ultimately transformed into an acoustic wave. Prior to conversion, the initial digital waveform may be further processed according to methods known to those skilled in the art. The phoneme set used in the preferred embodiment conforms to the simple set number 6 of SAMP-PA (Speech Assessment Methologies-Phonetic Alphabet). It will be appreciated that the method of the present invention is implemented on an electronic device and phonemes are provided in signal form, thus the method corresponds to the conversion of an input waveform into an output waveform. The preferred embodiment of the present invention transforms a waveform representing a string of 1, 2 or 3 phonemes into a digital waveform, but always 5 phonemes such that at least one preceding phoneme and at least one subsequent phoneme are considered. Works with the string. This has the effect that the "best" context is selected when 5 phoneme string choices are available. The present invention particularly uses a string of 5 phonemes, which is referred to in the following description as a "text window", and the five phonemes making up the "text window" are consecutive P1, P2, P3, Shown as P4, P5. An important feature of the present invention is that the "data context window" that is five consecutive phonemes from the input signal is matched with the "access context window" that is a sequence of five consecutive phonemes contained in the database. Is. The prior art includes techniques in which variable length strings are converted into digital waveforms. However, the context of the selected string is not considered. Each phoneme making up the selected string is of course in the context of all other phonemes of the string, but the context of the string is not considered as a whole. The present invention not only considers the context within the selected string, but also selects the best matching string from the valid strings in the database. This specification illustrates the important aspects of the following preferred embodiments. i) Definition of “best” when used in selection ii) Database configuration for storing signal representations of data context windows with corresponding digital waveforms iii) Selection method of (ii) using (i) iv ) (Iii) One of the various choices given by (iii) Definition of "best" The present invention is based on the "best" match of the input context window with the various stored context windows of choices. Select from. For example, there are 10 ⁸ or 10 ¹⁰ many possible context windows (5 phonemes each), so it is not possible to store them all, ie the database is somewhat lacking in possible context windows. There is. If all possible context windows are stored, there is no need to determine the "best" match, as an exact correspondence will always be obtained. However, each individual phoneme should be included in the database so that an exact match can always be achieved for at least one phoneme, and in the preferred embodiment the data context window P3 is the stored context window P3. It is always possible to match exactly, but usually not even more exactly. The present invention defines a correlation parameter between two phonemes as described below. Here, for each phoneme, there is a type vector consisting of a list of coefficients. Each of these coefficients represents a feature of a phoneme, such as whether the phoneme is a voice or a non-voice, or whether the phoneme is a silibant, plosive, or lip sound. It is also desirable to include a positional feature, for example, whether the phoneme is in stressed or unstressed syllables. Thus, a type vector uniquely characterizes its phoneme, and two phonemes can be compared by comparing their type vector coefficients, for example by using an exclusive OR gate (sometimes called an equivalent gate). . Multiple matching is one way to determine the correlation parameter. If desired, this can be converted to a percentage by dividing by the maximum possible value of the parameter and multiplying by 100. As an alternative example, the mismatch parameter can be determined, for example, by counting the number of differences between the two types of vectors. It will be appreciated that selecting the "best" match is equivalent to selecting the lowest mismatch. The main decision concerns the correlation parameters of a pair of phonemes. The correlation parameter of a string is obtained by summing or averaging the parameters of the corresponding pair of two strings. Weighted averages can be used where appropriate. Database In the preferred embodiment, the database (although the information content of the clause is not important) is based on extended clauses in the selected language, eg English. A suitable phrase lasts 2 or 3 minutes and contains approximately 1000 to 1500 phonemes. All phonemes must be included and should be included in various contexts, but the exact nature of the extended clause is not particularly important. Extended clauses can be stored in two different formats. First, extended clauses can be phonemeized to provide an access portion of the concatenated database. In particular, phonemes representing extended phrases are separated into context windows each containing five phonemes. The method of the invention consists in obtaining the best match of the context window of the data to the just discriminated stored context window. Extended clauses can also be provided in the form of digitized waveforms. As expected, this is accomplished by the reader or reciter issuing an extended phrase toward the microphone to make a digital recording using the set technique. Every point of the digital record can be defined by parameters such as time from start. Analysis of the recording sets values for the temporal parameters corresponding to the breaks between each pair of phonemes of the equivalent text. This apparatus sets the starting value of the time parameter corresponding to the first phoneme of the string and the ending value of the value of the time parameter corresponding to the last phoneme of the string, and searches the equivalent part of the database, that is, a specific digital waveform. By doing so, conversion of phonemes and waveforms is allowed for the included strings. In particular, conversion of strings of 1, 2 or 3 phonemes can be achieved. An important requirement is to choose the best part of the extended text for conversion. It has already been mentioned above that the phoneme parts of the extended text are stored in the context window form of 5 phonemes each. This is best achieved by storing the phonemes in a tree with 3 levels. The class of the first level is limited by the phoneme P3 of each window. The effect is that every phoneme gives direct access to a subset of the context window, ie the entire context window is divided into subsets, each subset having the same value of P3. The next level of the tree is bounded by the phonemes P2, P4, and this selection is made from the subsets defined as described above, which has the effect of dividing the entire context window into smaller subsets, each of which is a phoneme P2. , P3, P4 in common. (Although there are about half a million subsets, most of them are blank because valid sequences P2, P3, P4 do not occur in extended text). No blank subset is recorded, so the database remains manageable. For each three sequences P 2, P 3, P 4 under the extended text, it is true that there is a subset recorded at the second level of the database under P 2, P 4 despite this being true. Levels are indexed at the first level below P3. As an exact match, the second level gives access to a third level containing a subset with P2, P3, P4, which correspondingly includes all three values of P1 and P5. The best match of data P1 and P5 is selected. This selection completely discriminates one of the context windows contained in the extended text and provides access to the temporal parameters of said window. In particular, we give start and end time parameters up to four different strings as follows: (A) P3 itself; (b) P2 + P3 phoneme pair; (c) P3 + P4 phoneme pair; (d) Three phonemes consisting of P2 + P3 + P4 phonemes. In the first case, the database is the selected strings (a) to (a). Providing a start value and an end value for the time parameter corresponding to each one of d). As mentioned above, the time parameter limits the relevant part of the digital waveform so that the equivalent waveform is selected. If included in the database, item (d) is provided, in which case items (a), (b), (c) are all incorporated into the selected (d), so they are valid choices. It should be noted that Explicitly this choice cannot be given if item (d) is not included in the database. Item (b) and / or (c) may be present in the database even if item (d) is not in the database. When both of these choices are provided, they come from different parts of the database because item (d) is missing. Therefore, based on the contents of the database, the selection gives only (b) or (c) or both. The choice thus gives an option and in any case item (a) is available to be incorporated in pairs. Finally, item (a) is always present, even if (b), (c), (d) are not all in the database, so the "best match" is provided for a single phoneme, which is It is the only possibility offered. It will be apparent that items (b), (c) and (d) suggest string duplication. Therefore, when item (c) is selected for any phoneme, item (b) must be available for the next phoneme. If a better one is not provided, the same part of the database satisfies the requirement of (c) in the initial phoneme and the requirement of (b) in the later phoneme, but because of the different correlations involved, A better choice may be selected. When item (d) is valid, it is clear that item (c) is valid for the previous phoneme and item (b) is valid for the subsequent phoneme. In other words, some strings overlap, that is, there are options for some phonemes so that the same phoneme occurs at different positions in different strings. This aspect of the invention is described in more detail below. It was emphasized that the preferred embodiment is based on a context window that is 5 phonemes long. However, not enough strings of five phonemes are selected. Fortunately, if the input text contains 5 strings found in the database, only 3 strings, P2, P3, P4 are used. This is an important feature of the present invention is the selection of strings from the context, therefore the present invention selects the "best" context window of five phonemes, and all selected strings are in context. Use only part of it to ensure that it is based on. Selection of the "best" window The analysis of the text into phonemes contained in the database is performed by the phonemes, each phoneme being used in its context window. Although the description of the next part is based on one data phoneme selection process, it is understood that the same process is used for each data phoneme. The selected data phonemes are used as part of the context window rather than separately. More precisely, the selected data phoneme becomes the phoneme P3 of the data window with the two preceding phonemes and the two following phonemes selected to give the five phonemes of the relevant context window. The aforementioned database is searched in this context window. Since the exact window is rarely located, the search is done for the best fit of the stored context windows. The first step of the search involves accessing said tree using phoneme P3 as the indexing element. As mentioned above, this gives direct access to a subset of the stored context windows. More specifically, the access level by phoneme P3 gives access to a list of phoneme pairs corresponding to the possible values P2 and P4 of the data context window. The best pair is selected according to the following four criteria. First Criteria Fortunately, it is possible that one pair of subsets will give an exact match to the data P2 and P4. When this happens, the pair is selected and the search immediately proceeds to level 3. This result is unlikely to occur because the strings P2, P3, P4 are not included in the extended clause as detailed above. Second criterion. If there are no three matches, the left pair is selected when this happens. The left-hand match is selected when an exact match is found for P2, and if an option is provided, P4 with the highest correlation parameter is selected to give access to level 3 of the tree. The third criterion is similar to the second criterion except that it is the right pair based on the exact match found for P4. In this case, access to level 3 is given by the value of P2 which gives the highest correlation parameter. Criterion 4 occurs when there is no match in one of P2, P3 when the pair P2, P4 with the highest average correlation parameter is selected as the basis for access to level 3. It should be noted that if criterion 1 is successful, it is possible to take a left pair and a right pair and a single value, depending on criteria 2, 3 and 4. Even if criterion 1 fails, the left pair can be found by criterion 2 while the right pair can be found by criterion 3. However, because Criterion 1 failed, they are selected from different parts of the database, which gives access to different parts of the level 3 tree. Finally, Criterion 4 is accepted only when Criteria 1, 2, and 3 all fail, so that when used in other context windows, phoneme P3 is found in three phonemes or pairs. Can not. Thus, when criteria 1 or 4 is utilized, only a portion of the tree is accessed at the third level, and when criteria 2 and 3 are utilized, two different portions of the third level are accessed. The manner in which the context window selection is performed for the third level 1 or 2 region of the tree is described. In each case, the third level may include several pairs for phonemes 1 and 5 of the data context window. The pair with the best mean correlation parameter is selected as the context window for the access portion of the database. As described above, this context window is converted into a digital waveform form using the time parameter. Again, if Criterion 1 is used, only one context window is selected, but four possibilities arise, namely the following time parameter range. (I) Three phonemes P2 + P3 + P4; (ii) Left pair P2 + P3; (iii) Right pair P3 + P4; (iv) Single P3 itself When criterion 2 acts, this is the left pair P2 + P3 and single P3. Gives a time parameter range for itself only. Similar considerations apply when criterion 3 works, but the parameter range is the right pair P2 + P3 and a single P4. When both criteria work, this provides two options for a single P3, and the single with the higher correlation parameter for P1 + P5 is selected. Finally, when criterion 4 works, there is only one possibility, the phoneme P3 itself. The above description explained how the transformation is applied to each phoneme of the input text. Sometimes this method gives a conversion of only a single phoneme, but in this case no choice is provided. In some cases, the method provides a transform for a string of 2 or 3 adjacent phonemes, but in these situations the transform provides a choice for at least one phoneme. To finish the selection, it is necessary to reduce the number of choices to one. A preferred method of achieving this reduction is described below. The preferred method of performing the reduction is by processing short segments of the input text, such as segments that start and end at silence. If not so long, the sentence constitutes the appropriate segment. If the sentence is very long, for example 30 words or more, it will contain one or more embedded silences between the clause and the other subunits, for example. In the case of long sentences, such subunits are suitable for use as segments. The processing of the segments to reduce each set of options to one is described below. As mentioned above, no choices are provided for some phonemes and thus no choices are required for these phonemes. The choices are valid for other phonemes, and selections are made that yield the "best" results for the segment overall. This involves making a local "bad" selection at one point in the segment to get a "better" selection elsewhere in the segment. The "better" criteria include: (I) adopt longer strings than short strings, and (ii) select from overlapping strings over simply touching strings. The elimination of undesired options results in positions where each phoneme has one and only one transformation. In other words, the input text is divided into substrings of 1, 2 or 3 phonemes that match the database, thus setting the start and end values of the selected stream. The output part of the database takes the form of a digital waveform and the set parameters determine the segment of this waveform. Therefore, the segment designated to generate the digital waveform corresponding to the input text is selected and touched. This completes the requirements of the present invention. Once the digital waveform is obtained, it can be provided as an acoustic output using conventional digital-to-analog conversion techniques and conventional speakers. If desired, the primary digital waveform can be enhanced using techniques known to those skilled in the art. The invention will be further described by way of example with reference to the accompanying drawings. FIG. 1 schematically shows a speech engine according to the invention. FIG. 2 shows the speech engine shown in FIG. 1 mounted in a telephone network. As shown in FIG. 1, a speech engine according to the present invention comprises a primary processor 11 configured to receive text in a glyme and then generate an equivalent text in a phoneme. This text is sent to the converter 12 which is operatively associated with the database 13 according to the present invention. The converter 12 matches the segment of phoneme text with the segment stored in the access portion of the database 13. Therefore, the segments of the digital waveform are retrieved and these are assembled into an extension of the digital waveform corresponding to the extended portion of the original input. These extended portions of the digital waveform are passed to the waveform processor 14, where they are further processed to produce a smooth output. Finally, the digital output is converted to an analog waveform provided at output port 15 for further transmission. As shown in FIG. 1, the speech engine is connected to receive input from an external database 16 which holds the text in a conventional orthographic manner. The external database 16 is conveniently operated by the keyboard 17 to select the text stored therein. This text is provided to the primary converter 11 and appears at output port 15 as an analog waveform. 2 shows the speech engine shown in FIG. 1 mounted in a public access telephone network. As shown in FIG. 2, a typical voice telephone 20 is connected to a station 22 via a switched access network 21. Station 22 contains a speech engine as shown in FIG. 1 and output port 15 is connected to circuitry so that the information available in external database 16 is provided to telephone 20 as an analog acoustic waveform. ing. If desired, the keypad (used for dialing) of the telephone 20 can be used as the keypad 17 of the external database 16 (in which case the external database 16 can preferably be read by the speech engine). Including directives). In a simpler device, a human operator is located at station 20, which activates keyboard 17 in response to commands received through network 21. When the operator selects part of the text, it is read by the speech engine and no further operator intervention is required. Therefore, the operator is free to assist in queries and the use of the speech engine enhances the efficiency of operation. It will be appreciated that there are numerous other applications for the speech engine according to the invention, for example suitable for connection to public address systems.

【手続補正書】特許法第１８４条の８【提出日】１９９５年７月１２日【補正内容】デジタル波形のこれらの延長された部分は波形プロセッサ14に送られ、ここでこれらはスムースな出力を発生するためにさらに処理を受ける。最後に、デジタル出力はさらに伝送するために出力ポート15で与えられるアナログ波形に変換される。図１で示されているように、スピーチエンジンはテキストを一般的な正字法でテキストを保持する外部データベース16から入力を受信するように接続されている。外部データベース16はそこに記憶されたテキストを選択するためにキーボード17により動作されると便利である。このテキストは１次プロセッサ11に与えられ、アナログ波形として出力ポート15に現れる。図２は公共アクセス電話回路網に取付けられた図１で示されているスピーチエンジンを示している。図２で示されているように一般的な音声電話20は切換えアクセス回路網21を経てステーション22に接続されている。ステーション22は図１で示されているようにスピーチエンジンを含んでおり、出力ポート15は、外部データベース16中で利用可能な情報がアナログ音響波形として電話20に与えられるように回路網に接続されている。所望ならば、電話20の（ダイヤル用に使用される）キーパッドは外部データベース16のキーパッド17として使用されることができる（この場合、外部データベース16は好ましくはスピーチエンジンにより読取られることができる指令を含んでいる）。より簡単な技術の装置ではステーション20で人間請求の範囲（１）入力信号は音素でテキストを表わし、出力信号は前記テキストに対応する音響波形に変換可能なデジタル波形であり、出力部分に連結するアクセス部分を有する２部分のデータベースを使用する入力信号を出力信号に変換する方法において、前記アクセス部分はアクセス窓を限定し、これはそれぞれ音素のストリングに対応し、前記出力部分はアクセス窓に対応するデジタル波形を含み、前記方法は前記入力信号の窓とアクセス信号の窓とを比較し、それぞれの場合に、少なくとも１つの内部音素に対する正確な整合を含み、前記入力信号の一部に対する正確な整合である音素のより短いストリングを弁別するように少なくとも最良の整合の最初と最後の音素を破棄する最良の整合を与えるアクセス窓を選択し、選択された正確な整合に対応するデジタル波形を出力部分から検索し、その後出力信号を発生するようにデジタル波形の選択部分と共に連結することを含んでいることを特徴とする方法。（２）アクセス部分は音素における延長されたテキストに基づき、各アクセス窓は前記延長されたテキストに含まれている音素のストリングに対応し、出力部分はアクセス部分の延長された音素テキストに対応する延長されたデジタル波形を含み、出力部分から検索された部分は正確な整合に対応する延長されたデジタル波形のセグメントである請求項１記載の方法。（３）前記入力信号の５個の音素の窓に対して最良の整合を形成し、１個、２個または３個の音素のストリングに対する正確な整合を弁別するため前記最良の整合の少なくとも最初と最後の音素を破棄することを含む請求項１または２記載の方法。（４）データベースの入力部分は、（ｉ）窓の中央の音素に対応する単一の音素を含んだ最高レベルと、（ii）窓の第２および第４の音素の等価物を含んだ第２のレベルと、（iii ）窓の第１および第５の音素の等価物を含んだ最低レベルの３つの階級レベルに組織され、整合は、第１のレベルの階級から入力窓の中央の音素に対する正確な整合を選択し、最高レベルの階級の選択された部分に対応する第２のレベルの階級から第２および第４の音素に対する最良の整合を選択し、第２のレベルの階級の選択に対応する最低レベルの部分からの第１および５の音素に対する最良の整合を最低レベルの階級から選択することからなる請求項３記載の方法。（５）デジタル出力がアナログ信号に変換される請求項１乃至４のいずれか１項記載の方法。（６）スピーチエンジンの構成要素として使用され、デジタル波形を含んでいる出力部分に連結される音素を表す信号を含んだアクセス部分を有するデータベースにおいて、前記アクセス部分はそれぞれ５個の音素を含んでいるアクセス窓に分割される延長されたテキストに基づいており、前記出力部分はアクセス部分の延長された音素のテキストに対応する延長されたデジタル波形を含んでおり、アクセス部分は、（ｉ）アクセス窓の中央の音素に対応する単一の音素を含んだ最高レベルと、（ii）最高レベルで弁別されたアクセス窓の第２および第４の音素の等価物を含んだ第２のレベルと、（iii ）第２のレベルで弁別されたアクセス窓の第１および第５の音素の等価物を含んだ最低レベルの３つの階級レベルに組織され、アクセス部分と出力部分の間の連結は、レベル（ｉ）、（ii）、（iii ）からのアクセス窓の弁別がデジタル波形の対応する窓を弁別することを特徴とするデータベース。（７）文字素のテキストを音素の等価テキストに変換する１次プロセッサ（11）と、音素の前記テキストをデジタル波形に変換するコンバータ（12）とを具備するスピーチエンジンにおいて、コンバータ（12）が請求項６記載のデータベース（13）を含んでいることを特徴とするスピーチエンジン。（８）スピーチエンジンの出力を遠隔位置に伝送するために回路網に接続される請求項７記載のスピーチエンジンを含んでいる電話回路網。[Procedure Amendment] Patent Law Article 184-8 [Submission date] July 12, 1995 [Amendment content] These extended portions of the digital waveform are sent to the waveform processor 14, where they are output smoothly. Undergo further processing to generate. Finally, the digital output is converted to an analog waveform provided at output port 15 for further transmission. As shown in FIG. 1, the speech engine is connected to receive input from an external database 16 which holds the text in a conventional orthographic manner. The external database 16 is conveniently operated by the keyboard 17 to select the text stored therein. This text is provided to the primary processor 11 and appears at output port 15 as an analog waveform. 2 shows the speech engine shown in FIG. 1 mounted in a public access telephone network. As shown in FIG. 2, a typical voice telephone 20 is connected to a station 22 via a switched access network 21. Station 22 contains a speech engine as shown in FIG. 1 and output port 15 is connected to circuitry so that the information available in external database 16 is provided to telephone 20 as an analog acoustic waveform. ing. If desired, the keypad (used for dialing) of the telephone 20 can be used as the keypad 17 of the external database 16 (in which case the external database 16 can preferably be read by the speech engine). Including directives). In the device of the simpler technique, at the station 20, the human claim (1) The input signal represents a text with phonemes, and the output signal is a digital waveform convertible into an acoustic waveform corresponding to said text, and is connected to the output part. In a method of converting an input signal into an output signal using a two-part database having an access part, the access part defining an access window, each corresponding to a string of phonemes, and the output part corresponding to the access window. The input signal window and the access signal window, and in each case including an exact match to at least one internal phoneme, an accurate match to a portion of the input signal. Best to discard at least the best matching first and last phonemes so as to distinguish shorter strings of phonemes that are consistent Selecting the access window that provides the match, retrieving the digital waveform corresponding to the selected exact match from the output portion, and then concatenating with the selected portion of the digital waveform to produce the output signal. How to characterize. (2) The access portion is based on the extended text in the phoneme, each access window corresponds to a string of phonemes contained in the extended text, and the output portion corresponds to the extended phoneme text of the access portion. The method of claim 1 including an extended digital waveform, the portion retrieved from the output portion being a segment of the extended digital waveform that corresponds to an exact match. (3) At least the first of the best matches to form a best match for the window of 5 phonemes of the input signal and to distinguish an exact match for a string of 1, 2 or 3 phonemes. And discarding the last phoneme. (4) The input part of the database is (i) the highest level containing a single phoneme corresponding to the central phoneme of the window, and (ii) the second level containing the equivalent of the second and fourth phonemes of the window. 2 levels, and (iii) the lowest three level levels including the equivalents of the first and fifth phonemes of the window, the match is from the first level class to the middle phoneme of the input window. To select the best match for the second and fourth phonemes from the second level class corresponding to the selected part of the highest level class, and select the second level class 4. The method of claim 3 comprising selecting the best match for the first and fifth phonemes from the lowest level portion corresponding to the lowest level class. (5) The method according to any one of claims 1 to 4, wherein the digital output is converted into an analog signal. (6) A database used as a component of a speech engine and having an access part containing a signal representing a phoneme connected to an output part containing a digital waveform, wherein the access parts each include five phonemes. Based on the extended text divided into an access window, the output portion comprising an extended digital waveform corresponding to the extended phoneme text of the access portion, the access portion comprising: (i) access A highest level containing a single phoneme corresponding to the middle phoneme of the window, and (ii) a second level containing the equivalent of the second and fourth phonemes of the access window discriminated at the highest level, (Iii) organized into the lowest three rank levels including the equivalent of the first and fifth phonemes of the access window discriminated at the second level, and accessing Connection between the minute and the output portion, the level (i), (ii), a database, characterized in that to distinguish the corresponding window of the discrimination is the digital waveform of the access windows (iii). (7) In a speech engine comprising a primary processor (11) for converting a text of a phoneme into an equivalent text of a phoneme and a converter (12) for converting the text of a phoneme into a digital waveform, the converter (12) is Speech engine, characterized in that it comprises a database (13) according to claim 6. (8) A telephone network including a speech engine according to claim 7, which is connected to the network for transmitting the output of the speech engine to a remote location.

Claims

[Claims] (1) The input signal represents text with phonemes, and the output signal corresponds to the input text A method for converting an input signal that is a digital waveform that can be converted into an acoustic waveform into an output signal Be careful (A) The input signals are stored in the access part of the connected database. Divided into contact segments, (B) For each segment discriminated in step (a), the database output The segment of the digital waveform from the input section and the output segment is the input segment Is connected to (C) The digital segments searched in step (b) are connected, and the segment , The steps consist of steps that are maintained in the same order as equivalent input segments, The combined digital signal is the waveform corresponding to the input signal, and the output of the database The part has an extended digital waveform with position parameters that discriminate every point. Included, extended start and end position parameter settings digital waveform Limiting the part, step (a) includes the start and end position patterns of the segment of the input signal. Parameter and step (c) is to search the recorded digital waveform portion. (A) using the parameter set in step (a) . (2) Step (a) is for setting the input signal window and data to set the close matching of the input signal. The method of claim 1, comprising comparing the window of the input portion of the database. (3) The method according to claim 2, wherein each window has a length equal to five phonemes. Law. (4) The input part of the database is (I) the highest level containing a single phoneme corresponding to the phoneme in the center of the window, (Ii) a second level containing equivalents of the second and fourth phonemes of the window; (Iii) The lowest level of the three classes including the equivalents of the first and fifth phonemes of the window Organized into levels, the lowest level part of the discrimination is the window of recorded phonemes, Matching selects an exact match for the phoneme in the center of the input window from the first level class. The second level class corresponding to the selected part of the highest level class Select the best match for the second and fourth phonemes, and select the second level class The best match for the first and fifth phonemes from the corresponding lowest level parts is the lowest The method of claim 3 comprising selecting from a rank of levels. (5) In the database used as a component of the speech engine, The database is an extension containing a signal representing the extended digital waveform of the phoneme. An output part including an access part and an access part, the database being It has a common address parameter that distinguishes the common points of both parts, For segment discrimination, set the start value and end value of the parameter, and A database that distinguishes the corresponding segments. (6) The access part has a window with a length of 5 phonemes, and the second and fourth windows. Higher level floor accessed by the phoneme in the center of the window to discriminate between phonemes Having a class, the entry of a higher level class is equivalent to a string of three phonemes The access part also has a string of three phonemes to distinguish the first and fifth phonemes. Have a lower level of class that is accessed by 6. A database according to claim 5, wherein the entries of the class are equivalent to a string of 5 phonemes. Source. (7) A primary processor that converts text of a phoneme into phoneme equivalent text (11) And a converter (12) for converting the text of the phoneme into a digital waveform The speech engine according to claim 5 or 6, wherein the converter (12) is a speech engine. Speech engine characterized by including a database (13). (8) Connected to the network to transmit the output of the speech engine to a remote location A telephone network including the speech engine of claim 7.