JPS62246097A

JPS62246097A - Word base form synthesizer for voice recognition

Info

Publication number: JPS62246097A
Application number: JP62053232A
Authority: JP
Inventors: ラリツト・ライ・バール; ピーター・ヴインセント・デソーザ; ロバート・レロイ・マーサー; マイケル・アラン・ピチエニ
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1986-04-18
Filing date: 1987-03-10
Publication date: 1987-10-27
Also published as: EP0241768A3; DE3779170D1; US4882759A; EP0241768B1; EP0241768A2; JPH0372999B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】この発明の説明はつぎのとおり行う。[Detailed description of the invention] This invention will be explained as follows.

Ａ、産業上の利用分野Ｂ、従来技術Ｃ１発明が解決しようとする問題点り０問題点を解決するための手段Ｅ、実施例Ｅｌ、音声認識システムの環境Ｅｌｌ、一般説明Ｅｌ２．音声学的ベースフォームの構築Ｅ１３．フィー
ニーム・ベースフォームの構築Ｅ１４．単語モデルのト
レーニングＥ２．発声されない単語のベースフォームの合成Ｆ０発
明の効果Ａ、産業上の利用分野本発明は、一般に音声認識に関するものであり、具体的
には、所定の単語モデルを既知の他の単語モデルから合
成することに関するものである。A. Industrial field of application B. Prior art C1 Problems to be solved by the invention Means for solving the problems E. Embodiment El. Environment of the speech recognition system. General description El. 2. Construction of phonetic base form E13. Construction of feeneem base form E14. Training of word model E2. SYNTHESIS OF BASE FORMS OF UNVOICED WORDS F0 EFFECTS OF THE INVENTION A. INDUSTRIAL APPLICATION FIELD OF THE INVENTION The present invention relates generally to speech recognition, and specifically relates to the synthesis of a given word model from other known word models. It's about things.

Ｂ、従来技術ある種の音声認識方法では、語常中の単語を単語モデル
で表わしている。この単語モデルを噴語のベースフォー
ムと呼ぶことがある。たとえばＩＢＭＣ株）では、実験
的音声認識装置で各単語をマルコフ・モデルのシーケン
スとして表わしている。各シーケンスがそれ自体マルコ
フ・モデルであることに留意すべきである。B. Prior Art In some speech recognition methods, words in common words are represented by word models. This word model is sometimes called the base form of the ejective word. For example, at IBM Corporation, each word is represented as a sequence of Markov models in an experimental speech recognition device. Note that each sequence is itself a Markov model.

単語モデルを、音声入力に応答して生成された８力と一
緒に使うことにより、音声入力が語嚢中の語と突き合わ
される。By using the word model in conjunction with the eight forces generated in response to the speech input, the speech input is matched to the words in the word capsule.

ある方法では一組のモデルが定義される。語曇中のすべ
ての単語のベースフォームが、定義済みの一組のモデル
のうちから選んだ複数のモデルから構築される。In one method a set of models is defined. The base form of every word in the word cloud is constructed from multiple models selected from a predefined set of models.

しかし、別の方法では単語を複数のベースフォームで表
わすが、それも根拠があると思われる。However, another method of representing words in multiple base forms also seems to have some basis.

この場合、各ベースフォームをそれぞれ対応する一組の
モデルのうちから選んだモデルから構築する。すなわち
、第１のモデル群に含まれるモデルから構築されたベー
スフォームをある目的で音声認識装置に使用し、第２の
モデル群に含まれるモデルから構築されたベースフォー
ムを別の目的に使用することができる。さらに、音響の
突き合せその他の目的を実施する過程で、いくつかのベ
ースフォームを一緒に使用することもできる。In this case, each base form is constructed from a model selected from a corresponding set of models. That is, a base form constructed from models included in the first model group is used in a speech recognition device for one purpose, and a base form constructed from models included in the second model group is used for another purpose. be able to. Furthermore, several base forms can also be used together in the process of performing acoustic matching and other purposes.

大部分の大語當（たとえば５ｏｏｏ語以上）音声認識シ
ステムでは、単語のベースフォームを。Most large word (e.g. 500+ words) speech recognition systems use the base form of a word.

音声認識装置の各ユーザに合わせて修正する。すなわち
、各ベースフォームに関連するある種の変数の値を決定
するために、ユーザが既知の単語からなる（トレーニン
グ用）テキストを発声する。Modify the speech recognition device to suit each user. That is, the user speaks a (training) text of known words in order to determine the values of certain variables associated with each base form.

普通、各単語ベースフォームは、トレーニング中に生成
されたデータから直接にその変数が設定される。単語を
（それぞれのモデル群に含まれるモデルから構築された
）多重ベースフォームで表わす場合、すべての単語ベー
スフォームを「トレーニング」するのに、すなわちその
変数値を設定するのに充分なデータをもたらすのに、長
いトレーニング期間が必要であった。Typically, each word-based form has its variables set directly from the data generated during training. If a word is represented by multiple base forms (constructed from models in each model family), it will yield enough data to "train" all the word base forms, i.e. to set their variable values. However, a long training period was required.

トレーニング期間が長くかかるのは望ましくない。した
がって、ある語食中のすべての語について多重ベースフ
ォームを構築するのに充分なデータを生成しなければな
らないことは、克服すべき問題であると考えられるよう
になった。It is undesirable for the training period to take a long time. Therefore, the need to generate enough data to construct multiple base forms for all words in a word constellation has been considered a problem to be overcome.

さらに、場合によっては、第２のモデル群に含まれるモ
デルから構築されたベースフォームが。Furthermore, in some cases, a base form constructed from a model included in the second model group.

既に存在しているか、または第１のモデル群のモデルか
ら構築されたベースフォームと比較して容易に形成でき
るものであることがある。さらに。The base form may already exist or may be easily formed compared to the base form constructed from the models of the first model group. moreover.

音声認識体系で必要なベースフォームが、第２のモデル
群のベースフォームではなくて、第１のモデル群のうち
から選んだモデルのベースフォームであることもある。The base form required in the speech recognition system may not be the base form of the second model group, but the base form of a model selected from the first model group.

従来は、第２群のベースフォームが知られているか否か
にかかわらず、第１のモデル群からすべてのベースフォ
ームを構築するには、そのためのトレーニング・データ
が必要であった。Conventionally, training data is required to construct all base forms from a first set of models, regardless of whether the base forms of the second set are known or not.

Ｃ０発明が解決しようとする問題点したがって、トレーニング期間中に発声されなかった単
語のベースフォームを合成する手法を提供することが１
本発明の目的である。Problems to be Solved by the C0 Invention Therefore, it is desirable to provide a method for synthesizing base forms of words that were not uttered during the training period.
This is the object of the present invention.

Ｄ１問題点を解決するための手段具体的に言うと、本発明では、既知のある種の単語が、
それぞれ（ａ）第１のモデル群に含まれる単語モデルか
ら構築されたベースフォームと、（ｂ）第２のモデル群
に含まれる単語モデルから構築されたベースフォームで
表わされるものと仮定する。また、既知の単語ではこの
２つのベースフォームを互いに位置合わせすることがで
きるものと仮定する。さらに、他の単語は、第２のモデ
ル群に含まれるモデルから構築されたベースフォームで
初めから表わされており、またはすぐに表わせるものと
仮定する。本発明は、トレーニング期間後にかかる他の
単語に対する（第１のモデル群のうちから選んだモデル
から構築された）ベースフォームを合成するための手法
を教示するものである。Means for Solving Problem D1 Specifically, in the present invention, certain known words are
It is assumed that they are each represented by (a) a base form constructed from word models included in the first model group, and (b) a base form constructed from word models included in the second model group. It is also assumed that the two base forms can be aligned with each other for known words. Furthermore, it is assumed that other words are originally represented or can be readily represented in base forms constructed from models included in the second set of models. The present invention teaches techniques for synthesizing base forms (constructed from models selected from a first set of models) for such other words after a training period.

すなわち、トレーニング中に生成されたベースフオーム
から、第１のモデル群のモデルと所与の文脈中の第２の
モデル群の所与の各モデルとの間で相関を行なう。所与
の文脈中の所与のモデルが、トレーニング中に発声され
なかった「新」単語中に現われると、それに対応する「
新」単語の切片が、第１のモデル群中の相関されたモデ
ルで現わされる。「新」単語の各切片を第１のモデル群
の相関されたモデルで表わし、「新」単語の連続する切
片に対する相関されたモデルを連結することによって、
第１のモデル群に含まれるモデルから構築されたベース
フオームが合成される。That is, from the base form generated during training, a correlation is made between the models of the first model group and each given model of the second model group in a given context. When a given model in a given context appears in a "new" word that was not uttered during training, the corresponding "
A new word intercept appears in the correlated models in the first set of models. By representing each intercept of "new" words with a correlated model of the first set of models and concatenating the correlated models for successive intercepts of "new" words,
A base form constructed from models included in the first model group is synthesized.

本発明は、上記のことによって、それ以上のトレーニン
グを必要とせずに、既知のベースフオームにもとづいて
第１のモデル群に含まれるモデルから構築されたある種
のベースフオームを合成するという目的を達成する。The present invention thus achieves the objective of synthesizing a certain base form constructed from the models included in the first model group based on the known base form without the need for further training. achieve.

また１本発明は、各ベースフオームを独立にトレーニン
グせずに、同じ語に対する異なるベースフオームを生成
するという目的を達成する。The present invention also achieves the objective of generating different base forms for the same word without training each base form independently.

また、本発明は、場合によっては、生成または形成する
のは容易かもしれないが計算上の効率が低いと思われる
ベースフオームから、音声認識計算で使用するのが好ま
しいモデルのベースフオームを導き出すという目的をも
実現する。The present invention also provides that, in some cases, base forms for models that are preferred for use in speech recognition calculations are derived from base forms that may be easy to generate or form, but are likely to be computationally inefficient. Realize your purpose.

さらに、語常中の単語については音声学に基づく第２の
モデル群のベースフオームが既に知られているかまたは
すぐに決定できるが、音響処理装置の出力に関係するモ
デルのベースフオームを使うと、認識の精度または速度
あるいはその両者が向上する場合、本発明は、トレーニ
ングの必要なしにある種の単語の出力に関係するモデル
のベースフオームを合成するための手法を提供する。Furthermore, for words in common words, the base form of the second model group based on phonetics is already known or can be readily determined, but using the base form of the model related to the output of the acoustic processing device, If the accuracy and/or speed of recognition is to be improved, the present invention provides a method for synthesizing a base form of a model related to the output of certain words without the need for training.

Ｅ、実施例Ｅｌ、音声認識システムの環境Ｅｌｌ、一般説明第１図に、音声認識システム１００ｏの一般的ブロック
図を示す。システム１０００は、スタック・デコーダ１
００２と、それに接続されている音響処理装置１００４
．音響マツチング要素１００６（好ましくはアレイ処理
装置）、および言語モデル処理装置１０１０を含んでい
る。言語モデル処理送置１０１０は、何らかの好ましく
は文派上の基準にもとづいて単語の尤度を決定する。音
響マツチング技術と言語モデルについては、様々な論文
に記載されている。たとえば、下記の論文で、音声認識
の様々な側面とモデル化法が考察されており、ここにそ
れらの論文を引用する。し、ＲｏＢａｈｌ、　Ｆ、Ｊｅ
ｌｉｎｅｋ、　Ｒ，Ｌ、Ｍａｒｃｅｒの、”ｃｏｎｔｉ
ｎｕｏｕｓＳｐｅｅｃｈ　Ｒｅｃｏｇｎｉｔｉｏｎ　ｂ
ｙ　５ｔａｔｉｓｔｉｃａｌ　Ｍｅｔｈｏｄｓ”、Ｐｒ
ｏｃｅｅｄｉｎｇｓ　ｏｆ　ｔｈｅ　ＩＥＥＥ、　Ｖｏ
ｌ、　６４、ＰＰ、　５３２〜５５６、（１９７６）；
　“Ａ　ＭａｘｉｍｕｍＬｉｋｅｌｉｈｏｏｄ　Ａｐｐ
ｒｏａｃｈ　ｔｏ　Ｃｏｎｔｉｎｕｏｕｓ　Ｓｐｅｅｃ
ｈＲｅｃｏｇｎｉｔｉｏｎ”　、　ＩＥＥＥ　Ｔｒａｎ
ｓａｃｔｉｏｎｓ　ｏｎ　ｐａｔｔｅｒｎＡｎａｌｙｓ
ｉｓ　　ａｎｄ　　Ｍａｃｈｉｎｅ　　Ｉｎｔｅｌｌｉ
ｇｅｎｃｅ、　　　Ｖｏｌ、ＰＡＭＩ−５、Ｎｏ、２．
１９８３年３月。E. Embodiments El Environment of Speech Recognition System Ell General Description FIG. 1 shows a general block diagram of a speech recognition system 100o. System 1000 includes stack decoder 1
002 and the sound processing device 1004 connected to it
．． It includes an acoustic matching element 1006 (preferably an array processing unit), and a language model processing unit 1010. The language model processor 1010 determines the likelihood of words based on some preferably stylistic criteria. Acoustic matching techniques and language models are described in various papers. For example, the following papers discuss various aspects of speech recognition and modeling methods and are cited here: S,RoBahl,F,Je
linek, R.L., Marcer, “conti
neousSpeech Recognition b
y 5 statistical methods”, Pr.
oceedings of the IEEE, Vo
l, 64, PP, 532-556, (1976);
“A Maximum Likelihood App
reach to continuous speech
hRecognition”, IEEE Tran
actions on patternAnalyses
is and Machine Intelli
gence, Vol. PAMI-5, No. 2.
March 1983.

音響処理袋！ｉ！１００４は、音声波形を出力ストリン
グに変換するように設計されている。音声は、処理装置
１００４により１選択された諸特徴に対応するベクトル
成分をもつベクトル「空間」によって特徴づけられる。Sound treatment bag! i! 1004 is designed to convert audio waveforms into output strings. Speech is characterized by a vector “space” with vector components corresponding to features selected by processing unit 1004 .

従来、かかる特徴には、音声スペクトルの様々な周波数
におけるエネルギ振幅が含まれていた。音響処理装置１
００４は、複数の原型ベクトルを記憶する。各原型ベク
トルは。Traditionally, such characteristics have included energy amplitudes at various frequencies of the audio spectrum. Sound processing device 1
004 stores a plurality of prototype vectors. Each archetype vector is.

各成分毎に所定の値をもつ、音声信号入力が、音響処理
装置１００４に入るが、これは継続する時間間隔に分割
する二とが好ましい、各時間間隔には、かかる間隔中の
各種の特徴の値にもとづいて出力ベクトルを割り当てる
。各時間間隔の出力ベクトルを各原型ベクトルと比較し
、各原型ベクトルについて距離測定を行なう。距離測定
は１通常のベクトル距離測定法で行なうことができる。An audio signal input, with predetermined values for each component, enters the audio processing unit 1004, preferably divided into two successive time intervals, each time interval having a predetermined value for each component. Assign the output vector based on the value of . The output vector for each time interval is compared to each prototype vector and a distance measurement is made for each prototype vector. Distance measurements can be performed using one conventional vector distance measurement method.

次に各時間間隔を、特定の原型ベクトルまたはその他の
何らかの出力に関係する関数と関連づける。Each time interval is then associated with a particular prototype vector or some other output-related function.

たとえば、各原型ベクトルを、ラベルまたは記号で識別
することができる。それを「フィーニーム（ｆｅｎｅｍ
ａ）　Ｊと呼ぶ。これはフロント・エンド・プロセッサ
（ＦＥ）で得られる微小音系に由来する名称である。か
かる場合、音響処理装置１００４は、各時間間隔ごとに
フィーニームを出力する。For example, each prototype vector can be identified with a label or symbol. We call it “fenem”
a) Call it J. This name originates from the minute sound system obtained by the front end processor (FE). In such a case, the sound processing device 1004 outputs fineim at each time interval.

したがって、ある音声入力に対して、音響処理装置１１
００４はフィーニームのストリングを生成する。あるア
ルファベット中に２００個程度の異なるフィーニーム（
またはラベル）があることが好ましい。その場合、各時
間間隔について、２００個のフィーニームのうちの１つ
が選択される。Therefore, for a certain audio input, the sound processing device 11
004 generates a string of feeneem. There are about 200 different feeneems in a given alphabet (
or a label) is preferable. In that case, for each time interval one of the 200 finems is selected.

特定の種類の音響処理装置が特開昭６１−１２６６００
号公報に記載されている。この公開公報の発明では、ベ
クトル成分に対して選択された特徴が１人の耳の独自の
モデルから導き出される。A specific type of sound processing device is disclosed in Japanese Patent Application Laid-Open No. 61-126600.
It is stated in the No. In the invention of this publication, the features selected for the vector components are derived from a unique model of one ear.

各ベクトル成分は、各周波数帯域の推定神経発火率に対
応する。Each vector component corresponds to an estimated neural firing rate for each frequency band.

音響処理装置１００４から出たフィーニームは、スタッ
ク・デコーダ１００２に入る。スタック・デコーダ１０
０２は、１本または複数本の見込みのある単語経路を定
義し、見込みのある次の単語を使って見込みのある各単
語経路を拡張する。見込みのある単語経路と見込みのあ
る次の昨語は、部分的に、音響処理袋［！１００４で生
成されるラベルにもとづいて決定される。ある新規な型
式のスタック・デコーダが、特願昭６１−３２０４９号
明細書に開示されている。The feeneem exiting the sound processing device 1004 enters the stack decoder 1002. stack decoder 10
02 defines one or more possible word paths and extends each possible word path with a possible next word. The probable word path and the probable next last word are, in part, the acoustic processing bag [! It is determined based on the label generated in step 1004. A new type of stack decoder is disclosed in Japanese Patent Application No. 61-32049.

見込みのある次の単語、あるいはもっと具体的にいえば
、ある経路上で次にくる公算が比較的大きい候補語のリ
ストを決定する際、音響処理装置１００４からきたフィ
ーニームが、音響マツチング要素１００６に送られる。In determining the likely next word, or more specifically, the list of candidate words that are relatively likely to come next on a path, the feeneem from the acoustic processing unit 1004 is applied to the acoustic matching element 1006. Sent.

各種の型式の音響マツチングが、特願昭６０−２５５２
０５号明細書に記載されている。Acoustic matching of various models was published in patent application 1986-2552.
It is described in the specification of No. 05.

音響マツチング要素１００６は、単語モデルにもとづい
て動作する。具体的にいうと、直前に引用した特許出願
に記載されているように、音響マツチングは、単語を確
率的有限状態マシンのシーケンスとして特徴づけること
によって実施される。Acoustic matching element 1006 operates based on word models. Specifically, as described in the immediately cited patent application, acoustic matching is performed by characterizing words as sequences of stochastic finite state machines.

この有限状態マシンをマルコフ・モデルとも呼ぶ。This finite state machine is also called a Markov model.

一般に、各マルコフ・モデルがそれぞれある音声カテゴ
リに対応しているような１組のマルコフ・モデルがある
。たとえば、各マルコフ・モデルを。Generally, there is a set of Markov models, each Markov model corresponding to a certain speech category. For example, each Markov model.

国際音声字冊のある要素に対応させることができる。音
声文字ＡＡＯはそれに対応する音声マルコフ・モデルを
もつことになり、ＡＥＯ，ＡＥＩもそうであり以下ＺＸ
まで同様に続く。It can be made to correspond to certain elements of the international phonetic alphabet. The phonetic letter AAO will have a corresponding phonetic Markov model, and so will AEO and AEI, and hereafter ZX
It continues in the same way until.

音声学的マルコフ・モデルを用いる場合、各単語は、ま
ず音声学的要素列によって定義される。When using a phonetic Markov model, each word is first defined by a sequence of phonetic elements.

その語の音声学的要素に対応する音声学的モデルを連結
して、語の音声学的ベースフォームが構築される。A phonetic base form of a word is constructed by concatenating the phonetic models corresponding to the phonetic elements of the word.

各音声学的マルコフ・モデルは、第２図に示すような構
造で表わすのが好ましい、具体的に言うと、第２図の音
声学的マルコフ・モデルは、（ａ）７つの状態８１〜Ｓ
７；　　（ｂ）１３の遷移ｔｒ１〜ｔｒ１３；（ｃ）各
遷移の確率Ｐ　（ｔｒｉ）　〜Ｐ　（ｔｒ１３〕　（図
にはＰ（ｔｒｉ）だけを示しである）；（ｄ）遷移ｔｒ
ｉ〜ｔｒｉｏでのラベル出力確率を含んでいる。各ラベ
ル出力確率は、音声学的マルコフ・モデルの所与の遷移
で所与のラベルが生成される尤度に対応する。この尤度
はトレーニング期間中に決定される。たとえば、規定さ
れた音声学的要素列に対応する既知のテキストの発声に
もとづいて、特定の音声学的モデル（たとえばＡＡＯの
音声モデル）に対する遷移ｔｒｉでラベル１が生成され
る尤度が決定され、Ｐｉ　（１）として識別される。遷
移ｔｒ３でラベル２００が生成される尤度も決定され、
Ｐ３　（２００）として識別される。Each phonetic Markov model is preferably represented with a structure as shown in FIG. 2. Specifically, the phonetic Markov model in FIG.
7; (b) 13 transitions tr1 to tr13; (c) Probability of each transition P (tri) to P (tr13) (only P(tri) is shown in the figure); (d) Transition tr
Contains the label output probabilities for i to trio. Each label output probability corresponds to the likelihood that a given label is generated at a given transition of the phonetic Markov model. This likelihood is determined during the training period. For example, the likelihood that label 1 will be generated at transition tri for a particular phonetic model (for example, the AAO phonetic model) is determined based on known text utterances that correspond to a specified phonetic element sequence. , Pi (1). The likelihood that label 200 is generated at transition tr3 is also determined,
P3 (200).

同様にトレーニング・データにもとづいて、各音声学的
モデルについて各遷移ｔｒｉ〜ｔｒｉｏでの各ラベルの
ラベル出力確率が決定され、識別される。Similarly, based on the training data, the label output probability of each label in each transition tri-trio is determined and identified for each phonetic model.

遷移ｔｒｌｌ〜ｔｒ１３は、空遷移である。空白遷移で
はラベルは生成されない。したがって、それにはラベル
出力確率は割り当てられない。Transitions trll to tr13 are empty transitions. No labels are generated for blank transitions. Therefore, it is not assigned a label output probability.

すべての遷移ｔｒｉ〜ｔｒ１３の遷移確率も、トレーニ
ング中に生成されたデータから、周知のフォワード・バ
ックワード・アルゴリズムを適用して導き出される。The transition probabilities for all transitions tri to tr13 are also derived from the data generated during training by applying a well-known forward-backward algorithm.

簡単な説明として、第２図はＡＡＯなとの音声要素を示
し、ＡＡＯ音の発声が状態Ｓ１から状態Ｓ７に向って種
々の経路をとり得る様子を示す。As a simple explanation, FIG. 2 shows an AAO sound element and shows how the production of an AAO sound can take various paths from state S1 to state S7.

遷移ｔｒｌｌに従うなら、ＡＡＯ音声要素はラベルを生
成しない。その代りに、状態Ｓ１から状態Ｓ２または状
態Ｓ４に向う経路に従うこともできる。If the transition trll is followed, the AAO speech element will not generate a label. Alternatively, a path from state S1 towards state S2 or state S4 can be followed.

このどちらかの経路をとる場合は、ラベルが生成される
。これらの代替経路を第３図に示す。If either of these routes is taken, a label will be generated. These alternative routes are shown in FIG.

第３図で、水平軸はラベルが生成される時間間隔を表わ
す。実線は、ラベル間隔中にモデル内で起こり得る遷移
を示す。点線は、従うことのできる空遷移を示す。In Figure 3, the horizontal axis represents the time interval over which labels are generated. Solid lines indicate possible transitions within the model during the label interval. Dotted lines indicate empty transitions that can be followed.

第４図に、開始時間１０から始まる連続するラベル間隔
でのマルコフ音声学的モデルを描いた格子を示す。時間
ｔｏは、音響処理袋［７１００４によってストリング中
の最初のラベルが生成される時間に対応する。ｔｏが状
態Ｓ１に対応するとして、例として、最終状態Ｓ７に至
る様々な経路を図示しである。ある経路では、状態Ｓ１
から状態Ｓ２に至り、そこから状ｊｌｌｓ３に至る。す
なわち２つの非空遷移に従う、状態Ｓ３から状態Ｓ７へ
は、ラベルが生成されない経路と、非空遷移に従う経路
とがある。あるラベル列について、１つまたは複数の音
声学的モデルの遷移に沿った様々な経路があることが認
められる。FIG. 4 shows a grid depicting the Markov phonetic model at successive label intervals starting at start time 10. The time to corresponds to the time when the first label in the string is generated by the acoustic processing bag [71004. As an example, various paths leading to the final state S7 are illustrated, assuming that to corresponds to the state S1. In a certain route, state S1
to state S2, and from there to state jlls3. That is, from the state S3 to the state S7, which follow two non-empty transitions, there is a route for which no label is generated and a route that follows the non-empty transition. It is recognized that for a given label sequence, there are various paths along the transition of one or more phonetic models.

音声学的マルコフ単語モデルを第５図に示す。A phonetic Markov word model is shown in FIG.

第５図（ａ）に、”　Ｔ　ＨＥ　”の語を、そのある発
音にもとづいて、順に並んだ３つの音声要素として示す
。この音声要素は、ＤＨ，ＵＨＩ、およびＸｘである。In FIG. 5(a), the word "T HE" is shown as three phonetic elements arranged in order based on a certain pronunciation thereof. The audio elements are DH, UHI, and Xx.

第５図（ｂ）では、ＤＨ，ＵＨＩ、およびＸＸに対する
音声学的マルコフ・モデルを連結して１語”　Ｔ　ＨＥ
　”の音声学的単語ベースフォームを形成する。In Fig. 5(b), the phonetic Markov models for DH, UHI, and XX are concatenated to form one word "T HE".
” to form a phonetic word-based form.

第４図のような格子を、ある音声久方（たとえば、単語
”ＴＨＥ”）に応じて生成されるすべてのラベルを含む
ように拡張することができる。拡張格子では、状態間の
遷移には確率が割り当てられ、またラベル出力確率も遷
移に割り当てられることに留意すべきである。A grid such as that in FIG. 4 can be expanded to include all labels generated in response to a certain phonetic term (eg, the word "THE"). It should be noted that in the extended lattice, transitions between states are assigned probabilities, and label output probabilities are also assigned to transitions.

単語の尤度を評価する過程には、どの単語モデルが、時
間ｔｏ、ｔ１等々で（音声久方に応じて音響処理装置１
００４によって）レベルが生成される尤度が最大である
かを決定することが含まれる。音響マツチング要素をど
のように使って星語の尤度を決定するかの詳細な説明は
、音響マツチング要素に関する前述の特許出願に記され
ている。In the process of evaluating the likelihood of a word, which word model is selected at time to, t1, etc.
004) includes determining whether the level has the maximum likelihood of being generated. A detailed description of how acoustic matching elements are used to determine the likelihood of star words can be found in the aforementioned patent applications relating to acoustic matching elements.

音声学的ベースフォームを構築するのに使われる音声学
的モデルの他に、音響マツチング要素にはフィーニーム
型マルコフ・モデルも使われてきた。具体的にいうと、
第２図のような比較的複雑な音声モデルの代りに、フィ
ーニームにもとづく１組のマルコフ・モデルが使われて
きた。フィーニーム型マルコフ・モデルを第６図に示す
。フィーニーム型マルコフ・モデルは、２つの状ｆｉｓ
１゜８２と３つの遷移を含む簡単な構造であることが認
められる。１つの非空遷移はＳｌから８２へ延び、第２
の非空遷移は状態ｓ１から自分自身に戻る自己ループと
なっている。空遷移は状態ｓ１がら状態Ｓ２へ延びてい
る。この３つの遷移には。In addition to the phonetic models used to construct phonetic base forms, Fee-Neem Markov models have also been used for acoustic matching elements. Specifically,
Instead of a relatively complex speech model such as that shown in Figure 2, a set of Markov models based on Feeneem has been used. Figure 6 shows the Feenim-type Markov model. The Feenim-type Markov model has two states fis
It is recognized that the structure is simple, including 1°82 and three transitions. One non-empty transition extends from Sl to 82 and the second
The non-empty transition of is a self-loop returning from state s1 to itself. The empty transition extends from state s1 to state S2. For these three transitions.

それぞれ確率が割り当てられ、２つの非空遷移それぞれ
に、トレーニング期間中に生成されたデータから導き出
されたラベル出力確率がある。フィーニーム型モデルに
もとづく格子を第７図に示す。Each is assigned a probability, and each of the two non-empty transitions has a label output probability derived from the data generated during the training period. A lattice based on the Feeneem model is shown in FIG.

第８図では、単語のフィーニーム型ベースフォームを形
成する場合のように、複数のフィーニーム型マルコフ・
モデルが連結されている。In Figure 8, multiple feeneem-type Markov forms are formed, as in the case of forming a feeneem-type base form of a word.
Models are connected.

第８図の表記法について簡単に考察する。ＦＰ２００は
、通常２００個の異なるフィーニームを含むフィーニー
ム・アルファベット（フィーニーム集合）の２００００
番目ィーニームに対応するフィーニーム型音声を指す。The notation in FIG. 8 will be briefly considered. FP200 consists of 20,000 Feeneem Alphabets (Fineem Sets) that typically contain 200 different Feeneems.
Refers to the feeneem-type voice corresponding to the th eenim.

同様に、ＦＰＩＯはフィーニーム・アルファベットの１
０番目のフィーニームに対応する。ＦＰ２００．ＦＰＩ
Ｏ等々を連結すると、単語のフィーニーム型ベースフォ
ームとなる６各フイーニームは１通常０．０１秒継続し
１通常の発音された語の長さはフィーニーム数でいうと
平均８０〜１００である。さらに、各フィーニーム型モ
デルは平均約１つのフィーニームを生成するため、通常
のフィーニーム型ベースフォームの長さはフィーニーム
型モデル的８０−１００個である。ＦＰ２００の第２の
遷移の確率を、Ｐ　（ｔｒ２　　　　　）　テ表わす。Similarly, FPIO is 1 of the feeneem alphabet.
Corresponds to the 0th feeneem. FP200. FPI
O, etc., when concatenated, give the word's fee-neem base form.6 Each fine-me usually lasts 0.01 seconds, and the length of a typical pronounced word averages 80 to 100 fine-me numbers. Further, since each Fee-Neem model produces on average about one Fee-Neem, a typical Fee-Neem base form is 80-100 Fee-Neem models long. Let P (tr2) denote the probability of the second transition of FP200.

ＦＰ２００のモデルがその第２遷移でラベル１を生成す
る確率はＰ’　Ｆ２００　（１）ｉ？表わす。ＦＰ２０
０モデルは、実際には２００００番目ィーニームを生成
するようにスキューすることもできる。だが、発音の変
動のために、ＦＰ２００モデルが他のフィーニームを生
成する確率もある。The probability that the FP200 model generates label 1 in its second transition is P' F200 (1)i? represent. FP20
The 0 model can actually be skewed to generate the 20000th enemy. However, due to variations in pronunciation, there is a possibility that the FP200 model will produce other feeneems.

下記の２節では、それぞれ音声学的マルコフ・モデルと
フィーニーム型マルコフ・モデルから単語ベースフオー
ムを構築するための方法の概要を説明する。この２種の
ベースフオームを検討すると、音声学的ベースフオーム
の方がそれに含まれる連結されたモデルの数が少ないが
、音声学的モデルで必要な計算は、フィーニーム型モデ
ルで必要な計算よりも著しく多いことが認められる。ま
た、音声学的ベースフオームは音声学者の手で規定され
るが、フィーニーム型基本形式は、Ｅ１３節に引用する
特許出願に記載されているように音声学者の手を煩わさ
ずに自動的に構築されてきた。In Section 2 below, we outline methods for constructing word-based forms from phonetic Markov models and fee-neem Markov models, respectively. Considering these two base forms, the phonetic base form contains fewer connected models, but the phonetic model requires more calculations than the fee-neem model. It is recognized that there are significantly more cases. Furthermore, while the phonetic base form is defined by the hands of a phonetician, the feeneem basic form is automatically constructed without the phonetician's intervention, as described in the patent application cited in Section E13. It has been.

Ｅ１２．音声学的ベースフオームの構築各単語ごとに、
音声学的サウンド列があり、そのそれぞれがそれに対応
する音声学的モデル（音声学的゛′単単音ママシンも呼
ぶ）を有する。好ましくは、各非空遷移で、各フィーニ
ームの生成に何らかの確率が付随している（フィーニー
ム・アルファベットを第１表に示す）。各種の音声学的
単音マシンにおける遷移確率およびフィーニーム確率は
、トレーニング中に、既知の音声が少くとも１度発声さ
れたときに生成されるフィーニーム・ストリングを記録
し、周知のフォワード・バックワード・アルゴリズムを
適用することによって。E12. Construction of a phonetic base formFor each word,
There are sequences of phonetic sounds, each of which has a corresponding phonetic model (also called a phonetic ``phonetic machine''). Preferably, at each non-empty transition, there is some probability attached to the generation of each feeneem (the feeneem alphabet is shown in Table 1). Transition probabilities and feeneem probabilities in various phonetic phone machines are determined during training by recording the feeneme strings produced when a known sound is uttered at least once and using the well-known forward-backward algorithm. By applying.

決定される。It is determined.

第」４００１　　ＡＡＩＩ　　　０２９　　ＢＩ２−　　０５
７　　ＥＨ（００２ＡＡ１２　　０３０　　ＢＩ３−　
　０５８　　ＥＨＩ００３　　ＡＡ１３　　０３１　　
ＢＩ３−　　０５９　　ＥＨＩ００４　　ＡＡ１４　　
０３２　　ＢＩ５−　　０６０　　ＥＨＩ００５　　Ａ
Ａ１５　　０３３　　ＢＩ６−　　０６１　　ＥＨＩ０
０６　　ＡＥＩＩ　　　０３４　　ＢＩ７−　　０６２
　　ＥＨＩ００７　　ＡＥ１２　　０３５　　ＢＩ８−
　　１２６　　ＲＸＩ００８　　ＡＥ１３　　０３６　
　ＢＩ９−　　１２７　５ＨＩ００９　　ＡＥ１４　　
０３７　　ＤＨＩ−１２８ＳＨ；０１０　　ＡＦ１５　
　０３８　　ＤＨ２−１２９５ＸＩ０１１　　Ａｗｌｌ
　　　０３９　　ＤＱＩ−１３０５ＸＳ０１２　　Ａｖ
１２　　０４０　　ＢＯ２−１３１５Ｘ３０１３　　Ａ
Ｖ１３　　０４１　　ＢＯ３−１３２ＳＸ＜０１４　　
ＡＸＩＩ　　　０４２　　ＢＯ４−１３３５ＸＥ０１５
　　ＡＸ１２　　０４３　　ＤＸＩ−１３４５ＸＥ０１
６　　ＡＸ１３　　０４４　　ＤＸ２−　　１３５　　
ＳＸ’ｒ０１７　　ＡＸ１４　　０４５　　ＩＩ！Ｅ０
１　　１３６　　Ｔｌ（］０１８ＡＸ１５０４６ＥＥＯ
２１３７ＴＨコ０１９　　ＡＸ１６　　０４７　　ＥＥ
ＩＩ　　　１３８　　ＴＨ；０２０　　ＡＸ１７　　０
４８　　ＥＨ１１１３９ＴＨ７１０２１ＢＱＩ−０４９
ＥＨ１１１４０Ｔｌ（３０２２ＢＯ２−０５０ＥＨ１１
１４１ＴＱＩ０２３　　ＢＯ３−０５１ＥＨ１１１４２
ＴＱ：０２４　　ＢＯ２−０５２ＥＨ１１１４３ＴＸ；
０２５　　ＢＸＩ−０５３ＥＨ１１１４４ＴＸＩ０２６
　　ＢＸＩＯ０５４ＥＨ１１１４５ＴＸ：２　　１４８
　　ＴＸ５−　　１７６　　ＸＸｌ１１　　　１４９　
　ＴＸ６−　　　１７７　　ＸＸ１２２　　１５０　　
ｔｌＨＯｌ　　　１７８　　ＸＸ１３３　　１５１　　
ＵＨＯ２１７９ＸＸ１４４　　１５２　　［８１１１８
０ＸＸ１５５　　１５３　　ＵＨ１２１８１ＸＸ１６−
　　１５４　　ＵＨ１３１８２ＸＸ１７−　　１５５　
　ＬＩＨ１４１８３ＸＸ１８−　　１５６　　ＵＵＩＩ
　　　１８４　　ＸＸ１９−　　１５７　　ＵＵ１２　
　１８５　　ＸＸ２−−　　１５８　　ＵＸＧＩ　　　
１８６　　ＸＸ２０−　　１５９　　ＵＸＧ２　　１８
７　　ＸＸ２ｌ−１６０υＸｌｌ　　　１８８　　ＸＸ
２２−　　１６１　　ＵＸ１２　　１８９　　ＸＸ２３
−　　１６２　　ＵＸ１３　　１９０　　ＸＸ２４−　
　１６３　　ＶＸＩ−１９１ＸＸ３−−　　１６４　　
ＶＸ２−　　１９２　　ＸＸ４−−　　１６５　　ＶＸ
３−　　１９３　　ＸＸ５−−　　１６６　　ＶＸ４−
　　１９４　　ＸＸ６−−　　　１６７　　ＷＸＩ−１
９５ＸＸ７−−　　　　１６８　　リＸ２−　　　１９
６　　ＸＸ８−−　　　１６９　　ＷＸ３−　　　１９
７　　ＸＸ９−、＋　　　　１７０　　ＷＸ４−　　　
１９８　２Ｘ１−−　　１７１　　ＷＸ５−　　１９９
　　ＸＸ２−−　１７２　リＸ６−　　２００　　ＸＸ
３−’−１７３ＷＸ７− 例として、音ＤＨの統計のサンプルを第２表に示す。近
似的に、第２図の単音マシンの遷移ｔｒｉ、ｔｒ２．ｔ
ｒ８に対するラベル出力確率分布を１つの分布で表わし
、遷移ｔｒ３、ｔｒ４．ｔｒ５．ｔｒ９を１つの分布で
表わし、遷移ｔｒ６、ｔｒ７．ｔｒｌｏを１つの分布で
表わす。これを、弧（すなわち遷移）をそれぞれの欄４
．５．または６に割り当てる形で、第２表に示す。第２
表は各遷移の確率および、あるラベル（すなわちフィー
ニーム）が音声学的要素（すなわち「音Ｊ　）ＤＨの始
め、中間、または終りで生成される確率を示したもので
ある。たとえば音ＤＨでは、状態Ｓ１から状態Ｓ２への
遷移の確率は０．０７２４３とカウントされる。状態Ｓ
１から状態Ｓ４への遷移の確率は０．９２７５７である
（この場合は、この２つだけが初期状態からの可能な遷
移であり、その合計が１となる）。ラベル出力確率につ
いては、音ＤＨは、その終りの部分、すなわち第２表の
第６１１ｉでフィーニームＡＥ１３（第１表参照）を生
成する確率が０゜０９１である。また第２表では、各ノ
ード（または状態）にカウントが関連している。ノード
・カウントは、トレーニング中に音がそれに対応する状
態になった回数を示すものである。各音声学的モデル、
または音声学的単音マシンごとに第２表のような統計が
見出される。No. 4 001 AAII 029 BI2-05
7 EH (002AA12 030 BI3-
058 EHI003 AA13 031
BI3-059 EHI004 AA14
032 BI5- 060 EHI005 A
A15 033 BI6- 061 EHI0
06 AEII 034 BI7- 062
EHI007 AE12 035 BI8-
126 RXI008 AE13 036
BI9- 127 5HI009 AE14
037 DHI-128SH; 010 AF15
038 DH2-1295XI011 Awll
039 DQI-1305XS012 Av
12 040 BO2-1315X3013 A
V13 041 BO3-132SX<014
AXII 042 BO4-1335XE015
AX12 043 DXI-1345XE01
6 AX13 044 DX2- 135
SX'r017 AX14 045 II! E0
1 136 Tl(]018AX15046EEO
2137THko019 AX16 047 EE
II 138 TH;020 AX17 0
48 EH11139TH71021BQI-049
EH11140Tl (3022BO2-050EH11
141TQI023 BO3-051EH11142
TQ:024 BO2-052EH11143TX;
025 BXI-053EH11144TXI026
BXIO054EH11145TX:2 148
TX5- 176 XXl11 149
TX6- 177 XX122 150
tlHOl 178 XX133 151
UHO2179XX144 152 [81118
0XX155 153 UH12181XX16-
154 UH13182XX17- 155
LIH14183XX18- 156 UUII
184 XX19- 157 UU12
185 XX2-- 158 UXGI
186 XX20- 159 UXG2 18
7 XX2l-160υXll 188 XX
22- 161 UX12 189 XX23
- 162 UX13 190 XX24-
163 VXI-191XX3-- 164
VX2- 192 XX4-- 165 VX
3- 193 XX5-- 166 VX4-
194 XX6-- 167 WXI-1
95XX7-- 168 RiX2- 19
6 XX8-- 169 WX3- 19
7 XX9-, +170 WX4-
198 2X1-- 171 WX5- 199
XX2-- 172 RiX6- 200 XX
3-'-173WX7- As an example, a sample of the statistics of sound DH is shown in Table 2. Approximately, the transitions tri, tr2 . t
The label output probability distribution for r8 is represented by one distribution, and transitions tr3, tr4 . tr5. tr9 is represented by one distribution, and transitions tr6, tr7 . trlo is represented by one distribution. Add this to the arcs (i.e. transitions) in each column 4.
．． 5. or 6 as shown in Table 2. Second
The table shows the probability of each transition and the probability that a certain label (i.e. feeneem) is produced at the beginning, middle or end of a phonetic element (i.e. 'sound J) DH. For example, in the sound DH: The probability of transition from state S1 to state S2 is counted as 0.07243. State S
The probability of transitioning from 1 to state S4 is 0.92757 (in this case, these are the only two possible transitions from the initial state, and their sum is 1). Regarding the label output probability, the probability that the sound DH generates the feeneem AE13 (see Table 1) at its final part, ie, No. 611i in Table 2, is 0°091. Also in Table 2, each node (or state) has a count associated with it. The node count indicates the number of times a sound entered its corresponding state during training. Each phonetic model,
Alternatively, statistics such as those shown in Table 2 can be found for each phonetic phone machine.

音声学的単音マシンを単語ベースフオームのシーケンス
に配列する過程は、通常音声学者の手で実施され、通常
は自動的には行なわれない。The process of arranging phonetic phone machines into sequences of word-based forms is usually performed by a phonetician and is usually not done automatically.

；Ｉ　　　ぐり１デ　　　〇ジーデζ）ＰＯ −ｏ　ｕ３　　ロＥ１３．フィーニーム・ベースフォームの構築各遷移に
関連する確率、および第６図に示したようなあるフィー
ニーム・モデルの遷移で各ラベルに関連する確率は、ト
レーニング期間中に、音声学的ベースフォームで音声学
的モデルをトレーニングする場合と類似のやり方で決定
される。; I Guri1de 〇Gideζ) PO -o u3 RoE13. Construction of the Fee-Neem Base Form The probabilities associated with each transition and the probabilities associated with each label at the transition for a given Fee-Neem model, such as the one shown in Figure 6, are determined by the phonetic base form during the training period. is determined in a similar way to when training a standard model.

フィーニーム型単語ベースフオームは、フィーニーム型
単音を連結して構築される。その１つの方法が、１９８
５年２月１日出願の米国特許出願Ｓ、Ｎ、６９７１７４
号に記載されている。単語のフィーニーム・ベースフォ
ームは、当該の単語の複数回の発声から成長させること
が好ましい。A fee-neem-type word base form is constructed by connecting fee-neem-type single sounds. One method is 198
U.S. Patent Application S, N, 697174, filed February 1, 2015
listed in the number. Preferably, the feeneem base form of a word is grown from multiple utterances of the word.

このことは、米国特許出願Ｓ、Ｎ、０６／７３８９３３
号に記載されている。この開示を、本発明の充分な開示
に必要な範囲でここに引用する。簡単に言うと、複数回
の発声から語の基本形式を成長させる１つの方法は、下
記の各ステップを含むものである。This is reflected in U.S. patent application S.N. 06/738933.
listed in the number. This disclosure is hereby incorporated by reference to the extent necessary to fully disclose the invention. Briefly, one method for growing the basic form of a word from multiple utterances involves the following steps.

（ａ）単語セグメントの複数回の発声を、それぞれフィ
ーニーム・ストリングに変形する。(a) Transform each of the multiple utterances of a word segment into a feeneem string.

（ｂ）　−ｆｆｉのフィーニーム型マルコフ・モデル単
音マシンを定義する。(b) Define a Fee-Neem-type Markov model single-tone machine of -ffi.

（ｃ）多重フィーニーム・ストリングを生成するのに最
良の１つの単音マシンＰ１を決定する。(c) Determine the best single note machine P1 to generate the multiple feeneem string.

（ｄ）多重フィーニーム・ストリングを生成するための
、ＰＩＦ２またはＰ２Ｐ１の形の最良の二音ベースフオ
ームを決定する。(d) Determine the best diphonic baseform of the form PIF2 or P2P1 to generate multiple feeneem strings.

（ｅ）各フィーニーム・ストリングに対して、最良の二
音ベースフオームを位置合せする。(e) Align the best diphonic baseform for each feeneem string.

（ｆ）各フィーニーム・ストリングを、二音ベースフオ
ームの第１の単音マシンに対応する左部分と、二音ベー
スフオームの第２の単音マシンに対応スる右部分に分割
する。(f) Divide each fineme string into a left part corresponding to the first single note machine in the two-note baseform and a right part corresponding to the second note machine in the two-note baseform.

（ｇ）各左部分を左サブストリングと名づけ、各右部分
を右サブストリングと名づける。(g) Name each left part a left substring and each right part a right substring.

（ｈ）複数回の発声に対応する一組のフィーニーム・ス
トリングと同じやり方で一組の左サブストリングを処理
するが、さらに単音ベースフォームの方が最良の二音ベ
ースフオームよりも所定のサブストリングを生成する確
率が高いとき、そのサブストリングの再分割を禁止する
ステップを含む。(h) treat a set of left substrings in the same way as a set of feeneem strings corresponding to multiple utterances, but with the addition that the monophonic base form has a lower predetermined substring than the best diphonic base form; , when the probability of generating the substring is high, prohibiting repartition of the substring.

（ｊ）複数回の発声に対応する一組のフィーニーム・ス
トリングと同じやり方で一組の右サブストリングを処理
するが、さらに単音ベースフオームの方が最良の二音ベ
ースフオームよりも所定のサブストリングを生成する確
率が高いとき、そのサブストリングの再分割を禁止する
ステップを含む。(j) Process a set of right substrings in the same way as a set of feeneem strings corresponding to multiple utterances, but with the addition that the monophonic base form has a higher value for a given substring than the best diphonic base form. , when the probability of generating the substring is high, prohibiting repartition of the substring.

（ｋ）分割されなかった単一の単音をそれらに対応する
フィーニーム・サブストリングの順序と同じ順序で連結
する。(k) concatenate unsplit single notes in the same order as their corresponding feeneem substrings;

ベースフォーム・モデルは、既知の発声を音響処理装置
に声を出して入力することにより、さらにそこでそれに
応じたラベルのストリングを生成させることによって、
トレーニング（または統計で充填）される、既知の発声
と生成されたラベルにもとづいて、上記に引用した諸論
文で考察されているフォワード・バックワード・アルゴ
リズムによって１語モデルの統計が導き出される。The baseform model is constructed by inputting known utterances aloud into an acoustic processing device, which in turn generates a string of labels accordingly.
Based on the known utterances that are trained (or filled with statistics) and the generated labels, the statistics of the one-word model are derived by the forward-backward algorithm discussed in the papers cited above.

第７図に、フィーニーム型単音に対応する格子を示す。FIG. 7 shows a grid corresponding to a feeneem type single note.

この格子は、音声学的モデル体系に関係する第４図の格
子に比べて、ずっと簡単である。This grid is much simpler than the grid of Figure 4, which relates to the phonetic model system.

音声学的ベースフォームとフィーニーム・ベースフォー
ムとは、どちらも音響マツチング要素中で、また他の音
声認識の目的に使用できる。Both phonetic base forms and feeneem base forms can be used in acoustic matching elements and for other speech recognition purposes.

Ｅ１４．単語モデルのトレーニング良好なトレーニングの方法は、　Ｌ、Ｒ，Ｂａｈｌ、　
Ｐ、Ｆ。E14. Training the word model A good training method is L, R, Bahl,
P.F.

Ｂｒｏｗｎ、　Ｐ、Ｖ、Ｄｅｓｏｕｚａ、およびＲ，Ｌ
、Ｍｅｒｃｅｒが発明し。Brown, P.V., Desouza, and R.L.
, invented by Mercer.

ＩＢＭＣ株）に譲渡された。″音声認識システムで使用
されるマルコフ・モデルのトレーニングの改良（Ｉｍｐ
ｒｏｖｉｎｇ　ｔｈｅ　Ｔｒａｉｎｉｎｇ　ｏｆ　Ｍａ
ｒｋｏｖＭｏｄｅｌｓ　Ｕｓｅｄ　ｉｎ　ａ　５ｐｅｅ
ｃｈ　Ｒｅｃｏｇｎｉｔｉｏｎ　Ｓｙｓｔｅｍ）”と題
する同時係属の米国特許出願で教示されている。この開
示をここに引用する。この開示では。IBM Corp.). ``Improvements in the training of Markov models used in speech recognition systems (Imp
roving the Training of Ma
rkovModelsUsed in a 5pee
ch Recognition System), the disclosure of which is incorporated herein by reference.

トレーニングは、他の単語に関連する確率に比べて正し
い単語の確率を向上させる形で、各単語のベースフォー
ムの統計を決定することを含んでいる。他の方法のよう
にラベルにスクリプトが与えられる確率を最大にするの
ではなく、発声された単語の正しいスクリプトにラベル
出力が与えられる確率と他の（正しくない）スクリプト
の確率との差を最大にするというのが、その考え方であ
る。Training includes determining base form statistics for each word in a manner that improves the probability of a correct word relative to the probabilities associated with other words. Rather than maximizing the probability that the label is given the script as in other methods, we maximize the difference between the probability that the label output is given the correct script of the uttered word and the probability of other (incorrect) scripts. The idea is to do so.

かかる方法によると、（語常中の各単語が少なくも１つ
の確率的有限状態モデルのあるベースフォー１１で表わ
され、かつ各確率的有限状態モデルが遷移確率項目と出
力確率項目を有する、コミュニケートされた音声入力に
応答して出力のアルファベットのうちから選択された出
力から語常中のある単語をデコードするためのシステム
において）既知の単語のコミュニケートに応答して生成
される出力が既知の単語のベースフオームによって生成
される尤度が、生成される出力が他の少くとも１つの単
語のベースフオームによって生成されろ尤度に比べて高
くなるように、記憶済みの確率項目の値の少くとも一部
分をバイアスさせるステップを含む、確率項目の値を決
定する方法が提供される。According to such a method, (each word in the word common is represented by a base four 11 with at least one stochastic finite state model, and each stochastic finite state model has a transition probability term and an output probability term, (in a system for decoding a word in a common word from an output selected from an alphabet of outputs in response to a communicated speech input) in which the output produced in response to the communication of a known word is The value of the stored probability term is reduced such that the likelihood of being generated by the base form of a word is high compared to the likelihood that the output generated is generated by at least one other base form of the word. A method is provided for determining the value of a probability term, including biasing a portion of the probability term.

各単語（または語のはっきりした発音、これを“語紮素
″と呼ぶ）は、列になった１つまたは複数の確率的有限
状態マシン（またはモデル）で表わすことが好ましい。Each word (or distinct pronunciation of the word, referred to as a "word verb") is preferably represented by one or more stochastic finite state machines (or models) in sequence.

各マシンは、−組の音声のうちのあるパ音声″に対応す
る。各音声は、音声的要素、ラベル（またはフィーニー
ム）、あるいはマルコフ・モデルまたは類似のモデルを
指定できる他の何らかの事前に定義された音声の特性と
相関する。Each machine corresponds to a certain speech of the set of sounds. Each sound is defined by a phonetic element, a label (or feeme), or some other predefined feature that can specify a Markov model or similar model. Correlates with the characteristics of the voice.

トレーニング・スクリプトは１通常一連の既知の語から
構成される。A training script usually consists of a series of known words.

ここに記載するトレーニング方法によれば、確率項目に
付随する確率値は、下記のようにして評価される。According to the training method described herein, probability values associated with probability items are evaluated as follows.

各確率項目について、推定値０′が設定される。An estimated value of 0' is set for each probability item.

推定値０′とトレーニング中に生成されたラベルが与え
られているものとすると、′単一カウント″と呼ばれる
値が決定される。“１単一カウント″は、一般に訓練デ
ータにもとづいて、ある事象が発生する（予想）回数に
関係する。゛単一カウント″のある特定の定義は、（ａ
）あるラベルのストリングＹ、（ｂ）定義された推定値
Ｏ′、および（ｃ）特定の時間ｔが与えられているもの
として、上記の単一カウントは１周知のフォワード・バ
ックワード・アルゴリズム、またはバウム・ヴエルヒ・
アルゴリズムを適用して決定する。Given an estimate of 0' and the labels generated during training, a value called 'single count' is determined. It concerns the (expected) number of times an event will occur. One particular definition of "single count" is (a
) a string of labels Y, (b) a defined estimate O′, and (c) a particular time t, the above single count can be calculated using one well-known forward-backward algorithm, or Baum-Wuerch
Apply an algorithm to decide.

上記の定義によれば、単一カウントは、次式で表わすこ
とができる。According to the above definition, a single count can be expressed as:

Ｐｒ（Ｓ、、ｔ、ｌＹ、ｏ、τ）ｌ各時間ｔで特定のＳ、、τ、、Ｙ、０′に対すＪする単一カウントを合計すると、それに対応する遷移確率
項目について、′遷移累積カウント″が決定される。遷
移累積カウントは確率の和なので、その値は１を越える
こともある。各遷移確率について、それぞれの遷移確率
項目を記憶しておくことが好ましい。所与の遷移から得
られたこの累積カウントを、状態Ｓ、から取り得るすべ
ての遷移の累積カウントの和で割って、それぞれの遷移
確率項目に対する現在の確率値が決定される。現在の確
率値は、その当該の遷移確率項目に関連させて記憶して
おくことが好ましい。Pr(S,,t,lY,o,τ)l If we sum up the single counts of J for a particular S,,τ,,Y,0′ at each time t, then for the corresponding transition probability term,′ Since the transition cumulative count is a sum of probabilities, its value can exceed 1. For each transition probability, it is preferable to store a respective transition probability term. The current probability value for each transition probability term is determined by dividing this cumulative count of transitions by the sum of the cumulative counts of all possible transitions from state S. It is preferable to store it in association with the relevant transition probability item.

ラベル出力確率項目に関して、単一カウントを再度合計
する。これらの各確率項目について、対応する生成され
たそのストリング中のラベルがラベル出力確率項目に対
応するラベルとなるすべてＯ′に対する単一カウントを
合計する。この場合の合計は、′ラベル出力累積カウン
ト”であり。For the label output probability terms, sum the single counts again. For each of these probability terms, sum the single count for all O' for which the corresponding generated label in that string is the label corresponding to the label output probability term. The sum in this case is the 'label output cumulative count'.

それに対応するラベル出力確率項目と関連させて記憶し
ておくことが好ましい、この累積カウントベてのラベル
時間にわたる単一カウントの合計で割って、それぞれの
ラベル出力確率項目に対する現在の確率値を決定する。This cumulative count, preferably stored in association with its corresponding label output probability item, is divided by the sum of the single counts over the label time to determine the current probability value for each label output probability item. .

上記に引用した特許出願の方法によれば、発声された既
知の単語のトレーニング・スクリプト、各確率項目の初
期確率値、トレーニング中に発声された各語に対する候
補語のリストが規定される。According to the method of the patent application cited above, a training script of known words uttered, an initial probability value for each probability item, and a list of candidate words for each word uttered during training are defined.

候補語のリストは、迅速近似音響マツチングなどの手順
によって定義される。発音された既知のどの単語につい
ても、正しい″既知の単語と″正しくない”単語がある
（正しくない単語とは、誤まって既知の単語として復号
される尤度が最高であることが好ましい）。The list of candidate words is defined by a procedure such as rapid approximate acoustic matching. For every known word pronounced, there is a correct known word and an incorrect word (the incorrect word is preferably the one that has the highest likelihood of being erroneously decoded as a known word). .

確率項目の現在の確率値は、まず正しい単語のベースフ
ォームまたは正しくない単語のベースフォームによる各
確率項目の″プラス・カウント値″とパマイナス・カウ
ント値″を計算して決定する。The current probability value of a probability item is determined by first calculating the "plus count value" and the minus count value of each probability item based on the base form of the correct word or the base form of the incorrect word.

このプラス・カウント値を（各確率項目ごとに）対応す
る確率項目の累積値に加え１次にその累積値からマイナ
ス・カウント値を差し引く。This plus count value is added to the cumulative value of the corresponding probability item (for each probability item) and firstly, the negative count value is subtracted from the cumulative value.

プラス・カウント値は１周知のフォワード・バックワー
ド・アルゴリズムを適用し、好ましくはそれから得られ
る統計をスケーリングすることによって、正しい（すな
わち既知の）単語のベースフォームで各確率項目ごとに
計算する。プラス・カウント値を加えると、カウント値
（およびそれから導かれる確率項目）がストリングＹに
近づく方向にバイアスされ、Ｙが、相対的に正しい単語
モデルである尤度がより高い出力にみえるようになる。The plus count value is calculated for each probability term in the base form of the correct (i.e., known) word by applying one well-known forward-backward algorithm and preferably scaling the statistics obtained therefrom. Adding a positive count value biases the count value (and the probability term derived from it) toward the string Y, making Y appear to be the output with a higher likelihood of being a relatively correct word model. .

所与の確率項目のマイナスのカラン１−値は、正しくな
い単語が発音されてラベルのストリングを生成した場合
のように、フォワード・バックワード・アルゴリズムを
適用して計算する。既知の語の１回の発音から導かれた
マイナス・カウント値を、（プラス・カウント値と足す
前または後で）それに対応する累積カウントの最近の値
から差し引く。この減算によって、正しくない語のベー
スフォームで確率項目を計算するのに使われた累積カウ
ントが、ストリングＹから離れる方向にバイアスされる
。The negative Callan 1-value of a given probability term is computed by applying a forward-backward algorithm, such as when an incorrect word is pronounced to produce a string of labels. The negative count value derived from a single pronunciation of a known word is subtracted from the most recent value of its corresponding cumulative count (before or after addition with the positive count value). This subtraction biases the cumulative count used to calculate the probability term on the base form of the incorrect word away from string Y.

これらの調整された累積カウントにもとづいて、復号精
度が高まるように、カウントに対する確率値および確率
値がｉｓａされる。Based on these adjusted cumulative counts, probability values and probability values for the counts are established to increase decoding accuracy.

語當中の各単語ごとに上記のステップに従って、復号精
度が高まるように、カウントに対する記憶値および確率
値が調整される。According to the above steps for each word in the word range, the stored values and probability values for the counts are adjusted to increase the decoding accuracy.

上記に考察した方法は、音声を語鴬中の認識された単語
に復号する際の精度を向上させるために他の方法で決定
されたカウント値を改善するのに役立つ。The methods discussed above serve to improve count values determined by other methods in order to improve the accuracy in decoding speech into recognized words in words.

Ｅ２．発声されない単語のベースフォームの合成第９図
において、本発明の一般的方法が図示されている。ステ
ップ２００２で、トレーニング・テキスト中の単語が音
声学的ベースフオームで表わされる。具体的に言うと、
トレーニング期間中に発声される各単語が、通常は音声
学者の手で、国際音標文字で定義される音声学的要素の
列として特徴づけられる。各音声学的要素が、それに対
応する音声学的モデルで表わされる。したがって。E2. Synthesis of Base Forms of Unspoken Words In FIG. 9, the general method of the present invention is illustrated. At step 2002, words in the training text are represented in a phonetic base form. to be specific,
Each word uttered during the training period is characterized, usually by a phonetician, as a sequence of phonetic elements defined in the International Phonetic Alphabet. Each phonetic element is represented by its corresponding phonetic model. therefore.

各単語について、先にＥ１２段で説明したようなそれに
対応する音声学的モデルの列がある。この列が音声学的
ベースフォームを表わす。For each word, there is a sequence of phonetic models corresponding to it as described above in stage E12. This column represents the phonetic base form.

先にＥ１３段で説明したように、単語は一連のフィーニ
ーム・モデルから構築されるフィーニーム・ベースフォ
ームでも表わすことができる。ステップ２０ｏ４で、ト
レーニング・テキスト中の単語が、フィーニーム・ベー
スフォームで表わされる。As previously explained in paragraph E13, a word can also be represented in a feeneem base form constructed from a series of feeneem models. In step 20o4, words in the training text are represented in feeneem base form.

フィーニームは、′出力に関係″することが認められる
。すなわち、フィーニームは、音響処理装置、たとえば
処理装置１００４によって生成される出力である。した
がって、フィーニーム型モデルは、′出力に関係するモ
デル″である。この点に関して、さらに代りに他の出力
に関係するモデルを使うこともできることに留意すべき
である。It is recognized that a feeneem is 'output related', i.e. a feeneem is an output produced by a sound processing device, e.g. be. In this regard, it should be noted that models related to other outputs can also be used instead.

たとえば、′出力に関係するモデル”を、簡単な出力ベ
クトル、または音響処理装置が出力としてもたらす音声
の他の選択可能な特徴的出力にもとづいて定義すること
もできる。For example, an 'output-related model' may be defined based on a simple output vector or other selectable characteristic output of the sound that the sound processing device provides as output.

トレーニング・テキスト中で発生する音声学的モデルは
、様々な音声学的モデルの文脈中で発生する。現在説明
している実施例では、パ音声学的モデルの文脈”は、主
題となる音声学的モデルの直前の音声学的モデルおよび
直後の音声学的モデルによって定義される。すなわち、
ある音声の列について、位１！！ｐｉにある主題の音声
学的モデルの文脈が、位置Ｐ（ｉ−１）とＰ　（ｉ＋１
）にある音声学的モデルによって決定される。特定の主
題の音声学的モデルは、複数の文脈のどの中ででも発生
し得る。−組の音声学的要素（本出願での考察では、沈
黙に対応する要素を１つ含む）中に７０個の音声学的要
素があるものと仮定すると。The phonetic models that occur in the training text occur in the context of various phonetic models. In the currently described embodiment, the "context of a phonetic model" is defined by the phonetic model that immediately precedes and immediately follows the subject phonetic model, i.e.
For a certain sequence of sounds, place 1! ! The context of the phonetic model of the subject at pi is located at positions P(i-1) and P(i+1
) is determined by the phonetic model in A phonetic model of a particular subject can occur in any of multiple contexts. - assuming that there are 70 phonetic elements in the set of phonetic elements (in the discussion of this application, including one element corresponding to silence);

（沈黙でない）任意の音声学的モデルの前に７０個の音
声学的モデルのうちのどれでもくることができ、またそ
の後にも７０個の音声学的モデルのうちのどれでもくる
ことができると考えられる。Any (non-silent) phonetic model can be preceded by any of the 70 phonetic models, and any of the 70 phonetic models can come after it. it is conceivable that.

したがって、所与の音声学的モデルに対して、７０Ｘ７
０＝４９００の文脈が可能である。Therefore, for a given phonetic model, 70X7
0=4900 contexts are possible.

本発明の１つの実施例によれば、各音声学的モデルに対
する多数の可能な文脈のそれぞれに、記憶装置内のある
位置が割り当てられる。According to one embodiment of the invention, each of a number of possible contexts for each phonetic model is assigned a location in the memory.

しかし、下記で考察する良好な実施例では、選択された
文脈だけが記憶装置に入る。どちらの場合でも、その−
組の音声学的モデルのうちのｍ番目の音声学的モデルｎ
ｍについて、複数の文脈が識別できる。記憶装置内では
、音声学的モデルとその文脈は、ステップ２００６でｎ
ｍ、ｃとして記録される。However, in the preferred embodiment discussed below, only selected contexts enter storage. In either case, the −
mth phonetic model n of the set of phonetic models
Multiple contexts can be identified for m. In storage, the phonetic model and its context are stored in step 2006 as n
Recorded as m, c.

トレーニング・テキスト中の発声されたすべての単語に
対して、フィーニーム型単語ベースフオームと音声学的
単語ベースフオームとがあることが好ましい。ステップ
２００８で、周知のビタービ位置合せ手順が適用される
。すなわち所与の単語の音声学的ベースフオームによる
連続する各音声学的モデルが５所与の単語のフィーニー
ム・ベースフオームによる対応するフィーニーム型モデ
ルの列と相関される。ビタービ位置合せ手順は、上記に
引用したＦ、ＪｅＬｉｎｅｋの論文に詳細に記載されて
いる。Preferably, for every word spoken in the training text, there is a feeneem word base form and a phonetic word base form. At step 2008, the well-known Viterbi registration procedure is applied. That is, each successive phonetic model according to the phonetic base form of a given word is correlated with a sequence of five corresponding fee-neem models according to the fee-neem base form of the given word. The Viterbi alignment procedure is described in detail in the paper by F. JeLinek cited above.

所与の文脈中の音声学的モデルが１回だけ発声される場
合は、それに対して１つのフィーニーム・モデルの列が
位置合せされる６しかし、本実施例で選んだように、所
与の文脈中の音声学的モデルがトレーニング期間中に何
度か発声される場合、同じ音声学的モデルに対して異な
るフィーニーム・モデルの列が位置合せされる公算があ
る。同じ文脈中の同じ音声学的モデルの発声に異なる列
が対応するのは１発音が異なるためである。すなわち、
音響処理装置（第１図の）１００４によって発音が異な
るものとして解釈されて、異なるラベル出力（すなわち
フィーニーム）が生成され、したがって異なるフィーニ
ーム・ストリングが生成される。If a phonetic model in a given context is uttered only once, then one row of feeneem models is aligned to it.6 However, as we have chosen in this example, for a given If the phonetic model in context is uttered several times during the training period, it is likely that sequences of different feeneem models will be aligned to the same phonetic model. Different sequences correspond to utterances of the same phonetic model in the same context because one pronunciation is different. That is,
The pronunciations are interpreted differently by the acoustic processor 1004 (of FIG. 1) to produce different label outputs (i.e., fee-neems), and thus different fee-neem strings.

複数回の発声から異なるフィーニーム・ストリングが生
じることを補償するため、平均または合成フィーニーム
・ベースフオームが構築される。To compensate for different feeneem strings resulting from multiple utterances, an average or composite feeneem baseform is constructed.

複数の発声から合成フィーニーム・ベースフオームを構
築する方法は、８１４段およびそこに引用した後の方の
特許出願をみるとすぐに理解できる。The method of constructing a composite feeneem baseform from multiple utterances can be readily understood by looking at paragraph 814 and the later patent applications cited therein.

文脈化された音声学的モデル（ｎｒｎ、　ｃ）が１回発
生されようと何度か発声されようと、それぞれのフィー
ニーム型モデルのストリングがＩＩｍ。Whether the contextualized phonetic model (nrn, c) is uttered once or several times, the string of each fee-neem model is IIm.

Ｃに関連づけられる。フィーニーム・ストリングをテキ
スト中の当該の音声学的モデル（ｒＩｍ、ｃ）と関連づ
ける過程は、ステップ２０１０で行なわれる。Associated with C. The process of associating the feeneem string with the corresponding phonetic model (rIm,c) in the text occurs in step 2010.

上段で指摘したように、各音声学的モデルが可能なあら
ゆる文脈中で発声される場合、各音声学的モデルについ
て４９００の項目が記憶されることになる。７０個の音
声学的モデルでは、記憶装置中に４９００Ｘ７０＝３４
３０００の項目ができることになる。下段で指摘するよ
うに、項目数がこのように多いため、トレーニングに要
する時間が増加するが、これは通常の音声認識環境では
望ましいことではない。As pointed out above, 4900 items would be stored for each phonetic model if each phonetic model were uttered in every possible context. For 70 phonetic models, 4900 x 70 = 34 in storage
This will result in 3,000 items. As pointed out below, this large number of items increases the time required for training, which is not desirable in typical speech recognition environments.

したがって、好ましいモードは、可能な各文脈とそれに
関連するフィーニーム型モデル・ストリングを与えるの
ではなく、可能な３４３０００通りの組合せの一部だけ
についてフィーニーム型モデルをもたらすものである。Therefore, rather than providing each possible context and its associated Feeneem-type model string, the preferred mode is to yield Feeneem-type models for only a portion of the 343,000 possible combinations.

選択された文脈のみが訓練期間中に発声され、それに対
するフィーニーム型モデル列のみがそれと関連づけられ
る。文脈化された音声学的モデル（ｒＩｍ、ｃ）と関連
するフィーニーム型モデル・ストリングが、テーブル項
目として記憶される（ステップ２０１２参照）。Only the selected context is uttered during the training period, and only the Fee-Neem model sequence for it is associated with it. The feeneem model string associated with the contextualized phonetic model (rIm,c) is stored as a table entry (see step 2012).

トレーニング期間中に発声されなかった“新″単語のフ
ィーニーム・ベースフオームを構築するには、ステップ
２０１４．２０１６．２０１８を実行する。ステップ２
０１４で、新単語が、それぞれ定義された文脈中の音声
学的モデルのストリングとして表わされる。次に、各新
単語の音声学的モデル（ｎ’　ｍ、ｃ）が、文脈化され
た記憶済みの音声学的モデル（ｒＩｍ、ｃ）と相関され
る。To construct feeneem base forms for "new" words that were not uttered during the training period, perform step 2014.2016.2018. Step 2
At 014, new words are represented as strings of phonetic models, each in a defined context. The phonetic model of each new word (n' m, c) is then correlated with the contextualized stored phonetic model (rIm, c).

３４３０００通りの文脈の組合せがすべて項目が記憶さ
れている場合、１対１の相関がある。選択された項目だ
けが記憶されている場合、ステップ２０１４の相関は下
段でより詳しく考察するように、密接マツチング過程で
ある。If items are memorized for all 343,000 context combinations, there is a one-to-one correlation. If only selected items are stored, the correlation of step 2014 is a close matching process, as discussed in more detail below.

ステップ２０１６で、各新単語の音声学的モデル（ｎ’
　ｍ、ｃ）が相関された音声学的モデルｎｍ、ｃと関連
するフィーニーム・ストリングで表わされる。In step 2016, a phonetic model (n'
m,c) is represented by a feeneem string associated with a correlated phonetic model nm,c.

各新単語の音声学的モデル（ｒＩｍ、ｃ）ごとに上記の
手順を実施し、その結果得られた様々なフィーニーム・
ストリングがステップ２０１８で連結されて、その新単
語のフィーニーム・ベースフォームが与えられる。The above steps are carried out for each new word phonetic model (rIm, c), and the resulting various feeneem
The strings are concatenated in step 2018 to give the feeneem base form of the new word.

フィーニーム・ベースフォームを（第９図の各ステップ
で必要とされるような）音声学的要素のサイズの切片に
分解する具体的な方法が、第１０図に示しである。A specific method for decomposing the feeneem baseform into pieces of phonetic element size (as required in each step of FIG. 9) is shown in FIG.

第１０図で、最初の単語（Ｉ←１）が取り上げられる（
ステップ２１００）。最初の単語の音声学的ベースフオ
ームＦＢＩが知られており、トレーニング中に、１つま
たは複数のフィーニーム・ベースフォームＦＢＩが生成
される。最初の単語の各フィーニーム・ベースフォーム
に関して、音声学的ベースフオームに周知のビタービ位
置合せ手順が適用される（ステップ２１０２）。ステッ
プ２１０４で、最初のフィーニーム要素ｊ←１が取り上
げられる。その単語に対して複数のフィーニーム・ベー
スフォームがある場合、ステップ２１０６で、ｊ番目の
音声学的要素に対する単一の代表的なフィーニーム・ス
トリングを決定することが必要である（ステップ２１０
８参照）。１つのフィーニーム・ベースフォームから形
成されたものであれ複数のフィーニーム・ベースフォー
ムから形成されたものであれ、ｊ番目の音声学的要素に
対応する単語の切片（Ｐｊ）に、フィーニーム・ストリ
ングＦ　（Ｐｊ）が関連づけられる。Ｆ（Ｐｊ）は１、
当該のフィーニーム型モデル・ストリングに対応する数
字または他の識別子で表わすことが好ましい。これは、
ステップ２１１ｏで実行される。ステップ２１１２でｊ
の値が増分される６ｊが音声学的ベースフォーム中の音
声学的要素の数を上回った場合（ステップ２１１４）。In Figure 10, the first word (I←1) is picked up (
Step 2100). The phonetic base form FBI of the first word is known and during training one or more feeneem base forms FBI are generated. For each feeneem baseform of the first word, the well-known Viterbi alignment procedure is applied to the phonetic baseform (step 2102). In step 2104, the first feeneem element j←1 is picked. If there are multiple feeneem base forms for the word, then in step 2106 it is necessary to determine a single representative feeneem string for the jth phonetic element (step 210
8). The feeneem string F ( Pj) are associated. F(Pj) is 1,
Preferably, it is represented by a number or other identifier that corresponds to the feeneem model string in question. this is,
This is executed in step 211o. At step 2112
The value of 6j is incremented if 6j exceeds the number of phonetic elements in the phonetic base form (step 2114).

ステップ２１１６にもとづいて次の単語が選択され、ス
テップ２１０２から手順が再開される。ｊが音声学的要
素の数を越えない場合は、音声学的ベースフォーム中の
次の音声学的要素についてステップ２１０６〜２１１４
が繰り返される。The next word is selected based on step 2116 and the procedure resumes at step 2102. If j does not exceed the number of phonetic elements, then steps 2106-2114 for the next phonetic element in the phonetic base form.
is repeated.

第１１図に音声学的要素のストリングとして表わされた
サンプル語“ＣＡＴ”を示す。これは。FIG. 11 shows the sample word "CAT" represented as a string of phonetic elements. this is.

標準的国際標音文字に含まれる記号をコンピュータに可
読な形で表わしたものであるや本開示では、単語”　Ｃ
Ａ　Ｔ”が訓練期間中に発声されず、またＣＡ　Ｔ”の
フィーニーム・ベースフォームを探索しているものと仮
定する。下段で、単語“’ＣＡＴ　ＩＴのフィーニーム
・ベースフォームが本発明にもとづいてどのように合成
されるかについて考察する。In this disclosure, the word “C” is a computer-readable representation of the symbols included in the standard international phonetic alphabet.
Assume that AT'' is not uttered during the training period and that we are exploring the feeneem base form of C T''. In the lower part, we consider how the feeneem base form of the word "'CAT IT" is synthesized according to the present invention.

単語“ＣＡＴ”中の各音声学的要素に対して。For each phonetic element in the word “CAT”.

第２図に示した一組のモデルのような、それに対応する
音声学的モデルがある。様々な遷移およびラベル出力に
割り当てられる確率が、８１４段で概略を述べたように
、トレーニング期間中に生成される統計から導き出され
る。There are corresponding phonetic models, such as the set of models shown in FIG. The probabilities assigned to the various transitions and label outputs are derived from statistics generated during the training period, as outlined at step 814.

第１２図は、４欄を含む記憶テーブルの一部分を示した
ものである。最初の欄は、■ｍと名づけられる主題の音
声学的要素である。ただし、ｍは（７０個の音声学的要
素のアルファベット中の）１〜７０である。各音声学的
要素に対して、識別済みの文脈が複数個ある。本実施例
では１位置Ｐｉにある音声学的要素の文脈は、前の位置
Ｐ（ｉ−１）にある音声学的要素と次の位［Ｐ（ｉ＋１
）にある音声学的要素にもとづいている。第２欄は。FIG. 12 shows a portion of a storage table containing four columns. The first column is the phonetic element of the theme named ■m. However, m is from 1 to 70 (in the alphabet of 70 phonetic elements). There are multiple identified contexts for each phonetic element. In this example, the context of the phonetic element at position 1 Pi is the phonetic element at the previous position P(i-1) and the phonetic element at the next position [P(i+1)
) is based on the phonetic elements found in The second column is.

主題の音声学的要素の前にある記憶済み音声学的要素で
ある。第３欄は、主題の音声学的要素の後にくる記憶済
み音声学的要素である。It is a memorized phonetic element that precedes the phonetic element of the subject. The third column is the remembered phonetic element that comes after the subject phonetic element.

“ＣＡＴ”の音声学的要素ＡＥＩを例にとると。Taking the phonetic element AEI of “CAT” as an example.

最初の文脈はＡＡＯ−ＡＥ　１−ＡＡＯと名づけること
かできる。この場合、ＡＥｌの前と後にフィーニーム・
アルファベットの最初の音声学的要素がくる。第２の文
脈は、ＡＡＯ−ＡＥｌ−ＡＥＯとして示される。ＡＥＩ
の前に最初の音声学的要素があり、またＡＥＩの後に第
２の音声学的要素がくる。前にくる音声学的要素として
ＡＡＯを含む様々の文脈をリストした後、後にくる音声
学的要素としてＡＥＯを含む文脈をリストしである。The first context can be named AAO-AE 1-AAO. In this case, feeneem is used before and after AEl.
Here comes the first phonetic element of the alphabet. The second context is denoted as AAO-AEl-AEO. AEI
The first phonetic element comes before the AEI, and the second phonetic element comes after the AEI. After listing various contexts that include AAO as a preceding phonetic element, we list contexts that include AEO as a subsequent phonetic element.

このリストは、ＡＥｌを主題の（中間の）音声学的要素
とする様々な３要素の組合せを含んでいる。This list contains various three-element combinations with AEl as the (intermediate) phonetic element of the theme.

ＡＥＩに対応する音声学的要素のリスト項目を検討する
と、（破線で囲んだ）ある文脈ｎｍ、ｃは、その前にあ
る音声学的要素がＫＱ、その後にくる音声的要素がＴＸ
である。この文脈は、語”　ＣＡ　Ｔ　”中に見られる
文脈と一致する。トレーニング中に得られたデータにも
とづいて、ＫＱ−ＡＥＩ−ＴＸの文脈に、ｆと名づけた
フィーニーム・ストリングが関連づけられる。上段で指
摘したように、ストリングｆは、トレーニング中にＫＱ
−ＡＥｌ−ＴＸ文脈が１回発声された結果であることも
何度か発声された結果であることもある。Considering the list of phonetic elements corresponding to AEI, a certain context nm,c (encircled by a dashed line) has a phonetic element that precedes it as KQ and a phonetic element that follows as TX.
It is. This context matches the context found in the word "CAT". Based on the data obtained during training, a feeneem string named f is associated with the context of KQ-AEI-TX. As pointed out above, the string f is KQ during training.
-AEl-TX context may be the result of being uttered once or several times.

ストリングｆは、ＫＱ−ＡＥ　１−ＴＸ文脈中で発生す
る音声学的要素ＡＥＩに対応する。The string f corresponds to the phonetic element AEI occurring in the KQ-AE 1-TX context.

単語“ＣＡ　Ｔ　”のフィーニーム・ベースフオームを
形成する際、ｆストリングが、単語”　ＣＡ　Ｔ　”の
音声学的要素ＡＥＩに対応する切片に関連づけられる。In forming the feeneem baseform of the word "CAT", the f-string is associated with the intercept corresponding to the phonetic element AEI of the word "CAT".

単語”　ＣＡ　Ｔ　”中の他の音声学的要素に対して。For other phonetic elements in the word "CAT".

それに対応するフィーニーム・モデル・ストリングが導
き出される。すなわち、Ｓ　Ｉ　ＬＥＮＣＥ−ＫＱ−Ａ
ＥＩに関連するフィーニーム・モデル・ストリングが記
録される。また、ＡＥｌとＴＱに狭まれたＴＸに対する
フィーニーム・モデル・ストリングが記録され、以下同
様である。単語パＣＡ　Ｔ　”中の音声学的要素に対し
て導き出された様々なフィーニーム・モデル・ストリン
グが、その単語中でそれぞれの音声学的要素が発生する
順に連結される。連結されたフィーニーム・モデル・ス
トリングが、単語”　ＣＡ　Ｔ　”に対する合成された
フィーニーム・ベースフオームとなる。A corresponding feeneem model string is derived. That is, S I LENCE-KQ-A
The feeneem model string associated with the EI is recorded. Also, the finem model string for TX narrowed to AEl and TQ is recorded, and so on. The various feeneem model strings derived for the phonetic elements in the word PACAT'' are concatenated in the order in which each phonetic element occurs in that word.The concatenated feeneem models - The string becomes the composite feeneem base form for the word "CAT".

この良好な実施例では、それに関連するフィーニーム・
モデル・ストリングが記憶されていない″新単語”の文
脈中で、ある音声学的要素が発生することがある。音声
学的モデル・ストリングと３要素の音声学的文脈の間の
省略された対応リストが使用できるようにするため、第
１３図の方法を使用する。In this good example, the feeneem associated with it is
Certain phonetic elements may occur in the context of "new words" for which no model string has been memorized. To enable the use of an abbreviated correspondence list between a phonetic model string and a three-element phonetic context, the method of Figure 13 is used.

第１３図によれば、各″新″単語は音声学的要素のスト
リングとして表わされ、各音声学的要素は″新″単語の
ある切片を表わす。次に客語の切片に対応する音声学的
要素が、その文脈ＩＴ’　ｍ、Ｃ中で識別される。ステ
ップ２４００の最初の単語の切片ｉ←１から出発して２
ステツプ２４０２で１位置ＰｉにあるＨ’　ｍ、ｃが、
関連するフィーニーム・モデル・ストリングを有する文
脈化された音声学的要素（ｒＩｍ、ｃ）に完全に対応す
るかどうか判断が下される。イエスの場合、関連するフ
ィーニーム・モデル・ストリングの全体が、ステップ２
４ｏ４でフィーニーム・ベースフォームに含められる。According to FIG. 13, each "new" word is represented as a string of phonetic elements, each phonetic element representing a slice of the "new" word. The phonetic element corresponding to the segment of the guest word is then identified in its context IT'm,C. Starting from the first word intercept i←1 in step 2400, 2
In step 2402, H'm,c at the 1st position Pi is
A determination is made whether it fully corresponds to a contextualized phonetic element (rIm,c) with an associated feeneem model string. If yes, the entire associated feeneem model string is
Included in Feeneem base form in 4o4.

ステップ２４０４　（および下記の各ステップ）で使用
する表記法は、簡単な説明のためのものである。２重の
垂線は、連結演算子を表わす。その右側にある“切片”
が、以前に構築されたベースフォームのその部分にタグ
として付加される。連結演算子の右側にある“切片”は
、３つのパラメータを含んでいる。一番左のパラメータ
は、現在行なわれている判断を示す。次のパラメータは
。The notation used in step 2404 (and each step below) is for ease of explanation. Double perpendicular lines represent concatenation operators. “Intercept” on the right side
is added as a tag to that part of the previously constructed base form. The "intercept" to the right of the concatenation operator contains three parameters. The leftmost parameter indicates the decision currently being made. The next parameter is.

関連するフィーニーム・ストリング中の始めのフィーニ
ーム・モデルを示す。最後のパラメータは、連結に含ま
れるべき関連するフィーニーム・モデル中の最後のフィ
ーニーム・モデルを示す、したがって“切片（ｇｌ、１
、Ｑ（ｇｉ））”は、（ステップ２４０２の）ｇ１判断
に関連する最初から最後までのフィーニーム・モデルを
指す、すなわち。Indicates the first feeneem model in the associated feeneem string. The last parameter indicates the last fineme model among the related fineme models to be included in the concatenation, thus the “intercept(gl, 1
, Q(gi))'' refers to the fee-neem model from start to finish that is relevant to the g1 decision (of step 2402), ie.

ｇｌの判断が″イエス″であれば、パ新”単語の主題の
音声学的要素ｎ’　ｍ、ｃが（同じ３要素の音声学的文
脈をもつ）記憶済みのｒＩｍ、ｃと一致し、かつそれに
関連する（モデル１から始まりモデルＱ　（ｇｌ）で終
わる）フィーニーム・モデル・ストリングがあることを
示す。ステップ２４０４でｇｌの判断がイエスであれば
、そのフィーニーム・モデル・ストリングの全体が、“
新単語″の以前の切片に対して構築されたベースフォー
ムにタグとして付加される。ステップ２４０４の後、ス
テップ２４０６で次の単語の切片が検査される。If the judgment of gl is "yes", the phonetic element n' m, c of the subject of the new word matches the memorized rIm, c (which has the same three-element phonetic context), and associated there is a feeneem model string (starting with model 1 and ending with model Q (gl)). If gl is yes in step 2404, then the entire feeneem model string is “
The new word is added as a tag to the base form constructed for the previous segment. After step 2404, the next word segment is examined in step 2406.

″新″単語中の音声学的要素ＩＴｍ、ｃが、それと同じ
３要素の音声学的文脈を有する記憶済みのある音声学的
要素に写像されない場合、類似する２要素の音声学的文
脈があるかどうか判断が下される。ステップ１４１０で
、″新″単語の音声学的要素とその前の音声学的要素が
、判断ｇ２で取り上げられる。類似する先行要素−主題
要素の文脈が記憶リスト中にあるいずれかの３要素の文
脈に含まれている場合、そのフィーニーム・モデル・ス
トリングが検索される。次にステップ２４１２でフィー
ニーム・モデル・ストリングの前半が抽出され、構築中
のフィーニーム・ベースフォーム（ｂｓｆ）に連結され
る。If the phonetic element ITm,c in a "new" word does not map to some remembered phonetic element that has the same three-element phonetic context, then there is a similar two-element phonetic context. A decision will be made whether or not. At step 1410, the phonetic component of the "new" word and its previous phonetic component are taken up in decision g2. Similar Preceding Elements - If the context of the subject element is included in the context of any three elements in the stored list, then the feeneem model string is retrieved. Next, in step 2412, the first half of the Feeneem model string is extracted and concatenated into the Feeneem base form (BSF) being constructed.

主題の音声学的要素とその後にくる音声学的要素が記憶
済みのそれに対応する文脈を有するかどうか判断を下す
ために、同様の検査が実施される。A similar test is performed to determine whether the subject phonetic element and the phonetic element that follows have a memorized corresponding context.

これは、ｇ４判断と呼ばれ、ステップ２４１４で実施さ
れる。この判断では、リスト中に、その最後の２つの音
声学的要素が、取り上げられている“新″単語の切片中
の主題の音声学的要素およびその後の音声学的要素と同
じである３要素の文脈が含まれているかどうかが示され
る。含まれている場合、フィーニーム・モデル・ストリ
ングの後半（最初の音素的モデルは省１１８）が、構築
中のベースフォーム（ｂｓｆ）にタグとして付加される
（ステップ２４１６参照）。そうでない場合は、ステッ
プ２４１８で、ステップ２４２０にもとづいて決定され
たフィーニーム・ストリングの後半部分が、構築中のベ
ースフォーム（ｂｓｆ）に連結される。This is called a g4 decision and is performed in step 2414. This judgment requires that there be three elements in the list whose last two phonetic elements are the same as the subject phonetic element and the subsequent phonetic element in the segment of the "new" word being featured. Indicates whether the context is included. If so, the second half of the feeneem model string (the first phonemic model is omitted 118) is added as a tag to the base form (bsf) being constructed (see step 2416). Otherwise, in step 2418, the second half of the feeneem string determined based on step 2420 is concatenated to the base form (bsf) under construction.

ステップ２４２０で、取り上げられている３６新”単語
切片中のものと同じ音声学的要素Ｐ　（ｉ−１）Ｐｉを
有する音声学的要素の文脈が記憶されていないかどうか
判断が下される。記憶されていない場合、主題の音声学
的要素（すなわち、取り上げられている“新”単語切片
のＰｉ位置にある音声学的要素）を含む、任意の記憶済
みの音声学的文脈が、その関連するフィーニーム・モデ
ル・ストリングとして記録される（複数のストリングが
記録されている場合、１つのストリングを任意に選択で
きる）、ステップ２４２２で、記録されたフィーニーム
・モデル・ストリングの半分が、構築中のベースフォー
ムに連結される。ステップ２４２２の次に、ステップ２
４１４に進む。At step 2420, a determination is made whether a context for a phonetic element with the same phonetic element P (i-1) Pi as in the 36 new word segment being featured is not stored. If not, any memorized phonetic context containing the phonetic element of the subject (i.e., the phonetic element in position Pi of the “new” word segment being featured) (If multiple strings are recorded, one string can be arbitrarily selected). In step 2422, half of the recorded feeneem model strings are Step 2422 is followed by step 2
Proceed to 414.

ステップ２４０４，２４１６で、ベースフォームの前に
構築された部分にフィーニーム・モデル・ストリングが
加えられた後、″１新新単語のすべての切片が取り上げ
られるまで、次の切片が取り上げられる。これは、ステ
ップ２４０６と２４２４で実施される。各切片について
導き出された音声学的モデルは、連結されて新”単語の
フィーニーム・モデルのベースフォームとなる。After the feeneem model string is added to the previously constructed portion of the base form in steps 2404, 2416, the next intercept is picked up until all intercepts of the ``1 new word have been picked up. , are performed in steps 2406 and 2424. The phonetic models derived for each segment are concatenated into the base form of the new word's feeneem model.

本発明によれば、音声学的要素の文脈にもとづくフィー
ニーム型単語ベースフォームの合成が。According to the invention, synthesis of feeneem-type word-based forms based on the context of phonetic elements is provided.

フィーニーム・ベースフォームをトレーニングしていな
いすべての単語またはその一部分に使用できる。（それ
ぞれ既知のフィーニーム・ベースフォームを有する）２
つの単語が結合されて単一の単語を形成する場合には、
それぞれのベースフォームが結合されてその単一の単語
の複合ベースフォームとなる。たとえば、語ＨＯＵＳＥ
とＢＯＡＴを結合して単一語ＨＯＵＳＥＢＯＡＴを形成
すると仮定する。単一語ＨＯＵＳＥＢＯＡＴのフィーニ
ーム・ベースフォームは、単に単ＩＨＯＵ　ＳＥのフィ
ーニーム・ベースフォームと単語Ｂ　ＯＡＴのフィーニ
ーム・ベースフォームを結合することにより形成される
。したがって、音声学的文脈法をかかる語に使ってもよ
いが、必ずしもそうする必要ではない。Can be used for all words or parts of words for which the feeneem base form has not been trained. (each with a known feeneem base form) 2
When two words are combined to form a single word,
Each base form is combined into a composite base form for that single word. For example, the word HOUSE
and BOAT are combined to form a single word HOUSEBOAT. The feeneem base form of the single word HOUSEBOAT is formed simply by combining the feeneem base form of the single word HOUSEBOAT with the feeneem base form of the word BOAT. Therefore, phonetic context methods may, but need not, be used for such words.

本発明をその良好な実施例に関して説明してきたが１本
発明の範囲から外れることなく形状および細部に様々な
変更を加えられることは、当業者なら理解できるはずで
ある。たとえば、依拠する音声学的文脈が上記の３要素
文脈でなくてもよい。Although the invention has been described in terms of preferred embodiments thereof, those skilled in the art will recognize that various changes may be made in form and detail without departing from the scope of the invention. For example, the phonetic context to rely on may not be the three-element context described above.

隣接する２つの要素の代りに、最高位の文脈が任意の数
ｎ個（１≦ｎ）の隣接する音声学的要素を含むこともで
きる。また１文脈中の音声学的諸要素は位置が隣接して
いる必要はなく、１個または複数の音声学的要素で分離
されていてもよい。Instead of two adjacent elements, the highest context can also include any number n (1≦n) of adjacent phonetic elements. Furthermore, the phonetic elements in one context do not need to be adjacent in position, and may be separated by one or more phonetic elements.

さらに、音声学的マルコフ・モデルとフィーニーム・マ
ルコフ・モデルに関して説明してきたが。Furthermore, we have discussed the phonetic Markov model and the finem Markov model.

本発明では他の型式のモデルの使用も企図されている。The present invention also contemplates the use of other types of models.

すなわち１本発明は、単語を第１のモデル群に含まれる
モデルのベースフォームと第２のモデル群に含まれるモ
デルのベースフォームによって表わすことができ、その
２つのベースフォームを位置合せすることができる場合
、一般に適用されることを予定している。That is, one aspect of the present invention is that a word can be represented by a base form of a model included in a first model group and a base form of a model included in a second model group, and that the two base forms can be aligned. Where possible, it is planned to be applied generally.

さらに１本特許出願で使用する゛単語”は、広義の意味
で使用し、辞書の単語、語堂素（すなわち上記のように
辞書の単語の特定の発音）、および（音節など）認識す
べき音声を定義するのに使用できる単語の部分を指すこ
とに留意すべきである。In addition, "words" as used in this patent application are used in a broad sense, and include dictionary words, lexical elements (i.e., specific pronunciations of dictionary words as mentioned above), and (syllables, etc.) to be recognized. It should be noted that it refers to parts of a word that can be used to define speech.

また、希望する場合、第１３図の方法を変えることもで
きる。たとえば、フィーニーム・モデル“ストリングの
、構築中のベースフォームの事前しこ存在するフィーニ
ーム・モデルに連結される部分を、半分ではない値にし
てもよい。Also, the method shown in FIG. 13 can be modified if desired. For example, the portion of the feeneem model string that is connected to a pre-existing feeneem model of the base form being constructed may be halved.

さらに、フィーニーム・ベースフォームを、いくつかの
方法で音声学的要素サイズの単語切片に分割できること
に留意すべきである。上記の（合成されたベースフォー
ムで重なり合う音素列が生じてもよい）Ｎグラム合成法
以外に、最長最良合成も使用できる。後者の方法では、
（ａ）利用される切片の数が最小となり、かつ（ｂ）使
用される最長の切片の長さが最となるように、音声シー
ケンスが分割される。Furthermore, it should be noted that the feeneem baseform can be divided into phonetic element-sized word segments in several ways. Besides the N-gram synthesis method described above (which may result in overlapping phoneme sequences in the synthesized base form), longest-best synthesis can also be used. In the latter method,
The audio sequence is divided such that: (a) the number of segments used is minimized, and (b) the length of the longest segment used is maximized.

たとえば、最長最良体系では１諸量中の可能なすべての
単音声ストリング群に対応する可能なすべてのフィーニ
ーム型切片を計算することができる。次に、判定基準関
数を下記のように定義できる。For example, a longest-best system can compute all possible Feeney-type intercepts corresponding to all possible monophonic strings in a quantity. Next, the criterion function can be defined as follows.

ｆ　＝　Ｑ、”＋　Ｑ、”＋　（１３２−−・・Ｑｎま
ただし、Ｑ□＝音声列Ｑの長さ；Ｑ２＝音声列２つの長
さ；以下同様である。したがって、ＱＬ＋Ｑ２・・・・
・・・・・＋１ｎ＝Ｌ＝所期の新しい語に対応する音声
列の長さ。f = Q, "+ Q, "+ (132--...Qn, however, Q□ = length of audio string Q; Q2 = length of two audio strings; the same applies hereinafter. Therefore, QL + Q2...・
...+1n=L=length of the phonetic sequence corresponding to the desired new word.

次にｆが最大となるような１ｆｆｉの切片を選ぶ。Next, select the 1ffi intercept that maximizes f.

これは理想的な場合１次式に対応するはずであることに
留意すること。Note that this should correspond to a linear equation in the ideal case.

Ｑ　１＝　Ｌ　　Ｑ　ｚ　＝　Ｑ　３・・・・・・、＝
φこの場合には、拘束条件Ｑ、＋Ｑ２・・・・・・＋Ｑ
ｎ＝Ｌのもとでｆが最大になる。Q 1 = L Q z = Q 3...,=
φIn this case, constraint conditions Q, +Q2...+Q
f becomes maximum when n=L.

本発明は、ＩＢＭ３０８４計算機で、Ｎグラム合成法と
最長最良法の両者を具体化したＰＬ／１言語で実施され
た。どちらの場合にも、有用なフィーニーム・ベースフ
ォームが合成された。The invention was implemented on an IBM 3084 computer in the PL/1 language, which embodies both the N-gram synthesis method and the longest best method. In both cases, useful fineme base forms were synthesized.

合成されたベースフォームは、認識タスクで少くともそ
れに対応する音声学的ベースフォームと同程度の性能を
もたらす。たとえば、標準タスクでの音声学的エラー発
生率が４％の場合、すべての音声学的ベースフォームを
合成されたフィーニーム・ベースフォームで置き換える
と、エラー発生率は４％より下がるはずである。The synthesized base form provides at least as much performance in the recognition task as its corresponding phonological base form. For example, if the phonetic error rate in a standard task is 4%, replacing all phonetic baseforms with synthesized feeneem baseforms should reduce the error rate below 4%.

本発明は、最も頻繁に発生する２０００語のベースフォ
ームを記録し、それほど頻用されない３ｏｏＯ語を合成
することにより、トレーニング時間が少くとも１５０％
（２／３に）節約できた。The present invention saves at least 150% of the training time by recording the base forms of the 2000 most frequently occurring words and synthesizing the less frequently occurring 3ooO words.
I was able to save money (to 2/3).

Ｆ１発明の詳細な説明したように１本発明によれば、トレーニング期間
後に、第１のモデル群に含まれるモデルから構築された
、かかる他の単語に対するベースフォームを合成するた
めの手法が提供される。DETAILED DESCRIPTION OF THE F1 INVENTION According to the present invention, a method is provided for synthesizing base forms for such other words constructed from models included in a first model group after a training period. Ru.

[Brief explanation of drawings]

第１図は１本発明を適用できる音声認識システムの概略
図、第２図は、音声学的マルコフ・モデルを示す概略図
、第３図は、第２図の音声学的マルコフ・モデルに対す
るラベル間隔を示す格子またはトレリス構造を示す概略
図、第４図は、音声処理装置で生成されたラベルのスト
リング中の最初のラベルから始まるいくっがのラベル出
カ間隔にわたって測定した。第３図と同様の格子または
トレリス構造を示す概略図、第５図は、単語ＬＩＴＨＥ
　”の所定の発音の音声学的表現と、単語ｒｔＴＨＥ　
”の音声学的ベースフオームを形成する、３つの連結さ
れた音声学的マルコフ・モデルとを示す図、第６図は、
フィーニーム・マルコフ・モデルを示す図、第７図′は
、フィーニーム・マルコフ・モデルに対応する数ラベル
出力間隔の間の格子またはトレリス構造を示す説明図、
第８図は、単語を形成するように連結されたフィーニー
ム・マルコフ・モデルを示す図、第９図は、本発明の方
法を一般的に示した構成図、第１０図は、フィーニーム
・ベースフオームをどのように分割して音声学的要素の
サイズに対応する切片に分割するのかを示す流れ図、第
１１図は、単ｉ”ＣＡＴ”を音声学的に表現した図、第
１２図は、各フィーニーム・ストリングと所与の文脈中
のそれに対応する音声学的モデルの関連を示す記憶テー
ブルの説明図、第１３図は、第１２図のリストに可能な
音声学的文脈のすべてではなくてそのいくつかが示され
ている、フィーニーム・ベースフオームの合成を示す流
れ図である。出願人　　インターナショナル・ビジネス・マシーンズ
・コーポレーション復代理人　　弁理人　　澤　　１）　俊　　夫（外１名
）シＦｒ　閣Figure 1 is a schematic diagram of a speech recognition system to which the present invention can be applied, Figure 2 is a schematic diagram showing a phonetic Markov model, and Figure 3 is a label for the phonetic Markov model in Figure 2. FIG. 4 is a schematic diagram showing a grid or trellis structure showing spacing measured over several label output intervals starting from the first label in a string of labels produced by an audio processing device. A schematic diagram showing a lattice or trellis structure similar to FIG. 3; FIG.
” and the phonetic representation of the given pronunciation of the word rtTHE
Figure 6 shows three connected phonetic Markov models forming the phonetic base form of ``.
FIG. 7' is an explanatory diagram showing a lattice or trellis structure between the number label output intervals corresponding to the Feenim Markov model;
FIG. 8 is a diagram showing a fineem Markov model concatenated to form a word, FIG. 9 is a block diagram generally showing the method of the present invention, and FIG. 10 is a diagram showing a fineem base form connected to form a word. Fig. 11 is a phonetic representation of the single i "CAT", and Fig. 12 is a flowchart showing how to divide it into segments corresponding to the size of the phonetic elements. An illustration of a memory table showing the association of a feeneem string and its corresponding phonetic model in a given context, Figure 13, shows that the list in Figure 12 does not include all possible phonetic contexts. 2 is a flowchart illustrating the synthesis of feeneem baseforms, some of which are shown. Applicant International Business Machines Corporation Sub-Agent Patent Attorney Sawa 1) Toshio (1 other person) ShiFr.

Claims

[Scope of Claims] A first word-based form represented by a chain of models of a first set is a second word-based form represented by a chain of models of a second set.
The following means (a) for synthesizing from word-based forms of
A word-based form synthesizer for speech recognition having (b) (a) Storage means for storing a chain of models of the first set corresponding to each of the models of the second set for each context. (b) means for determining each of the second set of models forming a second word base form of the word and a respective context; (c) means for retrieving from the storage means a corresponding chain of models of the first set based on the determined type and context of the second set of models; (d) means for combining the retrieved first set of model chains to generate a first word-based form;