JPS62178999A

JPS62178999A - Voice recognition system

Info

Publication number: JPS62178999A
Application number: JP61016993A
Authority: JP
Inventors: ピーター・ヴインセント・デソーザ; ラリツト・ライ・ボール; ロバート・レロイ・マーサー; マイケル・アラン・ピチエニイ
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1986-01-30
Filing date: 1986-01-30
Publication date: 1987-08-06
Also published as: JPH0372991B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】以下の順序で本発明を説明する。[Detailed description of the invention] The present invention will be explained in the following order.

Ａ、　産業上の利用分野Ｂ、　開示の概要Ｃ８従来の技術り、　発明が解決しようとする問題点Ｅ、　問題点を解決するための手段Ｆ、　実施例Ｆｌ、音声入力信号のラベル化Ｆ２．　　フィーニーム単位のワード−モデルの生成Ｆ′５．モデル生成および認識手順の概要Ｆ４．　　ラ
ベル語彙生成および音声のラベル・ストリングへの変換Ｆ５．　　ラベル・ストリングを使用するワード・モデ
ルの生成Ｆ６．認識プロセスＦ７．　　ラベル・アルファベットの変化Ｆ８．表Ｇ、　発明の効果Ａ、産業上の利用分野本発明は音声認識、詳細には所定の語彙中のワードにつ
いて統計的なマルコフ・モデルを用いて音声認識する音
声認識システムに係る。A. Industrial field of application B. Summary of the disclosure C8. Prior art Problems to be solved by the invention E. Means for solving the problems F. Example Fl. Labeling of audio input signals F2. Generation of word-model in feeneem units F'5. Overview of model generation and recognition procedure F4. Label vocabulary generation and conversion of speech into label strings F5. Generating a word model using label strings F6. Recognition process F7. Change of label/alphabet F8. Table G, Effects of the Invention A, Industrial Application Field The present invention relates to speech recognition, and more particularly to a speech recognition system that recognizes speech using a statistical Markov model for words in a predetermined vocabulary.

Ｂ、開示の概要音声認識システムで、ラベルに基づくマルコフ・モデル
によりワードをモデル化する装置を開示する。モデル化
は、０語雪中のワードに対応する第１音声入力を音響プ
ロセッサに送り、音響プロセッサは、発声されたワード
の各々を標準ラベルの列に変換して、標準ラベルの各々
を、時間間隔に割当て可能な音響タイプに対応させると
と；■各々の標準ラベルを、複数の状態と、ある状態か
らある状態への少なくとも１つの遷移と、ある遷移で少
なくとも１つの設定可能な出力確率とを有する確率モデ
ルとして表わすこと；■選択された音響入力を音響プロ
セッサに送り、音響プロセッサは、選択された音響入力
を、それぞれが時間間隔に割当てられた音響タイプに対
応する個人化ラベルに変換すること；■各々の出力確率
を、所与のモデルにより示された標準ラベルの確率とし
て設定し、所与のモデルにおける所与の遷移で特定の個
人化ラベルを生成することを含む。本発明は、音声認識
システムで簡単かつ自動的にワードのモデルを生成する
問題を扱う。B. SUMMARY OF THE DISCLOSURE An apparatus for modeling words by a label-based Markov model in a speech recognition system is disclosed. The modeling includes sending a first audio input corresponding to a word in the snow to an acoustic processor, which converts each of the spoken words into a sequence of standard labels, and converts each of the standard labels into a sequence of standard labels over time. Assign each standard label to an acoustic type that can be assigned to an interval; ■ assign each standard label to multiple states, at least one transition from one state to another, and at least one configurable output probability at a transition; ■ sending the selected acoustic inputs to an acoustic processor, which converts the selected acoustic inputs into personalized labels, each corresponding to an acoustic type assigned to a time interval; ■ Setting each output probability as the probability of a standard label exhibited by a given model, and generating a particular personalized label at a given transition in the given model. The present invention addresses the problem of easily and automatically generating models of words in speech recognition systems.

Ｃ０従来の技術最近の音声認識システムにおいては、ワードの音響モデ
ル化に関し一般に利用される２つの手法がある。１つの
手法はワード・テンプレートを使用するもので、ワード
を認識するためのマツチング・プロセスは動的計画法（
ＤＰ）の手順に基づく。この手順の例は、音響、音声お
よび信号処理に関するＩＥＥＥ会報ＡＳＳＰ第２３巻（
１９７５年）６７〜７２頁記載のエフ・イタクラの論文
”最小予測誤差原理を応用した音声認識”（Ｆ。BACKGROUND OF THE INVENTION In modern speech recognition systems, there are two commonly used approaches to acoustic modeling of words. One approach is to use word templates, and the matching process for recognizing words is based on dynamic programming (
Based on the procedure of DP). An example of this procedure is the IEEE Bulletin on Acoustics, Speech and Signal Processing, ASSP Volume 23 (
F. Itakura's paper “Speech Recognition Applying the Minimum Prediction Error Principle” (F. 1975), pages 67-72.

Ｉｔａｋｕｒａ、Ｍｉｎｉｍｕｍ　　Ｐｒｅｄｉｃｔｉ
ｏｎＲｅｓｉｄｕａｌ　　Ｐｒ１ｎｃｉｐｌｅ　　Ａｐ
ｐｌｉｅｄ　　ｔ。Itakura, Minimum Predicti
onResidual Pr1nciple Ap
plied t.

５ｐｅｅｃｈ　　Ｒｅｃｏｇｎｉｔｉｏｎ”、Ｉ　ＥＥ
ＥＴｒａｎｓａｃｔｉｏｎｓ　　ｏｎ　　Ａｃｏｕｓｔ
ｉｃｓ、５ｐｅｅｃｈ。5peech Recognition”, IEE
ETransactions on Account
ics, 5peech.

ａｎｄ　　Ｓｉｇｎａｌ　　Ｐｒｏｃｅｓｓｉｎｇ、　
Ｖｏｌ、　　Ａ　Ｓ　５Ｐ−２３，１９７５、ｐｐ６７
−７２）および米国特許第４１８１８２１号に示されて
いる。and Signal Processing,
Vol, AS 5P-23, 1975, pp67
-72) and US Pat. No. 4,181,821.

もう１つの手法は、確率的なトレーニングおよび復号の
アルゴリズムに適した音素単位のマルコフ・モデルを用
いる。この手法および関連手順の説明は、ＩＥＥＥ会報
第５４巻（１９７６年）５３２〜５５６頁記載のエフ・
ジエリネクの論文”統計的方法による連続音声認識”　
（Ｆ、　Ｊｅｌｉｎｅｋ、”　Ｃｏｎｔｉｎｕｏｕｓ　
　５ｐｅｅｃｈ　　Ｒｅｃｏｇｎｉｔｉｏｎｂｙ　　５
ｔａｔｉｓｔｉｃａｌ　　Ｍｅｔｈｏｄｓ”、Ｐｒｏｃ
ｅｅｄｉｎｇｓ。Another approach uses phoneme-wise Markov models suitable for probabilistic training and decoding algorithms. A description of this technique and related procedures can be found in F.
Zielinek's paper "Continuous speech recognition using statistical methods"
(F, Jelinek, “Continuous
5peech Recognition by 5
statistical Methods”, Proc.
eedings.

ｏｆ　　ｔｈｅ　　Ｉ　ＥＥＥ、Ｖｏｌ、６４１．１９
７６、ｐｐ。of the I EEE, Vol. 641.19
76, pp.

５５２−５５６）に記載されている。552-556).

これらのモデルの下記、の３つの点は特に重要である。The following three points of these models are particularly important.

（ａ）　　ワードの個別性：ワード・テンプレートは、
ワードの実際のサンプルから構築されるので、より良好
に認識することができる。音標に基づいたモデルは人工
の音声を表わす基本形式から導かれるので、理想的に作
られたワードを表わし、実際には生起しないことがある
。(a) Individuality of words: Word templates are
Since it is constructed from real samples of words, it can be recognized better. Because phonetic-based models are derived from basic forms representing artificial speech, they represent ideally created words that may not occur in reality.

（ｂ）トレーニンク可能性：マルコフ・モデルは、例え
ば、（前述のジエリネクの論文に記載された）フォワー
ド・バックワード・アルゴリズムによりトレーニングす
ることができるから、テンプレートよりもすぐれている
。ワード・テンプレートは、（前述のイタクラの論文に
記載された）イタクラ距離、スペクトル距離等のような
、トレーニングされなり・距離尺度を用いる。１つの例
外は、ＩＢＭ研究報告Ｒｅ５９７１．１９７６年４月号
に記載されたアール・バキスの論文”センチ秒音響状態
による連続音声認識”　（Ｒ，Ｂａｋｉｓ、Ｃｏｎｔｉ
ｎｕｏｕｓ　　５ｐｅｅｃｈ　　Ｒｅｃｏｇｎｉｔｉｏ
ｎＶｉａ　　Ｃｅｎｔｉｓｅｃｏｎｄ　　Ａｃｏｕｓｔ
ｉｃ　　５ｔａｔｅｓ”ＩＢＭ　　Ｒｅ５ｅａｒｃｈ　
　Ｒｅｐｏｒｔ　　ＲＣ５９７１、Ａｐｒｉ１１９７６
）で用いた、ワード・テンプレートのトレーニングを可
能にする方法である。(b) Trainability: Markov models are superior to templates because they can be trained, for example, by the forward-backward algorithm (described in the aforementioned Zielinek paper). Word templates use trained distance measures such as Itakura distance, spectral distance, etc. (described in the Itakura paper mentioned above). One exception is R Bakis' paper “Continuous Speech Recognition with Centisecond Acoustic Conditions” (R, Bakis, Conti), published in the April 1976 issue of IBM Research Report Re5971.
naughty 5peech Recognition
nVia Centisecond Acoust
ic 5tates"IBM Re5search
Report RC5971, Apri11976
) is a method that enables training of word templates.

（ｃ）計算速度：音響プロセッサから個別に出力された
アルファベットを用いるマルコフ・モデルは、（前記イ
タクラが用いた）動的計画法によるマツチングまたは（
前記バキスが用いた）連続パラメータのワード・テンプ
レートよりもかなり計算速度が速い。(c) Computation speed: Markov models using alphabets output individually from acoustic processors can be matched by dynamic programming (as used by Itakura) or (
It is much faster to calculate than the continuous parameter word template (used by Bakis).

Ｄ０発明が解決しようとする問題点本発明の目的は、ワード・テンプレートのようにワード
の個別性を有し、しかも個別のアルファベットのマルコ
フ・モデルで使用可能ナトレーニング可能性を生じる音
響モデル化の方法を提供することである。D0 Problems to be Solved by the Invention The purpose of the present invention is to develop an acoustic modeling method that has individuality of words like word templates, and that can be used with Markov models of individual alphabets, resulting in trainability. The purpose is to provide a method.

更に本発明の目的は、単純でしかも認識プロセスで高速
動作をする音声認識の音響ワード・モデルを提供するこ
とである。A further object of the present invention is to provide an acoustic word model for speech recognition that is simple yet fast-acting in the recognition process.

Ｅ９問題点を解決するための手段本発明により、ワード・モデルを生成する場合、最初に
、ワードを表わす音響信号を、個別のアルファベットか
ら標準ラベルのス）　ＩＪソング変換する。ラベルの各
々は、それぞれのワードの時間間隔を表わす。次いで、
標準ラベルの各々は確率的な（例えばマルコフ）モデル
に置き換えられ、複数の連続モデル（標準ラベルごとに
１モデル）から成る基本形式のモデルを形成する。この
際確率はいまだ入力されていない。次いで、このような
基本形式モデルを、モデルに適用する統計値すなわち確
率を生成する発声サンプルによりトレーニングする。そ
の後、そのモデルを実際の音声認識に使用する。E9 Means for Solving the Problem When generating a word model according to the present invention, the acoustic signals representing the words are first transformed from the individual alphabets to standard labels. Each label represents the time interval of the respective word. Then,
Each of the standard labels is replaced by a probabilistic (eg Markov) model, forming a basic form model consisting of multiple continuous models (one model for each standard label). At this time, the probability has not been entered yet. These elementary models are then trained with speech samples that generate statistics or probabilities that are applied to the model. The model is then used for actual speech recognition.

この方法の利点は、生成されたワード・モデルが音素単
位のモデルよりもずっと詳細で、しかもトレーニング可
能であり；パラメータ数は、語彙の大きさではなく、標
準ラベルのアルファベット（種類）の数に左右され；こ
れらのラベル単位のモデルによる統計的なマツチングは
、計算上、ワード・テンプレートによるＤＰマツチング
よりもずっと速いことである。The advantage of this method is that the generated word model is much more detailed than a phoneme-based model and is trainable; the number of parameters depends on the number of standard label alphabets (types) rather than the size of the vocabulary. The difference is that statistical matching with these label-wise models is computationally much faster than DP matching with word templates.

これらの利点は、前述のバスキの論文に示された手法に
関しては著しいものがある。バキスの論文では、ワード
は状態の連続と定義され、各々の状態は異なっていると
見なされる。それゆえ、各ワードが典型的に６０の状態
に広がり、語彙が５０００ワードであった場合、バキス
による手法は、３００．０００の異なった状態と見なさ
れることになる。本発明により、各々の状態は２００の
オーダのラベルの１つに対応して識別される。本発明は
、ワードを構成する２００ラベルを、３００．０００の
状態としてではなく単に（ラベルを表わす）番号の列と
して記憶することができる記憶装置しか必要としない。These advantages are significant relative to the approach presented in the aforementioned Baschi paper. In Bakis's paper, a word is defined as a sequence of states, each of which is considered distinct. Therefore, if each word was typically spread over 60 states and the vocabulary was 5000 words, the Bakis approach would account for 300,000 different states. According to the invention, each state is identified corresponding to one of on the order of 200 labels. The invention requires only a storage device that can store the 200 labels that make up a word simply as a sequence of numbers (representing the labels) rather than as 300.000 states.

更に、本発明のラベル単位のモデルによる手法では、ト
レーニング・データは少なくて済む。バキスの論文の手
法では、各々の話者はトレーニングのためワードごとに
発声しなければならないが、本発明の場合は、２００の
標準ラベルのモデルに関連した値を設定するのに十分な
ワードを発声するだけでよい。また、バキスの論文に示
された手法は、例えば“ｂＯｙパと”ｂｏｙｓ’“のよ
うな２つのワードを無関係に扱うが、本発明では、−１
ｂＯｙｌｌの標準的なラベルに関するトレーニングは、
”ｂｏｙｓ”にも適用する。Furthermore, the label-based model approach of the present invention requires less training data. Whereas Bakis's method requires each speaker to utter each word for training, in our case, each speaker must utter enough words to set the values associated with the model for 200 standard labels. Just say it out loud. Furthermore, the method shown in Bakis's paper treats two words, such as "bOypa" and "boys'", independently, but in the present invention, -1
bOyll's standard label training includes:
Also applies to “boys”.

Ｆ、実施例Ｆｌ、音声入力信号のラベル化このシステムにおける音声認識およびモデル生成の前処
理は、音声入力信号を符号に変換して表示することであ
る。これは例えば、１９８１年のＩＣＡＳＳＰ会報１１
５３〜１１５５頁の、エイ場ナダス他の論文１プートス
トラツプ化またはクラスタ化から得られた自動的に選択
された音響原型による連続音声認識”　（Ａ、Ｎａｄａ
ｓ　　ｅｔ　　ａｌ、Ｃｏｎｔｉｎｕｏｕｓ　　５ｐｅ
ｅｃｈ　　Ｒｅｃｏｇｎｉｔｉｏｎｗｉｔｈ　　Ａｕｔ
ｏｍａｔｉｃａｌｌｙ　　ＳｅｌｅｃｔｅｄＡｃｏｕｓ
’ｔｉｃ　　Ｐｒｏ、ｔｏｔｙｐｅ　　０ｂｔａｉｎｅ
ｄ　　ｂｙＥｉｔｈｅｒ　　Ｂｏｏｔｓｔｒａｐｐｉｎ
ｇ　　ｏｒ　　Ｃｌｕｓｔｅｒｉｎｇ、”Ｐｒｏｃｅｅ
ｄｉｎｇｓ　　ＩＣＡＳＳＰ　　１９８１、ｐｐ。F. Example Fl. Labeling of Speech Input Signal The preprocessing for speech recognition and model generation in this system is to convert the speech input signal into code and display it. For example, 1981 ICASSP Bulletin 11
"Continuous Speech Recognition with Automatically Selected Acoustic Archetypes Obtained from Puttstrup or Clustering" (A, Nada et al., 1993), pp. 53-1155
s et al, Continuous 5pe
ech Recognition with Out
automatically SelectedAcous
'tic Pro, totype 0btaine
d byEither Bootstrappin
g or Clustering,”Procee
dings ICASSP 1981, pp.

１１５３−１１５５）に記載された手順により行なわれ
る。この変換手順の場合、音響入力信号の一定長の１／
１ｏ　ｏ秒のオーダの間隔がスペクトル分析され、その
結果化じた情報は、”ラベル″スナワチフイーニーム（
ｆｅｎ、ｅｍｅ、フロントエンドから得られる微小音素
をこのように呼ぶこととする。基本的にはどこのプロセ
スで得られるにしろ、微小な音響タイプを表わすものは
この中に入る）からなる有限集合（アルファベット）の
中から選んだラベルを、間隔の各々に割り当てる。ラベ
ルの各々は音響タイプ、より具体的には特有の１０ミリ
秒音声間隔のスペクトルパターンを表わす。特有のスペ
クトル・パターンの初期選択、すなわちラベル集合の生
成も前述のナダス他の論文に記載されている。1153-1155). For this conversion procedure, the fixed length of the acoustic input signal is
Intervals on the order of 1 o o seconds are spectrally analyzed, and the resulting information is labeled with the label
The microphonemes obtained from the front end are called fen, eme, and so on. Each interval is assigned a label chosen from a finite set (alphabet) consisting of (basically, whatever process is used to represent the smallest acoustic type). Each label represents an acoustic type, more specifically a spectral pattern of a unique 10 millisecond speech interval. The initial selection of a unique spectral pattern, ie, the generation of a set of labels, is also described in the aforementioned Nadas et al. paper.

本発明においては、種々のラベル集合がある。In the present invention, there are various label sets.

第一に、標準ラベルからなる有限アルファベット（ラベ
ル列）がある。標準ラベルは、最初の話者が通常の音響
プロセッサに向って発声すると生成される（音響プロセ
ッサは通常の方法を用いてクラスタ化およびラベル化を
実行する）。種属ラベルは最初の話者の音声に対応する
。最初の話者が、設定された標準ラベルを有する語彙の
各ワードを発声すると、音響プロセッサは各ワードを（
第２表および第２図に例示したように）標準ラベルの列
に変換する。ワードごとの標準ラベルの列は記憶装置に
書込まれる。第二に、個人化されたラベル集合がある。First, there is a finite alphabet (label string) of standard labels. Standard labels are generated when the first speaker speaks into a regular acoustic processor (which performs clustering and labeling using conventional methods). The species label corresponds to the first speaker's voice. As the first speaker utters each word in the vocabulary with a set standard label, the sound processor utters each word (
(as illustrated in Table 2 and FIG. 2) into columns of standard labels. A string of standard labels for each word is written to storage. Second, there are personalized label sets.

これらの個人化ラベルは、標準ラベルおよびその列が設
定された後、次の話者（最初の話者が再び、または別の
話者）が音声入力を音響プロセッサに供給することによ
り生成される。These personalized labels are generated by the next speaker (the first speaker again, or another speaker) feeding audio input to the acoustic processor after the standard labels and their columns have been set. .

標準ラベル集合およびそれぞれの個人化ラベル集合のア
ルファベットの各々は、２００（この数は異なってもよ
い）のラベルを含むことが望ましい。標準ラベルと個人
化ラベルは、各標準ラベルに関連した確率モデルにより
相互に間係づけられる。特に、各標準ラベルは、（ａ）　　ある′状態からある状態に広がる複数の状態
および遷移、（ｂ）　　モデルにおける各々の遷移の確率、および（
ｃ）複数のラベル出力確率（所与の遷移における各出力
確率は、次の整形中の話者からの音響入力に基づいた所
与の遷移における特定の個人化ラベルを作成する標準ラ
ベルのモデルの確率に対応する）を有するモデルにより表わす。Preferably, each of the alphabets of the standard label set and each personalized label set includes 200 labels (this number may be different). Standard labels and personalized labels are correlated with each other by a probabilistic model associated with each standard label. In particular, each standard label represents (a) the multiple states and transitions that extend from some 'state to some state, (b) the probability of each transition in the model, and (
c) Multiple label output probabilities (each output probability at a given transition is a model of a standard label that creates a specific personalized label at a given transition based on the acoustic input from the speaker during the next shaping) corresponding to the probability).

遷移確率およびラベル出力確率は、トレーニング中の話
者が既知の発声をするトレーニング期間中に設定される
。マルコフ・モデルヲトレーニングする手法は既知であ
るが、下記に簡単に説明する。The transition probabilities and label output probabilities are set during a training period during which the speaker under training makes known utterances. Techniques for training Markov models are known and will be briefly described below.

このラベル化手法の重要な特徴は、音響信号に基づいて
自動的に実行でき、従って音声上の解釈を必要としない
ことである。An important feature of this labeling technique is that it can be performed automatically based on the acoustic signal and therefore does not require phonetic interpretation.

ラベル化手法の詳細については、後に第５図および第５
図に関連して説明する。For details on the labeling method, see Figures 5 and 5 later.
Explanation will be made in conjunction with the figure.

Ｆ２．　フィーニーム単位のワード・モデルの生成本発
明は、簡単かつ自動的で、音素単位のモデルを使用する
ものよりも正確に表現されるワード・モデルを生成する
新規の方法を示す。あるワード・モデルを生成する場合
、そのワードを最初に１回発声し、音響プロセッサによ
り標準ラベルのストリングを得る（原理については第２
図、サンプルについては第２表参照）。次に、標準ラベ
ルの各々は、最初および最後の状態ならびに状態間に起
こりうる成る遷移を表わすマルコフ・モデル（第３図）
に置き換えられる。これらのラベルのマルコフ・モデル
を連結した結果が完全なワード′・モデルである。F2. Generating a Pheneme-wise Word Model The present invention presents a new method for generating word models that are simple, automatic, and more accurately represented than those using phoneme-wise models. To generate a word model, the word is first uttered once and a string of standard labels is obtained by the acoustic processor (the principle is explained in Part 2).
(See Table 2 for figures and samples). Each standard label is then a Markov model (Figure 3) representing the first and last states and possible transitions between the states.
replaced by The result of concatenating the Markov models of these labels is the complete Word' model.

そして、このモデルは、文献、例えば前述のジエリネク
の論文から知られている他のマルコ７−モデルの音声認
識の場合にトレーニングし使用スることができる。各標
準ラベルの統計的なモデルは、そのラベルから形成した
ワード・モデルとして記憶装置に（第３表の形式で）記
憶することが望ましい。This model can then be trained and used in the case of other Marco 7-model speech recognition known from the literature, for example from the above-mentioned Zielynek paper. Preferably, the statistical model for each standard label is stored in storage (in the form of Table 3) as a word model formed from that label.

第５表において、標準ラベルのモデルＭ１〜ＭＮの各々
は、３つの遷移を生じることがあり、各々の遷移はそれ
に関連した遷移確率を有する。更に、モデルＭ１〜ＭＮ
の各々は各遷移に２００の出力確率（個人化ラベルごと
に１つの確率）を有する。出力確率の各々は、モデルＭ
１に対応する標準ラベルが特定の遷移でそれぞれの個人
化ラベルを生成する確率を表わす。遷移確率および出力
確率は話者により異なることがあるのでトレーニング期
間中に設定する。必要なら、遷移確率と出力確率を組合
せ、それぞれが所定の出力を生成して所定の遷移を行な
う確率を示す合成確率を形成することができる。In Table 5, each of the standard label models M1-MN may result in three transitions, and each transition has a transition probability associated with it. Furthermore, models M1 to MN
each has 200 output probabilities for each transition (one probability for each personalized label). Each of the output probabilities is model M
1 represents the probability that a standard label corresponding to 1 will generate the respective personalized label at a particular transition. Since the transition probability and output probability may differ depending on the speaker, they are set during the training period. If desired, transition probabilities and output probabilities can be combined to form composite probabilities, each indicating the probability of producing a given output and performing a given transition.

与えられたワードについて、最初の話者が発声すると、
ラベルの順序、従ってそれに対応するモデルの順序が決
まる。最初の話者は語彙のすべてのワードを発声して、
ワードごとに標準ラベル（およびモデル）のそれぞれの
順序を設定する。その後、トレーニング中に、次の話者
は既知の音響入力（語彙にあるワードが望ましい）を発
声する。When the first speaker utters a given word,
The order of the labels, and hence the order of their corresponding models, is determined. The first speaker says all the words in the vocabulary and
Set the respective order of standard labels (and models) for each word. Then, during training, the next speaker utters a known acoustic input (preferably a word in the vocabulary).

これらの既知の音響入力の発声から、ある特定の話者の
遷移確率および出力確率を決定して記憶し、モデルをト
レーニングする。ちなみに、次の話者は２００のモデル
の確率を設定するのに必要な音響入力数を発声するだけ
でよい。すなわち、次の話者は語案にあるすべてのワー
ドを発声しなくてもよいが、その代り、２００のモデル
をトレーニングするのに必要な音響入力数だけは発声し
なければならない。From these known acoustic input utterances, transition probabilities and output probabilities for a particular speaker are determined and stored to train the model. Incidentally, the next speaker only needs to utter the number of acoustic inputs necessary to set the probability of the 200 models. That is, the next speaker does not have to say every word in the idea, but instead only has to say the number of acoustic inputs needed to train the 200 models.

ラベル単位のモデルの使用は、語彙にワードを追加する
うえで更に重要である。語彙にワードを加えるのに必要
なのは、（新しいワードの発声のように）ラベルの列を
決めることだけである。（特定の話者の確率を含む）ラ
ベル単位のモデルは既に記憶装置に書込まれているので
、新しいワードに必要なデータはラベルの順序だけであ
る。The use of label-wise models is even more important in adding words to the vocabulary. Adding a word to the vocabulary requires only a sequence of labels (like saying a new word). Since the label-wise model (including probabilities for particular speakers) is already written to storage, the only data needed for a new word is the order of the labels.

このモデルの利点は、極めて簡単で容易に生成できるこ
とである。次に、本発明の良好な実施例の詳細について
説明する。The advantage of this model is that it is extremely simple and easy to generate. Next, details of a preferred embodiment of the present invention will be described.

本発明の良好な実施例では１つの音響プロセッサを用い
るように示しているが、処理プログラムは、例えば、最
初に標準ラベルのアルファベットを決める第１のプロセ
ッサ；最初に１回発声したワードごとの標準ラベルの列
を生成する第２のプロセッサ；個人化ラベルのアルファ
ベットを選択する第５のプロセッサ；トレーニング入力
すなわちワードを個人化ラベルのストリングに変換する
第４のプロセッサを含む複数のプロセッサにわたって分
配することもできる。Although the preferred embodiment of the invention is shown using one acoustic processor, the processing program includes, for example, a first processor that first determines the standard label alphabet; a second processor that generates a string of labels; a fifth processor that selects an alphabet of personalized labels; and a fourth processor that converts the training input or words into a string of personalized labels. You can also do it.

Ｆｉ　モデル生成および認識手順の概要第１図は本発明
によるモデル生成および実際の認識を行う音声認識シス
テムの概要を示す。Overview of Fi Model Generation and Recognition Procedures FIG. 1 shows an overview of a speech recognition system that performs model generation and actual recognition according to the present invention.

入力された音声は、音響プロセッサ１で、予め生成して
おいて標準的なラベル・アルファベット２を用いてラベ
ル・ストリングに変換し、初期ステップ６で、各ワード
の最初の１回の発声により生じた標準ラベルのストリン
グ、ならびに前に作成した基本フィーニームｅマルコフ
・モデル４を用いて、ワードごとにマルコフ・モデルを
生成する。ラベル単位のマルコフ・モデルは中間的に記
憶される（フィーニーム単位のワード・モデル５）。The input speech is converted in an acoustic processor 1 into a label string using a pre-generated and standard label alphabet 2, and in an initial step 6 is converted into a label string produced by the first utterance of each word. A Markov model is generated for each word using the string of standard labels created earlier and the basic fee-neem e-Markov model 4 created earlier. The label-based Markov model is stored intermediately (Fineem-based word model 5).

ソノ後、トレーニング・ステップ６で、次の話者による
いくつかのワードの発声をラベル単位のモデルトマッチ
ング、モデルごとの遷移および個人化ラベル出力の確率
値に関する統計値を生成する。After the recording, in a training step 6, the next speaker's utterances of several words are model-matched on a label-by-label basis, and statistical values regarding the probability values of transitions and personalized label outputs for each model are generated.

実際の認識動作では、認識すべき発声から生じる個人化
ラベルのス）　ＩＪソング、統計的なワードのラベル単
位のモデルとマツチングさせ、個人化ラベルのストリン
グを生成する最高の確率を生じるワードまたは複数ワー
ドの識別子を出力に供給する。In an actual recognition operation, the string of personalized labels resulting from the utterance to be recognized is matched with a statistical word label-by-label model, and the word or words that yield the highest probability of generating a string of personalized labels are selected. Supply the word's identifier to the output.

第１図で、記号のは１回発声（生成）するラベル・ス）
　ＩＪソング示し、記号■は認識すべきいくつかの発声
をトレーニングするラベル・ストリングを示し、記号Ｏ
は実際に認識すべき発声のラベル−ストリングを示す。In Figure 1, the symbol is a label that is uttered (generated) once.
IJ song, the symbol ■ indicates a label string that trains some utterances to be recognized, and the symbol O
indicates the label-string of the utterance to be actually recognized.

Ｆ４．　　ラベル語彙生成および音声のラベル・ストリ
ングへの変換ラベルのアルファベットを生成し、音声を実際にラベル
・ストリングに変換する手順について、第５図および第
５図に関連して説明する（この手順の説明は、例えば前
述のナダス他の論文にも記載されている）。F4. Label Vocabulary Generation and Conversion of Audio to Label Strings The procedure for generating label alphabets and actually converting audio into label strings will be explained with reference to FIGS. is also described, for example, in the aforementioned paper by Nadas et al.).

一般に音声の音響タイプ（詳細にはスペクトル・パラメ
ータ）の原型ベクトルを表わす標準ラベルを生成する場
合、話者は音声サンプルを得るため約５分間しゃべる（
第５図のブロック１１）。In general, when generating standard labels representing prototypical vectors of acoustic types (specifically spectral parameters) of speech, a speaker speaks for approximately 5 minutes to obtain a speech sample (
Block 11 in FIG. 5).

音響プロセッサ（その詳細は第５図に関連して説明する
）では、それぞれが１０ミリ秒の３０，０００ベクトル
の音声パラメータが得られる（ブロック１２）。次いで
、これらのベクトルを分析すなわちベクトル量子化動作
で処理して、それぞれがほぼ同じベクトルを含む２００
のクラスタに分類する（ブロック１３）。このような手
順は既に、研究論文、例えば、ＩＥＥＥのＡＳＳＰ雑誌
、１９４０年４月号の４〜２９頁のアール曇エム・グレ
イの論文”ベクトル量子化”（Ｒ，Ｍ、　　Ｇｒａｙ、
Ｖｅｃｔｏｒ　　Ｑｕａｎｔｉｚａｔｉｏｎ”、　Ｉ　
ＥＥＥＡＳ　Ｓ　Ｐ　　Ｍａｇａｚｉｎｅ、　Ａｐｒｉ
ｌ　　１９８４．　ｐｐ。In the acoustic processor (details of which will be described in connection with FIG. 5), 30,000 vectors of audio parameters of 10 ms each are obtained (block 12). These vectors are then processed with an analysis or vector quantization operation to obtain 200 vectors, each containing approximately the same vector.
(block 13). Such a procedure has already been described in research papers, such as R, M. Gray's paper “Vector Quantization” (R,M, Gray,
Vector Quantization”, I
EEEAS SP Magazine, Apri
l 1984. pp.

４−２９）に開示されている。4-29).

クラスタの各々について、１つの原型ベクトルを選択し
、その結束束じた２００の原型ベクトルを、後の参照の
ために記憶する（ブロック１４）。For each cluster, one prototype vector is selected and the combined 200 prototype vectors are stored for later reference (block 14).

このようなベクトルの各々は１つの音響要素すなわちラ
ベルを表わす。代表的なラベル・アルファベットを第１
表に示す。Each such vector represents one acoustic element or label. Typical labels and alphabets first
Shown in the table.

第５図は、音声を音響的に処理して発声のラベル・スト
リングを得る手順のブロック図である。FIG. 5 is a block diagram of a procedure for acoustically processing speech to obtain label strings of utterances.

マイクロホン２１からの音声はＡ／Ｄコンバータ２２に
よりディジタル表示に変換される。ブロック２３で、２
０ミリ秒のウィンドウをディジタル表示から取出し、ウ
ィンドウは（い（らかオーバラップさせて）１０ミリ秒
ごとに取込む。ウィンドウの各々について、高速フーリ
エ変換（ＦＦＴ）でスペクトル分析を行ない、各々の１
０ミリ秒の音声を表わすベクトルを得る（第５図のブロ
ック２４）。これらのベクトルのパラメータは、複数の
スペクトル・バンドのエネルギ値である。このように得
られた現在のベクトルの各々は、ブロック２５で、前述
の前処理で生成された原型ベクトルのセット（ブロック
１４）と比較する。ブロック２５で、現在のベクトルに
最も近い原型ベクトルを決定し、ブロック２６で、この
原型のラベルすなわち識別子を出力する。このように、
１０ミリ秒ごとに１つのラベルが出力に現われ、音声信
号は、コード化形式、すなわち２００のフィーニーム・
ラベルのコード化アルファベットで使用できる。ちなみ
に、本発明は周期的な間隔で生成されたラベルに限定し
なくてもよく、前記説明は、各ラベルをそれぞれの時間
間隔に対応させているだけである。Audio from the microphone 21 is converted into a digital display by an A/D converter 22. In block 23, 2
A 0 ms window is taken from the digital display, and the windows are acquired every 10 ms (with some overlap). Spectral analysis is performed on each of the windows using a fast Fourier transform (FFT), and each 1
Obtain a vector representing 0 milliseconds of speech (block 24 in Figure 5). The parameters of these vectors are energy values of multiple spectral bands. Each of the current vectors thus obtained is compared in block 25 with the set of prototype vectors generated in the pre-processing described above (block 14). Block 25 determines the prototype vector that is closest to the current vector, and block 26 outputs the label or identifier of this prototype. in this way,
One label appears at the output every 10 milliseconds, and the audio signal is in encoded form, i.e. 200 finems.
Available in label coded alphabet. Incidentally, the invention need not be limited to labels generated at periodic intervals; the above description merely associates each label with a respective time interval.

Ｆ５．　　ラベル・ストリングを使用するワード・モデ
ルの生成ワード・モデルを生成する場合、語檗中に必要とされる
ワードの各々は、いったん発声してから、前述のように
標準ラベルのストリングに変換する。F5. Generating a Word Model Using Label Strings When generating a word model, each word needed in the vocabulary is uttered once and then converted to a string of standard labels as described above.

１つのワードの標準ラベルのラベル・ストｌ）ングの図
形表示は、第２図に示すように、そΩワードを発声した
とき音響プロセッサの出力に現われたｙｌ、ｙ２、・・
・ｙｍ　　０列から成る。このストリングをそれぞれの
ワードの標準ラベルの基本形式とみなす。A graphical representation of the label string of a standard label for one word is shown in Figure 2, as shown in Figure 2, the yl, y2, . .
・ym Consists of 0 columns. Consider this string as the basic form of the standard label for each word.

ワードの基本モデルを、そのワードの発音の変化を考慮
に入れて生成するには、基本形式のストリングのフィー
ニームｙｉの各々を、そのフイーニームノ基本マルコフ
・モデルＭ（ｙｉ）と取替える。To generate a basic model of a word that takes into account changes in the pronunciation of that word, replace each fineme yi of a string in the basic form with its basic Markov model M(yi) of the finemem.

基本マルコフ・モデルは第５図に示すように極めて簡単
な形式にすることができる。その構成は、初期状態Ｓｉ
、最終状態Ｓｆ、遷移Ｔ１およびＴ２、ならびにナル遷
移ＴＯから成る。遷移Ｔ１は状態を８１からＳｆに導き
、１つの個人化ラベルの出力を表わす。遷移Ｔ２は初期
状態Ｓｉを離れて初期状態Ｓｉに戻り、１つの個人化さ
れた出力を表わす。遷移ＴＯは状態をＳｉからＳｆに導
くが、個人化ラベルの出力は割当てられない。この基本
モデルでは、（ａ）遷移Ｔ１を１回だけ行なうことにより、個人化ラ
ベルが１つ出現し、（ｂ）遷移Ｔ２を数回性なうことにより、個人化うペル
かい（つか出現し、（ｃ）ナル遷移ＴＯを行なうことにより、脱落個人化ラ
ベルが出現する。The basic Markov model can be reduced to a very simple form as shown in FIG. Its configuration is initial state Si
, a final state Sf, transitions T1 and T2, and a null transition TO. Transition T1 leads the state from 81 to Sf and represents the output of one personalized label. The transition T2 leaves the initial state Si and returns to the initial state Si and represents one personalized output. The transition TO leads the state from Si to Sf, but the output of the personalization label is not assigned. In this basic model, (a) one personalized label appears by performing transition T1 only once, and (b) one personalized label appears by performing transition T2 several times. , (c) By performing a null transition TO, a dropped personalized label appears.

第３図に示す同じ基本モデルを、２００の異なった標準
ラベルのすべてについて選択することができる。もちろ
ん、より複雑な基本モデルを使用したり、異なったモデ
ルをそれぞれの標準ラベルに割当てたりすることもでき
るが、本実施例では、第３図のモデルをすべての標準ラ
ベルに使用する。The same basic model shown in FIG. 3 can be selected for all 200 different standard labels. Of course, more complex basic models could be used or different models could be assigned to each standard label, but in this example the model of FIG. 3 is used for all standard labels.

第４図は、第２図に示すラベル単位の基本形式のワード
全体の完全な基本形式のマルコフ・モデルを示す。その
構成は、ワードの全標準ラベルの基本モデルＭ（ｙｉ）
の簡単な連結から成り、基本モデルの各々の最終状態を
、次の基本モデルの初期状態に結合する。従って、ｍ標
準ラベルのストリングを生じたワードの完全な基本形式
のマルコフ−モデルはｍ個の基本マルコフ・モデルを含
む。FIG. 4 shows a complete basic form Markov model for the entire label-based basic form word shown in FIG. Its structure is the basic model M(yi) of all standard labels of words.
consists of a simple concatenation of the final states of each base model to the initial state of the next base model. Thus, a complete basic form Markov model of a word that yields a string of m standard labels contains m basic Markov models.

代表的な標準ラベル数（従って、ワード・モデル当りの
状態数）は約３０〜８０である。４ワードのラベルを第
２表に示す。例えば、ワードｔｈａｎｋｓ”はラベルＰ
Ｘ５で始まり、ラベルＰＸ２で終了する。A typical standard number of labels (and thus number of states per word model) is approximately 30-80. The four word labels are shown in Table 2. For example, the word "thanks" is labeled P
It starts with X5 and ends with label PX2.

ワードの基本形式のマルコフ・モデルを生成することは
、それぞれのワードの異なった状態および遷移ならびに
それらの相互関係を定めることを意味する。音声認識に
役立てるためには、ワードの基本形式のモデルは、それ
を数回の発声により整形する、すなわちモデルにおける
遷移ごとに統計値を蓄積して統計的モデルを作らなけれ
ばならない。Generating a Markov model of the basic form of a word means defining the different states and transitions of each word and their interrelationships. To be useful for speech recognition, a model of the basic form of a word must be reshaped by uttering it several times, that is, a statistical model must be created by accumulating statistical values at each transition in the model.

同じ基本モデルがいくつかの異なったワードで現われる
ので、モデルをトレーニングするため各ワードを数回発
声する必要はない。Since the same basic model appears in several different words, there is no need to utter each word several times to train the model.

このようなトレーニングはいわゆる”フォワード・パン
クワード・アルゴリズム°′（研究論文、例えば前述の
ジエリネクの論文に記載されている）により行なうこと
ができる。Such training can be carried out by the so-called "forward puncture word algorithm°" (described in research papers, for example in the above-mentioned Zielinek paper).

トレーニングの結果、モデル内の各遷移に確率値が割当
てられる。例えば、１つの特定の状態の場合、Ｔ１が０
．５、Ｔ２が０．４、ＴＯが０．１の確率値になること
がある。更に、非ナル遷移Ｔ１およびＴ２の各々につい
て、それぞれの遷移が生じたとき、２００の個人化ラベ
ルごとにその出現確率がいくらであるかを示す確率のリ
ストが与えられる。ワードの統計的モデル全体は第５表
に示すようなリストすなわち表の形式をとる。各々の基
本モデルまたは標準ラベルは表の１つの欄に示され、各
々の遷移は、個々の個人化ラベルの確率の要素（更にそ
れぞれの遷移をとる全体的な確率）を有する行すなわち
ベクトルに対応する。As a result of training, each transition in the model is assigned a probability value. For example, for one particular state, T1 is 0
．． 5. T2 may have a probability value of 0.4 and TO may have a probability value of 0.1. Furthermore, for each of the non-null transitions T1 and T2, a list of probabilities is given that indicates what the probability of occurrence is for each of the 200 personalized labels when the respective transition occurs. Ward's entire statistical model takes the form of a list or table as shown in Table 5. Each base model or standard label is shown in one column of the table, and each transition corresponds to a row or vector with elements of the probability of the individual personalized label (as well as the overall probability of taking each transition). do.

すべてのワードのモデルを実際に記憶する場合、フード
ごとに、ワードを構成する基本マルコフ・モデルの識別
子を成分とする１つのベクトルを記憶し、アルファベッ
トの２００の標準ラベルごとに記憶されている確率値を
、１つの統計的マルコフ・モデルに包含すれば十分であ
る。従って、第３表に示す統計的なワード・モデルは、
実際にはそのように記憶する必要はな（、分散形式で記
憶して、必要に応じてデータを組合せることができる。If we were to actually store models for all words, we would store one vector for each food whose components are the identifiers of the basic Markov models that make up the word, and the probabilities stored for each of the 200 standard labels of the alphabet. It is sufficient to include the values in one statistical Markov model. Therefore, the statistical word model shown in Table 3 is:
In practice, there is no need to store it that way (although it can be stored in a distributed format and the data combined as needed).

Ｆ６．　　認識プロセス実際に音声認識を行う際には、最初の項で説明したよう
に、発声を個人化ラベルのストリングに変換する。次い
で、これらの個人化ラベルのストリングはワード・モデ
ルの各々と突合わせ、そのモデルが表わすワードの発声
により個人化ラベルのストリングが生じた確率を得る。F6. Recognition Process In actual speech recognition, as explained in the first section, utterances are converted into strings of personalized labels. These personalized label strings are then matched against each of the word models to obtain the probability that the personalized label string was produced by the utterance of the word represented by that model.

特に、マツチングはそれぞれのワードのマツチ・スコア
に基づいて実行する。各々のマツチ・スコアは前述のジ
エリネクの論÷で説明した”フォワード確率“を表わす
。ラベル単位のモデルを用いる認識プロセスは、音素単
位のモデルを用いる既知のプロセスに類似している。最
高の確率を有するワードまたは複数のワードが出力とし
て選択される。In particular, matching is performed based on each word's match score. Each match score represents the "forward probability" explained by Zielinek's theory ÷ above. The recognition process using a label-wise model is similar to the known process using a phoneme-wise model. The word or words with the highest probability are selected as output.

Ｆ７．　　ラベル−アルファベットの変化最初のワード
・モデル生成に用いる標準ラベル・アルファベットとト
レーニングおよび認識に用いる個人化ラベルの集合とは
、一般的ではあるが、すべてが同一ではないことがある
。けれども、認識中に生成されたラベルに関連した確率
値は、トレーニング中に生成された個人化ラベルといく
らか異なることがあっても、実際の認識結果は大体正し
い。これはマルコフ・モデルの適応性により可能である
。F7. Label-Alphabet Variation The standard label alphabet used for initial word model generation and the set of personalized labels used for training and recognition, although common, may not all be identical. However, even though the probability values associated with the labels generated during recognition may differ somewhat from the personalized labels generated during training, the actual recognition results are approximately correct. This is possible due to the adaptability of Markov models.

しかしながら、それぞれのトレーニングおよび認識アル
ファベットの変化が過度に大きい場合、精度に影響する
こともある。また、最初の話者と非常に異なる音声の次
の話者がモデルをトレーニングするのに用いる確率゛は
、精度に限界を生じることがある。However, if the variations in the respective training and recognition alphabets are too large, accuracy may be affected. Also, the probabilities used to train the model for subsequent speakers whose voices are very different from the first speaker may impose limits on accuracy.

Ｆ８．表第１表は代表的なラベル・アルファベットを示す。F8. table Table 1 shows typical label alphabets.

表中、２文字は要素の音を大まかに表わす。In the table, the two letters roughly represent the sounds of the elements.

２桁の数字は母音に関連し、第１の数字は音のアクセン
ト、第２の数字は最新の識別番号を表わす。The two digits relate to vowels, the first digit representing the accent of the sound and the second digit representing the latest identification number.

１桁の数字は子音に関連し、最新の識別番号を表わす。The single digit number is associated with the consonant and represents the most recent identification number.

第２表は４ワードのラベル・ストリングのサンプルを示
す。Table 2 shows a sample 4-word label string.

第ろ表はワードの統計的マルコフ・モデル（フエネメ単
位のモデル）を示す。Table 2 shows Ward's statistical Markov model (Feneme unit model).

−へ１寸りＮＯＮのへ１０− ｗ　ｗ　ｗ　ｗ−ｗ　ｗ　ｗ−へＮへ ’ＯＮ（ＯＣｋ　Ｏｒ　Ｎ　唖Ｗ　Ｌｎ　’ＯｋＮ＋＋
＋ｈ　ｋの■■のＯ）ωのω 悶峙ｃＱＥニー悶悶ロロロロさＯＰへ（イ）寸りくＮの（ト）ロヘ（イ）（イ）（イ）（イ）（イ）哨梢（イ）（イ）（
イ）マロ００ロロ０ロロロロＯローへ唖マク−へ（イ）寸い−ヘ一〜哨寸唖℃への（ト）ローへＯＯロロ０００００−一一〇〇〇Ｃｌ００口００ロロロ αコ ■ ｗ　Ｐｗ　Ｆ？−ｗ　ｗ　ｗ　ＦＦ−へ−へ４寸叩くべ
のへローへ（イ）マク％Ｏ’ｌコＮコ％Ｏ％ＯＮｏ＜ＭコＮコｈｈＮｒＮ、ｈ
ｋ哨寸ｕ’）　＜＋　ｈのさローへ哨臂鱒（ｈ哨（イ）
哨（イ）（イ）（イ）噂！寸寸寸寸寸寸寸ローＮ（イ）寸０４にのさローへ噂賃り℃寸　寸　寸　寸
　寸　寸　寸　賃　臂　り　い　り　旧　哨　０　り０
ロロロロロロロロロロロ０ｐ００（イ）寸り”０ｒ−ｓのきローへ１寸り（ｈ■−一〒−
−−Ｐ−−ｃＮへへへへへヘヘＮロロロ０ロロＯＯロロ
ＯＯロロＯ口クーｈ− 第　　　　　６　　　　表ＭＩ　ＳＩ　Ｔ１１０．５０．０３０．０００．１３Ｔ
Ｉ２０．４０．０２０．０１０．００Ｔ１００．１−−
　− Ｍ２　Ｓ２　Ｔ２１０．７３　。- to 1 inch NON to 10- w w w w-w w w- to N'ON (OCk Or N 唖W Ln 'OkN++
+h k's ■■'s O) ω's ω writhing cQE knee writhing rorororo sa OP (a) diagonal N's (g) rohe (a) (a) (a) (a) (a) sentry tree (a) )(stomach)(
A) Malo 00 Roro 0 Rorororo O Lo to dumb mak - to (A) size - he 1 to post size dumb ℃ (T) Ro to OO Roro 00000 - 11000 Cl00 mouth 00 Rororo αko■ w Pw F? -w w w FF- to - hit 4 inches to the bottom hero (a) Mak%O'lkoNko%O%ONo<MkoNkohhNrN, h
k watch size u') ＜+ h watch trout (h watch (i)
(I) (I) (I) Rumor! Dimensions Dimension Dimension Dimension Low N (A) Dimension 04 Rumored to Dimension ℃ Dimension Dimension Dimension Dimension Dimension Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Dimensions Low N (A) Dimensions
Rororororororororororo 0p00 (a) Dimension “1 inch to the bottom of 0r-s (h■-1〒-
--P--cN Hehehehehe N Rororo 0 Roro OO Roro OO Roro O mouth coo h- Table 6 MI SI T110.50.030.000.13T
I20.40.020.010.00T100.1--
- M2 S2 T210.73.

Ｔ２２０．２２　。T220.22.

Ｔ２Ｏ０，０５−−− Ｍ５５５Ｔ５１０．５　。T2O0,05--- M555T510.5.

Ｔ３２０．４５　、　。T320.45,.

Ｔ３００．０５−−　− ＭＮ　ＳＮ　ＴＮＩ　Ｏ，６。T300.05-- MN SN TNI O, 6.

ＴＮ２０．３　。TN20.3.

ＴＮＯｏ、１−−　− Ｇ０発明の効果本発明により１、−）ラベル単位のマルコフ・モデルはずっと詳細なレ
ベルでワードを表わすから音素単位のモデルよりもすぐ
れ、（ｂ）ＤＰマツチングを用いるワード・テンフレートと
異なり、ラベル単位のワード・モデルはフォワード・バ
ックワード・アルゴリズムを用いてトレーニングするこ
とができ、（ｃ）ハラメータ数は語彙の大きさではなくフィーニー
ム・アルファベットの大きさにより決まるので、必要な
記憶容量は語彙の大きさの増加に較べてゆっくりと増加
し、（ｄ）　　ラベル単位のモデルを用いる認識手順は計算
上、ＤＰマツチングおよび連続パラメータのワード・テ
ンプレートの使用よりもずっと速（、（、）　　ワード
のモデル化を自動的に行なうことができる。TNOo, 1-- - G0 Effects of the Invention The present invention provides: 1, -) label-based Markov models are superior to phoneme-based models because they represent words at a much more detailed level; (b) word-based Markov models using DP matching; Unlike tenflates, label-wise word models can be trained using a forward-backward algorithm; Memory capacity increases slowly relative to increases in vocabulary size, and (d) recognition procedures using label-wise models are computationally much faster than DP matching and the use of continuous parameter word templates. , ) Word modeling can be done automatically.

[Brief explanation of drawings]

第１図は本発明によるモデル生成および認識手順のブロ
ック図、第２図は音響プロセッサから得たワードのラベル・スト
リングを表わす図、第３図は１ラベルの基本マルコフ・モデルを示す図、第４図は第２図に示したス）　ＩＪングの標準ラベルの
各々を基本マルコフ・モデルと取替えることにより生成
したワードの基本形式を示す図、第５図は標準ラベル・
アルファベットの初期生成のプロセスを示すブロック図
、第５図は発声したワードの個人化ラベル・ストリングを
引出す音響プロセッサの動作を表わすブロック図である
。１・・・・音響フロセッサ、２・・・・ラベル・アルフ
ァベット、３・・・・初期ステップ、４・・・・フィー
ニーム・マルコフ・モデル、５・・・・ワード・モデル
、６・・・・トレーニング・ステップ、７・・・・統計
的マルコフ・モデル、８・・・・認識フロセス、１２・
・・・音響プロセッサ。出願人インタせガか・ピ４ｘ−マドＸ・コ→ト怖ン復代
理人　弁理士　　澤　　　１）　　俊　　　夫七デノ［
生成及び認識手段第１図ｙｊ　　ｙ２　　　　　　ｙｉ　　　　　　ｙｍ−−−
−−−４−一一一−１−−−−−（トーーーー〇ワーＦ
：Ｗのフエネメ第８図Ｍ（ｙｌ）　Ｍ（ｙ２）　　　Ｍ（ｙ’）　　　Ｍ（ｙ
ｍ）ワードＷのアーニームのマルコフ毛デ」しFIG. 1 is a block diagram of the model generation and recognition procedure according to the present invention; FIG. 2 is a representation of the label string of words obtained from the acoustic processor; FIG. 3 is a diagram of the basic Markov model with one label; Figure 4 is a diagram showing the basic format of words generated by replacing each of the standard labels of IJing shown in Figure 2 with the basic Markov model.
FIG. 5 is a block diagram illustrating the process of initial alphabet generation. FIG. 1...Acoustic processor, 2...Label alphabet, 3...Initial step, 4...Fineem Markov model, 5...Word model, 6... Training step, 7...Statistical Markov model, 8...Recognition process, 12.
...Sound processor. Applicant Inta Sega P4x - Mado
Generation and recognition means Fig. 1 yj y2 yi ym ---
---4-111-1----(Toooooooooo F
: W's Feneme Figure 8 M(yl) M(y2) M(y') M(y
m) Word W's Arnim's Markov Hair De'

Claims

Claims: first processor means for generating a set of standard labels, each representing a sound type assignable to a time interval, in response to a first audio input; and utterance of each word in the word vocabulary. second processor means responsive to extracting a corresponding standard label from said set of standard labels to produce a sequence of standard labels for each said word; and personalization, each representing an acoustic type assigned to a time interval. A speech recognition system comprising: fifth processor means for selecting a set of labels in response to a second speech input; and means for forming a respective probabilistic model for each standard label; for each probabilistic model: (a) a plurality of states; (b) at least one transition from one state to another; (c) a transition probability for each of the transitions; (d) for the at least one transition. A speech recognition system comprising means for associating a plurality of output probabilities, each representing a probability of producing each personalized label at a given transition in a model of a given standard label.