JPS60201395A

JPS60201395A - Voice recognition

Info

Publication number: JPS60201395A
Application number: JP59057280A
Authority: JP
Inventors: 田部井　幸雄; 森戸　誠
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1984-03-27
Filing date: 1984-03-27
Publication date: 1985-10-11
Also published as: JPH0313599B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（技術分野）本発明は、音声認識方法に関し、具体的には、単語入力
音声の終端を確認する前に、入力音声の始端検出から認
識を開始するようにした音声認識方法に関する。DETAILED DESCRIPTION OF THE INVENTION (Technical Field) The present invention relates to a speech recognition method, and specifically, to a speech recognition method in which speech recognition is started by detecting the beginning of input speech before confirming the end of word input speech. Regarding recognition methods.

（背景技術）音声認識の一形式として、音声の特徴パターンを周波数
成分のフレーム時系列として表現し、入カバターンと各
標準パターンとの非類似度を計算し、その非類似度を少
なくとも含む情報に基づいて入力音声のカテゴリを認識
する方法が知られている。(Background technology) As a form of speech recognition, a feature pattern of speech is expressed as a frame time series of frequency components, the degree of dissimilarity between the input pattern and each standard pattern is calculated, and information containing at least that degree of dissimilarity is expressed. There are known methods for recognizing the category of input speech based on.

非類似度を計算する方法としては、入カバターンと標準
パターンとのマツチングパスの設定に関して、動的計画
法によるＤＰマ、チングと、本質的に線形な線形マツチ
ング法とが知られている。As methods for calculating dissimilarity, DP matching using dynamic programming and essentially linear linear matching are known for setting a matching path between an input pattern and a standard pattern.

後者の線形マツチング法は、簡易な方法であシ、例えば
沖研究開発１１８号昭和５７年１２月にある如く、比較
的少数の認識カテゴリを対象とする場合には適用されて
いる。The latter linear matching method is a simple method, and is applied when a relatively small number of recognition categories are targeted, as in, for example, Oki Research and Development No. 118, December 1980.

しかしながら、従来の線形マツチング法においては、単
語入力音声の終端を確認した後、認識動作を開始してお
シ、認識応答速度の面で問題があシ又動作メモリの容量
も比較的多くを必要とする。However, in the conventional linear matching method, the recognition operation starts after confirming the end of the word input speech, which has problems in terms of recognition response speed and requires a relatively large amount of operation memory. shall be.

又線形マツチング法においては、発声速度変動を吸収す
るために種々の工夫が行なわれているが、実時間的な非
類似度計算には適用し難い。Furthermore, in the linear matching method, various measures have been taken to absorb variations in speech rate, but these methods are difficult to apply to real-time dissimilarity calculations.

（発明の目的）本発明の目的は、入力音声の始端を検出して直ちに認識
を開始することによって、認識応答速度を高めると共に
動作メモリ容量を低減することにあシ、更に他の目的は
発声速度変動を予想して本質的に線形なマツチングパス
を複数個設定することによって、発声速度の変動を吸収
することにある。(Object of the Invention) An object of the present invention is to increase the recognition response speed and reduce the operating memory capacity by detecting the beginning of input speech and immediately starting recognition. The purpose is to absorb fluctuations in speaking speed by anticipating speed fluctuations and setting a plurality of essentially linear matching paths.

（発明の概要）本発明では、例えば標準パターンの終端フレ１ムのパタ
ーン情報を入カバターンの複数フレームと対応づける場
合はあるけれども、少なくとも有音状態においては線形
に対応づけておシ、本質的に線形なマツチング法による
。(Summary of the Invention) In the present invention, for example, although the pattern information of the last frame 1 of the standard pattern may be associated with multiple frames of the input pattern, it is essentially by a linear matching method.

本発明では、入力音声の始端を検出して、少なくとも有
音状態では、入力フレームの更新毎に入カバターンと各
標準パターンとの距離を計算し、且つ入力フレームの更
新毎に各標準パターンに対応して非類似度を計算し且つ
更新記憶する。The present invention detects the beginning of the input audio, calculates the distance between the input pattern and each standard pattern each time the input frame is updated, and corresponds to each standard pattern each time the input frame is updated, at least in the voiced state. Then, the degree of dissimilarity is calculated and updated and stored.

本明細書では、少なくとも有音状態における入カバター
ンの順番を音声フレーム番号という。音声フレーム番号
は入カバターンの順番とｔｌは同じであるが、パワーデ
ィップでの距離計算を停止する手法を採用した場合など
では異なってくる。In this specification, the order of input patterns at least in the sound state is referred to as the audio frame number. The voice frame number has the same input turn order and tl, but it differs when a method of stopping distance calculation at power dip is adopted.

この非類似度の計算は、マツチングツくスに沿った前の
音声フレームまでの非類似度に当該音声フレーム番号で
の距離を加算することによって行なうため、終端の確認
後にマツチングパスの全長に泊って非類似度を計算する
処理と比べれば、認識応答速度は遥かに早くなる。This dissimilarity calculation is performed by adding the distance at the audio frame number to the dissimilarity to the previous audio frame along the matching path, so after checking the end, the dissimilarity is calculated over the entire length of the matching path. Compared to the process of calculating similarity, the recognition response speed is much faster.

本発明の他の態様においては、複数のマツチングバスを
設定することによって発声速度変動を吸収した非類似度
の計算が行なわれる。本発明では、少なくとも有音状態
の各音声フレーム毎に、前記非類似度に基づいて１つの
標準パターンコードを選択し、また他の態様においては
、その選択処理回数を減少させるために非類似度メモリ
として２倍の容量を用意し、入力音声の終端に対応した
音声フレームの非類似度を対象としてのみ標準パターン
コードを選択する。In another aspect of the present invention, dissimilarity calculation is performed in which variations in speaking speed are absorbed by setting a plurality of matching buses. In the present invention, one standard pattern code is selected based on the dissimilarity for at least each audio frame in the voiced state, and in another aspect, the dissimilarity is selected in order to reduce the number of selection processes. A memory with twice the capacity is prepared, and a standard pattern code is selected only for the degree of dissimilarity of the audio frame corresponding to the end of the input audio.

（実施例）第１図は本発明の一実施例を示す要部ブロック図であシ
、以下図に沿って説明する。(Embodiment) FIG. 1 is a block diagram of a main part showing an embodiment of the present invention, and the explanation will be given below along with the figure.

５２は入力音声の始端から入力フレーム数をカウントし
て音声フレーム番号ｊを出力するフレームカウンタであ
って始端検出時にリセットパルスによってその内容は０
となシ以後入力フレームの更新毎にカウントパルスによ
ってカウントアツプするものである。入力音声の始端検
出は、所定のしきい値を越えることによって行ない、雑
音対策としては、そのしきい値を連続して３フレーム（
桓し、１フレーム長は１６　ｍ５ｅｃ　）以上越えない
場合は認識処理を初期状態ヘリセットすることによって
行なう。６０は全ての標準パターン５ｎ（ｉ、ｊ）を記
憶している標準パターンメモリでアシ、標準パターンコ
ード（以下標準パターン番号という）ｎとチャンネル番
号ｉと音声フレーム番号ｊとによってアドレスされ、１
要素ずつ読み出されるものである。52 is a frame counter that counts the number of input frames from the start of the input audio and outputs the audio frame number j, and its contents are set to 0 by a reset pulse when the start of the input audio is detected.
After that, the count is incremented by a count pulse every time the input frame is updated. The start of input audio is detected by exceeding a predetermined threshold, and as a noise countermeasure, the threshold is continuously exceeded for 3 frames (
However, if the length of one frame does not exceed 16 m5ec), recognition processing is performed by resetting the initial state. 60 is a standard pattern memory storing all standard patterns 5n (i, j), which is addressed by a standard pattern code (hereinafter referred to as standard pattern number) n, channel number i, and audio frame number j;
It is read out element by element.

６１は１フレ一ム分の入カバターンメモリであシ、入力
音声の分析フレーム毎に１フレ一ム分の入カバターンＷ
（ｉ、ｊ）が入力され、チャンネル番号ｉによってアド
レスされ、１要素ずつ読み出されるものである。61 is the input cover turn memory for one frame, and the input cover turn W for one frame is stored for each analysis frame of input audio.
(i, j) is input, addressed by channel number i, and read out one element at a time.

６３は距離演算器であって、各チャンネル番号ｉに対応
して惑対値差を計算し且つ累算することによって入カバ
ターンＷ（ｉ、ｊ）と各標準パターン５ｎ（ｉ、ｊ）と
の１フレームに関する距離ｄｎ（ｊ）を計算する。この
計算は音声フレーム番号の更新毎に行なわれる。Reference numeral 63 denotes a distance calculator, which calculates and accumulates the distance value difference corresponding to each channel number i, thereby calculating the difference between the input cover pattern W(i, j) and each standard pattern 5n(i, j). Calculate the distance dn(j) for one frame. This calculation is performed every time the audio frame number is updated.

６４は加算器、６５は各標準パターンに対応してその非
類似度を記憶する非類似度メモリでｓｂ、任意の標準パ
ターンの距離ｄｎ（Ｄを計算する毎に、非類似度メモリ
６５を標準パターン番号でアドレスして直前の音声フレ
ーム番号対応の非類似度Ｄｎ（ｊ−＝１）を読み出し、
加算器６４でその非類似度Ｄｎ（ｊ−１）と前記距離ｄ
ｎ（ｊ）とを加算して、再び非類似度メモリ６５へ与え
ることによって、音声フレーム番号ｊの更新毎に、各標
準パターンＳ”　（１ｔ　ｊ）の非類似度Ｄｎ（ｊ）を
更新記憶する。64 is an adder; 65 is a dissimilarity memory that stores the dissimilarity corresponding to each standard pattern; Addressing with the pattern number, read the dissimilarity Dn (j-=1) corresponding to the immediately preceding audio frame number,
An adder 64 calculates the dissimilarity Dn(j-1) and the distance d.
n(j) and again provides it to the dissimilarity memory 65, so that the dissimilarity Dn(j) of each standard pattern S'' (1tj) is updated and stored every time the audio frame number j is updated. do.

７０はコンパレータ、７１は最小非類似度メモリであシ
、音声フレーム番号ｊの更新の初期に最小非類似度メモ
リ７１には最大値がセットされ、当該音声フレームカウ
ンタの非類似度Ｄｎ（ｊ）を計算する毎に、その非類似
度Ｄｎ（ｊ）を読み出し、最小非類似度メモリ２１の値
と比較し、その小さい方の値で最小非類似度メモリ７１
を１き換えるものである。70 is a comparator, and 71 is a minimum dissimilarity memory. At the beginning of updating the audio frame number j, the maximum value is set in the minimum dissimilarity memory 71, and the dissimilarity Dn(j) of the audio frame counter is set to the maximum value. Each time , the dissimilarity Dn(j) is read out, compared with the value in the minimum dissimilarity memory 21, and the smaller value is stored in the minimum dissimilarity memory 71.
This is to replace 1.

従って当該音声フレーム番号ｊでの処理が終了した時点
では、入力音声の始端から当該フレーム番号ｊまでのマ
ツチングバスに沿ったもので最小値を与える非類似度が
検出される。Therefore, when the processing for the audio frame number j ends, the dissimilarity that gives the minimum value along the matching bus from the start of the input audio to the frame number j is detected.

７２は標準パターン番号を記憶するフレームコードメモ
リでアシ、コンパレータ７０が小さい非類似度を検出す
る毎にそれに対応した標準パターン番号で書き換えられ
、従って当該音声フレーム番号ｊの時点での認識結果が
記憶される。Reference numeral 72 denotes a frame code memory for storing standard pattern numbers; each time the comparator 70 detects a small degree of dissimilarity, it is rewritten with the corresponding standard pattern number, so that the recognition result at the time of the audio frame number j is stored. be done.

７５はパワディップ対策として設けたもので、標準パタ
ーン番号を記憶するコードメモリであシ、無音状態とな
ったことを示す終端候補クロックによってフレームコー
ドメモリ７２の内容を転送記憶するものでアシ、入力音
声の終端を確認した時点で、このコードメモリ７５の標
準パターン番号を入力音声として認識することになる。Reference numeral 75 is provided as a countermeasure against power dips, and is a code memory for storing a standard pattern number.The content of the frame code memory 72 is transferred and stored using a termination candidate clock indicating that a silent state has occurred. When the end of the code memory 75 is confirmed, the standard pattern number in the code memory 75 is recognized as the input voice.

すなわち、認識対象によっては単語内にパワーの小さい
パワーディップ（例えばスト、プのトとブの間）が存在
するため、フレーム電力が３０７レーム程度継続した場
合に終端を確認することができる。That is, depending on the recognition target, there is a power dip with low power within the word (for example, between ``to'' and ``b'' of ``st'' and ``b''), so the end can be confirmed when the frame power continues for about 307 frames.

従って入力音声の終端を確認するまで認識動作は継続さ
せる必要があると共に、フレーム電力がしきい値を下ま
わった終端候補での認識結果を保存する必要があシ、こ
の例ではコードメモリ７５によって行なっている。Therefore, it is necessary to continue the recognition operation until the end of the input audio is confirmed, and it is also necessary to save the recognition results at the end candidates whose frame power has fallen below the threshold. I am doing it.

このような構成によると、終端の確認後、直ちに認識応
答を行なうことができ、また入カバターンメモリは１フ
レ一ム分でよい。又、認識カテゴリの各音声フレーム毎
の判定を避けたい場合には、非類似度メモリ６５と同じ
ものを別に設け、入力音声の終端候補を検出した時点に
おいて、その別設のメモリにその時点の非類似度を記憶
しておき、終端を確認した時点において、その非類似度
を入力としてコンパレータ７０と最小非類似度メモリ７
１とによって標準パターン番号を検出するようにしても
よい。この場合においても、非類似度の計算は、音声フ
レームの更新毎に、直前の非類似度に当時点での距離を
加算することによって行なうため、終端を確認したのち
、数ｍ５ｅｃ程度で認識応答することができる。第２図
と第３図とは、本発明の他の実施例の説明図であシ、こ
の例では各音声フレーム毎に、各標準パターンに３個の
フレーム番号を発生して３本のマツチングパスを設定し
、音声フレーム番号の更新毎に各パス対応で非類似度を
計算し且つ更新記憶し、ある基準に基づいて選択したパ
ス対応の非類似度に基づいて判定する。With this configuration, a recognition response can be made immediately after the end is confirmed, and the input cover pattern memory only needs to be for one frame. In addition, if it is desired to avoid making a judgment for each audio frame in the recognition category, a separate memory similar to the dissimilarity memory 65 may be provided, and at the time an end candidate of the input speech is detected, the information at that time will be stored in the separate memory. The degree of dissimilarity is stored, and when the end is confirmed, the degree of dissimilarity is input to the comparator 70 and the minimum degree of dissimilarity memory 7.
1 may be used to detect the standard pattern number. In this case as well, the dissimilarity is calculated by adding the distance at that point to the previous dissimilarity every time the audio frame is updated, so after confirming the end, the recognition response takes about a few m5ec. can do. 2 and 3 are explanatory diagrams of another embodiment of the present invention. In this example, three frame numbers are generated for each standard pattern for each audio frame, and three matching passes are generated. is set, the degree of dissimilarity is calculated and updated for each path each time the audio frame number is updated, and the degree of dissimilarity is updated and stored, and a determination is made based on the degree of dissimilarity of the path corresponding to the path selected based on a certain criterion.

第２図（ａ）は、横軸に入カバターンの音声フレーム番
号ｊをとシ、縦軸に標準パターンのフレーム番号をとシ
２０チ速い場合標準的な場合、及び２０チ遅い場合を想
定した３本のマツチングパス１０１〜１０３を示すもの
であシ、第２図（ｂ）は入カバターンの音声フレーム番
号ｊに対応したフレーム電力を示すものである。Figure 2 (a) shows the audio frame number j of the input pattern on the horizontal axis and the frame number of the standard pattern on the vertical axis, assuming a case of 20 bits faster, a standard case, and a case of 20 bits slower. Three matching paths 101 to 103 are shown, and FIG. 2(b) shows the frame power corresponding to the audio frame number j of the input pattern.

第２図における５Ｌ（ｎ）は、標準パターン番号ｎの標
準パターンのフレーム長さである。5L(n) in FIG. 2 is the frame length of the standard pattern with standard pattern number n.

次式は、音声フレーム番号ｊにおいて、任意の標準パタ
ーンに対応して発生させたフレーム番号をｋ　、　ｌｃ
’、　ｋ”として、その標準パターンと入カバターンと
の１フレ一ム分の距離を示す。The following formula calculates the frame number generated corresponding to an arbitrary standard pattern at audio frame number j by k, lc
', k' indicate the distance of one frame between the standard pattern and the input cover turn.

パス１０１に対する距離パス１０２に対する距離パス１０３に対する距離・・・第３式パス１０１においては入カバターンのｊ番目のフレーム
と標準パターンのに番目のフレームの間の距離計算を行
なう。パス１０２においては入カバターンのｊ番目のフ
レームと標準ノ（ターンのに′番目のフレームの間の距
離計算を行ない、ノクス１０３においては入カバターン
ｊ番目のフレームと標準パターンのｋ“番目のフレーム
の間の距離計算が行なわれる。但し、標準パターンのフ
レーム番号を示すｋ　、　ｋ’　、　ｋ“はその標準パ
ターンの長さ５Ｌ（ｎ）よシ大きくなる場合には５Ｌ（
ｎ）に制限される。なお、第１式〜第３式の〔〕はガウ
ス記号である。Distance to path 101 Distance to path 102 Distance to path 103...In the third formula path 101, the distance between the jth frame of the input pattern and the second frame of the standard pattern is calculated. In pass 102, the distance between the jth frame of the input cover turn and the ``th frame of the standard turn is calculated, and in the pass 103, the distance between the jth frame of the input cover turn and the k''th frame of the standard pattern is calculated. However, if k, k', k", which indicates the frame number of the standard pattern, is larger than the length of the standard pattern, 5L(n), then 5L(n) is calculated.
n). Note that [ ] in the first to third equations is a Gaussian symbol.

次に、音声フレーム番号ｊまでの非類似度Ｄｎ（ｊ）　
、　Ｄ’ｎ（ｊ）　ｐ　Ｄ”ｎ（ｊ）を示す。Next, the dissimilarity Dn(j) up to audio frame number j
, D'n(j) p denotes D''n(j).

パス１０１の非類似度Ｄｎ（ｊ）＝ｄｎ（ｊ）＋Ｄｎ（ｊ−１）　−・・第４
式パス１０２の非類似度Ｄ’ｎ（ｊ）　＝　ｄ’ｎ（ｊ）＋Ｄ’ｎ（ｊ−１）　
−＝第５式バス１０３の非類似度Ｄ”ｎ（ｊ）＝ｄ”ｎ（ｊ）＋Ｄ”ｎ（ｊ−１）　・・
・第６式すなわち、それぞれのパス上での音声フレーム
番号ｊでの非類似度の算出は各チャンネルごとの距離（
例えばＩＷ（ｉ＃ｊ）−８ｎ（ｉｔｋ）ｌ　）をチャン
ネル分、音声フレーム番号ｊ−１に対する非類似度値（
たとえばＤｎ（ｊ−１）に加えることによって得られる
。これらの演算は音声フレゴム番号ｊの更新毎に行なわ
れる。これらの非類似度の組が標準パターンの数（Ｎと
する）だけ存在する。これらの非類似度を用いてカテゴ
リー判定を行なう。まず、ｎ番目の標準パターンに対す
る各パスごとの非類似度Ｄｎ（ｊ）　、　Ｄ’ｎ（ｊ）
　ｔ　Ｄ”ｎ（ｊ）のうち１つを選択する。この選択に
あたっては音声終端検出時の音声フレーム電力ｊに対し
て次式で与えられるＬ　、　Ｌ’　、　Ｌ“　が用いる
。Dissimilarity of path 101 Dn(j) = dn(j)+Dn(j-1) - 4th
Dissimilarity of expression path 102 D'n(j) = d'n(j)+D'n(j-1)
−=Dissimilarity degree of the fifth formula bus 103 D”n(j)=d”n(j)+D”n(j−1) ・・
・Equation 6, that is, calculation of dissimilarity at audio frame number j on each path, is calculated using the distance for each channel (
For example, IW(i#j)-8n(itk)l) for channels and the dissimilarity value (
For example, it can be obtained by adding it to Dn(j-1). These calculations are performed every time the voice frequency number j is updated. There are as many sets of these dissimilarities as there are standard patterns (assumed to be N). Category determination is performed using these dissimilarities. First, the dissimilarity for each path with respect to the nth standard pattern Dn(j), D'n(j)
One of tD"n(j) is selected. For this selection, L, L', and L" given by the following equation are used for the voice frame power j at the time of voice end detection.

これらの値Ｌ　、　Ｌ’　、　Ｌ“は音声フレームに対
応する標準パターンのフレーム数を与える式に類似して
いるが、標準パターンの長さ５Ｌ（ｎ）によって制限さ
れることはない。従って、Ｌ、Ｌ’、Ｌ“は標準パター
ンの種類とは無関係である。これらり。These values L, L', L" are similar to the formula giving the number of frames of the standard pattern corresponding to a voice frame, but are not limited by the length of the standard pattern 5L(n). Therefore, L, L', and L'' are unrelated to the type of standard pattern. These.

Ｌ’　、　Ｌ“のうち標準パターンの長さ５Ｌ（ｎ）に
最も近い値を示すパスに対応する非類似度のみを選択す
る。たとえば、Ｌ′が５Ｌ（ｎ）に最も近いとするとパ
ス１０２が対応しそれに対する非類似度Ｄ’ｎｌ）が選
択される。選択された非類似度をＤＤｎとする。Select only the dissimilarity corresponding to the path that is closest to the length 5L(n) of the standard pattern among L' and L''.For example, if L' is the closest to 5L(n), path 102 is selected. corresponds to it, and the dissimilarity D'nl) corresponding thereto is selected.The selected dissimilarity is assumed to be DDn.

これらの選択は標準パターンごとに行なわれる。These selections are made for each standard pattern.

前記判定ステップによって得られた標準パターンごとの
非類似度ＤＤｎに対して最小値をめる。A minimum value is determined for the degree of dissimilarity DDn for each standard pattern obtained in the determination step.

この最小値を与える標準パターンに付加されたカテゴリ
が当該音声フレーム番号ｊでの認識結果となる。The category added to the standard pattern that gives this minimum value becomes the recognition result for the audio frame number j.

第２図（ｂ）に示すように、音声フレーム番号ｊ。As shown in FIG. 2(b), the audio frame number j.

＋１においてフレーム電力がしきい値以下となシ、無音
状態を検出した場合、音声フレーム番号ｊ１で判定した
標準パターン番号を認識カテゴリ候補として記憶し、認
識動作は無音状態が３０フレーム継続するまで中断しな
い。If the frame power is below the threshold at +1 and a silent state is detected, the standard pattern number determined by the audio frame number j1 is stored as a recognition category candidate, and the recognition operation is interrupted until the silent state continues for 30 frames. do not.

第２図（ｂ）では、継続認識の結果、音声フレーム番号
ｊ２における１つの標準パターン番号が認識カテゴリ候
補として更新記憶されておシ、音声フレーム番号ｊ３で
終端が確認され、その標準パターン番号を入力音声とし
て判定する。In FIG. 2(b), as a result of continuous recognition, one standard pattern number at audio frame number j2 is updated and stored as a recognition category candidate, and the end is confirmed at audio frame number j3, and that standard pattern number is Determined as input audio.

第３図はこの実施例のブロック図でアシ、以下この図に
沿って説明する。FIG. 3 is a block diagram of this embodiment, and the following description will be made with reference to this diagram.

５２は音声フレーム番号ｊを出力する音声フレームカウ
ンタである。52 is an audio frame counter that outputs audio frame number j.

していて、パスの種類を示すパス信号と音声フレーム番
号ｊとによってアドレスされ、標準パターンのフレーム
番号ｋ　、　ｋ’　、　ｋ“　に相当し得るものｔを出
力するＲＯＭである。It is a ROM that is addressed by a path signal indicating the type of path and an audio frame number j, and outputs t that can correspond to frame numbers k, k', k'' of the standard pattern.

５６は、標準パターンのフレーム長５Ｌ（ｎ）　全記憶
していて、標準パターン番号ｎでアドレスされてそれを
出力するＲＯＭである。Reference numeral 56 denotes a ROM which stores the entire standard pattern frame length 5L(n) and outputs it when addressed by the standard pattern number n.

５７は両ＲＯＭの出力ｔ　、　５Ｌ（ｎ）を比較するコ
ンるセレクタである。57 is a selector that compares the outputs t and 5L(n) of both ROMs.

両ＲＯＭ５４，５６、コンパレータ５７、及びセレクタ
５８とによって、音声フレーム番号ｊと標準パターン番
号とパス信号とに対応して、ｔ≦５Ｌ（ｎ）ならばＲＯ
Ｍ５４に記憶しておいた標準パターンのフレーム番号を
セレクタ５８から出力し、Ａ＞５Ｌ（ｎ）ならばフレー
ム長５Ｌ（ｎ）に等しい最終フレーム番号を出力する。Both the ROMs 54 and 56, the comparator 57, and the selector 58 perform the RO
The frame number of the standard pattern stored in M54 is output from the selector 58, and if A>5L(n), the final frame number equal to the frame length 5L(n) is output.

このようにして、音声フレームの更新毎に各標準パター
ンに対応して３個のフレーム番号ｋ　、　ｋ’　、　ｋ
“が発生され、第２図で説明したマツチングパスが設定
される。In this way, for each audio frame update, three frame numbers k, k', k are assigned corresponding to each standard pattern.
" is generated, and the matching path explained in FIG. 2 is set.

６０は標準パターンメモリであシ、６１は入カバターン
メモリであシ、６２は絶対値演算器であシ、６４は加算
器であシ、６５は非類似度メモリで６．Ｄ、音声フレー
ム番号ｊの更新毎に各標準パターンの各パス１０１〜１
０３に対応して、１フレ一ム分のパターンが標準パター
ンメモリ６ｏからチャンネル番号ｉと同期して１要素ず
つ読み出６２でＩＷ（ｉｔｊ）−８ｎ（ｉｔｋ）Ｉ　な
る演算が１要素づつ行なわれ、加算器６４と非類似度メ
モリ６６による非類似度メモリ６５による非類似度の計
算が次式で示すようにチャンネル番号ｉと同期して１要
素ずつ行なわれる。60 is a standard pattern memory, 61 is an input pattern memory, 62 is an absolute value calculator, 64 is an adder, and 65 is a dissimilarity memory. D, each path 101 to 1 of each standard pattern every time the audio frame number j is updated
Corresponding to 03, the pattern for one frame is read element by element from the standard pattern memory 6o in synchronization with channel number i, and the calculation IW(itj)-8n(itk)I is performed element by element at 62. Then, calculation of the dissimilarity by the adder 64 and the dissimilarity memory 65 is performed element by element in synchronization with the channel number i as shown in the following equation.

非類似度メモリの内容←１つ前のチャンネル番号ｉでの
非類似度メモリの内容＋ＩＷ（’　＊　ｊ）−８ｎ（１
ｔ　ｋ）　ｌ　・”第８式。Contents of dissimilarity memory ← Contents of dissimilarity memory at the previous channel number i + IW (' * j) - 8n (1
t k) l ・”Equation 8.

この例では、チャンネル番号ｉ対応の距離を１フレーム
に亘って合計する過程はとっていないけれども、入カバ
ターンＷ（ｉ、ｊ）と標準パターン５ｎ（ｉ、ｋ）との
距離を１フレーム分計算し、直前の音声フレーム番号ｊ
−１の非類似度Ｄｎ（ｊ−１）に加算して当該フレーム
番号ｊでの非類似度Ｄｎ（ｊ）を計算しているととと実
質上同じであり、非類似度メモリ６５には、各標準パタ
ーンの各パス対応の非類似度Ｄｎ（ｊ）　ｔ　Ｄ’ｎ（
ｊ）　ｔ　Ｄ’ｎ（ｊ）が更新記憶される。In this example, although the process of summing the distance corresponding to channel number i over one frame is not taken, the distance between the input cover turn W (i, j) and the standard pattern 5n (i, k) is calculated for one frame. and the previous audio frame number j
This is substantially the same as calculating the dissimilarity Dn(j) at the frame number j by adding it to the dissimilarity Dn(j-1) of −1, and the dissimilarity memory 65 contains , the dissimilarity of each standard pattern corresponding to each path Dn(j) t D'n(
j) t D'n(j) is updated and stored.

６７は、音声フレーム番号ｊと標準パターン番号ｎとで
アドレスされ、その音声フレーム番号ｊにおいて選択す
べきパスを指定するパス選択信号Ｐを出力するＲＯＭで
あシ、前述の第７式を用いて示したように、選択すべき
パス種類をパス選択情報として各標準パターン毎に音声
フレーム番号対応で記憶しているものである。67 is a ROM that outputs a path selection signal P that is addressed by an audio frame number j and a standard pattern number n and specifies the path to be selected in the audio frame number j, and uses the above-mentioned formula 7. As shown, the path type to be selected is stored as path selection information in association with the audio frame number for each standard pattern.

６８はパス選択信号Ｐとパス信号とを比較するコンパレ
ータであシ、６９は予め記憶している最大非類似度と非
類似度メモリ６５から読み出した非類似度との一方を出
力するコンバータである。68 is a comparator that compares the path selection signal P and the path signal, and 69 is a converter that outputs either the maximum dissimilarity stored in advance or the dissimilarity read from the dissimilarity memory 65. .

各音声フレーム番号ｊでの非類似度の計算が終了した後
、標準パターン番号ｎとパス信号とでアドレスされて非
類似度メモリ６５から順次読み出され、ＲＯＭ６７とコ
ンパレータ６８とに作られたパス選択信号Ｐ対応の非類
似度がコンバータ６９から出力され、他のパス信号の周
期では最大非類似度が出力され、これによってパスの選
択すなわち各標準パターン毎に１つの非類似度が選択さ
れる。After the calculation of the dissimilarity for each audio frame number j is completed, the address is sequentially read from the dissimilarity memory 65 using the standard pattern number n and the path signal, and the path created in the ROM 67 and the comparator 68 is read out sequentially from the dissimilarity memory 65. The dissimilarity corresponding to the selection signal P is output from the converter 69, and the maximum dissimilarity is output in other path signal periods, thereby selecting a path, that is, one dissimilarity for each standard pattern. .

２０はコンパレータ、７１は最小非類似度メモリ、７２
はフレームコードメモリ、２５はコードメモリでアシ、
これらによって１つの標準パターン番号を入力音声のカ
テゴリとして認識し、ここでの構成は、非選択のパスに
対応してコンバータ６９から最大非類似度を出力させて
いるため、同期関係も含めて第１図の実施例と同じであ
る。20 is a comparator, 71 is a minimum dissimilarity memory, 72
is the frame code memory, 25 is the code memory,
By these, one standard pattern number is recognized as a category of input audio, and in this configuration, the maximum dissimilarity is output from the converter 69 corresponding to the non-selected path, so the This is the same as the embodiment shown in FIG.

第１表は、第３図相対の構成によって、発声速度の違い
を克服するために、どのようなマツチングパスを設定す
ることが効果的であるかについて、認識実験を行なった
結果を示すものであシ、この結果よシ±２０チの発声速
度の違いを想定した３本のマツチングパスが、効果的で
あるとの結論を得た。Table 1 shows the results of a recognition experiment to find out what kind of matching path is effective to set in order to overcome the difference in speaking speed using the relative configuration shown in Figure 3. Based on these results, it was concluded that three matching passes assuming a difference in speaking speed of ±20 degrees are effective.

第１表評価用音声としてテープレコーダにマイクロホンによっ
て入力した男性２０名１女性２０名の音声を録音した。Table 1 As evaluation voices, the voices of 20 men and 20 women were recorded using a microphone into a tape recorder.

今回の認識実験に用いた語は、第２表に示される８語で
ある。The words used in this recognition experiment were the eight words shown in Table 2.

第２表＃−ｖ壽１の１２！！腸息姓ふ１イー　ズペカＬル正＃
イレは周波数分析結果を対数変換し、各フレームのフレ
ーム平均電力、低域平均電力、高域平均電力を用いて行
ったものである。Table 2 #-v Hisashi 1 of 12! ! Intestinal last name F1 E Zpekal L positive #
The analysis was performed by logarithmically transforming the frequency analysis results and using the frame average power, low frequency average power, and high frequency average power of each frame.

（発明の効果）以上説明したように、本発明では、音声フレーム番号の
更新毎に、１フレーム対応の距離を計算し、前回の音声
フレーム番号での非類似度にそれを加算することによっ
て音声フレーム番号の更新毎に非類似度を計算している
ため、終端確認後、すぐ認識結果が得られる利点があシ
、入カバターンメモリその他の動作メモリも比較的小容
量で済む利点がある。(Effects of the Invention) As explained above, in the present invention, each time the audio frame number is updated, the distance corresponding to one frame is calculated, and the distance is added to the dissimilarity at the previous audio frame number. Since the degree of dissimilarity is calculated every time the frame number is updated, there is an advantage that the recognition result can be obtained immediately after the termination is confirmed, and the capacity of the input cover pattern memory and other operation memories is also relatively small.

また発声速度を予測して各標準パターン毎に複数の線形
パスを設定しているために、比較的簡易な構成で発声速
度変動を吸収できる利点がある。Furthermore, since the speaking speed is predicted and a plurality of linear paths are set for each standard pattern, there is an advantage that fluctuations in speaking speed can be absorbed with a relatively simple configuration.

[Brief explanation of the drawing]

第１図は本発明の一実施例を示すブロック図、第２図と
第３図とは本発明の他の実施例の説明図であって第２図
は非類似度計算の説明図、第３図はプロ、り図である。５２・・・フレームカウンタ、５４は標準ノくターンの
フレーム番号に相当するものを音声フレーム番号対応で
記憶しているＲＯＭ、５６は標準ノくターンのフレーム
長を記憶しているＲＯＭ、ｓ７はコンノくレータ、５８
はセレクタ、６０は標準ノくターンメモリ、６１は入カ
バターンメモリ、６２は絶対値差演算器、６３は距離演
算器、６４は加算器、６５は非類似度メモリ、６７は各
標準ノくターンのパス選択情報を音声フレーム番号対応
で記憶しているＲＯＭ、６８はコンパレータ、６９はコ
ンノ（−タ、７０はコンパレータ、７１は最小非類似度
メモリ、７２はフレームコードメモリ、７５はコードメ
モリ。特許出願人　沖電気工業株式会社 ■、事件の表示昭和５９年　特　許　願第０５７２８０号２、発明の名
称音声認識方法３、補正をする者事件との関係　特　許　出　願　人任　所（〒１０５）　東京都港区虎ノ門１丁目７番１２
号６、補正の内容（１）明細書第１９頁第９行目から第１０行目に「チャ
ンネル番号ｉ対応の距離を１フレームに亘って合計する
過程は」とあるのを［チャンネルに関する距離を、１フレームに亘って合計
する、過程は」と補正する。Fig. 1 is a block diagram showing one embodiment of the present invention, Figs. 2 and 3 are explanatory diagrams of other embodiments of the present invention, and Fig. 2 is an explanatory diagram of dissimilarity calculation; Figure 3 is a professional drawing. 52...Frame counter; 54 is a ROM that stores the frame number of the standard turn in correspondence with the audio frame number; 56 is a ROM that stores the frame length of the standard turn; s7 is a ROM that stores the frame number of the standard turn; Konno Kureta, 58
is a selector, 60 is a standard number turn memory, 61 is an input cover turn memory, 62 is an absolute value difference calculator, 63 is a distance calculator, 64 is an adder, 65 is a dissimilarity memory, and 67 is each standard number ROM which stores turn path selection information corresponding to audio frame numbers, 68 is a comparator, 69 is a controller, 70 is a comparator, 71 is a minimum dissimilarity memory, 72 is a frame code memory, and 75 is a code memory . Patent applicant Oki Electric Industry Co., Ltd.■, Case indication 1982 Patent Application No. 057280 2, Name of the invention Speech recognition method 3, Relationship with the person making the amendment case Patent application person office (〒 105) 1-7-12 Toranomon, Minato-ku, Tokyo
No. 6, Contents of the amendment (1) On page 19, lines 9 to 10 of the specification, the statement "The process of summing the distances corresponding to channel number i over one frame" has been changed to [channel-related distances]. The process of summing over one frame is corrected as ``.

Claims

[Claims]

(1) Store a standard pattern expressed as a frame time series of frequency components corresponding to each standard voice, a) extract the input pattern from the input voice as a frame time series of frequency components, and b) extract the input voice as a frame time series of frequency components. C) an essentially linear relationship to the audio frame number; d) Every time the audio frame number is updated, calculate the distance between the input cover turn and each standard pattern between the frames associated with the matching path, e. ) The cumulative value of the distance along the matching bus from the start of the input audio to an arbitrary audio frame is defined as the dissimilarity, and the immediately preceding dissimilarity and the distance at the audio frame number are added and temporarily stored. f) Separately store all the dissimilarities, or at least update and store the dissimilarities corresponding to each standard pattern; and f) Separately store all the dissimilarities, or at least g) After confirming the end of the input audio, one standard pattern code selected based on the standard pattern code or the dissimilarity that is separately updated and stored is stored. A speech recognition method characterized by recognizing standard pattern codes as categories of input speech.

(2) Store a standard pattern expressed as a frame time series of frequency components corresponding to each standard voice, a) extract the input pattern by expressing it as a frame time series of frequency components from the input voice, b) ) Detect the start of the input audio and start counting that frame, and update the audio frame number each time a frame is updated at least while detecting the active state; C) Essentially change the audio frame number A plurality of matching passes are set by generating a plurality of frame numbers of the standard pattern in a linear relationship, and d) each time the audio frame number is updated, the input cover pattern and each standard pattern are e) The cumulative value of the distance along the matching path from the start of the input audio to an arbitrary audio frame is defined as the dissimilarity, and the dissimilarity of the immediately preceding dissimilarity and the audio frame are calculated. By adding the distance in terms of number and storing it once, the degree of dissimilarity is updated and stored corresponding to each matching pass of each standard pattern. g) After confirming the end of the input audio, store the dissimilarity separately or at least update and memorize one standard pattern code selected based on the dissimilarity; A speech recognition method characterized in that one standard turn code selected based on the degree of dissimilarity updated and stored in the input speech is recognized as a category of input speech.