JPS6356700A

JPS6356700A - Continuous voice recognition equipment

Info

Publication number: JPS6356700A
Application number: JP61200898A
Authority: JP
Inventors: 三木　敬
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1986-08-27
Filing date: 1986-08-27
Publication date: 1988-03-11

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Abstract] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（産業上の利用分野）この発明は連続的に発声された音声の認識処理に適した
連続音声認識装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a continuous speech recognition device suitable for recognition processing of continuously uttered speech.

（従来の技術）従来、連続的に発声された単語列（以下連続単語と呼ぶ
場合もある）の認識手法（以下連続単語認識技術とする
）が種々提案されている。これらは基本的に二つの種類
に大別出来る。１つは認識の基本単位を単語より小さな
単位（例えば音節、音素等）とし、これらの認識結果を
用いて除々に大きな単位である単語、文節及び文章を同
定するものである。他方は基本単位を単語とするもので
ある。さて、基本単位を！Ｐ−語とした場合、認識処理
の第１段階は連続を質り中よりある単：ｉΔが存在する
区間を決定することである。！１１．語のような比較的
長い音声区間を検出、同定１−る技術はパタンマツチン
グによる手法、特に動的計画法に基づくマツチング技術
（以下ＤＰと呼ぶ）が一般的である。更に連続単語認識
に適した手法として連続ＤＰとｌｈｐばれる技術かある
。この手法はある人力音声に対してパタンマツチングを
連続的に行うことにより予め定められた単語の存在が探
索されるものであり、ワードスポツティングと称されて
いる。この連続ＤＰを用いて連続単語認識を実現する考
え方として人力音声中からあるりＰ−語を検出、同定し
、単語区間候補を求め（連続ＤＰ）、これら候補より最
適な単語列を決定するという二段構えの処理がある。(Prior Art) Conventionally, various methods (hereinafter referred to as continuous word recognition techniques) for recognizing a string of words uttered continuously (hereinafter also referred to as continuous words) have been proposed. These can basically be divided into two types. One method uses units smaller than words (for example, syllables, phonemes, etc.) as the basic units of recognition, and uses these recognition results to identify gradually larger units such as words, phrases, and sentences. The other type uses words as basic units. Now, the basic unit! In the case of a P-word, the first step in the recognition process is to determine the interval in which a certain unit iΔ exists from among the sequences. ! 11. A common technique for detecting and identifying relatively long speech sections such as words is a method based on pattern matching, particularly a matching technique based on dynamic programming (hereinafter referred to as DP). Furthermore, there are techniques called continuous DP and lhp that are suitable for continuous word recognition. This method searches for the presence of predetermined words by continuously performing pattern matching on a given human voice, and is called word spotting. The idea of realizing continuous word recognition using this continuous DP is to detect and identify certain P-words from human speech, find word interval candidates (continuous DP), and determine the optimal word string from these candidates. There is a two-stage process.

従来このような考え方を実現した技術どして文獣「日本
音響学会音声研究会資料Ｊ　　（５８２−１６゜Ｊｕｎ
ｅ２８．＋９８２　）に開示されているものがある。以
下連続ＤＰ千手法述べた後、従来の連続ＤＰを利用した
連続単語認識技術についてその基本となる考え方を示す
。What is the technology that has realized this idea in the past?
e28. +982). After describing the continuous DP 1000 method below, the basic idea of the conventional continuous word recognition technology using continuous DP will be explained.

連続ＤＰの処理技術は次のとおりである。The processing technology for continuous DP is as follows.

人力さ第１た音声に対し゛Ｃフレーム周期と呼ばわる時
間間隔て音７”の特徴パラメータ系列（＝人力パタン）
Ｖ（ｐ、ｉ）（但し、ｐ＝ｌ、２，３゜・・・、Ｐ及び
ｉ＝１．２，３．　　・・・、Ｉ）を算出する。一般に
音声の特徴パタンは時間軸とスペクトル周波数を軸とす
る二次元パタンとなる。Characteristic parameter series of sound 7 at a time interval called C frame period (=human pattern) for human-powered voices
V(p, i) (where p=l, 2,3°..., P and i=1.2,3...., I) is calculated. In general, the characteristic pattern of speech is a two-dimensional pattern centered on the time axis and the spectral frequency.

ここでＶ（ｐ、ｉ）とはｊ番目のフレームにおける２番
目のヌ、ベクトル十青報を１位味している。ワードスポ
ツティングの対象となる中詰ｎにおいても入力音声と同
様にフレーノ・周期間隔で特徴パラメータ系列Ｓｎ　（
ｐ、ｊ）（ｊ＝１．２゜・・・、　、Ｊｎ、　ｐ＋＝ｌ
、　２．３．・・・、　Ｐ）を算出しておく。この特徴
パラメータ系列Ｓｏ　（ρ、ｊ）を標準パタンと呼ぶ。Here, V(p, i) means the second value in the j-th frame, which is the first vector. Similarly to the input speech, the feature parameter series Sn (
p, j) (j=1.2゜..., , Jn, p+=l
, 2.3. ..., P) is calculated in advance. This feature parameter series So (ρ, j) is called a standard pattern.

ここで以−トの記号を定義する。The following symbols are defined here.

Ｄｎ　（ｉ、ｊ）　　・人力パタンのｍ番「１のフレー
ムから１番目のフレームと、標準パタンＳ、（ｐ、ｊ）
の１〜ｊ番目のフレームとの最小累積距離の内でｍを変
化させた場合の最小の値Ｂ、、（ｉ、ｊ）：Ｄｎ　（ｉ
、ｊ）に対応するｍの位置ｄ、（ｉ、ｊ）：人力パタンの１番目のフレームと標準
パタンＳ。（ρ、ｊ）のｊ番目のフレームとの距離第３図はこのワードスポツティング手法に用いている動
的計画法を利用して行われるＤＰマツチングのＤＰババ
ス示す図である。Dn (i, j) ・The first frame from frame m of the manual pattern “1” and the standard pattern S, (p, j)
The minimum value B, , (i, j): Dn (i
, j): the first frame of the human pattern and the standard pattern S. FIG. 3 shows the distance between (ρ, j) and the j-th frame in DP matching performed using the dynamic programming method used in this word spotting method.

第３図のＤＰババス用いた場合、Ｄｎ　（ｉ。When using the DP bus shown in FIG. 3, Dn (i.

Ｊ）、Ｂ、（ｉ、ｊ）は次式で決定される。J), B, (i, j) are determined by the following formula.

Ｄｎ（ｉｌ、＋）−Ｄｎ（＋、ｊ）＋ｄｎ（ｉｌ、＋）
　　Ｈ＋　ＨＨＨ＋　＋　（１）Ｂ１、（ｉ、ｊ）−Ｂ
ｎ（ｉ、ｊ−１）　　・＝　（２）但し、ｉ−ａｒｇ　
　ｍｉｎ　　Ｄ、、（ｉ、ｊ−１）ｉ−２≦ｉ′≦ｊとなる。（１）、（２）式は漸化式であるがら以下の処
理を施すことによって任意の１．ｊにおけるり。（ｉ、
Ｊ）、Ｂｎ　　（ｔ、Ｊ）を計算出来る。Dn (il, +) - Dn (+, j) + dn (il, +)
H+ HHH+ + (1) B1, (i, j)-B
n(i, j-1) ・= (2) However, i-arg
min D, , (i, j-1) i-2≦i'≦j. Equations (1) and (2) are recurrence equations, but by performing the following processing, any 1. In the j. (i,
J), Bn (t, J) can be calculated.

０人カパタンの始端のための初期設定処理を行う。Performs initial setting processing for the start of a 0-person Kapatan.

Ｄ、（−１，ｊ）＝Ｄｎ　（０，ｊ）＝Ｍｊ＝１．２．
　　・・・＋Ｊ１１但し、Ｍは十分大きな正の値、Ｊ、、は標準パタンの長
さである。D, (-1,j)=Dn (0,j)=Mj=1.2.
...+J11 However, M is a sufficiently large positive value, and J is the length of the standard pattern.

■ｉ−１，２，・・・、■に゛ついて■〜■を実行する
。但し、■は大カパタンの長さである。■Execute ■~■ for i-1, 2, ..., ■. However, ■ is the length of the large kapatan.

■第ｉ番目のフレームからマツチングを開始する場合、
すなわちｊ＝１の場合の初期化処理を行う。■When starting matching from the i-th frame,
That is, initialization processing for j=1 is performed.

Ｄ　。　　（ｉ、　　　１）　　　＝ｄ、　　　（ｉ、
　　　ｌ）　　　・　　・　　・　　−（３）Ｂ、　　
（ｉ、　　１）＝　ｉ　・　・−＝　　・　・−−（４
）■ｊ＝２．３．　　・・・、Ｊｏについて式（１）、
（２）を実行する。D. (i, 1) = d, (i,
l) ・・・ −(3) B,
(i, 1) = i ・・−= ・・−−(4
)■j=2.3. ..., Equation (1) for Jo,
Execute (2).

さて、以上述べた処理によって、Ｄｎ　（ｉ。Now, by the processing described above, Dn(i.

Ｊｎ　）、Ｂ、（ｉ、Ｊｎ）が算出出来る。ここでｏｎ
（ｉ、Ｊｎ）の意味を考えると入力パタンのＢｎ　　（
ｉ、Ｊｎ　）〜ｉ番目のフレームと、標準パタンの１〜
Ｊ、番目のフレームとの時間軸を正規化したパタン間累
積距離となっている。Jn), B, (i, Jn) can be calculated. on here
Considering the meaning of (i, Jn), the input pattern Bn (
i, Jn)~i-th frame and standard pattern 1~
This is the inter-pattern cumulative distance with the time axis normalized from the J-th frame.

従って、この石。（ｉ、Ｊｎ）の値がある定められた値
より小さければ人力パタンのＢ。（ｉ。Therefore, this stone. If the value of (i, Jn) is smaller than a certain predetermined value, it is a manual pattern B. (i.

、Ｊｏ）フレームからｉフレーム間に標準パタンと同じ
単語か検出されたものとする。, Jo) It is assumed that the same word as the standard pattern is detected between the i-frame and the i-frame.

以上述べた処理により、連続音声中の任意の位置に存在
する特定の単語区間を検出出来ることを説明した。It has been explained that by the processing described above, a specific word section existing at an arbitrary position in continuous speech can be detected.

次に、この結果を利用した従来の連続単語認識技術を簡
単に述べる。Next, a conventional continuous word recognition technique using this result will be briefly described.

ここで述べる処理は文献「連続ＤＰに基づく大語賃連続
音声認識システム」　（口本音ｑｌ学会音ｊＩ研究会資
料５８２−１１）によるものである。The processing described here is based on the document ``Large word rate continuous speech recognition system based on continuous DP'' (Kouchi Honen Ql Gakkai OnjI Study Group Material 582-11).

先ず、連続ＤＰの結果、ある単語ｎを検出する条件を例
えば以Ｆの如く決めておく。First, conditions for detecting a certain word n as a result of continuous DP are determined, for example, as shown below.

ｆＥ、（ｉ）＜人なる単語ｎがに、時刻よりａフレーム
以上連続して存在し、かっに１時刻以前遡ってｂフレー
ム以内に当該単語ｎが検出されていない場合、ｎなる単
語か時刻ｉで検出されたとする。１但し、Ｅｎ　（ｉ）
は以下の式による。fE, (i) <If the word n exists consecutively for at least a frame from the time, and the word n has not been detected within b frames going back one time, then the word n or the time Suppose that it is detected at i. 1 However, En (i)
is based on the following formula.

Ｅｏ　（ｉ）＝Ｄｎ　（ｉ、Ｊｎ）／Ｊ、　　・・　（
５）しかしながら検出条件は、あるｆ語を検出する場合
、連続音声の特徴である市ｆ後の単５Δからの連結によ
るスペクト・ルパタンの変形の影響か全く考慮されてい
ないため、ある単語の始端、終端の位置がずれてしまう
場合が多い。このような庁詰の検出における誤りは三種
に分けられる。Eo (i)=Dn (i, Jn)/J, ... (
5) However, when detecting a certain f-word, the detection conditions do not take into account the influence of the deformation of the spectrum pattern due to the concatenation from the single 5Δ after the city f, which is a characteristic of continuous speech. , the end position often shifts. Errors in detecting such blockages can be divided into three types.

＜ｌ＞誤検出・・・本来その区間には存在していない単
語を検出１−る。<l> Erroneous detection: A word that does not originally exist in that section is detected.

＜２〉未検出・・・その単語が存在しているのに検出出
来なかった。<2> Not detected...The word existed but could not be detected.

＜３〉セグメンテーション誤り・・・本来の単語区間と
検出された単語区間が大幅にずれている。<3> Segmentation error: The original word section and the detected word section are significantly different.

特に今述べた例は〈３）のセグメンテーション誤りに当
り、その後の処理いかんでは致命的なエラーの原因とな
る。これは本来連続音声中の単語区間の検出において＠
後の単語区間との接続状態を評価し、その評価値が最も
高くなるような位置をもって始端、あるいは終端とする
ような処理を取り込む必要がある。このような処理を取
り込めば＜３〉のセグメンテーション誤りは非常に減少
する。In particular, the example just described corresponds to the segmentation error (3), which causes a fatal error in subsequent processing. This is originally used to detect word sections in continuous speech.
It is necessary to incorporate processing that evaluates the connection state with subsequent word sections and sets the position where the evaluation value is the highest as the start or end. If such processing is incorporated, the segmentation error <3> will be greatly reduced.

さて、従来例の検出アルゴリズムで検出された単語区間
候補の一例を第４図に示す。この第４図で示すように、
各４ｊ語区間較補Ａ−Ｄは多くの場合、時間軸に対して
重なりをも一つことが多い。すなわち、連続ＤＰの結果
求められた単；ｉ５区間較補Ａ−Ｄは図中矢印で示すよ
うなラテーイスを形成する。以ドこのラティスを単語ラ
ティスと呼ぶ。この単語ラティスから最適な単語列を求
めるために従来例では次のアルゴリズムを採用している
。Now, FIG. 4 shows an example of word section candidates detected by the conventional detection algorithm. As shown in this figure 4,
In many cases, each of the 4j word interval comparisons A to D overlaps by one on the time axis. That is, the simple i5 interval comparison A-D obtained as a result of continuous DP forms a lathe as shown by the arrow in the figure. From now on, this lattice will be called a word lattice. In order to find the optimal word string from this word lattice, the following algorithm is adopted in the conventional example.

（ａ）排他的フレーム区間からなる単語系列群の作成この処理（ａ）は単語ラティスの中に正認識の系列が存
在すれば、正認識の単語系列かもつ区間は乃いに排他的
になるという考え方に基づいている。(a) Creation of a word sequence group consisting of exclusive frame sections This process (a) means that if a correctly recognized sequence exists in a word lattice, the section that has a correctly recognized word sequence becomes exclusive. It is based on this idea.

（ｂ）上記系列の中で最も尤度の高い系列決定次に上記
単語系列内に含まわる単語区間候補について対応するＥ
、、（ｉ）の値より系列全体の尤度を求める。(b) Determine the sequence with the highest likelihood among the above sequences. Next, determine the corresponding E for the word interval candidates included in the above word sequence.
The likelihood of the entire series is determined from the values of , , (i).

ここで、ある単語系列ｖＪがＩＪ個の単語区間候補より
構成されているとしてそのに番目の区間のＥｎ　（ｉ）
の値をＦゴとｔ７た場合、系列全体の尤度ＬＪを次のよ
うに定めている。Here, if a certain word sequence vJ is composed of IJ word interval candidates, then En (i) of the second interval
When the value of is Fgo and t7, the likelihood LJ of the entire series is determined as follows.

Ｌｊ＝−□□ ち Σ　α（ｋ）　　・Ｆ７−β・１゜ｋ〜・・・・・　（６）ここでα（ｋ）、βは経験により予め定められた任意の
値である。Lj=−□□ Σ α(k)・F7−β・1°k~ (6) Here, α(k) and β are arbitrary values determined in advance by experience.

従フて、系列全体の尤度ＬＪは各Ｂの値すなわち連続Ｄ
Ｐ時のパタン間距離の総和が小さい桿尤度が大きく、Ｖ
ｊに含まれる単語数が多い程大きくなることを示してい
る。この尤度しＪが最も大きくなる単語系列しＪが認識
結果となる。Therefore, the likelihood LJ of the entire series is the value of each B, that is, the continuous D
The sum of the distances between patterns at P is small, and the likelihood is large, and V
This shows that the larger the number of words included in j, the larger it becomes. Based on this likelihood, the word sequence with the largest J becomes the recognition result.

（発明が解決しようとする問題点）しかしながら、従来の技術では連続ＤＰの結果から各単
語ｎ毎に単独で単語区間の検出がなされているため、連
続音声の特徴でもある前後の単語からの連結によるスペ
クトルパタンの変形の影響か考慮されていなかった。さ
らに、単語ラティスから最適な単語を求めるための前提
条件、すなわち「正認識の単語系列がもつ区間は互いに
排他的になる」も完全とはいえない。例えば「イチ」と
−イチ」か続いた場合、最初の「イチ」の語尾の「チ」
の母音部分「イ」と次の「イチ」の語頭の「イ」とか連
結し、その境界ははっきりとは定まらない。特に単語区
間検出に連続ＤＰを用いた場合、区間候補は重なる場合
も多い。この部分のセグメンテーションは通例面接関係
を考慮するか、単語境界にある程度の１［なりを許すか
しなければならず、前述の条件はあてはまらないという
問題点があった。(Problems to be Solved by the Invention) However, in the conventional technology, a word section is detected independently for each word n from the result of continuous DP, so the connection from the preceding and succeeding words, which is also a characteristic of continuous speech, is not possible. The effect of deformation of the spectral pattern due to Furthermore, the precondition for finding optimal words from a word lattice, that is, ``the intervals of correctly recognized word sequences are mutually exclusive'' cannot be said to be perfect. For example, if "ichi" and -ichi are followed, "chi" at the end of the first "ichi"
The vowel part ``i'' is connected to the first ``i'' of the next ``ichi'', and the boundary between them is not clearly defined. Particularly when continuous DP is used to detect word sections, section candidates often overlap. Segmentation in this part usually requires consideration of the interview relationship or allows for a certain degree of 1[in the word boundaries, which poses the problem that the above-mentioned conditions do not apply.

この発明の目的は、上述した前提条件を用いずに、連続
ＤＰより算出された距離に対してもう一段のＤＰ処理を
施す構成とすることにより、前後関係をも考慮した最適
単語系列を求めることが出来る連続音声認識装置を提案
するものである。The purpose of this invention is to obtain an optimal word sequence that also takes into account the context by performing another DP process on the distance calculated from continuous DP without using the above-mentioned preconditions. This paper proposes a continuous speech recognition device that can perform continuous speech recognition.

（問題点を解決するための手段）この目的の達成を図るため、この発明の連続音声認識装
置によれば、ａ）人力音声に対してフレームと称する一定時間間隔で
音声の特徴パラメータ系列を抽出する眞処理部と、ｂ）予め定められた認識対象単語の標準パタンを記憶し
ておく標準辞書部と、Ｃ）標準パタンと特徴パラメータ系列との連続ＤＰマツ
チング処理を実行し、フレーム毎に特徴パラメータ系列
と標準パタンとの最小累積距離と、当該最小累積距離を
与えるマツチング開始フレーム位置とを算出する連続Ｄ
Ｐ部と、ｄ）認識対象単語毎に、最小累積距離と、マツチング開
始フレーム位置までの最適単語系列のパタン間距離累積
値とを加算した値を最小にする単語を決定する処理を行
う最適Ｉｌｔ　；？Ａ系列判定部とを具えることを特徴
とする。(Means for Solving the Problems) In order to achieve this objective, the continuous speech recognition device of the present invention has the following features: a) Extracts a sequence of speech feature parameters from human speech at fixed time intervals called frames. b) a standard dictionary unit that stores predetermined standard patterns of recognition target words; and C) performs continuous DP matching processing between the standard patterns and the feature parameter series to determine the features for each frame. Continuation D that calculates the minimum cumulative distance between the parameter series and the standard pattern and the matching start frame position that gives the minimum cumulative distance.
and d) an optimal Ilt that performs processing for determining, for each recognition target word, a word that minimizes the sum of the minimum cumulative distance and the cumulative inter-pattern distance value of the optimal word series up to the matching start frame position. ;? It is characterized by comprising an A-series determination section.

この発明の実施に当っては、好ましくは、ａ）上述した
最小累積距離を、特徴パラメータ系列と標準パタンとの
連続ＤＰマツチング処理において標準パタン側のフレー
ム数で正規化したマツチングバスにより算出し、ｂ）上述した最適単語系列のパタン間距離累積値を最小
累積距離の和で表わすように構成するのが良い。In implementing the present invention, preferably, a) the above-mentioned minimum cumulative distance is calculated using a matching bus normalized by the number of frames on the standard pattern side in continuous DP matching processing between the feature parameter series and the standard pattern, and b) ) It is preferable to configure the above-mentioned optimal word sequence so that the inter-pattern distance cumulative value is expressed by the sum of minimum cumulative distances.

（作用）このように、この発明の連続音声認識装置によりば、連
続単ＪΔ認識において単語単位にワードスポツティング
を行い、その結果各フレーム毎に算出されたパタン間累
積距離に対してさらに動的計画法の処理を適用すること
により、最適な単語列を算出するように構成しているの
で、実時間動作か可能となり、メモリ数の低減を図り、
しかも、認識率の向上を図ることが出来る。(Operation) As described above, according to the continuous speech recognition device of the present invention, word spotting is performed for each word in continuous single JΔ recognition, and as a result, the cumulative distance between patterns calculated for each frame is further dynamically detected. It is configured to calculate the optimal word string by applying programming processing, so real-time operation is possible, reducing the amount of memory required.
Furthermore, it is possible to improve the recognition rate.

（実施例）以下、図面を参照してこの発明の連続音声認識装置の実
施例につき説明する。(Embodiments) Hereinafter, embodiments of the continuous speech recognition device of the present invention will be described with reference to the drawings.

第１図はこの発明の実施例を示す機能ブロック図であっ
て、１００は人力音声端子である。入力端子１００より
人力された音声は前処理部ｌｏｔにおいてＡ／Ｄ変換後
、フレーム周期毎に特徴パラメータ系列Ｖ（ｐ、ｉ）が
算出される。このＶ（ρ、ｉ）は音声区間検出部１０２
へ転送されるとともに、連続ＤＰ部１０３へも送られる
。連続ＤＰ部１０３ではｔ′千声の標準パタン群が格納
されている標準パタン辞書部１０４のパタンに基づいて
特徴パラメータ系列とのパタン間距離を算出する。FIG. 1 is a functional block diagram showing an embodiment of the present invention, and 100 is a human voice terminal. After the voice manually inputted from the input terminal 100 is A/D converted in a preprocessing unit lot, a characteristic parameter series V(p, i) is calculated for each frame period. This V (ρ, i) is expressed by the voice section detection unit 102
At the same time, it is also sent to the continuous DP section 103. The continuous DP unit 103 calculates the inter-pattern distance from the feature parameter series based on the pattern in the standard pattern dictionary unit 104 in which the standard pattern group of t'thousand is stored.

さらに、そのパタン間距離を用いて最適単語系列判定部
１０５では最適な単語系列を決定する。またこの最適単
語系列判定部１０５では音声区間検出部＋０２から発せ
られる音声区間終端の検出信号を受けて、最終的な認識
単語系列を出力端子＋０６に認識結果として出力する。Furthermore, the optimal word sequence determination unit 105 determines the optimal word sequence using the inter-pattern distance. The optimum word sequence determining unit 105 also receives a detection signal of the end of a voice interval issued from the voice interval detecting unit +02, and outputs the final recognized word sequence to an output terminal +06 as a recognition result.

これらの各構成成分１０１〜１０５の各動作はマイクロ
コンピュータ等によってソフト的に処理出来るものであ
る。Each operation of each of these constituent components 101 to 105 can be processed by software using a microcomputer or the like.

入力端子１００から人力された音声信号は前処理部１０
１において特徴を表わすベクトル系列、特徴パラメータ
系列Ｖ（ｐ、ｉ）に変換される。この特徴パラメータ系
列Ｖ（ｐ、ｉ）は一般には中心周波数の異なるＰ個のバ
ンドパスフィルタ群によって抽出された帯域内周波数成
分をフレーム周期毎に標本化することによって得ている
。この場合、線形予測分析と呼ばれる別の手法を用いて
フレーム周期毎に予測結果を算出し、標本化し、特徴パ
ラメータ系列Ｖ（ｐ、ｉ）どしても良い。この特徴パラ
メータ系列は連続ＤＰ部１０３と音声区間検出部＋０２
とに送出される。尚、この前処理部１０１の構成及び動
作は既に提案されているので、これ以上の詳細な説明は
省略する。The audio signal input manually from the input terminal 100 is sent to the preprocessing unit 10.
1, it is converted into a vector series representing the feature, a feature parameter series V(p, i). This characteristic parameter series V(p, i) is generally obtained by sampling in-band frequency components extracted by P bandpass filter groups having different center frequencies for each frame period. In this case, a prediction result may be calculated for each frame period using another method called linear prediction analysis, sampled, and returned to the feature parameter series V(p, i). This feature parameter series is generated by the continuous DP section 103 and the voice section detection section +02.
It will be sent to Incidentally, since the configuration and operation of this preprocessing section 101 have already been proposed, further detailed explanation will be omitted.

音声区間検出部＋０２では特徴パラメータ系列Ｖ（ｐ、
ｉ）に基づき音声区間、すなわち音声の始端及び終端を
検出１−る。この検出アルゴリズムとして特徴パラメー
タ系列Ｖ（ｐ、ｉ）から求まる音声パワーを用いて検出
を行う複雑なアルゴリズム（特願昭５９−１０８６６８
号）、音声パワーが予め定めた閾値以上となった時点を
音声の始端、閾値未満となった時点を音声の終端とする
簡易なアルゴリズム及びその他のアルゴリズムがあるが
いずわか適切なアルゴリズムに従った動作を行わせれば
良い。この音声区間検出部１０２の構成及び動作も既に
提案されているので、これ以上の詳細な説明を省略する
。The speech section detection unit +02 uses the feature parameter series V(p,
Based on i), detect the voice section, that is, the start and end of the voice. This detection algorithm is a complex algorithm that performs detection using the voice power found from the feature parameter series V(p, i) (Patent Application No. 108668/1986)
There are simple algorithms and other algorithms that define the point at which the voice power exceeds a predetermined threshold as the start of the voice, and the point at which it falls below the threshold as the end of the voice, but I somehow followed an appropriate algorithm. All you have to do is make it work. Since the configuration and operation of this voice section detection section 102 have already been proposed, further detailed explanation will be omitted.

次に連続ＤＰ部の動作についてであるが、このブロック
の詳細については従来例で述べたアルゴリズムに従った
構成及び動作とある一部を除いて同じである。従ってこ
の発明の実施例では従来の例の動作と兄なる部分のみ述
べる。Next, regarding the operation of the continuous DP section, the details of this block are the same as the configuration and operation according to the algorithm described in the conventional example, except for a certain part. Therefore, in the embodiment of the present invention, only the operation of the conventional example and the older part will be described.

先ず、前述した式（１）及び（２）の漸化式を算出する
にあたり、従来例では既に説明した通り、先ず、入力パ
タンの始端のための初期設定処理を最初に行うか、この
発明の実施例では人力パタンの始端として音声区間検出
部１０２にて音声の始端であると判定されたフレーム■
８を採用する。すなわち従来例で使用され−（いる人力
フレーム番号を表わすｉは、この実施例ではＩｓからの
相対的なフレーム番号となる。さらに音声区間検出部１
０２で音声の終端であると判定されたフレーム■６まで
の区間を人力パタンの長さ■とする。First, in calculating the recurrence formulas of equations (1) and (2) described above, as already explained in the conventional example, initialization processing for the starting end of the input pattern is first performed, or In the embodiment, the frame ■ determined to be the start of the voice by the voice section detection unit 102 is the start of the human pattern.
8 is adopted. That is, i representing the manual frame number used in the conventional example is a relative frame number from Is in this embodiment.
The section up to frame ■6, which is determined to be the end of the audio in step 02, is defined as the length ■ of the human pattern.

従って、Ｉ＝１．−１ｑ＋１・・・・・・・　（７）となる。音
声の終端検出後は次なる音声の始端か検出されたならば
再びそのフレーム番号を■５として同様な処理を繰り返
す。このように、この連続ＤＰ部には、従来の入力フレ
ーム番号を表わすｉの代わりにフレームＩｓを検出する
手段を具えている（図示せず）。Therefore, I=1. −1q+1・・・・・・(7) After the end of the voice is detected, if the start of the next voice is detected, the frame number is set to 5 again and the same process is repeated. Thus, this continuous DP section is provided with means (not shown) for detecting a frame Is instead of i representing the conventional input frame number.

以後、特に指定する場合を除いて人力フレーム番号ｉは
全て１．からの相対フレーム番号とする。さて、ここで
算出された最終結果、すなわちｉフレームを終０１；１
とした場合の各標準パタンｎとのパタン間距ｆｉＤ。（
ｉ、Ｊｏ）及びマツチング開始フレーム番号Ｂｎ　　（
＋　、Ｊｎ　）はそのままこの発明の特徴部分である最
適単語系列判定部１０５へ転送される。From now on, unless otherwise specified, all manual frame numbers i will be 1. Let it be a relative frame number from . Now, the final result calculated here, that is, the i-frame ends 01;1
Inter-pattern distance fiD with each standard pattern n when (
i, Jo) and matching start frame number Bn (
+, Jn) are directly transferred to the optimal word sequence determination unit 105, which is a characteristic part of the present invention.

次に、この発明の実施例の要部である最適単語系列判定
部１０５の動作について第２図の動作の流れ図に従って
説明する。尚、以下の説明において処理ステップをＳで
表わして説明する。Next, the operation of the optimal word sequence determining section 105, which is a main part of the embodiment of the present invention, will be explained according to the flowchart of the operation shown in FIG. In the following description, each processing step will be represented by S.

ここで以下の記号を定義しておく。Here we define the following symbols.

ＤＤ（ｉ）：ｉフレームまでの最適単語系列のパタン間
距離の累積値Ｘ　（ｉ）：ＤＤ　（ｉ）を与える最適単語系列の単語
数Ｎ　（ｉ）：ＤＤ　（ｉ）を与える第Ｘ　（ｉ）桁目の
単語名ＢＢ　（ｉ）＋ＤＤ　（ｉ）を与える第Ｘ（ｉ）桁目の
単語の始端フレーム番号さて、ＤＤ　（ＩＥ　）を与える最適単語系列が認識結
果となるわけである。ＤＤ（ＩＥ）は動的計画法を用い
ると次の漸化式を解けば良いことになる。DD(i): Cumulative value of inter-pattern distances of the optimal word sequence up to i frame X (i): Number of words in the optimal word sequence that gives DD (i) The starting frame number of the word in the X(i)th digit that gives the word name BB (i)+DD (i) in the i)th digit.Now, the optimal word sequence that gives DD (IE) is the recognition result. When DD (IE) is used with dynamic programming, it is sufficient to solve the following recurrence formula.

ＤＤ（ｉ）＝ｍｉｎ［ＤＤ（ｍ−］）＋Ｄｎ（ｉ、Ｊｏ
）］　　・・・・・（Ｓ８）但し、ｍ＝Ｂｏ　（ｉ、Ｊ
ｎ　）Ｘ（ｉ）＝Ｘ（ｍ）＋１−・・・−＝・　（９）Ｎ（ｉ
）−ｎ・・・・・・・・・・・・・　（１０）ＢＢ（ｉ
）＝ｍ・・・・・・・・・・・・　（１１）ここでｎは
式（８）の最小値を与える単語名ｎであり、ｍはｍ＝ｎ
である場合のｍの値である。DD(i)=min[DD(m-])+Dn(i, Jo
)] ...(S8) However, m=Bo (i, J
n ) X(i)=X(m)+1−・・・−=・(9)N(i
)-n・・・・・・・・・・・・・・・ (10) BB(i
)=m・・・・・・・・・・・・ (11) Here, n is the word name n that gives the minimum value of equation (8), and m is m=n
This is the value of m when .

式（８）〜（１１）を１≦ｉ≦Ｉについて順に計算して
ゆけばＤＤ　（１，）が求まる。By sequentially calculating equations (8) to (11) for 1≦i≦I, DD (1,) can be found.

従って、次のような処理■〜■を行う。Therefore, the following processes (1) to (2) are performed.

■先ず、連続Ｄ　Ｐ　ａｔｓ　１０３からり。（ｉ、Ｊ
ｎ）及びＢ。（ｉ、Ｊｎ）の情報を取り込む（Ｓｌ）。■First of all, continuous D P ats 103 Karari. (i, J
n) and B. Take in the information of (i, Jn) (Sl).

０次に、音声区間の始端のための初期設定処理を行う（
Ｓ２）。0 Next, perform initial setting processing for the start of the voice section (
S2).

ＤＤ　（ｏ）＝Ｘ　（ｏ）＝ＢＢ　（ｏ）＝０■次に、
ｎ＝１．２、・・・、■について上述の式（８）〜（１
１）を実行する（３４〜Ｓ７）。DD (o)=X (o)=BB (o)=0 ■Next,
For n=1.2, ..., ■, the above equations (8) to (1
1) is executed (34-S7).

この場合、先ず、式（８）の［］　　内　の（ＤＤ　（
ｍ−１）　＋Ｄ、（ｉ、Ｊｏ））を求め、続いて、式（
８）のＤＤ　（ｉ）を求める（Ｓ４）。In this case, first, (DD (
m-1) +D, (i, Jo)), and then the formula (
8) DD (i) is determined (S4).

次に、この最小値を与えるｌｌｊ語名０をｎとし、第Ｘ
　（ｉ）桁目の単語名をｍ＝Ｎ（ｉ）と設定する（Ｓ５
）。Next, let the llj word name 0 that gives this minimum value be n, and the
(i) Set the word name of the digit as m=N(i) (S5
).

次に、ｍ＝ｎであるｍをｍとし、このときの第Ｘ　（ｉ
）桁目の単語の始端フレーム番号ＢＢ（ｉ）−ｍと設定
する（Ｓ６）。Next, let m where m=n be m, and at this time, the Xth (i
) is set as the starting frame number BB(i)-m of the word (S6).

次に、式（９）のＤＤ（ｉ）を与える最適単語系列の単
語数Ｘ（ｉ）を求める（Ｓ７）。Next, the number of words X(i) in the optimal word sequence that gives DD(i) in equation (9) is determined (S7).

［述した処理（Ｓ２〜Ｓ７）を最終フレームＩまで実行
しくＳ８）、その後に次の処理に進む。[The above-described processes (S2 to S7) are executed until the final frame I (S8), and then the process proceeds to the next process.

■Ｎ　（ｉ）、ＢＢ　（ｉ）をバックトレースすること
により最適単語系列を求める。すなわちｎ＝１としくＳ
９）、続いて、Ｘ（１）桁目を与える単語Ｎ（Ｉ）の始
端ＢＢ　（１）を用い、Ｘ（１）−１桁目の単語名Ｎ　
（ＢＢ　（１）−１）を求めると共に、その始端ＢＢ　
（ＢＢ　（１）−１）とを求める（Ｓ１０．５Ｉｌ）。(2) Find the optimal word sequence by backtracing N (i) and BB (i). In other words, let n=1 and S
9), Next, using the starting point BB (1) of the word N(I) that gives the X(1)th digit, write the word name N of the X(1)-1st digit.
(BB (1)-1) and its starting point BB
(BB (1)-1) is determined (S10.5Il).

同様の処理を１桁目まで行うことにより結果を得る（Ｓ
ｌ２）。The result is obtained by performing the same process up to the first digit (S
l2).

尚、上述した処理の動作の流れは一例であって、この流
れにのみ限定されるものではない。Note that the flow of the processing operations described above is an example, and is not limited to this flow.

以−ヒで明らかなように連続ＤＰ部１０３よりフレーム
毎に転送されてくるＤｎ　（ｉ、Ｊｎ）、Ｂｎ　　（ｉ
、Ｊｎ　）を用いて、最適単語系列判定部１０５では式
（８）＝（＋１）の演算に必要な全ての値か定まってい
るため、その時点で演算処理が開始出来る。すなわちこ
の実施例のアルゴリズムに基づく処理は全くのフレーム
クイズに処理可能である。As is clear from the following, Dn (i, Jn) and Bn (i
, Jn), the optimal word sequence determination unit 105 has determined all the values necessary for the calculation of equation (8)=(+1), so the calculation process can be started at that point. In other words, the processing based on the algorithm of this embodiment is capable of processing a complete frame quiz.

音声区間の終端検出時点で処理４のバックトレースを除
いて処理が完了しており、実時間処理に好適である。ま
た、最適単語系列判定部１０５で必要とされる途中結果
保持のためのメモリもγを声区間の各フレーム毎にＤＤ
　（ｉ）、Ｘ　（ｉ）、Ｎ　（ｉ）、ＢＢ（ｉ）の４個
のみで良いために極めてメモリ量か少ない構成で実現出
来る。さらに、実際上、桁数Ｘ（りは最適単語系列のみ
を求める場合、必１゛シも必要でない。そこで式（９）
の０１算を省略しても良いため、さらに少ないメモリて
構成出来る。At the time when the end of the voice section is detected, the processing except for the back trace in process 4 is completed, and this is suitable for real-time processing. In addition, the memory required by the optimal word sequence determination unit 105 for holding intermediate results is also
(i), X(i), N(i), and BB(i), it can be realized with an extremely small amount of memory. Furthermore, in practice, if the number of digits
Since the 01 arithmetic operation can be omitted, the structure can be configured with even less memory.

（発明の効果）上述した説明から明らかなように、この発明の連続音声
認識装置によれば、連続ＤＰマツチングにより算出され
る各フレーム毎の結果のみを用いて、さらに動的計画法
を適用することにより最適単語系列を算出する技１ヨ１
を用いた装置であり、実時間動作が可能であるとともに
、必要なメモリも極めて少なくて良い利点がある。(Effects of the Invention) As is clear from the above description, according to the continuous speech recognition device of the present invention, dynamic programming is further applied using only the results for each frame calculated by continuous DP matching. Techniques for calculating the optimal word sequence by
This is a device that uses a 3D device, which has the advantage of being able to operate in real time and requiring very little memory.

また従来例のようなフレーム毎に求められる連続ＤＰマ
ツチングの結果から前後の噴詰区間との関係を考１・ｈ
することなく直接単語区間を決定してしまってからこの
単語区間のみを用いて最適なＱｊ−語列を求めるという
下位レベルによる情報圧縮処理を含むアルゴリズムを採
用していない。従ってこのような例に見られる下位レベ
ルの処理での誤りが積み重なることがないために高い認
識率が得られるという利点かある。In addition, from the results of continuous DP matching obtained for each frame as in the conventional example, the relationship between the preceding and succeeding eruption sections was considered 1.h.
This method does not employ an algorithm that includes information compression processing at a lower level, in which a word section is directly determined without any additional processing, and then an optimal Qj-word string is determined using only this word section. Therefore, there is an advantage that a high recognition rate can be obtained because errors in lower-level processing seen in such an example do not accumulate.

[Brief explanation of the drawing]

第１図はこの発明の連続音声認識装置の実施例を説明す
るための機能ブロック図、第２図はこの発明の説明に供する最１８Ｑ１語系列判定
部の動作の流れ図、第３図はこの発明及び従来の連続音声認識装置の説明に
供する連続ＤＰのバスを示−）−図、第４図はこの発明
及び従来の連続音声認識装置の説明に供する単語区間候
補の説明図である。１００・・・人力音声端イ゛、　ｌｏｔ・・・前処理部
１０２・・・音声区間検出部、１０３・・・連続ＤＰ部
Ｉｆ）４・・・標準パタン辞書部１０５・・・最適ｉｐ−諸系列判定部１０６・・・出力６；ｇ　：ｆ−０特許出願人　　　　　沖電気工業株式会社この斃Ｂｆ４
の上靴３＃？認■Ｖ没■のブ冶ツフ図第１図＋党連ハ０タン連部ＤＰＩ７）パ又第３図 □−□工　坪間軸第４図Fig. 1 is a functional block diagram for explaining an embodiment of the continuous speech recognition device of the present invention, Fig. 2 is a flowchart of the operation of the maximum 18Q1 word sequence determination section for explaining the invention, and Fig. 3 is a functional block diagram for explaining the embodiment of the continuous speech recognition device of the present invention. FIG. 4 is an explanatory diagram of word section candidates used to explain the present invention and the conventional continuous speech recognition apparatus. 100... Human voice terminal I, lot... Preprocessing unit 102... Voice section detection unit, 103... Continuous DP unit If) 4... Standard pattern dictionary unit 105... Optimum IP- Various series determination unit 106...output 6;g:f-0 Patent applicant: Oki Electric Industry Co., Ltd.Kono Bf4
Shoes #3? Recognition ■ V death ■ construction map Figure 1 + party association DPI 7) Parameter Figure 3 □ - □ engineering Tsuboma axis Figure 4

Claims

[Claims]

(1) a) A preprocessing unit that extracts a sequence of audio feature parameters from input audio at fixed time intervals called frames, and b) A standard dictionary unit that stores predetermined standard patterns of recognition target words. and c) performing continuous DP matching processing between the standard pattern and the feature parameter series, and determining a minimum cumulative distance between the feature parameter series and the standard pattern for each frame, and a matching start frame position that provides the minimum cumulative distance. and d) for each word to be recognized, a word that minimizes the sum of the minimum cumulative distance and the cumulative inter-pattern distance value of the optimal word series up to the matching start frame position. What is claimed is: 1. A continuous speech recognition device, comprising: an optimal word sequence determination unit that performs processing for determining an optimal word sequence.

(2) a) the minimum cumulative distance is calculated by a matching pass normalized by the number of frames on the standard pattern side in continuous DP matching processing between the feature parameter series and the standard pattern, and b) the minimum cumulative distance 2. The continuous speech recognition apparatus according to claim 1, wherein the cumulative distance value between patterns of a series is expressed by the sum of the minimum cumulative distances.