JPH01321498A

JPH01321498A - Speech recognizing device

Info

Publication number: JPH01321498A
Application number: JP63155259A
Authority: JP
Inventors: Takeshi Norimatsu; 武志則松
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1988-06-23
Filing date: 1988-06-23
Publication date: 1989-12-27

Abstract

PURPOSE:To suppress misrecognition and to improve the rejection performance at the time of the input of a word which is not to be recognized by detecting a voiceless section of a standard pattern previously and carrying out pattern matching while storing a previous passing point in a back pointer. CONSTITUTION:A speech analysis part 2 extracts a time series of feature vector from an input speech signal and a voiceless section detection part 3 detects voiceless parts from a time series of energy values of a voice and stores following frame positions where energy rises. Then a similarity calculation part 4 matches the input pattern with patterns whose paths are limited at intersections of the voiceless sections among the standard patterns. Further, a back pointer storage part 6 stores intersections of voiceless section points which are passed before respective points are reached at the time of cumulative distance calculation and a recognition and decision part 7 regards the standard pattern which indicates the shortest distance among distance values obtained by the similarity calculation part 4 as a recognition candidate speech. Consequently, misrecognition among similar words is suppressed and even when a word which is not to be recognized is inputted, it can be rejected as much as possible.

Description

【発明の詳細な説明】産業上の利用分野本発明は、入力音声パターンと各標準パターンとのパタ
ーンマツチングにより認識結果を導き出す音声認識装置
に関するものである。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a speech recognition device that derives recognition results by pattern matching an input speech pattern and each standard pattern.

従来の技術一般に、音声認識装置では、入力音声パターンと辞書に
蓄えられた各標準パターンとの類似度を計算し、類似度
の最大となる標準パターンを認識結果とする方法が行な
われている。二つの音声パターンの類似度を計算するた
めには動的計画法（ダイナミック　プログラミング法）
を用いて、二つのパターンの時間軸を非線形に伸縮する
パターンマツチング（以下、ＤＰマツチングと記す。）
が使用されている。特に、単語音声認識装置では、この
ＤＰマツチング法により高い認識率を得ている。　（例
えば、　「ダイナミック　プログラミングオプティミゼ
イション　フォ　スポークン　ワ−ド　レコグニション
Ｊ　　（Ｈ，５ａｋｏｅ　ａｎｄ　Ｓ　、Ｃｈ　ｌｂａ
　＋”Ｄｙｎａｍｌｃ　ｐｒｏｇｒａｍｍｌｎｇ　　ｏ
ｐｔｌｍｌｚａｔｌｏｎ　　ｆｏｒ　　５ｐｏｒｋｅｎ
　ｗｏｒｄ　　ｒｅｃｏｇｎｌｔｌｏｎ”、ＩＥＥＥ　
　ｔｒａｎｓ、Ａｃｏｕｓｔｌｃ、５ｐｅｅｃｈ、Ｓｉ
ｇｎａｌ　　Ｐｒｏｃｅｓｓｉｎｇ、ｙｏｌ、Ａｓ５Ｐ
−２７ｐｐ、３３Ｇ−３４９゜発明が解決しようとする
課題しかしながら上記の音声認識装置では、スペクトルの形
状のみによるパターンマツチングを行うため、異なった
音声パターン間のＤＰマツチングにおいても、極端な時
間軸の非線形伸縮のため両者の距離が小さくなる場合が
あり、誤認識を生じる原因となっていた。例えば、　「
大阪」と「大分」とはエネルギー系列で見ると、前者は
三つ、後者は二つのエネルギーの山があり明らかに異な
った二つのパターンであるが、音韻的には似通っている
ためＤＰマツチングにより距離が小さくなってしまい誤
認識を生じる場合がある。2. Description of the Related Art In general, a speech recognition device calculates the degree of similarity between an input speech pattern and each standard pattern stored in a dictionary, and selects the standard pattern with the maximum degree of similarity as the recognition result. Dynamic programming method is used to calculate the similarity between two speech patterns.
Pattern matching (hereinafter referred to as DP matching) that non-linearly expands and contracts the time axes of two patterns using
is used. In particular, word speech recognition devices achieve high recognition rates using this DP matching method. (For example, ``Dynamic Programming Optimization for Spoken Word Recognition J (H,5akoe and S, Chlba
＋"Dynamlc program mng o
ptlmlzatlon for 5porken
IEEE
trans, Acoustlc, 5peech, Si
gnal Processing, yol, As5P
-27pp, 33G-349゜Problems to be Solved by the Invention However, since the above-mentioned speech recognition device performs pattern matching based only on the shape of the spectrum, even in DP matching between different speech patterns, extreme time axis Due to non-linear expansion and contraction, the distance between the two may become small, causing misrecognition. for example, "
Looking at the energy series of "Osaka" and "Oita", the former has three energy peaks and the latter has two energy peaks, so they are clearly two different patterns, but they are phonetically similar, so DP matching The distance may become small and misrecognition may occur.

また、類似した音声パターン間のＤＰマツチングでは、
音声パターン全体に渡ってＤＰマツチングを行なうため
、両者間の違いが埋もれてしまい、その結果、パターン
間の距離が小さくなり誤認識を生じやすいという問題点
を何していた。In addition, in DP matching between similar speech patterns,
Since DP matching is performed over the entire speech pattern, the difference between the two is obscured, and as a result, the distance between the patterns becomes small and misrecognition is likely to occur.

また、認識対象外単語が入力された時にも標準パターン
の一つにマツチングしてしまい、対象外単語のりジェク
ト性能には限界があった。Furthermore, even when a word that is not to be recognized is input, it is matched to one of the standard patterns, and there is a limit to the ability to paste the word that is not to be recognized.

本発明は上記問題点に鑑み、類似音声パターン間での誤
認識および極端なりＰマツチングによる誤認識を極力抑
え、さらに認識対象外単語が入力された時のりクエクト
性能を高めることのできる音声認識装置を提供するもの
である。In view of the above-mentioned problems, the present invention is a speech recognition device that is capable of suppressing misrecognition between similar speech patterns and misrecognition caused by extreme P matching as much as possible, and further improving query performance when a word not to be recognized is input. It provides:

課題を解決するための手段本発明の音声認識装置は入力音声からエネルギー系列を
含む特徴ベクトルの時系列を出力する音声分析部と、前
記音声分析部から出力されるエネルギー系列から無音区
間を検出する無音区間検出部と、前記無音区間検出部で
得られた入力パターン及び標準パターンの無音区間点の
交点でマツチング経路を限定するパターンマツチングを
全ての可能な経路上で計算し、その際各点での累積距離
計算時にそれ以前にどの無音区間点の交点を通過してき
たかをバックポインターとして記憶する類似度計算部と
を備えたことを特徴とする。Means for Solving the Problems The speech recognition device of the present invention includes a speech analysis section that outputs a time series of feature vectors including energy sequences from input speech, and detects silent intervals from the energy series output from the speech analysis section. A pattern matching is calculated on all possible paths to limit the matching path at the intersection of the silence section points of the input pattern obtained by the silence section detection section and the standard pattern, and the silence section points of the input pattern obtained by the silence section detection section are calculated. The present invention is characterized by comprising a similarity calculation unit that stores as a back pointer which intersection of silent interval points has been passed before when calculating the cumulative distance.

作用本発明は上記に述べた構成によって、あらかじめ標準パ
ターンの無音区間を検出し、バックポインターに以前の
通過点を記憶させながらパターンマツチングを実行する
ことにより一度のＤＰマツチングの実行により、無音区
間点の検出誤りを考慮した全ての無音区間点の対応のさ
せ方を含んだ距離計算ができることにより、パターンマ
ツチングに要する処理時間を増加させずに類似パターン
間の誤認識および極端なマツチングによる誤認識を極力
抑えることができ、さらに認識対象外単語が入力された
蒔には極力リジェクトすることができる。Effects of the present invention With the above-described configuration, the silent section of the standard pattern is detected in advance, and the pattern matching is executed while storing the previous passing point in the back pointer. By being able to calculate distances that include how to match all silent interval points in consideration of point detection errors, it is possible to eliminate misrecognition between similar patterns and errors caused by extreme matching without increasing the processing time required for pattern matching. Recognition can be suppressed as much as possible, and furthermore, words that are not recognized can be rejected as much as possible.

実施例以下本発明の一実施例の音声認識装置について、図面を
参照しながら説明する。Embodiment Hereinafter, a speech recognition device according to an embodiment of the present invention will be described with reference to the drawings.

第１図は本発明の一実施例における音声認識装置のブロ
ック図である。第１図において、１は音声信号を入力す
るマイクロホン（電話機のハンドセット等でもよい。）
２は音声分析部で、入力された音声信号から特徴ベクト
ルの時系列を抽出する。３は無音区間検出部で、音声の
エネルギー値時系列から無音部分を検出し後続のエネル
ギーの立ち上がりのフレーム位置を記憶する。４は類似
度計算部で入力パターンと標準パターンの無音区間点の
交点で経路を限定したパターンマツチングを行う。５は
記憶部で入力パターン及び全ての標準パターンの特徴ベ
クトルの時系列と無音区間点を記憶する。６はバックポ
インター記憶部で各点での累積距離計算時にその点に到
達する以前に通過した無音区間点の交点を記憶する。７
は認識判定部で類似度計算部４で得られた距離値のうち
最小距離を与える標準パターンを認識候補音声とする。FIG. 1 is a block diagram of a speech recognition device according to an embodiment of the present invention. In FIG. 1, 1 is a microphone for inputting audio signals (it may also be a telephone handset, etc.).
2 is a speech analysis unit that extracts a time series of feature vectors from the input speech signal. Reference numeral 3 denotes a silent section detecting section which detects a silent section from the audio energy value time series and stores the frame position of the subsequent energy rise. 4 is a similarity calculation unit that performs pattern matching with a path limited at the intersection of the silent section points of the input pattern and the standard pattern. 5 is a storage unit that stores time series of feature vectors and silent interval points of input patterns and all standard patterns. Reference numeral 6 denotes a back pointer storage unit which stores the intersection points of silent interval points passed before reaching each point when calculating the cumulative distance at each point. 7
The recognition determination unit uses the standard pattern that provides the minimum distance among the distance values obtained by the similarity calculation unit 4 as the recognition candidate speech.

第２図は第１図に示した装置の説明図である。FIG. 2 is an explanatory diagram of the apparatus shown in FIG. 1.

次に上記音声認識装置の動作を説明する。Next, the operation of the above speech recognition device will be explained.

まず、マイクロホン１から音声を入力し、音声分析部２
で入力音声信号をアナログ−ディジタル変換しさらに音
声の特徴ベクトルの時系列（例えば、１０次の線形予測
係数）とエネルギー系列とを求め、記憶部５に記憶する
。次に、無音区間検出部３で、音声分析部２でメモリに
記憶されたエネルギー系列からエネルギー値が予め定め
られた閾値を下回る区間が一定時間ＴＯを超える区間を
無音区間として検出し、その無音区間の個数と各無音区
間の後続のエネルギーの立ち上がりのフレーム位置（以
下このフレーム位置をＱ点と呼ぶことにする。）を記憶
する。なお、あらかじめ各標準パターンの特徴ベクトル
の時系列及び各標準パターンの無音区間の個数とその後
続のエネルギーの立ち上がりのフレーム位置が記憶部５
に記憶されているものとする。次に類似度計算部４で、
入力された音声パターンは各標準パターンとの間で無音
区間検出部３で検出されたＱ点によりマツチング経路を
拘束したパターンマツチング（例えばＤＰマツチング法
）を行う。以下パターンマツチングの動作をここでは簡
単のために入力パターン、標準パターンにそれぞれＱ点
が２個ずつ存在する場合について説明する。First, audio is input from the microphone 1, and the audio analysis unit 2
The input audio signal is analog-to-digital converted, and a time series of audio feature vectors (for example, 10th order linear prediction coefficients) and an energy series are obtained and stored in the storage unit 5. Next, the silent section detecting section 3 detects, as a silent section, a section in which the energy value is below a predetermined threshold and exceeds a certain time TO from the energy sequence stored in the memory by the speech analyzing section 2. The number of sections and the frame position of the rise of energy following each silent section (hereinafter, this frame position will be referred to as Q point) are stored. Note that the time series of the feature vectors of each standard pattern, the number of silent sections of each standard pattern, and the frame position of the rise of the subsequent energy are stored in the storage unit 5 in advance.
It is assumed that it is stored in . Next, in the similarity calculation unit 4,
The input audio pattern is subjected to pattern matching (for example, DP matching method) in which the matching path is constrained by Q points detected by the silent section detection section 3 between each standard pattern. For simplicity, the pattern matching operation will be described below for the case where there are two Q points in each of the input pattern and the standard pattern.

ここで入力パターンの始点及び終点をそれぞれＱ＋ａ　
（＝　１　）　、Ｑ１０、標準パターンの始点、終点を
それぞれＱＲ９（＝　１　）　、ＱＲ３とする。また、
入力パターン、標準パターンのＱ点を始点に近い方から
それぞれＱ　＋＋ｓ　　Ｑ　１２及びＱ　Ｒ＋、ＱＲ２
とし各Ｑ点同志の２次元平面上の交点をそれぞれ（ＱＩ
ＩＩＱＲＩ）（ＱＲＩＱ−２）（Ｑ−２Ｑ−１）（ＱＲ
２０Ｒ２）とする。Here, the start point and end point of the input pattern are respectively Q+a
(= 1 ), Q10, and the start and end points of the standard pattern are QR9 (= 1 ) and QR3, respectively. Also,
From the Q point of the input pattern and standard pattern, from the one closest to the starting point, Q ++s Q 12 and Q R+, QR2, respectively.
Let the intersection of each Q point on the two-dimensional plane be (QI
IIQRI) (QRIQ-2) (Q-2Q-1) (QR
20R2).

これらの様子は第２図に示している。またここでは簡単
のために整合窓による制限は考えないこととする。These conditions are shown in FIG. Also, for simplicity, we will not consider restrictions due to the matching window.

パターンマツチングは入力パターン、標準パターン上で
検出されたＱ点同志を対応させた交点上でマツチング経
路を制限して行うが、Ｑ点の検出を誤った場合を考慮す
ると第２図に示した■■■の３種類の経路を考える必要
がある。パターンマツチングの漸化式としてここでは次
式を考える。Pattern matching is performed by restricting the matching path to the intersections of the Q points detected on the input pattern and the standard pattern, but considering the case where the Q points are detected incorrectly, as shown in Figure 2. It is necessary to consider three types of routes: ■■■. Here, we consider the following equation as a recurrence equation for pattern matching.

ｇ（１，１）：ｄ（１，１）　　　　　　　　　　−−
−−−（１）ｇ（１＋ｊ）＝ｍｌｎ（ｇ（１１，Ｊ）　
＋ｇ（１−１＋ｊ−ＩＬｇ（１−１，ｊ−２））＋ｄ（
１，ｊ）　　　　　　　　−−−−（２）ｇ（１，ｊ）
＝ｇ（１＋１）＝ｃｏ（１，Ｊ＝２．３．＊　ｅ　ｅ　
ｅ　）ここでｇ（ｉ、ｊ）、ｄ（ｉ、ｊ）はそれぞれ（
ｔ、ｊ）における累積距離及びベクトル間距離を表す。g(1,1):d(1,1) --
---(1) g(1+j)=mln(g(11,J)
+g(1-1+j-ILg(1-1,j-2))+d(
1, j) -----(2) g(1, j)
=g(1+1)=co(1,J=2.3.* e e
e) Here, g(i, j) and d(i, j) are each (
t, j) and the distance between vectors.

また点（ｔ、ｊ）でのバックポインターをｂ（ｉ、ｊ）
とし、初期値としてｂ（１，１）Ｅ（１，１）とする。Also, the back pointer at point (t, j) is b(i, j)
and the initial value is b(1,1)E(1,1).

まずｉ”Ｌｊ＝１としてｊをインクリメントしながら上
記の漸化式に従い累積距離ｇを計算し、ｊ　”ＱＩＩＥ
まで処理し終わるとｉをインクリメントし同様に処理を
続ける。各点（ｉ、ｊ）でのバックポインターは以下の
ようにして求める。First, set i''Lj=1 and calculate the cumulative distance g according to the above recurrence formula while incrementing j, and then calculate j''QIIE
When processing is completed, i is incremented and processing continues in the same manner. The back pointer at each point (i, j) is determined as follows.

！＝Ｑ＋に＋かつｊ＝ＱＲ＋のとき（但しに、　Ｉ＝１
．２）ｂ　（ｉ＋　　ｊ　）　Ｅ　（Ｑ＋ｂ＋　　ＱＲ
Ｉ）ｉ　ｆ−Ｑ　Ｉｋｌ　　又はｊ≠ＱＲＩのとき（但
しに、　Ｉ：ｌ　、２）（ｉ、　　ｊ）がＱ、、＜ｉ≦
Ｑ　＋に４＋　　かつＱＲＩ＜ｊ≦ＱＲ１＋１の範囲内にある時、漸化式（２）の右辺第１項の最小値
となるｇＯに対応するバックポインターの内容をｂ（ｉ
、ｊ）とする。但しここでｂ（ｉ。! =Q+ and j=QR+ (However, I=1
．． 2) b (i+ j ) E (Q+b+ QR
I) When i f−Q Ikl or j≠QRI (however, I:l, 2) (i, j) is Q, ,<i≦
When Q+ is 4+ and QRI<j≦QR1+1, the content of the back pointer corresponding to gO, which is the minimum value of the first term on the right side of recurrence formula (2), is expressed as b(i
, j). However, here b(i.

ｊ）として取り得る点（Ｑｌｌ、ＱＲ，）は、　（ｍ＝
ｋかつｎ＝１）または（ｎ＝　１かつｍ＝ｋ　）でなけ
ればならない。The points (Qll, QR,) that can be taken as j) are (m=
k and n=1) or (n=1 and m=k).

こうして得られたバックポインターｂ（ｉ、ｊ）はバッ
クポインター記憶部６に記憶される。こうして最終的に
得られた点（Ｑ１０．ＱＲ３）での累積距離は第２図の
３種類の経路をすべて考慮した最も最適な限定された経
路を選んだ時の類似度となる。The back pointer b(i, j) thus obtained is stored in the back pointer storage section 6. The cumulative distance at the point (Q10.QR3) finally obtained in this way is the degree of similarity when the most optimal limited route is selected considering all three types of routes in FIG. 2.

各標準パターンとの類似度がすべて計算されたのちに、
認識判定部７で類似度の最大なる標準パターンを認識候
補音声として判定し外部に出力する。After all similarities with each standard pattern are calculated,
The recognition determining unit 7 determines the standard pattern with the highest degree of similarity as the recognition candidate speech and outputs it to the outside.

以上のように本実施例によれば、音声の無音区間を検出
する無音区間検出部３と、無音区間の交点により経路を
制限したパターンマツチングにより入力パターンと標準
パターンの類似度を計算する類似度計算部４と、過去に
通過したＱ点の交点を記憶するバックポインター記憶部
６を設けたことにより、１回のパターンマツチングで経
路限定の全ての可能性を考慮することができ、計算時間
を増加させずに誤認識を防止しまた対象外音声入力時に
極力リジェクトすることができる。As described above, according to this embodiment, the silent section detection unit 3 detects silent sections of audio, and the similarity calculation unit 3 calculates the similarity between an input pattern and a standard pattern by pattern matching in which a path is restricted by the intersection of the silent sections. By providing a degree calculation unit 4 and a back pointer storage unit 6 that stores the intersection points of Q points passed in the past, it is possible to consider all possibilities of route limitation in one pattern matching, and calculate It is possible to prevent erroneous recognition without increasing time, and to reject as much as possible non-target audio input.

発明の効果以上のように本発明は、音声パターン中の無音区間を検
出しその位置を記憶する無音区間検出部と、入力パター
ンと標準パターンの無音区間点の交点を通過するように
制限したパターンマツチングを、累積距離算出時にバッ
クポインターを用いて過去の通過した最適な無音区間点
の交点を記憶しながら計算する類似度計算部を設けたこ
とにより、エネルギー包絡線に対応したパターンマツチ
ングが可能になり、また１回のパターンマツチング計算
ですべての無音区間点の対応のさせ方を考慮しているこ
とから、処理時間を増加させずに類似単語間の誤認識を
抑え、対象外単語が入力された時にも極力リジェクトす
ることのできる音声認識装置を提供することができる。Effects of the Invention As described above, the present invention includes a silent section detecting section that detects a silent section in a voice pattern and stores its position, and a pattern that is restricted to pass through the intersection of the silent section points of the input pattern and the standard pattern. By providing a similarity calculation unit that uses a back pointer to calculate the cumulative distance and memorizes the intersection points of optimal silent section points passed in the past, pattern matching corresponding to the energy envelope is possible. In addition, since it considers how to match all silence interval points in one pattern matching calculation, it reduces misrecognition between similar words without increasing processing time, and eliminates non-target words. It is possible to provide a speech recognition device that can reject as much as possible even when input.

[Brief explanation of the drawing]

第１図は本発明の一実施例の音声認識装置のブロック構
成図、第２図は同装置の動作説明図であ１・・争マイク
ロホン、２φ・・音声分析部、３・・・無音区間検出部
、４ｅ・番類似度計算部、５・・・記憶部、６・・Φバ
ックポインター記憶部、７・・＠認識判定部。Fig. 1 is a block configuration diagram of a speech recognition device according to an embodiment of the present invention, and Fig. 2 is an explanatory diagram of the operation of the same device. Detection unit, 4e similarity calculation unit, 5... storage unit, 6... Φ back pointer storage unit, 7... @ recognition determination unit.

Claims

[Claims]

(1) a speech analysis unit that outputs a time series of feature vectors including energy sequences from input speech; a silent section detection unit that detects silent portions in the speech pattern from the energy series output from the speech analysis unit; A pattern matching distance calculation that limits a pattern matching path at the intersection of the silent interval points of the input pattern obtained by the silent interval detection unit and each standard pattern stored in the dictionary in advance is performed by taking into account the error in detecting the silent interval points. A speech recognition device comprising: a similarity calculation unit that calculates a similarity between the two by calculating a distance on a matching path limited by the intersections of all possible silent interval points.

(2) A claim characterized in that the similarity calculation unit has a back pointer that stores which intersection of silent interval points the matching path has passed before reaching that point when calculating the cumulative distance at each point. Speech recognition device according to item 1