JPH05165491A

JPH05165491A - Voice recognizing device

Info

Publication number: JPH05165491A
Application number: JP3336679A
Authority: JP
Inventors: Tsuneo Nitta; 恒雄新田; Akira Nakayama; 昭中山
Original assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Current assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Priority date: 1991-12-19
Filing date: 1991-12-19
Publication date: 1993-07-02

Abstract

PURPOSE:To detect a voice block with high accuracy even under noise and to improve recognizing performance and reliability. CONSTITUTION:An acoustic analysis part 1 acoustically analyzes an input voice and calculates the feature parameter. Based on the feature parameter, a partial pattern detection part 2 detects the partial pattern of the input voice by calculating similarity between the part of the input voice and the standard pattern of a voice previously registered on a keyword standard pattern memory 21. With the position of the partial pattern as a reference, a word border hypothesis generating part 3 calculates the plural block candidates of the input voice and generates the feature pattern for each block. A recognition part 4 recognizes the input voice by calculating similarity between each feature pattern generated by the word border hypothesis generating part 3 and the standard pattern of a voice previously registered in a word standard pattern memory 41. Thus, the high recognizing performance can be presented even under noise.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声認識装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device.

【０００２】[0002]

【従来の技術】音声認識技術は、優れたマンマシン・イ
ンターフェースを実現する上で重要な役割を担ってい
る。この音声認識において、その認識性能を高める上で
の重要な前処理として音声区間検出があり、従来より種
々研究・開発されている。音声認識装置の実用化を考え
た場合に、耐騒音性の向上も重要な課題ではあるが、騒
音下での音声認識において特に問題となるのは音声区間
検出である。2. Description of the Related Art Speech recognition technology plays an important role in realizing an excellent man-machine interface. In this speech recognition, there is speech section detection as an important pre-processing for improving the recognition performance, and various researches and developments have been conventionally performed. When considering practical use of a voice recognition device, improvement of noise resistance is also an important issue, but voice segment detection is particularly problematic in voice recognition under noise.

【０００３】従来の音声認識装置における音声区間検出
は、専ら入力音声のパワー時系列を求め、その音声パワ
ーの値が所定の閾値Ｔ1 より大きくなった時点を始端Ｓ
として検出し、また始端検出後に上記音声パワー値が所
定の閾値Ｔ2 より小さくなった時点を終端Ｅとして検出
して行われていた。In the speech section detection in the conventional speech recognition apparatus, the power time series of the input speech is exclusively obtained, and the start point S is the time when the value of the speech power becomes larger than a predetermined threshold value T1.
In addition, the time point when the voice power value becomes smaller than the predetermined threshold value T2 after the detection of the start point is detected as the end point E.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、このよ
うな音声区間検出では、その音声区間が一意に決定され
るので、例えば実際の音声区間の前後に息洩れや舌打ち
ノイズ等が存在すると、これをも音声区間の一部として
検出してしまうという問題があった。However, in such a voice section detection, since the voice section is uniquely determined, for example, if there is a breathing noise or a tongue noise before and after the actual voice section, this is detected. However, there is a problem that it is detected as a part of the voice section.

【０００５】また、逆に音節の先頭や最終音節が無声化
しやすい単語音声の場合にあっては、その無声化音節部
分のパワーが極端に小さくなるので、この部分が音声区
間検出から脱落し易いという問題があった。On the other hand, in the case of a word voice in which the beginning and end syllables of a syllable are likely to be unvoiced, the power of the unvoiced syllable portion is extremely small, and this portion is easily dropped from the detection of the voice section. There was a problem.

【０００６】このような音声区間検出の誤りは、その音
声認識において致命的な誤認識の原因となる。Such an error in voice section detection causes fatal misrecognition in the voice recognition.

【０００７】そこで、本発明は、上記事情に鑑みてなさ
れたものであり、騒音下でも高精度な音声区間検出が可
能となり、認識性能，信頼性の向上を図った音声認識装
置を提供することを目的とする。Therefore, the present invention has been made in view of the above circumstances, and provides a voice recognition device capable of detecting a voice segment with high accuracy even in a noisy state and improving recognition performance and reliability. With the goal.

【０００８】[0008]

【課題を解決するための手段】上記目的を達成するため
に本発明は、入力音声の音響分析結果に基づき入力音声
の部分とこれに対応する音声の標準パターンとの類似度
演算により入力音声の部分パターンを検出する部分パタ
ーン検出部と、この部分パターン検出部が検出した部分
パターンの位置を基準にして前記入力音声の複数の音声
区間候補を求め、その区間毎に特徴パターンを生成する
単語境界仮説生成部と、この単語境界仮説生成部にて生
成された各特徴パターンとこれに対応する音声の標準パ
ターンとの類似度演算により前記入力音声を認識する認
識部とを有することを特徴とするものである。In order to achieve the above object, the present invention is based on the result of acoustic analysis of the input voice, and calculates the similarity between the part of the input voice and the standard pattern of the corresponding voice A partial pattern detection unit that detects a partial pattern, and a word boundary that generates a plurality of voice section candidates of the input voice based on the position of the partial pattern detected by the partial pattern detection unit and generates a characteristic pattern for each section The present invention is characterized by having a hypothesis generating unit and a recognizing unit that recognizes the input voice by calculating the similarity between each characteristic pattern generated by the word boundary hypothesis generating unit and the standard pattern of the corresponding voice. It is a thing.

【０００９】[0009]

【作用】上記構成の発明の作用を説明する。The operation of the invention having the above construction will be described.

【００１０】部分パターン検出部は、入力音声の音響分
析結果に基づく類似度演算により入力音声の部分パター
ンを検出する。単語境界仮説生成部は、部分パターン検
出部が検出した部分パターンの位置を基準にして前記入
力音声の複数の音声区間候補を求める。次に、単語境界
仮説生成部は、候補として求めた各音声区間毎に特徴パ
ターンを生成する。認識部は、単語境界仮説生成部にて
生成された各特徴パターンに基づく類似度演算により前
記入力音声を認識する。The partial pattern detector detects a partial pattern of the input voice by a similarity calculation based on the acoustic analysis result of the input voice. The word boundary hypothesis generation unit obtains a plurality of voice section candidates of the input voice based on the position of the partial pattern detected by the partial pattern detection unit. Next, the word boundary hypothesis generation unit generates a characteristic pattern for each voice section obtained as a candidate. The recognition unit recognizes the input voice by a similarity calculation based on each feature pattern generated by the word boundary hypothesis generation unit.

【００１１】このように、入力音声を音響分析して求め
た部分パターンに基づいて複数の音声区間候補を求め、
類似度演算により入力音声を認識しているので、騒音下
でも高い認識性能を発揮できる。In this way, a plurality of voice section candidates are obtained based on the partial pattern obtained by acoustically analyzing the input voice,
Since the input voice is recognized by the similarity calculation, high recognition performance can be exhibited even in the presence of noise.

【００１２】[0012]

【実施例】以下、本発明の実施例を図面を参照して詳述
する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１３】図１は本発明の一実施例の音声認識装置の
概略構成図である。本装置は、入力音声を音響分析して
その特徴パラメータを求める音響分析部１と、音響分析
部１が求めた特徴パラメータに基づき入力音声の部分パ
ターンを検出する部分パターン検出部２と、この部分パ
ターン検出部２が検出した部分パターンの位置を基準に
して前記入力音声の複数の音声区間候補としての単語区
間候補（Ｌ3 ）を求め、その区間毎に特徴パターンを生
成する単語境界仮説生成部３と、この単語境界仮説生成
部３にて生成された各特徴パターンに基づき前記入力音
声を認識する認識部４とを有して概略構成されている。FIG. 1 is a schematic configuration diagram of a voice recognition apparatus according to an embodiment of the present invention. The apparatus includes an acoustic analysis unit 1 that acoustically analyzes an input voice to obtain a characteristic parameter thereof, a partial pattern detection unit 2 that detects a partial pattern of the input voice based on the characteristic parameter obtained by the acoustic analysis unit 1, and this portion. A word boundary hypothesis generating section 3 for obtaining a word section candidate (L3) as a plurality of speech section candidates of the input speech based on the position of the partial pattern detected by the pattern detecting section 2 and generating a characteristic pattern for each section. And a recognition unit 4 for recognizing the input voice based on each characteristic pattern generated by the word boundary hypothesis generation unit 3.

【００１４】図２は本実施例における単語セットの例を
示した図である。単語セットには「階（かい）」という
同一のキーワードが含まれている。図３は「にかい」と
発生したときの特徴パラメータの一例を濃淡で示した図
である。以下、各部分について説明する。FIG. 2 is a diagram showing an example of a word set in this embodiment. The word set includes the same keyword "kai". FIG. 3 is a diagram showing, in shades, an example of the characteristic parameter when "nikai" occurs. Each part will be described below.

【００１５】前記音響分析部１は、入力音声から音声区
間検出に用いられる特徴量としてその音声パワー時系列
を求めると共に、キーワード標準パターンとの照合に供
される特徴量として、例えば周波数分析したバンドパス
フィルタ群出力を求め、これらの特徴量を入力音声の特
徴パラメータとして部分パターン検出部２の後述するキ
ーワード検出部２０等へ送出するものである。The acoustic analysis unit 1 obtains a voice power time series as a feature amount used for detecting a voice section from an input voice, and, for example, a frequency-analyzed band as a feature amount used for matching with a keyword standard pattern. The output of the pass filter group is obtained, and these characteristic quantities are sent as characteristic parameters of the input voice to a keyword detecting section 20 of the partial pattern detecting section 2 which will be described later.

【００１６】前記部分パターン検出部２は、キーワード
検出部２０及びキーワード標準パターンメモリ２１を具
備している。The partial pattern detector 2 comprises a keyword detector 20 and a keyword standard pattern memory 21.

【００１７】キーワード標準パターンメモリ２１には、
キーワード検出部２０でキーワードを検出する際に使用
するキーワード標準パターンが予め登録されいる。その
登録内容は、キーワードが含まれた既知の入力音声を音
響分析し、例えば後述する図９に示すように、キーワー
ド「かい」（／ｋａｉ／）を２つ要素／ｋａ／及び／ａ
ｉ／に分割してキーワード標準パターンとしたものであ
る。In the keyword standard pattern memory 21,
A keyword standard pattern used when the keyword detection unit 20 detects a keyword is registered in advance. The registered content is obtained by acoustically analyzing a known input voice containing a keyword, and for example, as shown in FIG. 9 described later, two keywords “kai” (/ kai /) are included in the elements / ka / and / a.
This is a keyword standard pattern divided into i /.

【００１８】キーワード検出部２０は、音響分析部１か
ら得られる特徴パラメータを用いて入力音声が含まれて
いる大まかな発生区間Ｌ1 の検出とその区間Ｌ1 内に含
まれるキーワードの検出とを行うものであり、これらを
検出した後は、その検出結果及び音響分析部１からの入
力音声の特徴パラメータを後述する単語境界仮説生成部
３に出力するものである。The keyword detection unit 20 detects a rough occurrence section L1 in which the input voice is included and a keyword included in the section L1 by using the characteristic parameters obtained from the acoustic analysis section 1. After detecting these, the detection result and the characteristic parameter of the input voice from the acoustic analysis unit 1 are output to the word boundary hypothesis generation unit 3 described later.

【００１９】このキーワード検出部２０による発生区間
Ｌ1 の検出は、具体的には音声パワーの値に対して所定
の閾値Ｔ1 を越えた時点から閾値Ｔ2 を下まわる時点ま
でを発生区間Ｌ1 とする従来から用いられている音声区
間検出の手法などによって求められる。この発生区間Ｌ
1 は、図３ではＳからＥまでの区間に相当する。The detection of the generation section L1 by the keyword detection unit 20 is conventionally performed by setting the generation section L1 from the time point when the value of the voice power exceeds a predetermined threshold value T1 to the time point when it falls below the threshold value T2. It is obtained by the method of detecting the voice section used from. This generation section L
1 corresponds to the section from S to E in FIG.

【００２０】また、キーワード検出部２０による発生区
間Ｌ1 内に含まれるキーワードの検出は、求めた発生区
間Ｌ1 の中からキーワード（例えば「かい」）を検出す
ることで行われる。具体的には、後述する図８に示すよ
うに、この発生区間Ｌ1 の中からセグメント２２の区間
を時間方向（フレームｊ毎）にずらしながら順次切り出
してきて、この区間の特徴パターンとキーワード標準パ
ターン（／ｋａ／，／ａｉ／）との類似度値と類似度値
の差からキーワード「かい」の区間Ｌ2 を検出する。図
３では、ＫＳからＫＥまでの区間Ｌ2 がこれに相当す
る。Further, the keyword detection unit 20 detects a keyword included in the occurrence section L1 by detecting a keyword (for example, "kai") from the obtained occurrence section L1. Specifically, as shown in FIG. 8 to be described later, the segment 22 is sequentially cut out from the occurrence segment L1 while being shifted in the time direction (every frame j), and the characteristic pattern and the keyword standard pattern of this segment are extracted. The section L2 of the keyword "kai" is detected from the difference between the similarity value with (/ ka /, / ai /) and the similarity value. In FIG. 3, the section L2 from KS to KE corresponds to this.

【００２１】前記単語境界仮説生成部３は、前記キーワ
ード検出部２０で検出されたキーワード（例えば「か
い」）の位置を基準にして、前記音響分析部１で求めら
れた入力音声の特徴パラメータに対して種々の単語区間
検出パラメータを適応的に設定し、前述したように複数
の単語区間候補（Ｌ3 ）を設定するものである。つまり
入力音声に対してキーワード検出部２０で求めたキーワ
ード「かい」の始端の位置を単語区間Ｌ3 の終端候補と
して、これより前の時点に複数の始端候補（Ｓ1，Ｓ2
，〜ＳM ）を求め、複数の単語区間候補（Ｌ3 ）を設
定する。図３では、Ｌ30（Ｓ1 ，ＫＳ），Ｌ31（Ｓ2 ，
ＫＳ）が単語区間候補（Ｌ3 ）となる。また、この生成
部３は、さらにそれぞれの単語区間候補（Ｌ3 ）の時間
正規化を施した特徴パターンを生成し類似度演算部４０
へ送出するようにしている。なお、単語区間候補（Ｌ3
）を設定する範囲は、キーワード検出部２０で求めた
キーワードに応じてそのキーワードに対して前後又はそ
れ以後に設定してもよい。The word boundary hypothesis generating unit 3 uses the position of the keyword (for example, "kai") detected by the keyword detecting unit 20 as a reference to determine the characteristic parameter of the input voice obtained by the acoustic analyzing unit 1. On the other hand, various word section detection parameters are adaptively set, and a plurality of word section candidates (L3) are set as described above. That is, the position of the beginning of the keyword "kai" found by the keyword detector 20 with respect to the input voice is used as the ending candidate of the word section L3, and a plurality of starting candidates (S1, S2) are generated before this.
, ~ SM) and set a plurality of word segment candidates (L3). In FIG. 3, L30 (S1, KS), L31 (S2,
KS) becomes a word section candidate (L3). Further, the generation unit 3 further generates a feature pattern in which each word segment candidate (L3) is time-normalized, and the similarity calculation unit 40 is generated.
I am sending it to. Note that word segment candidates (L3
) May be set before, after, or after the keyword according to the keyword obtained by the keyword detection unit 20.

【００２２】前記認識部４は、類似度演算部４０，単語
標準パターンメモリ４１及び認識結果出力部４２を具備
している。The recognition section 4 includes a similarity calculation section 40, a word standard pattern memory 41, and a recognition result output section 42.

【００２３】類似度演算部４０は、上述した如く求めら
れた単語区間候補（Ｌ3 ）の各特徴パターンと単語標準
パターンメモリ４１に予め登録されている認識対象単語
の各標準パターンとの類似度をそれぞれ演算し、認識結
果出力部４２に送出するものである。ここでの類似度演
算は、従来より種々提唱されている複合類似度法や混合
類似度法等を用いて行われる。The similarity calculation unit 40 calculates the similarity between each characteristic pattern of the word segment candidates (L3) obtained as described above and each standard pattern of the recognition target word registered in the word standard pattern memory 41 in advance. Each is calculated and sent to the recognition result output unit 42. The similarity calculation here is performed using a composite similarity method, a mixed similarity method, or the like that has been conventionally proposed.

【００２４】単語標準パターンメモリ４１には、認識対
象単語の標準パターンが予め登録されている。この認識
対象単語の標準パターンは、予めカテゴリ名が既知の入
力音声を音響分析し、その単語区間Ｌ3 の発生時間長の
正規化などを施して特徴パターンを抽出し、単語音声の
カテゴリ名に対応付けて作成される。本実施例において
は「にかい」の「に」、「さんかい」の「さん」などの
部分に相当する音声の特徴パターンを単語標準パターン
として登録しているが、「にかい」、「さんかい」など
単語区間Ｌ3 全体の特徴パターンを単語標準パターンと
してもよい。この場合、キーワード「かい」の終端の位
置を単語区間Ｌ3 の終端候補とする。In the word standard pattern memory 41, standard patterns of recognition target words are registered in advance. The standard pattern of this recognition target word corresponds to the category name of the word voice by acoustically analyzing the input voice whose category name is known in advance and normalizing the occurrence time length of the word section L3 to extract the characteristic pattern. Created with. In the present embodiment, voice characteristic patterns corresponding to "ni" of "nikai" and "san" of "sankai" are registered as word standard patterns, but "nikai" and "san" are registered. A characteristic pattern of the entire word section L3 such as "Kai" may be used as the standard word pattern. In this case, the end position of the keyword "kai" is set as the end candidate of the word section L3.

【００２５】認識結果出力部４２は、類似度演算部４０
で得られる全ての類似度を比較して最上位あるいは上位
複数のカテゴリ名あるいはカテゴリ番号を結果として出
力するものである。The recognition result output section 42 is a similarity calculation section 40.
All the degrees of similarity obtained in (1) are compared and the highest or highest category names or category numbers are output as a result.

【００２６】ここで本装置が特徴とするキーワード検出
について図４乃至図１０を用いて更に詳しく説明する。
図４乃至図７はキーワード「かい」を検出処理するフロ
ーチャートを示すものである。図８はキーワード標準パ
ターン「／ｋａ／」と「／ａｉ／」とを作成する際のセ
グメント２２の位置を示したものである。図９は発生区
間Ｌ1 の中からセグメント２２の区間を時間方向にずら
し順次切り出してきて、この区間の特徴パターンとキー
ワード標準パターン「／ｋａ／」，「／ａｉ／」との間
の類似度Ｓ^(ka)j ，Ｓ^(ai)j の演算をフレームｊ毎に行
う様子を図示したものである。図１０は「にかい」の特
徴パターンに対する図４乃至図７の処理過程を示す図で
ある。図４乃至図７を用いてキーワード検出部２０の動
作を説明すると、先ず準備処理として、各パラメータの
初期化を行う（Ｓ1 ）。すなわち、図５に示すように、
ｋａｃｎｔ＝０，ｋａａｔｔａ＝０，ｋａｊ＝０，ａｉ
ｃｎｔ＝０，ａｉａｔｔａ＝０，ａｉｊ＝０，ｋａｉａ
ｔｔａ＝０を実行する。次に検出処理の開始フレームｊ
をＳ（前記のおおまかな発生区間Ｌ1 の始端）に設定す
る（Ｓ2 ）、そして／ｋａ／の検出処理を行う（Ｓ3
）。この処理は図６及び図９に示す様に、フレームｊ
における入力特徴パターンとキーワード標準パターン
「／ｋａ／」，「／ａｉ／」との間の類似度Ｓ^(ka)j ，
Ｓ^(ai)j を計算し、Ｓ^(ka)j が定数Ｓ_KAより大きくかつ
ＳＤ_j（＝｜Ｓ^(ka)j ，Ｓ^(ai)j ｜）が定数Ｓ_DFより大
きければ（Ｓ301 ）、ｋａｃｎｔをインクリメントする
（Ｓ303 ）。そうでなければｋａｃｎｔをゼロクリアし
（Ｓ302 ）、リターンする。ｋａｃｎｔがインクリメン
トされた場合はｋａｃｎｔが定数ＫＡより大きければｋ
ａａｔｔａのフラグを１に設定し（Ｓ304 ，Ｓ305 ）、
／ｋａ／が検出されたとみなしてセグメント／ｋａ／の
区間の中央フレーム位置ｋａｊ（＝ｊ−ｋａｃｎｔ／
２）を算出する（Ｓ306 ）。以上の様な処理の後にｋａ
ａｔｔａ＝１かつｋａｃｎｔ＝０［／ｋａ／区間検出が
終了］となれば（Ｓ4 ）、／ｋａ／検出処理と同様にし
て／ａｉ／検出処理を行う（Ｓ5 ，図７）。すなわち、
フレームｊにおける入力特徴パターンとキーワード標準
パターン「／ｋａ／」，「／ａｉ／」との間の類似度Ｓ
^(ka)j ，Ｓ^(ai)j を計算し、Ｓ^(ai)j が定数Ｓ_AIより大
きくかつＳＤ_j（＝｜Ｓ^(ka)j ，Ｓ^(ai)j ｜）が定数Ｓ
_DFより大きければ（Ｓ501 ）、ａｉｃｎｔをインクリメ
ントする（Ｓ503 ）。そうでなければａｉｃｎｔをゼロ
クリアし（Ｓ502 ）、リターンする。ａｉｃｎｔがイン
クリメントされた場合はａｉｃｎｔが定数ＡＩより大き
ければａｉａｔｔａのフラグを１に設定し（Ｓ504 ，Ｓ
505 ）、／ａｉ／が検出されたとみなしてセグメント／
ａｉ／の区間の中央フレーム位置ａｉｊ（＝ｊ−ａｉｃ
ｎｔ／２）を算出する（Ｓ506 ）。前記ステップＳ４
で、ｋａａｔｔａ＝１かつｋａｃｎｔ＝０［／ｋａ／区
間検出が終了］とならなければ、フレームをインクリメ
ントし（Ｓ9 ）、ステップＳ3 からｊ＝Ｅとなるまで処
理を繰り返す（Ｓ10）。ステップＳ5 の処理後、／ａｉ
／が検出されて、ａｉａｔｔａ＝１になれば、ｋａｉａ
ｔｔａ＝１として（Ｓ6 ，Ｓ7 ）、更にｋａｉａｔｔａ
＝１かつａｉｃｎｔ＝０［／ａｉ／区間検出が終了］と
なれば（Ｓ8 ）、検出された／ｋａ／，／ａｉ／のセグ
メント２２の区間の中央フレーム位置がそれぞれｋａ
ｊ，ａｉｊに格納されて処理を終了する。そうでなけれ
ば、「かい」検出が失敗したとして、リジェクト等の処
理を行い終了する（Ｓ11）。The keyword detection, which is a feature of this apparatus, will be described in more detail with reference to FIGS. 4 to 10.
FIG. 4 to FIG. 7 show a flowchart for detecting the keyword “kai”. FIG. 8 shows the positions of the segments 22 when the keyword standard patterns “/ ka /” and “/ ai /” are created. In FIG. 9, the segment 22 is shifted in the time direction from the occurrence segment L1 and sequentially cut out, and the similarity S between the characteristic pattern of this segment and the keyword standard patterns "/ ka /" and "/ ai /" is calculated. It is illustrated that the calculation of ^(ka) j and S ^(ai) j is performed for each frame j. FIG. 10 is a diagram showing the processing steps of FIGS. 4 to 7 for the characteristic pattern of “nikai”. The operation of the keyword detecting unit 20 will be described with reference to FIGS. 4 to 7. First, as a preparation process, each parameter is initialized (S1). That is, as shown in FIG.
kacnt = 0, kaatta = 0, kaj = 0, ai
cnt = 0, aiatta = 0, aij = 0, kaia
Perform tta = 0. Next, the detection processing start frame j
Is set to S (starting point of the rough generation section L1 described above) (S2), and / ka / detection processing is performed (S3).
). This processing is performed on the frame j as shown in FIGS.
Between the input feature pattern and the keyword standard pattern “/ ka /” and “/ ai /” in S ^(ka) j,
S ^(ai) j is calculated, and if S ^(ka) j is larger than the constant S _KA and SD _j (= | S ^(ka) j, S ^(ai) j |) is larger than the constant S _DF (S301), kacnt is incremented (S303). Otherwise, kacnt is cleared to zero (S302) and the process returns. If kacnt is incremented, k is larger than the constant KA.
Set the aatta flag to 1 (S304, S305),
Assuming that / ka / has been detected, the central frame position kaj (= j-kacnt / of the section of the segment / ka /
2) is calculated (S306). After the above processing, ka
When atta = 1 and kacnt = 0 [/ ka / section detection is completed] (S4), / ai / detection processing is performed in the same manner as / ka / detection processing (S5, FIG. 7). That is,
Similarity S between the input feature pattern in frame j and the keyword standard patterns “/ ka /” and “/ ai /”
^(ka) j, S ^(ai) j are calculated such that S ^(ai) j is larger than the constant S _AI and SD _j (= | S ^(ka) j, S ^(ai) j |) is the constant S _AI.
If it is larger than _DF (S501), aicnt is incremented (S503). Otherwise, aicnt is cleared to zero (S502) and the process returns. If aicnt is incremented and aicnt is larger than the constant AI, the flag of aiatta is set to 1 (S504, S
505), / ai / is assumed to be detected and segment /
The central frame position aij (= j-aic in the section of ai /
nt / 2) is calculated (S506). Step S4
If kaatta = 1 and kacnt = 0 [/ ka / section detection is completed], the frame is incremented (S9), and the process is repeated from step S3 to j = E (S10). After the processing of step S5, / ai
If / is detected and aiattta = 1, then kaia
If tta = 1 (S6, S7), then kaiatta
= 1 and aicnt = 0 [/ ai / section detection ends] (S8), the detected central frame positions of the sections 22 of / ka / and / ai / are ka.
It is stored in j, aij, and the process ends. If not, it is determined that the "kai" detection has failed, and processing such as reject is performed and the processing ends (S11).

【００２７】図１０は以上の処理の流れに沿って、実際
のキーワード「かい」検出処理を行い、／ｋａ／，／ａ
ｉ／を検出するそれぞれのセグメント２２ａ，２２ｂの
位置と各フレーム毎の類似度Ｓ^(ka)j ，Ｓ^(ai)j とＳＤ
_j（＝｜Ｓ^(ka)j ，Ｓ^(ai)j｜）を表示したものであ
る。In FIG. 10, the actual keyword "kai" detection processing is performed along the flow of the above processing, and / ka /, / a
The positions of the segments 22a and 22b for detecting i / and the similarity S ^(ka) j, S ^(ai) j and SD for each frame
_j (= | S ^(ka) j, S ^(ai) j |) is displayed.

【００２８】本発明の特徴とする上記のキーワード「か
い」検出処理の有効性を図１０を用いて説明する。Ｓ
^(ka)j ，Ｓ^(ai)j の値は／ｋａｉ／以外のフレーム位置
でもある程度の大きな値が出ている。従って、Ｓ^(ka)j
，Ｓ^(ai)j の値だけからは安定した高精度な／ｋａｉ
／を検出するのが困難であることが容易に予想される。
ところが、Ｓ^(ka)j ，Ｓ^(ai)j の差ＳＤ_jに注目する
と、／ｋａ／，／ａｉ／の位置で他よりも安定した大き
な値が出ていることがわかる。これは、キーワード標準
パターンを作成する際に、／ｋａ／，／ａｉ／のパター
ン同士を学習させることにより、／ｋａ／の区間（２２
ａ）ではＳ^(ka)j が大きくなるだけではなく、Ｓ^(ai)j
を小さくする効果が現れているので、ＳＤ_jが安定した
大きな値となる。同様に、／ａｉ／の区間（２２ｂ）で
はＳ^(ai)j が大きくなるだけではなく、Ｓ^(ka)j を小さ
くする効果が現われているので、ＳＤ_jが安定した大き
な値となる。また、／ｋａｉ／以外の区間、例えば／ｎ
ｉ／の区間（２２ｃ）ではＳ^(ai)j の値が大きいが、Ｓ
^(ka)j の値も大きくなり、ＳＤ_jが小さくなるので、誤
った検出が回避できるという効果が得られる。即ち、Ｓ
^(ka)j −Ｓ^(ai)j 以外にＳ^(ka)j ，Ｓ^(ai)j の差ＳＤ_j
（＝｜Ｓ^(ka)j ，Ｓ^(ai)j ｜）を用いることにより安定
した高精度な／ｋａｉ／を検出することが容易に可能と
なる。The effectiveness of the above keyword "kai" detection processing, which is a feature of the present invention, will be described with reference to FIG. S
The values of ^(ka) j and S ^(ai) j are large to some extent even at frame positions other than / kai /. Therefore, S ^(ka) j
, S ^(ai) j values are stable and highly accurate / kai
It is easily expected that / will be difficult to detect.
However, when attention is paid to the difference SD _j between S ^(ka) j and S ^(ai) j, it can be seen that a larger and more stable value appears at the positions of / ka / and / ai /. This is because by learning the patterns of / ka / and / ai / when creating a keyword standard pattern, the / ka / interval (22
In a), not only does S ^(ka) j increase, but S ^(ai) j
As a result, the effect of reducing the value of SD _j appears, and SD _j has a stable and large value. Similarly, in the section (22b) of / ai /, not only S ^(ai) j increases but also S ^(ka) j is reduced, so that SD _j has a stable and large value. Also, a section other than / kai /, for example, / n
In the i / section (22c), the value of S ^(ai) j is large,
^Since the value of ^(ka) j also becomes large and SD _j becomes small, the effect that erroneous detection can be avoided is obtained. That is, S
^{In addition to (ka)} j −S ^(ai) j, the difference SD _j between S ^(ka) j and S ^(ai) _j
By using (= | S ^(ka) j, S ^(ai) j |), it becomes possible to detect stable and highly accurate / kai / easily.

【００２９】このような上記実施例の音声認識装置によ
れば、音声認識の対象となる語彙セットに同一のキーワ
ードもしくは音声を構成する部分パターンが含まれる場
合、このキーワードもしくは部分パターンをあらかじめ
検出した後、その位置を基準にして所望の音声区間を検
出することにより、従来の音声区間検出方法と比較して
格段に高精度な音声区間検出が実現でき、更には所望の
音声区間の前後に付加されるノイズや不要語などに対処
できるため耐騒音性の点でも優れ、その基本となるキー
ワードの検出を高精度に安定して検出することが可能と
なり、当該装置の認識性能，信頼性の向上を図り得る等
の実用上多大なる効果が得られる。According to the voice recognition apparatus of the above embodiment, when the vocabulary set to be voice-recognized includes the same keyword or a partial pattern forming a voice, the keyword or the partial pattern is detected in advance. After that, by detecting the desired voice section based on that position, it is possible to realize a much more highly accurate voice section detection as compared with the conventional voice section detection method, and to add before and after the desired voice section. It is also excellent in terms of noise resistance because it can deal with noise and unnecessary words that are generated, and it becomes possible to detect the basic keywords that are detected with high accuracy and stability, improving the recognition performance and reliability of the device. It is possible to obtain a great effect in practical use.

【００３０】なお、本発明は上記実施例に限定されず、
その要旨を変更しない範囲内で種々に変形実施できる。
例えば、本実施例では「かい」のみをキーワードとして
いるが、必要に応じては「がい」などもキーワードに加
えてもく、更には「じゅう」等のキーワードを加えるこ
とにより１０階以上の単語にも対応できる等、キーワー
ドを複数持つ応用が考えられる。また、「かい（／ｋａ
ｉ／）」を／ｋａ／と／ａｉ／の２つの要素に分割した
が、／ｋａ／，／ａ／，／ａｉ／のように任意の複数個
に分割してもよい。The present invention is not limited to the above embodiment,
Various modifications can be made without departing from the spirit of the invention.
For example, in the present embodiment, only "kai" is used as a keyword, but "gai" and the like may be added to the keyword as necessary, and by adding a keyword such as "10", words on the 10th floor and above It can be applied to have multiple keywords. In addition, "kai (/ ka
"i /)" is divided into two elements, / ka / and / ai /, but it may be divided into arbitrary plural elements such as / ka /, / a /, / ai /.

【００３１】更に、音声認識の対象となる語彙セットに
同一のキーワードもしくは音声を構成する部分パターン
が含まれない場合においても、特定の１つないしは複数
のキーワードを検出する手段として本発明のキーワード
検出方式は有効である。Further, even when the vocabulary set to be voice-recognized does not include the same keyword or a partial pattern forming a voice, the keyword of the present invention is used as means for detecting one or more specific keywords. The detection method is effective.

【００３２】[0032]

【発明の効果】以上説明したように本発明によれば、入
力音声を音響分析して求めた部分パターンに基づいて複
数の音声区間候補を求め、類似度演算により入力音声を
認識しているので、騒音下でも高精度な音声区間検出が
可能となり、認識性能，信頼性の向上を図った音声認識
装置を提供することができる。As described above, according to the present invention, a plurality of voice segment candidates are obtained based on the partial pattern obtained by acoustically analyzing the input voice, and the input voice is recognized by the similarity calculation. As a result, it is possible to provide a voice recognition device with which it is possible to detect a voice segment with high accuracy even in a noisy condition, and to improve recognition performance and reliability.

[Brief description of drawings]

【図１】本発明の音声認識装置の一実施例を示す概略構
成図である。FIG. 1 is a schematic configuration diagram showing an embodiment of a voice recognition device of the present invention.

【図２】本実施例における単語セットの例を示す図であ
る。FIG. 2 is a diagram showing an example of a word set in the present embodiment.

【図３】本実施例における音声の特徴パラメータの例を
濃淡で示す図である。FIG. 3 is a diagram showing an example of a characteristic parameter of voice in gray scale in the present embodiment.

【図４】キーワード「かい」の検出処理を示すフローチ
ャートである。FIG. 4 is a flowchart showing a process of detecting a keyword “kai”.

【図５】キーワード「かい」の検出処理を示すフローチ
ャートである。FIG. 5 is a flowchart showing a process of detecting a keyword “kai”.

【図６】キーワード「かい」の検出処理を示すフローチ
ャートである。FIG. 6 is a flowchart showing a process of detecting a keyword “kai”.

【図７】キーワード「かい」の検出処理を示すフローチ
ャートである。FIG. 7 is a flowchart showing a process of detecting a keyword “kai”.

【図８】キーワード標準パターン（「／ｋａ／」と「／
ａｉ／」）を作成する際のセグメントの位置を示す図で
ある。FIG. 8: Standard keyword patterns (“/ ka /” and “/
It is a figure which shows the position of the segment at the time of creating ai / ").

【図９】類似度Ｓ^(ka)j ，Ｓ^(ai)j の演算をフレームｊ
毎に行う様子を示した図である。FIG. 9 shows the calculation of the similarity S ^(ka) j, S ^(ai) j in the frame j.
It is the figure which showed a mode that it performed for every.

【図１０】「かい」検出処理の一例を示す図である。FIG. 10 is a diagram illustrating an example of “kai” detection processing.

[Explanation of symbols]

１音響分析部２部分パターン検出部３単語境界仮説生成部４認識部 1 acoustic analysis unit 2 partial pattern detection unit 3 word boundary hypothesis generation unit 4 recognition unit

Claims

[Claims]

1. A partial pattern detection unit for detecting a partial pattern of an input voice by calculating a degree of similarity between a portion of the input voice and a standard pattern of the voice corresponding thereto based on an acoustic analysis result of the input voice, and this partial pattern detection. Is generated by the word boundary hypothesis generating unit that obtains a plurality of voice section candidates of the input voice based on the position of the partial pattern detected by the section and generates a feature pattern for each section. A voice recognition device, comprising: a recognition unit that recognizes the input voice by calculating the degree of similarity between each characteristic pattern and the corresponding standard pattern of the voice.