JPH05210396A

JPH05210396A - Voice recognizing device

Info

Publication number: JPH05210396A
Application number: JP4014399A
Authority: JP
Inventors: Satoshi Tsukada; 塚田聡
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1992-01-30
Filing date: 1992-01-30
Publication date: 1993-08-20
Anticipated expiration: 2015-05-08
Also published as: JP3039095B2

Abstract

PURPOSE:To stably detect voices by applying a method for correcting similarity to word spotting so as to compare the similarity in the same scale regardless of difference in speakers or environments. CONSTITUTION:An inter-vector similarity calculation part 2 calculates inter- vector similarity D for each frame between a feature vector V of an input voice I and a standard pattern registered in advance. Next, a similarity correction part 3 corrects the inter-vector similarity D by using the maximum value of the D in the same frame, and frame corrected similarity F is calculated. A similarity accumulation part 4 accumulates the frame corrected similarity F and calculates corrected similarity C. An identification part 5 calculates the standard pattern applying the maximum value of the corrected similarity C as a recognized result A.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は音声認識装置において、
特に認識対象語以外の発声のリジェクト、未知語の検
出、および、入力未知音声中の一部から標準パターンの
カテゴリに属する音声の存在を検出するワードスポッテ
ィングの改良に関するものである。BACKGROUND OF THE INVENTION The present invention relates to a voice recognition device,
In particular, the present invention relates to improvement of word spotting for rejecting utterances other than recognition target words, detecting unknown words, and detecting the presence of speech belonging to a standard pattern category from a part of input unknown speech.

【０００２】[0002]

【従来の技術】従来、音声認識では、東海大学出版会刊
行の「ディジタル音声処理」（以下、文献［１］と称
す）の１４９−１７７ページに述べられているように、
入力された未知音声とあらかじめ登録された認識対象の
標準パターンを各々、比較照合して類似度を求め、最大
の類似度を与える標準パターンのカテゴリを選択するこ
とによって認識を行なっていた。ここで、類似度として
は、特徴ベクトル間距離に基づくものや特徴ベクトル出
現確率に基づくものなどが用いられる。このようにして
求められた類似度を用いることにより、認識対象の標準
パターンとして登録されていない未知の語が入力された
時に、それを未知単語と判定することができる。例え
ば、電子情報通信学会技術研究報告、Ｖｏｌ．８９、Ｎ
ｏ．９１、１９８９年６月、１−８ページに掲載の「高
騒音下における自動券売機用不特定話者単語音声認識装
置の開発」（以下、文献［２］と称す）に述べられてい
るように、得られた類似度があらかじめ定められた閾値
より小さい場合未知単語であると判定しリジェクトする
方法がある。しかし、ここで求めた類似度の大きさは、
話者や発声環境によって大きく変化する。このため、話
者や発声環境が異なる場合に、未知語検出、リジェクシ
ョンの精度を高くするためには、話者ごと、発声環境ご
とに異なった閾値を設定する必要があり、非常な労力を
必要とするという問題点がある。そこで、特願平３−６
０７８６（以下、文献［３］と称す）において述べられ
ているように、入力された未知音声とあらかじめ登録さ
れた認識対象の標準パターンとの類似度を求めるととも
に、認識対象の制約をなくし、言語による拘束を弱めた
場合の類似度を参照類似度として求め、先の類似度を参
照類似度を用いて補正する方法がある。入力音声に現れ
る話者や環境の影響は、認識対象の標準パターンとの類
似度を求める場合にも、参照類似度を求める場合にも、
どちらにも同様に現れるので、この方法により補正され
た類似度は、話者や環境の影響が相殺されているといえ
る。このことにより、補正された類似度があらかじめ定
められた閾値より小さい場合に未知単語と判定すれば、
精度の高いリジェクトが実現できる。2. Description of the Related Art Conventionally, in speech recognition, as described on pages 149-177 of "Digital Speech Processing" published by Tokai University Press (hereinafter referred to as reference [1]).
The recognition was performed by comparing and collating the input unknown voice and the standard pattern of the recognition target registered in advance to obtain the degree of similarity, and selecting the category of the standard pattern that gives the maximum degree of similarity. Here, as the degree of similarity, one based on the distance between feature vectors, one based on the probability of feature vector appearance, or the like is used. By using the similarity thus obtained, when an unknown word that is not registered as a standard pattern to be recognized is input, it can be determined as an unknown word. For example, Technical Report of IEICE, Vol. 89, N
o. 91, June 1989, page 1-8, as described in "Development of an unspecified speaker word speech recognition device for automatic ticket vending machines under high noise" (hereinafter referred to as reference [2]). There is a method of rejecting an unknown word when the obtained similarity is smaller than a predetermined threshold value. However, the magnitude of the similarity calculated here is
It varies greatly depending on the speaker and vocal environment. For this reason, when the speaker and the utterance environment are different, it is necessary to set different thresholds for each speaker and each utterance environment in order to improve the accuracy of unknown word detection and rejection. There is a problem that it is necessary. Therefore, Japanese Patent Application No. 3-6
As described in 0786 (hereinafter referred to as document [3]), the similarity between the input unknown voice and the standard pattern of the recognition target registered in advance is obtained, and the restriction of the recognition target is eliminated, and the language There is a method in which the similarity when the constraint due to is weakened is obtained as the reference similarity, and the previous similarity is corrected using the reference similarity. The influence of the speaker and the environment appearing in the input voice is irrespective of whether the similarity with the standard pattern to be recognized is obtained or the reference similarity is obtained.
Since it appears in both cases as well, it can be said that the similarity corrected by this method cancels out the influence of the speaker and the environment. By this, if the corrected similarity is smaller than a predetermined threshold value, if it is determined as an unknown word,
A highly accurate reject can be realized.

【０００３】入力未知音声中の一部から標準パターンの
カテゴリに属する音声の存在を検出するワードスポッテ
ィングについては、文献［１］の１７６−１７７ページ
や文献［６］の８７−８９ページに述べられているよう
に、入力された未知音声の部分区間とあらかじめ登録さ
れた標準パターンとを比較照合して類似度を求め、得ら
れた類似度があらかじめ定められた閾値より大きい場合
に、標準パターンのカテゴリの音声が存在すると判定
し、認識を行なう方法がある。Word spotting for detecting the presence of a voice belonging to the category of the standard pattern from a part of the input unknown voice is described on pages 176-177 of document [1] and pages 87-89 of document [6]. As described above, the similarity is calculated by comparing and collating the input partial segment of the unknown voice with the standard pattern registered in advance, and when the obtained similarity is larger than a predetermined threshold value, the standard pattern There is a method of recognizing by determining that the voice of the category exists.

【０００４】[0004]

【発明が解決しようとする課題】ワードスポッティング
においても、話者や発声環境による類似度の大きさの変
化のために、高精度の音声検出には多大な労力を必要と
する。この問題を解決するために、前述の類似度の補正
方法を適用して、補正した類似度に基づいてワードスポ
ッティングを行なえば、高精度なワードスポッティング
が可能と考えられる。しかし、ワードスポッティングの
ような端点が固定されていない音声認識方法に、前述の
類似度補正方法をそのまま適用することはできない。本
発明の目的は、話者や発声環境が異なった場合でも、類
似度を同一の尺度で比較できるように類似度を補正する
方法をワードスポッティングに適用し、安定した音声検
出を行なうことにある。Even in word spotting, a great deal of labor is required for highly accurate voice detection because of the change in the degree of similarity depending on the speaker and the utterance environment. In order to solve this problem, if the above-mentioned similarity correction method is applied and word spotting is performed based on the corrected similarity, it is considered that highly accurate word spotting is possible. However, the above-described similarity correction method cannot be directly applied to a voice recognition method such as word spotting in which the end points are not fixed. An object of the present invention is to apply a method of correcting the similarity to the word spotting so that the similarity can be compared on the same scale even when the speaker and the utterance environment are different, and to perform stable voice detection. ..

【０００５】[0005]

【課題を解決するための手段】第１の発明の音声認識装
置は、入力された音声信号を特徴ベクトルのフレームの
時系列に変換する分析部と、あらかじめ登録された標準
パターンを構成する特徴ベクトルと前記入力された音声
信号の特徴ベクトルとのベクトル間類似度を入力された
音声信号の各フレームごとに求めるベクトル間類似度計
算部と、入力された音声信号の各フレームにおける前記
ベクトル間類似度の最大値を用いて補正したフレーム補
正類似度を求める類似度補正部と、前記フレーム補正類
似度を累積し補正類似度とする類似度累積部と、前記補
正類似度をもとに音声を識別する識別部を有することを
特徴とする。According to a first aspect of the present invention, there is provided a speech recognition apparatus, which comprises an analysis unit for converting an input speech signal into a time series of feature vector frames, and a feature vector forming a pre-registered standard pattern. And a vector-to-vector similarity calculation unit that obtains the vector-to-vector similarity between the feature vector of the input audio signal for each frame of the input audio signal, and the inter-vector similarity in each frame of the input audio signal , A similarity correction unit that obtains a frame-corrected similarity corrected using the maximum value, a similarity accumulation unit that accumulates the frame-corrected similarity as a corrected similarity, and identifies a voice based on the corrected similarity. It is characterized by having an identifying section that

【０００６】第２の発明の音声認識装置は、入力された
音声信号を特徴ベクトルのフレームの時系列に変換する
分析部と、前記特徴ベクトル時系列とあらかじめ登録さ
れた標準パターンとを終端フレームを定めて比較照合
し、最大の類似度を与える標準パターンを認識結果とし
て求めるとともに前記最大の類似度に対応する始端フレ
ームを求める比較照合部と、認識単位の標準パターンを
保持する単位標準パターン記憶部と、前記認識単位の標
準パターンをあらかじめ定められた順序で結合した標準
パターンと前記始端フレームから前記終端フレームまで
の特徴ベクトル時系列との類似度の最大値を参照類似度
として求める参照類似度計算部と前記比較照合部によっ
て求められた類似度を前記参照類似度にを用いて補正し
た補正類似度を求める類似度補正部とを有することを特
徴とする。The speech recognition apparatus of the second invention uses an analysis unit for converting an input speech signal into a time series of feature vector frames, and an end frame for the feature vector time series and a pre-registered standard pattern. A comparison and matching unit that determines and compares and determines a standard pattern that gives the maximum similarity as a recognition result and a starting frame that corresponds to the maximum similarity, and a unit standard pattern storage unit that holds the standard pattern of the recognition unit And a reference similarity calculation for obtaining the maximum value of the similarity between the standard pattern in which the standard patterns of the recognition units are combined in a predetermined order and the feature vector time series from the start frame to the end frame as the reference similarity. Section and the comparison and collation section, the similarity degree is corrected using the reference similarity degree to obtain a corrected similarity degree. And having a similarity correcting unit.

【０００７】第３の発明の音声認識装置は、前記類似度
補正部により得られた補正類似度があらかじめ定められ
た閾値より小さい場合にリジェクト信号を出力するリジ
ェクト部を有することを特徴とする。The speech recognition apparatus of the third invention is characterized in that it has a reject unit for outputting a reject signal when the corrected similarity obtained by the similarity correction unit is smaller than a predetermined threshold value.

【０００８】第４の発明の音声認識装置は、前記類似度
補正部により得られた補正類似度があらかじめ定められ
た閾値より大きい場合に音声として検出する音声検出部
を有することを特徴とする。The voice recognition apparatus of the fourth invention is characterized by having a voice detection unit for detecting as voice when the corrected similarity obtained by the similarity correction unit is larger than a predetermined threshold value.

【０００９】第５の発明の音声認識装置は、入力された
音声信号を特徴ベクトルのフレームの時系列に変換する
分析部と、認識対象の標準パターンを保持する標準パタ
ーン記憶部と、認識単位の標準パターンを保持する単位
標準パターン記憶部と、前記認識単位の標準パターンを
あらかじめ定められた順序で結合し参照パターンとする
参照パターン生成部と、前記参照パターンと前記認識対
象の標準パターンを結合し認識パターンとする認識パタ
ーン生成部と、前記特徴ベクトル時系列と前記認識パタ
ーンとを終端フレームを定めて比較照合し、最大の類似
度を与える認識パターンを構成する標準パターンを認識
結果として求める比較照合部と、前記参照パターンと前
記終端フレームを終端と定めて比較照合した場合の類似
度の最大値を参照類似度として求める参照類似度計算部
と、前記比較照合部によって求められた最大の類似度を
前記参照類似度を用いて補正した補正類似度を求める類
似度補正部と前記類似度補正部により得られた補正類似
度があらかじめ定められた閾値より大きい場合に音声と
して検出する音声検出部を有することを特徴とする。According to a fifth aspect of the speech recognition apparatus, an analyzing section for converting an inputted speech signal into a time series of feature vector frames, a standard pattern storing section for holding a standard pattern to be recognized, and a recognition unit A unit standard pattern storage unit that holds a standard pattern, a reference pattern generation unit that combines the standard patterns of the recognition units in a predetermined order to form a reference pattern, and combines the reference pattern and the standard pattern to be recognized. A comparison pattern matching unit that recognizes a recognition pattern as a recognition pattern, compares and collates the feature vector time series and the recognition pattern by defining an end frame, and obtains a standard pattern that constitutes a recognition pattern that gives the maximum similarity as a recognition result. Section, refer to the maximum value of the similarity when the reference pattern and the end frame are defined as the end and compared and collated. It is obtained by the reference similarity calculation unit that obtains the similarity, and the similarity correction unit and the similarity correction unit that obtain the corrected similarity by correcting the maximum similarity obtained by the comparison and matching unit using the reference similarity. It is characterized by having a voice detection unit that detects as voice when the obtained corrected similarity is larger than a predetermined threshold value.

【００１０】[0010]

【作用】本発明は、入力音声の特徴ベクトル時系列に対
して、あらかじめ登録された標準パターンとの類似度を
求めると共に、単語辞書の制約をなくし、言語による拘
束を弱めた場合の類似度を求めて、先の類似度を補正す
る場合に、入力音声のフレームごとに類似度を求めるこ
とにより、ワードスポッティングへの適用を可能にした
ものである。According to the present invention, the similarity between a feature vector time series of an input voice and a standard pattern registered in advance is obtained, and the similarity in the case where the constraint of the language is weakened by eliminating the constraint of the word dictionary. In the case of finding and correcting the previous similarity, it is possible to apply to word spotting by finding the similarity for each frame of the input voice.

【００１１】第１の発明による音声認識装置において、
あらかじめ登録された標準パターンを用いて音声を認識
する場合を考える。ここで、各標準パターンは、特徴ベ
クトルの系列により構成されている。In the speech recognition apparatus according to the first invention,
Consider a case where a voice is recognized using a standard pattern registered in advance. Here, each standard pattern is composed of a series of feature vectors.

【００１２】まず入力された音声信号を分析部によって
特徴ベクトルのフレームの時系列に変換する。ここでの
分析には、文献［１］の３２−９８ページに示されてい
るメルケプストラムによる方法やＬＰＣ分析による方法
などを用いることができる。First, the input voice signal is converted into a time series of feature vector frames by the analysis unit. For the analysis here, the method by the mel cepstrum and the method by the LPC analysis, which are shown on pages 32-98 of the document [1], can be used.

【００１３】次に、ベクトル間類似度計算部において、
分析部で得られた特徴ベクトルとあらかじめ登録してお
いた標準パターンを構成する特徴ベクトルとの類似度
を、入力された音声信号のフレームごとに計算する。ベ
クトル間類似度を求める方法としては、文献［１］の１
５４−１６１ページに述べられているようなベクトル間
距離に基づく方法や隠れマルコフモデル（以下、ＨＭＭ
と呼ぶ）に基づいたベクトル出現確率による方法を用い
ることができる。ＨＭＭについては、Ｓ．Ｅ．レビンソ
ン（Ｓ．Ｅ．Ｌｅｖｉｎｓｏｎ）や、Ｌ．Ｒ．ラビナー
（Ｌ．Ｒ．Ｒａｂｉｎｅｒ）、およびＭ．Ｍソンディ
（Ｍ．Ｍ．Ｓｏｎｄｈｉ）らの、ベルシステムテクニカ
ルジャーナル（ＴｈｅＢｅｌｌＳｙｓｔｅｍＴｅ
ｃｈｎｉｃａｌＪｏｕｒｎａｌ）、Ｖｏｌ．６２、Ｎ
ｏ．４、１９８３年４月、１０３５−１０７４ページに
掲載の論文「アンイントロダクションツージア
プリケーションオブザセオリーオブプロバブ
リスティックファンクションズオブアマルコフ
プロセスツーオートマチックスピーチレコグ
ニション（ＡｎＩｎｔｒｏｄｕｃｔｉｏｎｔｏｔ
ｈｅＡｐｐｌｉｃａｔｉｏｎｏｆｔｈｅＴｈｅ
ｏｒｙｏｆＰｒｏｂａｂｌｉｓｔｉｃＦｕｎｃｔ
ｉｏｎｓｏｆａＭａｒｃｏｖＰｒｏｃｅｓｓ
ｔｏＡｕｔｏｍａｔｉｃＳｐｅｅｃｈＲｅｃｏｇ
ｎｉｔｉｏｎ）」（以下、文献３と称す）に述べられて
いる。Next, in the vector similarity calculation unit,
The similarity between the feature vector obtained by the analysis unit and the feature vector forming the standard pattern registered in advance is calculated for each frame of the input voice signal. As a method of obtaining the similarity between vectors, the method of 1 in Reference [1] is used.
Methods based on distances between vectors as described on pages 54-161 and hidden Markov models (hereinafter referred to as HMM
Method) based on the vector appearance probability based on For HMM, see S. E. Levinson and L.E. R. LR Rabiner, and M.R. Bell Systems Technical Journal of MM Sondi et al.
chnical Journal), Vol. 62, N
o. 4, April 1983, pp. 1035-1074, "An Introduction to the Application of the Theory of Probabilistic Functions of A Markov Process to Automatic Speech Recognition".
he Application of the The
ory of Probabilistic Funct
ions of a Marcov Process
to Automatic Speech Recog
Nation) ”(hereinafter referred to as Document 3).

【００１４】類似度補正部においては、フレームごとに
ベクトル間類似度の最大値を求め、求められた最大値を
用いてそのフレームの全てのベクトル間類似度を補正
し、フレーム補正類似度が計算される。ここでの補正方
法としては、ベクトル間類似度とベクトル間類似度の最
大値の差に基づいた値を求める方法や、ベクトル間類似
度とベクトル間類似度の最大値の比に基づいた値を求め
る方法などを用いることができる。In the similarity correction section, the maximum value of the inter-vector similarity is calculated for each frame, all the inter-vector similarities of the frame are corrected using the calculated maximum value, and the frame correction similarity is calculated. To be done. As the correction method here, a method based on the difference between the vector-to-vector similarity and the maximum value of the vector-to-vector similarity or a value based on the ratio of the vector-to-vector similarity and the vector-to-vector maximum value is used. The method of obtaining can be used.

【００１５】類似度累積部においては、フレーム補正類
似度を入力された音声信号に対して累積した補正類似度
を求める。ここでのフレーム補正類似度の累積方法とし
ては、単語パターンとの比較照合の場合は文献［１］の
１５４−１６５ページに示されているように、動的計画
法に基づいて補正類似度を累積する方法や、文献［４］
に述べられているように、ＨＭＭに基づいて補正類似度
を累積する方法を用いることができる。In the similarity accumulating unit, the corrected similarity is calculated by accumulating the frame correction similarity for the input audio signal. As a method of accumulating frame-corrected similarity here, in the case of comparison and matching with a word pattern, the corrected similarity is calculated based on the dynamic programming method as shown on page 154-165 of document [1]. Accumulation method, reference [4]
A method of accumulating the corrected similarity based on the HMM can be used, as described in.

【００１６】識別部においては、補正類似度をもとに音
声を識別し、認識結果を求める。ここでの音声の識別方
法は、補正類似度の最も大きなものを認識結果とする方
法などを用いることができる。The discrimination section discriminates the voice based on the corrected similarity and obtains the recognition result. As a method of identifying the voice here, a method of using the one having the largest correction similarity as the recognition result or the like can be used.

【００１７】入力音声に現れる話者や環境の影響は、類
似度を求める場合にも、補正のために用いる類似度の最
大値にもどちらにも同様に現れるので、このようにして
求めた補正類似度においては、話者や環境の影響が相殺
されている。The influence of the speaker or the environment appearing in the input voice appears in both the maximum value of the similarity used for the correction when the similarity is calculated, and therefore, the correction thus obtained. In the degree of similarity, the influence of the speaker and the environment is offset.

【００１８】文献［３］においても、フレームごとのベ
クトル間類似度の最大値に基づいた方法が述べられてい
る。これは、フレームごとのベクトル間類似度の累積に
よって求めた標準パターンとの類似度の計算と、フレー
ムごとのベクトル間類似度の最大値を累積して求めた参
照類似度の計算が、別々に行われた後、標準パターンと
の類似度を参照類似度によって補正する方法である。標
準パターンとの類似度を求めるための区間と参照類似度
を求めるための区間は同じ区間である必要があるが、ワ
ードスポッティングを行う場合には、あらかじめ端点が
決められていないので、文献［３］の方法では、標準パ
ターンとの類似度を求めた後、同じ区間についての参照
類似度の計算を行う必要があった。このため、標準パタ
ーンとの類似度求めた後の計算量が多くなり、ワードス
ポッティングに向いていない。Document [3] also describes a method based on the maximum value of the inter-vector similarity for each frame. This is because the calculation of the similarity with the standard pattern obtained by accumulating the inter-vector similarity for each frame and the calculation of the reference similarity obtained by accumulating the maximum value of the inter-vector similarity for each frame are performed separately. After this is performed, the similarity with the standard pattern is corrected by the reference similarity. The section for obtaining the similarity with the standard pattern and the section for obtaining the reference similarity need to be the same section. However, when word spotting is performed, the end points are not determined in advance, so that the reference [3 In the above method, it is necessary to calculate the reference similarity for the same section after obtaining the similarity with the standard pattern. For this reason, the amount of calculation after obtaining the degree of similarity with the standard pattern is large, which is not suitable for word spotting.

【００１９】これに対して、第１の発明による音声認識
装置においては、フレームごとの補正類似度を累積して
おり、このようにして求めた補正類似度においては、標
準パターンとの類似度を求めるための区間と、補正のた
めに用いる類似度を求めるための区間が常に一致してい
る。このため、フレームごとの類似度を累積計算を行っ
た後に、参照類似度の計算を行う必要がない。On the other hand, in the speech recognition apparatus according to the first aspect of the present invention, the correction similarity for each frame is accumulated, and the correction similarity thus obtained is similar to the standard pattern. The section for obtaining and the section for obtaining the similarity used for correction always match. Therefore, it is not necessary to calculate the reference similarity after performing the cumulative calculation of the similarity for each frame.

【００２０】これらのことから、第１の発明による音声
認識装置は、話者や環境の違いによる類似度の大きさの
違いが抑えられており、ワードスポッティングへの適用
に向いている類似度の補正方式である。From the above, the voice recognition apparatus according to the first invention suppresses the difference in the degree of similarity due to the difference in the speaker and the environment, and the similarity degree suitable for the application to word spotting is reduced. It is a correction method.

【００２１】このようにして、第１の発明によって、話
者や環境に影響されない補正された類似度を得ることが
できる。As described above, according to the first aspect of the present invention, it is possible to obtain the corrected similarity not affected by the speaker or the environment.

【００２２】第２の発明による音声認識装置において
は、比較照合部において、分析部で得られた特徴ベクト
ルのフレームの時系列とあらかじめ登録しておいた複数
の標準パターンとの類似度が終端フレームを定めて計算
され、最大の類似度を与える標準パターンが認識結果と
して求められるとともに、最大の類似度に対応する始端
フレームが求められる。ここで、単語パターンとの比較
照合の方法としては文献［１］の１５４−１６５ページ
に示されているように、標準パターンとして特徴ベクト
ル時系列を保持し特徴ベクトル間距離に基づいて類似度
を計算し動的計画法に基づいて比較照合する方法や、文
献［４］に述べられているようなＨＭＭに基づいて比較
照合する方法などがある。最大の類似度に対応する始端
フレームの求め方としては、電子情報通信学会論文誌、
Ｖｏｌ．Ｊ７１−Ｄ、Ｎｏ．９、１６５０−１６５９ペ
ージに掲載の「フレーム同期化、ビームサーチ、ベクト
ル量子化の統合によるＤＰマッチングの高速化」（以
下、文献［５］と称す）に示されているように、比較照
合時の照合パスの選択の際に、始端フレームを順次伝え
ていき、最大の類似度が決まると同時に始端フレームが
求められる方法などを用いることができる。In the speech recognition apparatus according to the second aspect of the invention, in the comparison and collation unit, the similarity between the time series of the feature vector frames obtained by the analysis unit and a plurality of standard patterns registered in advance is the end frame. Is calculated and the standard pattern giving the maximum similarity is obtained as a recognition result, and the start frame corresponding to the maximum similarity is obtained. Here, as a method of comparison and collation with a word pattern, as shown on page 154-165 of the document [1], a feature vector time series is held as a standard pattern and the similarity is calculated based on the distance between feature vectors. There is a method of performing comparison and comparison based on dynamic programming, and a method of comparison and comparison based on HMM as described in the document [4]. The method of obtaining the starting frame corresponding to the maximum similarity is as follows:
Vol. J71-D, No. As shown in “Speeding up DP matching by integrating frame synchronization, beam search, and vector quantization” (hereinafter referred to as document [5]), published on pages 9, 1650 to 1659, at the time of comparison and matching. When selecting the matching path, the starting frame can be sequentially transmitted, and the starting frame can be obtained at the same time when the maximum similarity is determined.

【００２３】参照類似度計算部においては、比較照合部
で求められた始端フレームから終端フレームの範囲の入
力音声の特徴ベクトル時系列と単位標準パターン記憶部
に記憶されている認識単位標準パターンをある定められ
た順序で結合した複数の標準パターンと比較照合して類
似度を求め、類似度の最大値を参照類似度とする。認識
単位パターンを定めておいた順序によって結合したパタ
ーンとの比較照合の方法としては、この定められた順序
としてあらかじめネットワークの形で記述しておくこと
により、文献［５］に示されているようなフレーム同期
ＤＰマッチングによる連続音声認識に基づいて補正類似
度を累積する方法や、電子情報通信学会刊行の「確率モ
デルによる音声認識」（以下、文献［６］と称す）の４
０−５０ページに示されているようなＨＭＭによる連続
音声認識アルゴリズムに基づいて補正類似度を累積する
方法を用いることができる。また、ここでの認識単位標
準パターンとしては、文献［１］の５ページに示されて
いるような音節や音素あるいは単語などを用いることが
できる。In the reference similarity calculation unit, there is a feature vector time series of the input voice in the range from the start frame to the end frame obtained by the comparison and collation unit and the recognition unit standard pattern stored in the unit standard pattern storage unit. The similarity is obtained by comparing and collating with a plurality of standard patterns combined in a predetermined order, and the maximum value of the similarity is set as the reference similarity. As a method of comparing and collating the recognition unit patterns with the patterns combined in the defined order, by describing the defined order in the form of a network in advance, as shown in document [5]. Method of accumulating the correction similarity based on continuous speech recognition by various frame-synchronous DP matching, and "Speech recognition by probabilistic model" (hereinafter referred to as reference [6]) published by the Institute of Electronics, Information and Communication Engineers.
A method of accumulating the corrected similarities based on the continuous speech recognition algorithm by HMM as shown on page 0-50 can be used. Further, as the recognition unit standard pattern here, a syllable, a phoneme, a word or the like as shown on page 5 of the document [1] can be used.

【００２４】次に、類似度補正部において、比較照合部
で求められた標準パターンとの類似度を参照類似度を用
いて補正し、補正類似度が計算される。ここでの補正方
法としては、標準パターンとの類似度と参照類似度の差
に基づいた値を求める方法や、標準パターンとの類似度
と参照類似度の比に基づいた値を求める方法などを用い
ることができる。Next, in the similarity correction unit, the similarity with the standard pattern obtained by the comparison and collation unit is corrected using the reference similarity, and the corrected similarity is calculated. As the correction method here, there are a method of obtaining a value based on the difference between the similarity with the standard pattern and the reference similarity, a method of obtaining a value based on the ratio between the similarity with the standard pattern and the reference similarity, and the like. Can be used.

【００２５】入力音声に現れる話者や環境の影響は、標
準パターンとの類似度を求める場合にも、補正のために
用いる参照類似度を求める場合にも、どちらにも同様に
現れるので、このようにして求めた補正類似度において
は、話者や環境の影響が相殺されている。The influence of the speaker and the environment appearing in the input voice appears in both the similarity between the standard pattern and the reference similarity used for correction. In the corrected similarity thus obtained, the influences of the speaker and the environment are offset.

【００２６】また、あらかじめ端点が固定されていない
場合でも、標準パターンとの類似度を計算した後に、端
点を求めて参照用尤度を計算し補正を行う方法を用いて
いるので、ワードスポッティングに適用することができ
る。Even if the end points are not fixed in advance, the method of calculating the similarity with the standard pattern, then calculating the reference likelihood by calculating the end points, and performing the correction is used. Can be applied.

【００２７】このように、第２の発明による音声認識装
置においても、話者や環境の違いによる類似度の大きさ
の違いを抑えることができる。As described above, also in the voice recognition apparatus according to the second invention, it is possible to suppress the difference in the degree of similarity due to the difference in the speaker or the environment.

【００２８】第３の発明においては、リジェクト部で補
正類似度があらかじめ定められた閾値より小さい場合に
リジェクト信号を発生する。ここで、補正類似度は話者
や環境の違いによる類似度の大きさの違いが補正されて
おり、一定の閾値を用いてリジェクトすることができ
る。In the third invention, the reject signal is generated when the corrected similarity is smaller than a predetermined threshold value in the reject unit. Here, the corrected similarity has been corrected for differences in the degree of similarity due to differences in speakers and environments, and can be rejected using a certain threshold value.

【００２９】文献［３］においては、標準パターンとの
類似度を求めた後で、参照類似度による補正を行なう必
要があったが、第３の発明においては、第１の発明によ
り各フレームにおいて補正類似度が求められているの
で、文献［３］における標準パターンとの類似度計算と
同様の計算により、補正類似度が求められ、後から補正
する必要がないことが利点である。In the reference [3], it was necessary to perform the correction based on the reference similarity after obtaining the similarity with the standard pattern, but in the third invention, in each frame according to the first invention. Since the corrected similarity is obtained, it is an advantage that the corrected similarity is obtained by the same calculation as the similarity calculation with the standard pattern in the document [3], and there is no need to make correction later.

【００３０】第４の発明においては、音声検出部で補正
類似度があらかじめ定められた閾値より大きい場合を音
声として検出する。ここで、補正類似度は話者や環境の
違いによる類似度の大きさの違いが補正されており、一
定の閾値を用いて音声を検出することができる。According to the fourth aspect of the present invention, the voice detection unit detects a voice when the corrected similarity is larger than a predetermined threshold value. Here, in the corrected similarity, the difference in the degree of similarity due to the difference in the speaker or the environment is corrected, and the voice can be detected using a certain threshold value.

【００３１】第５の発明においては、参照パターン生成
部で単位標準パターン記憶部に記憶されている認識単位
の標準パターンをあらかじめ定められた順序で結合し参
照パターンを生成する。また、ここでの認識単位標準パ
ターンとしては、文献［１］の５ページに示されている
ような音節や音素あるいは単語などを用いることができ
る。In the fifth invention, the reference pattern generation unit combines the standard patterns of the recognition units stored in the unit standard pattern storage unit in a predetermined order to generate the reference pattern. Further, as the recognition unit standard pattern here, a syllable, a phoneme, a word or the like as shown on page 5 of the document [1] can be used.

【００３２】認識パターン生成部では、参照パターンと
標準パターンを結合した認識パターンを生成する。The recognition pattern generation unit generates a recognition pattern in which the reference pattern and the standard pattern are combined.

【００３３】比較照合部では、特徴ベクトル時系列と認
識パターンとを終端フレームを定めて比較照合し、最大
の類似度を与える認識パターンを構成する標準パターン
を認識結果として求める。ここで、単語パターンとの比
較照合の方法としては文献［１］の１５４−１６５ペー
ジに示されているように、標準パターンとして特徴ベク
トル時系列を保持し特徴ベクトル間距離に基づいて類似
度を計算し動的計画法に基づいて比較照合する方法や、
文献［４］に述べられているようなＨＭＭに基づいて比
較照合する方法などがある。また、参照パターンとの比
較照合の方法としては、文献［５］に示されているよう
なフレーム同期ＤＰマッチングによる連続音声認識や、
文献［６］の４０−５０ページに示されているようなＨ
ＭＭによる連続音声認識アルゴリズムを用いることがで
きる。In the comparison and collation section, the feature vector time series and the recognition pattern are compared and collated by defining the end frame, and the standard pattern constituting the recognition pattern giving the maximum similarity is obtained as the recognition result. Here, as a method of comparison and collation with a word pattern, as shown on page 154-165 of the document [1], a feature vector time series is held as a standard pattern and the similarity is calculated based on the distance between feature vectors. A method of calculating and comparing and matching based on dynamic programming,
There is a method of comparison and matching based on the HMM as described in the document [4]. Further, as a method of comparison and collation with a reference pattern, continuous speech recognition by frame synchronous DP matching as shown in the literature [5],
H as shown on pages 40-50 of document [6]
A continuous speech recognition algorithm by MM can be used.

【００３４】参照類似度計算部では、特徴ベクトル時系
列と参照パターンを終端フレームを定めて比較照合して
類似度を計算し、最大の類似度を参照類似度として求め
る。参照パターンとの比較照合の方法としては、文献
［５］に示されているようなフレーム同期ＤＰマッチン
グによる連続音声認識や、文献［６］の４０−５０ペー
ジに示されているようなＨＭＭによる連続音声認識アル
ゴリズムを用いることができる。The reference similarity calculator calculates the similarity by comparing and collating the feature vector time series and the reference pattern with the end frame defined, and obtains the maximum similarity as the reference similarity. As a method of comparison and collation with a reference pattern, continuous speech recognition by frame synchronization DP matching as shown in the document [5] or HMM as shown on pages 40-50 of the document [6] is used. Continuous speech recognition algorithms can be used.

【００３５】次に、類似度補正部では、比較照合部によ
って求められた最大の類似度を参照類似度を用いて補正
した補正類似度を求める。ここでの補正方法としては、
標準パターンとの類似度と参照類似度の差に基づいた値
を求める方法や、標準パターンとの類似度と参照類似度
の比に基づいた値を求める方法などを用いることができ
る。Next, the similarity correction unit obtains a corrected similarity by correcting the maximum similarity obtained by the comparison and collation unit using the reference similarity. As the correction method here,
A method for obtaining a value based on the difference between the similarity with the standard pattern and the reference similarity, a method for obtaining a value based on the ratio between the similarity with the standard pattern and the reference similarity, or the like can be used.

【００３６】次に、音声検出部において、類似度補正部
により得られた補正類似度があらかじめ定められた閾値
より大きい場合に音声として検出する。Next, in the voice detection unit, when the corrected similarity obtained by the similarity correction unit is larger than a predetermined threshold value, it is detected as voice.

【００３７】入力音声に現れる話者や環境の影響は、標
準パターンとの類似度を求める場合にも、補正のために
用いる参照類似度を求める場合にも、どちらにも同様に
現れるので、このようにして求めた補正類似度において
は、話者や環境の影響が相殺されている。The influence of the speaker or the environment appearing in the input voice appears in both the case of obtaining the similarity with the standard pattern and the case of obtaining the reference similarity used for correction. In the corrected similarity thus obtained, the influences of the speaker and the environment are offset.

【００３８】第２の発明では、特徴ベクトル時系列とあ
らかじめ登録しておいた複数の標準パターンとの類似度
を終端フレームを定めて計算し、最大の類似度に対応す
る始端フレームを求めた後で、特徴ベクトル時系列から
始端フレームと終端フレームの間の部分を取り出して、
参照類似度の計算を行なっている。終端フレームごとに
参照類似度をやり直すことになるため、計算量が多くな
るという問題があった。第５の発明においては、参照パ
ターンと標準パターンを結合した認識パターンとの比較
照合による類似度を、参照パターンとの比較照合による
参照類似度を用いて補正している。入力音声と認識パタ
ーンとの比較照合の結果は、認識対象とする標準パター
ンに対応する部分と、参照パターンに対応する部分とに
分かれており、始端を固定して比較照合しても、入力音
声の部分区間と標準パターンとを対応づけるワードスポ
ッティングが行われていることになる。参照パターンに
対応する部分については、参照用類似度の計算において
も、ほぼ同じ参照パターンが対応づけられ、類似度につ
いてもほぼ同じ値が求められると考えられる。このた
め、認識パターンとの類似度を参照パターンとの類似度
で補正することにより、標準パターンに対応する区間に
ついて、標準パターンとの類似度を参照パターンとの類
似度を用いて補正を行なった補正類似度が得られている
ことになる。その結果、始端を固定した比較照合で、類
似度の補正を適応したワードスポッティングが可能とな
り、計算量が少なく、高精度なワードスポッティングを
行なうことができる。In the second invention, after calculating the similarity between the feature vector time series and a plurality of standard patterns registered in advance by setting the end frame and obtaining the start frame corresponding to the maximum similarity, Then, extract the part between the start frame and the end frame from the feature vector time series,
The reference similarity is calculated. Since the reference similarity is redone for each end frame, there is a problem that the amount of calculation increases. In the fifth aspect of the invention, the similarity by comparison and matching between the reference pattern and the recognition pattern obtained by combining the standard patterns is corrected by using the reference similarity by comparison and matching with the reference pattern. The result of the comparison and matching between the input voice and the recognition pattern is divided into the part corresponding to the standard pattern to be recognized and the part corresponding to the reference pattern. It means that word spotting is performed by associating the sub-sections with the standard pattern. Regarding the portion corresponding to the reference pattern, it is considered that substantially the same reference pattern is associated even in the calculation of the reference similarity, and the similarity has substantially the same value. Therefore, by correcting the similarity with the recognition pattern with the similarity with the reference pattern, the similarity with the standard pattern is corrected using the similarity with the reference pattern for the section corresponding to the standard pattern. This means that the corrected similarity is obtained. As a result, it is possible to perform word spotting in which the correction of the degree of similarity is applied by the comparison and collation with the fixed start point, the amount of calculation is small, and highly accurate word spotting can be performed.

【００３９】[0039]

【実施例】本発明の実施例について図面を参照して説明
する。Embodiments of the present invention will be described with reference to the drawings.

【００４０】図１は第１，３及び４の発明の一実施例の
ブロック図である。FIG. 1 is a block diagram of an embodiment of the first, third and fourth inventions.

【００４１】この音声認識装置は、分析部１、ベクトル
間類似度計算部２、類似度補正部３、類似度累積部４、
識別部５、リジェクト部６、音声検出部７、を備えてい
る。This speech recognition apparatus comprises an analysis unit 1, an inter-vector similarity calculation unit 2, a similarity correction unit 3, a similarity accumulation unit 4,
An identification unit 5, a reject unit 6, and a voice detection unit 7 are provided.

【００４２】分析部１は、入力された音声信号Ｉの特徴
分析を行ない、特徴ベクトル時系列Ｖに変換するもので
ある。The analysis unit 1 analyzes the characteristics of the input audio signal I and converts it into a characteristic vector time series V.

【００４３】ベクトル間類似度計算部２は、特徴ベクト
ル時系列Ｖの１フレームの特徴ベクトルとあらかじめ登
録しておいた標準パターンを構成する特徴ベクトルとの
ベクトル間類似度Ｄが求め出力するものである。The vector-to-vector similarity calculation unit 2 finds and outputs the vector-to-vector similarity D between the one-frame feature vector of the feature vector time series V and the feature vector forming the standard pattern registered in advance. is there.

【００４４】類似度補正部３は、ベクトル間類似度Ｄが
入力され、１フレームについてのベクトル間類似度の最
大値を求めて、各フレームについてベクトル間類似度を
求めた最大値で補正しフレーム補正類似度Ｆとして出力
するものである。The similarity correction unit 3 receives the inter-vector similarity D, finds the maximum value of the inter-vector similarity for one frame, and corrects the inter-vector similarity for each frame with the maximum value found. It is output as the corrected similarity F.

【００４５】類似度累積部４は、フレーム補正類似度Ｆ
を累積した補正類似度Ｃとして出力するものである。The similarity accumulator 4 calculates the frame correction similarity F
Is output as the corrected similarity C.

【００４６】識別部５は、補正類似度Ｃを用いて音声を
識別し、その結果を認識結果Ａとして出力するものであ
る。The identification section 5 identifies the voice using the corrected similarity C and outputs the result as the recognition result A.

【００４７】リジェクト部６は、補正類似度Ｃがあらか
じめ定めておいた閾値より小さかった時に、リジェクト
信号Ｊを出力するものである。The reject unit 6 outputs the reject signal J when the corrected similarity C is smaller than a predetermined threshold value.

【００４８】音声検出部７は、補正類似度Ｃがあらかじ
め定めておいた閾値より大きかった時に、認識結果Ａを
検出したとするものである。The voice detection unit 7 detects the recognition result A when the corrected similarity C is larger than a predetermined threshold value.

【００４９】次に、図１の実施例の動作について説明す
る。Next, the operation of the embodiment shown in FIG. 1 will be described.

【００５０】入力された音声信号Ｉは分析部１に入力さ
れ、特徴分析によって特徴ベクトル時系列Ｖに変換され
る。ここでの分析は、例えば、文献［１］の１５５ペー
ジで示されているようなメルケプストラムによる方法を
用いることができる。The input voice signal I is input to the analysis unit 1 and converted into a feature vector time series V by feature analysis. For the analysis here, for example, the method by the mel cepstrum as shown on page 155 of the document [1] can be used.

【００５１】特徴ベクトル時系列Ｖは、ベクトル間類似
度計算部２へ入力され、あらかじめ登録しておいた複数
の標準パターンを構成する特徴ベクトルとの類似度が計
算され、ベクトル間類似度Ｄとなる。ベクトル間類似度
を求める方法としては、例えば、文献［１］の１５４−
１６１ページに述べられているようなベクトル間距離に
基づく方法を用いることができる。The feature vector time series V is input to the vector similarity calculation unit 2 and the similarity with the feature vectors forming a plurality of standard patterns registered in advance is calculated to obtain the vector similarity D. Become. As a method of obtaining the similarity between vectors, for example, 154- of Reference [1] is used.
A method based on inter-vector distance as described on page 161 can be used.

【００５２】ベクトル間類似度Ｄは、類似度補正部３へ
入力され、１フレームについてのベクトル間類似度の最
大値を求めて、各フレームについてベクトル間類似度を
求めた最大値を用いて補正され、フレーム補正類似度Ｆ
として出力される。補正の方法としては、ベクトル間類
似度とベクトル間類似度の最大値の差に基づいた値を求
める方法を用いることができる。The inter-vector similarity D is input to the similarity correction unit 3, and the maximum value of the inter-vector similarity for one frame is calculated and corrected using the maximum value of the inter-vector similarity calculated for each frame. Frame correction similarity F
Is output as. As a correction method, a method of obtaining a value based on the difference between the vector similarity and the maximum value of the vector similarity can be used.

【００５３】フレーム補正類似度Ｆは、類似度累積部４
に入力され、フレーム補正類似度を累積した補正類似度
Ｃが出力される。フレーム補正類似度の累積方法として
は、例えば、文献［１］の１５４−１６５ページに示さ
れているように、動的計画法に基づいて補正類似度を累
積する方法を用いることができる。The frame correction similarity F is calculated by the similarity accumulation unit 4
And the corrected similarity C obtained by accumulating the frame correction similarities is output. As a method of accumulating the frame correction similarities, for example, a method of accumulating the correction similarities based on the dynamic programming can be used, as shown on page 154-165 of the document [1].

【００５４】補正類似度Ｃは識別部５に入力され、補正
類似度を用いて音声が識別され、その結果が認識結果Ａ
として出力される。識別の方法としては、例えば、補正
類似度が最大となった標準パターンを選ぶ方法を用いる
ことができる。The corrected similarity C is input to the identification unit 5, the voice is identified using the corrected similarity, and the result is the recognition result A.
Is output as. As a method of identification, for example, a method of selecting a standard pattern having the maximum corrected similarity can be used.

【００５５】このようにして、第１の発明によって、話
者や環境に影響されない補正された類似度を得ることが
できる。As described above, according to the first aspect of the present invention, it is possible to obtain the corrected similarity not influenced by the speaker or the environment.

【００５６】第３の発明によれば、リジェクト部７によ
って、補正類似度Ｃがあらかじめ定めておいた閾値より
小さい場合には、リジェクト信号Ｊが出力される。According to the third aspect, the reject unit 7 outputs the reject signal J when the corrected similarity C is smaller than a predetermined threshold value.

【００５７】第４の発明によれば、音声検出部８によっ
て、補正類似度Ｃがあらかじめ定めておいた閾値より大
きい場合には、音声として検出し認識結果Ａが出力され
る。According to the fourth aspect, when the corrected similarity C is larger than a predetermined threshold value, the voice detection unit 8 detects the voice as a voice and outputs the recognition result A.

【００５８】図２は、第２及び４の発明の一実施例のブ
ロック図である。FIG. 2 is a block diagram of an embodiment of the second and fourth inventions.

【００５９】この音声認識装置は、分析部１１、比較照
合部１２、単位標準パターン記憶部１３、参照類似度計
算部１４、類似度補正部１５、音声検出部１７、を備え
ている。This voice recognition device comprises an analysis unit 11, a comparison and collation unit 12, a unit standard pattern storage unit 13, a reference similarity calculation unit 14, a similarity correction unit 15, and a voice detection unit 17.

【００６０】分析部１１では、入力された音声信号Ｉの
特徴分析を行ない、特徴ベクトル時系列Ｖに変換するも
のである。The analysis unit 11 analyzes the characteristics of the input audio signal I and converts it into a characteristic vector time series V.

【００６１】比較照合部１２は、特徴ベクトル時系列Ｖ
とあらかじめ登録された標準パターンとを終端フレーム
Ｅを定めて比較照合し、最大の類似度Ｓを与える標準パ
ターンを認識結果Ａとして求めるとともに、最大の類似
度に対応する始端フレームＴを求めるものである。The comparing and collating unit 12 determines the feature vector time series V
And the standard pattern registered in advance are compared and collated by defining the end frame E, the standard pattern giving the maximum similarity S is obtained as the recognition result A, and the starting frame T corresponding to the maximum similarity is obtained. is there.

【００６２】単位標準パターン記憶部１３は、単位標準
パターンを記憶しておくものである。The unit standard pattern storage unit 13 stores the unit standard pattern.

【００６３】参照類似度計算部１４は、単位標準パター
ン記憶部１３に登録しておいた単位標準パターンをあら
かじめ決めておいた順序で結合した標準パターンと、特
徴ベクトル時系列Ｖから始端フレームＴから終端フレー
ムＥまでの区間を取り出した特徴ベクトル時系列とを比
較照合し、各々の標準パターンについて類似度を計算
し、最大の類似度を参照類似度Ｒとして求めるものであ
る。The reference similarity calculation unit 14 combines the unit standard patterns registered in the unit standard pattern storage unit 13 in a predetermined order, from the feature vector time series V to the starting frame T. The section up to the end frame E is compared and collated with the feature vector time series, the similarity is calculated for each standard pattern, and the maximum similarity is obtained as the reference similarity R.

【００６４】類似度補正部１５は、類似度Ｓを参照類似
度Ｒを用いて補正し、補正類似度Ｃを求めるものであ
る。The similarity correction unit 15 corrects the similarity S using the reference similarity R to obtain the corrected similarity C.

【００６５】音声検出部７は、補正類似度Ｃがあらかじ
め定めておいた閾値より大きかった時に、認識結果Ａを
検出したとするものである。The voice detection unit 7 detects the recognition result A when the corrected similarity C is larger than a predetermined threshold value.

【００６６】次に、図２の実施例の動作について説明す
る。Next, the operation of the embodiment shown in FIG. 2 will be described.

【００６７】入力された音声信号Ｉは分析部１１に入力
され、特徴分析によって特徴ベクトル時系列Ｖに変換さ
れる。The input voice signal I is input to the analysis unit 11 and converted into a feature vector time series V by feature analysis.

【００６８】特徴ベクトル時系列Ｖは、比較照合部１２
へ入力され、あらかじめ登録しておいた複数の標準パタ
ーンと終端フレームＥを定めて比較照合した類似度が計
算され、最大の類似度Ｓと最大の類似度を与える標準パ
ターンが認識結果Ａとして求められるとともに、最大の
類似度に対応する始端フレームＴが求められる。ここで
類似度の計算方法としては文献［１］、文献［５］に示
されているようなＤＰマッチングに基づく方法や文献
［４］、文献［６］に示されているようなＨＭＭに基づ
く方法を用いることができる。The feature vector time series V is compared and compared by the comparison and collation unit 12
Is input to a plurality of standard patterns registered in advance and the end frame E is defined and compared to calculate the similarity, and the standard pattern giving the maximum similarity S and the maximum similarity is obtained as the recognition result A. At the same time, the start frame T corresponding to the maximum similarity is obtained. Here, as a method of calculating the similarity, a method based on DP matching as shown in documents [1] and [5], or based on HMM as shown in documents [4] and [6] is used. Any method can be used.

【００６９】また、特徴ベクトル時系列Ｖは、参照類似
度計算部１４に入力され特徴ベクトル時系列から始端フ
レームＴと終端フレームＥとの間を取り出した部分と認
識単位標準パターン記憶部１３に記憶されている単位標
準パターンをある定められた順序で結合した複数の標準
パターンとが比較照合され、参照類似度Ｒが出力され
る。ここでの認識単位標準パターンとしては、文献
［１］の５ページに示されているような音節を用いるこ
とができる。The feature vector time series V is input to the reference similarity calculation unit 14 and is stored in the recognition unit standard pattern storage unit 13 and a portion extracted from the feature vector time series between the start frame T and the end frame E. A plurality of standard patterns obtained by combining the unit standard patterns that have been combined in a predetermined order are compared and collated, and the reference similarity R is output. As the recognition unit standard pattern here, a syllable as shown on page 5 of the document [1] can be used.

【００７０】次に、類似度Ｓと参照類似度Ｒは類似度補
正部１５に入力され、類似度Ｓを参照類似度Ｒを用いて
補正された補正類似度Ｃが出力される。ここでの補正方
法としては、ＳとＲの差に基づいた値を求める方法を用
いることができる。Next, the similarity S and the reference similarity R are input to the similarity correction unit 15, and the corrected similarity C obtained by correcting the similarity S using the reference similarity R is output. As the correction method here, a method of obtaining a value based on the difference between S and R can be used.

【００７１】このようにして、第２の発明によって、話
者や環境に影響されない補正された類似度を得ることが
できる。In this way, according to the second invention, it is possible to obtain the corrected similarity not influenced by the speaker or the environment.

【００７２】第４の発明によれば、音声検出部１８によ
って、補正類似度Ｃがあらかじめ定めておいた閾値より
大きい場合には、音声として検出し認識結果Ａが出力さ
れる。According to the fourth aspect of the invention, when the corrected similarity C is larger than a predetermined threshold value, the voice detection unit 18 detects the voice as a voice and outputs the recognition result A.

【００７３】図３は、第５の発明の一実施例のブロック
図である。FIG. 3 is a block diagram of an embodiment of the fifth invention.

【００７４】この音声認識装置は、分析部２１、比較照
合部２２、単位標準パターン記憶部２３、参照類似度計
算部２４、類似度補正部２５、音声検出部２７、標準パ
ターン記憶部２８、認識パターン生成部２９、参照パタ
ーン生成部３０、を備えている。This voice recognition device includes an analysis unit 21, a comparison / collation unit 22, a unit standard pattern storage unit 23, a reference similarity calculation unit 24, a similarity correction unit 25, a voice detection unit 27, a standard pattern storage unit 28, and recognition. A pattern generation unit 29 and a reference pattern generation unit 30 are provided.

【００７５】分析部２１では、入力された音声信号Ｉの
特徴分析を行ない、特徴ベクトル時系列Ｖに変換するも
のである。The analysis unit 21 analyzes the characteristics of the input audio signal I and converts it into a characteristic vector time series V.

【００７６】標準パターン記憶部２８は、標準パターン
を記憶しておくものである。The standard pattern storage unit 28 stores standard patterns.

【００７７】単位標準パターン記憶部２３は、単位標準
パターンを記憶しておくものである。The unit standard pattern storage unit 23 stores the unit standard pattern.

【００７８】参照パターン生成部３０は、単位標準パタ
ーン部に記憶されている単位標準パターンをある定めら
れた順序で結合した参照パターンＱを生成するものであ
る。認識パターン生成部２９は、参照パターンＱと標準
パターン記憶部２８に記憶された標準パターンとを結合
し、認識パターンＰを生成するものである。The reference pattern generating section 30 is for generating a reference pattern Q in which the unit standard patterns stored in the unit standard pattern section are combined in a predetermined order. The recognition pattern generation unit 29 combines the reference pattern Q and the standard pattern stored in the standard pattern storage unit 28 to generate a recognition pattern P.

【００７９】比較照合部２２は、特徴ベクトル時系列Ｖ
と認識パターンＰとを比較照合し、最大の類似度Ｓを与
える認識パターンを構成する標準パターンを認識結果Ａ
として求めるものである。The comparison / collation unit 22 uses the feature vector time series V.
And the recognition pattern P are compared and collated, and the standard pattern forming the recognition pattern giving the maximum similarity S is recognized as the recognition result A.
It is what you ask for.

【００８０】参照類似度計算部２４は、参照パターンＱ
と特徴ベクトル時系列Ｖとを比較照合し、各々の参照パ
ターンについて類似度を計算し、最大の類似度を参照類
似度Ｒとして求めるものである。The reference similarity calculator 24 determines the reference pattern Q.
And the feature vector time series V are compared and collated, the similarity is calculated for each reference pattern, and the maximum similarity is obtained as the reference similarity R.

【００８１】類似度補正部２５は、類似度Ｓを参照類似
度Ｒを用いて補正し、補正類似度Ｃを求めるものであ
る。The similarity correction unit 25 corrects the similarity S using the reference similarity R to obtain the corrected similarity C.

【００８２】音声検出部２７は、補正類似度Ｃがあらか
じめ定めておいた閾値より大きかった時に、認識結果Ａ
を検出したとするものである。The voice detection section 27 recognizes the recognition result A when the corrected similarity C is larger than a predetermined threshold value.
Is detected.

【００８３】次に、図３の実施例の動作について説明す
る。Next, the operation of the embodiment shown in FIG. 3 will be described.

【００８４】入力された音声信号Ｉは分析部２１に入力
され、特徴分析によって特徴ベクトル時系列Ｖに変換さ
れる。The input voice signal I is input to the analysis unit 21 and converted into a feature vector time series V by feature analysis.

【００８５】特徴ベクトル時系列Ｖは、比較照合部２２
へ入力され、認識パターンＰと比較照合した類似度が計
算され、最大の類似度Ｓと最大の類似度を与える認識パ
ターンを構成する標準パターンが認識結果Ａとして求め
られるここで類似度の計算方法としては文献［１］、文
献［５］に示されているようなＤＰマッチングに基づく
方法や文献［４］、文献［６］に示されているようなＨ
ＭＭに基づく方法を用いることができる。The characteristic vector time series V is compared by the comparison and collation unit 22.
Is input to the recognition pattern P, the similarity is compared and collated, and the standard pattern forming the recognition pattern that gives the maximum similarity S and the maximum similarity is obtained as the recognition result A. Here, the similarity calculation method For example, a method based on DP matching as shown in documents [1] and [5] or H as shown in documents [4] and [6].
An MM-based method can be used.

【００８６】また、特徴ベクトル時系列Ｖは、参照類似
度計算部２４に入力され複数の参照パターンＱとが比較
照合され、参照類似度Ｒが出力される。Further, the feature vector time series V is input to the reference similarity calculation section 24 and compared and collated with a plurality of reference patterns Q, and the reference similarity R is output.

【００８７】次に、類似度Ｓと参照類似度Ｒは類似度補
正部２５に入力され、類似度Ｓを参照類似度Ｒを用いて
補正された補正類似度Ｃが出力される。ここでの補正方
法としては、例えば、ＳとＲの差に基づいた値を求める
方法を用いることができる。次に、音声検出部２８によ
って、補正類似度Ｃがあらかじめ定めておいた閾値より
大きい場合には、音声として検出し認識結果Ａが出力さ
れる。Next, the similarity S and the reference similarity R are input to the similarity correction unit 25, and the corrected similarity C obtained by correcting the similarity S using the reference similarity R is output. As the correction method here, for example, a method of obtaining a value based on the difference between S and R can be used. Next, when the corrected similarity C is larger than a predetermined threshold value, the voice detection unit 28 detects the voice as a voice and outputs the recognition result A.

【００８８】このようにして、第５の発明によって、話
者や環境に影響されない補正された類似度を得て、ワー
ドスポッティングが可能になる。As described above, according to the fifth aspect of the invention, it is possible to obtain the corrected similarity not affected by the speaker or the environment and perform the word spotting.

【００８９】[0089]

【発明の効果】以上説明したように本発明によれば、入
力音声の特徴ベクトル時系列と登録した標準パターンと
の類似度を各フレームのベクトル間類似度の最大値や入
力音声と参照用の標準パターンとの参照類似度を用いて
補正することにより、話者や発声環境が異なった場合で
も同一の尺度で類似度を比較でき、安定したリジェクト
やワードスポッティングを実現することができる。As described above, according to the present invention, the similarity between the feature vector time series of the input voice and the registered standard pattern is set to the maximum value of the vector-to-vector similarity of each frame or the input voice and the reference voice. By correcting using the reference similarity with the standard pattern, the similarity can be compared on the same scale even when the speaker and the utterance environment are different, and stable reject and word spotting can be realized.

[Brief description of drawings]

【図１】第１，３及び４の発明の一実施例のブロック
図。FIG. 1 is a block diagram of an embodiment of first, third and fourth inventions.

【図２】第２及び４の発明の一実施例のブロック図。FIG. 2 is a block diagram of an embodiment of the second and fourth inventions.

【図３】第５の発明の一実施例のブロック図。FIG. 3 is a block diagram of an embodiment of the fifth invention.

[Explanation of symbols]

１分析部２ベクトル間類似度計算部３類似度補正部４類似度累積部５識別部６リジェクト部７音声検出部１１分析部１２比較照合部１３単位標準パターン記憶部１４参照類似度計算部１５類似度補正部１７音声検出部２１分析部２２比較照合部２３単位標準パターン記憶部２４参照類似度計算部２５類似度補正部２７音声検出部２８標準パターン記憶部２９認識パターン生成部３０参照パターン生成部 1 analysis unit 2 vector similarity calculation unit 3 similarity correction unit 4 similarity accumulation unit 5 discrimination unit 6 reject unit 7 voice detection unit 11 analysis unit 12 comparison collation unit 13 unit standard pattern storage unit 14 reference similarity calculation unit 15 Similarity correction unit 17 voice detection unit 21 analysis unit 22 comparison and collation unit 23 unit standard pattern storage unit 24 reference similarity calculation unit 25 similarity correction unit 27 voice detection unit 28 standard pattern storage unit 29 recognition pattern generation unit 30 reference pattern generation Department

Claims

[Claims]

1. An analyzer for converting an input voice signal into a time series of frames of feature vectors, and a vector between a feature vector forming a standard pattern registered in advance and a feature vector of the input voice signal. An inter-vector similarity calculation unit that obtains a similarity for each frame of an input voice signal, and a frame-corrected similarity that is corrected using the maximum value of the inter-vector similarity in each frame of an input voice signal A voice recognition device comprising a similarity correction unit, a similarity accumulation unit that accumulates the frame correction similarities to obtain a correction similarity, and an identification unit that identifies a voice based on the corrected similarity.

2. An analysis unit for converting an input voice signal into a time series of feature vector frames, and the feature vector time series and a standard pattern registered in advance are compared and collated by defining an end frame, A comparison and collation unit that obtains a standard pattern giving a similarity as a recognition result and a starting frame corresponding to the maximum similarity, a unit standard pattern storage unit that holds a standard pattern of a recognition unit, and a standard pattern of the recognition unit Is obtained by the reference similarity calculation unit and the comparison and collation unit, which obtains the maximum value of the similarity between the standard pattern obtained by combining in a predetermined order and the feature vector time series from the start frame to the end frame as the reference similarity. And a similarity correction unit that obtains a corrected similarity by correcting the obtained similarity with the reference similarity. Recognition device.

3. The voice recognition device according to claim 1, further comprising a reject unit that outputs a reject signal when the corrected similarity obtained by the similarity correction unit is smaller than a predetermined threshold value.

4. The voice recognition device according to claim 1, further comprising a voice detection unit that detects a voice when the corrected similarity obtained by the similarity correction unit is larger than a predetermined threshold value.

5. An analysis unit for converting an input voice signal into a time series of feature vector frames, a standard pattern storage unit for holding a standard pattern to be recognized, and a unit standard pattern for holding a standard pattern of a recognition unit. A storage unit, a reference pattern generation unit that combines the standard patterns of the recognition units in a predetermined order to form a reference pattern, and a recognition pattern generation unit that combines the reference pattern and the standard pattern to be recognized as a recognition pattern And a comparison and collation unit that determines and compares the feature vector time series and the recognition pattern by defining an end frame, and obtains a standard pattern forming a recognition pattern that gives the maximum similarity as a recognition result, the reference pattern, and the reference pattern. Reference similarity that determines the maximum value of similarity when the end frame is defined as the end and compared and compared The correction similarity obtained by the calculation unit and the similarity correction unit and the similarity correction unit that obtains the corrected similarity obtained by correcting the maximum similarity obtained by the comparison and matching unit using the reference similarity is calculated in advance. A voice recognition device having a voice detection unit that detects a voice when the voice is larger than a predetermined threshold.