JP3039095B2

JP3039095B2 - Voice recognition device

Info

Publication number: JP3039095B2
Application number: JP4014399A
Authority: JP
Inventors: 塚田聡
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1992-01-30
Filing date: 1992-01-30
Publication date: 2000-05-08
Anticipated expiration: 2015-05-08
Also published as: JPH05210396A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は音声認識装置において、
特に認識対象語以外の発声のリジェクト、未知語の検
出、および、入力未知音声中の一部から標準パターンの
カテゴリに属する音声の存在を検出するワードスポッテ
ィングの改良に関するものである。The present invention relates to a speech recognition apparatus,
In particular, the present invention relates to rejection of utterances other than a recognition target word, detection of an unknown word, and improvement of word spotting for detecting the presence of a voice belonging to a standard pattern category from a part of input unknown voices.

【０００２】[0002]

【従来の技術】従来、音声認識では、東海大学出版会刊
行の「ディジタル音声処理」（以下、文献［１］と称
す）の１４９−１７７ページに述べられているように、
入力された未知音声とあらかじめ登録された認識対象の
標準パターンを各々、比較照合して類似度を求め、最大
の類似度を与える標準パターンのカテゴリを選択するこ
とによって認識を行なっていた。ここで、類似度として
は、特徴ベクトル間距離に基づくものや特徴ベクトル出
現確率に基づくものなどが用いられる。このようにして
求められた類似度を用いることにより、認識対象の標準
パターンとして登録されていない未知の語が入力された
時に、それを未知単語と判定することができる。例え
ば、電子情報通信学会技術研究報告、Ｖｏｌ．８９、Ｎ
ｏ．９１、１９８９年６月、１−８ページに掲載の「高
騒音下における自動券売機用不特定話者単語音声認識装
置の開発」（以下、文献［２］と称す）に述べられてい
るように、得られた類似度があらかじめ定められた閾値
より小さい場合未知単語であると判定しリジェクトする
方法がある。しかし、ここで求めた類似度の大きさは、
話者や発声環境によって大きく変化する。このため、話
者や発声環境が異なる場合に、未知語検出、リジェクシ
ョンの精度を高くするためには、話者ごと、発声環境ご
とに異なった閾値を設定する必要があり、非常な労力を
必要とするという問題点がある。そこで、特開平４−２
５５９００号（以下、文献［３］と称す）において述べ
られているように、入力された未知音声とあらかじめ登
録された認識対象の標準パターンとの類似度を求めると
ともに、認識対象の制約をなくし、言語による拘束を弱
めた場合の類似度を参照類似度として求め、先の類似度
を参照類似度を用いて補正する方法がある。入力音声に
現れる話者や環境の影響は、認識対象の標準パターンと
の類似度を求める場合にも、参照類似度を求める場合に
も、どちらにも同様に現れるので、この方法により補正
された類似度は、話者や環境の影響が相殺されていると
いえる。このことにより、補正された類似度があらかじ
め定められた閾値より小さい場合に未知単語と判定すれ
ば、精度の高いリジェクトが実現できる。2. Description of the Related Art Conventionally, in speech recognition, as described on pages 149 to 177 of "Digital Speech Processing" published by Tokai University Press (hereinafter referred to as reference [1]).
Recognition has been performed by comparing and collating the input unknown voice with a pre-registered standard pattern to be recognized to obtain a similarity, and selecting a category of the standard pattern that gives the maximum similarity. Here, as the similarity, one based on a distance between feature vectors, one based on a feature vector appearance probability, or the like is used. By using the similarity obtained in this way, when an unknown word that is not registered as a recognition target standard pattern is input, it can be determined as an unknown word. For example, IEICE Technical Report, Vol. 89, N
o. 91, June 1989, page 1-8, "Development of Unspecified Speaker Word Speech Recognition Device for Automatic Ticket Vending Machine under High Noise" (hereinafter referred to as Document [2]). There is a method in which when an obtained similarity is smaller than a predetermined threshold value, the word is determined to be an unknown word and rejected. However, the magnitude of the similarity calculated here is
It varies greatly depending on the speaker and vocal environment. For this reason, when the speaker and the utterance environment are different, it is necessary to set different threshold values for each speaker and each utterance environment in order to increase the accuracy of unknown word detection and rejection. There is a problem that it is necessary. Therefore, Japanese Patent Laid-Open No. 4-2
As described in US Pat. No. 55900 (hereinafter referred to as reference [3]), a similarity between an input unknown voice and a pre-registered standard pattern of a recognition target is obtained, and a restriction on the recognition target is removed. There is a method in which the similarity in the case where the constraint due to language is weakened is obtained as a reference similarity, and the previous similarity is corrected using the reference similarity. The influence of the speaker and the environment appearing in the input speech appears in both cases of obtaining the similarity with the standard pattern to be recognized and the reference similarity, so they were corrected by this method. It can be said that the similarity is offset by the influence of the speaker and the environment. Thus, if the corrected similarity is smaller than a predetermined threshold value and it is determined that the word is unknown, a highly accurate reject can be realized.

【０００３】入力未知音声中の一部から標準パターンの
カテゴリに属する音声の存在を検出するワードスポッテ
ィングについては、文献［１］の１７６−１７７ページ
や電子情報通信学会刊行の「確率モデルによる音声認
識」（文献［６］）の８７−８９ページに述べられてい
るように、入力された未知音声の部分区間とあらかじめ
登録された標準パターンとを比較照合して類似度を求
め、得られた類似度があらかじめ定められた閾値より大
きい場合に、標準パターンのカテゴリの音声が存在する
と判定し、認識を行なう方法がある。[0003] Word spotting for detecting the presence of speech belonging to the standard pattern category from a part of input unknown speech is described in pp. 176-177 of Ref. (Ref. [6]), pp. 87-89, a comparison is made between an input partial section of an unknown voice and a pre-registered standard pattern to obtain a similarity, and the obtained similarity is calculated. When the degree is larger than a predetermined threshold value, there is a method of determining that a voice of the category of the standard pattern exists and performing recognition.

【０００４】[0004]

【発明が解決しようとする課題】ワードスポッティング
においても、話者や発声環境による類似度の大きさの変
化のために、高精度の音声検出には多大な労力を必要と
する。この問題を解決するために、前述の類似度の補正
方法を適用して、補正した類似度に基づいてワードスポ
ッティングを行なえば、高精度なワードスポッティング
が可能と考えられる。しかし、ワードスポッティングの
ような端点が固定されていない音声認識方法に、前述の
類似度補正方法をそのまま適用することはできない。本
発明の目的は、話者や発声環境が異なった場合でも、類
似度を同一の尺度で比較できるように類似度を補正する
方法をワードスポッティングに適用し、安定した音声検
出を行なうことにある。Even in word spotting, high-precision speech detection requires a great deal of labor due to the change in the degree of similarity depending on the speaker and the utterance environment. In order to solve this problem, if the above-described similarity correction method is applied and word spotting is performed based on the corrected similarity, it is considered that highly accurate word spotting is possible. However, the above-described similarity correction method cannot be directly applied to a speech recognition method in which endpoints are not fixed, such as word spotting. An object of the present invention is to apply a method of correcting similarity to word spotting so that similarities can be compared on the same scale even when speakers and utterance environments are different, and to perform stable voice detection. .

【０００５】[0005]

【課題を解決するための手段】第１の発明の音声認識装
置は、入力された音声信号を特徴ベクトルのフレームの
時系列に変換する分析部と、あらかじめ登録された標準
パターンを構成する特徴ベクトルと前記入力された音声
信号の特徴ベクトルとのベクトル間類似度を入力された
音声信号の各フレームごとに求めるベクトル間類似度計
算部と、入力された音声信号の各フレームにおける前記
ベクトル間類似度の最大値を用いて補正したフレーム補
正類似度を求める類似度補正部と、前記フレーム補正類
似度を累積し補正類似度とする類似度累積部と、前記補
正類似度をもとに音声を識別する識別部を有することを
特徴とする。According to a first aspect of the present invention, there is provided a speech recognition apparatus comprising: an analyzing unit for converting an input speech signal into a time series of a frame of a feature vector; and a feature vector constituting a standard pattern registered in advance. And an inter-vector similarity calculation unit that calculates an inter-vector similarity between each of the input audio signals and a feature vector of the input audio signal, and the inter-vector similarity in each frame of the input audio signal. A similarity correction unit that calculates a frame correction similarity corrected using the maximum value of the above, a similarity accumulation unit that accumulates the frame correction similarity to obtain a correction similarity, and identifies a voice based on the correction similarity. It is characterized by having an identification unit that performs the operation.

【０００６】第２の発明の音声認識装置は、入力された
音声信号を特徴ベクトルのフレームの時系列に変換する
分析部と、前記特徴ベクトル時系列とあらかじめ登録さ
れた標準パターンとを終端フレームを定めて比較照合
し、最大の類似度を与える標準パターンを認識結果とし
て求めるとともに前記最大の類似度に対応する始端フレ
ームを求める比較照合部と、認識単位の標準パターンを
保持する単位標準パターン記憶部と、前記認識単位の標
準パターンをあらかじめ定められた順序で結合した標準
パターンと前記始端フレームから前記終端フレームまで
の特徴ベクトル時系列との類似度の最大値を参照類似度
として求める参照類似度計算部と前記比較照合部によっ
て求められた類似度を前記参照類似度にを用いて補正し
た補正類似度を求める類似度補正部とを有することを特
徴とする。According to a second aspect of the present invention, there is provided a speech recognition apparatus for converting an input speech signal into a time series of a frame of a feature vector, and converting the feature vector time series and a pre-registered standard pattern into a terminal frame. A comparison / matching unit that determines a standard pattern that gives the maximum similarity as a recognition result and obtains a start frame corresponding to the maximum similarity, and a unit standard pattern storage unit that holds a standard pattern of a recognition unit And a reference similarity calculation for calculating, as a reference similarity, a maximum value of similarity between a standard pattern obtained by combining the standard patterns of the recognition units in a predetermined order and a feature vector time series from the start frame to the end frame. Determining a corrected similarity obtained by correcting the similarity determined by the comparing unit and the comparison matching unit using the reference similarity. And having a similarity correcting unit.

【０００７】第３の発明の音声認識装置は、前記類似度
補正部により得られた補正類似度があらかじめ定められ
た閾値より小さい場合にリジェクト信号を出力するリジ
ェクト部を有することを特徴とする。A speech recognition apparatus according to a third aspect of the present invention is characterized in that the speech recognition apparatus has a reject unit that outputs a reject signal when the corrected similarity obtained by the similarity correction unit is smaller than a predetermined threshold.

【０００８】第４の発明の音声認識装置は、前記類似度
補正部により得られた補正類似度があらかじめ定められ
た閾値より大きい場合に音声として検出する音声検出部
を有することを特徴とする。[0008] A speech recognition apparatus according to a fourth aspect of the present invention is characterized in that the speech recognition apparatus further comprises a speech detection unit for detecting as a speech when the corrected similarity obtained by the similarity correction unit is larger than a predetermined threshold.

【０００９】第５の発明の音声認識装置は、入力された
音声信号を特徴ベクトルのフレームの時系列に変換する
分析部と、認識対象の標準パターンを保持する標準パタ
ーン記憶部と、認識単位の標準パターンを保持する単位
標準パターン記憶部と、前記認識単位の標準パターンを
あらかじめ定められた順序で結合し参照パターンとする
参照パターン生成部と、前記参照パターンと前記認識対
象の標準パターンを結合し認識パターンとする認識パタ
ーン生成部と、前記特徴ベクトル時系列と前記認識パタ
ーンとを終端フレームを定めて比較照合し、最大の類似
度を与える認識パターンを構成する標準パターンを認識
結果として求める比較照合部と、前記参照パターンと前
記終端フレームを終端と定めて比較照合した場合の類似
度の最大値を参照類似度として求める参照類似度計算部
と、前記比較照合部によって求められた最大の類似度を
前記参照類似度を用いて補正した補正類似度を求める類
似度補正部と前記類似度補正部により得られた補正類似
度があらかじめ定められた閾値より大きい場合に音声と
して検出する音声検出部を有することを特徴とする。According to a fifth aspect of the present invention, there is provided a speech recognition apparatus comprising: an analysis unit for converting an input speech signal into a time series of a frame of a feature vector; a standard pattern storage unit for holding a standard pattern to be recognized; A unit standard pattern storage unit that holds a standard pattern, a reference pattern generation unit that combines the standard patterns of the recognition units in a predetermined order and forms a reference pattern, and combines the reference pattern and the standard pattern of the recognition target. A recognition pattern generation unit to be a recognition pattern, comparison and collation by determining an end frame between the feature vector time series and the recognition pattern, and comparison and collation for obtaining as a recognition result a standard pattern constituting a recognition pattern that gives the maximum similarity And the maximum value of the similarity when the reference pattern and the end frame are determined as the end and compared and compared. A reference similarity calculation unit that obtains the similarity, a similarity correction unit that obtains a corrected similarity obtained by correcting the maximum similarity obtained by the comparison / matching unit using the reference similarity, and a similarity correction unit that obtains the corrected similarity. A voice detection unit that detects the voice as the voice when the corrected similarity is larger than a predetermined threshold.

【００１０】[0010]

【作用】本発明は、入力音声の特徴ベクトル時系列に対
して、あらかじめ登録された標準パターンとの類似度を
求めると共に、単語辞書の制約をなくし、言語による拘
束を弱めた場合の類似度を求めて、先の類似度を補正す
る場合に、入力音声のフレームごとに類似度を求めるこ
とにより、ワードスポッティングへの適用を可能にした
ものである。According to the present invention, the similarity between a feature vector time series of an input voice and a standard pattern registered in advance is obtained, and the similarity in the case where the constraint of a word dictionary is removed and the constraint by language is weakened is obtained. When the similarity is obtained and corrected, the similarity is obtained for each frame of the input voice, thereby enabling application to word spotting.

【００１１】第１の発明による音声認識装置において、
あらかじめ登録された標準パターンを用いて音声を認識
する場合を考える。ここで、各標準パターンは、特徴ベ
クトルの系列により構成されている。[0011] In the speech recognition apparatus according to the first invention,
Consider a case where speech is recognized using a standard pattern registered in advance. Here, each standard pattern is constituted by a sequence of feature vectors.

【００１２】まず入力された音声信号を分析部によって
特徴ベクトルのフレームの時系列に変換する。ここでの
分析には、文献［１］の３２−９８ページに示されてい
るメルケプストラムによる方法やＬＰＣ分析による方法
などを用いることができる。First, an input speech signal is converted into a time series of feature vector frames by an analysis unit. For the analysis here, a method based on mel-cepstrum, a method based on LPC analysis, etc., shown on pages 32-98 of document [1] can be used.

【００１３】次に、ベクトル間類似度計算部において、
分析部で得られた特徴ベクトルとあらかじめ登録してお
いた標準パターンを構成する特徴ベクトルとの類似度
を、入力された音声信号のフレームごとに計算する。ベ
クトル間類似度を求める方法としては、文献［１］の１
５４−１６１ページに述べられているようなベクトル間
距離に基づく方法や隠れマルコフモデル（以下、ＨＭＭ
と呼ぶ）に基づいたベクトル出現確率による方法を用い
ることができる。ＨＭＭについては、Ｓ．Ｅ．レビンソ
ン（Ｓ．Ｅ．Ｌｅｖｉｎｓｏｎ）や、Ｌ．Ｒ．ラビナー
（Ｌ．Ｒ．Ｒａｂｉｎｅｒ）、およびＭ．Ｍソンディ
（Ｍ．Ｍ．Ｓｏｎｄｈｉ）らの、ベルシステムテクニカ
ルジャーナル（ＴｈｅＢｅｌｌＳｙｓｔｅｍＴｅ
ｃｈｎｉｃａｌＪｏｕｒｎａｌ）、Ｖｏｌ．６２、Ｎ
ｏ．４、１９８３年４月、１０３５−１０７４ページに
掲載の論文「アンイントロダクションツージアプ
リケーションオブザセオリーオブプロバブリ
スティックファンクションズオブアマルコフ
プロセスツーオートマチックスピーチレコグニシ
ョン（ＡｎＩｎｔｒｏｄｕｃｔｉｏｎｔｏｔｈｅ
ＡｐｐｌｉｃａｔｉｏｎｏｆｔｈｅＴｈｅｏｒ
ｙｏｆＰｒｏｂａｂｌｉｓｔｉｃＦｕｎｃｔｉｏｎ
ｓｏｆａＭａｒｃｏｖＰｒｏｃｅｓｓｔｏ
ＡｕｔｏｍａｔｉｃＳｐｅｅｃｈＲｅｃｏｇｎｉｔ
ｉｏｎ）」（以下、文献４と称す）に述べられている。Next, in the inter-vector similarity calculation unit,
The similarity between the feature vector obtained by the analysis unit and the feature vector constituting the standard pattern registered in advance is calculated for each frame of the input audio signal. As a method for obtaining the similarity between vectors, there is a method described in reference [1].
A method based on the distance between vectors and a hidden Markov model (hereinafter referred to as HMM
) Can be used. For HMM, S.M. E. FIG. Levinson (SE Levinson), L. R. LR Rabiner; M. Sondhi et al., The Bell System Te.
chemical Journal), Vol. 62, N
o. 4, April 1983, pp. 1035-1074, entitled "An Introduction to the Application of the Theory of Probabilistic Functions of a Markov."
Process Introduction to Automatic Speech Recognition (An Introduction to the
Application of the Theor
y of Probabilistic Function
s of a Markov Process to
Automatic Speech Recognit
ion) "(hereinafter referred to as Document 4).

【００１４】類似度補正部においては、フレームごとに
ベクトル間類似度の最大値を求め、求められた最大値を
用いてそのフレームの全てのベクトル間類似度を補正
し、フレーム補正類似度が計算される。ここでの補正方
法としては、ベクトル間類似度とベクトル間類似度の最
大値の差に基づいた値を求める方法や、ベクトル間類似
度とベクトル間類似度の最大値の比に基づいた値を求め
る方法などを用いることができる。The similarity correction unit calculates the maximum value of the inter-vector similarity for each frame, corrects all the inter-vector similarities of the frame using the obtained maximum value, and calculates the frame correction similarity. Is done. As the correction method here, a method of obtaining a value based on the difference between the inter-vector similarity and the maximum value of the inter-vector similarity, or a value based on the ratio of the inter-vector similarity and the maximum value of the inter-vector similarity is used. The required method can be used.

【００１５】類似度累積部においては、フレーム補正類
似度を入力された音声信号に対して累積した補正類似度
を求める。ここでのフレーム補正類似度の累積方法とし
ては、単語パターンとの比較照合の場合は文献［１］の
１５４−１６５ページに示されているように、動的計画
法に基づいて補正類似度を累積する方法や、文献［４］
に述べられているように、ＨＭＭに基づいて補正類似度
を累積する方法を用いることができる。In the similarity accumulating section, a corrected similarity obtained by accumulating the frame corrected similarity with respect to the input audio signal is obtained. As a method of accumulating the frame correction similarity, in the case of comparison and collation with a word pattern, the correction similarity is calculated based on the dynamic programming method, as shown on pages 154 to 165 of document [1]. How to accumulate, literature [4]
, A method of accumulating the corrected similarity based on the HMM can be used.

【００１６】識別部においては、補正類似度をもとに音
声を識別し、認識結果を求める。ここでの音声の識別方
法は、補正類似度の最も大きなものを認識結果とする方
法などを用いることができる。The identification section identifies the speech based on the corrected similarity and obtains a recognition result. Here, as the method of identifying the voice, a method of using the one having the largest correction similarity as the recognition result can be used.

【００１７】入力音声に現れる話者や環境の影響は、類
似度を求める場合にも、補正のために用いる類似度の最
大値にもどちらにも同様に現れるので、このようにして
求めた補正類似度においては、話者や環境の影響が相殺
されている。The influence of the speaker and the environment appearing in the input voice appears in both cases of obtaining the similarity and the maximum value of the similarity used for the correction. In the similarity, the influence of the speaker and the environment is offset.

【００１８】文献［３］においても、フレームごとのベ
クトル間類似度の最大値に基づいた方法が述べられてい
る。これは、フレームごとのベクトル間類似度の累積に
よって求めた標準パターンとの類似度の計算と、フレー
ムごとのベクトル間類似度の最大値を累積して求めた参
照類似度の計算が、別々に行われた後、標準パターンと
の類似度を参照類似度によって補正する方法である。標
準パターンとの類似度を求めるための区間と参照類似度
を求めるための区間は同じ区間である必要があるが、ワ
ードスポッティングを行う場合には、あらかじめ端点が
決められていないので、文献［３］の方法では、標準パ
ターンとの類似度を求めた後、同じ区間についての参照
類似度の計算を行う必要があった。このため、標準パタ
ーンとの類似度求めた後の計算量が多くなり、ワードス
ポッティングに向いていない。Reference [3] also describes a method based on the maximum value of the inter-vector similarity for each frame. This is because the calculation of the similarity with the standard pattern obtained by accumulating the inter-vector similarity for each frame and the calculation of the reference similarity obtained by accumulating the maximum value of the inter-vector similarity for each frame are separately performed. After this, the similarity with the standard pattern is corrected by the reference similarity. The section for obtaining the similarity with the standard pattern and the section for obtaining the reference similarity need to be the same section. However, when performing word spotting, since the end point is not determined in advance, reference [3] In the method [1], it is necessary to calculate the reference similarity for the same section after calculating the similarity with the standard pattern. For this reason, the amount of calculation after obtaining the similarity with the standard pattern increases, and the method is not suitable for word spotting.

【００１９】これに対して、第１の発明による音声認識
装置においては、フレームごとの補正類似度を累積して
おり、このようにして求めた補正類似度においては、標
準パターンとの類似度を求めるための区間と、補正のた
めに用いる類似度を求めるための区間が常に一致してい
る。このため、フレームごとの類似度を累積計算を行っ
た後に、参照類似度の計算を行う必要がない。On the other hand, in the speech recognition apparatus according to the first aspect of the invention, the corrected similarity for each frame is accumulated. The section for obtaining and the section for obtaining the similarity used for the correction always coincide with each other. Therefore, there is no need to calculate the reference similarity after performing the cumulative calculation of the similarity for each frame.

【００２０】これらのことから、第１の発明による音声
認識装置は、話者や環境の違いによる類似度の大きさの
違いが抑えられており、ワードスポッティングへの適用
に向いている類似度の補正方式である。From these facts, the speech recognition apparatus according to the first aspect of the present invention suppresses the difference in the magnitude of similarity due to the difference in speaker and environment, and the similarity suitable for application to word spotting. This is a correction method.

【００２１】このようにして、第１の発明によって、話
者や環境に影響されない補正された類似度を得ることが
できる。As described above, according to the first aspect, it is possible to obtain a corrected similarity that is not affected by the speaker or the environment.

【００２２】第２の発明による音声認識装置において
は、比較照合部において、分析部で得られた特徴ベクト
ルのフレームの時系列とあらかじめ登録しておいた複数
の標準パターンとの類似度が終端フレームを定めて計算
され、最大の類似度を与える標準パターンが認識結果と
して求められるとともに、最大の類似度に対応する始端
フレームが求められる。ここで、単語パターンとの比較
照合の方法としては文献［１］の１５４−１６５ページ
に示されているように、標準パターンとして特徴ベクト
ル時系列を保持し特徴ベクトル間距離に基づいて類似度
を計算し動的計画法に基づいて比較照合する方法や、文
献［４］に述べられているようなＨＭＭに基づいて比較
照合する方法などがある。最大の類似度に対応する始端
フレームの求め方としては、電子情報通信学会論文誌、
Ｖｏｌ．Ｊ７１−Ｄ、Ｎｏ．９、１６５０−１６５９ペ
ージに掲載の「フレーム同期化、ビームサーチ、ベクト
ル量子化の統合によるＤＰマッチングの高速化」（以
下、文献［５］と称す）に示されているように、比較照
合時の照合パスの選択の際に、始端フレームを順次伝え
ていき、最大の類似度が決まると同時に始端フレームが
求められる方法などを用いることができる。In the speech recognition apparatus according to the second aspect of the present invention, the similarity between the time series of the feature vector frame obtained by the analysis unit and the plurality of standard patterns registered in advance is determined by the comparison and collation unit. , And a standard pattern that gives the maximum similarity is obtained as a recognition result, and a start frame corresponding to the maximum similarity is obtained. Here, as a method of comparison and matching with a word pattern, as shown on pages 154 to 165 of document [1], a feature vector time series is held as a standard pattern, and similarity is determined based on the distance between feature vectors. There are a method of performing comparison and comparison based on a dynamic programming method, and a method of performing comparison and comparison based on an HMM as described in reference [4]. The method of finding the start frame corresponding to the maximum similarity is described in IEICE Transactions,
Vol. J71-D, No. As described in “Speeding up DP Matching by Integrating Frame Synchronization, Beam Search, and Vector Quantization” on page 9, pages 1650-1659 (hereinafter referred to as reference [5]), In selecting the collation path, a method of sequentially transmitting the start frame and determining the maximum similarity and simultaneously obtaining the start frame can be used.

【００２３】参照類似度計算部においては、比較照合部
で求められた始端フレームから終端フレームの範囲の入
力音声の特徴ベクトル時系列と単位標準パターン記憶部
に記憶されている認識単位標準パターンをある定められ
た順序で結合した複数の標準パターンと比較照合して類
似度を求め、類似度の最大値を参照類似度とする。認識
単位パターンを定めておいた順序によって結合したパタ
ーンとの比較照合の方法としては、この定められた順序
としてあらかじめネットワークの形で記述しておくこと
により、文献［５］に示されているようなフレーム同期
ＤＰマッチングによる連続音声認識に基づいて補正類似
度を累積する方法や、電子情報通信学会刊行の「確率モ
デルによる音声認識」（以下、文献［６］と称す）の４
０−５０ページに示されているようなＨＭＭによる連続
音声認識アルゴリズムに基づいて補正類似度を累積する
方法を用いることができる。また、ここでの認識単位標
準パターンとしては、文献［１］の５ページに示されて
いるような音節や音素あるいは単語などを用いることが
できる。In the reference similarity calculation unit, the feature vector time series of the input speech in the range from the start frame to the end frame obtained by the comparison / matching unit and the recognition unit standard pattern stored in the unit standard pattern storage unit are stored. The similarity is obtained by comparing and collating with a plurality of standard patterns combined in a predetermined order, and the maximum value of the similarity is set as the reference similarity. As a method of comparing and recognizing a recognition unit pattern with a pattern combined in a predetermined order, as described in the document [5], the predetermined order is described in advance in the form of a network. A method of accumulating corrected similarities based on continuous speech recognition by simple frame-synchronous DP matching, and the method of “Speech recognition using a stochastic model” published by the Institute of Electronics, Information and Communication Engineers (hereinafter referred to as reference [6]).
A method of accumulating the corrected similarity based on a continuous speech recognition algorithm by the HMM as shown on page 0-50 can be used. Further, as the recognition unit standard pattern here, syllables, phonemes, words, and the like as shown on page 5 of document [1] can be used.

【００２４】次に、類似度補正部において、比較照合部
で求められた標準パターンとの類似度を参照類似度を用
いて補正し、補正類似度が計算される。ここでの補正方
法としては、標準パターンとの類似度と参照類似度の差
に基づいた値を求める方法や、標準パターンとの類似度
と参照類似度の比に基づいた値を求める方法などを用い
ることができる。Next, the similarity correction unit corrects the similarity with the standard pattern obtained by the comparison / matching unit using the reference similarity, and calculates the corrected similarity. Examples of the correction method include a method of calculating a value based on the difference between the similarity with the standard pattern and the reference similarity, and a method of calculating a value based on the ratio of the similarity with the standard pattern and the reference similarity. Can be used.

【００２５】入力音声に現れる話者や環境の影響は、標
準パターンとの類似度を求める場合にも、補正のために
用いる参照類似度を求める場合にも、どちらにも同様に
現れるので、このようにして求めた補正類似度において
は、話者や環境の影響が相殺されている。The influence of the speaker and the environment appearing in the input voice appears in both cases of obtaining the similarity with the standard pattern and obtaining the reference similarity used for correction. In the corrected similarity obtained in this way, the influence of the speaker and the environment is offset.

【００２６】また、あらかじめ端点が固定されていない
場合でも、標準パターンとの類似度を計算した後に、端
点を求めて参照用尤度を計算し補正を行う方法を用いて
いるので、ワードスポッティングに適用することができ
る。Even if the end point is not fixed in advance, the method of calculating the similarity to the standard pattern, then obtaining the end point, calculating the likelihood for reference, and correcting it is used for word spotting. Can be applied.

【００２７】このように、第２の発明による音声認識装
置においても、話者や環境の違いによる類似度の大きさ
の違いを抑えることができる。As described above, also in the speech recognition apparatus according to the second invention, it is possible to suppress the difference in the magnitude of the similarity due to the difference in the speaker and the environment.

【００２８】第３の発明においては、リジェクト部で補
正類似度があらかじめ定められた閾値より小さい場合に
リジェクト信号を発生する。ここで、補正類似度は話者
や環境の違いによる類似度の大きさの違いが補正されて
おり、一定の閾値を用いてリジェクトすることができ
る。In the third aspect, a reject signal is generated when the correction similarity is smaller than a predetermined threshold in the reject unit. Here, the difference in the magnitude of the similarity due to the difference in the speaker and the environment is corrected, and the corrected similarity can be rejected using a certain threshold.

【００２９】文献［３］においては、標準パターンとの
類似度を求めた後で、参照類似度による補正を行なう必
要があったが、第３の発明においては、第１の発明によ
り各フレームにおいて補正類似度が求められているの
で、文献［３］における標準パターンとの類似度計算と
同様の計算により、補正類似度が求められ、後から補正
する必要がないことが利点である。In the document [3], it is necessary to perform the correction based on the reference similarity after calculating the similarity with the standard pattern. Since the corrected similarity is calculated, the corrected similarity is calculated by the same calculation as the similarity calculation with the standard pattern in the document [3], and there is an advantage that it is not necessary to perform correction later.

【００３０】第４の発明においては、音声検出部で補正
類似度があらかじめ定められた閾値より大きい場合を音
声として検出する。ここで、補正類似度は話者や環境の
違いによる類似度の大きさの違いが補正されており、一
定の閾値を用いて音声を検出することができる。In the fourth invention, a case where the corrected similarity is larger than a predetermined threshold value is detected as a sound by the sound detection unit. Here, in the corrected similarity, the difference in the magnitude of the similarity due to the difference in the speaker and the environment is corrected, and the voice can be detected using a certain threshold.

【００３１】第５の発明においては、参照パターン生成
部で単位標準パターン記憶部に記憶されている認識単位
の標準パターンをあらかじめ定められた順序で結合し参
照パターンを生成する。また、ここでの認識単位標準パ
ターンとしては、文献［１］の５ページに示されている
ような音節や音素あるいは単語などを用いることができ
る。In the fifth aspect, the reference pattern generation unit combines the standard patterns of the recognition units stored in the unit standard pattern storage unit in a predetermined order to generate a reference pattern. Further, as the recognition unit standard pattern here, syllables, phonemes, words, and the like as shown on page 5 of document [1] can be used.

【００３２】認識パターン生成部では、参照パターンと
標準パターンを結合した認識パターンを生成する。The recognition pattern generation section generates a recognition pattern obtained by combining the reference pattern and the standard pattern.

【００３３】比較照合部では、特徴ベクトル時系列と認
識パターンとを終端フレームを定めて比較照合し、最大
の類似度を与える認識パターンを構成する標準パターン
を認識結果として求める。ここで、単語パターンとの比
較照合の方法としては文献［１］の１５４−１６５ペー
ジに示されているように、標準パターンとして特徴ベク
トル時系列を保持し特徴ベクトル間距離に基づいて類似
度を計算し動的計画法に基づいて比較照合する方法や、
文献［４］に述べられているようなＨＭＭに基づいて比
較照合する方法などがある。また、参照パターンとの比
較照合の方法としては、文献［５］に示されているよう
なフレーム同期ＤＰマッチングによる連続音声認識や、
文献［６］の４０−５０ページに示されているようなＨ
ＭＭによる連続音声認識アルゴリズムを用いることがで
きる。The comparison / matching section determines the end frame of the feature vector time series and the recognition pattern, compares and matches them, and obtains, as a recognition result, a standard pattern constituting a recognition pattern giving the maximum similarity. Here, as a method of comparison and matching with a word pattern, as shown on pages 154 to 165 of document [1], a feature vector time series is held as a standard pattern, and similarity is determined based on the distance between feature vectors. How to calculate and compare and match based on dynamic programming,
There is a method of comparing and matching based on the HMM as described in reference [4]. Further, as a method of comparison and comparison with a reference pattern, continuous speech recognition by frame-synchronous DP matching as described in reference [5],
H as shown on pages 40-50 of document [6]
A continuous speech recognition algorithm based on MM can be used.

【００３４】参照類似度計算部では、特徴ベクトル時系
列と参照パターンを終端フレームを定めて比較照合して
類似度を計算し、最大の類似度を参照類似度として求め
る。参照パターンとの比較照合の方法としては、文献
［５］に示されているようなフレーム同期ＤＰマッチン
グによる連続音声認識や、文献［６］の４０−５０ペー
ジに示されているようなＨＭＭによる連続音声認識アル
ゴリズムを用いることができる。The reference similarity calculation unit calculates the similarity by comparing and collating the feature vector time series with the reference pattern by defining the end frame, and calculates the maximum similarity as the reference similarity. As a method of comparison and comparison with a reference pattern, continuous speech recognition by frame-synchronous DP matching as described in reference [5] or HMM as described in page 40-50 of reference [6] is used. A continuous speech recognition algorithm can be used.

【００３５】次に、類似度補正部では、比較照合部によ
って求められた最大の類似度を参照類似度を用いて補正
した補正類似度を求める。ここでの補正方法としては、
標準パターンとの類似度と参照類似度の差に基づいた値
を求める方法や、標準パターンとの類似度と参照類似度
の比に基づいた値を求める方法などを用いることができ
る。Next, the similarity correction unit obtains a corrected similarity obtained by correcting the maximum similarity obtained by the comparison / matching unit using the reference similarity. As a correction method here,
A method of obtaining a value based on the difference between the similarity with the standard pattern and the reference similarity, a method of obtaining a value based on the ratio of the similarity with the standard pattern and the reference similarity, or the like can be used.

【００３６】次に、音声検出部において、類似度補正部
により得られた補正類似度があらかじめ定められた閾値
より大きい場合に音声として検出する。Next, when the corrected similarity obtained by the similarity correction unit is larger than a predetermined threshold value, the voice detection unit detects it as voice.

【００３７】入力音声に現れる話者や環境の影響は、標
準パターンとの類似度を求める場合にも、補正のために
用いる参照類似度を求める場合にも、どちらにも同様に
現れるので、このようにして求めた補正類似度において
は、話者や環境の影響が相殺されている。The influence of the speaker and the environment appearing in the input voice appears in both cases of obtaining the similarity with the standard pattern and obtaining the reference similarity used for correction. In the corrected similarity obtained in this way, the influence of the speaker and the environment is offset.

【００３８】第２の発明では、特徴ベクトル時系列とあ
らかじめ登録しておいた複数の標準パターンとの類似度
を終端フレームを定めて計算し、最大の類似度に対応す
る始端フレームを求めた後で、特徴ベクトル時系列から
始端フレームと終端フレームの間の部分を取り出して、
参照類似度の計算を行なっている。終端フレームごとに
参照類似度をやり直すことになるため、計算量が多くな
るという問題があった。第５の発明においては、参照パ
ターンと標準パターンを結合した認識パターンとの比較
照合による類似度を、参照パターンとの比較照合による
参照類似度を用いて補正している。入力音声と認識パタ
ーンとの比較照合の結果は、認識対象とする標準パター
ンに対応する部分と、参照パターンに対応する部分とに
分かれており、始端を固定して比較照合しても、入力音
声の部分区間と標準パターンとを対応づけるワードスポ
ッティングが行われていることになる。参照パターンに
対応する部分については、参照用類似度の計算において
も、ほぼ同じ参照パターンが対応づけられ、類似度につ
いてもほぼ同じ値が求められると考えられる。このた
め、認識パターンとの類似度を参照パターンとの類似度
で補正することにより、標準パターンに対応する区間に
ついて、標準パターンとの類似度を参照パターンとの類
似度を用いて補正を行なった補正類似度が得られている
ことになる。その結果、始端を固定した比較照合で、類
似度の補正を適応したワードスポッティングが可能とな
り、計算量が少なく、高精度なワードスポッティングを
行なうことができる。In the second invention, the similarity between the feature vector time series and a plurality of standard patterns registered in advance is calculated by determining the end frame, and the start frame corresponding to the maximum similarity is obtained. Then, the part between the start frame and the end frame is extracted from the feature vector time series,
Calculation of reference similarity is performed. Since the reference similarity is redone for each end frame, there is a problem that the amount of calculation increases. In the fifth aspect, the similarity based on the comparison and comparison with the recognition pattern obtained by combining the reference pattern and the standard pattern is corrected using the reference similarity based on the comparison and comparison with the reference pattern. The result of the comparison and comparison between the input voice and the recognition pattern is divided into a portion corresponding to the standard pattern to be recognized and a portion corresponding to the reference pattern. This means that word spotting for associating the partial section with the standard pattern has been performed. It is considered that substantially the same reference pattern is associated with the portion corresponding to the reference pattern in the calculation of the similarity for reference, and substantially the same value is obtained for the similarity. Therefore, by correcting the similarity with the recognition pattern by the similarity with the reference pattern, the similarity with the standard pattern was corrected using the similarity with the reference pattern in the section corresponding to the standard pattern. This means that the corrected similarity has been obtained. As a result, it is possible to perform word spotting adapted to correction of similarity by comparison and collation with a fixed starting end, and to perform high-precision word spotting with a small amount of calculation.

【００３９】[0039]

【実施例】本発明の実施例について図面を参照して説明
する。Embodiments of the present invention will be described with reference to the drawings.

【００４０】図１は第１，３及び４の発明の一実施例の
ブロック図である。FIG. 1 is a block diagram of one embodiment of the first, third and fourth inventions.

【００４１】この音声認識装置は、分析部１、ベクトル
間類似度計算部２、類似度補正部３、類似度累積部４、
識別部５、リジェクト部６、音声検出部７、を備えてい
る。This speech recognition apparatus includes an analysis unit 1, an inter-vector similarity calculation unit 2, a similarity correction unit 3, a similarity accumulation unit 4,
An identification unit 5, a reject unit 6, and a voice detection unit 7 are provided.

【００４２】分析部１は、入力された音声信号Ｉの特徴
分析を行ない、特徴ベクトル時系列Ｖに変換するもので
ある。The analysis unit 1 analyzes the characteristics of the input audio signal I and converts it into a characteristic vector time series V.

【００４３】ベクトル間類似度計算部２は、特徴ベクト
ル時系列Ｖの１フレームの特徴ベクトルとあらかじめ登
録しておいた標準パターンを構成する特徴ベクトルとの
ベクトル間類似度Ｄが求め出力するものである。The inter-vector similarity calculation unit 2 calculates and outputs an inter-vector similarity D between a feature vector of one frame of the feature vector time series V and a feature vector constituting a standard pattern registered in advance. is there.

【００４４】類似度補正部３は、ベクトル間類似度Ｄが
入力され、１フレームについてのベクトル間類似度の最
大値を求めて、各フレームについてベクトル間類似度を
求めた最大値で補正しフレーム補正類似度Ｆとして出力
するものである。The similarity correction unit 3 receives the inter-vector similarity D, calculates the maximum value of the inter-vector similarity for one frame, and corrects the maximum value of the inter-vector similarity for each frame with the obtained maximum value. The correction similarity F is output.

【００４５】類似度累積部４は、フレーム補正類似度Ｆ
を累積した補正類似度Ｃとして出力するものである。The similarity accumulator 4 calculates the frame correction similarity F
Is output as the accumulated corrected similarity C.

【００４６】識別部５は、補正類似度Ｃを用いて音声を
識別し、その結果を認識結果Ａとして出力するものであ
る。The identification section 5 identifies a voice using the corrected similarity C and outputs the result as a recognition result A.

【００４７】リジェクト部６は、補正類似度Ｃがあらか
じめ定めておいた閾値より小さかった時に、リジェクト
信号Ｊを出力するものである。The reject unit 6 outputs a reject signal J when the correction similarity C is smaller than a predetermined threshold.

【００４８】音声検出部７は、補正類似度Ｃがあらかじ
め定めておいた閾値より大きかった時に、認識結果Ａを
検出したとするものである。The voice detector 7 detects the recognition result A when the correction similarity C is larger than a predetermined threshold.

【００４９】次に、図１の実施例の動作について説明す
る。Next, the operation of the embodiment shown in FIG. 1 will be described.

【００５０】入力された音声信号Ｉは分析部１に入力さ
れ、特徴分析によって特徴ベクトル時系列Ｖに変換され
る。ここでの分析は、例えば、文献［１］の１５５ペー
ジで示されているようなメルケプストラムによる方法を
用いることができる。The input speech signal I is input to the analysis unit 1 and is converted into a feature vector time series V by feature analysis. For the analysis here, for example, a method based on mel-cepstral as shown on page 155 of document [1] can be used.

【００５１】特徴ベクトル時系列Ｖは、ベクトル間類似
度計算部２へ入力され、あらかじめ登録しておいた複数
の標準パターンを構成する特徴ベクトルとの類似度が計
算され、ベクトル間類似度Ｄとなる。ベクトル間類似度
を求める方法としては、例えば、文献［１］の１５４−
１６１ページに述べられているようなベクトル間距離に
基づく方法を用いることができる。The feature vector time series V is input to the inter-vector similarity calculation unit 2 and the similarity with the feature vectors constituting a plurality of standard patterns registered in advance is calculated. Become. As a method of calculating the inter-vector similarity, for example, 154-
A method based on the inter-vector distance as described on page 161 can be used.

【００５２】ベクトル間類似度Ｄは、類似度補正部３へ
入力され、１フレームについてのベクトル間類似度の最
大値を求めて、各フレームについてベクトル間類似度を
求めた最大値を用いて補正され、フレーム補正類似度Ｆ
として出力される。補正の方法としては、ベクトル間類
似度とベクトル間類似度の最大値の差に基づいた値を求
める方法を用いることができる。The inter-vector similarity D is input to the similarity correction unit 3 to determine the maximum value of the inter-vector similarity for one frame, and to correct using the maximum value of the inter-vector similarity for each frame. And the frame correction similarity F
Is output as As a correction method, a method of obtaining a value based on the difference between the inter-vector similarity and the maximum value of the inter-vector similarity can be used.

【００５３】フレーム補正類似度Ｆは、類似度累積部４
に入力され、フレーム補正類似度を累積した補正類似度
Ｃが出力される。フレーム補正類似度の累積方法として
は、例えば、文献［１］の１５４−１６５ページに示さ
れているように、動的計画法に基づいて補正類似度を累
積する方法を用いることができる。The frame correction similarity F is calculated by the similarity accumulating unit 4
And outputs a corrected similarity C obtained by accumulating the frame corrected similarities. As a method of accumulating the frame correction similarities, for example, a method of accumulating the correction similarities based on the dynamic programming, as shown on pages 154 to 165 of document [1], can be used.

【００５４】補正類似度Ｃは識別部５に入力され、補正
類似度を用いて音声が識別され、その結果が認識結果Ａ
として出力される。識別の方法としては、例えば、補正
類似度が最大となった標準パターンを選ぶ方法を用いる
ことができる。The corrected similarity C is input to the identification unit 5, and the voice is identified using the corrected similarity.
Is output as As a method of identification, for example, a method of selecting a standard pattern having the maximum corrected similarity can be used.

【００５５】このようにして、第１の発明によって、話
者や環境に影響されない補正された類似度を得ることが
できる。As described above, according to the first aspect, it is possible to obtain a corrected similarity that is not affected by the speaker or the environment.

【００５６】第３の発明によれば、リジェクト部７によ
って、補正類似度Ｃがあらかじめ定めておいた閾値より
小さい場合には、リジェクト信号Ｊが出力される。According to the third aspect, when the correction similarity C is smaller than the predetermined threshold value, the rejection unit J outputs the rejection signal J.

【００５７】第４の発明によれば、音声検出部８によっ
て、補正類似度Ｃがあらかじめ定めておいた閾値より大
きい場合には、音声として検出し認識結果Ａが出力され
る。According to the fourth aspect, when the corrected similarity C is larger than a predetermined threshold, the voice detection unit 8 detects the voice as voice and outputs the recognition result A.

【００５８】図２は、第２及び４の発明の一実施例のブ
ロック図である。FIG. 2 is a block diagram of one embodiment of the second and fourth inventions.

【００５９】この音声認識装置は、分析部１１、比較照
合部１２、単位標準パターン記憶部１３、参照類似度計
算部１４、類似度補正部１５、音声検出部１７、を備え
ている。This speech recognition apparatus includes an analysis unit 11, a comparison / collation unit 12, a unit standard pattern storage unit 13, a reference similarity calculation unit 14, a similarity correction unit 15, and a speech detection unit 17.

【００６０】分析部１１では、入力された音声信号Ｉの
特徴分析を行ない、特徴ベクトル時系列Ｖに変換するも
のである。The analysis section 11 analyzes the characteristics of the input audio signal I and converts it into a characteristic vector time series V.

【００６１】比較照合部１２は、特徴ベクトル時系列Ｖ
とあらかじめ登録された標準パターンとを終端フレーム
Ｅを定めて比較照合し、最大の類似度Ｓを与える標準パ
ターンを認識結果Ａとして求めるとともに、最大の類似
度に対応する始端フレームＴを求めるものである。The comparison / collation unit 12 calculates the characteristic vector time series V
And a standard pattern registered in advance to determine the end frame E and compare and collate the standard frame. The standard pattern giving the maximum similarity S is obtained as the recognition result A, and the start frame T corresponding to the maximum similarity is obtained. is there.

【００６２】単位標準パターン記憶部１３は、単位標準
パターンを記憶しておくものである。The unit standard pattern storage unit 13 stores a unit standard pattern.

【００６３】参照類似度計算部１４は、単位標準パター
ン記憶部１３に登録しておいた単位標準パターンをあら
かじめ決めておいた順序で結合した標準パターンと、特
徴ベクトル時系列Ｖから始端フレームＴから終端フレー
ムＥまでの区間を取り出した特徴ベクトル時系列とを比
較照合し、各々の標準パターンについて類似度を計算
し、最大の類似度を参照類似度Ｒとして求めるものであ
る。The reference similarity calculation unit 14 calculates a standard pattern obtained by combining unit standard patterns registered in the unit standard pattern storage unit 13 in a predetermined order and a feature vector time series V from a starting frame T. The feature vector time series obtained by extracting the section up to the end frame E is compared and collated, the similarity is calculated for each standard pattern, and the maximum similarity is obtained as the reference similarity R.

【００６４】類似度補正部１５は、類似度Ｓを参照類似
度Ｒを用いて補正し、補正類似度Ｃを求めるものであ
る。The similarity correction unit 15 corrects the similarity S using the reference similarity R to obtain a corrected similarity C.

【００６５】音声検出部７は、補正類似度Ｃがあらかじ
め定めておいた閾値より大きかった時に、認識結果Ａを
検出したとするものである。The voice detecting section 7 detects the recognition result A when the corrected similarity C is larger than a predetermined threshold value.

【００６６】次に、図２の実施例の動作について説明す
る。Next, the operation of the embodiment shown in FIG. 2 will be described.

【００６７】入力された音声信号Ｉは分析部１１に入力
され、特徴分析によって特徴ベクトル時系列Ｖに変換さ
れる。The input voice signal I is input to the analysis unit 11 and is converted into a feature vector time series V by feature analysis.

【００６８】特徴ベクトル時系列Ｖは、比較照合部１２
へ入力され、あらかじめ登録しておいた複数の標準パタ
ーンと終端フレームＥを定めて比較照合した類似度が計
算され、最大の類似度Ｓと最大の類似度を与える標準パ
ターンが認識結果Ａとして求められるとともに、最大の
類似度に対応する始端フレームＴが求められる。ここで
類似度の計算方法としては文献［１］、文献［５］に示
されているようなＤＰマッチングに基づく方法や文献
［４］、文献［６］に示されているようなＨＭＭに基づ
く方法を用いることができる。The feature vector time series V is compared with the comparison /
The similarity is calculated by comparing and comparing a plurality of standard patterns registered in advance with the end frame E, and the standard pattern giving the maximum similarity S and the maximum similarity is obtained as the recognition result A. And the starting frame T corresponding to the maximum similarity is obtained. Here, as a method of calculating the similarity, a method based on DP matching as described in References [1] and [5] and an HMM as described in References [4] and [6] are used. A method can be used.

【００６９】また、特徴ベクトル時系列Ｖは、参照類似
度計算部１４に入力され特徴ベクトル時系列から始端フ
レームＴと終端フレームＥとの間を取り出した部分と認
識単位標準パターン記憶部１３に記憶されている単位標
準パターンをある定められた順序で結合した複数の標準
パターンとが比較照合され、参照類似度Ｒが出力され
る。ここでの認識単位標準パターンとしては、文献
［１］の５ページに示されているような音節を用いるこ
とができる。The feature vector time series V is inputted to the reference similarity calculation unit 14 and is stored in the recognition unit standard pattern storage unit 13 with a portion extracted between the start frame T and the end frame E from the feature vector time series. A plurality of standard patterns obtained by combining the unit standard patterns described above in a predetermined order are compared and collated, and a reference similarity R is output. As the recognition unit standard pattern, a syllable as shown on page 5 of document [1] can be used.

【００７０】次に、類似度Ｓと参照類似度Ｒは類似度補
正部１５に入力され、類似度Ｓを参照類似度Ｒを用いて
補正された補正類似度Ｃが出力される。ここでの補正方
法としては、ＳとＲの差に基づいた値を求める方法を用
いることができる。Next, the similarity S and the reference similarity R are input to the similarity correction unit 15, and the corrected similarity C obtained by correcting the similarity S using the reference similarity R is output. As a correction method here, a method of obtaining a value based on the difference between S and R can be used.

【００７１】このようにして、第２の発明によって、話
者や環境に影響されない補正された類似度を得ることが
できる。As described above, according to the second aspect, it is possible to obtain a corrected similarity that is not affected by the speaker or the environment.

【００７２】第４の発明によれば、音声検出部１８によ
って、補正類似度Ｃがあらかじめ定めておいた閾値より
大きい場合には、音声として検出し認識結果Ａが出力さ
れる。According to the fourth aspect, when the corrected similarity C is larger than a predetermined threshold, the voice detection unit 18 detects the voice as voice and outputs a recognition result A.

【００７３】図３は、第５の発明の一実施例のブロック
図である。FIG. 3 is a block diagram of one embodiment of the fifth invention.

【００７４】この音声認識装置は、分析部２１、比較照
合部２２、単位標準パターン記憶部２３、参照類似度計
算部２４、類似度補正部２５、音声検出部２７、標準パ
ターン記憶部２８、認識パターン生成部２９、参照パタ
ーン生成部３０、を備えている。This speech recognition apparatus includes an analysis unit 21, a comparison / collation unit 22, a unit standard pattern storage unit 23, a reference similarity calculation unit 24, a similarity correction unit 25, a speech detection unit 27, a standard pattern storage unit 28, a recognition unit A pattern generator 29 and a reference pattern generator 30 are provided.

【００７５】分析部２１では、入力された音声信号Ｉの
特徴分析を行ない、特徴ベクトル時系列Ｖに変換するも
のである。The analysis section 21 analyzes the characteristics of the input audio signal I and converts it into a characteristic vector time series V.

【００７６】標準パターン記憶部２８は、標準パターン
を記憶しておくものである。The standard pattern storage section 28 stores a standard pattern.

【００７７】単位標準パターン記憶部２３は、単位標準
パターンを記憶しておくものである。The unit standard pattern storage unit 23 stores unit standard patterns.

【００７８】参照パターン生成部３０は、単位標準パタ
ーン部に記憶されている単位標準パターンをある定めら
れた順序で結合した参照パターンＱを生成するものであ
る。認識パターン生成部２９は、参照パターンＱと標準
パターン記憶部２８に記憶された標準パターンとを結合
し、認識パターンＰを生成するものである。The reference pattern generator 30 generates a reference pattern Q in which unit standard patterns stored in the unit standard pattern section are combined in a predetermined order. The recognition pattern generation unit 29 combines the reference pattern Q and the standard pattern stored in the standard pattern storage unit 28 to generate a recognition pattern P.

【００７９】比較照合部２２は、特徴ベクトル時系列Ｖ
と認識パターンＰとを比較照合し、最大の類似度Ｓを与
える認識パターンを構成する標準パターンを認識結果Ａ
として求めるものである。The comparison / collation unit 22 calculates the characteristic vector time series V
Is compared with the recognition pattern P, and the standard pattern forming the recognition pattern that gives the maximum similarity S is recognized as the recognition result A.
Is what you want.

【００８０】参照類似度計算部２４は、参照パターンＱ
と特徴ベクトル時系列Ｖとを比較照合し、各々の参照パ
ターンについて類似度を計算し、最大の類似度を参照類
似度Ｒとして求めるものである。The reference similarity calculator 24 calculates the reference pattern Q
Is compared with the feature vector time series V, the similarity is calculated for each reference pattern, and the maximum similarity is obtained as the reference similarity R.

【００８１】類似度補正部２５は、類似度Ｓを参照類似
度Ｒを用いて補正し、補正類似度Ｃを求めるものであ
る。The similarity correction unit 25 corrects the similarity S using the reference similarity R to obtain a corrected similarity C.

【００８２】音声検出部２７は、補正類似度Ｃがあらか
じめ定めておいた閾値より大きかった時に、認識結果Ａ
を検出したとするものである。When the corrected similarity C is larger than a predetermined threshold value, the voice detection unit 27
Is detected.

【００８３】次に、図３の実施例の動作について説明す
る。Next, the operation of the embodiment shown in FIG. 3 will be described.

【００８４】入力された音声信号Ｉは分析部２１に入力
され、特徴分析によって特徴ベクトル時系列Ｖに変換さ
れる。The input speech signal I is input to the analysis unit 21 and is converted into a feature vector time series V by feature analysis.

【００８５】特徴ベクトル時系列Ｖは、比較照合部２２
へ入力され、認識パターンＰと比較照合した類似度が計
算され、最大の類似度Ｓと最大の類似度を与える認識パ
ターンを構成する標準パターンが認識結果Ａとして求め
られるここで類似度の計算方法としては文献［１］、文
献［５］に示されているようなＤＰマッチングに基づく
方法や文献［４］、文献［６］に示されているようなＨ
ＭＭに基づく方法を用いることができる。The feature vector time series V is obtained by comparing the
The similarity calculated by comparing with the recognition pattern P is calculated, and the standard pattern forming the recognition pattern giving the maximum similarity S and the maximum similarity is obtained as the recognition result A. As a method based on DP matching as shown in References [1] and [5], and H as shown in References [4] and [6]
A method based on MM can be used.

【００８６】また、特徴ベクトル時系列Ｖは、参照類似
度計算部２４に入力され複数の参照パターンＱとが比較
照合され、参照類似度Ｒが出力される。The feature vector time series V is input to the reference similarity calculation unit 24, where it is compared with a plurality of reference patterns Q, and the reference similarity R is output.

【００８７】次に、類似度Ｓと参照類似度Ｒは類似度補
正部２５に入力され、類似度Ｓを参照類似度Ｒを用いて
補正された補正類似度Ｃが出力される。ここでの補正方
法としては、例えば、ＳとＲの差に基づいた値を求める
方法を用いることができる。次に、音声検出部２８によ
って、補正類似度Ｃがあらかじめ定めておいた閾値より
大きい場合には、音声として検出し認識結果Ａが出力さ
れる。Next, the similarity S and the reference similarity R are input to the similarity correction unit 25, and the corrected similarity C obtained by correcting the similarity S using the reference similarity R is output. As the correction method here, for example, a method of obtaining a value based on the difference between S and R can be used. Next, when the corrected similarity C is larger than a predetermined threshold, the voice detection unit 28 detects the voice as voice and outputs a recognition result A.

【００８８】このようにして、第５の発明によって、話
者や環境に影響されない補正された類似度を得て、ワー
ドスポッティングが可能になる。As described above, according to the fifth aspect, word spotting can be performed by obtaining a corrected similarity that is not affected by a speaker or an environment.

【００８９】[0089]

【発明の効果】以上説明したように本発明によれば、入
力音声の特徴ベクトル時系列と登録した標準パターンと
の類似度を各フレームのベクトル間類似度の最大値や入
力音声と参照用の標準パターンとの参照類似度を用いて
補正することにより、話者や発声環境が異なった場合で
も同一の尺度で類似度を比較でき、安定したリジェクト
やワードスポッティングを実現することができる。As described above, according to the present invention, the similarity between the feature vector time series of the input voice and the registered standard pattern is determined by using the maximum value of the inter-vector similarity of each frame or the input voice and the reference voice. By performing correction using the reference similarity with the standard pattern, the similarity can be compared on the same scale even when the speaker and the utterance environment are different, and stable rejection and word spotting can be realized.

[Brief description of the drawings]

【図１】第１，３及び４の発明の一実施例のブロック
図。FIG. 1 is a block diagram of one embodiment of the first, third and fourth inventions.

【図２】第２及び４の発明の一実施例のブロック図。FIG. 2 is a block diagram of one embodiment of the second and fourth inventions.

【図３】第５の発明の一実施例のブロック図。FIG. 3 is a block diagram of one embodiment of the fifth invention.

[Explanation of symbols]

１分析部２ベクトル間類似度計算部３類似度補正部４類似度累積部５識別部６リジェクト部７音声検出部１１分析部１２比較照合部１３単位標準パターン記憶部１４参照類似度計算部１５類似度補正部１７音声検出部２１分析部２２比較照合部２３単位標準パターン記憶部２４参照類似度計算部２５類似度補正部２７音声検出部２８標準パターン記憶部２９認識パターン生成部３０参照パターン生成部 REFERENCE SIGNS LIST 1 analysis unit 2 inter-vector similarity calculation unit 3 similarity correction unit 4 similarity accumulation unit 5 identification unit 6 reject unit 7 voice detection unit 11 analysis unit 12 comparison and collation unit 13 unit standard pattern storage unit 14 reference similarity calculation unit 15 Similarity correction unit 17 Voice detection unit 21 Analysis unit 22 Comparison and comparison unit 23 Unit standard pattern storage unit 24 Reference similarity calculation unit 25 Similarity correction unit 27 Voice detection unit 28 Standard pattern storage unit 29 Recognition pattern generation unit 30 Reference pattern generation Department

フロントページの続き (56)参考文献特開平４−369696（ＪＰ，Ａ) 特開昭63−56700（ＪＰ，Ａ) 特開平４−255900（ＪＰ，Ａ) 特開平４−188200（ＪＰ，Ａ) 特開平１−100600（ＪＰ，Ａ) 特開平５−19786（ＪＰ，Ａ) 特許2808906（ＪＰ，Ｂ２) 日本音響学会平成３年度春季研究発表会講演論文集▲Ｉ▼ ３−Ｐ−28「未知語検出・リジェクションのための音声認識の尤度補正」ｐ．203−204（平成３年８月５日国立国会図書館受入) 日本音響学会平成４年度春季研究発表会講演論文集▲Ｉ▼ １−Ｐ−11「音声認識のための適応尤度補正の評価」ｐ. 125−126（平成４年３月17日発行) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/10 G10L 15/08 G10L 15/28 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (56) References JP-A-4-369696 (JP, A) JP-A-63-56700 (JP, A) JP-A-4-255900 (JP, A) JP-A-4-188200 (JP) , A) JP-A-1-100600 (JP, A) JP-A-5-19786 (JP, A) Patent 2808906 (JP, B2) Proceedings of the Acoustical Society of Japan, Spring Meeting, 1991, I-3- P.28 “Likelihood correction of speech recognition for unknown word detection / rejection” p. 203-204 (accepted by the National Diet Library on August 5, 1991) Proceedings of the Acoustical Society of Japan, Spring Meeting 2004, I-P1-P-11 "Evaluation of adaptive likelihood correction for speech recognition" p. 125-126 (issued on March 17, 1992) (58) Fields surveyed (Int. Cl. ⁷ , DB name) G10L 15/10 G10L 15/08 G10L 15/28 JICST file (JOIS)

Claims

(57) [Claims]

An analysis unit configured to convert an input audio signal into a time series of a frame of a feature vector; and a vector between a feature vector constituting a standard pattern registered in advance and a feature vector of the input audio signal. An inter-vector similarity calculator for obtaining a similarity for each frame of the input audio signal, and obtaining a frame correction similarity corrected using the maximum value of the inter-vector similarity in each frame of the input audio signal A speech recognition device, comprising: a similarity correction unit; a similarity accumulation unit that accumulates the frame correction similarity to obtain a correction similarity; and an identification unit that identifies speech based on the correction similarity.

2. The method according to claim 1, wherein the input speech signal is
An analysis unit that converts the feature vector into a time series;
The turn and the end frame are determined and compared and compared.
When a standard pattern that gives similarity is obtained as a recognition result
First, find the starting frame corresponding to the maximum similarity
Comparison / collation unit and unit standard pattern notation that holds the standard pattern of recognition units
Storage unit and a standard pattern of the recognition unit in a predetermined order.
From the standard pattern and the starting frame
The maximum similarity with the feature vector time series up to the end frame
A reference similarity calculator for determining a maximum value as a reference similarity;, The similarity determined by the comparison / matching unit is referred to by the reference class.
SimilarityToSimilarity correction unit that calculates the corrected similarity corrected by using
And a speech recognition device having:

3. The speech recognition apparatus according to claim 1, further comprising a reject unit that outputs a reject signal when the correction similarity obtained by said similarity correction unit is smaller than a predetermined threshold.

4. The speech recognition device according to claim 1, further comprising a speech detection unit that detects as a speech when the corrected similarity obtained by the similarity correction unit is larger than a predetermined threshold.

5. The method according to claim 1, further comprising:
Analysis unit that converts the time series of the game, and standard pattern storage unit that stores the standard patterns to be recognized
And the unit standard pattern notation that holds the standard pattern of the recognition unit.
Storage unit and a standard pattern of the recognition unit in a predetermined order.
A reference pattern generation unit that combines the reference patterns into a reference pattern, and combines the reference pattern and the standard pattern to be recognized.
A recognition pattern generating unit that sets the feature vector time series and the recognition pattern as end patterns;
Recognition that gives the maximum similarity by defining and comparing frames
Find the standard pattern that constitutes the pattern as the recognition result
A comparison / matching unit that determines the end of the reference pattern and the end frame as an end.
The maximum value of the similarity in the case of comparison is determined as the reference similarity.
A reference similarity calculator that calculates the maximum similarity obtained by the comparison / matching unit.
Similarity to find corrected similarity corrected using reference similarity
Compensator and, The corrected similarity obtained by the similarity correction unit is
Is detected as voice if it is larger than a predetermined threshold
A speech recognition device having a speech detection unit.