JPH0744188A

JPH0744188A - Speech recognition device

Info

Publication number: JPH0744188A
Application number: JP5190089A
Authority: JP
Inventors: Shinji Koga; 真二古賀
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1993-07-30
Filing date: 1993-07-30
Publication date: 1995-02-14
Anticipated expiration: 2014-11-22
Also published as: JP2979912B2

Abstract

PURPOSE:To detect a similar word with high precision by artificially generating an input pattern from a standard model, performing recognition by using the input pattern, and regarding found similarity as similarity between words. CONSTITUTION:In similar word detection mode, a detection/recognition changeover switch 19 is connected to a dummy input pattern generation part 13 and a detection/recognition changeover switch 20 is connected to a similar word determination part 15. The dummy input pattern generation part 13 reads a phoneme sequence corresponding to a word out of a word dictionary part 12 and reads the mean vector of standard models corresponding to respective phonemes out of a standard model storage part 11 to generate a dummy input pattern. The pattern is inputted to a recognition part 14 and the similarities to the words in the word dictionary part are found by using the standard models in the standard model storage part 11. A similar word determination part 15 outputs the word and similarity as information on a similar word and informs a user of it when the similarity is larger than a previously set threshold value.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、高い認識性能が要求さ
れる分野に適した音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device suitable for fields requiring high recognition performance.

【０００２】[0002]

【従来の技術】従来、音声認識では、予め発声した学習
データから作成した標準モデルを用いて、それらと未知
音声から求めた音声パターンとの類似度を求め、最大の
類似度を与えるカテゴリを認識結果とする方法が一般に
用いられている。カテゴリを単語とした場合、認識対象
となる単語の中に発声が類似した単語、例えば「千葉
（ちば）」と「志賀（しが）」などが存在すると、認識
時にこれらの音声を入力した場合、誤認識を生じ易いと
いう問題点があった。多くの語彙を認識対象とする大語
彙音声認識では、この問題点は特に顕著であった。大語
彙音声認識方法では、認識単位として音素などの単語よ
り小さい単位が一般に用いられている。以下、「音素」
とは、音韻論的な意味での音声の最小基本単位という意
味だけではなく、音節や複数の音素の連結をも含む、も
っと広い範囲の音声の単位を意味するものとする。音素
を認識単位とする方法としては、たとえば、渡辺、吉
田、古賀らによる、電子情報通信学会論文誌Ｄ−ＩＩ
Ｖｏｌ．Ｊ７２−Ｄ−ＩＩＮｏ．８１９８９年８月
のページ１２６４−１２６９に掲載の論文「半音節を単
位としたＨＭＭを用いた大語い音声認識」（以下、文献
１と記す）に述べられている方法が挙げられる。この方
法では、単語単位に発声された複数個の学習データを用
いて音素の一種である半音節（以下、音素と呼ぶ）単位
の標準モデルを作成している。認識時には、音素表記さ
れた単語辞書を用いて標準モデルを結合して単語単位の
モデルを作成し、この単語モデルを用いて未知単語音声
を認識している。2. Description of the Related Art Conventionally, in speech recognition, a standard model created from learning data uttered in advance is used to find the similarity between them and a speech pattern found from unknown speech, and a category giving the maximum similarity is recognized. The resulting method is commonly used. When the category is a word, if there are words with similar utterances among the words to be recognized, such as "Chiba" and "Shiga," when these sounds are input during recognition. However, there is a problem that misrecognition is likely to occur. This problem was especially noticeable in large vocabulary speech recognition, which targets many vocabulary words. In the large vocabulary speech recognition method, a unit smaller than a word such as a phoneme is generally used as a recognition unit. Below, "phoneme"
Means not only the minimum basic unit of speech in the phonological sense, but also a wider range of units of speech, including syllables and concatenation of multiple phonemes. As a method of using a phoneme as a recognition unit, for example, Watanabe, Yoshida, Koga et al., IEICE Transactions D-II.
Vol. J72-D-II No. 8 The method described in the paper “Large word speech recognition using HMM with semisyllabic units” (hereinafter referred to as Document 1) published on pages 1264-1269 of August 1989 can be mentioned. In this method, a standard model is created for each semi-syllable (hereinafter referred to as a phoneme), which is a kind of phoneme, using a plurality of learning data uttered in word units. At the time of recognition, a word-by-phoneme dictionary is used to combine standard models to create a word-by-word model, and this word model is used to recognize unknown word speech.

【０００３】上述のような類似した単語による誤認識の
問題への対処方法としては、標準モデルや認識方式の高
精度化のほかに、認識を行う前に予め認識対象の単語間
の類似性を求め、類似性の高い単語の組合せを検出し、
それを使用者に知らせ、使用者がその組合せのうち一部
もしくは全部の単語を認識対象から除外したり、別の単
語に置き換えたりする方法が挙げられる。このような類
似した単語の検出方法の例が、特公平４−６２５９５号
公報（以下、文献２と記す）に記載されている。文献２
では、母音間の距離および子音間の距離を定義したテー
ブルをそれぞれ用意し、認識対象となる単語のうちの任
意の２個の単語毎に、音節単位で対応をとり、対応する
音節間の距離を前記２つのテーブルより求め、それらを
用いて単語間の類似性を検査している。As a method for dealing with the problem of erroneous recognition due to similar words as described above, in addition to improving the accuracy of the standard model and the recognition method, similarity between words to be recognized in advance before recognition is performed. Find a combination of words with high similarity,
There is a method of notifying it to the user and excluding a part or all of the words in the combination from the recognition target or replacing the word with another word. An example of such a method of detecting similar words is described in Japanese Patent Publication No. 4-62595 (hereinafter referred to as Document 2). Reference 2
Then, prepare tables that define the distance between vowels and the distance between consonants, and make correspondence in syllable units for every two arbitrary words of the recognition target, and the distance between corresponding syllables. From the above two tables and using them, the similarity between words is checked.

【０００４】[0004]

【発明が解決しようとする課題】上述した文献２の方法
では、認識単位や標準モデルに依存せずに単語間の類似
性の検証を行っているので、認識時に生じる誤認識とは
違った傾向の類似単語を検出してしまう可能性が高く、
また、連母音の長母音化、母音の無声化等の発声変形に
より類似してしまう場合に対応できないという問題があ
る。In the method of Document 2 described above, since the similarity between words is verified without depending on the recognition unit or the standard model, there is a tendency different from erroneous recognition that occurs during recognition. Is likely to detect similar words in
In addition, there is a problem that it is not possible to deal with cases where the vowels are similar to each other due to vowel deformation such as lengthening of vowels and vowel devoicing.

【０００５】本発明は、音素を認識単位とした音声認識
において、標準モデルから疑似的に入力パターンを作成
し、この入力パターンを用いて認識を行い、求められた
類似度を単語間の類似性とすることにより、類似した単
語を高精度に検出することを目的とする。According to the present invention, in speech recognition using a phoneme as a recognition unit, a pseudo input pattern is created from a standard model, recognition is performed using this input pattern, and the obtained similarity is calculated as the similarity between words. The purpose of this is to detect similar words with high accuracy.

【０００６】[0006]

【課題を解決するための手段】本発明の音声認識装置
は、音声信号を分析して特徴ベクトル時系列を出力する
特徴分析部と、特徴ベクトルに対する出現確率が任意個
の確率分布の形で定義されている状態のネットワークと
して表現される標準モデルを予め音素単位で作成し蓄え
ておく標準モデル記憶部と、音声認識の対象となる単語
を構成する音素情報を格納する単語辞書部と、類似単語
検出と認識のモードを切替えるための検出／認識切替え
スイッチと、類似単語検出モード時に前記単語辞書部の
任意の単語に対する音素情報と前記標準モデルから疑似
入力パターンを作成する疑似入力パターン作成部と、前
記標準モデルと前記単語辞書部に格納された音素情報を
用いて類似単語検出モードでは前記疑似入力パターンに
対して、また認識モードでは前記特徴分析部から出力さ
れる前記特徴ベクトル時系列に対して認識を行い、前記
音声認識の対象となる単語との類似度を求める認識部
と、前記認識部から出力された前記類似度より、前記音
声認識の対象となる単語から類似性の高い単語を類似単
語として出力する類似単語決定部と、前記認識部から出
力された前記類似度より、前記音声信号と前記音声認識
の対象となる単語との類似性を調べ認識判定を行い、認
識結果を出力する認識結果決定部と、類似単語検出モー
ドでは前記類似単語を、また認識モードでは前記認識結
果を表示する結果表示部とを有し、または、前記標準モ
デル記憶部に蓄えられる標準モデル中の状態間の遷移に
は遷移確率が定義されており、前記疑似入力パターン作
成部は、前記遷移確率を考慮して疑似入力パターンを作
成することを特徴とする。A speech recognition apparatus of the present invention defines a feature analysis unit for analyzing a voice signal and outputting a feature vector time series, and a probability distribution in which the appearance probability for the feature vector is arbitrary. Standard model storage unit that creates and stores a standard model expressed as a network in a phoneme unit in advance, a word dictionary unit that stores phoneme information that constitutes a word that is the target of speech recognition, and a similar word A detection / recognition switch for switching between detection and recognition modes, a pseudo input pattern creation unit that creates a pseudo input pattern from phoneme information for any word in the word dictionary unit and the standard model in the similar word detection mode, In the similar word detection mode using the standard model and the phoneme information stored in the word dictionary, the pseudo input pattern and the recognition mode are detected. The recognition unit recognizes the feature vector time series output from the feature analysis unit and obtains the similarity with the word that is the target of the voice recognition, and the similarity output from the recognition unit. From the similar word determination unit that outputs a word having a high similarity from the words that are the target of the voice recognition as a similar word, the similarity between the words output from the recognition unit, the voice signal and the target of the voice recognition. A recognition result determination unit that outputs a recognition result by checking the similarity with a word, and a result display unit that displays the recognition result in the similar word detection mode and the recognition result in the recognition mode. Alternatively, transition probabilities are defined for the transitions between states in the standard model stored in the standard model storage unit, and the pseudo input pattern creation unit considers the transition probabilities. Characterized by creating a pattern.

【０００７】[0007]

【実施例】次に、本発明について図面を参照して説明す
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, the present invention will be described with reference to the drawings.

【０００８】図１は本発明の第１の一実施例を示すブロ
ック図である。標準モデル記憶部１１は、音素単位の標
準モデルＰ_k（ｋ＝１，２，・・・Ｋ、Ｋは音素の種類
数）を予め記憶している。標準モデルＰ_kとしては、文
献１に述べられているＨＭＭを用いることができ、文献
１に述べられている学習方法により作成することができ
る。ＨＭＭは、状態遷移ネットワークの一種で、各状態
ｉ（ｉ＝１，２，・・・Ｉ^k、Ｉ^kは標準モデルＰ_kの
状態数）には状態遷移確率FIG. 1 is a block diagram showing a first embodiment of the present invention. The standard model storage unit 11 stores in advance a standard model P _k (k = 1, 2, ... K, K is the number of types of phonemes) in units of phonemes. As the standard model P _k , the HMM described in Document 1 can be used and can be created by the learning method described in Document 1. The HMM is a kind of state transition network, and each state i (i = 1, 2, ... I ^k , I ^k is the number of states of the standard model P _k ) has a state transition probability.

【０００９】[0009]

【数１】 [Equation 1]

【００１０】と特徴ベクトル出現確率とが定義されてい
る。特徴ベクトル出現確率としては、ガウス分布And the feature vector appearance probability are defined. The Gaussian distribution is used as the feature vector appearance probability.

【００１１】[0011]

【数２】 [Equation 2]

【００１２】を用いることができる。単語辞書部１２
は、認識対象となる単語Ｗ_m（ｍ＝１，２，・・・，
Ｍ、Ｍは語彙数）を構成する音素情報を予め記憶してい
る。音素情報としては、単語Ｗ_mの音素列Can be used. Word dictionary section 12
Is a word W _m to be recognized (m = 1, 2, ...,
Phoneme information forming M and M is the vocabulary number) is stored in advance. As the phoneme information, a phoneme string of the word W _m

【００１３】[0013]

【数３】 [Equation 3]

【００１４】を用いることができる。Can be used.

【００１５】次に、図１および図２を参照して、本実施
例の動作について説明する。Next, the operation of this embodiment will be described with reference to FIGS. 1 and 2.

【００１６】類似単語検出モードでは、検出／認識切替
スイッチ１９は疑似入力パターン作成部１３と、また検
出／認識切替スイッチ２０は類似単語決定部１５と接続
する。In the similar word detection mode, the detection / recognition changeover switch 19 is connected to the pseudo input pattern creating section 13, and the detection / recognition changeover switch 20 is connected to the similar word determining section 15.

【００１７】疑似入力パターン作成部１３は、単語辞書
部１２から単語Ｗ_mに対する音素列The pseudo input pattern creating unit 13 uses the phoneme string for the word W _m from the word dictionary unit 12.

【００１８】[0018]

【数４】 [Equation 4]

【００１９】を読み込み、各音素に対応する標準モデル
の平均ベクトルを標準モデル記憶部１１から読み込み、
疑似入力パターンを作成する（ステップＡ₁）。音素Is read, the average vector of the standard model corresponding to each phoneme is read from the standard model storage unit 11,
A pseudo input pattern is created (step A ₁ ). phoneme

【００２０】[0020]

【数５】 [Equation 5]

【００２１】に対応する標準モデルをThe standard model corresponding to

【００２２】[0022]

【数６】 [Equation 6]

【００２３】とし、それに含まれる平均ベクトルをAnd the mean vector contained therein is

【００２４】[0024]

【数７】 [Equation 7]

【００２５】とすると、単語Ｗ_mに対する疑似入力パタ
ーンＶ_mとして、Then, as the pseudo input pattern V _m for the word W _m ,

【００２６】[0026]

【数８】 [Equation 8]

【００２７】と、平均ベクトルを一列に並べたパターン
を使用することができる。平均ベクトルを並べる際、１
個ではなく複数個並べることもできる。Then, a pattern in which average vectors are arranged in a line can be used. When arranging the average vectors, 1
It is possible to arrange multiple pieces instead of individual pieces.

【００２８】作成された疑似入力パターンＶ_mは、認識
部１４に入力され、標準モデル記憶部１１の標準モデル
を用いて、単語辞書部の単語Ｗ_n（ｎ＝１，２，・・
・，Ｍ）との類似度Ｒ_{m n}が求められる（ステップ
Ａ₂）。疑似度の算出方法については、文献１に述べら
れている認識方式を使用することができる。疑似単語決
定部１５は、求められた類似度Ｒ_{m n}が予め設定してお
いた閾値より大きい場合、単語Ｗ_m、Ｗ_nおよび類似度
Ｒ_{m n}を類似した単語に関する情報として出力し、結果
表示部１６はそれらの情報を表示し、使用者に知らせる
（ステップＡ₃およびステップＡ₄）。使用者はその情
報を元に、類似している単語の一方もしくは両方を認識
対象から除外したり、別の単語に置き換えたりすること
により認識性能の向上が図れる。The generated pseudo input pattern V _m is input to the recognition unit 14 and, using the standard model of the standard model storage unit 11, the word W _n (n = 1, 2, ...
, M) and the degree of similarity R _mn is obtained (step A ₂ ). As a method of calculating the pseudo degree, the recognition method described in Document 1 can be used. If the calculated similarity R _mn is larger than a preset threshold value, the pseudo word determination unit 15 outputs the words W _m , W _n and the similarity R _mn as information about similar words, and the result display unit 16 displays those information and informs the user (step A ₃ and step A ₄ ). Based on the information, the user can improve the recognition performance by excluding one or both of the similar words from the recognition target or replacing them with another word.

【００２９】単語Ｗ_nが単語辞書部の最終単語Ｗ_Mの場
合（ステップＡ₅）、すなわち、単語Ｗ_mに対する類似
性の検証が単語辞書部の全単語に対して実行された場
合、単語Ｗ_{m + 1}に対して、同様にステップＡ₁〜Ａ₅
により類似性の検証を実行する。単語Ｗ_mが単語辞書部
の最終単語Ｗ_Mの場合（ステップＡ₆）、すなわち、単
語辞書部の全単語に対する類似性の検証が終了した場
合、処理を終了する。If the word W _n is the final word W _{M in} the word dictionary (step A ₅ ), that is, if the similarity verification with respect to the word W _m is executed for all the words in the word dictionary, the word W _n Similarly for _{m + 1} , steps A _{1 to} A ₅
To perform similarity verification. When the word W _m is the final word W _{M in} the word dictionary section (step A ₆ ), that is, when the verification of the similarity to all the words in the word dictionary section is completed, the processing is ended.

【００３０】認識モードでは、検出／認識切替スイッチ
１９は特徴分析部１７と、また検出／認識切替スイッチ
２０は認識結果決定部１８と接続する。In the recognition mode, the detection / recognition changeover switch 19 is connected to the feature analysis section 17, and the detection / recognition changeover switch 20 is connected to the recognition result determination section 18.

【００３１】特徴分析部１７では、古井著、１９８５
年、東海大学出版会発行の「ディジタル音声処理」に述
べられているようなメルケプストラムによる方法を用い
て、未知の音声信号が特徴ベクトル時系列に変換され、
認識部１４にて、疑似入力パターンと同様、この特徴ベ
クトル時系列と単語辞書部の単語Ｗ_n（ｎ＝１，２，・
・・，Ｍ）との類似度Ｒ_{m n}が求められる。認識結果決
定部１８は、求められた類似度Ｒ_{m n}が大きい順から任
意個選択し、その値および対応する単語を認識結果とし
て出力し、結果表示部１６は、それらの情報を表示す
る。In the characteristic analysis unit 17, Furui, 1985.
The unknown speech signal was converted into a feature vector time series using the method by the mel cepstrum as described in "Digital Speech Processing" published by Tokai University Press
In the recognition unit 14, the feature vector time series and the word W _n (n = 1, 2, ...
.., M) and the degree of similarity R _mn is obtained. The recognition result determination unit 18 selects an arbitrary number from the descending order of the _calculated similarity R _mn , outputs the value and the corresponding word as a recognition result, and the result display unit 16 displays the information.

【００３２】図３は本発明の第２の一実施例を示すブロ
ック図である。FIG. 3 is a block diagram showing a second embodiment of the present invention.

【００３３】図３を参照すると、本発明の第２の実施例
は、図１に示した本発明の第１の実施例における標準モ
デル記憶部１１と疑似入力パターン作成部１３の間にベ
クトル数決定部２１が加わっている点が異なる。Referring to FIG. 3, in the second embodiment of the present invention, the number of vectors between the standard model storage unit 11 and the pseudo input pattern creating unit 13 in the first embodiment of the present invention shown in FIG. The difference is that the decision unit 21 is added.

【００３４】本実施例の動作は、図２に示した第１の実
施例の動作とステップＡ₁での疑似入力パターンの作成
方法が異なっており、他の動作は同一である。第１の実
施例では、一定個の平均ベクトルを並べることにより疑
似入力パターンを作成しており、標準モデル中の状態遷
移確率The operation of this embodiment is different from the operation of the first embodiment shown in FIG. 2 in the method of creating the pseudo input pattern in step A ₁ , and the other operations are the same. In the first embodiment, a pseudo input pattern is created by arranging a fixed number of average vectors, and the state transition probability in the standard model is

【００３５】[0035]

【数９】 [Equation 9]

【００３６】は使用していなかった。本実施例では、ベ
クトル数決定部２１が、標準モデル記憶部１１から標準
モデルの状態遷移確率Was not used. In the present embodiment, the vector number determination unit 21 uses the standard model storage unit 11 to calculate the state transition probability of the standard model.

【００３７】[0037]

【数１０】 [Equation 10]

【００３８】を読み込み、その値を用いて並べる平均ベ
クトルの個数The number of average vectors read by and arranged using the values

【００３９】[0039]

【数１１】 [Equation 11]

【００４０】を決定する。個数Determine. Number

【００４１】[0041]

【数１２】 [Equation 12]

【００４２】を求めるには、以下の式を用いることがで
きる。The following equation can be used to obtain

【００４３】[0043]

【数１３】 [Equation 13]

【００４４】疑似入力パターン作成部１３は、求められ
た個数、および単語辞書部１２内の音素列、標準モデル
記憶部１１内の標準モデルの平均ベクトルから疑似入力
パターンを作成する。The pseudo input pattern creating unit 13 creates a pseudo input pattern from the obtained number, the phoneme sequence in the word dictionary unit 12, and the average vector of the standard model in the standard model storage unit 11.

【００４５】第２の実施例の他に、疑似入力パターン作
成時での状態遷移確率の利用方法としては、単語辞書部
１２中の単語Ｗ_mの音素情報が図４のように分岐をもつ
音素列の場合に、例えば語尾の母音が無声化する可能性
がある単語を表現する場合などに、状態遷移確率が大き
い枝の音素を疑似入力パターンの作成に使用する、等が
ある。In addition to the second embodiment, as a method of using the state transition probability at the time of creating the pseudo input pattern, the phoneme information of the word W _m in the word dictionary unit 12 has a branch as shown in FIG. In the case of a string, for example, when expressing a word in which a vowel at the end may be unvoiced, a phoneme of a branch having a high state transition probability is used for creating a pseudo input pattern.

【００４６】[0046]

【発明の効果】以上説明したように、本発明による類似
単語検出方式は、標準モデルから疑似的に作成した入力
パターンを用いて認識を行い、求められた類似度を単語
間の類似性としたため、類似した単語を高精度に検出で
きるという効果がある。As described above, in the similar word detection method according to the present invention, recognition is performed using an input pattern pseudo-created from a standard model, and the obtained similarity is regarded as the similarity between words. There is an effect that a similar word can be detected with high accuracy.

[Brief description of drawings]

【図１】本発明の第１の実施例を示すブロック図であ
る。FIG. 1 is a block diagram showing a first embodiment of the present invention.

【図２】本発明の実施例の類似単語検出モードでのフロ
ーチャートである。FIG. 2 is a flowchart in a similar word detection mode according to the embodiment of this invention.

【図３】本発明の第２の実施例を示すブロック図であ
る。FIG. 3 is a block diagram showing a second embodiment of the present invention.

【図４】本発明の単語辞書部中の単語の音素情報例を示
す図である。FIG. 4 is a diagram showing an example of phoneme information of words in a word dictionary unit of the present invention.

[Explanation of symbols]

１１標準モデル記憶部１２単語辞書部１３疑似入力パターン作成部１４認識部１５類似単語決定部１６結果表示部１７特徴分析部１８認識結果決定部１９，２０検出／認識切替スイッチ２１ベクトル数決定部 11 standard model storage unit 12 word dictionary unit 13 pseudo input pattern creation unit 14 recognition unit 15 similar word determination unit 16 result display unit 17 feature analysis unit 18 recognition result determination unit 19, 20 detection / recognition switch 21 vector number determination unit

Claims

[Claims]

1. A standard analysis model for expressing a feature vector time series by analyzing a voice signal and a network of states in which appearance probabilities for feature vectors are defined in the form of an arbitrary number of probability distributions. A standard model storage unit that creates and stores each phoneme in advance, a word dictionary unit that stores phoneme information that constitutes a word that is a target of speech recognition, and detection / recognition for switching between similar word detection and recognition modes. A changeover switch, a pseudo input pattern creation unit that creates a pseudo input pattern from the phoneme information for any word of the word dictionary unit and the standard model in the similar word detection mode, the standard model and the word dictionary unit. The phoneme information is used to detect the pseudo input pattern in the similar word detection mode, and the feature output from the feature analysis unit in the recognition mode. A recognition vector time series, and a recognition unit that obtains the degree of similarity with the word that is the target of the voice recognition; and the similarity that is output from the recognition unit, from the word that is the target of the voice recognition. A similar word determination unit that outputs a word with high similarity as a similar word, and a similarity determination between the voice signal and the word that is the target of the voice recognition is performed based on the similarity output from the recognition unit, and a recognition determination is performed. A speech recognition apparatus, comprising: a recognition result determination unit that outputs a recognition result; and a result display unit that displays the similar word in the similar word detection mode and the recognition result in the recognition mode.

2. A transition probability is defined for a transition between states in the standard model stored in the standard model storage unit, and the pseudo input pattern creation unit considers the transition probability to generate a pseudo input pattern. The voice recognition device according to claim 1, wherein the voice recognition device is created.