JPH04324499A

JPH04324499A - Speech recognition device

Info

Publication number: JPH04324499A
Application number: JP3094422A
Authority: JP
Inventors: Satoru Nakamura; 哲中村; Toshio Akaha; 俊夫赤羽
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1991-04-24
Filing date: 1991-04-24
Publication date: 1992-11-13

Abstract

PURPOSE:To offer the speech recognition device which can recognize the voice of an unknown user by selecting a proper standard speaker cluster. CONSTITUTION:The speech recognition device is equipped with an acoustic analysis part 13 which extracts the feature of the input voice by taking an acoustic analysis of the input voice, a speaker cluster selection part 14 which selects the standard speaker cluster corresponding to the acoustically analyzed voice according to similarity from a speaker model, a pattern matching part 18 which performs pattern matching according to the voice standard pattern of the selected standard speaker cluster, and a recognition result decision part 19 which decides the recognition result by the pattern matching and outputs the decision output. The speaker cluster selection part 14 determines the number of clusters according to the mutual information amount coefficient MIC of the selected standard speaker cluster.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】本発明は、クラスタ化された標準
パターンを用いて不特定の話者の音声を認識する音声認
識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition device that recognizes the speech of unspecified speakers using clustered standard patterns.

【０００２】0002

【従来の技術】従来の音声認識装置は、不特定話者の音
声を認識するために多数の話者が発声した音声データを
用いて標準パターンを作成する方法を用いてきた。標準
パターンとしては、不特定の話者のパターンから平均的
なパターンを求めるものと、全てのパターンをマルチテ
ンプレート的に用いるものが試みられてきた。また、多
数の話者の標準パターンをクラスタリングする方法も試
みられた。2. Description of the Related Art Conventional speech recognition devices have used a method of creating a standard pattern using speech data uttered by a large number of speakers in order to recognize the speech of unspecified speakers. As standard patterns, attempts have been made to find an average pattern from patterns of unspecified speakers, and to use all patterns in a multi-template manner. In addition, methods for clustering standard patterns from a large number of speakers have also been attempted.

【０００３】0003

【発明が解決しようとする課題】全ての話者の標準パタ
ーンを平均化する上記従来の音声認識装置及び全ての標
準パターンを用いるマルチテンプレートを用いた上記従
来の音声認識装置では、異なる話者の異なる音韻が重複
することによって発生する誤認識を避けることができな
いという問題点がある。[Problem to be Solved by the Invention] In the above-mentioned conventional speech recognition device that averages the standard patterns of all speakers, and in the above-mentioned conventional speech recognition device that uses multi-templates that use all the standard patterns, There is a problem that erroneous recognition caused by overlapping different phonemes cannot be avoided.

【０００４】また、標準パターンをクラスタリングする
方法を用いた上記従来の音声認識装置では、クラスタリ
ングを行うときでも認識語彙に依存しており、最適なク
ラスタ数を決定することができないという問題点がある
。[0004]Furthermore, the above-mentioned conventional speech recognition device that uses the method of clustering standard patterns has the problem that even when performing clustering, it depends on the recognition vocabulary, and it is not possible to determine the optimal number of clusters. .

【０００５】本発明は、上述した従来の音声認識装置の
問題点に鑑み、異なる話者の異なる音韻を重複せずに、
認識語彙に依存しないで標準パターンの最適なクラスタ
数を決定可能な音声認識装置を提供する。[0005] In view of the problems of the conventional speech recognition device described above, the present invention has been made to
To provide a speech recognition device capable of determining the optimum number of clusters of a standard pattern without depending on recognition vocabulary.

【０００６】[0006]

【課題を解決するための手段】本発明は、入力された音
声を音響分析して音声の特徴を抽出する音響分析手段と
、音響分析された音声に対応する標準話者クラスタを話
者モデルからの類似度に基づいて選別する選別手段と、
選別された標準話者クラスタの音声標準パターンに基づ
いてパターンマッチングを行うパターンマッチング手段
と、パターンマッチングによる認識結果を判定して判定
結果を出力する認識結果判定手段とを備えており、選別
手段は選別された標準話者クラスタの相互情報量係数に
基づいてクラスタ数を決定するように構成されている音
声認識装置によって達成される。[Means for Solving the Problems] The present invention provides an acoustic analysis means for acoustically analyzing input speech to extract speech features, and a standard speaker cluster corresponding to the acoustically analyzed speech from a speaker model. a sorting means for sorting based on the similarity of;
It is equipped with a pattern matching means for performing pattern matching based on the speech standard pattern of the selected standard speaker cluster, and a recognition result determination means for determining the recognition result by the pattern matching and outputting the determination result. This is achieved by a speech recognition device configured to determine the number of clusters based on mutual information coefficients of selected standard speaker clusters.

【０００７】[0007]

【作用】音響分析手段は入力された音声を音響分析して
音声の特徴を抽出し、選別手段は音響分析された音声に
対応する標準話者クラスタを話者モデルからの類似度に
基づいて選別し、選別された標準話者クラスタの相互情
報量係数に基づいてクラスタ数を決定し、パターンマッ
チング手段は選別された標準話者クラスタの音声標準パ
ターンに基づいてパターンマッチングを行い、認識結果
判定手段はパターンマッチングによる認識結果を判定し
て判定結果を出力する。[Operation] The acoustic analysis means acoustically analyzes the input speech to extract speech features, and the selection means selects standard speaker clusters corresponding to the acoustically analyzed speech based on similarity from the speaker model. The number of clusters is determined based on the mutual information coefficient of the selected standard speaker clusters, and the pattern matching means performs pattern matching based on the speech standard pattern of the selected standard speaker clusters, and the recognition result determining means determines the recognition result by pattern matching and outputs the determination result.

【０００８】[0008]

【実施例】以下、図面を参照して本発明の音声認識装置
における実施例を説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of a speech recognition apparatus according to the present invention will be described with reference to the drawings.

【０００９】図１は、本発明の音声認識装置の動作を説
明するためのフロ−チャ−トである。　　図２は図１の
フロ−チャ−トに基づいて動作する本発明の音声認識装
置における一実施例の構成を示す。FIG. 1 is a flowchart for explaining the operation of the speech recognition apparatus of the present invention. FIG. 2 shows the configuration of an embodiment of the speech recognition apparatus of the present invention which operates based on the flowchart shown in FIG.

【００１０】まず、図２の音声認識装置の構成を説明す
る。First, the configuration of the speech recognition device shown in FIG. 2 will be explained.

【００１１】図２の音声認識装置は、マイクロフォン１
０、アンプ・フィルタ１１、アナログ／ディジタル（Ａ
／Ｄ）変換部１２、音響分析部１３、話者クラスタ選別
部１４、音声標準パターン１５〜１７、パターンマッチ
ング部１８及び認識結果判定部１９により構成されてい
る。The speech recognition device shown in FIG.
0, amplifier filter 11, analog/digital (A
/D) Consists of a conversion section 12, an acoustic analysis section 13, a speaker cluster selection section 14, speech standard patterns 15 to 17, a pattern matching section 18, and a recognition result determination section 19.

【００１２】次に上述の各構成部分の動作を説明する。Next, the operation of each of the above-mentioned components will be explained.

【００１３】マイクロフォン１０は、未知の使用者の音
声を入力し、アンプ・フィルタ１１は、マイクロフォン
１０から入力された音声を増幅する。Ａ／Ｄ変換部１２
は、アンプ・フィルタ１１で増幅されたアナロク音声信
号をディジタル音声信号に変換する。The microphone 10 inputs the voice of an unknown user, and the amplifier filter 11 amplifies the voice input from the microphone 10. A/D converter 12
converts the analog audio signal amplified by the amplifier/filter 11 into a digital audio signal.

【００１４】音声分析部１３は、Ａ／Ｄ変換部１２でデ
ジタル化された音声信号の特徴を抽出する。The audio analysis section 13 extracts features of the audio signal digitized by the A/D conversion section 12.

【００１５】話者クラス選別部１４は、使用者の音声が
どの標準話者クラスに属するかを、話者モデルからの類
似度を用いて判断して、いずれかの標準話者クラスタの
音声標準パタ−ン１５〜１７を選別する。The speaker class selection unit 14 determines which standard speaker class the user's voice belongs to, using the degree of similarity from the speaker model, and selects the voice standard of one of the standard speaker clusters. Select patterns 15-17.

【００１６】パターンマッチング部１８は、話者クラス
選別部１４により選別された音声標準パターンを用いて
パターンマッチングを行い、認識結果判定部１９は、求
まった標準パターンへの距離を用いて選択された音声標
準パタ−ンが適切かどうかをで判断して認識結果を出力
する。The pattern matching unit 18 performs pattern matching using the standard speech patterns selected by the speaker class selection unit 14, and the recognition result determination unit 19 uses the distance to the standard pattern found to perform pattern matching. It determines whether the standard speech pattern is appropriate and outputs the recognition result.

【００１７】上記音声分析部１３で抽出される特徴量と
しては、フーリエスペクトル分析による特徴量、バンド
パスフィルタによる特徴量のいずれでもよく、音声の特
徴を効果的に抽出できるものであればよい。また、上記
話者クラスタ選別部１４で用いられておりテキストに依
存しない独立な話者モデルとしては、話者間の類似度が
定義できるものであればよく、例えばベクトル量子化に
基づく方法などがある。更に、上記パターンマッチング
部１８で用いられるパターンマッチング方法としては、
動的計画法（ＤＰマッチング）、隠れマルコフモデル、
ニューラルネットワークによる方法なでのいずれでもよ
い。The feature extracted by the voice analysis section 13 may be either a feature obtained by Fourier spectrum analysis or a feature obtained by a band-pass filter, as long as it can effectively extract the features of the speech. Furthermore, as an independent speaker model that is used in the speaker cluster selection unit 14 and does not depend on the text, any model that can define the degree of similarity between speakers may be used, such as a method based on vector quantization. be. Furthermore, the pattern matching method used by the pattern matching section 18 is as follows:
Dynamic programming (DP matching), hidden Markov model,
Any method using a neural network may be used.

【００１８】認識語彙として単語認識を行うときの標準
パターンは、単語単位またはより小さな単位の組み合せ
で格納される。また、連続音声認識を行うときの標準パ
タ−ンは、音節音韻などの単位で格納される。Standard patterns for word recognition as recognition vocabulary are stored in word units or in combinations of smaller units. Furthermore, standard patterns for continuous speech recognition are stored in units such as syllables and phonemes.

【００１９】次に、図１のフロ−チャ−トを参照して標
準話者のクラスタリングの処理手順を図を示す。Next, the processing procedure for clustering standard speakers will be illustrated with reference to the flowchart of FIG.

【００２０】まず、多数の標準話者の音声を用意して（
ステップＳ１）、標準パターン作成用音声の一部または
全部を用いてベクトル量子化に基づく方法でテキストに
依存しないテキスト独立な話者モデルを作成し（ステッ
プＳ２）、話者モデル作成に用いた音声かあるいはそれ
以外の音声を各話者モデルに入力して話者間の類似度マ
トリクスを求めて（ステップＳ３）、話者間の類似度マ
トリクスに基づいて２つの話者または話者クラスをマー
ジして（ステップＳ４）、相互情報量係数ＭＩＣが最大
になるまでステップ４を繰り返し実行することにより類
似話者をマージしてクラスタリングを行う（ステップＳ
５）。First, prepare the voices of many standard speakers (
Step S1) A text-independent speaker model that does not depend on the text is created using a method based on vector quantization using part or all of the speech for standard pattern creation (Step S2), and the speech used for creating the speaker model is or other voices are input to each speaker model to obtain a similarity matrix between speakers (step S3), and two speakers or speaker classes are merged based on the similarity matrix between speakers. (Step S4), and perform clustering by merging similar speakers by repeating Step 4 until the mutual information coefficient MIC becomes maximum (Step S4).
5).

【００２１】ここで上記相互情報量係数ＭＩＣについて
説明する。The above mutual information coefficient MIC will now be explained.

【００２２】相互情報量ＭＩは、情報理論で用いられて
いるものであり、次式（１）で定義されている。The mutual information MI is used in information theory and is defined by the following equation (1).

【００２３】　　　　　　ＭＩ（ｘ，ｙ）＝Ｈ（ｘ）＋Ｈ（ｙ）−Ｈ
（ｘ，ｙ）　　　　……（１）Ｈ（ｘ）は入力話者ｘの
確率ｐ（ｘ）に対するエントロピー、Ｈ（ｙ）は判定さ
れた話者ｙの確率ｐ（ｙ）に対するエントロピーであり
、Ｈ（ｘ，ｙ）は入力話者ｘと判定された話者ｙの同時
確率ｐ（ｘ，ｙ）に対するエントロピーである。MI(x,y)=H(x)+H(y)−H
(x, y) ... (1) H(x) is the entropy for the probability p(x) of the input speaker x, H(y) is the entropy for the determined probability p(y) of the speaker y, H(x,y) is the entropy for the joint probability p(x,y) of the input speaker x and the determined speaker y.

【００２４】相互情報量ＭＩの量は、話者間の類似度マ
トリクスから求められる。話者間の類似度マトリクスに
おいて、異話者間の類似度が小さいほど、つまり話者が
独立であるほど相互情報量ＭＩは大きく、生起する話者
が片寄っているほど小さくなる。The amount of mutual information MI is determined from the similarity matrix between speakers. In the similarity matrix between speakers, the smaller the similarity between different speakers, that is, the more independent the speakers, the larger the mutual information MI, and the smaller the speakers that occur are closer to each other.

【００２５】従って、話者間の類似度マトリクスの相互
情報量が大きくなるように、いずれかの話者クラスタの
組をマージすることによりボトムアップクラスタリング
が実現できる。[0025] Therefore, bottom-up clustering can be realized by merging any set of speaker clusters so that the mutual information of the similarity matrix between speakers becomes large.

【００２６】しかし、相互情報量ＭＩは、クラスタ数が
大きくなるほど大きくなる傾向があるので、クラスタ数
が決まったときに得られる最大相互情報量でこの値を正
規化し、最大相互情報量に対してどの程度うまくクラス
タ化できているかを評価する尺度とする。However, since the mutual information MI tends to increase as the number of clusters increases, this value is normalized by the maximum mutual information obtained when the number of clusters is determined, and This is a measure to evaluate how well clustering is achieved.

【００２７】最大相互情報量ＭＩｍａｘ　は、クラスタ
数Ｍにより次式（２）のように定義される。The maximum mutual information MImax is defined by the number M of clusters as shown in the following equation (2).

【００２８】ＭＩｍａｘ　＝　ｌｏｇＭ　　　　……（
２）ここで、最大相互情報量ＭＩｍａｘ　からの比率と
して相互情報量係数ＭＩＣを次式（３）のように定義す
る。MImax=logM...(
2) Here, the mutual information coefficient MIC is defined as the ratio from the maximum mutual information MImax as shown in the following equation (3).

【００２９】ＭＩＣ＝ＭＩ（ｘ，ｙ）／ＭＩｍａｘ　　
　　　……（３）相互情報量係数ＭＩＣを用いることによりクラスタ数に
よらずにクラスタリング効率を評価することができる。MIC=MI(x,y)/MImax
(3) By using the mutual information coefficient MIC, clustering efficiency can be evaluated regardless of the number of clusters.

【００３０】[0030]

【発明の効果】本発明の音声認識装置は、入力された音
声を音響分析して音声の特徴を抽出する音響分析手段と
、音響分析された音声に対応する標準話者クラスタを話
者モデルからの類似度に基づいて選別する選別手段と、
選別された標準話者クラスタの音声標準パターンに基づ
いてパターンマッチングを行うパターンマッチング手段
と、パターンマッチングによる認識結果を判定して判定
結果を出力する認識結果判定手段とを備えており、選別
手段は選別された標準話者クラスタの相互情報量係数に
基づいてクラスタ数を決定するように構成されているの
で、より効率的にクラスタリングされた標準話者及び標
準パターンを用いて未知の使用者に対し最適な話者クラ
スタを選択でき、その結果、高精度で不特定話者の音声
を認識することができる。[Effects of the Invention] The speech recognition device of the present invention includes an acoustic analysis means for acoustically analyzing input speech to extract speech features, and a standard speaker cluster corresponding to the acoustically analyzed speech from a speaker model. a sorting means for sorting based on the similarity of;
It is equipped with a pattern matching means for performing pattern matching based on the speech standard pattern of the selected standard speaker cluster, and a recognition result determination means for determining the recognition result by the pattern matching and outputting the determination result. Since the number of clusters is determined based on the mutual information coefficient of the selected standard speaker clusters, the number of clusters can be determined based on the mutual information coefficient of the selected standard speaker clusters. The optimal speaker cluster can be selected, and as a result, speech from any speaker can be recognized with high accuracy.

[Brief explanation of drawings]

【図１】本発明の音声認識装置に動作を説明するための
フロ−チャ−トである。FIG. 1 is a flowchart for explaining the operation of a speech recognition device according to the present invention.

【図２】図２の動作を行なう音声認識装置の一実施例の
構成を示す図である。FIG. 2 is a diagram showing the configuration of an embodiment of a speech recognition device that performs the operations shown in FIG. 2;

[Explanation of symbols]

１０　　マイクロフォン１１　　アンプ・フィルター１２　　Ａ／Ｄ変換部１３　　音声分析部１４　　話者クラスタ選別部１５〜１７　　音声標準パターン１８　　パターンマッチング部１９　　認識結果判定部 10 Microphone 11 Amplifier filter 12 A/D conversion section 13 Speech analysis section 14 Speaker cluster selection section 15-17 Audio standard pattern 18 Pattern matching section 19 Recognition result determination section

Claims

[Claims]

1. Acoustic analysis means for acoustically analyzing input speech to extract speech features; and selecting standard speaker clusters corresponding to the acoustically analyzed speech based on similarity from a speaker model. a pattern matching means for performing pattern matching based on the speech standard pattern of the selected standard speaker cluster; and a recognition result determining means for determining a recognition result by the pattern matching and outputting the determination result. A speech recognition device comprising: the selecting means configured to determine the number of clusters based on mutual information coefficients of the selected standard speaker clusters.