JP3176210B2

JP3176210B2 - Voice recognition method and voice recognition device

Info

Publication number: JP3176210B2
Application number: JP05029694A
Authority: JP
Inventors: 昭一松永; 茂樹嵯峨山
Original assignee: 株式会社エイ・ティ・アール音声翻訳通信研究所
Priority date: 1994-03-22
Filing date: 1994-03-22
Publication date: 2001-06-11
Anticipated expiration: 2016-06-11
Also published as: JPH07261785A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、音声認識方法及び音声
認識方法及び音声認識装置に関し、特に、木構造話者ク
ラスタリングを用いた音声認識方法及び音声認識装置に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method, a speech recognition method, and a speech recognition device, and more particularly, to a speech recognition method and a speech recognition device using tree speaker clustering.

【０００２】[0002]

【従来の技術】従来、連続音声認識装置において、木構
造話者クラスタリングアルゴリズムを用いて音声認識を
実行し、そのアルゴリズムを高速話者適応する場合の方
法が、例えば小坂ほか，”話者適応のための木構造話者
クラスタリング”，電子情報通信学会技術報告，ＳＰ９
３−１１０，１９９３年１２月において開示されてい
る。この従来例の方法では、クラスタリング木を上位レ
ベルから下位レベルに辿ることにより、話者の大局的特
徴から局所的特徴へと話者適応することができる。ここ
では、木構造を入力される音声にたいするモデルの尤度
を基準として探索し、尤度が最大となるノードにおける
モデルを選択することにより、話者適応を行う。すなわ
ち、発声音声に基づいた教師信号付き話者適応を行って
いる。この話者適応の方法では、パラメータの修正を行
わずに、クラスタリング木の枝の選択のみを行うので、
少数のサンプルで話者適応を行うことができるという利
点がある。この方法では、音素モデルセットとして、例
えば音素環境を効果的に表現した隠れマルコフ網を用い
ることができる。2. Description of the Related Art Conventionally, in a continuous speech recognition apparatus, speech recognition is performed using a tree-structured speaker clustering algorithm and the algorithm is adapted to a high-speed speaker adaptation. Tree-structured speaker clustering ", IEICE Technical Report, SP9
3-110, December 1993. In this conventional method, the speaker can be adapted from global features to local features by tracing the clustering tree from a higher level to a lower level. Here, the speaker adaptation is performed by searching the tree structure based on the likelihood of the model for the input speech and selecting the model at the node having the maximum likelihood. That is, speaker adaptation with a teacher signal based on the uttered voice is performed. In this method of speaker adaptation, only the branches of the clustering tree are selected without modifying the parameters.
The advantage is that speaker adaptation can be performed with a small number of samples. In this method, for example, a hidden Markov network that effectively represents a phoneme environment can be used as a phoneme model set.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上記の
従来例の方法では、いまだ発声リストに従った適応用音
声データのサンプルを必要とし、音声認識率も比較的低
いという問題点があった。本発明の目的は以上の問題点
を解決し、適応用音声データのサンプルを必要とせず、
従来例の方法に比較して音声認識率を改善することがで
きる音声認識方法及び音声認識装置装置を提供すること
にある。However, the above-described conventional method still has a problem that a sample of adaptation voice data according to the utterance list is still required, and the voice recognition rate is relatively low. An object of the present invention is to solve the above problems, and does not require a sample of audio data for adaptation.
It is an object of the present invention to provide a speech recognition method and a speech recognition device capable of improving the speech recognition rate as compared with the conventional method.

【０００４】[0004]

【課題を解決するための手段】本発明に係る請求項１記
載の音声認識方法は、話者モデル記憶装置に予め格納さ
れた複数の話者モデルを用いて、入力された文字列から
なる発声音声文を音声認識する音声認識方法において、
上記話者モデル記憶装置に予め記憶された複数の話者モ
デルは、最上層の話者モデルが不特定話者モデルであ
り、最下層の話者モデルが少数又は特定話者モデルであ
るように、そのクラスタが階層化されて分類されかつ木
構造で表現され、入力された発声音声文に基づいて、上
記複数の話者モデルからなる不特定話者モデルを用いて
音声認識し、その音声認識結果と上記入力された発声音
声文とに基づいて、上記木構造を最上層から最下層に向
かって辿ることにより、最上層と、最上層と最下層の間
に位置する中間層と、最下層とにおける複数の話者モデ
ルのうち、より最適な少なくとも１つの話者モデルを選
択し、上記選択した話者モデルに基づいて上記発声音声
文を再び音声認識し、その音声認識結果を出力すること
を特徴とする。According to the first aspect of the present invention, there is provided a voice recognition method comprising a plurality of speaker models stored in a speaker model storage device in advance, and a speech comprising an input character string. In a voice recognition method for recognizing a voice sentence,
The plurality of speaker models stored in advance in the speaker model storage device are such that the top speaker model is an unspecified speaker model and the bottom speaker model is a small number or a specific speaker model. , The clusters are classified hierarchically and expressed in a tree structure, and based on the input uttered speech sentence, speech recognition is performed using an unspecified speaker model including the plurality of speaker models, and the speech recognition is performed. By tracing the tree structure from the uppermost layer to the lowermost layer based on the result and the input uttered speech sentence, the uppermost layer, an intermediate layer located between the uppermost layer and the lowermost layer, and a lowermost layer Selecting at least one more optimal speaker model from among the plurality of speaker models, and re-recognizing the uttered speech sentence based on the selected speaker model, and outputting the speech recognition result. It is characterized by.

【０００５】また、本発明に係る請求項２記載の音声認
識装置は、最上層の話者モデルが不特定話者モデルであ
り、最下層の話者モデルが少数又は特定話者モデルであ
るように、そのクラスタが階層化されて分類されかつ木
構造で表現された複数の話者モデルを格納する記憶装置
と、入力された文字列からなる発声音声文に基づいて、
上記記憶装置に格納された複数の話者モデルからなる不
特定話者モデルを用いて音声認識する音声認識手段と、
上記音声認識手段による音声認識結果と上記入力された
発声音声文とに基づいて、上記木構造を最上層から最下
層に向かって辿ることにより、最上層と、最上層と最下
層の間に位置する中間層と、最下層とにおける複数の話
者モデルのうち、より最適な少なくとも１つの話者モデ
ルを選択する選択手段とを備え、上記音声認識手段は、
上記選択手段によって選択された話者モデルに基づいて
上記発声音声文を再び音声認識し、その音声認識結果を
出力することを特徴とする。According to a second aspect of the present invention, in the speech recognition apparatus, the top speaker model is an unspecified speaker model, and the bottom speaker model is a small number or a specific speaker model. On the basis of a storage device that stores a plurality of speaker models in which the clusters are hierarchically classified and expressed in a tree structure, and a uttered speech sentence composed of an input character string,
Voice recognition means for recognizing voice using an unspecified speaker model consisting of a plurality of speaker models stored in the storage device,
By tracing the tree structure from the uppermost layer to the lowermost layer based on the voice recognition result by the voice recognition unit and the input uttered voice sentence, the uppermost layer and the position between the uppermost layer and the lowermost layer And a selecting means for selecting at least one more optimal speaker model among a plurality of speaker models in the lowermost layer, wherein the speech recognition means comprises:
The utterance speech sentence is again subjected to speech recognition based on the speaker model selected by the selection means, and the speech recognition result is output.

【０００６】[0006]

【０００７】[0007]

【０００８】[0008]

【作用】以上のように構成された音声認識装置において
は、上記音声認識手段は、入力された文字列からなる発
声音声文に基づいて、上記記憶装置に格納された複数の
話者モデルからなる不特定話者モデルを用いて音声認識
する。次いで、上記選択手段は、上記音声認識手段によ
る音声認識結果と上記入力された発声音声文とに基づい
て、上記木構造を最上層から最下層に向かって辿ること
により、最上層と、最上層と最下層の間に位置する中間
層と、最下層とにおける複数の話者モデルのうち、より
最適な少なくとも１つの話者モデルを選択する。さら
に、上記音声認識手段は、上記選択手段によって選択さ
れた話者モデルに基づいて上記発声音声文を再び音声認
識し、その音声認識結果を出力する。In the speech recognition apparatus constructed as described above, the speech recognition means comprises a plurality of speaker models stored in the storage device based on an uttered speech sentence consisting of an input character string. Speech recognition using an unspecified speaker model. Next, the selecting means traces the tree structure from the uppermost layer to the lowermost layer based on the speech recognition result by the speech recognizing means and the input uttered speech sentence, thereby forming the uppermost layer, the uppermost layer At least one speaker model that is more optimal is selected from among a plurality of speaker models in the middle layer and the lowest layer located between the and the lowest layers. Further, the voice recognition unit performs voice recognition of the uttered voice sentence again based on the speaker model selected by the selection unit, and outputs a voice recognition result.

【０００９】[0009]

【００１０】[0010]

【実施例】以下、図面を参照して本発明に係る実施例に
ついて説明する。図１は、本発明に係る一実施例である
音声認識装置のブロック図であり、図２は、図１の音声
認識装置において用いる木構造話者クラスタリングの構
成を示す斜視図である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a speech recognition apparatus according to one embodiment of the present invention, and FIG. 2 is a perspective view showing a configuration of a tree-structure speaker clustering used in the speech recognition apparatus of FIG.

【００１１】本実施例の音声認識装置は、図２に示す従
来例の木構造話者クラスタリングを用いて音声認識を行
うが、特に、図１に示すように、隠れマルコフ網メモリ
（以下、ＨＭ網メモリという。）１１に格納された不特
定話者音素モデルに基づいて音素照合部４と音素コンテ
キスト依存型ＬＲパーザ（以下、ＬＲパーザという。）
５とによって公知の方法で音声認識処理を実行し、次い
で、このときＬＲパーザ５から出力される音声認識結果
データを教師信号として話者モデル選択部３０に入力
し、バッファメモリ３から入力される発声音声の特徴パ
ラメータと、上記音声認識結果データとに基づいてＨＭ
網メモリ１１に予め格納された隠れマルコフモデル（以
下、ＨＭＭという。）の上記木構造話者モデル内の複数
の話者モデルのうちより最適な少なくとも１つの話者モ
デルを選択して選択信号をＨＭＭメモリ１１に出力し、
これに応答して音素照合部４は当該ＨＭ網メモリ１１内
の選択信号に対応する話者モデルのＨＭＭを用いて音素
照合を実行して音声認識処理を実行することを特徴とし
ている。The speech recognition apparatus of the present embodiment performs speech recognition using the conventional tree-structured speaker clustering shown in FIG. 2. In particular, as shown in FIG. 1, a hidden Markov network memory (hereinafter referred to as HM) The phoneme matching unit 4 and a phoneme context-dependent LR parser (hereinafter, referred to as LR parser) based on the unspecified speaker phoneme model stored in the network memory 11.
5, the speech recognition process is executed by a known method, and then the speech recognition result data output from the LR parser 5 is input to the speaker model selection unit 30 as a teacher signal, and is input from the buffer memory 3. HM based on the characteristic parameters of the uttered voice and the voice recognition result data
At least one of the plurality of speaker models in the tree-structured speaker model of the hidden Markov model (hereinafter, referred to as HMM) stored in advance in the network memory 11 is selected, and a selection signal is selected. Output to the HMM memory 11,
In response to this, the phoneme matching unit 4 performs phoneme matching using the HMM of the speaker model corresponding to the selection signal in the HM network memory 11 and performs speech recognition processing.

【００１２】まず、木構造話者クラスタリングの原理を
図２を参照して説明する。図２においては、階層化され
た木構造を有する話者モデルの一例が示されており、３
つの階層化されたレベルを有している。最上層のレベル
０では、ｍ⁰ ₀（０）なるクラスタの話者モデルが１つの
み存在し、当該話者モデルは、いわゆる不特定話者モデ
ルである。また、中間層のレベル２では、それぞれレベ
ル１におけるｍ⁰ ₀（０）の話者モデルに属し、話者クラ
スタ１であるｍ⁰ ₀（１）の話者モデルと、話者クラスタ
２であるｍ⁰ ₁（１）の話者モデルとが存在する。さら
に、最下層のレベル３においては、レベル１の話者クラ
スタ１に属する２つの話者クラスタｍ⁰ ₀（２），ｍ
⁰ ₁（２）の話者モデルが存在するとともに、レベル１の
話者クラスタ２に属する２つの話者クラスタｍ
⁰ ₀（２），ｍ⁰ ₁（２）の話者モデルが存在する。例えば
図２においてハッチングを用いて選択された話者クラス
タを示しているが、ここで、レベル２におけるｍ
⁰ ₁（２）の話者クラスタの話者モデルのデータは、レベ
ル１の話者クラスタ１であるｍ⁰ ₀（１）の話者クラスタ
の話者モデルのデータに含まれており、レベル０の話者
モデルのデータはすべての話者モデルのデータをすべて
含んでいる。すなわち、レベルの番号が増大するにつれ
て、より詳細な分類の話者クラスタに分類された話者モ
デルに分割されている。従って、木構造の上層に属する
モデルは不特定多数の話者特徴を包含し、下層に属する
モデルは少数または特定話者の特徴を有する。図２の例
では階層数は３であるが、本発明はこれに限らず、複数
個であってもよい。First, the principle of tree structure speaker clustering will be described with reference to FIG. FIG. 2 shows an example of a speaker model having a hierarchical tree structure.
It has two layered levels. At level 0 of the uppermost layer, there is only one speaker model of the cluster m ⁰ ₀ (0), and this speaker model is a so-called unspecified speaker model. Further, at level 2 of the intermediate layer, the speaker model belongs to m ⁰ ₀ (0) at level 1 and is m ⁰ ₀ (1), which is speaker cluster 1, and speaker cluster 2. m ⁰ ₁ (1) speaker model exists. Further, at the lowest level 3, two speaker clusters m ⁰ ₀ (2) and m belonging to the level 1 speaker cluster 1
⁰ ₁ A speaker model of (2) exists and two speaker clusters m belonging to speaker cluster 2 of level 1
There are speaker models of ⁰ ₀ (2) and m ⁰ ₁ (2). For example, FIG. 2 shows a speaker cluster selected using hatching.
The data of the speaker model of the speaker cluster of ⁰ ₁ (2) is included in the data of the speaker model of the speaker cluster of m ⁰ ₀ (1), which is the speaker cluster 1 of level 1, and the level 0 Speaker model data includes all speaker model data. In other words, as the level number increases, the speaker model is divided into speaker models classified into speaker clusters of more detailed classification. Therefore, the model belonging to the upper layer of the tree structure includes an unspecified number of speaker characteristics, and the model belonging to the lower layer has characteristics of a small number or specific speakers. Although the number of layers is three in the example of FIG. 2, the present invention is not limited to this, and a plurality of layers may be used.

【００１３】このように、階層的な話者クラスタリング
では話者特性を階層的に逐次分割することにより、話者
モデルの木構造を作成する。この木構造を、入力音声に
対するモデルの尤度を基準として探索することにより話
者選択による適応を行なうことができる。すなわち、こ
の木構造を上層から下層に辿りより最適な少なくとも１
つの話者モデルを選択することにより、話者適応が可能
となる。入力音声の特徴が木構造を構成する標準話者の
一人と似た特徴を有する場合、下層のモデルが選択され
ることが期待される。また標準話者の特徴とは似ていな
い場合は、上層のモデルが選択されると予想される。上
層のモデルが選択された場合、複数話者の特徴からの内
挿的な効果が得られると考えられる。As described above, in the hierarchical speaker clustering, a tree structure of the speaker model is created by sequentially and hierarchically dividing the speaker characteristics. By searching this tree structure based on the likelihood of the model for the input speech, adaptation by speaker selection can be performed. That is, this tree structure is traced from the upper layer to the lower layer, and at least one
By selecting one speaker model, speaker adaptation is possible. If the feature of the input speech has a feature similar to one of the standard speakers constituting the tree structure, it is expected that a lower model is selected. If the characteristics are not similar to those of the standard speaker, it is expected that an upper model is selected. When an upper model is selected, it is considered that an interpolative effect from the characteristics of a plurality of speakers can be obtained.

【００１４】木構造を作成するためには、まず、複数の
話者のデータからそれぞれの話者用の特定話者音素モデ
ルセットが作成される。複数のモデルセットは、クラス
タリングアルゴリズムによりクラスタ化される。生成さ
れた個々のクラスターはさらにクラスタリングされサブ
クラスタが作成される。一つのクラスタが１名の話者に
なるまでこれを繰り返し、木構造を作成する。木構造が
作成された後、個々のクラスタに属する話者の音声デー
タにより統計的音素モデルセットを作成する。統計的モ
デルを用いるため、最適モデルの選択の基準として、モ
デルの出力する尤度が利用することができる。尤度が最
大となるノードにおけるモデルを選択することにより、
頑健性の低下を防ぐことが可能となる。また、従来法の
話者モデル選択による話者適応では、性能向上のために
は学習用話者数を増加する必要があるが、本実施例の方
法では木構造で話者クラスタを表現することにより、話
者が増加した場合の、適応に要する計算量の増大を防ぐ
という効果も得られる。In order to create a tree structure, a specific speaker phoneme model set for each speaker is first created from data of a plurality of speakers. The plurality of model sets are clustered by a clustering algorithm. The generated individual clusters are further clustered to create subclusters. This is repeated until one cluster becomes one speaker to create a tree structure. After the tree structure is created, a statistical phoneme model set is created from the speech data of the speakers belonging to each cluster. Since a statistical model is used, the likelihood that the model outputs can be used as a criterion for selecting an optimal model. By selecting the model at the node with the highest likelihood,
It is possible to prevent a decrease in robustness. In addition, in the speaker adaptation based on the conventional speaker model selection, it is necessary to increase the number of learning speakers in order to improve the performance. In the method of the present embodiment, the speaker cluster is represented by a tree structure. Thus, the effect of preventing an increase in the amount of calculation required for adaptation when the number of speakers increases can be obtained.

【００１５】本実施例においては、音声認識のための統
計的音素モデルセットとしてＨＭ網を使用している。当
該ＨＭ網は効率的に表現された音素環境依存モデルであ
る。１つのＨＭ網は多数の音素環境依存モデルを包含す
る。ＨＭ網はガウス分布を含む状態の結合で構成され、
個々の音素環境依存モデル間で状態が共有される。この
ためパラメータ推定のためのデータ数が不足する場合
も、頑健なモデルを作成することができる。このＨＭ網
は逐次状態分割法（Successive State Splitting:以
下、ＳＳＳという。）を用いて自動作成される。上記Ｓ
ＳＳではＨＭ網のトポロジーの決定、異音クラスの決
定、各々の状態におけるガウス分布のパラメータの推定
を同時に行なう。本実施例においては、ＨＭ網のパラメ
ータとして、ガウス分布で表現される出力確率及び遷移
確率を有する。このため認識時には一般のＨＭＭと同様
に扱うことができる。In this embodiment, an HM network is used as a statistical phoneme model set for speech recognition. The HM network is a phoneme environment dependent model expressed efficiently. One HM network includes many phoneme environment dependent models. The HM network is composed of a combination of states including a Gaussian distribution,
State is shared between individual phoneme environment dependent models. Therefore, even when the number of data for parameter estimation is insufficient, a robust model can be created. This HM network is automatically created using a successive state splitting method (SSS). The above S
The SS determines the topology of the HM network, determines the allophone class, and estimates the parameters of the Gaussian distribution in each state at the same time. In the present embodiment, the parameters of the HM network include an output probability and a transition probability expressed by a Gaussian distribution. Therefore, at the time of recognition, it can be handled in the same way as a general HMM.

【００１６】さらに、木構造の各ノードで行なうクラス
タリングのアルゴリズムについて述べる。ここでは、ス
プリット（ＳＰＬＩＴ）法で用いられたクラスタリング
アルゴリズムに基づく方法を用いている。この方法で
は、２のべき乗のクラスタを作成する一般的な従来のＬ
ＢＧアルゴリズムとは異なり、歪みが最大となるクラス
タを順次分割する。従って任意の数のクラスタを作成で
きる。またクラスタリングを行なう前に、あらかじめ要
素間の距離テーブルを作成する。これにより、クラスタ
中心の初期値をヒューリスティックに（偶発的に又は発
見的に）与えなくとも良いという利点がある。結局あら
かじめ与える必要があるのは距離に対するしきい値、又
はクラスタ数のみで、この値さえ与えれば完全に自動的
に結果が得られる。Further, an algorithm of clustering performed at each node of the tree structure will be described. Here, a method based on the clustering algorithm used in the split (SPLIT) method is used. In this method, a general conventional L that creates a power-of-two cluster
Unlike the BG algorithm, the cluster with the largest distortion is sequentially divided. Therefore, any number of clusters can be created. Before performing clustering, a distance table between elements is created in advance. This has the advantage that the initial value of the cluster center need not be given heuristically (accidentally or heuristically). After all, it is only necessary to give in advance a threshold value for the distance or the number of clusters, and if this value is given, a completely automatic result can be obtained.

【００１７】上述のクラスタリング法を用いて、話者ク
ラスタの木構造を作成する方法について述べる。ここで
提案する木構造作成アルゴリズムでは、各ノードにおけ
るクラスタ数Ｋを与えるだけで、自動的にクラスタの作
成を行なう。以下にアルゴリズムを示す。＜ステップ１＞複数Ｎ人の話者の音声データから、複
数Ｎ個の特定話者用ＨＭ網を作成する。＜ステップ２＞クラスタリングアルゴリズムを用い
て、複数Ｎ個の定話者用ＨＭ網のクラスタリングを行な
い、複数Ｋ個のクラスタを作成する。その後、各クラス
タに属する話者のデータを用いてＨＭ網を再学習して次
の数１で示す複数Ｋ個のＨＭ網を作成する。A method for creating a tree structure of speaker clusters using the above-described clustering method will be described. In the tree structure creation algorithm proposed here, clusters are automatically created only by giving the number K of clusters at each node. The algorithm is shown below. <Step 1> A plurality of N specific speaker HM networks are created from voice data of a plurality of N speakers. <Step 2> Using the clustering algorithm, clustering of a plurality of N HM networks for fixed speakers is performed to create a plurality of K clusters. Thereafter, the HM network is re-learned using the data of the speakers belonging to each cluster, and a plurality of K HM networks represented by the following equation 1 are created.

【００１８】[0018]

【数１】Ｍ⁰（ｊ）＝｛ｍ⁰ ₀（ｊ），．．．，ｍ⁰ _K-1（ｊ）｝，ｊ＝１，２，．．．，Ｊ(1) M ⁰ (j) = {m ⁰ ₀ (j),. . . , M ⁰ _K−1 (j)}, j = 1, 2,. . . , J

【００１９】ここで、ｊは木構造の階層の深さを示す階
層番号であり、ここでは、Ｊは階層数である。Here, j is a hierarchy number indicating the depth of the hierarchy of the tree structure, and here, J is the number of hierarchies.

【００２０】＜ステップ３＞ｓ∈Ｓ^l（ｊ）を満たす
話者ｓの数がＫ以下となったとき、クラスタｌのクラス
タリングを終了する。ここで、Ｓ^l（ｊ）はレベルの階
層ｊにおけるｌ番目のクラスタを表す。＜ステップ４＞ステップ３で終了したクラスタを除
き、全てのｌ番目のクラスタについて、ｌ番目のクラス
タＳ^l（ｊ）に属する話者をクラスタリングし、Ｋ個の
サブクラスタを作成する。その後サブクラスタに属する
話者のデータによりＨＭ網を再学習し次の数２で表され
るＫ個のＨＭ網を作成する。<Step 3> When the number of speakers s satisfying s∈S ^l (j) becomes equal to or smaller than K, the clustering of the cluster 1 is terminated. Here, S ^l (j) represents the l-th cluster in level hierarchy j. <Step 4> The speakers belonging to the l-th cluster S ^l (j) are clustered for all the l-th clusters except for the cluster completed in step 3, and K sub-clusters are created. Thereafter, the HM network is re-learned based on the data of the speakers belonging to the subcluster, and K HM networks represented by the following Expression 2 are created.

【００２１】[0021]

【数２】Ｍ^l（ｊ＋１）＝｛ｍ^l ₀（ｊ＋１），．．．，
ｍ^l _K-1（ｊ＋１）｝，ｊ＝１，２，．．．，Ｊ## EQU2 ## M ^l (j + 1) = {m ^l ₀ (j + 1),. . . ,
m ^l _K-1 (j + 1)}, j = 1, 2,. . . , J

【００２２】＜ステップ５＞ｊを１だけインクリメン
トする。＜ステップ６＞そして、ステップ３に戻る。<Step 5> j is incremented by one. <Step 6> Then, the process returns to step 3.

【００２３】さらに、本発明に係る木構造話者クラスタ
リングによる不特定話者音声認識の原理について述べ
る。本実施例では、１発話のみの評価データで教師なし
話者適応を行う。上記木構造話者クラスタリングによる
不特定話者音声認識の方法のアルゴリズムは、次のステ
ップを含む。＜ステップ１＞音素照合部４とＬＲパーザ５は、入力
音声を不特定話者音素モデルを用いて認識する。以下、
当該ステップの音声認識を第１の音声認識プロセスとい
う。＜ステップ２＞認識結果の音素系列をＬＲパーザ５か
ら話者モデル選択部３０にフィードバックし、話者モデ
ル選択部３０は、上記ステップ１において用いた入力音
声と、この音素系列を入力として話者選択を行なう。＜ステップ３＞そして、音素照合部４とＬＲパーザ５
は、選択後の音素モデルを用いて入力音声を再び音声認
識してその結果データを出力する。以下、当該ステップ
の音声認識を第２の音声認識プロセスという。Further, the principle of unspecified speaker speech recognition by tree-structured speaker clustering according to the present invention will be described. In this embodiment, unsupervised speaker adaptation is performed using evaluation data of only one utterance. The algorithm of the speaker-independent speech recognition method using the tree-structured speaker clustering includes the following steps. <Step 1> The phoneme matching unit 4 and the LR parser 5 recognize the input speech using an unspecified speaker phoneme model. Less than,
The voice recognition in this step is called a first voice recognition process. <Step 2> The phoneme sequence of the recognition result is fed back from the LR parser 5 to the speaker model selection unit 30, and the speaker model selection unit 30 receives the input speech used in step 1 and the speaker Make a selection. <Step 3> Then, the phoneme matching unit 4 and the LR parser 5
Uses the selected phoneme model to perform speech recognition on the input speech again and outputs the result data. Hereinafter, the voice recognition in this step is referred to as a second voice recognition process.

【００２４】以上説明したように、上記第１と第２の音
声認識プロセスの、２回の音声認識プロセスで最終的な
音声認識結果を確定する。本実施例の音声認識方法で音
声認識率を向上するためには、誤認識するデータの認識
率を改善する必要がある。このため、誤った認識結果を
フィードバックしても、正しい方向へ学習をすすめる必
要があるという本質的な問題がある。しかしながら、音
声認識結果データは文法などの知識によりある程度修正
されたものであり、さらに文節で評価した場合誤ってい
るだけで、すべての音素系列が誤っているわけではな
い。実際に音声認識誤りのデータを調べると、助詞の部
分だけ誤ったものが多い。このことから誤認識結果のフ
ィードバックでも話者適応は十分可能と考えられる。As described above, the final speech recognition result is determined by the two speech recognition processes of the first and second speech recognition processes. In order to improve the speech recognition rate by the speech recognition method of the present embodiment, it is necessary to improve the recognition rate of erroneously recognized data. For this reason, there is an essential problem that, even if an incorrect recognition result is fed back, learning must be performed in the correct direction. However, the speech recognition result data has been corrected to some extent by knowledge of grammar and the like, and when evaluated by a phrase, it is only incorrect and not all phoneme sequences are incorrect. When actually examining the data of the speech recognition error, many of them are incorrect only in the particle part. From this, it is considered that speaker adaptation is sufficiently possible even with feedback of a misrecognition result.

【００２５】次いで、上述の本実施例の音声認識方法を
用いた、ＳＳＳ−ＬＲ（left-to-right rightmost型）
不特定話者連続音声認識装置に付いて説明する。この装
置は、メモリ１１に格納されたＨＭ網と呼ばれる音素環
境依存型の効率のよいＨＭＭの表現形式を用いている。
また、上記ＳＳＳにおいては、音素の特徴空間上に割り
当てられた確率的定常信号源（状態）の間の確率的な遷
移により音声パラメータの時間的な推移を表現した確率
モデルに対して、尤度最大化の基準に基づいて個々の状
態をコンテキスト方向又は時間方向へ分割するという操
作を繰り返すことによって、モデルの精密化を逐次的に
実行する。Next, an SSS-LR (left-to-right rightmost type) using the above-described speech recognition method of the present embodiment.
An unspecified speaker continuous speech recognition device will be described. This apparatus uses a phoneme environment-dependent efficient HMM expression format called an HM network stored in a memory 11.
In the SSS, the likelihood of a stochastic model expressing a temporal transition of a speech parameter by a stochastic transition between stochastic stationary signal sources (states) assigned to a feature space of a phoneme is calculated. The refinement of the model is performed sequentially by repeating the operation of dividing each state in the context direction or the time direction based on the criterion of maximization.

【００２６】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して音素照合部４及び話者モデル
選択部３０に入力される。In FIG. 1, a speaker's uttered voice is input to a microphone 1 and converted into a voice signal, and then input to a feature extracting unit 2. After performing A / D conversion on the input audio signal, the feature extraction unit 2 performs, for example, LPC analysis, and performs 34-dimensional feature parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. Is extracted. The time series of the extracted feature parameters is input to the phoneme matching unit 4 and the speaker model selecting unit 30 via the buffer memory 3.

【００２７】音素照合部４に接続されるＨＭ網メモリ１
１内のＨＭ網は、各状態をノードとする複数のネットワ
ークとして表され、各状態はそれぞれ以下の情報を有す
る。（ａ）状態番号（ｂ）受理可能なコンテキストクラス（ｃ）先行状態、及び後続状態のリスト（ｄ）出力確率密度分布のパラメータ（ｅ）自己遷移確率及び後続状態への遷移確率HM network memory 1 connected to phoneme matching unit 4
The HM network in 1 is represented as a plurality of networks having each state as a node, and each state has the following information. (A) State number (b) Acceptable context class (c) List of preceding and succeeding states (d) Parameters of output probability density distribution (e) Self transition probability and transition probability to succeeding state

【００２８】なお、本実施例において用いるＨＭ網は、
各分布がどの話者に由来するかを特定する必要があるた
め、所定の話者混合ＨＭ網を変換して作成する。ここ
で、出力確率密度関数は３４次元の対角共分散行列をも
つ混合ガウス分布であり、各分布はある特定の話者のサ
ンプルを用いて学習されている。The HM network used in this embodiment is:
Since it is necessary to specify which speaker each distribution originates from, a predetermined speaker mixed HM network is converted and created. Here, the output probability density function is a Gaussian mixture distribution having a 34-dimensional diagonal covariance matrix, and each distribution is learned using a specific speaker sample.

【００２９】第１の音声認識プロセスにおいて、音素照
合部４は、音素コンテキスト依存型ＬＲパーザ５からの
音素照合要求に応じて音素照合処理を実行する。そし
て、図２に示す最上層のレベル０の不特定話者モデルを
用いて音素照合区間内のデータに対する尤度が計算さ
れ、この尤度の値が音素照合スコアとしてＬＲパーザ５
に返される。このときに用いられるモデルは、ＨＭＭと
等価であるために、尤度の計算には通常のＨＭＭで用い
られている前向きパスアルゴリズムをそのまま使用す
る。In the first speech recognition process, the phoneme matching unit 4 executes phoneme matching processing in response to a phoneme matching request from the phoneme context-dependent LR parser 5. Then, the likelihood for the data in the phoneme matching section is calculated using the level 0 unspecified speaker model of the uppermost layer shown in FIG. 2, and the value of the likelihood is used as the phoneme matching score in the LR parser 5.
Is returned to Since the model used at this time is equivalent to the HMM, the likelihood calculation uses the forward path algorithm used in the normal HMM as it is.

【００３０】一方、文脈自由文法データベースメモリ２
０内の所定の文脈自由文法（ＣＦＧ）を公知の通り自動
的に変換してＬＲテーブルを作成してＬＲテーブルメモ
リ１３に格納される。ＬＲパーザ５は、上記ＬＲテーブ
ル１３を参照して、入力された音素予測データについて
左から右方向に、後戻りなしに処理する。構文的にあい
まいさがある場合は、スタックを分割してすべての候補
の解析が平行して処理される。ＬＲパーザ５は、ＬＲテ
ーブルメモリ１３内のＬＲテーブルから次にくる音素を
予測して音素予測データを音素照合部４に出力する。こ
れに応答して、音素照合部４は、その音素に対応するＨ
Ｍ網メモリ１１内の情報を参照して照合し、その尤度を
音声認識スコアとしてＬＲパーザ５に戻し、順次音素を
連接していくことにより、連続音声の認識を行い、その
音声認識結果データを話者モデル選択部３０にフィード
バックして出力する。上記連続音声の認識において、複
数の音素が予測された場合は、これらすべての存在をチ
ェックし、ビームサーチの方法により、部分的な音声認
識の尤度の高い部分木を残すという枝刈りを行って高速
処理を実現する。On the other hand, the context-free grammar database memory 2
As is well known, a predetermined context-free grammar (CFG) in 0 is automatically converted to create an LR table and stored in the LR table memory 13. The LR parser 5 refers to the LR table 13 and processes the input phoneme prediction data from left to right without backtracking. If there is syntactic ambiguity, the stack is split and the analysis of all candidates is processed in parallel. The LR parser 5 predicts the next phoneme from the LR table in the LR table memory 13 and outputs phoneme prediction data to the phoneme matching unit 4. In response to this, the phoneme matching unit 4 sets the H
The collation is performed by referring to the information in the M network memory 11, the likelihood is returned to the LR parser 5 as a speech recognition score, and the continuous speech is recognized by successively connecting the phonemes. Is fed back to the speaker model selection unit 30 and output. When a plurality of phonemes are predicted in the continuous speech recognition, the existence of all of them is checked, and a pruning is performed by using a beam search method to leave a partial tree having a high likelihood of partial speech recognition. To achieve high-speed processing.

【００３１】次いで、これに応答して話者モデル選択部
３０は、バッファメモリ３から入力される上記特徴パラ
メータのデータと、ＬＲパーザ５からフィードバックさ
れる第１の音声認識プロセスにおける音声認識結果デー
タとに基づいて、図２に示した木構造話者クラスタリン
グの構造を有する話者モデル群の中から、好ましくは、
所定のしきい値以上の尤度を有する、より下層の話者ク
ラスタの話者モデルを選択し、より好ましくは、最大の
尤度を有する最下層の話者クラスタの話者モデルを選択
する。そして、選択した話者モデルの話者クラスタを示
す選択信号をＨＭ網メモリ１１に出力して、音素照合部
４で用いる話者モデル（以下、指定話者モデルとい
う。）を指定する。Next, in response to this, the speaker model selecting section 30 converts the characteristic parameter data inputted from the buffer memory 3 and the speech recognition result data in the first speech recognition process fed back from the LR parser 5. From the group of speaker models having the tree-structured speaker clustering structure shown in FIG.
A speaker model of a lower speaker cluster having a likelihood equal to or greater than a predetermined threshold is selected, and more preferably, a speaker model of a lowest speaker cluster having a maximum likelihood is selected. Then, a selection signal indicating a speaker cluster of the selected speaker model is output to the HM network memory 11, and a speaker model (hereinafter, referred to as a designated speaker model) used by the phoneme matching unit 4 is designated.

【００３２】そして、第２の音声認識プロセスにおい
て、音素照合部４は、音素コンテキスト依存型ＬＲパー
ザ５からの音素照合要求に応じて音素照合処理を実行す
る。このときに、ＬＲパーザ５からは、音素照合区間及
び照合対象音素とその前後の音素から成る音素コンテキ
スト情報が渡される。音素照合部４は、受け取った音素
コンテキスト情報に基づいて、上記指定話者モデルを用
いて音素照合区間内のデータに対する尤度が計算され、
この尤度の値が音素照合スコアとしてＬＲパーザ５に返
される。これに応答して、ＬＲパーザ５は、第１の音声
認識プロセスと同様に、上記ＬＲテーブル１３を参照し
て、入力された音素予測データについて左から右方向
に、後戻りなしに処理する。構文的にあいまいさがある
場合は、スタックを分割してすべての候補の解析が平行
して処理される。ＬＲパーザ５は、ＬＲテーブルメモリ
１３内のＬＲテーブルから次にくる音素を予測して音素
予測データを音素照合部４に出力する。これに応答し
て、音素照合部４は、その音素に対応する上記指定話者
モデルに関するＨＭ網メモリ１１内の情報を参照して照
合し、その尤度を音声認識スコアとしてＬＲパーザ５に
戻し、順次音素を連接していくことにより、連続音声の
認識を行う。ここで、第１の音声認識プロセスと同様
に、複数の音素が予測された場合は、これらすべての存
在をチェックし、ビームサーチの方法により、部分的な
音声認識の尤度の高い部分木を残すという枝刈りを行っ
て高速処理を実現する。入力された話者音声の最後まで
処理した後、全体の尤度が最大のもの又は所定の上位複
数個のものを、当該装置の認識結果データとして外部装
置に出力する。Then, in the second speech recognition process, the phoneme matching unit 4 executes a phoneme matching process in response to a phoneme matching request from the phoneme context-dependent LR parser 5. At this time, the LR parser 5 passes phoneme context information including a phoneme matching section, a phoneme to be matched, and phonemes before and after the phoneme. Based on the received phoneme context information, the phoneme matching unit 4 calculates the likelihood for the data in the phoneme matching section using the specified speaker model,
This likelihood value is returned to the LR parser 5 as a phoneme matching score. In response, the LR parser 5 refers to the LR table 13 and processes the input phoneme prediction data from left to right without regression, as in the first speech recognition process. If there is syntactic ambiguity, the stack is split and the analysis of all candidates is processed in parallel. The LR parser 5 predicts the next phoneme from the LR table in the LR table memory 13 and outputs phoneme prediction data to the phoneme matching unit 4. In response, the phoneme matching unit 4 performs matching by referring to information in the HM network memory 11 regarding the specified speaker model corresponding to the phoneme, and returns the likelihood to the LR parser 5 as a speech recognition score. By successively connecting phonemes, continuous speech is recognized. Here, similarly to the first speech recognition process, when a plurality of phonemes are predicted, the presence of all of them is checked, and a partial tree having a high likelihood of partial speech recognition is determined by a beam search method. High-speed processing is realized by performing pruning to keep. After processing to the end of the input speaker's voice, the one with the highest overall likelihood or a plurality of predetermined higher-order ones is output to an external device as recognition result data of the device.

【００３３】本発明者による本実施例の音声認識装置を
用いたシミュレーション結果を次の表１に示す。The following Table 1 shows the results of a simulation performed by the inventor using the speech recognition apparatus of this embodiment.

【００３４】[0034]

【表１】 [Table 1]

【００３５】表１から明らかなように、話者によってば
らつきはあるが、いずれの話者でも木構造話者クラスタ
リングを用いた実施例の方法が、話者混合法（例えば、
小坂ほか，”話者混合ＳＳＳによる不特定話者音声認識
と話者適応”，電子情報通信学会技術報告，ＳＰ９２−
５２，１９９２年９月参照。）による不特定話者音声認
識に比較して音声認識率が高くなっている。ここでは評
価対象として文節を用いているが、文節の平均時間長は
約０．９秒であり、この程度の長さの入力音声で教師な
し話者適応の効果が出なければ、不特定話者モードでの
認識率の向上は期待できない。実験によると音声認識率
は向上しているため、１秒以下の、文節中に誤りを含む
情報をフィードバックしても話者適応の効果が出ている
と考えられる。As is clear from Table 1, although there is a variation depending on the speakers, the method of the embodiment using the tree-structured speaker clustering is the speaker mixing method (for example,
Kosaka et al., "Unspecified Speaker Speech Recognition and Speaker Adaptation Using Mixed SSS," IEICE Technical Report, SP92-
52, September 1992. ), The speech recognition rate is higher than that of unspecified speaker speech recognition. Here, a phrase is used as an evaluation target, but the average time length of the phrase is about 0.9 seconds. Unless an unsupervised speaker adaptation effect is obtained with input speech of this length, unspecified speech is used. It is not expected that the recognition rate will be improved in the user mode. According to experiments, the speech recognition rate has been improved, and it is considered that the effect of speaker adaptation is obtained even if information containing an error in a phrase of 1 second or less is fed back.

【００３６】以上説明したように、本実施例によれば、
第１の音声認識プロセスによる音声認識結果のデータを
話者モデル選択部３０にフィードバックし、当該フィー
ドバックされた音声認識結果のデータと、バッファメモ
リ３から入力される特徴パラメータとに基づいて木構造
話者クラスタリングを有するＨＭ網内の複数の話者モデ
ルからより最適な少なくとも１つの話者モデルを選択し
て、当該選択された話者モデルに基づいて第２の音声認
識プロセスが実行され、この音声認識の結果が当該装置
の最終の認識結果データとして出力される。これによっ
て、より最適な話者モデルが選択されて、音声認識が実
行されるので、当該音声認識率が大幅に増大する。そし
て、発声話者リストに従った発声は不要となる。すなわ
ち、発声する話者に依存せず、従来例に比較して高い音
声認識率で音声認識を行うことができる。言い替えれ
ば、適応用音声データのサンプルを必要とせず、従来例
の方法に比較して音声認識率を改善することができる音
声認識装置を提供することができる。As described above, according to this embodiment,
The data of the speech recognition result obtained by the first speech recognition process is fed back to the speaker model selecting unit 30, and a tree structure speech is generated based on the fed back speech recognition result data and the feature parameter input from the buffer memory 3. Selecting a more optimal at least one speaker model from the plurality of speaker models in the HM network having speaker clustering, and executing a second speech recognition process based on the selected speaker model; The result of recognition is output as final recognition result data of the device. As a result, a more optimal speaker model is selected and speech recognition is performed, so that the speech recognition rate is significantly increased. Then, the utterance according to the utterance speaker list becomes unnecessary. That is, it is possible to perform speech recognition at a higher speech recognition rate than in the conventional example without depending on the speaking speaker. In other words, it is possible to provide a speech recognition device that does not require a sample of adaptation speech data and can improve the speech recognition rate as compared with the method of the related art.

【００３７】本発明に係る音声認識方法は、少なくと
も、話者モデル記憶装置に予め格納された複数の話者モ
デルを用いて、入力された文字列からなる発声音声文を
音声認識する音声認識方法において、入力された発声音
声文に基づいて、上記複数の話者モデルからなる不特定
話者モデルを用いて音声認識し、その音声認識結果と上
記入力された発声音声文とに基づいて、上記複数の話者
モデルのうちより最適な少なくとも１つの話者モデルを
選択し、上記選択した話者モデルに基づいて音声認識
し、その音声認識結果を出力することを特徴としてい
る。そして、上記ＨＭ網メモリ１１である上記話者モデ
ル記憶装置に予め記憶された複数の話者モデルは、好ま
しくは、少なくとも、そのクラスタが階層化されて分類
されていればよい。さらには、より好ましくは、上記話
者モデル記憶装置に予め記憶された複数の話者モデル
は、そのクラスタが木構造で表現されていればよい。A voice recognition method according to the present invention uses at least a plurality of speaker models stored in a speaker model storage device in advance to recognize a voice sentence composed of an input character string. In, based on the input uttered speech sentence, speech recognition using an unspecified speaker model consisting of the plurality of speaker models, based on the speech recognition result and the input uttered speech sentence, It is characterized in that at least one speaker model that is more optimal from a plurality of speaker models is selected, speech recognition is performed based on the selected speaker model, and the speech recognition result is output. The plurality of speaker models stored in advance in the speaker model storage device, which is the HM network memory 11, preferably have at least clusters classified in a hierarchical manner. Further, more preferably, the plurality of speaker models stored in the speaker model storage device in advance may have a cluster represented by a tree structure.

【００３８】本実施例において、以下に示す教師なし話
者適応の方法を用いて音声認識処理を実行するように構
成してもよい。すなわち、適応用の音声の入力に対し、
一旦認識系により音声認識を行い、その結果出力される
音素系列をフィードバックし、話者適応時の教師信号と
して用いることにより、見かけ上の教師なし話者適応を
実現することができる。この場合、木構造話者クラスタ
リングによる話者適応では、木構造の枝の選択のみを行
ない、平均値や分散などのパラメータの変更は行なわな
いため、少ないデータで教師なし学習が実現することが
できる。In this embodiment, the speech recognition processing may be executed using the following unsupervised speaker adaptation method. That is, for the input of the adaptation voice,
Speech recognition is once performed by a recognition system, and a phoneme sequence output as a result is fed back and used as a teacher signal at the time of speaker adaptation, whereby apparent unsupervised speaker adaptation can be realized. In this case, in speaker adaptation by tree-structured speaker clustering, only the branch of the tree structure is selected, and parameters such as an average value and a variance are not changed, so that unsupervised learning can be realized with a small amount of data. .

【００３９】[0039]

【発明の効果】以上詳述したように本発明によれば、話
者モデル記憶装置に予め格納された複数の話者モデルを
用いて、入力された文字列からなる発声音声文を音声認
識する音声認識方法及び装置において、上記話者モデル
記憶装置に予め記憶された複数の話者モデルは、最上層
の話者モデルが不特定話者モデルであり、最下層の話者
モデルが少数又は特定話者モデルであるように、そのク
ラスタが階層化されて分類されかつ木構造で表現され、
入力された発声音声文に基づいて、上記複数の話者モデ
ルからなる不特定話者モデルを用いて音声認識し、その
音声認識結果と上記入力された発声音声文とに基づい
て、上記木構造を最上層から最下層に向かって辿ること
により、最上層と、最上層と最下層の間に位置する中間
層と、最下層とにおける複数の話者モデルのうち、より
最適な少なくとも１つの話者モデルを選択し、上記選択
した話者モデルに基づいて上記発声音声文を再び音声認
識し、その音声認識結果を出力する。従って、複数の話
者モデルの木構造を最上層から最下層に向かって辿るこ
とにより、最上層と、最上層と最下層の間に位置する中
間層と、最下層とにおける複数の話者モデルのうち、よ
り最適な話者モデルが選択されて、音声認識が実行され
るので、当該音声認識率が大幅に高くなる。そして、発
声話者リストに従った発声は不要となる。すなわち、発
声する話者に依存せず、従来例に比較して高い音声認識
率で音声認識を行うことができるという特有の効果があ
る。As described above in detail, according to the present invention, a plurality of speaker models stored in the speaker model storage device are used in advance to recognize an uttered speech sentence composed of an input character string. In the speech recognition method and apparatus, among the plurality of speaker models stored in the speaker model storage device in advance, the top speaker model is an unspecified speaker model, and the bottom speaker model is a small number or a specified speaker model. As in the speaker model, the clusters are hierarchically classified and represented by a tree structure,
Based on the input uttered speech sentence, speech recognition is performed using an unspecified speaker model including the plurality of speaker models, and based on the speech recognition result and the input uttered speech sentence, the tree structure is used. From the uppermost layer to the lowermost layer, a more optimal at least one story among a plurality of speaker models in the uppermost layer, the intermediate layer located between the uppermost layer and the lowermost layer, and the lowermost layer A speaker model is selected, the uttered speech sentence is again subjected to speech recognition based on the selected speaker model, and the speech recognition result is output. Therefore, by tracing the tree structure of the plurality of speaker models from the uppermost layer to the lowermost layer, the plurality of speaker models in the uppermost layer, the intermediate layer located between the uppermost layer and the lowermost layer, and the lowermost layer Of these, a more optimal speaker model is selected and speech recognition is performed, so that the speech recognition rate is significantly increased. Then, the utterance according to the utterance speaker list becomes unnecessary. That is, there is a specific effect that the voice recognition can be performed at a higher voice recognition rate than the conventional example, without depending on the speaker who utters.

[Brief description of the drawings]

【図１】本発明に係る一実施例である音声認識装置の
ブロック図である。FIG. 1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.

【図２】図１の音声認識装置において用いる木構造話
者クラスタリングの構成を示す斜視図である。FIG. 2 is a perspective view showing a configuration of a tree-structured speaker clustering used in the speech recognition apparatus of FIG. 1;

[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３…バッファメモリ、４…音素照合部、５…ＬＲパーザ、１１…隠れマルコフ網メモリ、１３…ＬＲテーブルメモリ、２０…文脈自由文法データベースメモリ、３０…話者モデル選択部。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Feature extraction part, 3 ... Buffer memory, 4 ... Phoneme collation part, 5 ... LR parser, 11 ... Hidden Markov network memory, 13 ... LR table memory, 20 ... Context-free grammar database memory, 30 ... Talk Part model selection section.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平４−324499（ＪＰ，Ａ) 特開昭61−73199（ＪＰ，Ａ) 特開平２−220099（ＪＰ，Ａ) 特開昭62−245294（ＪＰ，Ａ) 特開平２−226200（ＪＰ，Ａ) 特開昭59−146099（ＪＰ，Ａ) 特開平４−20999（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 3/00 521 G10L 3/00 531 ──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-4-324499 (JP, A) JP-A-61-73199 (JP, A) JP-A-2-220099 (JP, A) JP-A 62-73 245294 (JP, A) JP-A-2-226200 (JP, A) JP-A-59-146099 (JP, A) JP-A-4-20999 (JP, A) (58) Fields investigated (Int. ⁷ , DB name) G10L 3/00 521 G10L 3/00 531

Claims

(57) [Claims]

1. A speech recognition method for recognizing an uttered speech sentence composed of an input character string by using a plurality of speaker models stored in a speaker model storage device in advance. The plurality of speaker models stored in advance are clustered such that the top speaker model is an unspecified speaker model and the bottom speaker model is a small or specific speaker model. Based on the input uttered speech sentence, the speech recognition is performed using an unspecified speaker model including the plurality of speaker models, and the speech recognition result and the input utterance By tracing the tree structure from the uppermost layer to the lowermost layer based on the voice sentence, a plurality of speaker models in the uppermost layer, the intermediate layer located between the uppermost layer and the lowermost layer, and the lowermost layer Out of the more optimal Both selects one speaker model, based on the selected speaker model again recognizing speech the utterance sentence, the speech recognition method and outputs the voice recognition results.

2. The cluster is hierarchically classified and classified into a tree structure, such that the top speaker model is an unspecified speaker model and the bottom speaker model is a minority or specific speaker model. A storage device for storing a plurality of speaker models represented by: and an unspecified speaker model comprising a plurality of speaker models stored in the storage device based on an uttered speech sentence composed of an input character string. Speech recognition means using speech recognition, based on the speech recognition result by the speech recognition means and the input uttered speech sentence, by tracing the tree structure from the top layer to the bottom layer, , An intermediate layer located between the uppermost layer and the lowermost layer, and selecting means for selecting at least one more optimal speaker model among a plurality of speaker models in the lowermost layer, , The above selection means A speech recognition device for re-recognizing the uttered speech sentence based on the speaker model selected in step (a), and outputting the speech recognition result.