JPH0430599B2

JPH0430599B2 -

Info

Publication number: JPH0430599B2
Application number: JP58067321A
Authority: JP
Priority date: 1983-04-15
Filing date: 1983-04-15
Publication date: 1992-05-22
Also published as: JPS59192299A

Description

[Detailed description of the invention]

産業上の利用分野本発明は各単語、もしくは音節、音韻について
複数個の標準パターンを用いる音声認識方法に関
するものである。従来例の構成とその問題点不特定話者に対応させるために、各単語もしく
は音節、音韻について複数個の標準パターンを用
いる音声認識方法は有効な方法である。こうした
方法ではより多くの話者に対応するために標準パ
ターンの個数を多くすることが考えられるが、逆
にその個数の多さのために標準パターンのばらつ
きが多くなり、誤認識の原因となる。第１表は標準パターンとして５母音を用いた場
合のこうした標準パターンの個数と認識率の関係
の一例を示したものである。 INDUSTRIAL APPLICATION FIELD The present invention relates to a speech recognition method that uses a plurality of standard patterns for each word, syllable, or phoneme. Structure of the conventional example and its problems In order to accommodate unspecified speakers, a speech recognition method that uses a plurality of standard patterns for each word, syllable, or phoneme is an effective method. In such a method, it is possible to increase the number of standard patterns in order to accommodate a larger number of speakers, but conversely, the large number of standard patterns increases the variation in the standard patterns, causing misrecognition. . Table 1 shows an example of the relationship between the number of standard patterns and the recognition rate when five vowels are used as standard patterns.

【表】こうした複数個の標準パターンを用いる音声認
識方法では学習により話者に適応して複数個の標
準パターンの中から前記話者に適合した標準パタ
ーンを選択することにより、標準パターンのバラ
ツキによる誤認識を減少させ、前記話者に対する
認識率を向上させることが可能である。こうした話者への適応を目的とした学習方法と
して、従来は各単語もしくは音節、音韻毎に学習
することが行なわれてきた。しかしながら、こうした従来方法では、すべて
の標準パターンの話者への適応を行なうためには
すべての単語もしくは音節、音韻毎に一定量の学
習サンプルを必要とし、話者への適応速度、すな
わち学習速度が遅いという問題点があつた。また学習に使用するサンプルは各単語もしくは
音節、音韻毎に十分信頼度の高いものでなければ
ならず、こうした意味で学習時にオペレータ（以
下教師という）を必要とし、教師なしの自動的な
学習は難しいという問題があつた。発明の目的本発明は上記のような従来方法の問題点に鑑み
てなされたものであり、その目的とするところは
複数個の標準パターンを用いる音声認識方法にお
いて、学習により話者に適合した標準パターンを
選択することにより前記話者に対する認識率を向
上させるのに際して、高速、高効率に、また学習
サンプル個々の信頼性に大きく依存せず、したが
つて教師なしの自動学習が可能な学習方法を提供
することにある。発明の構成特定話者の発声する単語音声もしくは音節、音
韻は相互に関連を有している。たとえば、第１図
は２人の話者Ａ，Ｂの５母音の第１フオルマント
F₁、第２フオルマントF₂の位置を示したもので
ある。この図に示すように、話者Ａの｜ａ｜と話
者Ｂの｜ｏ｜のように２人の話者で接近している
音韻も存在するが、それぞれの話者に注目すれば
各音韻は十分な距離を置いて位置している。また
特定の話者に注目すれば、音韻間には一定の位置
関係が存在し、１個または数個の音韻の位置が定
まれば残りの音韻の存在する位置はある程度限定
することが可能である。本発明は音声の上記性質に基いて、複数の単独
もしくは音節、音韻のパラメータベクトルを話者
毎に結合して一つのパラメータベクトルを作成
し、前記結合されたパラメータベクトルの集合に
対してクラスタ化を行い、各クラスタの代表ベク
トルをセツト化された標準パターンとし、前記セ
ツト化された標準パターンに基いて話者に対する
学習を行い、標準パターンの使用頻度および学習
サンプル識別時の信頼度とに基いて、前記話者に
適合した標準パターンのセツトを選択し、この選
択された標準パターンをもちいて音声の識別を行
うものである。実施例の説明以下、従来例と比較しながら本発明の一実施例
について詳細に説明する。最初に標準パターンの作成方法および学習方法
について述べ、次に学習時の認識結果について述
べる。（）従来方法標準パターンの作成方法各音韻カテゴリ別にパラメータベクトルの集合
をクラスタ化し、適当な個数のクラスタに分割
後、各クラスタの代表ベクトルをその音韻カテゴ
リの音韻標準パターンとする。またはすべての音
韻カテゴリのパラメータベクトルを集めた集合に
対してクラスタ化を行ない適当な個数のクラスタ
に分割後、各クラスタの代表ベクトルをそのクラ
スタが属する音韻カテゴリの音韻標準パターンと
する。学習方法学習サンプルが識別の際に参照した各音韻毎の
標準パターンの使用頻度および学習サンプルの識
別時の信頼度とに基いて、各音韻毎に話者に適合
する標準パターンを選択する。この際、学習サン
プル数の増加に応じて選択する個数を減少させて
いくのがよい。（）本発明による方法セツト化した標準パターンの作成方法話者毎にすべての音韻（音韻数ｍ）のパラメー
タベクトル（次元数ｎ）を結合して一つのパラメ
ータベクトル（次元数ｍ×ｎ）を作成する。第２
図は音韻が５母音、パラメータベクトルが２次元
の例である。次に、結合されたパラメータベクト
ルの集合に対してクラスタ化を行い、各クラスタ
の代表ベクトルをセツト化された音韻標準パター
ンとする。このクラスタ化の際、音韻の種類毎に
重要度に応じて重み付けを行つたり、誤識別を生
じ易い音韻境界付近のベクトルに重みを置いてク
ラスタ化を行つてもよい。学習方法学習サンプルの識別の際の信頼度および学習サ
ンプルが識別の際に参照した標準パターンの使用
頻度により、セツト化された標準パターン全体の
信頼度を考慮した使用頻度を求め、これに基いて
話者に適合する標準パターンのセツトを選択す
る。この際、学習サンプル数の増加に応じて選択
する個数を減少させていくのがよい。第３図に音韻として５母音、パラメータベクト
ルとしてF₁，F₂が構成する２次元平面において、
各音韻について４個ずつの標準パターンを作成し
た例を示す。同図で黒丸が選択された標準パター
ンを示し、同図ａが従来方法による場合、ｂが本
発明の方法による場合を示す。以下、従来方法と比較しつつ本発明方法による
学習時の認識結果について述べる。認識実験は５母音を対象として、以下に述べる
条件で行つた。すなわち、男性15名が各５回／ieaou／と連続
発声した音声に対して線形予測分析を行い、音韻
標準パターン作成用として２回発声分より600フ
レーム（15名×５母音×８フレーム）、認識サン
プルとして残り３回発声分より675フレーム（15
名×５母音×９フレーム）のLPCケプストラム
係数を得た後、音韻標準パターン作成用サンプル
を用いてFisher比を最大にするFisher空間への射
影行列を求め、標準パターン作成用、認識用サン
プル共に４次元のFisher空間への射影を行つたも
のを試料として認識実験を行つた。第４図ａは標準パターンを各音韻について10個
ずつもち、。前述した２つの方法により学習を進
めていく過程における誤認識率を示したものであ
る。また第４図ｂは２つの方法で学習前の誤認識
率が異なるため、学習前の誤認識率を100とした
割合で示したものである。第４図で×が従来方法
によるもの、○が本発明方法によるものである。
同図には２通りの音韻の学習順序に対する結果が
示されている。すなわち一方は識別誤りの多い順
（ｏ→ｕ→ａ→ｉ→ｅ）、他方はその逆である。従
来方法では各音韻の学習が行われる毎にその音韻
カテゴリに属する音韻標準パターン10個から１個
を選択し、また本発明方法では学習音韻数に応じ
て、10、６、４、３、２、１、セツトの順で選択
し、最終的には双方とも各音韻につき１個の標準
パターンを選択している。第４図より従来方法で
は、学習する音韻の順序により誤認識率の低下が
大きく影響されている。しかし、本発明による方
法ではほぼ一様な低下を示しており、学習した音
韻の種類の影響が少ない。第５図ａは各音韻につき10個の標準パターンを
持つておき最終的に各音韻につき１個の標準パタ
ーンを選択するまでに使用した学習サンプルの数
に対する誤認識率の低下の様子を示したものであ
る。同図ｂは学習前の誤認識率を100とした割合
で示したものである。第５図で×が従来方法によ
るもの、○が本発明方法によるものである。同図
より従来方法と比較して本発明方法ではより少な
い学習サンプルで最終状態に近い誤認識率を示し
ており、適応化の早さを示している。第２表は最終的に各音韻につき１個の標準パタ
ーンを選択した際、その標準パターンが標準パタ
ーン作成時に代表ベクトルとして選ばれたクラス
タにその話者が属していたかどうかを示したもの
である。この表より本発明による方法では、すべ
ての話者について各話者の属していたクラスタの
代表ベクトルである標準パターンを選択している
が、従来方法では異なつたものを選んでいる割合
が大きく、本発明による方法の学習の信頼性が高
いことを示している。[Table] In this speech recognition method that uses multiple standard patterns, learning is performed to adapt to the speaker and select a standard pattern that is suitable for the speaker from among the multiple standard patterns. It is possible to reduce misrecognition and improve the recognition rate for the speaker. As a learning method aimed at adapting to such speakers, learning has conventionally been carried out for each word, syllable, or phoneme. However, in these conventional methods, in order to adapt all standard patterns to speakers, a certain amount of learning samples are required for each word, syllable, or phoneme. The problem was that it was slow. In addition, the samples used for learning must be highly reliable for each word, syllable, or phoneme, and in this sense, an operator (hereinafter referred to as a teacher) is required during learning, and automatic learning without supervision is I had a difficult problem. Purpose of the Invention The present invention has been made in view of the problems of the conventional methods as described above, and its purpose is to create a standard that is adapted to the speaker through learning in a speech recognition method that uses a plurality of standard patterns. A learning method that improves the recognition rate for the speaker by selecting patterns, is fast, highly efficient, does not rely heavily on the reliability of individual learning samples, and is therefore capable of automatic learning without supervision. Our goal is to provide the following. Structure of the Invention Word sounds, syllables, and phonemes uttered by a specific speaker are related to each other. For example, Figure 1 shows the first formants of the five vowels of two speakers A and B.
F ₁ and the position of the second formant F ₂ are shown. As shown in this figure, there are phonemes that are close between two speakers, such as |a| for speaker A and |o| for speaker B, but if you pay attention to each speaker Phonemes are located a sufficient distance apart. Furthermore, if we focus on a particular speaker, there is a certain positional relationship between phonemes, and once the position of one or several phonemes is determined, it is possible to limit the positions of the remaining phonemes to some extent. be. Based on the above-mentioned properties of speech, the present invention combines a plurality of individual, syllable, and phonetic parameter vectors for each speaker to create one parameter vector, and clusters the set of the combined parameter vectors. Then, the representative vector of each cluster is set as a set of standard patterns, and learning is performed for the speaker based on the set of standard patterns. Then, a set of standard patterns suitable for the speaker is selected, and the selected standard patterns are used to identify the voice. DESCRIPTION OF EMBODIMENTS An embodiment of the present invention will be described in detail below while comparing it with a conventional example. First, we will discuss the standard pattern creation method and learning method, and then we will discuss the recognition results during learning. () Conventional method Standard pattern creation method After clustering a set of parameter vectors for each phoneme category and dividing into an appropriate number of clusters, the representative vector of each cluster is used as the phoneme standard pattern for that phoneme category. Alternatively, a set of parameter vectors of all phoneme categories is clustered and divided into an appropriate number of clusters, and then the representative vector of each cluster is taken as the phoneme standard pattern of the phoneme category to which the cluster belongs. Learning method A standard pattern suitable for the speaker is selected for each phoneme based on the frequency of use of the standard pattern for each phoneme that the learning sample refers to during identification and the reliability of the learning sample during identification. At this time, it is preferable to reduce the number of learning samples to be selected as the number of learning samples increases. () Method according to the present invention Method for creating a set of standard patterns For each speaker, parameter vectors (number of dimensions n) of all phonemes (number of phonemes m) are combined to create one parameter vector (number of dimensions m×n). create. Second
The figure shows an example in which the phoneme is five vowels and the parameter vector is two-dimensional. Next, the set of combined parameter vectors is clustered, and the representative vector of each cluster is used as a set of phoneme standard patterns. In this clustering, weighting may be performed according to the importance of each type of phoneme, or weighting may be placed on vectors near phoneme boundaries that are likely to cause misidentification. Learning method: Based on the reliability of the training sample for identification and the frequency of use of the standard patterns that the training sample referred to during identification, the usage frequency is calculated based on the reliability of the entire set of standard patterns. Select a set of standard patterns that match the speaker. At this time, it is preferable to reduce the number of learning samples to be selected as the number of learning samples increases. In Figure 3, in a two-dimensional plane consisting of five vowels as phonemes and F ₁ and F ₂ as parameter vectors,
An example is shown in which four standard patterns are created for each phoneme. In the figure, the black circles indicate the selected standard patterns, and figure a shows the case where the conventional method is used, and b shows the case where the method of the present invention is used. Hereinafter, the recognition results during learning by the method of the present invention will be described in comparison with the conventional method. The recognition experiment was conducted on five vowels under the conditions described below. In other words, linear predictive analysis was performed on the voices uttered /ieaou/ five times each by 15 men in succession, and 600 frames (15 men x 5 vowels x 8 frames) from the two utterances were used to create a phonological standard pattern. As a recognition sample, 675 frames (15
After obtaining the LPC cepstral coefficients of (first name x 5 vowels x 9 frames), the projection matrix to Fisher space that maximizes the Fisher ratio is obtained using the sample for creating the phonological standard pattern, and the projection matrix for the standard pattern creation and recognition sample is We performed a recognition experiment using a sample that was projected onto a four-dimensional Fisher space. Figure 4a has 10 standard patterns for each phoneme. This figure shows the false recognition rate in the process of proceeding with learning using the two methods described above. In addition, since the false recognition rate before learning differs between the two methods, FIG. 4b shows the ratio with the false recognition rate before learning being 100. In FIG. 4, × indicates the result by the conventional method, and ○ indicates the result by the method of the present invention.
The figure shows the results for two different phoneme learning orders. That is, one is in the order of the number of identification errors (o→u→a→i→e), and the other is in the reverse order. In the conventional method, each time each phoneme is learned, one is selected from ten standard phoneme patterns belonging to that phoneme category, and in the method of the present invention, 10, 6, 4, 3, 2 are selected depending on the number of phonemes to be learned. , 1, and set, and finally both select one standard pattern for each phoneme. As can be seen from FIG. 4, in the conventional method, the reduction in the false recognition rate is greatly influenced by the order of phonemes to be learned. However, the method according to the present invention shows a nearly uniform decline, and the influence of the type of learned phoneme is small. Figure 5a shows how the false recognition rate decreases with respect to the number of training samples used until 1 standard pattern is finally selected for each phoneme, with 10 standard patterns for each phoneme. It is something. Figure b shows the rate of misrecognition before learning as 100. In FIG. 5, × indicates the result by the conventional method, and ○ indicates the result by the method of the present invention. As can be seen from the figure, compared to the conventional method, the method of the present invention shows a false recognition rate close to the final state with fewer learning samples, indicating the speed of adaptation. Table 2 shows, when one standard pattern is finally selected for each phoneme, whether the speaker belongs to the cluster that was selected as the representative vector when creating the standard pattern. . From this table, the method according to the present invention selects for all speakers a standard pattern that is a representative vector of the cluster to which each speaker belonged, whereas in the conventional method, a large proportion of different patterns are selected. This shows that the learning method according to the present invention is highly reliable.

【表】本発明は特定の実施例について記載したが、本
発明は上記実施例に限定されるものではなく、た
とえば、標準パターンとして母音以外の音韻をも
ちいてもよいし、音韻のかわりに音節、単語をも
ちいてもよい。発明の効果上述したように本発明によれば、次のような効
果を有する。少数の学習サンプルにて話者への適応が可能
である（学習速度が速い）。学習サンプルの属する標準パターンの種類に
あまり影響されずに誤認識率を低下することが
できる。学習サンプル個々の識別結果の信頼度にあま
りこだわらなくても、セツト全体の信頼度が問
題となるため、不適当な標準パターンを選択す
る可能性が小さい。すなわち、本発明によれば、特定の話者に対す
る認識率を改善することを目的として、学習によ
る話者に適合した標準パターンの選択に際して、
高速、高効率に、また教師なしの自動学習が可能
な学習方法を有する音声認識方法を提供すること
が可能である。[Table] Although the present invention has been described with reference to specific embodiments, the present invention is not limited to the above embodiments. For example, phonemes other than vowels may be used as the standard pattern, or syllables may be used instead of phonemes. , you can also use words. Effects of the Invention As described above, the present invention has the following effects. Adaptation to speakers is possible with a small number of learning samples (learning speed is fast). The misrecognition rate can be reduced without being greatly influenced by the type of standard pattern to which the learning sample belongs. Even if the reliability of the identification results of individual learning samples is not a concern, the reliability of the entire set is a concern, so the possibility of selecting an inappropriate standard pattern is small. That is, according to the present invention, when selecting a standard pattern suitable for a speaker through learning, with the aim of improving the recognition rate for a specific speaker,
It is possible to provide a speech recognition method that has a learning method that enables high speed, high efficiency, and automatic learning without supervision.

[Brief explanation of drawings]

第１図は２人の話者における５母音の標準パタ
ーンの位置を示す図、第２図は音韻のセツト化の
模式図、第３図ａは従来方法による標準パターン
の作成例を示す図、第３図ｂは本発明の一実施例
による標準パターンの作成例を示す図、第４図
ａ，ｂは音韻の学習順序と誤認識率の関係を示す
図、第５図ａ，ｂは学習サンプル数と誤認識率の
関係を示す図である。 Figure 1 is a diagram showing the positions of standard patterns of five vowels for two speakers, Figure 2 is a schematic diagram of phoneme set-up, and Figure 3a is a diagram showing an example of creating a standard pattern using a conventional method. Figure 3b is a diagram showing an example of standard pattern creation according to an embodiment of the present invention, Figures 4a and b are diagrams showing the relationship between phoneme learning order and misrecognition rate, and Figures 5a and b are learning FIG. 3 is a diagram showing the relationship between the number of samples and the misrecognition rate.

Claims

[Claims] 1. A parameter vector of a plurality of words, syllables, and phonemes is combined for each speaker to create one parameter vector, and a set of the combined parameter vectors is clustered, and each The representative vector of the cluster is set as a set of standard patterns, and learning is performed for the speaker based on the set of standard patterns. A speech recognition method characterized by selecting a set of compatible standard patterns and identifying speech using the selected standard patterns. 2. The speech recognition method according to claim 1, wherein the standard patterns to be set are vowels.