JPH0594198A

JPH0594198A - Method and device for recognizing voice

Info

Publication number: JPH0594198A
Application number: JP25362291A
Authority: JP
Inventors: Tetsuo Kosaka; 哲夫小坂
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1991-10-01
Filing date: 1991-10-01
Publication date: 1993-04-16

Abstract

PURPOSE:To improve the recognition accuracy of phonemes having a transient feature such as a plosive, a nasal sound by deriving separately a covariance matrix for representing the variance of a normal distribution for every state. CONSTITUTION:The device is provided with a microphone 1 for inputting voice information, an A/D converter 2 for executing analog/digital conversion (A/D conversion) of the voice information inputted from the microphone 1, and a CPU(Central Processing Unit) 3 for taking charge of control of each part. Also, a read-only memory(ROM) 4 stores an average value and a covariance of each standard pattern, and moreover, stores a program of a processing, and a random access memory(RAM) 5 is used as a memory for a work space, and each part is connected by a system bus 6. In such a state, at the time of extracting an analytic parameter from a sound signal, a dynamic feature parameter and a static feature parameter are both derived, and at the time of DP matching, pattern matching is executed, based on the respective output probability.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は音声を分析することによ
って得られるパラメータから、音韻、音節、単語などを
認識する音声認識方法及び装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method and apparatus for recognizing phonemes, syllables, words, etc. from parameters obtained by analyzing speech.

【０００２】[0002]

【従来の技術】従来、音声を認識する方法としては、Ｈ
ＭＭ法など、音声を統計的手法により認識する手段が多
く使われている。これらの手法の一つとして、統計的距
離尺度を用いるＤＰマッチングによる認識手法であるス
トキャスティックＤＰ法がある。このストキャスティッ
クＤＰ法とは、中川、「ストキャスティックＤＰ法およ
び統計的手法による不特定話者の英語子音の認識」、電
子通信学会論文誌（Ｄ）、Ｊ７０−Ｄ、１，ｐ．ｐ．１
５５−１６３（昭６２−０１）に詳しいが、距離尺度と
して確率の尺度に対応するもの、パスコストのかわりに
遷移確率を用いたものである。2. Description of the Related Art Conventionally, as a method for recognizing voice, H
Many means such as the MM method for recognizing speech by a statistical method are used. One of these methods is the stochastic DP method, which is a recognition method based on DP matching using a statistical distance measure. The Stochastic DP method is referred to by Nakagawa, "Recognition of English Consonants of Unspecified Speakers by Stochastic DP Method and Statistical Method", IEICE Transactions (D), J70-D, 1, p. p. 1
55-163 (Sho 62-01), the distance measure corresponds to the probability measure, and the transition probability is used instead of the path cost.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上記従
来の方法では、１音韻カテゴリーに対して１共分散行列
を用いているため、過渡的な特徴を持つ音韻は特徴が表
現しきれないという欠点があった。However, in the above-mentioned conventional method, since one covariance matrix is used for one phoneme category, there is a drawback that a phoneme having a transient feature cannot be fully expressed. there were.

【０００４】さらに、音声パラメータとして音声スペク
トルの絶対的な位置を表わすベクトル量である静的特徴
を用いているため、やはり過渡的な特徴を持つ音韻が認
識しにくいという欠点があった。Further, since a static feature which is a vector quantity representing an absolute position of a voice spectrum is used as a voice parameter, there is a drawback that a phoneme having a transient feature is difficult to recognize.

【０００５】またこの手法は基本的にＤＰマッチング法
を用いているため、標準パターンを基本軸とする非対称
型のＤＰパスを使用することにより、スポッティングア
ルゴリズムへの変更が可能である。この手法については
中川、「確率モデルによる音声認識」、電子情報通信学
会編、ｐｐ．８７−８９（昭６３−７）に詳しい。しか
しながら、出力確率の計算回数が一定でないため、標準
パタンの長さが異なる場合は短いパタンに認識されやす
いという欠点があった。Further, since this method basically uses the DP matching method, it is possible to change to the spotting algorithm by using an asymmetric DP path having the standard pattern as the basic axis. This method is described in Nakagawa, “Speech Recognition by Probabilistic Model”, edited by IEICE, pp. 87-89 (Sho 63-7). However, since the number of times the output probability is calculated is not constant, there is a drawback in that if the standard patterns have different lengths, they are likely to be recognized as short patterns.

【０００６】[0006]

【課題を解決するための手段】上記課題を解決するため
に、本発明はスペクトルの距離尺度として、正規分布の
仮定に基づく統計的距離尺度を用い、状態遷移確率を用
いる動的計画法による音声認識装置であって、正規分布
の分散を表わす共分散行列を、各状態ごと別個に求める
ことを特徴とする音声認識装置を提供する。In order to solve the above-mentioned problems, the present invention uses a statistical distance measure based on the assumption of normal distribution as a distance measure of a spectrum, and speech by dynamic programming using a state transition probability. Provided is a recognition device, which is characterized in that a covariance matrix representing the variance of a normal distribution is obtained separately for each state.

【０００７】上記課題を解決するために、本発明は前記
音声認識装置は、入力音声を分析してパラメーターを得
る分析手段を有し、前記分析は静的特徴と動的特徴を併
用するものとすることを特徴とする請求項１に記載の音
声認識装置。In order to solve the above-mentioned problems, the present invention is characterized in that the speech recognition apparatus has an analysis means for analyzing input speech to obtain parameters, and the analysis uses both static features and dynamic features. The voice recognition device according to claim 1, wherein

【０００８】上記課題を解決するために、本発明は、ス
ペクトルの距離尺度として、正規分布の仮定に基づく統
計的距離尺度を用い、状態遷移確率を用いる動的計画法
による音声認識方法及び装置であって、正規分布の分散
を表わす共分散行列を、各状態ごと別個に求めることを
特徴とする音声認識方法及び装置。In order to solve the above problems, the present invention provides a speech recognition method and apparatus by dynamic programming that uses a statistical distance measure based on the assumption of normal distribution as a distance measure of a spectrum and uses a state transition probability. A speech recognition method and apparatus, wherein a covariance matrix representing the variance of a normal distribution is obtained separately for each state.

【０００９】上記課題を解決するために、本発明は、前
記音声の認識は、入力音声を分析して得るパラメータを
用い、前記分析により静的特徴と動的特徴を導出し、認
識の際に併用するものとする。In order to solve the above-mentioned problems, the present invention uses the parameters obtained by analyzing the input voice for the recognition of the voice, derives the static feature and the dynamic feature by the analysis, and recognizes the feature. It should be used together.

【００１０】上記課題を解決するために、本発明は、前
記動的計画法の積分軸を標準パタン側におき、状態数で
正規化する。In order to solve the above problems, the present invention places the integration axis of the dynamic programming on the side of the standard pattern and normalizes it by the number of states.

【００１１】[0011]

【実施例】以下、本発明の好適な実施例を、図面を用い
て詳細に説明する。Preferred embodiments of the present invention will be described in detail below with reference to the drawings.

【００１２】図１は、本実施例の音声認識装置の構成を
示すブロック図である。図中、１は音声情報を入力する
為のマイク、２はマイク１から入力された音声情報をア
ナログ／デジタル変換（Ａ／Ｄ変換）するＡ／Ｄ変換
器、３はＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎ
ｇＵｎｉｔ）であり、これら各部の制御を司る。４は
リード・オンリー・メモリ（ＲＯＭ）であり、音声の各
標準パタンの平均値および共分散を格納し、また、後述
するフローチャートに示すような処理のプログラムを格
納する。５はランダム・アクセス・メモリ（ＲＡＭ）で
あり、ワークスペース用のメモリとして用いる。６はシ
ステムバスであり、上記各部はこのシステムバスによっ
て接続される。FIG. 1 is a block diagram showing the configuration of the speech recognition apparatus of this embodiment. In the figure, 1 is a microphone for inputting voice information, 2 is an A / D converter that performs analog / digital conversion (A / D conversion) of voice information input from the microphone 1, and 3 is a CPU (Central Process).
g Unit) and controls the control of each of these parts. A read-only memory (ROM) 4 stores the average value and covariance of each standard pattern of voice, and also stores a processing program as shown in a flowchart described later. A random access memory (RAM) 5 is used as a workspace memory. Reference numeral 6 denotes a system bus, and the above-mentioned units are connected by this system bus.

【００１３】ここで、ＲＯＭ４に格納する標準パタンの
平均値の求め方について説明する。Now, how to obtain the average value of the standard patterns stored in the ROM 4 will be described.

【００１４】標準パタンの学習法としては、まずＬＰＣ
ケプストラムなどの音声パラメータを用いて、標準パタ
ンと入力パラメータ系列とのＤＰマッチングをとる。こ
こで標準パタンの初期値としては、統計すべきクラスに
属する任意の音声パラメータ系列をえらぶ。ＤＰマッチ
ングによってアライメントがとれれば、標準パタンの各
フレームを状態とみたて、各状態ごとそれに対応する入
力パラメータの和及び、自乗和を計算する。入力パラメ
ータの和から、各状態ごとの平均値を計算する。平均値
は１入力ごとに毎回更新する。またすべてのデータが入
力された時点で、自乗和を用いて共分散行列を計算す
る。またアライメントをとったところで、各状態ごとの
ＤＰパスが選ばれた回数をカウントしておき、すべての
データが入力された時点で、状態遷移確率を求める。As a standard pattern learning method, first, LPC is used.
DP matching between the standard pattern and the input parameter sequence is performed using a voice parameter such as cepstrum. Here, as the initial value of the standard pattern, an arbitrary voice parameter sequence belonging to the class to be statistically selected is selected. If alignment is achieved by DP matching, each frame of the standard pattern is regarded as a state, and the sum of the input parameters corresponding to each state and the sum of squares are calculated for each state. The average value for each state is calculated from the sum of the input parameters. The average value is updated every input. When all data are input, the covariance matrix is calculated using the sum of squares. In addition, when alignment is taken, the number of times the DP path is selected for each state is counted, and the state transition probability is calculated at the time when all the data are input.

【００１５】入力パラメータ系列と各状態とのアライメ
ントはＤＰマッチングによってとられるため、１つの入
力データがあると必ず各状態とも１回は計算をおこなう
ことになる。このため共分散行列の計算に次数の２乗程
度のデータが必要とするなら、例えばパラメータ次数が
１２次とすると、１４４データがあれば１状態ごとの共
分散行列が設計可能となる。Since the alignment between the input parameter series and each state is obtained by DP matching, if there is one input data, each state is always calculated once. For this reason, if the data of about the square of the order is required for the calculation of the covariance matrix, for example, assuming that the parameter order is the 12th order, the covariance matrix for each state can be designed with 144 data.

【００１６】このように求められた標準パタンの各状態
ごとの平均値と状態遷移確率はＲＯＭ４に格納される。The average value and the state transition probability of each state of the standard pattern thus obtained are stored in the ROM 4.

【００１７】次にこの標準パタンデータを用いて音声の
認識を行う処理について図２のフローチャートを用いて
説明する。尚、この図２のフローチャートの処理は、マ
イク１から入力された音声情報をＡ／Ｄ変換器２によっ
てデジタル信号に変換された後の処理である。Next, a process of recognizing a voice using the standard pattern data will be described with reference to the flowchart of FIG. The process of the flowchart of FIG. 2 is a process after the voice information input from the microphone 1 is converted into a digital signal by the A / D converter 2.

【００１８】音声信号は、音声スペクトルの絶対的な位
置を表わすベクトル量である静的な特徴パラメータ（例
えばＬＰＣケプストラム）と、音声スペクトルの時間的
な動きを表わすベクトル量である動的な特徴パラメータ
（例えばデルタケプストラム）とに変換する（Ｓ−
１）。The voice signal has a static feature parameter (eg LPC cepstrum) which is a vector quantity representing an absolute position of the voice spectrum and a dynamic feature parameter which is a vector quantity representing a temporal movement of the voice spectrum. (For example, delta cepstrum) (S-
1).

【００１９】Ｓ２で式（１−１）〜式（１−３）を用い
てＤＰマッチングの初期値を与える。At S2, the initial value of DP matching is given by using the equations (1-1) to (1-3).

【００２０】Ｑ（−１，ｊ）＝Ｑ（０，ｊ）＝−∞ 式（１−１）Ｑ（ｉ，１）＝ｌｏｇＰ（ａ_i｜１）式（１−２）Ｑ（ｉ，０）＝０式（１−３）（ただし、ａ_iはベクトルを表すものとする。）Ｓ３で
すべてのクラス（単語、音韻など）のＤＰの計算が終了
したか否かの判断を行う。Q (−1, j) = Q (0, j) = − ∞ Formula (1-1) Q (i, 1) = logP (a _i | 1) Formula (1-2) Q (i, 0) = 0 Expression (1-3) (where a _i represents a vector) In S3, it is determined whether or not the DP calculation for all classes (words, phonemes, etc.) has been completed.

【００２１】Ｓ３において、終了していないと判断され
る場合は、Ｓ４に進み、式（２−１）〜式（２−３）の
漸加式を用いてＤＰの計算を行う。求めるＤＰの値（Ｄ
Ｐの累積距離）Ｑ（ｉ，ｊ）は、式（２−１）〜式（２
−３）の最大値とする。If it is determined in S3 that the processing has not ended, the processing proceeds to S4, and DP is calculated using the gradual addition equations of equations (2-1) to (2-3). Required DP value (D
(Cumulative distance of P) Q (i, j) is expressed by equations (2-1) to (2).
-3) maximum value.

【００２２】Ｑ（ｉ−２，ｊ−１）＋０．５ｌｏｇＰ（ａ_i-1｜ｊ）＋０．５ｌｏｇＰ（ａ_i｜ｊ）＋ｌｏｇＰ₁（ｊ）式（２−１）Ｑ（ｉ−１，ｊ−１）＋ｌｏｇＰ（ａ_i｜ｊ）＋ｌｏｇＰ₂（ｊ）式（２−２）Ｑ（ｉ−１，ｊ−２）＋ｌｏｇＰ（ａ_i｜ｊ−１）＋ｌｏｇＰ（ａ_i｜ｊ）＋ｌｏｇＰ₃（ｊ）式（２−３）但しＰ_i（ｊ）は状態ｊへの遷移確率。Q (i−2, j−1) + 0.5logP (a _i−1 | j) + 0.5logP (a _i | j) + logP ₁ (j) Formula (2-1) Q (i−1, j) j-1) + logP (a i | j) + logP 2 (j) equation (2-2) Q (i-1 , j-2) + logP (a i | j-1) + logP (a i | j) + logP 3 (J) Formula (2-3) where P _i (j) is the transition probability to the state j.

【００２３】なお、式（２−１）〜式（２−３）におい
て用いられている出力確率を表わすＰ（ａ_i｜ｊ）は、
以下の式（３）のように表せる。Note that P (a _i | j) representing the output probability used in the equations (2-1) to (2-3) is
It can be expressed as the following formula (3).

【００２４】Ｐ（ａ_i｜ｊ）＝λＰ_CEP（ａ_i｜ｊ）＋（１−λ）Ｐ_DCEP（ａ_i｜ｊ）（０≦λ≦１）・・・式（３）但し、Ｐ_CEP（ａ_i｜ｊ）は静的特徴パラメータによる出
力確率であり、Ｐ_DCEP（ａ_i｜ｊ）は動的特徴パラメー
タによる出力確率である。P (a _i | j) = λP _CEP (a _i | j) + (1−λ) P _DCEP (a _i | j) (0 ≦ λ ≦ 1) (3) where P _CEP (a _i | j) is the output probability due to the static feature parameter, and P _DCEP (a _i | j) is the output probability due to the dynamic feature parameter.

【００２５】つまり、式（２−１）〜式（２−３）にお
いてＰ（ａ_i｜ｊ）を用いる為、この式を基にして求め
るパラメータは静的特徴と動的特徴を併用することがで
きる。That is, since P (a _i | j) is used in the equations (2-1) to (2-3), the parameter obtained based on this equation should be a combination of the static feature and the dynamic feature. You can

【００２６】なお、式（３）においてＰ_CEP（ａ_i｜ｊ）
及びＰ_DCEP（ａ_i｜ｊ）として用いられる出力確率は以
下の式（４）によって表される。In equation (3), P _CEP (a _i | j)
And the output probabilities used as P _DCEP (a _i | j) are represented by the following equation (4).

【００２７】（２π）^-d/2｜Σｊ｜^-1/2・ｅｘｐ｛−（ａ_i−μ_j）^tΣ_j ^-1（ａ_i−μ_j）｝・・・式（４）ただし、ｄはパラメータの次元数、μ_jは状態ｊでの平
均値ベクトル、Σ_jは状態ｊでの共分散行列、ａ_iはベク
トルを表す。(2π) ^{−d / 2} | Σj | ⁻¹ / ² · exp {− (a _i −μ _j ) ^t Σ _j ⁻¹ (a _i −μ _j )} (4) where d is the dimension number of the parameter, μ _j is the average value vector in the state j, Σ _j is the covariance matrix in the state j, and a _i is the vector.

【００２８】Ｓ３においてすべてのクラスのＤＰの計算
が終了すると判断されるまで以上のような計算を繰り返
し、Ｓ３ですべてのクラスのＤＰの計算が終了したと判
断されたらＳ４に進む。The above calculation is repeated until it is determined in S3 that the DPs of all the classes have been completed. If it is determined that the DPs of all the classes have been calculated in S3, the process proceeds to S4.

【００２９】Ｓ４では状態数の正規化する為にＱ（ｉ，
Ｊ）／Ｊ（Ｊはモデルの状態数）を標準パタンごとに計
算し、最大を与える標準パタンが属するクラスを時刻ｉ
における認識クラス（音韻、単語など）とする。（Ｊは
モデルの状態数である）尚、以上の式（１）〜式（４）はＣＰＵ３により演算が
行われる。また、各標準パタンの平均値及び共分散はＲ
ＯＭ４に格納される。In S4, in order to normalize the number of states, Q (i,
J) / J (where J is the number of model states) is calculated for each standard pattern, and the class to which the standard pattern giving the maximum belongs is time i
The recognition class (phoneme, word, etc.) in. (J is the number of states of the model) The above equations (1) to (4) are calculated by the CPU 3. The average value and covariance of each standard pattern is R
It is stored in OM4.

【００３０】[0030]

【発明の効果】以上説明したように共分散を各状態ごと
に持ち、さらに静的特徴をあらわすパラメータと動的特
徴をあらわすパラメータを併用することにより、破裂
音、鼻音などの過渡的な特徴を持つ音韻の認識の精度が
向上する。また状態数で正規化することにより、スポッ
ティングした場合の性能が向上する。As described above, the covariance is provided for each state, and by using the parameter showing the static feature and the parameter showing the dynamic feature together, transient features such as plosive sounds and nasal sounds can be obtained. The accuracy of recognizing the phoneme possessed is improved. Further, by normalizing with the number of states, the performance when spotting is improved.

[Brief description of drawings]

【図１】本発明の実施例の処理を示すフローチャートFIG. 1 is a flowchart showing a process of an embodiment of the present invention.

【図２】本発明の音声認識装置の構成を示すブロック図FIG. 2 is a block diagram showing a configuration of a voice recognition device of the present invention.

[Explanation of symbols]

１マイク２Ａ／Ｄ変換器３ＣＰＵ４ＲＯＭ５ＲＡＭ６システムバス 1 Microphone 2 A / D converter 3 CPU 4 ROM 5 RAM 6 System bus

Claims

[Claims]

1. A speech recognition apparatus by dynamic programming using state transition probabilities, which uses a statistical distance measure based on the assumption of normal distribution as a spectral distance measure, and a covariance matrix representing the variance of a normal distribution. A voice recognition device characterized in that is obtained separately for each state.

2. The voice recognition device has an analysis unit for analyzing input voice to obtain a parameter, and the static feature and the dynamic feature are derived by the analysis and are used together during recognition. The voice recognition device according to claim 1.

3. The integration axis of the dynamic programming is placed on the side of the standard pattern and is normalized by the number of states.
The voice recognition device described in 1.

4. A speech recognition method by dynamic programming using state transition probabilities, wherein a statistical distance measure based on the assumption of normal distribution is used as a distance measure of the spectrum, and a covariance matrix representing the variance of the normal distribution. A method for recognizing speech, characterized in that is calculated separately for each state.

5. The voice recognition is performed by using a parameter obtained by analyzing an input voice, deriving a static feature and a dynamic feature by the analysis, and using them together for recognition. Speech recognition method described in.

6. An integration axis of the dynamic programming is placed on the side of a standard pattern and is normalized by the number of states.
Speech recognition method described in.