JPH0241760B2

JPH0241760B2 -

Info

Publication number: JPH0241760B2
Application number: JP57009267A
Authority: JP
Priority date: 1982-01-22
Filing date: 1982-01-22
Publication date: 1990-09-19
Also published as: JPS58126600A

Description

【発明の詳細な説明】この発明は入力音声の特徴量の時系列と、対応
する標準の時系列との間の類似度を、これらの時
系列の要素間の距離を計算することに基づき求め
る音声認識装置に関するものである。[Detailed Description of the Invention] This invention calculates the degree of similarity between a time series of features of input speech and a corresponding standard time series based on calculating the distance between elements of these time series. The present invention relates to a speech recognition device.

従来の音声認識装置では、入力音声の特徴量の
時系列Ａ＝a₁，a₂…a_lと標準の時系列Ｂ＝b₁，b₂，
…b_nとに対し、その要素a_iとb_jとの間の距離尺度
を、各要素における全パワーを一定に正規化した
スペクトルを用いて求めていた。このためa_iが母
音でb_jが子音というような音声パワーが極端に異
なる要素間でも距離がかなり小さくなるというこ
とが起り得るという欠点があつた。音声認識にお
いては一つの標準パタンを用いてなるべく多くの
人の音声を認識できることが望ましいが、従来の
方法の場合は、個人差の影響の出やすいスペクト
ルに基づいた距離を用いていたため母音と子音の
誤認識等の誤りが起りやすく、特に入力音声が標
準パタンと同一の話者のものでない場合には認識
率の大幅な低下が見られた。 In a conventional speech recognition device, a time series A=a ₁ , a ₂ . . . a _l of features of input speech and a standard time series B=b ₁ , b ₂ ,
...b _n , the distance measure between the elements a _i and b _j was determined using a spectrum in which the total power in each element was normalized to a constant value. For this reason, there was a drawback that the distance could become quite small even between elements with extremely different vocal powers, such as a _i being a vowel and b _j being a consonant. In speech recognition, it is desirable to be able to recognize the voices of as many people as possible using one standard pattern, but in the case of conventional methods, distances based on spectra that are easily affected by individual differences are used, so vowels and consonants Errors such as erroneous recognition are likely to occur, and the recognition rate is particularly low when the input speech is not from the same speaker as the standard pattern.

また要素間の距離尺度としては、スペクトル距
離とパワー距離とを組合せ使用したものが知られ
ている。これは例えば電子通信学会論文誌Vol.
J64−A.No.5.P.409〜416（昭和56年５月25日）、特
にその411頁２，3WeightedLiklihoodRatioの項
に示されているように、スペクトル距離とパワー
距離とを同一重みとし、かつその重みは固定とし
たものである。この距離尺度はこのようにパワー
距離の重み付けが固定であるため、音声強度の発
声毎のばらつき、Ｓ／Ｎ比の差異によるパワーパ
タンの変形、音声切出しミス、無声化等の影響が
あり、パワー重み付因子の影響が強すぎるとかえ
つてミスマツチングを起す可能性がある。 Furthermore, as a distance measure between elements, one using a combination of spectral distance and power distance is known. For example, this is the Journal of the Institute of Electronics and Communication Engineers Vol.
J64-A.No.5.P.409-416 (May 25, 1981), especially as shown in the section 2 and 3 WeightedLiklihoodRatio on page 411, the spectral distance and power distance are given the same weight. , and its weight is fixed. Since this distance measure has a fixed power distance weighting, it is affected by variations in voice strength from utterance to utterance, deformation of power patterns due to differences in S/N ratio, voice cutting mistakes, devoicing, etc. If the influence of the weighting factor is too strong, mismatching may occur.

この発明はスペクトル情報と個人差の少ないパ
ワー情報とを併用した音声時系列の要素間マツチ
ング距離尺度を用いることにより、不特定話者の
音声に対しても従来の方法に比べより正確に音声
認識ができるようにしたもので、以下図面につい
て詳細に説明する。 By using a matching distance measure between elements of a speech time series that combines spectral information and power information with little individual difference, this invention can more accurately recognize speech even for unspecified speakers than conventional methods. The drawings will be explained in detail below.

第１図はこの発明の実施例を示し、入力端子１
に加えられた音声信号はAD変換及び音声切り出
し部２においてデイジタル化されたあと音声部分
のみが切り出され、自己相関分析部３において一
定時間ごとに正規化パワーと自己相関係数とが計
算される。正規化パワーの例としては最大値を１
に最小値を０に正規化した対数パワーを用いる。
スペクトル計算部４においてはスペクトル計算に
必要なパラメータを抽出し、スペクトル距離計算
部６では入力音声と、標準パタン記憶部５に記憶
されている標準パタンとの要素間のスペクトル距
離が計算される。標準パタンではパワーは単語ご
とに最大を１、最小を０に対数パワーを正規化し
てある。スペクトル距離の例としてはWLR尺度
（杉山、鹿野“ピークに重みをおいたLPCスペク
トルマツチング尺度”、電子通信学会論文誌、
Vol.J64−Ａ，No.５，PP409−416，1981を用いる
ことができる。 FIG. 1 shows an embodiment of the invention, in which input terminal 1
The audio signal added to the audio signal is digitized in the AD conversion and audio extraction section 2, and then only the audio portion is extracted, and the normalized power and autocorrelation coefficient are calculated at fixed time intervals in the autocorrelation analysis section 3. . As an example of normalized power, the maximum value is 1
The logarithmic power with the minimum value normalized to 0 is used.
The spectrum calculation unit 4 extracts parameters necessary for spectrum calculation, and the spectrum distance calculation unit 6 calculates the spectral distance between elements of the input voice and the standard pattern stored in the standard pattern storage unit 5. In the standard pattern, the logarithmic power is normalized so that the maximum power is 1 and the minimum power is 0 for each word. An example of spectral distance is the WLR scale (Sugiyama, Kano "LPC spectral matching scale with weighted peaks", Journal of the Institute of Electronics and Communication Engineers,
Vol. J64-A, No. 5, PP409-416, 1981 can be used.

更にパワー距離計算部７において入力音声と標
準パタンとの要素間のパワー距離が計算される。
距離計算部８において計算部６，７で得られたス
ペクトル距離とパワー距離とから成る要素間距離
が計算される。この距離の例として、Ｄ＝WLR＋6.0（Ｐ＄ −P^*）² (1) Ｄ＝WLR・（１＋4.0（Ｐ＄ −P^*）²）² (2) が挙げられる。こゝでＰ＄，P^*は正規化したパ
ワーでＰ＄は入力音声、P^*は標準パタンの各パ
ワーである。WLRはWLRスペクトルマツチング
距離を表わす。WLR尺度は低域重み付けを行な
つた方が良好な効果が表われるので、低域重み付
けをスペクトル計算部４において行なう、（鹿野、
杉山“WLR尺度により大語い単語音声認識実験”
音響学会春季大会論文集３−１−19，1981.）。 Furthermore, the power distance calculating section 7 calculates the power distance between elements of the input voice and the standard pattern.
A distance calculation section 8 calculates an inter-element distance consisting of the spectral distance and power distance obtained in the calculation sections 6 and 7. Examples of this distance include D=WLR+6.0(P$-P ^* ) ² (1) D=WLR・(1+4.0(P$-P ^* ) ² ) ² (2). Here, P$ and P ^* are the normalized powers, P$ is the input voice, and P ^* is the power of each standard pattern. WLR represents WLR spectral matching distance. Since a better effect appears when the WLR scale is low-frequency weighted, the low-frequency weighting is performed in the spectrum calculation section 4 (Kano,
Sugiyama “Large word speech recognition experiment using WLR scale”
Proceedings of the Spring Conference of the Acoustical Society of Japan, 3-1-19, 1981.)

(1)式又は(2)式により求めた要素間距離を用いて
時間正規化マツチング部９において時間正規化マ
ツチングを行なう（例えば、鹿野“大語い単語音
声認識におけるLPCスペクトル・マツチング尺
度の評価、“音響学会・音声研究会資料S80−60，
1980）、この結果得られた入力音声と標準パタン
との類似度を１つの入力音声についていくつかの
標準パタンに対して算出し、類似度比較部１０に
おいてそれらを比較し、最も類似度の良いものを
認識結果出力端子１１に出力する。 Time-normalized matching is performed in the time-normalized matching unit 9 using the inter-element distance obtained by equation (1) or (2) (for example, "Evaluation of LPC spectral matching scale in large word speech recognition" by Kano) , “Acoustical Society of Japan/Speech Research Group Material S80-60,
1980), the degree of similarity between the input voice obtained as a result and the standard pattern is calculated for several standard patterns for one input voice, and the similarity comparison unit 10 compares them and selects the one with the highest degree of similarity. The object is output to the recognition result output terminal 11.

所で音声パワーの時系列パタンは母音部では高
く、子音部では低いという基本的な性質があり、
この性質は話者が異なつても不変である。第２図
は４人の話者がそれぞれ２回ずつ発声した
「Sapporo」という単語のパワーパタンである。
この第２図において対数パワーパタンを最大値と
最小値とが一定になるように正規化したものであ
る。この第２図から理解されるようにパワーパタ
ンは話者が変つてもあまり差異がない。従つてこ
の発明のようにパワーの時系列パタンについても
入力音声と標準音声とのマツチングをとり、これ
をも考慮すれば認識率がよくなることが理解され
よう。 However, the basic characteristic of the time-series pattern of voice power is that it is high in the vowel part and low in the consonant part.
This property remains unchanged even if the speaker is different. Figure 2 shows the power patterns of the word "Sapporo" uttered twice by each of the four speakers.
In FIG. 2, the logarithmic power pattern is normalized so that the maximum value and minimum value are constant. As can be understood from FIG. 2, the power pattern does not differ much even if the speaker changes. Therefore, it will be understood that the recognition rate can be improved by matching the input speech and the standard speech with respect to the time-series pattern of power as in the present invention, and also taking this into consideration.

パワーの時系列パタンについても入力音声と標
準音声とを比較する場合に、対数最大値max
（logp（ｔ））を一定値AL、例えば１、対数最小値
min（logp（ｔ））を一定値IL、例えば０に正規化
した時、１つの単語のパワーパタンｐ（ｔ）の正
規化パタンＰ（ｔ）は次式で与えられる。 Regarding the power time series pattern, when comparing input speech and standard speech, the logarithmic maximum value max
(logp(t)) is a constant value AL, e.g. 1, the logarithmic minimum value
When min(logp(t)) is normalized to a constant value IL, for example 0, the normalized pattern P(t) of the power pattern p(t) of one word is given by the following equation.

Ｐ（ｔ）＝log（ｐ（ｔ））−IL／AL−IL このように正規化することにより、音声と背景
雑音の強度比が変つても無声部を同一の値にする
ことができ（無声部に背景雑音が入る）、また対
数関数の性質から、パワー強度の大きい部分は圧
縮率が高いため、母音部はすべてほゞ１に近い値
をとり、更に対数関数の性質から最大パワーの変
動に対しパワーパタンの変形が少ないものとな
る。 P(t)=log(p(t))-IL/AL-IL By normalizing in this way, even if the intensity ratio of speech and background noise changes, the unvoiced part can be made to the same value ( Also, due to the nature of the logarithmic function, the compression ratio is high in the parts with high power intensity, so all vowel parts take values close to 1. The power pattern is less deformed due to fluctuations.

641都市名音声データについて、パワーパタン
を前述のように正規化し、同一単語のパワーの平
均距離と、異単語間のパワーの平均距離とを従来
の時間正規化マツチング法により求めた結果、同
一話者内では同一単語同志の距離が0.0374、（最
大は１、最小は０）であるが、異単語間の距離は
0.1451と大きく、また異話者間では同一単語間の
距離は0.0530であつて同一話者内の異単語間距離
よりも小さく、かつ異単語間では0.147と同一話
者内の異単語距離と同程度に大きくなつている。
このことからこの発明によりパワー時系列パタン
をも参照することは複数の話者からの音声認識に
おいて認識を向上できることがわかる。 For the 641 city name speech data, the power pattern was normalized as described above, and the average power distance between the same words and the average power distance between different words were calculated using the conventional time normalized matching method. Within participants, the distance between the same words is 0.0374 (maximum is 1, minimum is 0), but the distance between different words is
The distance between the same words between different speakers is 0.0530, which is smaller than the distance between different words within the same speaker, and the distance between different words is 0.147, which is the same as the distance between different words within the same speaker. It has become quite large.
From this, it can be seen that referring also to the power time series pattern according to the present invention can improve recognition in speech recognition from a plurality of speakers.

従つて入力音声フレームｉと標準パタンフレー
ムｊとのスペクトルマツチング距離Ds（ｉ，ｊ）
と、パワー距離Dp（ｉ，ｊ）とを用い、これら音
声フレーム間の距離Ｄ（ｉ，ｊ）を次式で与え、Ｄ（ｉ，ｊ）＝Ｆ（Ds（ｉ，ｊ），Dp（ｉ，ｊ） ∂D／∂Ds≧０ ∂D／∂Dp≧０たゞし、Ds≧０、Dp≧０この距離Ｄ（ｉ，ｊ）を用いて音声フレーム系
列同志の時間正規化マツチングを行えばスペクト
ルとパワーレベルとの両方が共に類似した部分が
マツチングし、母音と子音とのマツチングを避け
ることができる。関数Ｆとして最も簡単な形式の
ものはＤ（ｉ，ｊ）＝Ds（ｉ，ｊ）＋Dp（ｉ，ｊ）Ｄ（ｉ，ｊ）＝Ds（ｉ，ｊ）・Dp（ｉ，ｊ）の２通りが考えられる。この場合、この発明では
音声強度の発声毎のばらつき、Ｓ／Ｎ比の差によ
るパワーパタンの変形、音声切出しミスなどの影
響を避けるために、Ds（ｉ，ｊ）としてはパワー
成分を除いた従来のスペクトルマツチング距離を
用い、またパワーパタンは常にばらつきを伴つて
いることを考慮すると、Dp（ｉ，ｊ）は入力音声
と標準パタンとのパワー差が小さい場合にはパワ
ー差に対する感度が純く、ある程度以上大きい差
がある場合に感度が鋭くなるような関数とする。
このような関数のうち最も簡単なものとして二乗
関数を(1)，(2)式に示したように用いることができ
る。 Therefore, the spectral matching distance Ds (i, j) between input audio frame i and standard pattern frame j
and the power distance Dp(i,j), the distance D(i,j) between these audio frames is given by the following formula, and D(i,j)=F(Ds(i,j),Dp( i, j) ∂D/∂Ds≧0 ∂D/∂Dp≧0, Ds≧0, Dp≧0 Using this distance D(i, j), time normalized matching of audio frame sequences is performed. If this is done, parts with similar spectra and power levels will be matched, and matching between vowels and consonants can be avoided.The simplest form of the function F is D(i,j)=Ds(i , j) + Dp (i, j) D (i, j) = Ds (i, j) · Dp (i, j) In this case, in this invention, the variation in voice intensity for each utterance, In order to avoid effects such as deformation of the power pattern due to the difference in S/N ratio and mistakes in audio extraction, the conventional spectrum matching distance excluding the power component is used as Ds (i, j), and the power pattern is always Taking into consideration that variations are involved, Dp(i, j) has pure sensitivity to the power difference when the power difference between the input voice and the standard pattern is small, and has sharp sensitivity when the difference is larger than a certain level. Let the function be such that
As the simplest of these functions, the square function can be used as shown in equations (1) and (2).

Ds（ｉ，ｊ）としてケプストラム距離WLR尺
度を用いた場合に、これに対してパワー距離Dp
（ｉ，ｊ）の重み付けの最適値を、100の都市名デ
ータについて同一話者内、異話者間のフレーム間
マツチング距離を、 D²＝WLR＋ａ（Ｐ＄ −P^*）² D²＝WLR・（１＋ａ（Ｐ＄ −P^*）²）² とし、これらについて係数ａを変化させて認識率
と誤差率の変化を調べ、(1)，(2)式に示したように
最適係数ａ＝0.0，ａ＝4.0をそれぞれ求めた。 When using the cepstral distance WLR measure as Ds(i,j), on the other hand, the power distance Dp
The optimal weighting value of (i, j) and the matching distance between frames within the same speaker and between different speakers for 100 city name data are: D ² = WLR + a (P$ - P ^* ) ² D ² = WLR・(1+a(P$-P ^* ) ² ) ² , and examine changes in recognition rate and error rate by changing coefficient a for these, and as shown in equations (1) and (2), optimal coefficient a= 0.0 and a=4.0 were obtained, respectively.

前記(1)，(2)式を用いて641都市名について大語
い単語音声認識実験を行い、従来のスペクトル距
離のみを用いたものと比較した。この結果、(1)，
(2)式間では１位認識率に実質的な差はなかつた。
(1)，(2)式を用いた場合、従来法と比べ、同一話者
内での認識はほゞ同程度の結果となつたが、異話
者間では82.4％が88.0％と大幅に認識率が改善さ
れ、この発明が優れれたものであることが確認さ
れた。 Using equations (1) and (2) above, we conducted a large word speech recognition experiment for 641 city names and compared it with the conventional method using only spectral distance. As a result, (1),
There was no substantial difference in the first-place recognition rate between equations (2).
When using equations (1) and (2), compared to the conventional method, the recognition results within the same speaker were approximately the same, but between speakers of different speakers, the recognition was significantly improved from 82.4% to 88.0%. The recognition rate was improved, and it was confirmed that this invention is excellent.

以上発明したようにこの発明によれば、パワー
距離とスペクトル距離から成る距離を用いて入力
音声と標準パタンとのマツチングを行なうため、
入力音声と異なる内容の標準パタンではパワーの
時系列パタンが互いに異なり、類似度が悪くな
る。このためスペクトル距離のみでは認識誤りを
生じやすい多数話者音声認識において認識能力を
向上できるという利点がある。更にこの発明では
パワー距離を、パワー差に応じて感度を変化させ
て加味した要素間にマツチング距離を用いている
ため音声強度の発声毎のばらつき、Ｓ／Ｎ比の差
異によるパワーパタンの変形、音声切出しミスな
どに影響されない。 As described above, according to the present invention, since the input voice and the standard pattern are matched using the distance consisting of the power distance and the spectral distance,
For standard patterns with content different from the input speech, the time-series patterns of power are different from each other, resulting in poor similarity. Therefore, there is an advantage that the recognition ability can be improved in multi-speaker speech recognition where recognition errors are likely to occur when using spectral distance alone. Furthermore, in this invention, since the power distance is used as a matching distance between elements that are taken into account by changing the sensitivity according to the power difference, it is possible to prevent variations in voice strength from utterance to utterance, deformation of the power pattern due to differences in S/N ratio, It is not affected by audio cut-out mistakes.

[Brief explanation of drawings]

第１図はこの発明による音声認識装置の概念を
示すブロツク図、第２図は「Sapporo」の音声パ
ワー時系列パタンを示す図である。１：音声入力端子、２：Ａ／Ｄ変換・音声切出
し部、３：自己相関分析部、４：スペクトル計算
部、５：標準パタン記憶部、６：スペクトル距離
計算部、７：パワー距離計算部、８：距離計算
部、９：時間正規化マツチング部、１０：類似度
比較部、１１：認識結果出力部。 FIG. 1 is a block diagram showing the concept of the speech recognition device according to the present invention, and FIG. 2 is a diagram showing the speech power time series pattern of "Sapporo". 1: Audio input terminal, 2: A/D conversion/sound extraction section, 3: Autocorrelation analysis section, 4: Spectrum calculation section, 5: Standard pattern storage section, 6: Spectral distance calculation section, 7: Power distance calculation section , 8: distance calculation section, 9: time normalization matching section, 10: similarity comparison section, 11: recognition result output section.

Claims

[Claims]

1. Means for determining the spectral distance Ds between the elements of the feature time series of input speech and standard pattern speech, and normalization so that the maximum and minimum values of input speech and standard pattern speech are respectively constant values. A method for determining the power difference between an input voice and a standard pattern using the voice power obtained by performing
The difference between Ds and power is changed by using a function that changes the weighting for the difference between Ds and power, and makes the sensitivity to the power difference dull when the power difference is smaller than a predetermined value, and sharper when it is larger than the predetermined value. 1. A speech recognition device comprising means for determining an inter-element matching distance from the above, and means for calculating the similarity between an input speech and a standard pattern speech by time-normalized matching using the inter-element matching distance.