JPH0451036B2

JPH0451036B2 -

Info

Publication number: JPH0451036B2
Application number: JP59170655A
Authority: JP
Inventors: Katsuyuki Futayada; Ikuo Inoe; Masakatsu Hoshimi
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1984-08-16
Filing date: 1984-08-16
Publication date: 1992-08-17
Also published as: JPS6148896A

Description

【発明の詳細な説明】産業上の利用分野本発明は音声認識装置における、音声のセグメ
ンテーシヨン方法に関するものである。DETAILED DESCRIPTION OF THE INVENTION Field of Industrial Application The present invention relates to a speech segmentation method in a speech recognition device.

従来例の構成とその問題点近年、音素または音節を基本単位とする音声認
識方法の開発が活発になつている。この方法にお
いては、音声を音素または音節の単位に区切るこ
と（セグメンテーシヨン）が、音声認識率を向上
させるための重要な技術要素である。Configuration of conventional examples and their problems In recent years, there has been active development of speech recognition methods that use phonemes or syllables as basic units. In this method, segmentation of speech into units of phonemes or syllables is an important technical element for improving the speech recognition rate.

従来、音素または音節のセグメンテーシヨンに
は、スペクトルの全域または帯域パワーを利用す
る方法が知られている。ここでは従来例の一例と
して、スペクトルの帯域パワーの時間的な動きを
使用し、パワー値の時間的な凹み（パワーデイツ
プ）による子音のセグメンテーシヨン法について
述べる。 Conventionally, methods are known for segmenting phonemes or syllables that utilize the entire spectrum or band power. Here, as an example of a conventional method, a consonant segmentation method using the temporal movement of the spectral band power and the temporal concavity (power dip) of the power value will be described.

以下図面を参照しながら、従来の方法について
説明する。第１図は従来のセグメンテーシヨン法
の機能ブロツク図である。１はAD変換部で、入
力音声を12KHzでサンプリングし、帯域パワー計
算部２で帯域フイルタによつて、１フレーム
（10msec）ごとに高域パワーと低域パワーを求め
る。３はパワー値バツフア部であり、高域パワー
と低域パワーを蓄積して、パワー値の時系列情報
を求める。そして、パワーデイツプ抽出部４で
は、パワー値の時系列情報からパワーデイツプを
抽出し、音素区間決定部５によつて、パワーデイ
ツプ区間を子音区間としてセグメンテーシヨンを
行なう。 The conventional method will be described below with reference to the drawings. FIG. 1 is a functional block diagram of a conventional segmentation method. 1 is an AD converter that samples the input audio at 12KHz, and a band power calculation unit 2 uses a band filter to calculate high-frequency power and low-frequency power for each frame (10 msec). 3 is a power value buffer section which accumulates high frequency power and low frequency power to obtain time series information of power values. Then, the power dip extracting section 4 extracts the power dip from the time series information of the power values, and the phoneme section determining section 5 performs segmentation using the power dip section as a consonant section.

従来例の方法は、子音の方が母音よりもパワー
が小さいために、子音部でパワーの凹みができや
すいという性質を利用したものである。すなわ
ち、第２図において、ａで示すパワー値の時系列
情報が周囲よりも小さい値をとる時、パワー値の
立下りから立上り付近までを子音としてｂで示す
ようにセグメンテーシヨンする。高域（1500〜
4000Hz）パワーは有声子音のデイツプをとらえや
すく、低域（250〜600Hz）パワーは無声子音のデ
イツプをとらえやすいので、両方を併用すると広
い範囲の子音のセグメンテーシヨンを行なうこと
ができる。 The conventional method takes advantage of the fact that since consonants have lower power than vowels, consonants tend to have power depressions. That is, in FIG. 2, when the time-series information of the power value indicated by a takes a smaller value than the surroundings, segmentation is performed as shown by b as a consonant from the fall of the power value to the vicinity of the rise. High range (1500~
4000Hz) power makes it easy to capture the dips of voiced consonants, and low-frequency (250-600Hz) power makes it easy to capture the dips of unvoiced consonants, so if you use both together, you can segment a wide range of consonants.

しかし、従来例における欠点は、スペクトルが
母音に類似していて母音とのパワー差が少ない音
素、特に鼻音（／ｍ／，／ｎ／，／〓／，はつ
音）の検出率が低いことである。鼻音性情報を用
いてこれを補う方法もあるが（星見、二矢田：語
頭子音のセグメンテーシヨン法、音学講論昭59
年３月）、鼻音性情報はノイズや調音結合の影響
を受けやすく、安定したセグメンテーシヨンがで
きない。 However, the drawback of the conventional example is that the detection rate is low for phonemes whose spectra are similar to vowels and have a small power difference from vowels, especially nasal sounds (/m/, /n/, /〓/, hatsu). It is. There is a method to compensate for this using nasality information (Hoshimi, Niyada: Segmentation method for word-initial consonants, Lectures on Phonetics, 1982)
(March 2013), nasal information is susceptible to noise and articulatory coupling, making stable segmentation impossible.

発明の目的本発明は従来技術のもつ以上のような欠点を解
消するもので、鼻音を含めあらゆる種類の音素の
セグメンテーシヨンを精度よく行なう音声のセグ
メンテーシヨン方法を提供するものである。OBJECTS OF THE INVENTION The present invention eliminates the above-mentioned drawbacks of the prior art and provides a speech segmentation method that accurately performs segmentation of all kinds of phonemes, including nasal sounds.

発明の構成上記の目的を達成するために、本発明は特徴パ
ラメータと定常性パターンとの類似度をフレーム
ごとに計算し、類似度の時間情報の変化をとらえ
ることによつて音素区間のセグメンテーシヨンを
行う方法を提供するものである。Structure of the Invention In order to achieve the above object, the present invention calculates the degree of similarity between the feature parameters and the stationarity pattern for each frame, and calculates the degree of similarity between the feature parameters and the stationarity pattern for each frame. The present invention provides a method for carrying out this process.

実施例の説明以下本発明の一実施例について説明する。Description of examples An embodiment of the present invention will be described below.

本発明は入力パラメータと定常性パターンを比
較することによつて、入力パラメータの時間的な
変化をとらえることを原理とする。そこで先ず時
間的な定常性標準パターンの作成方法について説
明する。定常性パターンは音声信号中で時間的に
定常な部分、例えば母音やはつ音の中心部の複数
フレーム（ｍフレーム、本実施例てはｍ＝３）を
使用して多くのサンプルによつて作成する。１フ
レームあたりの特徴パラメータの数をｎとする。
本実施例ではLPCケプストラム係数の低次のパ
ラメータ（C₀〜C₄）を特徴パラメータとして使
用している。したがつて特徴パラメータの数ｎ＝
５である。 The present invention is based on the principle of capturing temporal changes in input parameters by comparing input parameters with stationarity patterns. First, a method for creating a temporal stationarity standard pattern will be explained. The stationarity pattern is determined by using many samples using multiple frames (m frames, m = 3 in this example) of temporally stationary parts of the audio signal, such as the center of vowels and vowels. create. Let n be the number of feature parameters per frame.
In this embodiment, low-order parameters (C ₀ to _{C 4} ) of LPC cepstrum coefficients are used as feature parameters. Therefore, the number of feature parameters n=
It is 5.

ｍ×ｎ（15）個のパラメータを次のように並べ
て特徴パラメータベクトルＣを作成する。 A feature parameter vector C is created by arranging m×n (15) parameters as follows.

Ｃ＝（C₀ ¹，C₁ ¹，……C₄ ¹，C₀ ²，C₁ ²，……C₄
^２，C₀ ³，C₁ ³……C₄ ³）（式１）ただし、C^j/iにおいてｉは次数ナンバー、ｊは
フレームナンバーである。便宜的にＣを次のよう
に表記する。 C=(C ₀ ¹ , C ₁ ¹ ,...C ₄ ¹ ,C ₀ ² ,C ₁ ² ,...C ₄
² , C ₀ ³ , C ₁ ³ ... C ₄ ³ ) (Equation 1) However, in C ^j/i , i is the order number and j is the frame number. For convenience, C is written as follows.

Ｃ＝（C¹，C²，C³……C¹⁵） ……（式２）多くのサンプルを使用してＣの平均値μと分散
共分散行列Ｗを計算する。μの要素をμ_i，Ｗの要
素をW_i,jとする。サンプル数をｎとすると、 μ_i＝１／Ｎ_N 〓^K=1 C^i/k ……（式３） W_i,j＝１／Ｎ−１_N 〓^K=1 （C^i/k−μ_i）（C^i/k−μ_j） ……（式４）で定常性パターン（標準パターン）を作成でき
る。 C=(C ¹ , C ² , C ³ . . . C ¹⁵ ) (Formula 2) The average value μ and the variance-covariance matrix W of C are calculated using many samples. Let the elements of μ be μ _i and the elements of W be W _i,j . When the number of samples is n, μ _i =1/N _N 〓 ^K=1 C ^i/k ...(Formula 3) W _i,j =1/N-1 _N 〓 ^K=1 (C ^i/k −μ _i ) (C ^i/k −μ _j ) ...(Equation 4) A stationarity pattern (standard pattern) can be created.

次に入力特徴パラメータと定常性パターンとの
類似度の計算方法を説明する。 Next, a method of calculating the similarity between the input feature parameters and the stationarity pattern will be explained.

入力音声の特徴パラメータ（LPCケプストラ
ム係数）を（式１）と同じように時系列に並べ、
これをＸとする。 Arrange the feature parameters (LPC cepstral coefficients) of the input speech in time series as in (Equation 1),
Let this be X.

Ｘ＝（X₁，X₂，X₃……X₁₅）（式５）Ｘの平坦性パターンに対する確率密度Ｐは次式
で表わされる。 X=(X ₁ , X ₂ , X ₃ ...X ₁₅ ) (Equation 5) The probability density P for the flatness pattern of X is expressed by the following equation.

Ｐ＝（2π）^-15/2｜Ｗ｜^-1/2exp｛−１／２（Ｘ− μ）′W^-1・（Ｘ−μ）｝ ……（式６）ただし、′は転置を表わす。 P=(2π) ^-15/2 |W| ^-1/2 exp{-1/2(X- μ)'W ^-1・(X-μ)} ...(Formula 6) However, '' indicates transposition. represent

（式６）の対数をとり、これを２倍してＬとする
と、Ｌ＝−（Ｘ−μ）′・W^-1・（Ｘ−μ）＋Ａ
（式７）Ａは定数でありＡ＝２・log｛（2π）^-15/2・｜Ｗ｜^-1/2｝
……（式８）である。If we take the logarithm of (Equation 6) and double it to make it L, then L=-(X-μ)'・W ^-1・(X-μ)+A
(Formula 7) A is a constant A=2・log {(2π) ^-15/2・|W| ^-1/2 }
...(Formula 8).

音声区間に対して、１フレームずつシフトしな
がらＸを求め、これによつて（式７）で類似度を
求めると、定常部では（式７）の値（類似度）は
大きくなり、スペクトルの変化またはパワーの変
化がある場合は（式７）の値は小さくなる。類似
度が小さい部分は音素の境界や単語の境界に相当
するので、これをとらえることによつて、セグメ
ンテーシヨンを行なうことができる。 For the voice section, calculate X while shifting one frame at a time, and then calculate the similarity using (Equation 7). In the stationary part, the value (similarity) of (Equation 7) becomes large, and the spectrum If there is a change or a change in power, the value of (Equation 7) will be small. Portions with low similarity correspond to phoneme boundaries or word boundaries, so segmentation can be performed by capturing these.

第３図は例として王様（／oosama／）と発声
した場合の類似度の変化ｂを示したものである。
図には参考として、従来例によるパワーの変化ａ
と目視によつて付した音素ラベルｃも付記してあ
る。第３図によると類似度の変化ｂは単語境界と
音素境界で極小値を形成しており、これによつて
音素のセグメンテーシヨンを容易に行なうことが
できる。目視ラベルｃと比較すると、うまく区間
を検出できていることがわかる。一方、従来例に
よるパワー変化ａは、／ｓ／は検出できているが
鼻音／ｍ／は検出できていない。 FIG. 3 shows, as an example, the change b in the degree of similarity when uttering ``Osama'' (/oosama/).
For reference, the figure shows the power change a according to the conventional example.
The phoneme label c, which was added by visual inspection, is also attached. According to FIG. 3, the change b in the degree of similarity forms minimum values at word boundaries and phoneme boundaries, thereby making it possible to easily segment phonemes. Comparing with the visual label c, it can be seen that the section can be detected successfully. On the other hand, in the power change a according to the conventional example, /s/ can be detected, but the nasal sound /m/ cannot be detected.

第４図は他の例を示したものであり、稲穂（／
inaho／）と発声した場合である。この場合も類
似度の変化ｂには語境界、音素境界に極小値が現
われており、鼻音も含め正確にセグメンテーシヨ
ンが行なわれている。従来のパワー変化ａの場合
では、セグメンテーシヨンは無理である。 Figure 4 shows another example.
This is the case when you say inaho/). In this case as well, minimum values appear at word boundaries and phoneme boundaries in the similarity change b, and segmentation including nasal sounds is performed accurately. In the conventional case of power change a, segmentation is impossible.

次に以上に説明した方法を実現するための、機
能ブロツク図を第５図に示す。 Next, FIG. 5 shows a functional block diagram for realizing the method described above.

第５図においてAD変換部１は従来例と同じ機
能であるので説明を省略する。１０は音響分析部
で音声信号を分析する部分であり、本実施例では
LPC分析法を使用している。分析窓はハミング
窓、フレーム周期は10msecであり、分析次数は
15である。１１は特徴パラメータ抽出部であり、
パワー項C₀と低次の４つのパラメータ（C₁〜C₄）
を計算する。１２は類似度計算部であり、入力特
徴パラメータと定常性パターンの類似度を（式
７）によつて計算する。 In FIG. 5, the AD converter 1 has the same function as the conventional example, so the explanation will be omitted. Reference numeral 10 denotes an acoustic analysis section that analyzes audio signals, and in this embodiment,
LPC analysis method is used. The analysis window is a Hamming window, the frame period is 10 msec, and the analysis order is
It is 15. 11 is a feature parameter extraction unit;
Power term C ₀ and four lower-order parameters (C ₁ to _{C 4} )
Calculate. Reference numeral 12 denotes a similarity calculation unit, which calculates the similarity between the input feature parameter and the stationarity pattern using (Equation 7).

１３は定常性パターン格納部であり、（式３）、
（式４）および（式８）の値が入つている。時系
列バツフア１４は類似度情報を時系列として蓄積
する。音素区間決定部１５は、類似度の時間情報
から、類似度が小さい部分を検出し、第３図およ
び第４図に例示したようにして、音素区間を決定
する。 13 is a stationarity pattern storage unit, (Equation 3),
It contains the values of (Formula 4) and (Formula 8). The time series buffer 14 stores similarity information as a time series. The phoneme interval determination unit 15 detects a portion where the degree of similarity is small from the time information of the degree of similarity, and determines a phoneme interval as illustrated in FIGS. 3 and 4.

このように本実施例のセグメンテーシヨン方法
は、音素境界でのスペクトルの時間変化を類似度
情報としてとらえるので、鼻音のようにパワー値
が母音とあまり変わらない音素も正確にセグメン
テーシヨンを行なうことができる。また類似度の
時間変化を相対値として（すなわち極小値の検出
という方法で）利用しているので、ノイズや調音
結合の影響を受けにくい特徴がある。 In this way, the segmentation method of this embodiment captures temporal changes in spectra at phoneme boundaries as similarity information, so it can accurately segment phonemes such as nasals whose power values are not much different from vowels. be able to. Furthermore, since the temporal change in similarity is used as a relative value (that is, by detecting the minimum value), it is less susceptible to the effects of noise and articulatory combination.

なお、上記の例では特徴パラメータとして
LPCケプストラム係数を使用したが、これは帯
域スペクトルパワー、PARCOR係数、自己相関
関数、自己相関係数など他の特徴パラメータを使
用することも可能である。またLPCケプストラ
ム係数の次数はC₀〜C₄に限定する必要はない。
またフレーム数は上記の例では３フレームを用い
たが、複数フレームならば３フレームに限定はさ
れない。 In addition, in the above example, the feature parameter is
Although we used LPC cepstral coefficients, it is also possible to use other characteristic parameters such as band spectral power, PARCOR coefficient, autocorrelation function, autocorrelation coefficient, etc. Further, the order of the LPC cepstrum coefficients does not need to be limited to C ₀ to C ₄ .
Further, although three frames are used in the above example, the number of frames is not limited to three frames if it is a plurality of frames.

距離尺度に関しても、他の統計的な距離尺度、
たとえばマハラノビス距離を用いてもよい。この
場合、第５図の音素区間決定部１５において極大
値を検出してセグメンテーシヨンを行なうことに
なる。 Regarding distance measures, other statistical distance measures,
For example, Mahalanobis distance may be used. In this case, the phoneme interval determination unit 15 in FIG. 5 detects the maximum value and performs segmentation.

また時間的な定常性パターンを作成する場合に
母音、はつ音の中心部のサンプルで作成すると述
べたが、実際には、全有声音区間または全音声区
間で１フレームずつずらせながら作成してもよい
（一般の音声では、スペクトルが変化している部
分に比べて、定常な部分の方が多い。）発明の効果以上要するに本発明は特徴パラメータと定常性
パターンとの類似度をフレームごとに計算し、類
似度の時間情報の変化をとらえることによつて音
素区間のセグメンテーシヨンを行う方法を提供す
るもので、鼻音など従来の方法では正確にセグメ
ンテーシヨンができなかつた音素に対しても精度
よくセグメンテーシヨンを行なうことができ、ま
た、類似度情報の相対値によつてスペクトルの変
化を検出するので、ノイズや調音結合などの変動
要因の影響を受けにくい利点を有する。更に、類
似度計算は全て積和計算であるので、ハードウエ
ア化が容易である利点も有する。 In addition, when creating a temporal stationarity pattern, it is said that it is created using samples from the center of vowels and syllables, but in reality, it is created by shifting one frame at a time in all voiced sound intervals or all speech intervals. (In general speech, there are more stationary parts than parts where the spectrum changes.) Effects of the Invention In summary, the present invention calculates the similarity between feature parameters and stationarity patterns for each frame. This method provides a method for segmenting phoneme intervals by calculating and capturing temporal information changes in similarity, and is useful for phonemes that cannot be accurately segmented using conventional methods, such as nasal sounds. This method can perform segmentation with high accuracy, and since changes in the spectrum are detected based on the relative values of similarity information, it has the advantage of being less susceptible to fluctuation factors such as noise and articulatory combination. Furthermore, since all similarity calculations are sum-of-product calculations, it has the advantage of being easy to implement in hardware.

[Brief explanation of the drawing]

第１図は従来の音声のセグメンテーシヨン方法
を説明するための機能ブロツク図、第２図は従来
のパワー変化によりセグメンテーシヨンを行う方
法を説明するための図、第３図と第４図は本発明
の一実施例における音声のセグメンテーシヨン方
法の有効性を示すために具体例を示した図、第５
図は本実施例を具現化するための機能ブロツク図
である。１……AD変換部、１０……音響分析部、１１
……特徴パラメータ抽出部、１２……類似度計算
部、１３……定常性パターン格納部、１４……時
系列バツフア、１５……音素区間決定部。 Figure 1 is a functional block diagram for explaining the conventional voice segmentation method, Figure 2 is a diagram for explaining the conventional method for segmentation using power changes, and Figures 3 and 4. Figure 5 shows a specific example to demonstrate the effectiveness of the voice segmentation method in one embodiment of the present invention.
The figure is a functional block diagram for realizing this embodiment. 1... AD conversion section, 10... Acoustic analysis section, 11
...Feature parameter extraction unit, 12...Similarity calculation unit, 13...Stationality pattern storage unit, 14...Time series buffer, 15...Phoneme interval determination unit.

Claims

[Claims] 1. Analyze input audio for each analysis section (frame) to obtain feature parameters, and statistically evaluate the similarity between the temporal pattern of the feature parameters and a standard pattern expressing temporal stationarity. Speech segmentation is performed by calculating the similarity using a distance measure, creating time-series information of the similarity for the speech interval, and detecting the boundaries of the speech using the temporal movement of the time-series information. A voice segmentation method characterized by performing the following. 2. The standard pattern expressing temporal constancy is
2. The speech segmentation method according to claim 1, wherein the speech segmentation method is comprised of an average value and a variance-covariance matrix using characteristic parameters of a plurality of frames of a large number of samples. 3 The feature parameters are LPC cepstral coefficients,
The speech segmentation method according to claim 1, wherein the segmentation method is one selected from band spectral power, PARCOR coefficient, and autocorrelation function. 4. The speech segmentation method according to claim 1, wherein the statistical distance measure is any one of probability density, log likelihood, or Mahalanobis distance. 5. The method according to claim 1 or 2, characterized in that the standard pattern of temporal stationarity is created using any one of a stationary part of speech, a voiced sound interval, and a whole speech interval. Audio segmentation methods.