JPH0311399A

JPH0311399A - Voice segmentation method

Info

Publication number: JPH0311399A
Application number: JP1145064A
Authority: JP
Inventors: Keisuke Oda; 啓介小田; Yumi Takizawa; 滝沢　由美; Kiyohito Tokuda; 清仁徳田; Atsushi Fukazawa; 深沢　敦司
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1989-06-09
Filing date: 1989-06-09
Publication date: 1991-01-18
Anticipated expiration: 2012-04-09
Also published as: JP2598518B2

Abstract

PURPOSE:To perform segmentation of high precision independently of voice power by using information entropy of voice, which is obtained by regarding a voice signal as a time series signal, to perform segmentation. CONSTITUTION:A square value calculating means, an average power calculating means 3, a forecast error power calculating means 4, a normalized entropy calculating means 5, and a syllable or phoneme section detecting means are provided. An input signal 1 is, for example, a time series signal of voice subjected to A/D conversion by a sampling frequency. Information entropy of voice at the time of regarding the voice as this time series signal is noticed, and a maximum value and a minimum value of normalized entropy and sections of the change to the maximum value and time series are detected as the syllable or phoneme section of one unit to perform voice segmentation. Since segmentation is not dependent upon the voice power, segmentation is performed with a scale independent of the individual difference of speakers.

Description

【発明の詳細な説明】［産業上の利用分野］この発明は連続して発生した音声信号を音節や音韻に区
分する音声のセグメンテーション方法に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech segmentation method for segmenting continuously generated speech signals into syllables and phonemes.

[Conventional technology]

従来から広く用いられてきた音声のセグメンテーション
方法は、音声パワーに着目するものであった。例えば、
安居腕　猛・中嶋正之共著「コンピュータ音声処理Ｊ　
　（１９８０年６月）秋葉出版Ｐ１７Ｂにおいて開示、
されているように、音声パワーが極小となる時間点に挟
まれた１個の極大値をもつ区間を１個の音節又は音韻区
間として検出する方法がある。Voice segmentation methods that have been widely used in the past have focused on voice power. for example,
Co-authored by Takeshi Yasui and Masayuki Nakajima, “Computer Speech Processing J.
(June 1980) Disclosed in Akiba Publishing P17B,
As described above, there is a method of detecting an interval having one maximum value sandwiched between time points where the voice power is minimum as one syllable or phoneme interval.

【発明が解決しようとする課題］しかしながら、このような方法では音声パワーという尺
度に対してセグメンテーションを行なうため、セグメン
テーションの結果は音声パワーに依存したものになる。[Problems to be Solved by the Invention] However, in such a method, segmentation is performed using a measure of voice power, so the result of segmentation depends on voice power.

音声パワーが個人差により大きく異なること、更に同一
人物でも音声パワーは一定ではないことなどを考えれば
、個人差によらない新しい尺度でのセグメンテーション
を行なう必要がある。Considering that voice power varies greatly depending on individual differences, and that voice power is not constant even for the same person, it is necessary to perform segmentation using a new scale that does not depend on individual differences.

この発明は、音声パワーによるセグメンテーションの結
果が音声パワーに依存するという問題点を解消するため
になされたものであり、音声信号を時系列信号とみなし
た音声の情報エントロピーを用いてセグメンテーション
を行なうことにより、音声パワーに依存せず、高精度な
セグメンテーションを可能にした音声のセグメンテーシ
ョン方法を提供することを目的とする。This invention was made in order to solve the problem that the result of segmentation based on voice power depends on the voice power, and it is possible to perform segmentation using the information entropy of voice when the voice signal is regarded as a time-series signal. The present invention aims to provide a voice segmentation method that enables highly accurate segmentation without depending on voice power.

［課題を解決するための手段］この発明に係る音声のセグメンテーション方法は、音声
入力の平均パワー及び予測誤差パワーをそれぞれ求める
工程と、平均パワー及び予測誤差パワーに基づいて正規
化エントロピーを求める工程と、正規化エントロピーが
極大値、極小値及び極大値と時系列に変化する区間を、
１単位の音節又は音韻区間として検出することにより音
声のセグメンテーションを行う工程とを有する。[Means for Solving the Problems] A speech segmentation method according to the present invention includes the steps of calculating the average power and prediction error power of a voice input, respectively, and calculating the normalized entropy based on the average power and the prediction error power. , the interval in which the normalized entropy changes over time from the local maximum value to the local minimum value and the local maximum value is
and performing speech segmentation by detecting one unit of syllable or phoneme interval.

［作　用］この発明においては、音声を時系列信号としてみなした
時の音声の情報エントロピーに看目し、正規化エントロ
ピーの極大値、極小値及び極大値と時系列に変化する区
間が、１単位の音節又は音韻区間として検出される。[Function] In this invention, considering the information entropy of speech when speech is regarded as a time-series signal, the intervals in which the maximum value, minimum value, and maximum value of normalized entropy change in time series are 1 It is detected as a unit syllable or phoneme interval.

【実施例１第１図はこの発明の一実施例に係る方法を実施するため
の装貧の構成を示したブロック図である。Embodiment 1 FIG. 1 is a block diagram showing the configuration of a device for carrying out a method according to an embodiment of the present invention.

図において、（１）は入力信号、（２）は２乗値算出手
段、（３）は平均パワー算出手段、（４）は予測誤差パ
ワー算出手段、（５）は正規化エントロピー算出手段、
（８）は音節又は音韻区間検出手段であり、（１）は検
出信号である。In the figure, (1) is an input signal, (2) is a square value calculation means, (3) is an average power calculation means, (4) is a prediction error power calculation means, (5) is a normalized entropy calculation means,
(8) is a syllable or phoneme segment detection means, and (1) is a detection signal.

次に動作を説明する。入力信号（１）は、例えばサンプ
リング周波数８　ＫＨｚでＡ／Ｄ変換された音声の時系
列信号ｘ　（ｎ）　ｔ　　ｎ　”　ｏ　ｒ　±Δｔ、±
Δ２ｔ。Next, the operation will be explained. The input signal (1) is, for example, an audio time series signal x (n) t n ” or ±Δt, ± which has been A/D converted at a sampling frequency of 8 KHz.
Δ2t.

±Δ３ｔ、　−・−・・−・−、Δｔ　−１７８０００
（ｓｅｃ、）となっているものとする。２乗値算出手段
（２）はこの入力信号ｘ　（ｎ）を入力して、２乗信号
（ｘ（ｎ））２を得る。±Δ3t, −・−・・−・−, Δt −178000
(sec,). The square value calculation means (2) inputs this input signal x (n) and obtains a square signal (x(n))2.

次に、この２乗信号を平均パワー算出手段（３）へ入力
し、平均パワーＰ　ｏ　（ｎ）を得る。ここで、Ｐ　ｏ
　（ｎ）は次式で定義されるものである。Next, this squared signal is input to the average power calculation means (3) to obtain the average power P o (n). Here, P o
(n) is defined by the following formula.

・・・（１）（ここで、Ｌは平均化する区間長である。）次に、この
Ｐ　　（ｎ）を予測誤差パワー算出手段（４）へ入力し
、予６１誤差パワーＰ　Ｍ　（ｎ）を得る。...(1) (Here, L is the interval length to be averaged.) Next, this P (n) is input to the prediction error power calculation means (4), and the prediction error power P M (n ).

予測誤差パワーの算出は、入力信号ｘ　（ｎ）を次式の
ように過去ｍ個のサンプル値の線形結合で予測し、力Ｐ　Ｍ（ｎ）は、式（３）において、ｍ−１から順に
Ｍまで増加させた時の、Ｍ次の予測誤差パワーである。To calculate the prediction error power, the input signal x (n) is predicted by a linear combination of the past m sample values as shown in the following equation, and the power P M (n) is calculated from m-1 in equation (3). This is the M-order prediction error power when increasing sequentially up to M.

正規化エントロピー算出手段（５）は１．Ｐ　ｏ　（ｎ
）及びＰ　Ｍ（ｎ）を入力し、情報エントロピー（以下
単にエントロピーという）の計算を行なう。ここで、時
系列スペクトルＳ　（ｆ、ｎ）のエントロピーＨ（ｎ）
は、（ここで、ａ（１）は、ｍ次の線形予測係数：反射係数
である。）レビンソンーダービン（Ｌｅｖｉｎｓｏｎ−Ｄｕｒｂｌ
ｎ　）のアルゴリズムを用いて、ｍ次の反射係数ａ　（
″）■ が与えられた時の、ｍ次の予測誤差パワー＋　　　　ｌ
ｏｇ２ｆＮ２　　　　　　　　　　　　　　　　・・・（４）上式
において、ｆＮはナイキスト（Ｎｙｑｕｌｓｔ）周波数
であり、Ｓ　（ｆ、ｎ）は、・・・（３）（ここで、ｍ■ｌ、２，３．ＭＭは最大の予測次数）したがって、予測誤差パワー算出手段（４）の出・・・
（５）であり、式（４）に式（５）を代入すると、式（５）の
分母の項の積分は０となるので、となり、更に、式（６）の積分を実行し、定数を無視す
ると、次式を得る。The normalized entropy calculation means (5) is 1. P o (n
) and P M(n) are input, and information entropy (hereinafter simply referred to as entropy) is calculated. Here, the entropy H(n) of the time series spectrum S (f, n)
(Here, a(1) is the m-th linear prediction coefficient: reflection coefficient.) Levinson-Durbl
m-th reflection coefficient a (
mth order prediction error power + l when given ″)■
og2fN 2 ... (4) In the above equation, fN is the Nyqulst frequency, and S (f, n) is ... (3) (here, m■l, 2,3.M M is the maximum prediction order) Therefore, the output of the prediction error power calculation means (4)...
(5), and by substituting equation (5) into equation (4), the integral of the denominator term in equation (5) becomes 0, so we get Ignoring , we get the following equation.

Ｈ（ｎ）　−ｆｏｇ　Ｐ　　（ｎ）　　　　　　　　　
　・・・（７）更に、上式（７）のエントロピーはＰ　
　（ｎ）が式■ （３）を漸化的に解くので、平均パワーＰ。（ｎ）依存
した量となり、Ｐ　　（ｎ）をＰ。（ｎ）で正規化した
値の対数をとり、正規化エントロピーＨ（ｎ）は、−ｌ
ｏｇ　Ｐ　　（ｎ）　−ｌｏｇ　Ｐ　□　（ｎ）　　−
（８）となり、式（８）に従って正規化エントロピーを
算出する。H(n) − fog P (n)
...(7) Furthermore, the entropy of the above formula (7) is P
Since (n) solves equation (3) recursively, the average power P. (n) becomes a dependent quantity, P (n) becomes P. Take the logarithm of the value normalized by (n), and the normalized entropy H(n) is -l
og P (n) −log P □ (n) −
(8), and the normalized entropy is calculated according to equation (8).

音節又は音韻区間検出手段（６）は、正規化エントロピ
ー算出手段（５）で算出した正規化エントロピーを時系
列信号とみなし、正規化エントロピーが極大値→極小値
−極大値となる区間を音節又は音韻区間として検出し、
セグメンテーションを行なってそれを検出信号（７）と
して出力する。この検出信号（ア）はマツチング装置（
図示せず）に送り出され、そこで、予め記憶されている
基準パターンとの類似度が演算され、最も類似している
パターンをその音節又は音韻として出力する。The syllable or phoneme interval detection means (6) regards the normalized entropy calculated by the normalized entropy calculation means (5) as a time-series signal, and detects the interval in which the normalized entropy changes from local maximum value to local minimum value to local maximum value as a syllable or a phoneme interval detection means (6). Detected as a phoneme interval,
Segmentation is performed and the result is output as a detection signal (7). This detection signal (a) is detected by the matching device (
There, the degree of similarity with a pre-stored reference pattern is calculated, and the most similar pattern is output as its syllable or phoneme.

第２図は音節又は音韻区間の検出方法を示した説明図で
あり、横軸は時間、縦軸は正規化エントロピーの値であ
る。ここで、正規化エントロピーの極大値、極小値を次
のように定める。FIG. 2 is an explanatory diagram showing a method for detecting syllables or phoneme intervals, where the horizontal axis is time and the vertical axis is the value of normalized entropy. Here, the local maximum value and local minimum value of normalized entropy are determined as follows.

イ）時刻ｍ１において、正規化エントロピーは極大値Ｈ
Ｍ　（ｍｌ）をもつ。b) At time m1, the normalized entropy is the maximum value H
M (ml).

口）時刻ｎ１において、正規化エントロピーは極小値Ｈ
Ｍ（ｎｌ）をもつ。口) At time n1, the normalized entropy reaches the minimum value H
It has M(nl).

基本的に、時刻層１→ｎｌ→■ｉ１１の区間を１単位の
音節又は音韻区間とする。これは正規化エントロピーが
Ｈ（ａｉ）−Ｈ（ｎｌ）→ＨＭ（ｇ＋１）という具Ｍ　
　　　　　　　翼合に極大値→極小値呻極大値という順で繰り返す区間で
ある。Basically, the interval from time layer 1→nl→■i11 is defined as one unit of syllable or phoneme interval. This means that the normalized entropy is H(ai) - H(nl) → HM(g+1).
This is an interval in which the order of maximum value, minimum value, and maximum value is repeated in the wings.

第３図は正規化エントロピーの出力例を示す説明図であ
り、平均パワーＰ　ｏ　（ｎ）及びそれに対応した予測
次数が１０次（つまりトｌＯ）の正規化エントロピーが
Ｈ（ｎ）が図示されている。この第３図の例では入力信
号としての単語は「あさひ」であり、図示のように音節
又は音韻の変化に対応して正規化エントロピーの値が変
化しており、その極大値−極小値→極大神を単位として
、ａ−ｓ−ａ−ｈ−ｔに対応して正規化エントロピーが
区分されている。FIG. 3 is an explanatory diagram showing an example of the output of normalized entropy, and the average power P o (n) and the corresponding normalized entropy with a prediction order of 10th order (i.e., 1O) are illustrated as H(n). ing. In the example of FIG. 3, the word as the input signal is "Asahi", and as shown in the figure, the value of normalized entropy changes in response to changes in syllables or phonemes, and the maximum value - minimum value → The normalized entropy is divided into units corresponding to a-s-a-h-t, using the maximum god as a unit.

この正規化エントロピーのもつ性質から考えて、エント
ロピー値の減少傾向が大きいほど、入力音声が予測モデ
ルにうまく適合しているといえる。Considering the nature of this normalized entropy, it can be said that the greater the decreasing tendency of the entropy value, the better the input speech fits the prediction model.

したがって、エントロピー値の極小点が最も安定した音
節又は音韻部分であるといえる。Therefore, it can be said that the minimum point of the entropy value is the most stable syllable or phoneme part.

〔Effect of the invention〕

以上説明したようにこの発明によれば、正規化エントロ
ピーを用いて音節又は音韻単位でのセグメンテーション
を行なうことを可能にした。この正規化エントロピーは
、音声信号にＡ　Ｒ（Ａｕｔｏ　Ｒｅｇｒｅｓｓｉｖｅ
　；自己回帰）モデルを適用した結果得られる予測性の
良し悪しを示す評価尺度であり、これは、また音声の声
道情報に起因するものである。したがって、音声パワー
に依存しないため、発声者の個人差によらない尺度を用
いたセグメンテーションが可能となり、更に、従来セグ
メンテーションが困難であった、無声子音やささやき声
などの声帯振動を伴わない音声についても、セグメンテ
ーションが可能になった。As explained above, according to the present invention, it is possible to perform segmentation in syllable or phoneme units using normalized entropy. This normalized entropy is applied to the audio signal by AR (Auto Regressive).
This is an evaluation measure that indicates the quality of predictability obtained as a result of applying an autoregressive (autoregressive) model, and this is also caused by vocal tract information of speech. Therefore, since it does not depend on voice power, it is possible to perform segmentation using a scale that does not depend on individual differences between speakers, and it is also possible to segment speech that does not involve vocal cord vibration, such as voiceless consonants and whispers, which were previously difficult to segment. , segmentation is now possible.

[Brief explanation of the drawing]

Ｍ１図はこの発明の一実施例に係る方法を実施した装置
の構成を示すブロック図、第２図は音節又は音韻区間の
検出方法を示した説明図、第３図は正規化エントロピー
の出力例を示す説明図である。（２）：２乗値算出手段（３）；平均パワー算出手段（４）；予測誤差パワー算出手段（５）；正規化エントロピー算出手段（８）；音節又は音韻区間検出手段Figure M1 is a block diagram showing the configuration of a device that implements a method according to an embodiment of the present invention, Figure 2 is an explanatory diagram showing a method for detecting syllables or phonetic intervals, and Figure 3 is an example of output of normalized entropy. FIG. (2): Square value calculation means (3); Average power calculation means (4); Prediction error power calculation means (5); Normalized entropy calculation means (8); Syllable or phoneme interval detection means

Claims

[Claims] A step of calculating an average power and a prediction error power of a voice input, respectively; a step of calculating a normalized entropy based on the average power and a prediction error power; 1. A speech segmentation method comprising the steps of performing speech segmentation by detecting a maximum value and an interval that changes over time as one unit of syllable or phoneme interval.