JPH0311399A - Voice segmentation method - Google Patents

Voice segmentation method

Info

Publication number
JPH0311399A
JPH0311399A JP1145064A JP14506489A JPH0311399A JP H0311399 A JPH0311399 A JP H0311399A JP 1145064 A JP1145064 A JP 1145064A JP 14506489 A JP14506489 A JP 14506489A JP H0311399 A JPH0311399 A JP H0311399A
Authority
JP
Japan
Prior art keywords
voice
segmentation
entropy
power
syllable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP1145064A
Other languages
Japanese (ja)
Other versions
JP2598518B2 (en
Inventor
Keisuke Oda
啓介 小田
Yumi Takizawa
滝沢 由美
Kiyohito Tokuda
清仁 徳田
Atsushi Fukazawa
深沢 敦司
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oki Electric Industry Co Ltd
Original Assignee
Oki Electric Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oki Electric Industry Co Ltd filed Critical Oki Electric Industry Co Ltd
Priority to JP1145064A priority Critical patent/JP2598518B2/en
Publication of JPH0311399A publication Critical patent/JPH0311399A/en
Application granted granted Critical
Publication of JP2598518B2 publication Critical patent/JP2598518B2/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Abstract

PURPOSE:To perform segmentation of high precision independently of voice power by using information entropy of voice, which is obtained by regarding a voice signal as a time series signal, to perform segmentation. CONSTITUTION:A square value calculating means, an average power calculating means 3, a forecast error power calculating means 4, a normalized entropy calculating means 5, and a syllable or phoneme section detecting means are provided. An input signal 1 is, for example, a time series signal of voice subjected to A/D conversion by a sampling frequency. Information entropy of voice at the time of regarding the voice as this time series signal is noticed, and a maximum value and a minimum value of normalized entropy and sections of the change to the maximum value and time series are detected as the syllable or phoneme section of one unit to perform voice segmentation. Since segmentation is not dependent upon the voice power, segmentation is performed with a scale independent of the individual difference of speakers.

Description

【発明の詳細な説明】 [産業上の利用分野] この発明は連続して発生した音声信号を音節や音韻に区
分する音声のセグメンテーション方法に関する。
DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech segmentation method for segmenting continuously generated speech signals into syllables and phonemes.

〔従来の技術〕[Conventional technology]

従来から広く用いられてきた音声のセグメンテーション
方法は、音声パワーに着目するものであった。例えば、
安居腕 猛・中嶋正之共著「コンピュータ音声処理J 
 (1980年6月)秋葉出版P17Bにおいて開示、
されているように、音声パワーが極小となる時間点に挟
まれた1個の極大値をもつ区間を1個の音節又は音韻区
間として検出する方法がある。
Voice segmentation methods that have been widely used in the past have focused on voice power. for example,
Co-authored by Takeshi Yasui and Masayuki Nakajima, “Computer Speech Processing J.
(June 1980) Disclosed in Akiba Publishing P17B,
As described above, there is a method of detecting an interval having one maximum value sandwiched between time points where the voice power is minimum as one syllable or phoneme interval.

【発明が解決しようとする課題] しかしながら、このような方法では音声パワーという尺
度に対してセグメンテーションを行なうため、セグメン
テーションの結果は音声パワーに依存したものになる。
[Problems to be Solved by the Invention] However, in such a method, segmentation is performed using a measure of voice power, so the result of segmentation depends on voice power.

音声パワーが個人差により大きく異なること、更に同一
人物でも音声パワーは一定ではないことなどを考えれば
、個人差によらない新しい尺度でのセグメンテーション
を行なう必要がある。
Considering that voice power varies greatly depending on individual differences, and that voice power is not constant even for the same person, it is necessary to perform segmentation using a new scale that does not depend on individual differences.

この発明は、音声パワーによるセグメンテーションの結
果が音声パワーに依存するという問題点を解消するため
になされたものであり、音声信号を時系列信号とみなし
た音声の情報エントロピーを用いてセグメンテーション
を行なうことにより、音声パワーに依存せず、高精度な
セグメンテーションを可能にした音声のセグメンテーシ
ョン方法を提供することを目的とする。
This invention was made in order to solve the problem that the result of segmentation based on voice power depends on the voice power, and it is possible to perform segmentation using the information entropy of voice when the voice signal is regarded as a time-series signal. The present invention aims to provide a voice segmentation method that enables highly accurate segmentation without depending on voice power.

[課題を解決するための手段] この発明に係る音声のセグメンテーション方法は、音声
入力の平均パワー及び予測誤差パワーをそれぞれ求める
工程と、平均パワー及び予測誤差パワーに基づいて正規
化エントロピーを求める工程と、正規化エントロピーが
極大値、極小値及び極大値と時系列に変化する区間を、
1単位の音節又は音韻区間として検出することにより音
声のセグメンテーションを行う工程とを有する。
[Means for Solving the Problems] A speech segmentation method according to the present invention includes the steps of calculating the average power and prediction error power of a voice input, respectively, and calculating the normalized entropy based on the average power and the prediction error power. , the interval in which the normalized entropy changes over time from the local maximum value to the local minimum value and the local maximum value is
and performing speech segmentation by detecting one unit of syllable or phoneme interval.

[作 用] この発明においては、音声を時系列信号としてみなした
時の音声の情報エントロピーに看目し、正規化エントロ
ピーの極大値、極小値及び極大値と時系列に変化する区
間が、1単位の音節又は音韻区間として検出される。
[Function] In this invention, considering the information entropy of speech when speech is regarded as a time-series signal, the intervals in which the maximum value, minimum value, and maximum value of normalized entropy change in time series are 1 It is detected as a unit syllable or phoneme interval.

【実施例1 第1図はこの発明の一実施例に係る方法を実施するため
の装貧の構成を示したブロック図である。
Embodiment 1 FIG. 1 is a block diagram showing the configuration of a device for carrying out a method according to an embodiment of the present invention.

図において、(1)は入力信号、(2)は2乗値算出手
段、(3)は平均パワー算出手段、(4)は予測誤差パ
ワー算出手段、(5)は正規化エントロピー算出手段、
(8)は音節又は音韻区間検出手段であり、(1)は検
出信号である。
In the figure, (1) is an input signal, (2) is a square value calculation means, (3) is an average power calculation means, (4) is a prediction error power calculation means, (5) is a normalized entropy calculation means,
(8) is a syllable or phoneme segment detection means, and (1) is a detection signal.

次に動作を説明する。入力信号(1)は、例えばサンプ
リング周波数8 KHzでA/D変換された音声の時系
列信号x (n) t  n ” o r ±Δt、±
Δ2t。
Next, the operation will be explained. The input signal (1) is, for example, an audio time series signal x (n) t n ” or ±Δt, ± which has been A/D converted at a sampling frequency of 8 KHz.
Δ2t.

±Δ3t、 −・−・・−・−、Δt −178000
(sec、)となっているものとする。2乗値算出手段
(2)はこの入力信号x (n)を入力して、2乗信号
(x(n))2を得る。
±Δ3t, −・−・・−・−, Δt −178000
(sec,). The square value calculation means (2) inputs this input signal x (n) and obtains a square signal (x(n))2.

次に、この2乗信号を平均パワー算出手段(3)へ入力
し、平均パワーP o (n)を得る。ここで、P o
 (n)は次式で定義されるものである。
Next, this squared signal is input to the average power calculation means (3) to obtain the average power P o (n). Here, P o
(n) is defined by the following formula.

・・・(1) (ここで、Lは平均化する区間長である。)次に、この
P  (n)を予測誤差パワー算出手段(4)へ入力し
、予61誤差パワーP M (n)を得る。
...(1) (Here, L is the interval length to be averaged.) Next, this P (n) is input to the prediction error power calculation means (4), and the prediction error power P M (n ).

予測誤差パワーの算出は、入力信号x (n)を次式の
ように過去m個のサンプル値の線形結合で予測し、 力P M(n)は、式(3)において、m−1から順に
Mまで増加させた時の、M次の予測誤差パワーである。
To calculate the prediction error power, the input signal x (n) is predicted by a linear combination of the past m sample values as shown in the following equation, and the power P M (n) is calculated from m-1 in equation (3). This is the M-order prediction error power when increasing sequentially up to M.

正規化エントロピー算出手段(5)は1.P o (n
)及びP M(n)を入力し、情報エントロピー(以下
単にエントロピーという)の計算を行なう。ここで、時
系列スペクトルS (f、n)のエントロピーH(n)
は、 (ここで、a(1)は、m次の線形予測係数:反射係数
である。) レビンソンーダービン(Levinson−Durbl
n )のアルゴリズムを用いて、m次の反射係数a (
″)■ が与えられた時の、m次の予測誤差パワー+    l
og2fN 2                ・・・(4)上式
において、fNはナイキスト(Nyqulst)周波数
であり、S (f、n)は、 ・・・(3) (ここで、m■l、2,3.M Mは最大の予測次数) したがって、予測誤差パワー算出手段(4)の出・・・
(5) であり、式(4)に式(5)を代入すると、式(5)の
分母の項の積分は0となるので、 となり、更に、式(6)の積分を実行し、定数を無視す
ると、次式を得る。
The normalized entropy calculation means (5) is 1. P o (n
) and P M(n) are input, and information entropy (hereinafter simply referred to as entropy) is calculated. Here, the entropy H(n) of the time series spectrum S (f, n)
(Here, a(1) is the m-th linear prediction coefficient: reflection coefficient.) Levinson-Durbl
m-th reflection coefficient a (
mth order prediction error power + l when given ″)■
og2fN 2 ... (4) In the above equation, fN is the Nyqulst frequency, and S (f, n) is ... (3) (here, m■l, 2,3.M M is the maximum prediction order) Therefore, the output of the prediction error power calculation means (4)...
(5), and by substituting equation (5) into equation (4), the integral of the denominator term in equation (5) becomes 0, so we get Ignoring , we get the following equation.

H(n) −fog P  (n)         
 ・・・(7)更に、上式(7)のエントロピーはP 
 (n)が式■ (3)を漸化的に解くので、平均パワーP。(n)依存
した量となり、P  (n)をP。(n)で正規化した
値の対数をとり、正規化エントロピーH(n)は、−l
og P  (n) −log P □ (n)  −
(8)となり、式(8)に従って正規化エントロピーを
算出する。
H(n) − fog P (n)
...(7) Furthermore, the entropy of the above formula (7) is P
Since (n) solves equation (3) recursively, the average power P. (n) becomes a dependent quantity, P (n) becomes P. Take the logarithm of the value normalized by (n), and the normalized entropy H(n) is -l
og P (n) −log P □ (n) −
(8), and the normalized entropy is calculated according to equation (8).

音節又は音韻区間検出手段(6)は、正規化エントロピ
ー算出手段(5)で算出した正規化エントロピーを時系
列信号とみなし、正規化エントロピーが極大値→極小値
−極大値となる区間を音節又は音韻区間として検出し、
セグメンテーションを行なってそれを検出信号(7)と
して出力する。この検出信号(ア)はマツチング装置(
図示せず)に送り出され、そこで、予め記憶されている
基準パターンとの類似度が演算され、最も類似している
パターンをその音節又は音韻として出力する。
The syllable or phoneme interval detection means (6) regards the normalized entropy calculated by the normalized entropy calculation means (5) as a time-series signal, and detects the interval in which the normalized entropy changes from local maximum value to local minimum value to local maximum value as a syllable or a phoneme interval detection means (6). Detected as a phoneme interval,
Segmentation is performed and the result is output as a detection signal (7). This detection signal (a) is detected by the matching device (
There, the degree of similarity with a pre-stored reference pattern is calculated, and the most similar pattern is output as its syllable or phoneme.

第2図は音節又は音韻区間の検出方法を示した説明図で
あり、横軸は時間、縦軸は正規化エントロピーの値であ
る。ここで、正規化エントロピーの極大値、極小値を次
のように定める。
FIG. 2 is an explanatory diagram showing a method for detecting syllables or phoneme intervals, where the horizontal axis is time and the vertical axis is the value of normalized entropy. Here, the local maximum value and local minimum value of normalized entropy are determined as follows.

イ)時刻m1において、正規化エントロピーは極大値H
M (ml)をもつ。
b) At time m1, the normalized entropy is the maximum value H
M (ml).

口)時刻n1において、正規化エントロピーは極小値H
M(nl)をもつ。
口) At time n1, the normalized entropy reaches the minimum value H
It has M(nl).

基本的に、時刻層1→nl→■i11の区間を1単位の
音節又は音韻区間とする。これは正規化エントロピーが
H(ai)−H(nl)→HM(g+1)という具M 
       翼 合に極大値→極小値呻極大値という順で繰り返す区間で
ある。
Basically, the interval from time layer 1→nl→■i11 is defined as one unit of syllable or phoneme interval. This means that the normalized entropy is H(ai) - H(nl) → HM(g+1).
This is an interval in which the order of maximum value, minimum value, and maximum value is repeated in the wings.

第3図は正規化エントロピーの出力例を示す説明図であ
り、平均パワーP o (n)及びそれに対応した予測
次数が10次(つまりトlO)の正規化エントロピーが
H(n)が図示されている。この第3図の例では入力信
号としての単語は「あさひ」であり、図示のように音節
又は音韻の変化に対応して正規化エントロピーの値が変
化しており、その極大値−極小値→極大神を単位として
、a−s−a−h−tに対応して正規化エントロピーが
区分されている。
FIG. 3 is an explanatory diagram showing an example of the output of normalized entropy, and the average power P o (n) and the corresponding normalized entropy with a prediction order of 10th order (i.e., 1O) are illustrated as H(n). ing. In the example of FIG. 3, the word as the input signal is "Asahi", and as shown in the figure, the value of normalized entropy changes in response to changes in syllables or phonemes, and the maximum value - minimum value → The normalized entropy is divided into units corresponding to a-s-a-h-t, using the maximum god as a unit.

この正規化エントロピーのもつ性質から考えて、エント
ロピー値の減少傾向が大きいほど、入力音声が予測モデ
ルにうまく適合しているといえる。
Considering the nature of this normalized entropy, it can be said that the greater the decreasing tendency of the entropy value, the better the input speech fits the prediction model.

したがって、エントロピー値の極小点が最も安定した音
節又は音韻部分であるといえる。
Therefore, it can be said that the minimum point of the entropy value is the most stable syllable or phoneme part.

〔発明の効果〕〔Effect of the invention〕

以上説明したようにこの発明によれば、正規化エントロ
ピーを用いて音節又は音韻単位でのセグメンテーション
を行なうことを可能にした。この正規化エントロピーは
、音声信号にA R(Auto Regressive
 ;自己回帰)モデルを適用した結果得られる予測性の
良し悪しを示す評価尺度であり、これは、また音声の声
道情報に起因するものである。したがって、音声パワー
に依存しないため、発声者の個人差によらない尺度を用
いたセグメンテーションが可能となり、更に、従来セグ
メンテーションが困難であった、無声子音やささやき声
などの声帯振動を伴わない音声についても、セグメンテ
ーションが可能になった。
As explained above, according to the present invention, it is possible to perform segmentation in syllable or phoneme units using normalized entropy. This normalized entropy is applied to the audio signal by AR (Auto Regressive).
This is an evaluation measure that indicates the quality of predictability obtained as a result of applying an autoregressive (autoregressive) model, and this is also caused by vocal tract information of speech. Therefore, since it does not depend on voice power, it is possible to perform segmentation using a scale that does not depend on individual differences between speakers, and it is also possible to segment speech that does not involve vocal cord vibration, such as voiceless consonants and whispers, which were previously difficult to segment. , segmentation is now possible.

【図面の簡単な説明】[Brief explanation of the drawing]

M1図はこの発明の一実施例に係る方法を実施した装置
の構成を示すブロック図、第2図は音節又は音韻区間の
検出方法を示した説明図、第3図は正規化エントロピー
の出力例を示す説明図である。 (2):2乗値算出手段 (3);平均パワー算出手段 (4);予測誤差パワー算出手段 (5);正規化エントロピー算出手段 (8);音節又は音韻区間検出手段
Figure M1 is a block diagram showing the configuration of a device that implements a method according to an embodiment of the present invention, Figure 2 is an explanatory diagram showing a method for detecting syllables or phonetic intervals, and Figure 3 is an example of output of normalized entropy. FIG. (2): Square value calculation means (3); Average power calculation means (4); Prediction error power calculation means (5); Normalized entropy calculation means (8); Syllable or phoneme interval detection means

Claims (1)

【特許請求の範囲】 音声入力の平均パワー及び予測誤差パワーをそれぞれ求
める工程と、 前記平均パワー及び予測誤差パワーに基づいて正規化エ
ントロピーを求める工程と、 前記正規化エントロピーが極大値、極小値及び極大値と
時系列に変化する区間を、1単位の音節又は音韻区間と
して検出することにより音声のセグメンテーションを行
う工程と を有することを特徴とする音声のセグメンテーション方
法。
[Claims] A step of calculating an average power and a prediction error power of a voice input, respectively; a step of calculating a normalized entropy based on the average power and a prediction error power; 1. A speech segmentation method comprising the steps of performing speech segmentation by detecting a maximum value and an interval that changes over time as one unit of syllable or phoneme interval.
JP1145064A 1989-06-09 1989-06-09 How to segment audio Expired - Lifetime JP2598518B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP1145064A JP2598518B2 (en) 1989-06-09 1989-06-09 How to segment audio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP1145064A JP2598518B2 (en) 1989-06-09 1989-06-09 How to segment audio

Publications (2)

Publication Number Publication Date
JPH0311399A true JPH0311399A (en) 1991-01-18
JP2598518B2 JP2598518B2 (en) 1997-04-09

Family

ID=15376547

Family Applications (1)

Application Number Title Priority Date Filing Date
JP1145064A Expired - Lifetime JP2598518B2 (en) 1989-06-09 1989-06-09 How to segment audio

Country Status (1)

Country Link
JP (1) JP2598518B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5204536A (en) * 1991-06-14 1993-04-20 Shlomo Vardi Electro-optical monitoring system utilizing optical signal transmitters in predetermined geometrical patterns

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5204536A (en) * 1991-06-14 1993-04-20 Shlomo Vardi Electro-optical monitoring system utilizing optical signal transmitters in predetermined geometrical patterns

Also Published As

Publication number Publication date
JP2598518B2 (en) 1997-04-09

Similar Documents

Publication Publication Date Title
Singh et al. Speaker's voice characteristics and similarity measurement using Euclidean distances
US8942977B2 (en) System and method for speech recognition using pitch-synchronous spectral parameters
EP1693826A1 (en) Vocal tract resonance tracking using a nonlinear predictor and a target-guided temporal constraint
US7908142B2 (en) Apparatus and method for identifying prosody and apparatus and method for recognizing speech
JP3673507B2 (en) APPARATUS AND PROGRAM FOR DETERMINING PART OF SPECIFIC VOICE CHARACTERISTIC CHARACTERISTICS, APPARATUS AND PROGRAM FOR DETERMINING PART OF SPEECH SIGNAL CHARACTERISTICS WITH HIGH RELIABILITY, AND Pseudo-Syllable Nucleus Extraction Apparatus and Program
Dusan et al. Acoustic-to-articulatory inversion using dynamical and phonological constraints
Yadav et al. Epoch detection from emotional speech signal using zero time windowing
Suzuki et al. Determination of articulatory positions from speech acoustics by applying dynamic articulatory constraints
Claes et al. SNR-normalisation for robust speech recognition
CN116230018A (en) Synthetic voice quality evaluation method for voice synthesis system
JPH0311399A (en) Voice segmentation method
Karabetsos et al. One-class classification for spectral join cost calculation in unit selection speech synthesis
US20060150805A1 (en) Method of automatically detecting vibrato in music
Zaidi et al. Automatic recognition system for dysarthric speech based on MFCC’s, PNCC’s, JITTER and SHIMMER coefficients
Dubey et al. Hypernasality detection using zero time windowing
Arroabarren et al. Glottal spectrum based inverse filtering.
JPS5972500A (en) Voice recognition system
Chen et al. Teager Mel and PLP fusion feature based speech emotion recognition
JP4576612B2 (en) Speech recognition method and speech recognition apparatus
Hussain et al. Endpoint detection of speech signal using neural network
JPH04130499A (en) Segmentation of voice
Vikram et al. Acoustic analysis of misarticulated trills in cleft lip and palate children
KR930010398B1 (en) Transfer section detecting method on sound signal wave
JP2001083978A (en) Speech recognition device
JPH05173594A (en) Voiced sound section detecting method