JPS61190400A

JPS61190400A - Enunciation speed estimate apparatus

Info

Publication number: JPS61190400A
Application number: JP60030183A
Authority: JP
Inventors: 晋太木村; 小林　敦仁
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1985-02-20
Filing date: 1985-02-20
Publication date: 1986-08-25
Also published as: JPH0588478B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔概　要〕この発明は音声、特に５つの母音と撥音の発声速度を推
定するもので、特に入力音声の５母音及び撥音の類似度
時系列により各時刻の最大値を求め選択した時系列の自
己相関値の最大値を抽出し、得られた最大点の逆数を発
声速度とすることにより５母音および撥音の単位時間発
声速度とするものである。[Detailed Description of the Invention] [Summary] This invention estimates the pronunciation speed of speech, particularly of five vowels and pellicles, and in particular estimates the maximum value at each time based on the similarity time series of the five vowels and pellicles of input speech. The maximum value of the autocorrelation value of the selected time series is extracted, and the reciprocal of the obtained maximum point is taken as the vocalization rate, thereby obtaining the unit time vocalization rate of the five vowels and the pellicles.

[Industrial application field]

この発明は音声処理装置に属し、特に５母音および撥音
の単位時間発声速度を推定する発声速度推定装置に関す
る。The present invention relates to speech processing devices, and particularly relates to a speech rate estimating device for estimating the unit time speech rate of five vowels and pellicles.

[Problems to be solved by conventional technology and invention]

従来の発声速度の推定方法を第３図（ａ）〜（ｅ）を用
いて説明する。第３図（ａ）において縦軸は音声エネル
ギ（Ｅ）横軸は時間（１）である。この場合例えば「ア
カダ（ＡＫＡＤＡ）Ｊと発声すると音声区間（Ｔ）にお
いて母音部分のエネルギが大きくなり３つの山が生ずる
。このような音声波形を関数Ｅ　＝　ｇ　（ｔ）とし、
第３図（ｂ）〜（ｅ）に示すよう、な例えば１〜４サイ
クルの正弦波間数ｆ　（ｃｏｓθ）によるコサイン級数
展開係数を当てはめて音声エネルギ波形と正弦波関数と
の間の相関関数を求める。即ち、音声区間Ｔの微小区間
について時系列的に関数ｇ（ｔ）と関数ｒ　（ｃｏｓθ
）との積を求めこれを累計する内積を求め展開係数と称
する。この場合、（ｂ）に示すような１サイクル波形と
ｇ（ｔ）との内積を求めると図からも明らかなようにＡ
部分の波形が大幅にずれるために展開係数の絶対値は小
さくなる。このように（ｂ）および（ｅ）についてもＡ
部分の波形がずれるために展開係数の絶対値は小さくな
る。一方、（ｄ）の３サイクルの場合ではＡの波形と負
のピークとがほぼ一致するために絶対値は大きくなる。A conventional method for estimating speech rate will be explained using FIGS. 3(a) to 3(e). In FIG. 3(a), the vertical axis represents audio energy (E) and the horizontal axis represents time (1). In this case, for example, when uttering "AKADA J", the energy of the vowel part increases in the vocal section (T), resulting in three peaks. Such a speech waveform is defined as a function E = g (t),
As shown in Figures 3(b) to (e), for example, the correlation function between the audio energy waveform and the sine wave function is calculated by applying a cosine series expansion coefficient based on the number f (cos θ) between 1 to 4 cycles of the sine wave. demand. In other words, the function g(t) and the function r (cos θ
), and the inner product is called the expansion coefficient. In this case, when calculating the inner product of the 1-cycle waveform and g(t) as shown in (b), as is clear from the figure, A
The absolute value of the expansion coefficient becomes small because the waveform of that part is significantly shifted. In this way, for (b) and (e), A
The absolute value of the expansion coefficient becomes small because the waveform of the part shifts. On the other hand, in the case of 3 cycles (d), the waveform of A and the negative peak almost match, so the absolute value becomes large.

このようにして４サイクル以後についても内積を求め、
絶対値はｇ（ｔ）に対して正方向もしくは負の方向に同
期しているときに最大となるので、このサイクル数（周
波数）が音声区間Ｔにおける１秒当りの母音の数（モー
９７秒）を示すことになる。In this way, calculate the inner product for the 4th cycle and beyond,
The absolute value is maximum when synchronized in the positive or negative direction with respect to g(t), so this cycle number (frequency) is the number of vowels per second in the speech interval T (Mo 97 seconds). ).

このように従来は例えば音声区間の微小区間として数１
０ｍ５ｅｃ毎のエネルギ値の時系列について、コサイン
級数展開係数の絶対値が最大となるようなコサイン展開
関数の周波数を発声速度としてｍ：σ いていた。In this way, conventionally, for example, as a minute section of a voice section, the number 1 is
Regarding the time series of energy values every 0 m5ec, the frequency of the cosine expansion function where the absolute value of the cosine series expansion coefficient becomes the maximum was set as the utterance rate m:σ.

しかしながら、従来はこのように各モーラ（５母音及び
撥音）がエネルギ時系列上で山を形成するという性質を
利用しているが、実際の音声では必ずしも明確な山谷を
形成するとは限らず、例えば「アイ　（ＡＩ）Ｊでは２
つの母音により山が続くことになり正弦波関数がうまく
当てはまらず従って周波数が求められずに発声速度の抽
出が正確に行われない場合が生ずる。However, although conventionally, each mora (five vowels and pellicles) forms a peak in the energy time series in this way, it does not necessarily form a clear peak and valley in actual speech. “AI (AI)J is 2
As the peaks continue due to the two vowels, the sine wave function cannot be applied well, so the frequency cannot be determined and the speech rate may not be extracted accurately.

〔問題点を解決するための手段および作用〕本発明は上
述の問題点を解決した発声速度推定装置であって、各モ
ーラ形成音素（５母音、撥音）の類似度時系列の最大値
時系列をモーラ位置時系列とし、その自己相関関数の最
大点をモーラ繰返し時間として抽出し、その逆数を発声
速度とするようにしたものであり、その手段は、音声処
理装置における発声速度推定装置において、音声波を電
気信号に変換する音響電気変換部と、該電気信号をデジ
タル信号に変換する入力部と、該ディジタル信号から特
徴の時系列的抽出を行う特徴抽出部と、５母、音及びＩ
Ｂ音の特徴を予め格納した辞書部と、該特徴抽出部の出
力と該辞書部から読み出された各音の特徴との間で類似
度の時系列的計算を行う類似度計算部と、各時刻におけ
る各音の類似度の最適値の選択を行う最適値選択部と、
該最適値選択部の出力時系列の自己相関を音声区間全体
に亘り計算する自己相関計算部と、該自己相関の最大点
を抽出する最大抽出部と、該最大点の逆数を計算する逆
数計算部とを具備し、該逆数を単位時間当りの発声速度
とすることを特徴とする。[Means and effects for solving the problems] The present invention is a speech rate estimating device that solves the above-mentioned problems, and which calculates the maximum value time series of the similarity time series of each mora-forming phoneme (5 vowels, plosives). is a mora position time series, the maximum point of the autocorrelation function is extracted as the mora repetition time, and the reciprocal thereof is taken as the speaking rate. an acousto-electric conversion unit that converts audio waves into electrical signals; an input unit that converts the electrical signals into digital signals; a feature extraction unit that extracts features in time series from the digital signals;
a dictionary section that stores B sound features in advance; a similarity calculation section that performs time-series calculation of similarity between the output of the feature extraction section and the features of each sound read from the dictionary section; an optimal value selection unit that selects the optimal value of the similarity of each sound at each time;
an autocorrelation calculation unit that calculates the autocorrelation of the output time series of the optimal value selection unit over the entire speech interval; a maximum extraction unit that extracts the maximum point of the autocorrelation; and a reciprocal calculation unit that calculates the reciprocal of the maximum point. , and the reciprocal number is taken as the speech rate per unit time.

〔Example〕

第１図は本発明に係る発声速度推定装置の一実施例ブロ
ック線図である。第１図において、音響電気変換部とし
てのマイクロホン１より入力された音声は入力部２でデ
ジタル化され、特徴抽出部３で数ＩＱｍｓｅｃ毎に周波
数分析され約１６個程度の帯域内のエネルギが特徴ベク
トルとして抽出される。辞書部４には５母音及び撥音を
上記と同様に特徴抽出した特徴ベクトルが各音素毎に記
憶されている。類似度計算部５では特徴抽出＠３で抽出
された特徴ベクトル（ｘ、）と辞書部４内の各音素毎の
特徴ベクトル、（ｙ、）との類似度（ρ）計算が例えば下記の（１）式の如く行
われる。ここでρＡは「ア（Ａ）」との類似度を示す値
である。FIG. 1 is a block diagram of an embodiment of a speech rate estimating device according to the present invention. In FIG. 1, audio input from a microphone 1 serving as an acoustoelectric transducer is digitized by an input unit 2, and frequency-analyzed by a feature extraction unit 3 every several IQmsec to identify energy characteristics within about 16 bands. Extracted as a vector. The dictionary section 4 stores feature vectors for each phoneme, which are obtained by extracting the features of the five vowels and pellicles in the same manner as described above. The similarity calculation unit 5 calculates the similarity (ρ) between the feature vector (x,) extracted in the feature extraction @3 and the feature vector (y,) for each phoneme in the dictionary unit 4, for example, as shown below ( 1) It is carried out as shown in the formula. Here, ρA is a value indicating the degree of similarity with "A".

以下余白特徴ベクトル（ｘｉ）は分析周期の微小区間として数Ｉ
Ｑｍｓｅｃ度に得られるので類似度ρ１〜ρ９も特徴ベ
クトルが得られる度に計算され、時刻ｊにおいて、 ρ４．ρｊ、ρｊ、ρ４．ρ４．ρｊという各音素の類似度時系列が得られる。Below, the margin feature vector (xi) is a number I as a minute interval of the analysis period.
Since the similarities are obtained in Qmsec degrees, the similarities ρ1 to ρ9 are also calculated every time a feature vector is obtained, and at time j, ρ4. ρj, ρj, ρ4. ρ4. A similarity time series of each phoneme called ρj is obtained.

最適値選択部６では次の計算を行う。The optimum value selection unit 6 performs the following calculation.

ここで、ｆｆ１ａｘは引数の最適値として最大値を選択
することを示す。Here, ff1ax indicates that the maximum value is selected as the optimal value of the argument.

自己相関計算部７では前段で計算したρ、の自己相関関
数を次の式により計算する。The autocorrelation calculation unit 7 calculates the autocorrelation function of ρ calculated in the previous stage using the following formula.

最大抽出部８では（νｋ）ｋ−ＩＮのに＝０以外の最大
のピーク点をに□８として抽出し、つぎの（４）式より
分析周期Ｔ　（ｓｅｃ）を掛算し１モーラの平均時間長
りを得る。The maximum extraction unit 8 extracts the maximum peak point of (νk)k-IN other than 0 as □8, and multiplies it by the analysis period T (sec) from the following equation (4) to obtain the average time of 1 mora. gain length.

Ｌ　＝　Ｔ　−ｋ　、、、　　　　　　　　　−−−−
−−−（４）逆数計算部９では次式の如くＬの逆数を計
算し発声速度Ｓ（モー９７秒）を得る。L = T −k,,, −−−−
--- (4) The reciprocal calculation section 9 calculates the reciprocal of L as shown in the following equation to obtain the speaking speed S (Mo 97 seconds).

Ｓ＝１／Ｌ　　　　　　　　　　　−・−（５）第２図
（ａ）〜（ｈ）は上述した処理手順の例を説明する図で
ｉる。例えば「ギンザの（ＧＩＮＺＡＮＯ）　Ｊを例に
とり、図において縦軸は音声エネルギ値、横軸（Ｄ　は
時間（例えばｌＱｍｓｅｃ間隔）を示す。S=1/L -.-(5) FIGS. 2(a) to 2(h) are diagrams for explaining an example of the above-mentioned processing procedure. For example, taking "GINZANO J" as an example, in the figure, the vertical axis represents the audio energy value, and the horizontal axis (D represents time (for example, 1Qmsec interval).

（ａ）において時間ｊにおける類似度ρｊＡは音声ｒＧ
ＩＮＺＡＮＯＪの内ｒＡＪにおいて類似度が高いことが
わかる。同様にｒ　Ｉ　Ｊ　　ｒＵＪ−−−−一・−ｒ
ＮＪの類似度が高い場所は波形が（ｂ）〜（ｆ）の山の
ようになることがわかる。これらの類似度計算は前述の
（１）式に基づいて類似度計算部５において行われる。In (a), the similarity ρjA at time j is voice rG
It can be seen that the degree of similarity is high in rAJ among INZANOJ. Similarly r I J rUJ---1・-r
It can be seen that where the NJ similarity is high, the waveforms look like mountains (b) to (f). These similarity calculations are performed in the similarity calculation section 5 based on the above-mentioned equation (1).

そしてｊを１番目からｎ１番目まで、つまり１０ｍ５ｅ
ｃごとに順次変えて上述の類似度計算を行いｒ　Ａ　Ｊ
　　ｒ　Ｉ　Ｊ　−−−−−−−・ｒＮＪについて最大
値を求める。この計算は（２）式のように示され最適値
選択部６において計算される。この結果は（ｇ）に示す
ような波形となる。このような波形について（３）式に
もとづいて自己相関関数を計算すると（ｈ）に示す波形
が得られる。即ち、ここで縦軸　（ν、）は自己相関関
数値、横軸（ｋ）はずらし量を示し、すらしｌｋは（ｇ
）に示す波形を時間にだけずらすことを示すものとする
。このようにして順次にだけずらした（ｇ）の波形とに
＝０における（ｇ）の波形との自己相関関数を計算する
と（ｈ）に示す波形が得られる。もちろんに＝’Ｏ１即
ち、ずらさない場合に相関が大となることは当然であり
、ｋ＝０を除いて次に最大となる点ｋ　ｍａｘが得られ
る。逆数計算部９においてこの時のｋ　ｍａｘと分析周
期Ｔとを（４）式に示す如く掛けて１モーラ当りの平均
時間長りを得る。このＬの逆数Ｓを計算することにより
発生速度が得られる。and j from 1st to n1th, that is, 10m5e
Perform the above similarity calculation by changing each c sequentially. r A J
Find the maximum value for r I J ----------rNJ. This calculation is shown as equation (2) and is calculated in the optimum value selection section 6. The result is a waveform as shown in (g). When an autocorrelation function is calculated based on equation (3) for such a waveform, a waveform shown in (h) is obtained. That is, here, the vertical axis (ν,) shows the autocorrelation function value, the horizontal axis (k) shows the shift amount, and the smoothness lk is (g
) indicates that the waveform shown in ) is shifted only in time. By calculating the autocorrelation function between the waveform (g) shifted only in this way and the waveform (g) at =0, the waveform shown in (h) is obtained. Of course, it is natural that the correlation becomes large when ='O1, that is, when there is no shift, and the next largest point k max is obtained except for k=0. In the reciprocal calculation unit 9, k max at this time is multiplied by the analysis period T as shown in equation (4) to obtain the average time length per mora. By calculating the reciprocal S of this L, the generation rate can be obtained.

〔Effect of the invention〕

本発明によれば陽に音素認識を行うことなく音声の発声
速度を正確に推定することができる。According to the present invention, it is possible to accurately estimate the speaking rate of speech without explicitly performing phoneme recognition.

[Brief explanation of drawings]

第１図は、本発明に係る発声速度推定装置の一実施例ブ
ロック線図、第２図（ａ）〜（ｈ）は、第１図装置の処理手順を説明
する図、および第３図（ａ）〜（ｅ）は従来の発声速度の推定方法を説
明する図である。（符号の説明）１−マイクロホン、２−人力部、３・・−特徴抽出部、４−・−辞書部、５−・−類似度計算部、６−最適値選択部、７・−自己相関計算部、８・−最大抽出部、９−逆数計算部。FIG. 1 is a block diagram of an embodiment of the speech rate estimating device according to the present invention, FIGS. 2(a) to (h) are diagrams explaining the processing procedure of the device in FIG. FIGS. 8A to 8E are diagrams illustrating a conventional method for estimating speech rate. (Explanation of symbols) 1-microphone, 2-human power section, 3...-feature extraction section, 4--dictionary section, 5--similarity calculation section, 6-optimum value selection section, 7.--autocorrelation calculation unit, 8-maximum extraction unit, 9-reciprocal calculation unit.

Claims

[Claims]

1. In a speech rate estimation device in a speech processing device,
an acousto-electric conversion unit that converts a speech wave into an electrical signal; an input unit that converts the electrical signal into a digital signal; a feature extraction unit that extracts features in time series from the digital signal; a similarity calculation unit that performs time-series calculation of similarity between the output of the feature extraction unit and the feature of each sound read from the dictionary unit; an optimal value selection section that selects the optimal value of sound similarity; an autocorrelation calculation section that calculates the autocorrelation of the output time series of the optimal value selection section over the entire speech interval; and an autocorrelation calculation section that calculates the maximum point of the autocorrelation. A speech rate estimating device comprising: a maximum extraction unit that extracts; and a reciprocal calculation unit that calculates a reciprocal of the maximum point; the reciprocal is used as a vocalization rate per unit time.