JPS5925237B2 - Speech segment determination method using speech analysis and synthesis method - Google Patents

Speech segment determination method using speech analysis and synthesis method

Info

Publication number
JPS5925237B2
JPS5925237B2 JP54157123A JP15712379A JPS5925237B2 JP S5925237 B2 JPS5925237 B2 JP S5925237B2 JP 54157123 A JP54157123 A JP 54157123A JP 15712379 A JP15712379 A JP 15712379A JP S5925237 B2 JPS5925237 B2 JP S5925237B2
Authority
JP
Japan
Prior art keywords
speech
zero
sections
crossing rate
determined
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired
Application number
JP54157123A
Other languages
Japanese (ja)
Other versions
JPS5678899A (en
Inventor
浩二 浮穴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Mobile Communications Co Ltd
Original Assignee
Matsushita Communication Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Matsushita Communication Industrial Co Ltd filed Critical Matsushita Communication Industrial Co Ltd
Priority to JP54157123A priority Critical patent/JPS5925237B2/en
Publication of JPS5678899A publication Critical patent/JPS5678899A/en
Publication of JPS5925237B2 publication Critical patent/JPS5925237B2/en
Expired legal-status Critical Current

Links

Landscapes

  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Description

【発明の詳細な説明】 本発明は音声分析合成方式における音声の有声区間、無
声区間、無音区間を短時間に判定する方法に関するもの
である。
DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a method for determining voiced sections, unvoiced sections, and silent sections of speech in a short time in a speech analysis and synthesis method.

一般的に音声分析合成系に於ては、音声情報の圧縮を行
なうために有声区間、無声区間、無音区間を決定し、そ
れぞれの区間に於て最適で最小量になるように情報を抽
出する方法からとられている。
In general, in speech analysis and synthesis systems, in order to compress speech information, voiced sections, unvoiced sections, and silent sections are determined, and information is extracted in the optimal and minimum amount for each section. It is taken from the method.

したがつてこの区間を決定する方法は音声分析合成系で
重要な問題となつている。従来の音声分析合成装置にお
いて、例えばPARCOR方式を用いだ装置では、音声
信号からフオルマントなどの周波数スペクトル包絡成分
を除去した残差信号を作成し、その残差信号の自己相関
関数である変形相関関数を求めだ後、その最大値と、第
1次のPARCOR係数klによつて、有声無声の決定
をしている。
Therefore, how to determine this interval is an important issue in speech analysis and synthesis systems. In conventional speech analysis and synthesis equipment, for example, equipment using the PARCOR method, a residual signal is created by removing frequency spectrum envelope components such as formants from the audio signal, and a modified correlation function that is an autocorrelation function of the residual signal is created. After finding the maximum value and the first PARCOR coefficient kl, voiced or unvoiced is determined.

実用上、これらの処理は電子計算機を使用して行なうこ
とが多いが、残差信号を求めたり変形相関関数を求める
処理にかなりの演算時間を要する。演算処理の高速化の
一環として、音声波形の自己相関関数にディジタルフィ
ルタをかけるという荷重移動平均操作によつて変形相関
関数を求める方法が提案されているが、これも演算処理
時間がかなりかかることには大差がないのが現状である
。本発明は、音声の有声、無声、無音区間の決定を簡単
な前処理を施した音声波形の自己相関関数と、零交差率
の組合せにより効率よく高速に確度よく行ない、今まで
の欠点であつた演算時間の問題を改善するものである。
In practice, these processes are often performed using an electronic computer, but it takes a considerable amount of calculation time to obtain the residual signal and the modified correlation function. As part of efforts to speed up calculation processing, a method has been proposed to obtain a modified correlation function using a weighted moving average operation in which a digital filter is applied to the autocorrelation function of the audio waveform, but this method also takes a considerable amount of calculation processing time. At present, there is not much difference. The present invention efficiently, quickly, and accurately determines voiced, unvoiced, and silent sections of speech using a combination of an autocorrelation function of a speech waveform that has undergone simple preprocessing and a zero-crossing rate. This is to improve the calculation time problem.

以下に図面を用いて、本発明の一実施例を説明する。An embodiment of the present invention will be described below with reference to the drawings.

第1図は、その概要を示すフローチャートであり、同図
において101は音声波形データで、この波形をある時
間(例えば30ms)毎にフレームに区切り、そのフレ
ーム毎に有声、無声、無音の決定を行なう。102の処
理はサイレントレベルで音声をクリップし、フレーム内
全区間がサイレントレベル以下のフレームは無音区間と
決定し、以後の処理を行なわない。
FIG. 1 is a flowchart showing the outline of the process. In the same figure, 101 is audio waveform data, this waveform is divided into frames at a certain time interval (for example, 30 ms), and voiced, unvoiced, or silent is determined for each frame. Let's do it. In the process 102, the audio is clipped at the silent level, and a frame in which all sections within the frame are below the silent level is determined to be a silent section, and subsequent processing is not performed.

このサイレントレベルとは理想的には零であるが、実際
にはハムの影響や、AD変換器のオフセットのズレ等を
カツトするためにある程度のレベル(例えば±2048
レベルの整数型データとして±3)を設定する。103
の処理は零交差数をカウントする処理である。
Ideally, this silent level is zero, but in reality it is set to a certain level (for example, ±2048
Set ±3) as the level integer type data. 103
The process is a process of counting the number of zero crossings.

第2図でその処理を説明する。ここで第2図は音声デー
タの一例を示すが、この例では、データ数22個に対し
て零交差数7となり、零交差率は7/22となる。無声
区間では、この零交差率が多くなり、有声、無声判定の
一つの鍵を握つている。この率を求める時、フレーム長
が長いためにフレームの切れ目が無声と有声にまたがつ
ている場合もあり得る。第3図にその例を示すが、この
フレームのデータの場合、フレーム内の左側が無声区間
、右側が有声区間にまたがつていると考えられる。日本
語に限つて言えば、約70%が有声区間で残りの約30
%が無声あるいは無声区間である。このようなフレーム
を有声と無声のどちらの区間と判定するかは難しい。本
発明ではこのような場合には無声区間と判定して無声音
を強調し、無声子音の明瞭度の低下を防止する。そのた
めに、零交差率を求める時に、フレーム内を2分割し、
フレーム前半の零交差率とフレーム後半の零交差率の大
きい方をそのフレーム内の代表値として採用している。
104は零交差率が、無音区間と判定すべき閾値以下か
どうかを判定する処理である。
The process will be explained with reference to FIG. Here, FIG. 2 shows an example of audio data, and in this example, the number of zero crossings is 7 for 22 pieces of data, and the zero crossing rate is 7/22. In unvoiced sections, this zero-crossing rate increases and holds one of the keys to determining voicedness and unvoicedness. When calculating this rate, there may be cases where the frame length is long and the frame break spans unvoiced and voiced periods. An example of this is shown in FIG. 3. In the case of this frame of data, it is thought that the left side of the frame spans an unvoiced section and the right side spans a voiced section. Speaking only in Japanese, about 70% are voiced sections, and the remaining 30% are voiced sections.
% is silent or silent section. It is difficult to determine whether such a frame is a voiced or unvoiced section. In the present invention, in such a case, it is determined that the segment is a voiceless section, the voiceless sound is emphasized, and the intelligibility of the voiceless consonant is prevented from deteriorating. For this purpose, when calculating the zero crossing rate, divide the frame into two,
The greater of the zero crossing rate in the first half of the frame and the zero crossing rate in the second half of the frame is adopted as the representative value within that frame.
104 is a process of determining whether the zero crossing rate is less than or equal to a threshold value for determining a silent section.

これは102の処理で取り切れなかつたハムやノイズ等
の影響を避けて、無音区間と判定できる値(例えば1/
100)を使用する〇105は確実に無声区間であると
判定できる零交差率(例えば1/3)以上の値をとるフ
レームを無声区間と決定する処理である。
This is a value that can be determined as a silent section (for example, 1/
100) is a process of determining a frame having a value equal to or higher than a zero-crossing rate (for example, 1/3) that can be definitely determined to be an unvoiced interval to be an unvoiced interval.

第4図はX軸に零交差率Z(0,y軸に自己相関関数の
最大値φ(T)と音声波形の遅れ時間零の自己相関関数
の値φ(0)の比W(′V)一φ(T)/φ(0)をと
つたものである〇第4図は各フレームをZ(0.!−?
F(′Tt)の関係で有声(VOiced)区間あるい
は、無声(4)NVOiced)区間に分類する説明の
ためのグラフであり、105の処理は第4図上の401
の領域に相当する。(第4図では、無声区間を(UV)
、有声区間を(と表示している。)この105の処理は
もし有声区間であれば、その区間の音声の基本周波数を
求めるために自己相関関数を求める訳であるが、その演
算処理が長くかかることを考慮して、その演算を少しで
も省くために、前処理として零交差率で無声区間を決定
するようにしている。即ち、完全に無声、無音区間であ
れば、その区間の音声基本周波数を求める必要がないた
めに自己相関関数の演算は行なわない。106は105
までに無音区間、無声区間と決定されなかつた区間につ
いてのみ行なわれる処理で、フオルマントの影響を軽減
したり、音声信号が零に近い部分での高周波成分力巾己
相関関数に与える影響をなくするための処理である。
Figure 4 shows the zero crossing rate Z (0) on the X axis, and the ratio W ('V ) - φ(T)/φ(0) 〇 Figure 4 shows each frame as Z(0.!-?
This is a graph for explaining classification into voiced (VOiced) sections or non-voiced (4) NVOiced) sections according to the relationship of F('Tt), and the process of 105 is similar to 401 in FIG.
corresponds to the area of (In Figure 4, the unvoiced section is (UV)
, a voiced section is (displayed as).If it is a voiced section, the autocorrelation function is calculated to find the fundamental frequency of the voice in that section, but the calculation process is long. In consideration of this, in order to reduce the number of calculations as much as possible, unvoiced sections are determined using a zero crossing rate as preprocessing. That is, if it is a completely silent or silent section, the autocorrelation function is not calculated because there is no need to find the fundamental audio frequency of that section. 106 is 105
This process is performed only on sections that have not been previously determined to be silent sections or voiceless sections, and reduces the influence of formants and eliminates the influence on the high-frequency component force width self-correlation function in sections where the audio signal is close to zero. This is a process for

その具体的方法を第5図で説明する。第5図は1フレー
ム内の音声信号を表わす。図の如くフレームを3分割し
、前1/3区間の絶対値の最大値501と、後1/3区
間の絶対値502の小さい方の値502のN%(例えば
30%)の値503でクリツプする処理を施す。この処
理後、107で自己相関関数φ(τ)を求め、その値が
極めて小づい時(例えば、±2048レベルの整数型の
300ポイントの自己相関関数の最大値が5以下であつ
た時等)、その区間を無声区間と決定する処理が108
である。109で、音声のピツチ周波数探策区間内にお
ける自己相関関数の最大値φ(′f!)より、r(7)
一φ(′T!)/φ(Oを求める。
A specific method will be explained with reference to FIG. FIG. 5 shows an audio signal within one frame. As shown in the figure, the frame is divided into three parts, and the value 503 is N% (for example, 30%) of the smaller value 502 of the maximum absolute value 501 of the first 1/3 section and the absolute value 502 of the second 1/3 section. Performs clipping processing. After this processing, the autocorrelation function φ(τ) is obtained in step 107, and when the value is extremely small (for example, when the maximum value of the autocorrelation function for 300 points of an integer type with ±2048 levels is 5 or less) ), the process of determining that section as a silent section is 108
It is. 109, from the maximum value φ('f!) of the autocorrelation function within the pitch frequency search section of the voice, r(7)
- Find φ('T!)/φ(O.

一般的にF(T)は周期性があればある程度大きい値(
例えば0.4以上)をとることが知られている。零交差
率Z(OとF(1)一φ(′f!)/φ(0)の関係に
於て、第4図に於る領域402のように一次不等式F(
T)〈AZ(0(aは統計的に求められた定数で例えば
1.5の部分は無声区間と判定する処理が110である
。111は第4図に}ける403の無声領域を決定する
処理で、?F(慣一φ(T)/φ(0)が一定値(例え
ば0.3)以下の領域403を無声区間と判定し、第4
図に於て残つた領域404を有声区間と判定する処理で
ある。
In general, F(T) is a somewhat large value if there is periodicity (
For example, it is known to take 0.4 or more). In the relationship between zero crossing rate Z(O and F(1) - φ('f!)/φ(0), the linear inequality F(
T) <AZ(0 (a is a statistically determined constant, for example, the part of 1.5 is determined to be an unvoiced section in step 110. 111 is shown in FIG. 4) Determine the unvoiced area of 403 In the process, an area 403 in which ?F(T)/φ(0) is less than a certain value (for example, 0.3) is determined to be an unvoiced section, and the fourth
This is a process of determining the remaining area 404 in the figure as a voiced section.

第6図は自己相関関数φ(τ)と、遅れ時間τの関係を
示すグラフの一例であるが、ピツチ周期探策区間内のピ
ーク値に対応した遅れ時間τ−T(第6図中601)が
音声の基本周期であることは言うまでもない。以上の説
明から明らかなように、本発明によれば、音声波形の自
己相関関数と零交差率を求めてこれらを組合せることに
より、駆動音源成分の有声、無声、無音区間の高精度な
検出が短時間で可能になり、この1駆動音激成分を使つ
て、高い品質の合成音を得ることができる。
Figure 6 is an example of a graph showing the relationship between the autocorrelation function φ(τ) and the delay time τ. ) is the fundamental period of speech. As is clear from the above description, according to the present invention, voiced, unvoiced, and silent sections of the driving sound source component can be detected with high precision by determining the autocorrelation function and zero crossing rate of the audio waveform and combining them. can be achieved in a short time, and high-quality synthesized sounds can be obtained using this one-drive sound intensity component.

また結果として、既存の音声分析合成装置の駆動音源信
号分析部に容易に組込むことができる点に於ても極めて
有効である。
Furthermore, as a result, it is extremely effective in that it can be easily incorporated into the driving sound source signal analysis section of an existing speech analysis and synthesis device.

【図面の簡単な説明】 第1図は本発明の一実施例による音声分析合成方式の音
声区間判定方法のフローチヤート、第2図は音声波形中
の零交差点を示す波形図、第3図は1フレーム内に無声
、有声区間の含まれる波形図、第4図は第1図のフロー
チヤートに基づき無声、有声区間が順々に決定?れる概
念図、第5図は音声信号から、高周波成分と、フオルマ
ントのπ 影響を除くための前処理の説明図、第6図は
ピツチ周期を求める方法の説明図である。
[Brief Description of the Drawings] Fig. 1 is a flowchart of a speech segment determination method using a speech analysis and synthesis method according to an embodiment of the present invention, Fig. 2 is a waveform diagram showing zero crossing points in a speech waveform, and Fig. 3 is a Figure 4 is a waveform diagram in which unvoiced and voiced sections are included in one frame, and the unvoiced and voiced sections are determined in sequence based on the flowchart in Figure 1. FIG. 5 is an explanatory diagram of preprocessing for removing high frequency components and the π influence of formants from an audio signal, and FIG. 6 is an explanatory diagram of a method for determining the pitch period.

Claims (1)

【特許請求の範囲】 1 音声信号中のハム成分等を取り除く第1のレベルの
クリップを施した後、信号の零交差率を求め、その零交
差率の値からあらかじめ確実な無声区間と無音区間を決
定すると共に、それ以外の区間について音声信号に、そ
の高周波成分とフオルマントの影響を少なくする第2の
クリップ処理を施した後、自己相関関数φ(τ)を求め
、その最大値φ(T)と、音声波形の遅れ時間零の自己
相関関数の値φ(O)との比Ψ(T)、及び零交差率の
2種類のパラメータを組み合せ、ある閾値で切つて、有
声区間を判定することを特徴とする音声分析合成方式の
音声区間判定方法。 2 あるフレーム内に有声・無声区間が混在する時に、
フレーム内を分割して、分割した中での零交差率の最も
大きい値をそのフレームの零交差率の代表値とおきかえ
て、無声区間を減らさないようにすることを特徴とする
特許請求の範囲第1項記載の音声分析合成方式の音声区
間判定方法。
[Claims] 1. After applying a first level of clipping that removes hum components, etc. in the audio signal, the zero-crossing rate of the signal is determined, and from the value of the zero-crossing rate, certain silent sections and silent sections are determined in advance. is determined, and the audio signal is subjected to a second clipping process that reduces the influence of its high frequency components and formants for other sections. Then, the autocorrelation function φ(τ) is determined, and its maximum value φ(T ) and the value φ(O) of the autocorrelation function at zero delay time of the speech waveform Ψ(T), and the zero crossing rate are combined and cut at a certain threshold to determine the voiced section. A speech interval determination method using a speech analysis and synthesis method, characterized in that: 2 When voiced and unvoiced sections are mixed in a certain frame,
Claims characterized in that the frame is divided, and the largest value of the zero-crossing rate among the divided frames is replaced with the representative value of the zero-crossing rate of that frame, so as not to reduce the silent section. 2. A speech segment determination method using the speech analysis and synthesis method described in item 1.
JP54157123A 1979-12-03 1979-12-03 Speech segment determination method using speech analysis and synthesis method Expired JPS5925237B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP54157123A JPS5925237B2 (en) 1979-12-03 1979-12-03 Speech segment determination method using speech analysis and synthesis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP54157123A JPS5925237B2 (en) 1979-12-03 1979-12-03 Speech segment determination method using speech analysis and synthesis method

Publications (2)

Publication Number Publication Date
JPS5678899A JPS5678899A (en) 1981-06-29
JPS5925237B2 true JPS5925237B2 (en) 1984-06-15

Family

ID=15642707

Family Applications (1)

Application Number Title Priority Date Filing Date
JP54157123A Expired JPS5925237B2 (en) 1979-12-03 1979-12-03 Speech segment determination method using speech analysis and synthesis method

Country Status (1)

Country Link
JP (1) JPS5925237B2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6151039U (en) * 1984-09-07 1986-04-05
JPS6313699Y2 (en) * 1984-09-03 1988-04-18

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5846396A (en) * 1981-09-16 1983-03-17 株式会社日立製作所 Voice signal detector
JPH06274198A (en) * 1993-03-17 1994-09-30 Fuji Xerox Co Ltd Sound processor

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6313699Y2 (en) * 1984-09-03 1988-04-18
JPS6151039U (en) * 1984-09-07 1986-04-05

Also Published As

Publication number Publication date
JPS5678899A (en) 1981-06-29

Similar Documents

Publication Publication Date Title
Talkin et al. A robust algorithm for pitch tracking (RAPT)
CA2257298C (en) Non-uniform time scale modification of recorded audio
US7065485B1 (en) Enhancing speech intelligibility using variable-rate time-scale modification
Kubin et al. Performance of noise excitation for unvoiced speech
JP4906230B2 (en) A method for time adjustment of audio signals using characterization based on auditory events
US8489404B2 (en) Method for detecting audio signal transient and time-scale modification based on same
Quatieri et al. Phase coherence in speech reconstruction for enhancement and coding applications
JPH06161494A (en) Automatic extracting method for pitch section of speech
JPS5925237B2 (en) Speech segment determination method using speech analysis and synthesis method
Samad et al. Pitch detection of speech signals using the cross-correlation technique
JPS5925238B2 (en) Speech segment determination method using speech analysis and synthesis method
Ohtsuka et al. Aperiodicity control in ARX-based speech analysis-synthesis method
Banbrook et al. Dynamical modelling of vowel sounds as a synthesis tool
JP2588963B2 (en) Speech synthesizer
KR100359988B1 (en) real-time speaking rate conversion system
US20060149539A1 (en) Method for separating a sound frame into sinusoidal components and residual noise
JPS63143598A (en) Voice feature parameter extraction circuit
JP3308847B2 (en) Pitch waveform extraction reference position determination method and device
JPS60262200A (en) Expolation of spectrum parameter
JPS6217800A (en) Voice section decision system
Gil Moreno Speech/music audio classification for publicity insertion and DRM
JPS59149400A (en) Syllable boundary selection system
JPS58162999A (en) Drive wave extraction for voice synthesization
Sakurai Generalized envelope matching technique for time-scale modification of speech (GEM-TSM).
Faycal et al. Pitch modification of speech signal using source filter model by linear prediction for prosodic transformations