JPS5925237B2

JPS5925237B2 - Speech segment determination method using speech analysis and synthesis method

Info

Publication number: JPS5925237B2
Application number: JP54157123A
Authority: JP
Inventors: 浩二浮穴
Original assignee: Matsushita Communication Industrial Co Ltd
Current assignee: Panasonic Mobile Communications Co Ltd
Priority date: 1979-12-03
Filing date: 1979-12-03
Publication date: 1984-06-15
Also published as: JPS5678899A

Description

【発明の詳細な説明】本発明は音声分析合成方式における音声の有声区間、無
声区間、無音区間を短時間に判定する方法に関するもの
である。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a method for determining voiced sections, unvoiced sections, and silent sections of speech in a short time in a speech analysis and synthesis method.

一般的に音声分析合成系に於ては、音声情報の圧縮を行
なうために有声区間、無声区間、無音区間を決定し、そ
れぞれの区間に於て最適で最小量になるように情報を抽
出する方法からとられている。In general, in speech analysis and synthesis systems, in order to compress speech information, voiced sections, unvoiced sections, and silent sections are determined, and information is extracted in the optimal and minimum amount for each section. It is taken from the method.

したがつてこの区間を決定する方法は音声分析合成系で
重要な問題となつている。従来の音声分析合成装置にお
いて、例えばＰＡＲＣＯＲ方式を用いだ装置では、音声
信号からフオルマントなどの周波数スペクトル包絡成分
を除去した残差信号を作成し、その残差信号の自己相関
関数である変形相関関数を求めだ後、その最大値と、第
１次のＰＡＲＣＯＲ係数ｋｌによつて、有声無声の決定
をしている。Therefore, how to determine this interval is an important issue in speech analysis and synthesis systems. In conventional speech analysis and synthesis equipment, for example, equipment using the PARCOR method, a residual signal is created by removing frequency spectrum envelope components such as formants from the audio signal, and a modified correlation function that is an autocorrelation function of the residual signal is created. After finding the maximum value and the first PARCOR coefficient kl, voiced or unvoiced is determined.

実用上、これらの処理は電子計算機を使用して行なうこ
とが多いが、残差信号を求めたり変形相関関数を求める
処理にかなりの演算時間を要する。演算処理の高速化の
一環として、音声波形の自己相関関数にディジタルフィ
ルタをかけるという荷重移動平均操作によつて変形相関
関数を求める方法が提案されているが、これも演算処理
時間がかなりかかることには大差がないのが現状である
。本発明は、音声の有声、無声、無音区間の決定を簡単
な前処理を施した音声波形の自己相関関数と、零交差率
の組合せにより効率よく高速に確度よく行ない、今まで
の欠点であつた演算時間の問題を改善するものである。In practice, these processes are often performed using an electronic computer, but it takes a considerable amount of calculation time to obtain the residual signal and the modified correlation function. As part of efforts to speed up calculation processing, a method has been proposed to obtain a modified correlation function using a weighted moving average operation in which a digital filter is applied to the autocorrelation function of the audio waveform, but this method also takes a considerable amount of calculation processing time. At present, there is not much difference. The present invention efficiently, quickly, and accurately determines voiced, unvoiced, and silent sections of speech using a combination of an autocorrelation function of a speech waveform that has undergone simple preprocessing and a zero-crossing rate. This is to improve the calculation time problem.

以下に図面を用いて、本発明の一実施例を説明する。An embodiment of the present invention will be described below with reference to the drawings.

第１図は、その概要を示すフローチャートであり、同図
において１０１は音声波形データで、この波形をある時
間（例えば３０ｍｓ）毎にフレームに区切り、そのフレ
ーム毎に有声、無声、無音の決定を行なう。１０２の処
理はサイレントレベルで音声をクリップし、フレーム内
全区間がサイレントレベル以下のフレームは無音区間と
決定し、以後の処理を行なわない。FIG. 1 is a flowchart showing the outline of the process. In the same figure, 101 is audio waveform data, this waveform is divided into frames at a certain time interval (for example, 30 ms), and voiced, unvoiced, or silent is determined for each frame. Let's do it. In the process 102, the audio is clipped at the silent level, and a frame in which all sections within the frame are below the silent level is determined to be a silent section, and subsequent processing is not performed.

このサイレントレベルとは理想的には零であるが、実際
にはハムの影響や、ＡＤ変換器のオフセットのズレ等を
カツトするためにある程度のレベル（例えば±２０４８
レベルの整数型データとして±３）を設定する。１０３
の処理は零交差数をカウントする処理である。Ideally, this silent level is zero, but in reality it is set to a certain level (for example, ±2048
Set ±3) as the level integer type data. 103
The process is a process of counting the number of zero crossings.

第２図でその処理を説明する。ここで第２図は音声デー
タの一例を示すが、この例では、データ数２２個に対し
て零交差数７となり、零交差率は７／２２となる。無声
区間では、この零交差率が多くなり、有声、無声判定の
一つの鍵を握つている。この率を求める時、フレーム長
が長いためにフレームの切れ目が無声と有声にまたがつ
ている場合もあり得る。第３図にその例を示すが、この
フレームのデータの場合、フレーム内の左側が無声区間
、右側が有声区間にまたがつていると考えられる。日本
語に限つて言えば、約７０％が有声区間で残りの約３０
％が無声あるいは無声区間である。このようなフレーム
を有声と無声のどちらの区間と判定するかは難しい。本
発明ではこのような場合には無声区間と判定して無声音
を強調し、無声子音の明瞭度の低下を防止する。そのた
めに、零交差率を求める時に、フレーム内を２分割し、
フレーム前半の零交差率とフレーム後半の零交差率の大
きい方をそのフレーム内の代表値として採用している。
１０４は零交差率が、無音区間と判定すべき閾値以下か
どうかを判定する処理である。The process will be explained with reference to FIG. Here, FIG. 2 shows an example of audio data, and in this example, the number of zero crossings is 7 for 22 pieces of data, and the zero crossing rate is 7/22. In unvoiced sections, this zero-crossing rate increases and holds one of the keys to determining voicedness and unvoicedness. When calculating this rate, there may be cases where the frame length is long and the frame break spans unvoiced and voiced periods. An example of this is shown in FIG. 3. In the case of this frame of data, it is thought that the left side of the frame spans an unvoiced section and the right side spans a voiced section. Speaking only in Japanese, about 70% are voiced sections, and the remaining 30% are voiced sections.
% is silent or silent section. It is difficult to determine whether such a frame is a voiced or unvoiced section. In the present invention, in such a case, it is determined that the segment is a voiceless section, the voiceless sound is emphasized, and the intelligibility of the voiceless consonant is prevented from deteriorating. For this purpose, when calculating the zero crossing rate, divide the frame into two,
The greater of the zero crossing rate in the first half of the frame and the zero crossing rate in the second half of the frame is adopted as the representative value within that frame.
104 is a process of determining whether the zero crossing rate is less than or equal to a threshold value for determining a silent section.

これは１０２の処理で取り切れなかつたハムやノイズ等
の影響を避けて、無音区間と判定できる値（例えば１／
１００）を使用する〇１０５は確実に無声区間であると
判定できる零交差率（例えば１／３）以上の値をとるフ
レームを無声区間と決定する処理である。This is a value that can be determined as a silent section (for example, 1/
100) is a process of determining a frame having a value equal to or higher than a zero-crossing rate (for example, 1/3) that can be definitely determined to be an unvoiced interval to be an unvoiced interval.

第４図はＸ軸に零交差率Ｚ（０，ｙ軸に自己相関関数の
最大値φ（Ｔ）と音声波形の遅れ時間零の自己相関関数
の値φ（０）の比Ｗ（′Ｖ）一φ（Ｔ）／φ（０）をと
つたものである〇第４図は各フレームをＺ（０．！−？
Ｆ（′Ｔｔ）の関係で有声（ＶＯｉｃｅｄ）区間あるい
は、無声（４）ＮＶＯｉｃｅｄ）区間に分類する説明の
ためのグラフであり、１０５の処理は第４図上の４０１
の領域に相当する。（第４図では、無声区間を（ＵＶ）
、有声区間を（と表示している。）この１０５の処理は
もし有声区間であれば、その区間の音声の基本周波数を
求めるために自己相関関数を求める訳であるが、その演
算処理が長くかかることを考慮して、その演算を少しで
も省くために、前処理として零交差率で無声区間を決定
するようにしている。即ち、完全に無声、無音区間であ
れば、その区間の音声基本周波数を求める必要がないた
めに自己相関関数の演算は行なわない。１０６は１０５
までに無音区間、無声区間と決定されなかつた区間につ
いてのみ行なわれる処理で、フオルマントの影響を軽減
したり、音声信号が零に近い部分での高周波成分力巾己
相関関数に与える影響をなくするための処理である。Figure 4 shows the zero crossing rate Z (0) on the X axis, and the ratio W ('V ) - φ(T)/φ(0) 〇 Figure 4 shows each frame as Z(0.!-?
This is a graph for explaining classification into voiced (VOiced) sections or non-voiced (4) NVOiced) sections according to the relationship of F('Tt), and the process of 105 is similar to 401 in FIG.
corresponds to the area of (In Figure 4, the unvoiced section is (UV)
, a voiced section is (displayed as).If it is a voiced section, the autocorrelation function is calculated to find the fundamental frequency of the voice in that section, but the calculation process is long. In consideration of this, in order to reduce the number of calculations as much as possible, unvoiced sections are determined using a zero crossing rate as preprocessing. That is, if it is a completely silent or silent section, the autocorrelation function is not calculated because there is no need to find the fundamental audio frequency of that section. 106 is 105
This process is performed only on sections that have not been previously determined to be silent sections or voiceless sections, and reduces the influence of formants and eliminates the influence on the high-frequency component force width self-correlation function in sections where the audio signal is close to zero. This is a process for

その具体的方法を第５図で説明する。第５図は１フレー
ム内の音声信号を表わす。図の如くフレームを３分割し
、前１／３区間の絶対値の最大値５０１と、後１／３区
間の絶対値５０２の小さい方の値５０２のＮ％（例えば
３０％）の値５０３でクリツプする処理を施す。この処
理後、１０７で自己相関関数φ（τ）を求め、その値が
極めて小づい時（例えば、±２０４８レベルの整数型の
３００ポイントの自己相関関数の最大値が５以下であつ
た時等）、その区間を無声区間と決定する処理が１０８
である。１０９で、音声のピツチ周波数探策区間内にお
ける自己相関関数の最大値φ（′ｆ！）より、ｒ（７）
一φ（′Ｔ！）／φ（Ｏを求める。A specific method will be explained with reference to FIG. FIG. 5 shows an audio signal within one frame. As shown in the figure, the frame is divided into three parts, and the value 503 is N% (for example, 30%) of the smaller value 502 of the maximum absolute value 501 of the first 1/3 section and the absolute value 502 of the second 1/3 section. Performs clipping processing. After this processing, the autocorrelation function φ(τ) is obtained in step 107, and when the value is extremely small (for example, when the maximum value of the autocorrelation function for 300 points of an integer type with ±2048 levels is 5 or less) ), the process of determining that section as a silent section is 108
It is. 109, from the maximum value φ('f!) of the autocorrelation function within the pitch frequency search section of the voice, r(7)
- Find φ('T!)/φ(O.

一般的にＦ（Ｔ）は周期性があればある程度大きい値（
例えば０．４以上）をとることが知られている。零交差
率Ｚ（ＯとＦ（１）一φ（′ｆ！）／φ（０）の関係に
於て、第４図に於る領域４０２のように一次不等式Ｆ（
Ｔ）〈ＡＺ（０（ａは統計的に求められた定数で例えば
１．５の部分は無声区間と判定する処理が１１０である
。１１１は第４図に｝ける４０３の無声領域を決定する
処理で、？Ｆ（慣一φ（Ｔ）／φ（０）が一定値（例え
ば０．３）以下の領域４０３を無声区間と判定し、第４
図に於て残つた領域４０４を有声区間と判定する処理で
ある。In general, F(T) is a somewhat large value if there is periodicity (
For example, it is known to take 0.4 or more). In the relationship between zero crossing rate Z(O and F(1) - φ('f!)/φ(0), the linear inequality F(
T) <AZ(0 (a is a statistically determined constant, for example, the part of 1.5 is determined to be an unvoiced section in step 110. 111 is shown in FIG. 4) Determine the unvoiced area of 403 In the process, an area 403 in which ?F(T)/φ(0) is less than a certain value (for example, 0.3) is determined to be an unvoiced section, and the fourth
This is a process of determining the remaining area 404 in the figure as a voiced section.

第６図は自己相関関数φ（τ）と、遅れ時間τの関係を
示すグラフの一例であるが、ピツチ周期探策区間内のピ
ーク値に対応した遅れ時間τ−Ｔ（第６図中６０１）が
音声の基本周期であることは言うまでもない。以上の説
明から明らかなように、本発明によれば、音声波形の自
己相関関数と零交差率を求めてこれらを組合せることに
より、駆動音源成分の有声、無声、無音区間の高精度な
検出が短時間で可能になり、この１駆動音激成分を使つ
て、高い品質の合成音を得ることができる。Figure 6 is an example of a graph showing the relationship between the autocorrelation function φ(τ) and the delay time τ. ) is the fundamental period of speech. As is clear from the above description, according to the present invention, voiced, unvoiced, and silent sections of the driving sound source component can be detected with high precision by determining the autocorrelation function and zero crossing rate of the audio waveform and combining them. can be achieved in a short time, and high-quality synthesized sounds can be obtained using this one-drive sound intensity component.

また結果として、既存の音声分析合成装置の駆動音源信
号分析部に容易に組込むことができる点に於ても極めて
有効である。Furthermore, as a result, it is extremely effective in that it can be easily incorporated into the driving sound source signal analysis section of an existing speech analysis and synthesis device.

【図面の簡単な説明】第１図は本発明の一実施例による音声分析合成方式の音
声区間判定方法のフローチヤート、第２図は音声波形中
の零交差点を示す波形図、第３図は１フレーム内に無声
、有声区間の含まれる波形図、第４図は第１図のフロー
チヤートに基づき無声、有声区間が順々に決定？れる概
念図、第５図は音声信号から、高周波成分と、フオルマ
ントのπ 影響を除くための前処理の説明図、第６図は
ピツチ周期を求める方法の説明図である。[Brief Description of the Drawings] Fig. 1 is a flowchart of a speech segment determination method using a speech analysis and synthesis method according to an embodiment of the present invention, Fig. 2 is a waveform diagram showing zero crossing points in a speech waveform, and Fig. 3 is a Figure 4 is a waveform diagram in which unvoiced and voiced sections are included in one frame, and the unvoiced and voiced sections are determined in sequence based on the flowchart in Figure 1. FIG. 5 is an explanatory diagram of preprocessing for removing high frequency components and the π influence of formants from an audio signal, and FIG. 6 is an explanatory diagram of a method for determining the pitch period.

Claims

[Claims] 1. After applying a first level of clipping that removes hum components, etc. in the audio signal, the zero-crossing rate of the signal is determined, and from the value of the zero-crossing rate, certain silent sections and silent sections are determined in advance. is determined, and the audio signal is subjected to a second clipping process that reduces the influence of its high frequency components and formants for other sections. Then, the autocorrelation function φ(τ) is determined, and its maximum value φ(T ) and the value φ(O) of the autocorrelation function at zero delay time of the speech waveform Ψ(T), and the zero crossing rate are combined and cut at a certain threshold to determine the voiced section. A speech interval determination method using a speech analysis and synthesis method, characterized in that: 2 When voiced and unvoiced sections are mixed in a certain frame,
Claims characterized in that the frame is divided, and the largest value of the zero-crossing rate among the divided frames is replaced with the representative value of the zero-crossing rate of that frame, so as not to reduce the silent section. 2. A speech segment determination method using the speech analysis and synthesis method described in item 1.