JPH09179586A

JPH09179586A - Setting method for voice pitch mark

Info

Publication number: JPH09179586A
Application number: JP7333852A
Authority: JP
Inventors: Yukio Tabei; 幸雄田部井
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1995-12-22
Filing date: 1995-12-22
Publication date: 1997-07-11
Anticipated expiration: 2015-12-22
Also published as: JP3358139B2

Abstract

PROBLEM TO BE SOLVED: To set a pitch mark of a voice automatically and at an accurate position. SOLUTION: A primary voice is inputted (step S101), its pitch period is extracted (step 102), and a fundamental wave component is extracted by passing a voice waveform through a low pass filter of which a cut-off frequency is a value of the prescribed constant multiple of a reciprocal of a detected pitch period, and the fundamental wave component is extracted (step S103). Next, a time value corresponding to the maximum value of this fundamental wave component is calculated (step S104), group delay correction is performed for the time value and a pitch mark candidate point is calculated (step 105). And the maximum value of a voice waveform near this pitch mark candidate point is calculated (step S106), the time value corresponding to this value is decided as the pitch mark (step 107).

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、入力音声の波形に
対してピッチマークを設定する音声ピッチマーク設定方
法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice pitch mark setting method for setting a pitch mark for a waveform of an input voice.

【０００２】[0002]

【従来の技術】従来、テキスト文章を音声として出力す
ることができるテキスト音声変換装置は、入力されるテ
キスト文章を解析するテキスト解析部と、韻律制御部
と、音声合成部とから構成されている。2. Description of the Related Art Conventionally, a text-to-speech conversion device capable of outputting a text sentence as a voice includes a text analysis unit for analyzing an input text sentence, a prosody control unit, and a speech synthesis unit. .

【０００３】このテキスト解析部では、キーボード等に
より入力された漢字かな混じりのテキスト文章を形態素
解析し、読み、アクセント、イントネーションを決定し
て中間言語（発音記号列）として出力している。In this text analysis unit, a text sentence mixed with kanji and kana input by a keyboard or the like is subjected to morphological analysis to determine reading, accent and intonation and output as an intermediate language (phonetic symbol string).

【０００４】また、韻律制御部では、ピッチ周波数パタ
ーンや音韻継続時間等の設定を行っている。さらに、音
声合成部では、線形予測法や波形を使用する方法等を適
用し中間言語、ピッチ周波数パターンおよび音韻継続時
間等に基づく音声合成を行って出力している。In the prosody control section, the pitch frequency pattern, the phoneme duration, etc. are set. Further, in the speech synthesis unit, a linear prediction method, a method using a waveform, or the like is applied to perform speech synthesis based on the intermediate language, the pitch frequency pattern, the phoneme duration, etc., and output the result.

【０００５】線形予測法は、声道情報と音源情報とを分
離して扱うことが可能であり、制御が容易であることか
ら盛んに適用されている。しかし、音声の声道情報と音
源情報との間には本来相互関係がある。線形予測法で
は、音源情報をインパルスと白色雑音とでモデル化して
いるため、合成音の劣化の要因となっている。The linear prediction method is widely applied because it can separately handle vocal tract information and sound source information and is easy to control. However, there is essentially a mutual relationship between vocal tract information of voice and sound source information. In the linear prediction method, sound source information is modeled by impulse and white noise, which is a factor of deterioration of synthesized speech.

【０００６】近年では、音源情報として残差等を用いて
改善することが考えられているが、声道情報と音源情報
との間の相互関係によって残差とスペクトルとの不整合
が生じ、これも合成音の劣化の要因となっている。In recent years, it has been considered to improve the sound source information by using a residual or the like. However, due to the mutual relationship between the vocal tract information and the sound source information, a mismatch between the residual and the spectrum occurs, and Is also a factor of the deterioration of the synthetic sound.

【０００７】そこで、声道情報と音源情報とを分離せ
ず、さらに合成時に原音声波形をそのまま利用して合成
音声が劣化しないようにする方法が考えられている。図
４は原音声波形を利用する従来の音声合成方法を説明す
るフローチャートである。Therefore, a method has been considered in which the vocal tract information and the sound source information are not separated and the original speech waveform is used as it is at the time of synthesis to prevent the synthesized speech from deteriorating. FIG. 4 is a flowchart explaining a conventional speech synthesis method using an original speech waveform.

【０００８】すなわち、ステップＳ４０１において原音
声の入力を行い、次いでステップＳ４０２でそのＡ／Ｄ
変換を行った後、ステップＳ４０３において原音声波形
の極大値位置にピッチマークを目視または自動で設定し
ておく。そして、ステップＳ４０４でこのピッチマーク
の付された原音声を音素片波形としてファイルに格納し
ておく。That is, the original voice is input in step S401, and then the A / D is input in step S402.
After the conversion, the pitch mark is visually or automatically set at the maximum value position of the original speech waveform in step S403. Then, in step S404, the original voice with the pitch mark is stored in the file as a phoneme waveform.

【０００９】次に、音声合成を行う場合、目標となるピ
ッチ周波数パターンを入力し、ステップＳ４０５におい
て波形素片ファイルに格納されている音素片波形に時間
窓掛けを施し、ステップＳ４０６で窓掛けして成る波形
との重畳を行う。重畳は、合成目標のピッチ周期に合う
ような時間窓関数を用い、音素片波形のピッチマーク位
置が時間窓の中心となるようにして乗ずることによって
波形を切り出し、目標ピッチ周波数パターンに合わせて
波形を重畳していくことにより行う。Next, in the case of performing voice synthesis, a target pitch frequency pattern is input, the phoneme unit waveform stored in the waveform unit file is time-windowed in step S405, and windowed in step S406. Is superimposed on the waveform. Superimposition uses a time window function that matches the pitch cycle of the synthetic target, and the waveform is cut out by multiplying it so that the pitch mark position of the phoneme waveform is at the center of the time window, and the waveform is matched to the target pitch frequency pattern. By superposing.

【００１０】その後、ステップＳ４０７において重畳後
の合成波形に対するＤ／Ａ変換を行い、アナログ信号と
してステップＳ４０８で合成音声を出力する。After that, in step S407, D / A conversion is performed on the synthesized waveform after superposition, and synthetic speech is output as an analog signal in step S408.

【００１１】このステップＳ４０３で行った原音声波形
の極大値を設定するにあたり、自動で設定する技術とし
ては、特開平５−２１７３３７号公報に記載される音声
合成方法および装置がある。この技術では、基本周波数
以上すなわち２５６Ｈｚ程度のカットオフ周波数を有す
る低域通過フィルタに原音声波形を通し、通過後の波形
の極大値をピッチマークとして設定するようにしてい
る。As a technique for automatically setting the maximum value of the original speech waveform performed in step S403, there is a speech synthesizing method and apparatus described in Japanese Patent Laid-Open No. 5-217337. In this technique, the original voice waveform is passed through a low-pass filter having a cutoff frequency equal to or higher than the fundamental frequency, that is, about 256 Hz, and the maximum value of the waveform after passing is set as a pitch mark.

【００１２】[0012]

【発明が解決しようとする課題】しかしながら、原音声
波形の極大値を抽出してピッチマークを設定するにあた
り、目視で行う場合には数量的に多大な労力を必要とす
るとともに、一定の基準を保ってピッチマークを設定す
るのが非常に困難である。つまり、ピッチ周期は女性音
で３〜４ｍｓｅｃ、男性音で６〜１０ｍｓｅｃ程度であ
るため、その周期毎に極大値を正確に抽出してピッチマ
ークを設定するのは非常に困難である。However, when the maximum value of the original speech waveform is extracted and the pitch mark is set, it requires a lot of quantitative labor when visually observing, and a certain standard is set. It is very difficult to keep and set the pitch mark. That is, since the pitch period is about 3 to 4 msec for a female sound and about 6 to 10 msec for a male sound, it is very difficult to accurately extract the maximum value for each period and set the pitch mark.

【００１３】また、特開平５−２１７３３７号公報に記
載される技術においてピッチマークを自動設定する場合
には、低域通過フィルタを通した後の波形の極大値が必
ずしも原音声波形の極大値とは一致せず、この不一致に
基づく合成音のごろつきやノイズが生じ、合成音の劣化
の原因となっている。In the technique disclosed in Japanese Patent Laid-Open No. 217337/1993, when the pitch mark is automatically set, the maximum value of the waveform after passing the low pass filter is not always the maximum value of the original speech waveform. Do not match, and the synthetic sound becomes dull or noise based on this mismatch, which is a cause of deterioration of the synthetic sound.

【００１４】[0014]

【課題を解決するための手段】本発明は、このような課
題を解決するために成された音声ピッチマーク設定方法
である。すなわち、本発明の音声ピッチマーク設定方法
では、先ず、入力音声の有声音波形のフレーム毎にピッ
チ周期を検出し、次いで、検出したピッチ周期の逆数を
所定の定数倍した値をカットオフ周波数とする低域通過
フィルタに有声音波形を通過させ、基本波成分を抽出す
る。次に、この基本波成分の極大値に対応する時間値を
算出するとともに、その時間値に対して低帯域通過フィ
ルタの群遅延分補正を行い、ピッチマーク候補点を算出
する。そして、このピッチマーク候補点の近傍にある有
声音波形の極大値を算出し、この極大値に対応する時間
値をピッチマークとして設定している。SUMMARY OF THE INVENTION The present invention is a voice pitch mark setting method which has been made to solve such a problem. That is, in the voice pitch mark setting method of the present invention, first, the pitch period is detected for each frame of the voiced sound waveform of the input voice, and then a value obtained by multiplying the reciprocal of the detected pitch period by a predetermined constant is set as the cutoff frequency. A voiced sound waveform is passed through a low-pass filter to extract the fundamental wave component. Next, the time value corresponding to the maximum value of this fundamental wave component is calculated, and the group delay correction of the low-pass filter is performed on the time value to calculate pitch mark candidate points. Then, the maximum value of the voiced sound waveform in the vicinity of this pitch mark candidate point is calculated, and the time value corresponding to this maximum value is set as the pitch mark.

【００１５】このような音声ピッチマーク設定方法で
は、入力音声における基本波成分を抽出するための低域
通過フィルタのカットオフ周波数を、有声音波形のフレ
ーム毎に変化させていることから、ピッチ周波数の変化
に追随して安定に基本波成分を抽出できるようになる。
また、低域通過フィルタの群遅延分補正を行うととも
に、基本波成分の極大値と入力音声の有声音波形におけ
る極大値との双方を考慮していることから、基本波成分
に基づく有声音波形の正確な極大値をピッチマークとし
て設定できるようになる。In such a voice pitch mark setting method, since the cutoff frequency of the low pass filter for extracting the fundamental wave component in the input voice is changed for each voiced sound waveform frame, the pitch frequency is changed. It becomes possible to stably extract the fundamental wave component by following the change of.
Also, since the group delay of the low-pass filter is corrected and both the maximum value of the fundamental wave component and the maximum value of the voiced sound waveform of the input voice are taken into consideration, the voiced sound waveshape based on the fundamental wave component is considered. It becomes possible to set the accurate maximum value of as the pitch mark.

【００１６】[0016]

【発明の実施の形態】以下に、本発明の音声ピッチマー
ク設定方法における実施の形態を図に基づいて説明す
る。図１は本発明の音声ピッチマーク設定方法における
第１実施形態を説明するフローチャートである。第１実
施形態は、主として音声合成等の音声出力で使用される
音素片波形をファイルに格納するにあたり、原音声波形
の最大値位置にピッチマークを設定する際に適用され
る。第１実施形態では、入力する原音声の有声音波形部
分にのみピッチマークを設定する場合に適用される。BEST MODE FOR CARRYING OUT THE INVENTION An embodiment of a voice pitch mark setting method of the present invention will be described below with reference to the drawings. FIG. 1 is a flow chart for explaining the first embodiment of the voice pitch mark setting method of the present invention. The first embodiment is mainly applied to setting a pitch mark at the maximum value position of an original speech waveform when storing a phoneme waveform used in speech output such as speech synthesis in a file. The first embodiment is applied to the case where the pitch mark is set only in the voiced sound waveform portion of the input original voice.

【００１７】先ず、ステップＳ１０１に示す原音声の入
力を行った後、ステップＳ１０２に示すように、入力し
た原音声の有声音波形のフレーム毎にピッチ周期を抽出
する処理を行う。ピッチ周期の抽出は、図２に示すケプ
ストラム法を用いる。First, after inputting the original voice shown in step S101, as shown in step S102, the pitch period is extracted for each voiced sound frame of the input original voice. The cepstral method shown in FIG. 2 is used to extract the pitch period.

【００１８】ケプストラム法は、先ず図２のステップＳ
２０１に示すように、時間波形を入力し、ステップＳ２
０２に示す窓掛けを行う。次いで窓掛けを行った時間波
形に対してステップＳ２０３に示す離散フーリエ変換
（ＤＦＴ）を施し、ステップＳ２０４においてその実部
と虚部との自乗和の平方根を対数変換する。その後、ス
テップＳ２０５に示す逆離散フーリエ変換（ＩＤＦＴ）
を施すことでステップＳ２０６に示すケプストラム成分
を出力できることになる。In the cepstrum method, first, step S in FIG.
As shown in 201, a time waveform is input, and step S2
Windowing shown in 02 is performed. Next, the windowed time waveform is subjected to the discrete Fourier transform (DFT) shown in step S203, and the square root of the sum of squares of the real part and the imaginary part is logarithmically converted in step S204. Then, the inverse discrete Fourier transform (IDFT) shown in step S205
Thus, the cepstrum component shown in step S206 can be output.

【００１９】すなわち、ケプストラム法は、畳み込み演
算を加法的な演算に変換するものである。入力音声の波
形が有声音波形の場合、ピッチ周期をＴ₀とすれば、音
源成分はＴ₀の近傍の成分として現れ、また声道成分は
短時間領域の成分として現れることになる。That is, the cepstrum method converts a convolution operation into an additive operation. When the waveform of the input voice is a voiced sound waveform, if the pitch period is T ₀ , the sound source component appears as a component near T ₀ , and the vocal tract component appears as a short-term region component.

【００２０】本実施形態では、予め、男性音の場合には
５．０〜１１．６７ｍｓｅｃ、女性音の場合には２．５
〜５．８３ｍｓｅｃのピッチ周期の範囲を設定してお
き、この区間にあるケプストラム成分のピーク値におけ
る時間値を抽出し、この時間値をピッチ周期Ｔ₀として
いる。In the present embodiment, 5.0 to 11.67 msec for male sounds and 2.5 for female sounds in advance.
A range of the pitch cycle of ˜5.83 msec is set, the time value at the peak value of the cepstrum component in this section is extracted, and this time value is set as the pitch cycle T ₀ .

【００２１】このピッチ周期Ｔ₀を抽出する場合には、
ケプストラム成分のピーク値と、その両脇の点の合わせ
て３点で２次曲線近似を行ってから求めるようにする。
なお、この近似を行う場合には３点より多くの点を用い
てもよい。When extracting this pitch period T ₀ ,
The peak value of the cepstrum component and the points on both sides thereof are combined to obtain a quadratic curve approximation at three points, and then the curve is obtained.
Note that when performing this approximation, more than three points may be used.

【００２２】次に、図１のステップＳ１０３に示すよう
に、フレーム毎にピッチ周期Ｔ₀の逆数を所定の定数
（ｃ）倍し、カットオフ周波数ｆ_c＝ｃ／Ｔ₀となる低
域通過フィルタを設定し、この低域通過フィルタに有声
音波形を通して基本波成分を抽出する。なお、この際の
定数ｃとしては、１以上２未満の値で設定するが、１．
１程度が望ましい。また、低域通過フィルタとしては、
波形の時間変形のない直線位相が可能なＦＩＲフィルタ
を用いるのがよい。Next, as shown in step S103 of FIG. 1, the reciprocal of the pitch period T ₀ is multiplied by a predetermined constant (c) for each frame, and the low-pass frequency at which the cutoff frequency f _c = c / T ₀ is obtained. A filter is set, and a fundamental wave component is extracted by passing a voiced sound waveform through this low-pass filter. The constant c at this time is set to a value of 1 or more and less than 2, but 1.
About 1 is desirable. Also, as a low pass filter,
It is preferable to use an FIR filter capable of a linear phase without time deformation of the waveform.

【００２３】次に、ステップＳ１０４に示すローカルピ
ーク時点抽出として、低域通過フィルタを通過した後の
基本波成分の中から極大値を求め、対応する時間値を算
出する。次いで、ステップＳ１０５に示すように、ステ
ップＳ１０４で求めた基本波成分の極大値に対して、低
域通過フィルタの群遅延分補正を施し、補正後の時間値
をピッチマーク候補点とする処理を行う。Next, as the local peak time point extraction shown in step S104, the maximum value is obtained from the fundamental wave component after passing through the low pass filter, and the corresponding time value is calculated. Next, as shown in step S105, the maximum value of the fundamental wave component obtained in step S104 is corrected by the group delay of the low-pass filter, and the corrected time value is used as a pitch mark candidate point. To do.

【００２４】このピッチマーク候補点は、そのままでは
揺れがあるため必ずしも原音声波形の極大値と対応して
いない場合もある。そこで、ステップＳ１０６におい
て、このピッチマーク候補点を表す時間値の近傍にある
有声音波形（原音声波形）の極大値を抽出し、ステップ
Ｓ１０７に示すようにこの極大値を最終的なピッチマー
クとして決定する。これにより、ピッチマーク候補点の
揺れを解消でき、有声音波形の正確なピッチマークの設
定を行うことができるようになる。This pitch mark candidate point may not necessarily correspond to the maximum value of the original speech waveform because it may fluctuate as it is. Therefore, in step S106, the maximum value of the voiced sound waveform (original speech waveform) in the vicinity of the time value representing the pitch mark candidate point is extracted, and this maximum value is used as the final pitch mark as shown in step S107. decide. As a result, the swing of pitch mark candidate points can be eliminated, and the voiced sound waveform can be accurately set.

【００２５】次に、本発明の音声ピッチマーク設定方法
における第２実施形態を説明する。図３は第２実施形態
を説明するフローチャートである。第２実施形態は、主
として音声合成等の音声出力で使用される音素片波形を
ファイルに格納するにあたり、原音声波形の最大値位置
にピッチマークを設定する際に適用される点で第１実施
形態と同様であるが、入力する原音声の中から有声音波
形と無声音波形とを判別して処理を行う点で相違する。Next, a second embodiment of the voice pitch mark setting method of the present invention will be described. FIG. 3 is a flowchart illustrating the second embodiment. The second embodiment is mainly applied when the pitch mark is set at the maximum value position of the original speech waveform when the phoneme segment waveform used for speech output such as speech synthesis is stored in the file. It is similar to the form, but differs in that the voiced sound waveform and the unvoiced sound waveform are discriminated from the input original speech and the processing is performed.

【００２６】特に第２実施形態では、ピッチ周期を抽出
する際に算出されるケプストラム成分において、無声音
成分の音源のランダム性からピッチに対応する鋭いピー
クがそのケプストラム成分に現れないという性質を利用
して有声音成分と無声音成分との判別を行う点に特徴が
ある。Particularly in the second embodiment, in the cepstrum component calculated when the pitch period is extracted, the property that a sharp peak corresponding to the pitch does not appear in the cepstrum component due to the randomness of the unvoiced sound source is utilized. It is characterized in that it distinguishes between voiced sound components and unvoiced sound components.

【００２７】先ず、ステップＳ３０１に示す原音声の入
力を行った後、ステップＳ３０２に示すように、入力し
た原音声のフレーム毎にピッチ周期を抽出する処理を行
う。このピッチ周期の抽出は第１実施形態と同様に図２
に示すようなケプストラム法を用いる。First, after inputting the original voice shown in step S301, as shown in step S302, a process of extracting a pitch period for each frame of the input original voice is performed. The extraction of the pitch period is similar to that of the first embodiment shown in FIG.
The cepstrum method as shown in is used.

【００２８】次に、ステップＳ３０３に示すように、ケ
プストラム成分を用いて原音声が有声音成分であるか否
かの判別を行う。例えば、ケプストラム成分のピークが
ある閾値以上となっている場合にはそのフレームは有声
音成分であると判定し、ピークがある閾値未満となって
いる場合にはそのフレームは無声音成分であると判定す
る。Next, as shown in step S303, it is determined using the cepstrum component whether or not the original voice is a voiced sound component. For example, if the peak of the cepstrum component is above a certain threshold, it is determined that the frame is a voiced sound component, and if the peak is below a certain threshold, it is determined that the frame is an unvoiced sound component. To do.

【００２９】つまり、無声音成分ではその音源のランダ
ム性によってケプストラム成分のピッチに対応する鋭い
ピークが現れず、有声音成分では鋭いピークが現れるこ
とを利用し、ピークに対する所定の閾値判定を行うこと
で対象となるフレームが有声音成分であるか無声音成分
であるかを判断する。That is, in the unvoiced sound component, the sharpness corresponding to the pitch of the cepstrum component does not appear due to the randomness of the sound source, and the sharp peak appears in the voiced sound component, and a predetermined threshold value judgment is performed for the peak. It is determined whether the target frame is a voiced sound component or an unvoiced sound component.

【００３０】フレームが無声音成分であると判断された
場合にはステップＳ３０３の判断でＮｏとなりステップ
Ｓ３０１へ戻る。一方、フレームが有声音成分であると
判断された場合にはステップＳ３０３の判断でＹｅｓと
なりステップＳ３０４へ進む。ステップＳ３０４以降の
処理は第１実施形態と同様である。When it is determined that the frame is the unvoiced sound component, the determination in step S303 is No, and the process returns to step S301. On the other hand, when it is determined that the frame is the voiced sound component, the determination in step S303 is Yes and the process proceeds to step S304. The processes after step S304 are the same as those in the first embodiment.

【００３１】すなわち、ステップＳ３０４では、フレー
ム毎にピッチ周期Ｔ₀の逆数を所定の定数（ｃ）倍し、
カットオフ周波数ｆ_c＝ｃ／Ｔ₀となる低域通過フィル
タを設定し、この低域通過フィルタに有声音波形を通し
て基本波成分を抽出する。なお、この際の定数ｃとして
は、１以上２未満の値で設定するが、１．１程度が望ま
しい。また、低域通過フィルタとしては、波形の時間変
形のない直線位相が可能なＦＩＲフィルタを用いるのが
よい。That is, in step S304, the reciprocal of the pitch period T ₀ is multiplied by a predetermined constant (c) for each frame,
A low pass filter having a cut-off frequency f _c = c / T ₀ is set, and a fundamental wave component is extracted by passing a voiced sound waveform through the low pass filter. The constant c at this time is set to a value of 1 or more and less than 2, but is preferably about 1.1. Further, as the low-pass filter, it is preferable to use an FIR filter capable of linear phase without time deformation of the waveform.

【００３２】また、ステップＳ３０５に示すローカルピ
ーク時点抽出として、低域通過フィルタを通過した後の
基本波成分の中から極大値を求め、対応する時間値を算
出し、ステップＳ３０６ではステップＳ３０５で求めた
基本波成分の極大値に対して、低域通過フィルタの群遅
延分補正を施し、補正後の時間値をピッチマーク候補点
とする処理を行う。Further, as the local peak time point extraction shown in step S305, the maximum value is obtained from the fundamental wave components after passing through the low pass filter, and the corresponding time value is calculated, and in step S306, it is obtained in step S305. The maximum value of the fundamental wave component is corrected by the group delay of the low-pass filter, and the corrected time value is used as the pitch mark candidate point.

【００３３】さらに、ステップＳ３０７において、この
ピッチマーク候補点を表す時間値の近傍にある有声音波
形（原音声波形）の極大値を抽出してピッチマーク候補
点の揺れを解消し、ステップＳ３０８に示すようにこの
極大値を最終的なピッチマークとして決定する。Further, in step S307, the maximum value of the voiced sound waveform (original speech waveform) in the vicinity of the time value representing the pitch mark candidate point is extracted to eliminate the fluctuation of the pitch mark candidate point, and then in step S308. As shown, this maximum value is determined as the final pitch mark.

【００３４】このような第２実施形態では、原音声入力
において予め有声音成分と無声音成分とを分けておく必
要がないため、音素片波形ファイルの作成処理を効率よ
く行うことが可能となる。In the second embodiment as described above, since it is not necessary to separate the voiced sound component and the unvoiced sound component in the original voice input in advance, it is possible to efficiently perform the phoneme unit waveform file creation process.

【００３５】なお、いずれの実施形態における音声ピッ
チマーク設定方法は、テキスト音声変換装置の音声合成
部で行われるピッチマークの設定方法として適用できる
他、原音声のピッチを変化させる音声ピッチ変換装置で
のピッチマーク設定等の種々の音声出力装置での処理に
適用することが可能である。The voice pitch mark setting method in any of the embodiments can be applied as a pitch mark setting method performed in the voice synthesizing unit of a text-to-speech conversion apparatus, and also in a voice pitch conversion apparatus that changes the pitch of original speech. It is possible to apply to the processing by various audio output devices such as the setting of the pitch mark.

【００３６】[0036]

【発明の効果】以上説明したように、本発明の音声ピッ
チマーク設定方法によれば次のような効果がある。すな
わち、基本波抽出にあたり低域通過フィルタのカットオ
フ周波数をフレーム毎に変化させているため、ピッチ周
波数の変化に追随して安定した基本波成分を抽出でき、
正確に極大値を検出できるようになる。また、基本波成
分の極大値と原音声波形の極大値との双方を考慮して有
声音波形の正確な極大値を検出していることから、自動
的に正確なピッチマークを設定できるようになる。これ
により、波形レベルで高品質な合成音声を出力すること
が可能となる。As described above, the voice pitch mark setting method of the present invention has the following effects. That is, since the cutoff frequency of the low pass filter is changed for each frame in extracting the fundamental wave, a stable fundamental wave component can be extracted by following the change in the pitch frequency.
The local maximum can be detected accurately. Also, since the maximum value of the voiced sound waveform is detected in consideration of both the maximum value of the fundamental wave component and the maximum value of the original speech waveform, it is possible to set an accurate pitch mark automatically. Become. This makes it possible to output high-quality synthesized speech at the waveform level.

[Brief description of the drawings]

【図１】本発明の第１実施形態を説明するフローチャー
トである。FIG. 1 is a flowchart illustrating a first embodiment of the present invention.

【図２】ケプストラム法を説明するフローチャートであ
る。FIG. 2 is a flowchart illustrating a cepstrum method.

【図３】本発明の第２実施形態を説明するフローチャー
トである。FIG. 3 is a flowchart illustrating a second embodiment of the present invention.

【図４】従来例を説明するフローチャートである。FIG. 4 is a flowchart illustrating a conventional example.

Claims

[Claims]

1. A step of detecting a pitch period for each frame of a voiced sound waveform of an input voice, and a step of detecting the pitch period by multiplying a predetermined constant by a reciprocal of the detected pitch period as a cutoff frequency. A step of passing a voice sound waveform and extracting a fundamental wave component, and calculating a time value corresponding to the maximum value of the fundamental wave component, and performing a group delay correction of the low band pass filter with respect to the time value. , A step of calculating pitch mark candidate points, and a step of calculating a maximum value of the voiced sound waveform in the vicinity of the pitch mark candidate points and setting a time value corresponding to the maximum value as a pitch mark. A method for setting a voice pitch mark characterized by.

2. A step of detecting a pitch period for each frame of the input voice, and determining a voiced sound waveform and an unvoiced sound waveform of the input voice at the time of detecting the pitch period; The step of extracting the fundamental wave component by passing the voiced sound waveform through a low-pass filter whose cutoff frequency is a value obtained by multiplying the reciprocal of the detected pitch period by a predetermined constant, and the maximum value of the fundamental wave component. Calculating a time value corresponding to, and performing a group delay correction of the low bandpass filter on the time value to calculate a pitch mark candidate point; and the existence of the pitch mark candidate point in the vicinity of the pitch mark candidate point. And a step of calculating a maximum value of the voice sound waveform and setting a time value corresponding to the maximum value as a pitch mark.

3. The method according to claim 2, wherein the voiced sound waveform and the unvoiced sound waveform of the input voice are discriminated based on a comparison between a cepstrum component in the waveform of the input voice and a predetermined threshold value. How to set the voice pitch mark.

4. The interpolation according to the peak value of the cepstrum component in the waveform of the input voice and a value around the peak value is used to detect the pitch period for each frame. The method for setting a voice pitch mark according to any one of 3 above.