JP3030869B2

JP3030869B2 - Sound source data generation method for speech synthesizer

Info

Publication number: JP3030869B2
Application number: JP2408727A
Authority: JP
Inventors: 清石田; 喜正沢田
Original assignee: Meidensha Corp
Current assignee: Meidensha Corp
Priority date: 1990-12-28
Filing date: 1990-12-28
Publication date: 2000-04-10
Anticipated expiration: 2015-04-10
Also published as: JPH04253100A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は規則音声合成方式によ
る音声合成装置に係り、特に音源データの生成方法に関
する。BACKGROUND OF THE to the present invention is the rule speech synthesis system
That relates to a speech synthesis apparatus, and more particularly to a method of generating a sound source data.

【０００２】[0002]

【従来の技術】規則音声合成方式による音声合成装置
は、入力文字列を構文解析によって単語，文節に区切
り、夫々にはイントネーション，アクセントを決定し、
単語や文節を音節さらには音素にまで分解し、音節又は
音素単位の音源波及び調音フィルタのパラメータを求
め、音源波に対する調音フィルタの応答出力として合成
音声を得るようにしている。2. Description of the Related Art A speech synthesizer using a regular speech synthesis system divides an input character string into words and phrases by syntactic analysis, and determines intonation and accent for each.
Words and phrases are decomposed into syllables and even phonemes, and sound source waves and articulation filter parameters are obtained for each syllable or phoneme, and a synthesized speech is obtained as a response output of the articulation filter to the sound source wave.

【０００３】このような音声合成装置において、音源情
報としてインパルスとノイズを使用する方式、又は残差
情報を使用する方式がある。このうち、残差を音源情報
とする方式は、音声波形を線形予測分析して調音パラメ
ータを求め、このパラメータによる調音フィルタに音声
波形を入力してその出力に残差波形を求め、この残差波
形をサンプリングと符号化によって音源情報としてい
る。また、音声波形の切り出しには元の波形に窓関数
（ハミング窓，ハニング窓等）を乗じ、切り出し区間の
両端に急激な変化が起きないようにする。In such a speech synthesizer, there is a method using impulse and noise as sound source information, or a method using residual information. In the method of using the residual as sound source information, a speech waveform is subjected to linear prediction analysis to determine articulation parameters, a speech waveform is input to an articulation filter based on these parameters, and a residual waveform is obtained at the output thereof. The waveform is used as sound source information by sampling and encoding. In addition, the extraction of the audio waveform is performed by multiplying the original waveform by a window function (Humming window, Hanning window, etc.) so that no abrupt change occurs at both ends of the extraction section.

【０００４】[0004]

【発明が解決しようとする課題】残差波形を用いて音源
情報を得る従来手段として各ピッチ波形の切り出し基準
点（ピッチ同期点）をいくつかの零交叉点の候補から音
声波形基準で抽出している。例えば、図４Ａに示す女性
音声による「ウ」の発声の場合、候補となる零交差点の
個数が２〜３個と少なく、しかもピッチ波形の振幅の少
ない波形では零交差点は正確に求めることができる。As a conventional means for obtaining sound source information using a residual waveform, a reference point (pitch synchronization point) for extracting each pitch waveform is extracted from several zero-crossing point candidates on the basis of a voice waveform. ing. For example, in the case of “U” utterance by the female voice shown in FIG. 4A, the number of candidate zero-crossing points is as small as two to three, and the zero-crossing point can be accurately obtained in a waveform having a small amplitude of the pitch waveform. .

【０００５】しかし、図４Ｂに示す男性音声による
「ウ」の発声の場合、零交差点が多くピッチ毎の波形振
幅の減衰の大きなときには、正確に残差ピークの出現す
る付近の零交差点が得られず、位相のずれた点を基準と
して選ぶことが多い。そして、位相のずれたまま音源フ
ァイルに格納すると、合成時に波形の歪みが発生する。
図４Ａ，Ｂにおいて、ａ〜ｅ及びａ′〜ｃ′がピッチ同
期点である。However, in the case of "U" utterance by the male voice shown in FIG. 4B, when there are many zero-crossing points and the attenuation of the waveform amplitude for each pitch is large, a zero-crossing point near the appearance of the residual peak can be obtained accurately. Instead, a point shifted in phase is often selected as a reference. If stored in a sound source file with the phase shifted, waveform distortion occurs during synthesis .
4A and 4B, a to e and a 'to c' are pitch synchronization points.

【０００６】上述した方式を用いて女性音声の分析を行
う場合、零交叉点が正確に求まるので、ピッチ同期手段
により、１つ１つのピッチ単位波形に正確に分割するこ
とは可能であるけれども、男性音声の分析を行う場合、
ピッチ波形の相違のために、零交叉点の誤差が大きく正
確な位置に分割基準が得られなかった。When analyzing a female voice using the above-described method, since the zero-crossing point is accurately obtained, it is possible to accurately divide the waveform into individual pitch unit waveforms by the pitch synchronization means. When analyzing male voice,
Due to the difference in pitch waveforms, the error at the zero-crossing point is large and a division reference cannot be obtained at an accurate position.

【０００７】この発明は上記事情に鑑みてなされたもの
で、零交叉回数が多く、ピッチ内の波形減衰の大きな波
形においても正確なピーク位置を求めることができるよ
うにした音声合成装置の音源データ生成方法を提供する
ことを目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above circumstances, and has been made in view of the above circumstances. It is intended to provide a generation method.

【０００８】[0008]

【課題を解決するための手段】この発明は上記の目的を
達成するために、音声をサンプリングした原波形データ
を用い、音声が存在する全区間にわたり、有声音の区間
で見られる音声波形の繰り返し周期に対応するピッチ周
期に対応するピッチ区間をすべて算出し、各フレームの
波形の分析を行って残差を得、各フレーム毎に選ばれた
残差の自己相関を算出した後、各フレーム毎に、自己相
関パターン上でピッチ周期付近の相関値が極大となる点
を求め、この値を用い、各フレーム内の中央のピッチ区
間内の残差波形振幅の相対ピーク位置を検出し、母音区
間で、もっともエネルギの大きいフレームを、残差ピー
ク位置計算時の基準のフレームとし、この最大エネルギ
となるフレームの中央のピッチ波形に対して分析した結
果得られる残差のピーク位置を算出し、次に、時間軸を
前方向に隣り合うフレーム間の相関を求めて行き、さら
に時間軸を後ろ方向に隣り合うフレーム間の相関を求
め、得られた有声音音声区間に対する残ピーク位置を用
い、各フレーム毎にピッチ周期を求め、得られた全フレ
ームに対するピッチ周期列のパターンの平均化処理を行
い、この平均化されたピッチ周期列を用いて、新たに残
差ピーク位置を求めた後、そのピーク位置と、実際の残
差波形のピーク位置とを比較し、ピーク位置に近傍に実
際の残差波形ピークが存在する場合、ピーク位置をずら
し、微調整して得られた残差ピーク位置を再度分析を行
って、先の残差ピーク位置近傍の残差振幅極大点を最終
的な残差ピークとして抽出し、そのピーク点の位相がそ
ろった形で、各フレーム毎にピッチ周期分の残差波形を
切り出し、音源データを生成することを特徴とするもの
である。SUMMARY OF THE INVENTION In order to achieve the above object, the present invention provides original waveform data obtained by sampling audio.
And voiced sound section over the entire section where voice exists
Pitch corresponding to the repetition period of the audio waveform seen in
Calculate all the pitch sections corresponding to the
Perform waveform analysis to obtain residuals, selected for each frame
After calculating the autocorrelation of the residual, the autocorrelation is performed for each frame.
The point where the correlation value near the pitch period becomes maximum on the Seki pattern
And use this value to calculate the pitch interval at the center of each frame.
The relative peak position of the residual waveform amplitude within the
Between the frames with the highest energy
This frame is used as the reference frame when calculating the
Of the pitch waveform in the center of the frame
Calculate the peak position of the resulting residual, and then set the time axis
Find the correlation between adjacent frames in the forward direction, and
First, find the correlation between frames that are
Use the remaining peak position for the obtained voiced sound section.
The pitch period is calculated for each frame, and the total
Performs averaging of the pitch periodic sequence pattern for the
Using this averaged pitch period sequence,
After determining the difference peak position, the peak position and the actual residual
Compare to the peak position of the difference waveform and
If there is a residual waveform peak at
And reanalyze the residual peak position obtained by fine adjustment.
To the residual amplitude maximum point near the previous residual peak position.
As a residual peak, and the phase of that peak is
The residual waveform for the pitch period is
It is characterized by cutting out and generating sound source data.

【０００９】[0009]

【作用】原波形からピッチ区間を分割検出した後、全フ
レームの残差を計算する。その後、残差ピーク位置を抽
出する。この抽出の後、残差最大点を区間のピーク位置
として抽出する。一方、区間から時間軸の前後方向に隣
接する区間との相互相関計算等を行う。その計算により
次々とピーク位置を抽出し、隣りどうしの間隔をピッチ
とし、このピッチの列を平均化処理する。その後、再び
ピーク位置を抽出してそのピーク位置を微調整し、最終
的ピーク位置を抽出する。After the division of the pitch section from the original waveform is detected, the residual of all frames is calculated. After that, the residual peak position is extracted. After this extraction, the maximum residual point is extracted as the peak position of the section. On the other hand, a cross-correlation calculation or the like between a section and a section adjacent in the front-back direction of the time axis is performed. The peak positions are extracted one after another by the calculation, the pitch between adjacent pitches is set as a pitch, and a row of the pitch is averaged. Thereafter, the peak position is extracted again, the peak position is finely adjusted, and the final peak position is extracted.

【００１０】[0010]

【実施例】以下この発明の一実施例を図面に基づいて説
明する。An embodiment of the present invention will be described below with reference to the drawings.

【００１１】図１において、ステップＳ₁は図２に示す
原波形からピッチ区間を分割検出する処理で、このステ
ップＳ₁で基準点を検出するときに、波形を基準にしな
いで、一旦全フレームの残差を計算する。これがステッ
プＳ₂の処理である。ステップＳ₂の処理が終わったな
らステップＳ₃の処理を行う。ステップＳ₃はピッチ区間
毎に得られた残差列どうしの相関計算を行う。この相
関計算の結果によりステップＳ₄で残差ピーク位置を抽
出する。In FIG. 1, step S ₁ is a process for dividing and detecting a pitch section from the original waveform shown in FIG. 2. When detecting a reference point in this step S ₁ , once the entire frame is detected without using the waveform as a reference. Calculate the residual of. This is the process of step S _2. The process of step S ₂ performs the process of step S ₃ if finished. Step S ₃ performs a correlation calculation of the residual sequence to each other, obtained for each pitch period. The results of the correlation calculation to extract a residual peak position at Step S _4.

【００１２】残差ピーク位置が抽出されたならステップ
Ｓ₅の処理に移る。ステップＳ₅は母音区間（一般にピー
ク性が強い）のエネルギー最大のフレームで、残差の最
大点をその区間のピーク位置として抽出する。また、そ
の区間から時間軸の前後方向に、となりの区間との相互
相関をステップＳ₆で計算する。この計算の結果、相関
の最大となる幅より、次々とステップＳ₇でピーク位置
を図３に示すように図２から決めて行く。[0012] Turning to the process in step S ₅ If the residual peak position is extracted. Step S ₅ is a maximum energy of the frame of the vowel section (typically a strong peak resistance), it extracts the maximum point of the residual as a peak position of the section. Further, in the longitudinal direction of the time axis from the section, calculating the cross-correlation between the next section in step S _6. The result of this calculation, than the width of the maximum correlation, decided As shown in FIGS. 2 to 3 one after another peak position at step S _7.

【００１３】例えば、図３の音源波形図に示すように、Ｐｅａｋ（Ｍ）→Ｐｅａｋ（Ｍ−１）→Ｐｅａｋ（Ｍ−２）……およびＰｅａｋ（Ｍ）→Ｐｅａｋ（Ｍ＋１）→Ｐｅａｋ
（Ｍ＋２）……とピーク位置が決定されたなら、その隣
どうしの間隔をピッチＰ（１），Ｐ（２）…点とステッ
プＳ₈で行う。このステップＳ₈でのピッチ列をステップ
Ｓ₉で平均化処理し、なめらかに推移させるようにす
る。次に平均化処理されたピッチＰ（１）′，Ｐ
（２）′……をステップＳ₁₀で再度ピークを決定する。
ピーク決定後、ステップＳ₁₁でピーク位置をその近傍で
微調整する。例えば近くにより大きなピークが存在すれ
ばずらすようにする。その後、最終的なピーク位置をス
テップＳ₁₂で決定する。この最終的なピーク位置をもと
にステップＳ₁₃からＳ ₁₅の処理を行って各ピッチ区間で
位相を整えて音源ファイルに格納する。なお、ステップ
Ｓ₁₃は最終的なピーク位置をもとに切り出しを行う処理
であり、ステップＳ₁₄は、このピーク位置をもとに切り
出し再分析して音源データの生成を行う処理であり、ス
テップＳ₁₅は各ピッチ区間で位相を整えて音源ファイル
に格納する処理である。[0013] For example, as shown in the sound source waveform diagram of FIG. 3, Peak (M) → Peak (M-1) → Peak (M-2) ...... , and Peak (M) → Peak (M + 1) → Peak
If (M + 2) ...... and the peak position has been determined, performing interval of the next to each other pitch P (1), with P (2) ... points and step S _8. The pitch column averaging process in step S ₉ at step S _8, so as to smoothly transition. Next, the averaged pitches P (1) ', P
(2) '... to determine again peaks at Step S ₁₀ the.
After determining the peak, to fine tune the peak position in the vicinity in step S _11. For example, if there is a larger peak nearby, it is shifted. Then, to determine the final peak position in step S _12. The final peak position from the step S ₁₃ based on established a phase in performing the processing of S ₁₅ each pitch interval stored in the music file. Incidentally, Step S ₁₃ is a process of cutting out on the basis of the final peak position, step S _14, turn the peak position based on
And out reanalysis a process for production of the sound source data, the step S ₁₅ is a process of storing the music file established a phase at each pitch interval.

【００１４】[0014]

【発明の効果】以上述べたように、この発明によれば、
零交叉回数が多く、ピッチ内の波形減衰の大きな波形に
おいても常に正確なピーク位置を求めることができるよ
うにしたものである。As described above, according to the present invention,
An accurate peak position can always be obtained even for a waveform having a large number of zero-crossings and a large waveform attenuation within a pitch.

[Brief description of the drawings]

【図１】この発明の一実施例を示すフローチャート。FIG. 1 is a flowchart showing one embodiment of the present invention.

【図２】原波形図。FIG. 2 is an original waveform diagram.

【図３】音源波形図。FIG. 3 is a waveform diagram of a sound source.

【図４】Ａは女性音声波形図、Ｂは男性音声波形図。FIG. 4A is a female voice waveform diagram, and B is a male voice waveform diagram .

フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 11/00 - 21/06 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (58) Field surveyed (Int. Cl. ⁷ , DB name) G10L 11/00-21/06 JICST file (JOIS)

Claims

(57) [Claims]

1. Extraction from voice as one analysis target
The audio waveform in the section that was
The pitch period corresponding to the repetition period of the speech waveform
The corresponding section is called the pitch section, and
One frame is defined as a section between a certain time and a certain time.
Cut out and cut out as speech waveforms for three pitch sections
Apply a Hamming window to the extracted waveform, and sound the cutout section
On the section, while shifting slightly,
Find the parameters and set the shift width
And a pair of voice parameters for each pitch section.
In the method for calculating the data, the original waveform data obtained by sampling the sound is used to calculate the sound.
Sounds that are seen in voiced sections over all sections where
Corresponds to the pitch period corresponding to the repetition period of the voice waveform
Calculate all pitch sections and analyze the waveform of each frame
To get the residuals, and the self-phase of the residuals chosen for each frame
After calculating the correlation, the autocorrelation pattern is calculated for each frame.
Finds the point where the correlation value near the pitch period is maximal.
Value, the residual wave in the central pitch section in each frame
The relative peak position of the shape amplitude is detected, and
Calculate residual peak position for frames with large energy
The frame at which the maximum energy is
Obtained by analyzing the pitch waveform at the center of the
Calculate the peak position of the residual, and then calculate the correlation between frames that are adjacent in the forward direction on the time axis.
Frame that is next to the time axis
Between the remaining voiced speech sections
The pitch period is calculated for each frame using the peak position, and the pitch period sequence pattern for all the obtained frames is obtained.
Averaging processing is performed, and this averaged pitch period sequence is used.
After calculating a new residual peak position,
And the actual peak position of the residual waveform.
If there is an actual residual waveform peak near the position,
The residual peak position obtained by shifting and
Is analyzed again, and the residual vibration near the previous residual peak position is analyzed.
The maximum width point is extracted as the final residual peak, and the peak
The pitch period of each frame
A sound source data generating method for a speech synthesizer , wherein a residual waveform of a minute is cut out to generate sound source data.