JPH10240295A

JPH10240295A - Voice synthesizing method and voice synthesizer

Info

Publication number: JPH10240295A
Application number: JP9047592A
Authority: JP
Inventors: Takeshi Iwaki; 健岩木
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1997-03-03
Filing date: 1997-03-03
Publication date: 1998-09-11

Abstract

PROBLEM TO BE SOLVED: To obtain a high-quality voice synthesis by making the center of gravity of data a pitch mark to enable the setting of the pitch whose rolling is small with a relatively simple processing. SOLUTION: This device is provided with a text analyzing part 102, a word dictionary 102, a parameter generating part 103, a voice signal input part 104, an element piece producing part 105, an element dictionary 106, a windows hanging part 107 and a synthetic voice part 108. The element producing part 105 detects the minimum value of a voice signal and adds an offset so that the minimum value of the voice signal become non-negative. Windows having a prescribed length are prepared and the windows are hanged to addition data while shifting centers of the window of the prescribed length and the time base coordinate of the center of the windows where gives the maximum of areas is detected by calculating the areas of the window hanged data at the respective center positions of the windows. Then, the voice signal is segmented by being centered around the time base coordinate and window hanging superpositions of the segmented voice signals are performed by shifting them by an amount equivalent to a pitch cycle by making the time base coordinate being the center of the segmented voice signals the center of the superpositions.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、規則によって任意
の音声を合成する音声合成方法及び音声合成装置技術に
関し、特に、音声波形を接続して合成音声を得る音声合
成方法及び音声合成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesis method and a speech synthesis apparatus for synthesizing an arbitrary speech according to rules, and more particularly to a speech synthesis method and a speech synthesis apparatus for obtaining a synthesized speech by connecting speech waveforms.

【０００２】[0002]

【従来の技術】音声合成方式には、音声波形そのものを
記録しておき、それを組み合わせて音声波形を作りだす
ものと、音声の特性を表すパラメータに分析して記録し
ておき、出力時に合成器を用いるものがある。さらに、
制御部と波形形成部の方式の組み合わせにより録音編
集、音声素片編集、パラメータ編集、規則合成などのさ
まざまな音声合成方式がある。2. Description of the Related Art In a speech synthesis system, a speech waveform itself is recorded, and a speech waveform is created by combining the speech waveforms, and a parameter representing speech characteristics is analyzed and recorded. Some use. further,
There are various voice synthesis methods such as recording and editing, voice unit editing, parameter editing, and rule synthesis, depending on the combination of the methods of the control unit and the waveform forming unit.

【０００３】このうち音声素片（主として１ピッチ周期
の音声波形）編集合成は、ピッチ制御が可能で音声素片
の選び方によっては任意語の出力まで拡張できる。ま
た、パラメータ編集合成は、単位音声ごとに音源パラメ
ータと音声伝達特性（スペクトラム）パラメータを符号
化して記録しておき、パラメータ時系列で音声合成を得
るものである。音声の規則合成は、文字や音声記号など
の離散的記号で表現された系列を、連続音声に変換する
ものである。[0003] Among them, the editing and synthesis of speech units (mainly speech waveforms of one pitch period) can be controlled in pitch and can be extended to the output of an arbitrary word depending on the selection of speech units. In the parameter editing / synthesis, a sound source parameter and a speech transfer characteristic (spectrum) parameter are encoded and recorded for each unit speech, and speech synthesis is performed in a parameter time series. The rule synthesis of speech is to convert a sequence represented by discrete symbols such as characters and phonetic symbols into continuous speech.

【０００４】テキスト音声合成は、文章（テキストデー
タ）を音声に変換するものであり、音声表記と１対１に
対応しないのが普通である。したがって、入力テキスト
を音声記号の系列に変換するとともに、韻律的特徴を自
動的に生成するため、形態素解析や構文解析などの言語
処理が必要となる。[0004] Text-to-speech synthesis converts text (text data) into speech, and usually does not correspond one-to-one with speech notation. Therefore, language processing such as morphological analysis and syntax analysis is required to convert the input text into a sequence of phonetic symbols and to automatically generate prosodic features.

【０００５】従来、テキスト文章を音声にして出力する
テキスト音声変換は、テキスト解析部とパラメータ生成
部、音声合成部から構成されるテキスト解析部では、漢
字かな混じり文を入力して、単語辞書を参照して形態素
解析し、読み、アクセント、イントネーションを決定
し、韻律記号付き発音記号（中間言語）を出力し、パラ
メータ生成部では、ピッチ周波数パターンや音韻継続時
間等の設定を行い、音声合成部では、音声の合成処理を
行う。Conventionally, a text-to-speech conversion for converting a text sentence into a voice is performed by a text analysis unit including a text analysis unit, a parameter generation unit, and a speech synthesis unit. By referring to the morphological analysis, determining the reading, accent and intonation, outputting phonetic symbols with prosodic symbols (intermediate language), the parameter generation unit sets the pitch frequency pattern, phoneme duration, etc., and the speech synthesis unit Then, speech synthesis processing is performed.

【０００６】音声合成部では、以前は線形予測法などが
用いられたが、本来相互関係がある声道情報と音源情報
を分離して扱うことによる劣化はさけられないため、近
年、声道情報と音源情報とを分離せず、さらに原音声波
形をそのまま利用して劣化の少ない高品質の合成音を得
る手法が用いられるようになってきた。In the speech synthesis unit, a linear prediction method or the like has been used before. However, since deterioration due to separately treating vocal tract information and sound source information which are originally correlated cannot be avoided, vocal tract information has recently been used. A method of obtaining high-quality synthesized sound with little deterioration without separating the sound source information and the original sound waveform as it is has come to be used.

【０００７】音声波形をそのまま利用する方法として
は、例えば文献「F.J.Charpentier,M.G.Stella,"Diphon
e synthesis using an overlap-add technique for spe
ech waveforms concatenation",Proc.Int.Conf.ASSP,20
15-2018,Tokyo,1986」に示されるものがある。この文献
に示されるように、あらかじめ音声波形にピッチマーク
（基準点）を付けておき、その位置を中心に切り出し
て、合成時には合成ピッチ周期に合わせて、ピッチマー
ク位置を合成ピッチ周期ずらしながら重ね合わせる合成
方法がＰＳＯＬＡ（pitch-synchronous Overlap Add me
thod）として知られている。As a method of using a speech waveform as it is, for example, a method described in the literature “FJ Charpentier, MGStella,” Diphon
e synthesis using an overlap-add technique for spe
ech waveforms concatenation ", Proc.Int.Conf.ASSP, 20
15-2018, Tokyo, 1986 ". As shown in this document, a pitch mark (reference point) is added to a speech waveform in advance, and the sound waveform is cut out at the center, and the pitch mark position is shifted while shifting the synthesized pitch cycle according to the synthesized pitch cycle during synthesis. PSOLA (pitch-synchronous overlap add me)
thod).

【０００８】図１５は上記文献から引用した従来のＰＳ
ＯＬＡのピッチを変更しながら重畳する音声合成方法を
示す模式図である。FIG. 15 shows a conventional PS cited from the above document.
It is a schematic diagram which shows the speech synthesis method of superimposing while changing the pitch of OLA.

【０００９】これは、分析時（素片作成時）に比べて、
合成時にピッチ周期を大きくした（音程を低くした）場
合の例である。[0009] This is compared with the time of analysis (when a segment is created).
This is an example of a case where the pitch period is increased (the pitch is decreased) during synthesis.

【００１０】ＰＳＯＬＡでは、ピッチを変更できるた
め、テキスト音声変換における音声合成部として広く用
いられてきている。In PSOLA, since the pitch can be changed, it has been widely used as a speech synthesizer in text-to-speech conversion.

【００１１】ピッチマークは、１ピッチ毎に付けておく
必要がある。これまで、ピッチマークの位置としては、
以下(1)〜(5)のものが提案されている。It is necessary to add a pitch mark for each pitch. Until now, the position of the pitch mark,
The following (1) to (5) are proposed.

【００１２】(1)音声波形のピーク例えば、特開平４−３７２９９９号（ＮＴＴデータ通信
株式会社）公報記載の方法。(1) Peak of audio waveform For example, a method described in Japanese Patent Application Laid-Open No. 4-372999 (NTT Data Communication Co., Ltd.).

【００１３】(2)短時間パワーのピーク例えば、文献河井、樋口、清水、山本、“波形素片接続
型音声合成システムの検討”信学技報ＳＰ９３−９（１
９９３−０５）記載の方法。(2) Peak of short-time power For example, Kawai, Higuchi, Shimizu, Yamamoto, "Study of speech synthesis system with waveform unit connection", IEICE Technical Report SP93-9 (1)
993-05).

【００１４】(3)ピッチフィルタ後のピーク例えば、特開平７−７２８９７号（日本電信電話株式会
社）公報記載の方法。(3) Peak after pitch filter A method described in, for example, JP-A-7-72897 (Nippon Telegraph and Telephone Corporation).

【００１５】(4)インパルス駆動点の１５％遅延点例えば、文献新居、西村、吉田、簑輪、“ピッチ波形抽
出位置の検討”信学技報ＳＰ９５−８（１９９５−０
５）記載の方法。(4) 15% delay point of impulse drive point For example, Literature Arai, Nishimura, Yoshida, Minowa, "Examination of Pitch Waveform Extraction Position", IEICE Technical Report, SP95-8 (1995-0)
5) The method described in the above.

【００１６】(5)声門閉鎖点例えば、文献坂本、斉藤、鈴木、橋本、小林、“波形重
畳法を用いた日本語テキスト音声合成システムについ
て”信学技報ＳΡ９５−６（１９９５−０５）の方法。(5) Glottal Closing Point For example, the literature Sakamoto, Saito, Suzuki, Hashimoto, Kobayashi, "On a Japanese text-to-speech synthesis system using a waveform superposition method", IEICE Technical Report S $ 95-6 (1995-05) Method.

【００１７】上記(1)の音声波形のピークは、音声波形
のローカルピーク位置にはエネルギが集中しているた
め、切り出し波形のスペクトルを保存するのに適してい
ると考えられる。Since the energy of the peak of the audio waveform in the above (1) is concentrated at the local peak position of the audio waveform, it is considered to be suitable for storing the spectrum of the cut-out waveform.

【００１８】上記(2)の短時間パワーのピークも同様の
理由から用いられている。The above-mentioned short-time power peak (2) is also used for the same reason.

【００１９】上記(3)のピッチフィルタ後のピークは、
１ピッチの声帯の駆動波形のピークであり、上記文献に
よればピッチ間隔を良好に代表するとされている。The peak after the pitch filter of the above (3) is
It is the peak of the driving waveform of the vocal cord of one pitch, and according to the above-mentioned literature, it is considered that the pitch interval is well represented.

【００２０】上記(4)のインパルス駆動点の１５％遅延
点は、新居らはスペクトル歪みが最小となると報告して
いる。At the 15% delay point of the impulse driving point in the above (4), Arai et al. Report that the spectral distortion is minimized.

【００２１】上記(5)の声門閉鎖点は、インパルス駆動
点（１ピッチ波形の励振点）と同様のものであると考え
られる。坂本らは、声門閉鎖点を安定に抽出するために
Dynamic Wavelet変換を用いている。The glottal closure point in the above (5) is considered to be the same as the impulse drive point (excitation point of one pitch waveform). Sakamoto and colleagues proposed a method to extract glottal closure points stably.
Dynamic Wavelet transform is used.

【００２２】[0022]

【発明が解決しようとする課題】しかしながら、このよ
うな従来のピッチマーク位置を用いた音声合成方法にあ
っては、それぞれ以下のような問題点があった。However, the conventional speech synthesizing method using the pitch mark position has the following problems.

【００２３】上記(1)の音声波形のピークでは、無声子
音の前後の有声音や、破裂音または破擦音を含む有声音
において、高周波（ホワイトノイズ）成分が大きくな
り、合成時の単位（１フレーム）ごとにピッチマークの
ゆらぎが生じ、結果として接続の悪いゴロゴロした音に
なる。In the peak of the voice waveform of the above (1), a high frequency (white noise) component becomes large in a voiced sound before and after an unvoiced consonant, or in a voiced sound including a plosive sound or an affricate sound, and a unit at the time of synthesis ( The pitch mark fluctuates for each frame), resulting in a loose sound with poor connection.

【００２４】上記(2)の短時間のパワーのピークも極大
値と極小値が対等に評価されるため、発声者によっては
ピッチの揺れを生ずる。Since the peak value of the short-time power of the above (2) is evaluated equally on the maximum value and the minimum value, a pitch fluctuation may occur depending on the speaker.

【００２５】上記(3)のピッチフィルタ後のピークで
は、本発明者の実験によれば、波形のピークとズレがあ
り、波形のピーク位置とのズレによるピッチの揺れが大
きく、上記(1)のピークの方が良かった。According to the experiment conducted by the present inventors, the peak after the pitch filter of the above (3) has a deviation from the peak of the waveform, and the fluctuation of the pitch due to the deviation from the peak position of the waveform is large. The peak was better.

【００２６】上記(4)のインパルス駆動点の１５％遅延
点は、処理量が多く、処理に遅延が生じる。The 15% delay point of the impulse drive point in the above (4) has a large amount of processing and causes a delay in processing.

【００２７】上記(5)の声門閉鎖点は、抽出するのに処
理量が多い。The glottal closure point (5) requires a large amount of processing to be extracted.

【００２８】このように、上記何れの音声合成方法にあ
っても、簡易な方法でピッチマーク位置を設定するのは
困難であった。As described above, it is difficult to set the pitch mark position by a simple method in any of the above-mentioned voice synthesizing methods.

【００２９】本発明は、比較的簡単な処理で揺れが少な
いピッチマーク設定法を提供し、高品質の音声合成が可
能な音声合成方法及び音声合成装置を提供することを目
的とする。An object of the present invention is to provide a method of setting a pitch mark with relatively simple processing and a small fluctuation, and to provide a voice synthesizing method and a voice synthesizing apparatus capable of synthesizing high quality voice.

【００３０】[0030]

【課題を解決するための手段】本発明に係る音声合成方
法は、音声波形を相互に接続したり重ね合わせることに
より合成音声を得る音声合成方法において、音声信号の
最小値を検出し、音声信号に最小値が非負となるようオ
フセットを加算して加算データを算出し、所定長の窓を
準備し、所定長窓の中心を移動させながら、加算データ
に窓をかけ、窓の各中心位置での窓掛けデータの面積を
計算し、面積の最大を与える窓の中心の時間軸座標を検
出し、時間軸座標を中心にセンタリングして音声信号を
切出し、切出し音声信号を、時間軸座標を重畳の中心と
して、ピッチ周期分ずらしながら窓掛け重畳することを
特徴とする。SUMMARY OF THE INVENTION A speech synthesis method according to the present invention is a speech synthesis method for obtaining a synthesized speech by connecting or overlapping speech waveforms. Add the offset so that the minimum value is non-negative, calculate the addition data, prepare a window of a predetermined length, apply a window to the addition data while moving the center of the predetermined length window, and at each center position of the window Calculates the area of the windowing data of, detects the time axis coordinate of the center of the window that gives the maximum area, cuts out the audio signal by centering around the time axis coordinate, superimposes the extracted audio signal and the time axis coordinate Is characterized in that it is superimposed on a window while being shifted by a pitch period.

【００３１】本発明に係る音声合成方法は、音声波形を
相互に接続したり重ね合わせることにより合成音声を得
る音声合成方法において、音声信号からピッチ周期を検
出し、音声信号の最小値を検出し、音声信号に最小値が
非負となるようオフセットを加算して加算データを算出
し、ピッチ周期長の窓を準備し、窓の中心を移動させな
がら加算データに窓をかけ窓の各中心位置での窓掛けデ
ータの面積を計算し、面積の最大を与える窓の中心の時
間軸座標を検出し、時間軸座標を中心にセンタリングし
て音声信号を切出し、切出し音声信号を、時間軸座標を
重畳の中心として、ピッチ周期分ずらしながら窓掛け重
畳することを特徴とする。A voice synthesizing method according to the present invention is a voice synthesizing method for obtaining a synthesized voice by connecting and overlapping voice waveforms, wherein a pitch period is detected from a voice signal and a minimum value of the voice signal is detected. Then, add an offset so that the minimum value is non-negative to the audio signal to calculate addition data, prepare a window having a pitch period length, apply a window to the addition data while moving the center of the window, and apply a window to each center position of the window. Calculates the area of the windowing data of, detects the time axis coordinate of the center of the window that gives the maximum area, cuts out the audio signal by centering around the time axis coordinate, superimposes the extracted audio signal and the time axis coordinate Is characterized in that it is superimposed on a window while being shifted by a pitch period.

【００３２】本発明に係る音声合成方法は、音声波形を
相互に接続したり重ね合わせることにより合成音声を得
る音声合成方法において、音声信号からピッチ周期を検
出し、音声信号の最小値を検出し、音声信号に最小値が
非負となるようオフセットを加算して加算データを算出
し、ピッチ周期長の窓を準備し、窓の中心を移動させな
がら、加算データを所定累乗したデータに窓をかけ、窓
の各中心位置における窓掛けデータの面積を計算し、面
積の最大を与える窓の中心の時間軸座標を検出し、時間
軸座標を中心にセンタリングして音声信号を切出し、切
出し音声信号を、時間軸座標を重畳の中心として、ピッ
チ周期分ずらしながら窓掛け重畳することを特徴とす
る。The voice synthesizing method according to the present invention is a voice synthesizing method for obtaining a synthesized voice by connecting and overlapping voice waveforms, wherein a pitch period is detected from a voice signal and a minimum value of the voice signal is detected. Calculate the addition data by adding an offset so that the minimum value is non-negative to the audio signal, prepare a window having a pitch period length, move the center of the window, and apply a window to the data obtained by raising the addition data to a predetermined power. Calculate the area of the windowing data at each center position of the window, detect the time axis coordinate of the center of the window giving the maximum area, center the time axis coordinate, cut out the audio signal, and extract the cutout audio signal. In addition, windowed superimposition is performed while shifting by a pitch period with the time axis coordinate being the center of superimposition.

【００３３】本発明に係る音声合成方法は、ピッチ周期
の検出が、ケプストラム分析法により求めた正確なピッ
チ周期の検出であってもよく、また、累乗の所定値が、
２乗、４乗または８乗の何れかであってもよい。In the speech synthesis method according to the present invention, the detection of the pitch period may be an accurate detection of the pitch period obtained by the cepstrum analysis method.
Any of the square, the fourth, or the eighth power may be used.

【００３４】また、本発明に係る音声合成装置は、音声
波形を相互に接続したり重ね合わせることにより合成音
声を得る音声合成装置において、音声信号の最小値を検
出し、音声信号に最小値が非負となるようオフセットを
加算して加算データを算出する手段と、所定長の窓を準
備し、所定長窓の中心を移動させながら、加算データに
窓をかけ、窓の各中心位置での窓掛けデータの面積を計
算し、面積の最大を与える窓の中心の時間軸座標を検出
する手段と、時間軸座標を中心にセンタリングして音声
信号を切出し、切出し音声信号を、時間軸座標を重畳の
中心として、ピッチ周期分ずらしながら窓掛け重畳する
手段とを備えて構成する。Further, in the voice synthesizing apparatus according to the present invention, the minimum value of the voice signal is detected in the voice synthesizing apparatus which obtains a synthesized voice by connecting and overlapping voice waveforms. Means for calculating addition data by adding an offset so as to be non-negative; preparing a window of a predetermined length, applying a window to the addition data while moving the center of the window of a predetermined length, and setting a window at each center position of the window Means for calculating the area of the multiplication data, detecting the time axis coordinate of the center of the window that gives the maximum area, and extracting the audio signal by centering around the time axis coordinate, superimposing the extracted audio signal on the time axis coordinate And a means for windowing and superimposing while shifting by a pitch period.

【００３５】本発明に係る音声合成装置は、音声波形を
相互に接続したり重ね合わせることにより合成音声を得
る音声合成装置において、音声信号からピッチ周期を検
出する手段と、音声信号の最小値を検出し、音声信号に
最小値が非負となるようオフセットを加算して加算デー
タを算出する手段と、ピッチ周期長の窓を準備し、窓の
中心を移動させながら加算データに窓をかけ窓の各中心
位置での窓掛けデータの面積を計算し、面積の最大を与
える窓の中心の時間軸座標を検出する手段と、時間軸座
標を中心にセンタリングして音声信号を切出し、切出し
音声信号を、時間軸座標を重畳の中心として、ピッチ周
期分ずらしながら窓掛け重畳する手段とを備えて構成す
る。A voice synthesizing apparatus according to the present invention is a voice synthesizing apparatus for obtaining a synthesized voice by connecting or superimposing voice waveforms to each other, wherein a means for detecting a pitch period from a voice signal and a minimum value of the voice signal are provided. Means for detecting and adding an offset so that the minimum value is non-negative to the audio signal to calculate addition data, and preparing a window having a pitch period length, applying a window to the addition data while moving the center of the window, and applying a window to the window. Means for calculating the area of the windowing data at each center position, detecting the time axis coordinate of the center of the window that gives the maximum area, and extracting the audio signal by centering around the time axis coordinate, and extracting the extracted audio signal Means for windowing and superimposing while shifting by the pitch cycle with the time axis coordinate being the center of superimposition.

【００３６】本発明に係る音声合成装置は、音声波形を
相互に接続したり重ね合わせることにより合成音声を得
る音声合成装置において、音声信号からピッチ周期を検
出する手段と、音声信号の最小値を検出し、音声信号に
最小値が非負となるようオフセットを加算して加算デー
タを算出する手段と、ピッチ周期長の窓を準備し、窓の
中心を移動させながら、加算データを所定累乗したデー
タに窓をかけ、窓の各中心位置における窓掛けデータの
面積を計算し、面積の最大を与える窓の中心の時間軸座
標を検出する手段と、時間軸座標を中心にセンタリング
して音声信号を切出し、切出し音声信号を、時間軸座標
を重畳の中心として、ピッチ周期分ずらしながら窓掛け
重畳する手段とを備えて構成する。A voice synthesizing apparatus according to the present invention is a voice synthesizing apparatus for obtaining a synthesized voice by mutually connecting or superposing voice waveforms, wherein a means for detecting a pitch period from a voice signal and a minimum value of the voice signal are provided. Means for detecting and adding an offset so that the minimum value is non-negative to the audio signal to calculate addition data, and preparing a window having a pitch period length, moving the center of the window, and moving the addition data to a predetermined power. Means for calculating the area of the windowing data at each center position of the window, detecting the time-axis coordinate of the center of the window that gives the maximum area, and centering the time-axis coordinate to generate an audio signal. Means for clipping and clipping the clipped audio signal by windowing while shifting the pitch axis by using the time axis coordinate as the center of superposition.

【００３７】本発明に係る音声合成装置は、ケプストラ
ム分析法で求めたピッチ周期に基づいて窓の窓長を決定
するものであってもよい。[0037] The speech synthesizer according to the present invention may determine the window length of the window based on the pitch period obtained by the cepstrum analysis method.

【００３８】[0038]

【発明の実施の形態】本発明に係る音声合成方法及び音
声合成装置は、テキストデータを入力とする音声合成方
法に適用することができる。DESCRIPTION OF THE PREFERRED EMBODIMENTS The speech synthesis method and speech synthesis device according to the present invention can be applied to a speech synthesis method using text data as input.

【００３９】図１は本発明の第１の実施形態に係る音声
合成方法及び音声合成装置の構成を示すブロック図であ
る。本実施形態に係る音声合成方法は、テキストデータ
を入力とする音声合成方法について全て有効である。FIG. 1 is a block diagram showing a configuration of a speech synthesis method and a speech synthesis apparatus according to the first embodiment of the present invention. The speech synthesis method according to the present embodiment is effective for all speech synthesis methods that use text data as input.

【００４０】図１において、１０１はテキスト解析部、
１０２は単語辞書、１０３はパラメータ生成部、１０４
は音声信号入力部、１０５は素片作成部、１０６は素片
辞書、１０７は窓掛け部、１０８は合成音声部である。In FIG. 1, reference numeral 101 denotes a text analysis unit;
102 is a word dictionary, 103 is a parameter generator, 104
Is an audio signal input unit, 105 is a unit creation unit, 106 is a unit dictionary, 107 is a windowing unit, and 108 is a synthesized speech unit.

【００４１】上記テキスト解析部１０１は、漢字かな混
じり文を入力して、単語辞書１０２を参照して形態素解
析し、読み、アクセント、イントネーションを決定し、
韻律記号付き発音記号（中間言語）を出力する。アクセ
ントとイントネーションは、ピッチ周波数の時間的変化
パターンと最も密接に関係しており、ピッチ周波数パタ
ーンは自然で聞きやすい音調を付与するばかりでなく、
単語や句のまとまりを示して文音声を理解しやすくする
役割を果たす。The text analysis unit 101 inputs a sentence mixed with kanji and kana, performs morphological analysis with reference to the word dictionary 102, determines reading, accent, intonation,
Outputs phonetic symbols with prosody (intermediate language). Accent and intonation are most closely related to the temporal pattern of pitch frequency, which not only gives a natural, easy-to-hear tone,
It plays a role in showing a group of words and phrases to make it easier to understand sentence speech.

【００４２】上記単語辞書記憶部１０２は、例えばＲＯ
ＭやＲＡＭで構成され、単語辞書及び文法的に連結可能
な後続単語の種類を規定した単語検索テーブルを記憶す
る。上記パラメータ生成部１０３は、ピッチ周波数パタ
ーンや音韻継続時間等の設定を行う。The word dictionary storage unit 102 stores, for example, an RO
M and RAM, and stores a word dictionary and a word search table that defines the types of subsequent words that can be grammatically linked. The parameter generation unit 103 sets a pitch frequency pattern, a phoneme duration, and the like.

【００４３】上記音声信号入力部１０４は、ＲＳ−２３
２Ｃ等の通信ポートやＦＤＤ、データを格納する内部バ
ッファから構成され、音声合成するテキストデータが通
信ポートやＦＤＤを通して入力される。The audio signal input unit 104 is connected to the RS-23.
It is composed of a communication port such as 2C, an FDD, and an internal buffer for storing data, and text data for speech synthesis is input through the communication port and the FDD.

【００４４】上記素片作成部１０５は、データの重心点
をピッチマークとする処理を行うもので本音声合成方法
の主要部分であり、図２〜図６により詳細に後述する。The segment generating section 105 performs a process of setting the center of gravity of the data as a pitch mark, and is a main part of the present voice synthesizing method, and will be described later in detail with reference to FIGS.

【００４５】上記素片辞書１０６は、音声信号を入力し
た後、素片作成部１０５により作成される。The unit dictionary 106 is created by the unit creating unit 105 after inputting a speech signal.

【００４６】上記音声合成部１０８は、素片辞書１０６
内の素片を選択して、ピッチマークが中心となるように
Ｔｐ１の長さの時間窓を、選択した素片に窓掛け部１０
７により窓掛けして、ΡＳＯＬＡ（pitch-synchronous
Overlap Add method）法にて音声合成する。The speech synthesis unit 108 is used to
Is selected, and a time window having a length of Tp1 is set on the selected segment so that the pitch mark becomes the center.
7 and windowed by $ SOLA (pitch-synchronous
Speech synthesis using Overlap Add method).

【００４７】ここで、時間窓長Ｔｐ１は、分析時のピッ
チ周期をＴｐａ、合成時のピッチ周期をＴｐｓとした場
合、式（１）のように制御する。Here, the time window length Tp1 is controlled as in equation (1), where Tpa is the pitch period at the time of analysis and Tps is the pitch period at the time of synthesis.

【００４８】Ｔｐ１＝Ｃ０×ｍｉｎ｛Ｔｐａ，Ｔｐｓ｝ …式（１）但し、Ｃ０は２．０程度の値、×は掛け算をあらわす。Tp1 = C0 × min {Tpa, Tps} Expression (1) where C0 is a value of about 2.0, and X represents multiplication.

【００４９】以下、上述のように構成された音声合成方
法の動作を説明する。The operation of the speech synthesis method configured as described above will be described below.

【００５０】まず、パソコン通信の文章ファイルやフロ
ッピーディスク（ＦＤ）内の文章ファイル等のテキスト
データがＲＳ２３２Ｃ等の通信ポートやＦＤＤを経て入
力され、内部バッファに一時的に格納され、一定量を超
えることによりある単位ごと（例えば、１文章ごと）に
テキスト解析部１０１に送られる。First, text data such as a text file in a personal computer communication or a text file in a floppy disk (FD) is input via a communication port such as RS232C or an FDD, temporarily stored in an internal buffer, and exceeds a certain amount. Thus, the text is sent to the text analysis unit 101 for each unit (for example, for each sentence).

【００５１】テキスト解析部１０１では、ＲＯＭやＲＡ
Ｍで構成された単語辞書１０２の単語辞書と、そのテキ
ストデータを照合しながら読み、アクセント、イントネ
ーション、ポーズ等の情報を文字列として記述した音韻
韻律記号を生成し、これをパラメータ生成部１０３に送
る。In the text analysis unit 101, a ROM or RA
M is read while collating the word dictionary of the word dictionary 102 composed of M with its text data, and generates a phonological prosodic symbol in which information such as accent, intonation, and pause is described as a character string, and this is output to the parameter generation unit 103. send.

【００５２】合成パラメータ生成部１０３では、この音
韻韻律記号列に基づいて音声素片データ記憶部１０６に
格納されている音声素片データの位置、各音韻の継続時
間（使用する音声素片の繰り返し数）、声の高さ（使用
する音声素片の繰り返し間隔）、声の強さ（使用する音
声素片の倍率）を決定し、これらの情報からなる合成パ
ラメータを生成して合成音声部１０８に送る。The synthesis parameter generation unit 103 determines the position of the speech unit data stored in the speech unit data storage unit 106 and the duration of each phoneme (repetition of the speech unit to be used) based on the phoneme prosody symbol string. ), The pitch of the voice (the repetition interval of the speech unit to be used), and the strength of the voice (the magnification of the speech unit to be used), and a synthesis parameter composed of these pieces of information is generated and the synthesized speech unit 108 is generated. Send to

【００５３】合成音声部１０８では、生成された合成パ
ラメータに基づいて、窓掛け部１０７から出力されたピ
ッチマーク位置を読み込みながら音声波形データを生成
し、これをＤ／Ａ変換器（図示せず）に送る。The synthesized voice section 108 generates voice waveform data while reading the pitch mark position output from the windowing section 107 based on the generated synthesized parameters, and converts this into a D / A converter (not shown). ).

【００５４】次に、上記素片作成部１０５における動作
を図２に示すフローチャートに従って説明する。Next, the operation of the segment generating unit 105 will be described with reference to the flowchart shown in FIG.

【００５５】図２は上記素片作成部１０５の動作を示す
フローチャートであり、図３は図２の重心点検出処理の
詳細を示すフローチャートである。図中、ＳＴはフロー
の各ステップを示す。FIG. 2 is a flowchart showing the operation of the segment creating unit 105, and FIG. 3 is a flowchart showing the details of the center-of-gravity point detection processing of FIG. In the figure, ST indicates each step of the flow.

【００５６】音声信号は、音声信号入力部１０４によっ
て、ディスクなどから入力されるものとする。It is assumed that the audio signal is input from the disk or the like by the audio signal input unit 104.

【００５７】まず、ステップＳＴ１０１で音声信号デー
タを分析フレームと称する区間に分割する。本実施形態
では、１フレーム長は、３２ｍ秒で、８ｍ秒ずらして次
のフレームに移る。ここで、総フレーム数をＮとする。First, in step ST101, the audio signal data is divided into sections called analysis frames. In the present embodiment, the length of one frame is 32 ms, and the next frame is shifted by 8 ms. Here, it is assumed that the total number of frames is N.

【００５８】次いで、ステップＳＴ１０２で処理を行う
フレーム番号ｉを初期化する。Next, in step ST102, a frame number i to be processed is initialized.

【００５９】次いで、ステップＳＴ１０３で重心点検出
用の窓長決定を行い、ステップＳＴ１０４で重心点を検
出する。Next, a window length for detecting the center of gravity is determined in step ST103, and the center of gravity is detected in step ST104.

【００６０】ステップＳＴ１０３及びステップＳＴ１０
４は本音声合成方法の核心部であり、このステップにお
いて、ピッチマークを検出する。詳細については、図３
により後述する。Step ST103 and step ST10
Reference numeral 4 denotes a core part of the present speech synthesis method. In this step, a pitch mark is detected. See Figure 3 for details.
Will be described later.

【００６１】本実施形態では、第ｉフレームのデータに
対し、まず、ピッチマークを探索する範囲を与え、ピッ
チマーク探索時に用いるデータ（探索範囲内及び探索範
囲の始点・終点から窓長の半分の時間軸座標まで）の音
声信号について、最小値を検出し、この最小値が０とな
るようオフセットを加える。本実施形態では、最小値と
してフレームごとに最小値検出処理を行っているが、長
時間長の最小値、あるいは、簡易的に音声データの取り
うる最小値（１６ビット２進数で２の補数形式では−３
２７６８）を用いてもよい。In the present embodiment, first, a range for searching for a pitch mark is given to the data of the i-th frame, and data used for searching for the pitch mark (in the search range and half the window length from the start point / end point of the search range). A minimum value is detected for the audio signal (up to the time axis coordinates), and an offset is added so that this minimum value becomes zero. In the present embodiment, the minimum value detection processing is performed for each frame as the minimum value. However, the minimum value of a long time length or the minimum value that can be easily taken by audio data (2's complement format in 16-bit binary number) Then -3
2768) may be used.

【００６２】このオフセットを加えたデータに、固定長
の窓をかけ、窓の中心を１ポイントずつずらしながら各
々の窓の中心位置での窓掛けデータの面積を計算し、そ
の面積の最大を与える窓の中心の時間軸座標を検出し、
これを第ｉフレームのデータの重心点と呼び、第ｉフレ
ームのデータのピッチマークとする。A window of a fixed length is applied to the data to which the offset has been added, and the area of the windowing data at the center position of each window is calculated while shifting the center of the window by one point, and the maximum of the area is given. Detect the time axis coordinates of the center of the window,
This is called the center of gravity of the data of the i-th frame, and is used as the pitch mark of the data of the i-th frame.

【００６３】第ｉフレームのデータのピッチマークが検
出されると、ステップＳＴ１０５でピッチマークの前後
の音声データを切り出し、ピッチマークが中央に位置す
るようセンタリングする。本実施形態では、切出し長と
しては、予備実験により、男性で最長のピッチ周期に余
裕を持たせた１２ｍ秒とした。When the pitch mark of the data of the i-th frame is detected, the audio data before and after the pitch mark is cut out in step ST105, and the centering is performed so that the pitch mark is located at the center. In the present embodiment, the cut-out length is set to 12 ms, which is a length of the longest pitch period for males with a margin, by a preliminary experiment.

【００６４】次いで、ステップＳＴ１０６で第ｉフレー
ムにおける素片として素片書き込みを行い、ステップＳ
Ｔ１０７で切り出した音声データをディスク等の記憶媒
体に素片辞書として順次書き込む。Next, in step ST106, segment writing is performed as a segment in the i-th frame, and step S106 is executed.
The audio data extracted in T107 is sequentially written as a segment dictionary on a storage medium such as a disk.

【００６５】ステップＳＴ１０８では、総フレーム数Ｎ
がフレーム番号ｉより大きいか（ｉ＜Ｎか）を比較して
全フレーム終了したかの判定を行い、ｉ＜Ｎであれば終
了していないと判断してステップ１０９で処理を行うフ
レーム番号を更新して（ｉ＝ｉ＋１）ステップＳＴ１０
３に進み、以降の処理を継続する。また、ステップＳＴ
１０８で全フレームの処理が終了したと判定したとき
は、ディスクのクローズ処理等（図示せず）を行って素
片作成部１０５の動作を終了する。In step ST108, the total number of frames N
Is greater than the frame number i (i <N), it is determined whether all frames have been completed. If i <N, it is determined that the processing has not been completed, and the frame number to be processed in step 109 is determined. Update (i = i + 1) step ST10
Proceed to 3 to continue the subsequent processing. Step ST
If it is determined in step 108 that the processing of all the frames has been completed, the closing operation of the disc (not shown) and the like are performed, and the operation of the segment creating unit 105 is terminated.

【００６６】図３は図２の重心点検出処理（ステップＳ
Ｔ１０４）の詳細動作を示すフローチャートである。FIG. 3 is a flowchart showing the process of detecting the center of gravity of FIG.
It is a flowchart which shows the detailed operation of T104).

【００６７】まず、ステップＳＴ２０１で分割された音
声信号の第ｉフレームのデータｄａｔａ［ｋ］，（ｋ＝
１，２，…）を切り出す。次いで、ステップＳＴ２０２
で窓長を設定し、この固定長窓データｗｉｎｄｏｗ
［ｋ］，（ｋ＝−窓長／２〜窓長／２）を作成する。First, the data data [k], (k = k) of the i-th frame of the audio signal divided in step ST201
1, 2, ...). Next, step ST202
To set the window length, and the fixed-length window data window
[K], (k = −window length / 2−window length / 2).

【００６８】次いで、重心点検出処理（ステップＳＴ１
０４）動作として、ステップＳＴ２０３でピッチマーク
探索範囲（始点・終点）を設定する。Next, the center-of-gravity point detection processing (step ST1)
04) As an operation, a pitch mark search range (start point / end point) is set in step ST203.

【００６９】ステップＳＴ２０４では、ステップＳＴ２
０３で設定した探索範囲の始点よりステップＳＴ２０２
で設定したの窓長の半分だけ前の点から、ステップＳＴ
２０３で設定した探索範囲の終点からステップＳＴ２０
２で設定した窓長の半分だけ後の点までの間の音声信号
の最小値ｍｉｎｉｍｕｍ［ｉ］を検出する。In step ST204, step ST2
Step ST202 from the start point of the search range set in 03
Step ST from the point before the window length set by
Step ST20 from the end point of the search range set in 203
The minimum value [i] of the audio signal between the point after half the window length set in step 2 and the next point is detected.

【００７０】ステップＳＴ２０５では、ステップＳＴ２
０４で検出した最小値が０となるようフレーム内のデー
タにオフセットを加える。このオフセット加算は、式
（２）で示される。In step ST205, step ST2
An offset is added to the data in the frame so that the minimum value detected in 04 becomes 0. This offset addition is represented by equation (2).

【００７１】Ｏｆｆ＿Ａｄｄ＿Ｄａｔａ（ｋ）＝Ｄａｔａ［ｋ］＋ｍｉｎｉｍｕｍ［ｉ］ …式（２）図４は最小値検出範囲と窓関数を示す図であり、縦軸の
大きさ０（基準点）が上記最小値にあたる。Off_Add_Data (k) = Data [k] + minimum [i] Expression (2) FIG. 4 is a diagram showing a minimum value detection range and a window function. Hit the value.

【００７２】次いで、ピッチマーク探索用の窓の中心と
なる時間軸座標をｊとし、ステップＳＴ２０６におい
て、ｊの初期値としてステップＳＴ２０３で設定したピ
ッチマーク探索始点の時間軸座標を代入する。ステップ
ＳＴ２０７〜ステップＳＴ２１０で、探索始点から窓長
個のデータに前記の固定長窓をかけ、この窓掛けデータ
の面積Ｓ（ｊ）を計算する。Next, j is the time axis coordinate which is the center of the pitch mark search window, and in step ST206, the time axis coordinate of the pitch mark search start point set in step ST203 is substituted as the initial value of j. In steps ST207 to ST210, the window length data is multiplied by the fixed-length window from the search start point, and the area S (j) of the windowed data is calculated.

【００７３】すなわち、ステップＳＴ２０７でｋを−窓
長／２、面積Ｓ（ｊ）を０とし、ステップＳＴ２０８で
データに窓をかけ、その面積を計算する。窓掛けデータ
の面積Ｓ（ｊ）は、式（３）により計算される。That is, in step ST207, k is set to -window length / 2, and the area S (j) is set to 0. In step ST208, the data is windowed, and the area is calculated. The area S (j) of the windowing data is calculated by Expression (3).

【００７４】Ｓ（ｊ）＝Ｓ（ｊ）＋Ｏｆｆ＿Ａｄｄ＿Ｄａｔａ（ｋ）×ｗｉｎｄｏｗ［ｋ］ …式（３）ステップＳＴ２０９では、ｋが窓長／２より大きい（ｋ
＞窓長／２）か否かを判別し、ｋが窓長／２以下（ｋ≦
窓長／２）のときは窓長の計算が済んでいないと判断し
てステップＳＴ２１０でｋを１インクリメント（ｋ＝ｋ
＋１）してステップＳＴ２０８に戻る。S (j) = S (j) + Off_Add_Data (k) × window [k] Expression (3) In step ST209, k is larger than the window length / 2 (k
> Window length / 2), and k is equal to or less than window length / 2 (k ≦
In the case of (window length / 2), it is determined that the calculation of the window length has not been completed, and k is incremented by 1 (k = k) in step ST210.
+1) and the process returns to step ST208.

【００７５】ｋが窓長／２より大きい（ｋ＞窓長／２）
のときは窓長の計算が済んだと判断してステップＳＴ２
１１でｊが探索終点より大きい（ｊ＞探索終点）か否か
を判別し、ｊが探索終点以下（ｊ≦探索終点）のときは
窓掛けデータの面積Ｓ（ｊ）の計算が済んでいないと判
断してステップＳＴ２１２でｊを１インクリメント（ｊ
＝ｊ＋１）してステップＳＴ２０７に戻る。K is larger than window length / 2 (k> window length / 2)
In the case of, it is determined that the calculation of the window length has been completed, and step ST2 is performed.
At 11, it is determined whether or not j is larger than the search end point (j> search end point). If j is equal to or smaller than the search end point (j ≦ search end point), the area S (j) of the windowing data has not been calculated. In step ST212, j is incremented by 1 (j
= J + 1), and returns to step ST207.

【００７６】このように、窓の中心をステップＳＴ２１
２で１ポイントずつずらしていきながら窓掛けデータの
面積Ｓ（ｊ）を計算していき、ステップＳＴ２１１でｊ
が探索終点より大きく（ｊ＞探索終点）なったときに
は、窓の中心が探索終点の位置に来たと判断してステッ
プＳＴ２１３に進み窓掛けデータの面積Ｓ（ｊ）の計算
を終了する。As described above, the center of the window is set at step ST21.
The area S (j) of the windowing data is calculated while shifting one point at a time by 2 in step ST211.
Is larger than the search end point (j> search end point), it is determined that the center of the window has reached the position of the search end point, and the process proceeds to step ST213 to terminate the calculation of the area S (j) of the windowing data.

【００７７】最後に、ステップＳＴ２１３で面積の最大
を与える窓の中心の時間軸座標ａｒｇｍａｘ［Ｓ
（ｊ）］を検出し、これを窓の重心点と呼び、ピッチマ
ークとして重心点検出処理を終了する。Finally, in step ST213, the time axis coordinate argmax [S of the center of the window giving the maximum area is set.
(J)], and this is called the center of gravity of the window, and the center of gravity point detection processing is terminated as a pitch mark.

【００７８】本フローにより、第ｉフレームのデータの
ピッチマークが検出されると、前記図２のステップＳＴ
１０５に移行しピッチマークの前後の音声データ切り出
しによる素片作成処理を行う。According to this flow, when a pitch mark of the data of the i-th frame is detected, step ST in FIG.
The process proceeds to 105 to perform a segment creation process by extracting audio data before and after the pitch mark.

【００７９】図５は６４，９６，１２８点の各窓長にお
ける窓掛けデータの面積の推移を示す図、図６は高周波
成分の大きい波形での重心点検出による窓掛けデータの
面積の推移を示す図である。FIG. 5 is a diagram showing the transition of the area of the windowing data at each window length of 64, 96, and 128 points. FIG. 6 is a graph showing the transition of the area of the windowing data by detecting the center of gravity of the waveform having a large high-frequency component. FIG.

【００８０】図５は波形ア（／ａ／）（／／は、音節
境界記号を示す。）と発声した波形の定常部を、図６は
波形ウザ（／ｕｚａ／）と発声した波形の／ｚ／部の
模式図として示される。FIG. 5 shows the waveform a (/ a /) (// indicates a syllable boundary symbol) and the stationary part of the uttered waveform, and FIG. 6 shows the waveform user (/ uza /) of the uttered waveform. It is shown as a schematic diagram of z / part.

【００８１】図５及び図６中の実線は、オフセットを加
えた波形データを示し、横軸は時間軸、縦軸は波形信号
の大きさであるが、オフセットを加えているため、縦軸
の０が波形の最小値（ｍｉｎｉｍｕｍ［ｉ］）になって
いる。図５及び図６中の破線は、横軸の時間点に窓の中
心がある時の、窓掛けデータの面積であり、この最大を
与える時間軸上の点、すなわち＊が本音声合成方法によ
るピッチマーク位置（データの重心点）である。The solid lines in FIGS. 5 and 6 show the waveform data to which an offset has been added. The horizontal axis is the time axis, and the vertical axis is the magnitude of the waveform signal. 0 is the minimum value of the waveform (minimum [i]). The dashed lines in FIGS. 5 and 6 indicate the area of the windowing data when the center of the window is at the time point on the horizontal axis, and the point on the time axis that gives the maximum, that is, *, according to the present speech synthesis method. This is the pitch mark position (the center of gravity of the data).

【００８２】以上説明したように、第１の実施形態に係
る音声合成方法及び音声合成装置は、テキスト解析部１
０１、単語辞書１０２、パラメータ生成部１０３、音声
信号入力部１０４、素片作成部１０５、素片辞書１０
６、窓掛け部１０７、合成音声部１０８を備え、素片作
成部１０５では、音声信号の最小値を検出し、音声信号
に最小値が非負となるようオフセットを加算する手段
（主としてステップＳＴ２０５部分）と、所定長の窓を
準備し、所定長窓の中心を移動させながら、加算データ
に窓をかけ、窓の各中心位置での窓掛けデータの面積を
計算し、面積の最大を与える窓の中心の時間軸座標を検
出する手段（主としてステップＳＴ２０８部分）と、前
記時間軸座標を中心にセンタリングして音声信号を切出
し、切出し音声信号の中心の時間軸座標を重畳の中心と
して、ピッチ周期分ずらしながら窓掛け重畳する手段
（主としてステップＳＴ１０５〜ＳＴ１０９部分）とを
有しているので、データの重心点をピッチマークとする
ことによって、簡易な処理でピッチマークを設定するこ
とができ、高品質の音声合成が可能になる。As described above, the text-to-speech synthesis method and the text-to-speech synthesizing apparatus according to the first embodiment include the text analysis unit 1.
01, word dictionary 102, parameter generation unit 103, voice signal input unit 104, unit creation unit 105, unit dictionary 10
6, a windowing unit 107 and a synthesized voice unit 108. The unit generating unit 105 detects the minimum value of the voice signal and adds an offset to the voice signal so that the minimum value is non-negative. ), Prepare a window of a predetermined length, apply a window to the added data while moving the center of the predetermined length window, calculate the area of the windowing data at each center position of the window, and give the maximum area. Means (mainly step ST208) for detecting the time axis coordinate of the center of the audio signal, extracting the audio signal by centering around the time axis coordinate, and setting the time axis coordinate of the center of the extracted audio signal as the center of superimposition, and setting the pitch cycle Since means (mainly steps ST105 to ST109) for windowing and superimposing while shifting each other is provided, the center of gravity of the data is used as a pitch mark, thereby simplifying the operation. Management in can be set pitch mark allows a high quality speech synthesis.

【００８３】また、破裂音あるいは破擦音を含む有声音
や、無声子音の直前の有声音に見られる高周波成分（ラ
ンダムノイズ）のパワーがかなり大きくなるような波形
（図６図の波形）においても、安定にピッチマークを抽
出することができ、音声合成聴取では、ピークによるピ
ッチマーク付与素片に比して、音素間の接続の悪さ（ゴ
ロツキ）を解消することができた。In a waveform (a waveform in FIG. 6) in which the power of a high frequency component (random noise) seen in a voiced sound including a plosive or affricate or a voiced sound immediately before an unvoiced consonant is considerably large. In addition, pitch marks could be stably extracted, and in voice synthesis listening, poor connection (scratching) between phonemes could be eliminated as compared to a pitch mark-added segment due to a peak.

【００８４】さらに、波形の重心点をピッチマークとす
るため、ピッチ波形に対する変形が少なく、フォルマン
トのＱの低下が抑制され、より明るい音となる。Further, since the center of gravity of the waveform is used as the pitch mark, the deformation of the pitch waveform is small, the decrease of the formant Q is suppressed, and a brighter sound is obtained.

【００８５】図７は本発明の第２の実施形態に係る音声
合成方法及び音声合成装置の構成を示すブロック図であ
る。本実施形態に係る音声合成方法の説明にあたり図１
に示す音声合成方法及び音声合成装置と同一構成部分に
は同一符号を付して重複部分の説明を省略する。FIG. 7 is a block diagram showing a configuration of a voice synthesizing method and a voice synthesizing apparatus according to a second embodiment of the present invention. FIG. 1 is used in describing the speech synthesis method according to the present embodiment.
The same components as those of the speech synthesis method and the speech synthesis device shown in FIG.

【００８６】図７において、１０１はテキスト解析部、
１０２は単語辞書、１０３はパラメータ生成部、１０４
は音声信号入力部、２００は素片作成部、１０６は素片
辞書、１０８は合成音声部であり、素片作成部２００は
ピッチマーク算出部２１０及び窓掛け部２１１からな
る。In FIG. 7, reference numeral 101 denotes a text analysis unit;
102 is a word dictionary, 103 is a parameter generator, 104
Denotes a speech signal input unit, 200 denotes a unit creation unit, 106 denotes a unit dictionary, and 108 denotes a synthesized speech unit. The unit creation unit 200 includes a pitch mark calculation unit 210 and a windowing unit 211.

【００８７】上記素片作成部２００は、データの重心点
をピッチマークとする処理を行うもので本音声合成方法
の主要部分であり、図８〜図１０により詳細に後述す
る。The segment generating section 200 performs a process of setting the center of gravity of the data as a pitch mark, and is a main part of the present voice synthesizing method, and will be described later in detail with reference to FIGS.

【００８８】上記素片辞書１０６は、音声信号を入力し
た後、素片作成部２００により作成される。After inputting the speech signal, the segment dictionary 106 is created by the segment creating section 200.

【００８９】上記音声合成部１０８は、素片辞書１０６
内の素片を選択して、ΡＳＯＬＡ（pitch-synchronous
Overlap Add method）法にて音声合成する。The speech synthesizing unit 108
Select the segment in the ΡSOLA (pitch-synchronous
Speech synthesis using Overlap Add method).

【００９０】以下、上述のように構成された音声合成方
法の動作を説明する。The operation of the speech synthesis method configured as described above will be described below.

【００９１】第２の実施形態では、データの重心点を検
出する際、あらかじめケプストラム分析で信頼度の高い
ピッチ周期を求めておき、そのピッチ周期を重心点抽出
時の窓関数に用いる点で第１の実施形態と相違する。In the second embodiment, when detecting the center of gravity of data, a highly reliable pitch cycle is obtained in advance by cepstrum analysis, and the pitch cycle is used as a window function at the time of extracting the center of gravity. This is different from the first embodiment.

【００９２】図８は上記素片作成部２００の動作を示す
フローチャートであり、図９は図８のピッチ周期検出処
理の詳細を示すフローチャート、図１０は図８の重心点
検出処理の詳細を示すフローチャートである。なお、前
記図２及び図３に示す素片作成処理と同一処理部分には
同一ステップ番号を付している。FIG. 8 is a flowchart showing the operation of the segment generating section 200. FIG. 9 is a flowchart showing the details of the pitch period detection processing of FIG. 8, and FIG. 10 shows the details of the center-of-gravity point detection processing of FIG. It is a flowchart. The same processing steps as those in the segment creation processing shown in FIGS. 2 and 3 are denoted by the same step numbers.

【００９３】音声信号は、音声信号入力部１０４によっ
て、ディスクなどから入力されるものとする。It is assumed that the audio signal is input from a disk or the like by the audio signal input unit 104.

【００９４】まず、ステップＳＴ１０１で音声信号デー
タを分析フレームと称する区間に分割する。本実施形態
では、１フレーム長は、３２ｍ秒で、８ｍ秒ずらして次
のフレームに移る。ここで、総フレーム数をＮとする。First, in step ST101, the audio signal data is divided into sections called analysis frames. In the present embodiment, the length of one frame is 32 ms, and the next frame is shifted by 8 ms. Here, it is assumed that the total number of frames is N.

【００９５】次いで、ステップＳＴ１０２で処理を行う
フレーム番号ｉを初期化する。Next, in step ST102, the frame number i to be processed is initialized.

【００９６】次いで、ステップＳＴ３０１で第ｉフレー
ムにおける音声のピッチ周期Ｔｐを検出する。ピッチ周
期検出法としては、簡易手法としては、波形のピーク間
隔等が考えられるが、本実施形態では、より精密にピッ
チ周期を算出するため、ケプストラム法を用いる。ケプ
ストラム法を用いたピッチ周期検出方法については、図
９により後述する。Next, in step ST301, the pitch period Tp of the sound in the i-th frame is detected. As a pitch period detection method, a peak interval of a waveform or the like can be considered as a simple method. In the present embodiment, the cepstrum method is used to calculate the pitch period more precisely. A pitch cycle detection method using the cepstrum method will be described later with reference to FIG.

【００９７】次いで、ステップＳＴ３０２で上記ケプス
トラム法で求めたピッチ周期を基に重心点検出用の窓長
決定を行い、ステップＳＴ３０３で重心点を検出する。Next, in step ST302, a window length for detecting the center of gravity is determined based on the pitch period obtained by the cepstrum method, and the center of gravity is detected in step ST303.

【００９８】上記ケプストラム法で求めたピッチ周期を
重心点検出に用いる窓の窓長とし、窓データを作成した
結果、本発明者による予備実験では、用いる窓としてサ
イドローブでの減衰が大きいハニング窓，ブラックマン
窓等で、安定にピッチマークが求められることを確認し
た。The pitch period obtained by the cepstrum method is used as the window length of the window used for detecting the center of gravity, and the window data is created. As a result of the preliminary experiment by the present inventors, the Hanning window having a large attenuation in the side lobe is used as the window to be used. It was confirmed that a pitch mark was required stably on a Blackman window or the like.

【００９９】ステップＳＴ３０２及びステップＳＴ３０
３は本音声合成方法の核心部であり、このステップにお
いて、ピッチマークを検出する。詳細については、図１
０により後述する。Step ST302 and step ST30
Reference numeral 3 denotes a core part of the present speech synthesis method, and in this step, a pitch mark is detected. See Figure 1 for details.
0 will be described later.

【０１００】本実施形態では、第１の実施形態と同様
に、第ｉフレームのデータに対し、まず、ピッチマーク
を探索する範囲を与え、ピッチマーク探索時に用いるデ
ータ（探索範囲内及び探索範囲の始点・終点から窓長の
半分の時間軸座標まで）の音声信号について、最小値を
検出し、この最小値が０となるようオフセットを加え
る。このオフセットを加えたデータに、固定長の窓をか
け、窓の中心を１ポイントずつずらしながら各々の窓の
中心位置での窓掛けデータの面積を計算し、その面積の
最大を与える窓の中心の時間軸座標を検出し、これを第
ｉフレームのデータの重心点と呼び、第ｉフレームのデ
ータのピッチマークとする。In the present embodiment, similarly to the first embodiment, first, a range for searching for a pitch mark is given to the data of the i-th frame, and data used for searching for a pitch mark (in the search range and in the search range). The minimum value is detected for the audio signal from the start point / end point to the time axis coordinate of half the window length), and an offset is added so that the minimum value becomes zero. A window of a fixed length is applied to the data to which the offset is added, and the area of the windowing data at the center position of each window is calculated while shifting the center of the window by one point, and the center of the window giving the maximum of the area is calculated. , And this is referred to as the center of gravity of the data of the i-th frame, and is used as the pitch mark of the data of the i-th frame.

【０１０１】第ｉフレームのデータのピッチマークが検
出されると、ステップＳＴ３０４でピッチマークの前後
の音声データを切り出し、ピッチマークが中央に位置す
るようセンタリングする。本実施形態では、切出し長と
しては、予備実験により、男性で最長のピッチ周期に余
裕を持たせた１２ｍ秒とした。When a pitch mark of the data of the i-th frame is detected, audio data before and after the pitch mark is cut out in step ST304, and the centering is performed so that the pitch mark is located at the center. In the present embodiment, the cut-out length is set to 12 ms, which is a length of the longest pitch period for males with a margin, by a preliminary experiment.

【０１０２】次いで、ステップＳＴ３０６でステップＳ
Ｔ３０５からの窓データＣ０を基に窓掛けし、ステップ
ＳＴ３０７で窓掛けしたものを素片辞書へ書き込み、ス
テップＳＴ１０７で切り出した音声データをディスク等
の記憶媒体に素片辞書として順次書き込む。このステッ
プＳＴ３０６における窓掛けは、前記第１の実施形態の
窓掛け部１０７（図１）動作に対応している。Next, in step ST306, step S306 is executed.
Windowing is performed based on the window data C0 from T305, and the windowed data is written in a segment dictionary in step ST307, and the audio data cut out in step ST107 is sequentially written as a unit dictionary on a storage medium such as a disk. The windowing in step ST306 corresponds to the operation of the windowing unit 107 (FIG. 1) of the first embodiment.

【０１０３】このように、本実施形態では、ステップＳ
Ｔ３０７で素片辞書への書き込みを行うが、既にステッ
プＳＴ３０１で第ｉフレームの正確なピッチ周期を求め
ているので、ステップＳＴ３０６において窓掛けしたも
のを素片辞書へ書き込む。これにより、音声合成時にお
いて、第１の実施形態で必要であった１ピッチ毎の窓掛
けの乗算が不要となり、ただ重ね合わせを実行するだけ
でよく、音声合成処理量の処理量が大幅に減少できる。
このため、本音声合成方法を用いた装置においては、Ｄ
ＳΡ等の演算プロセッサを使用することなく、通常のＣ
ＰＵで実現可能となる。As described above, in the present embodiment, step S
Writing to the unit dictionary is performed in T307, but since the accurate pitch period of the i-th frame has already been obtained in step ST301, the windowed one is written in the unit dictionary in step ST306. Thereby, at the time of speech synthesis, the multiplication of windowing for each pitch, which is required in the first embodiment, becomes unnecessary, and only superposition is performed, and the processing amount of the speech synthesis processing amount is greatly increased. Can be reduced.
For this reason, in an apparatus using the present speech synthesis method, D
Without using an arithmetic processor such as SΡ
It can be realized by PU.

【０１０４】ステップＳＴ１０８では、総フレーム数Ｎ
がフレーム番号ｉより大きいか（ｉ＜Ｎか）を比較して
全フレーム終了したかの判定を行い、ｉ＜Ｎであれば終
了していないと判断してステップ１０９で処理を行うフ
レーム番号を更新して（ｉ＝ｉ＋１）ステップＳＴ３０
１に進み、以降の処理を継続する。また、ステップＳＴ
１０８で全フレームの処理が終了したと判定したとき
は、ディスクのクローズ処理等（図示せず）を行って素
片作成部２００の動作を終了する。In step ST108, the total number of frames N
Is greater than the frame number i (i <N), it is determined whether all frames have been completed. If i <N, it is determined that the processing has not been completed, and the frame number to be processed in step 109 is determined. Update (i = i + 1) step ST30
Proceed to 1 to continue the subsequent processing. Step ST
If it is determined in step 108 that the processing of all the frames has been completed, a disc closing process or the like (not shown) is performed, and the operation of the segment generating unit 200 ends.

【０１０５】図９は図８のピッチ周期検出処理（ステッ
プＳＴ３０１）の詳細動作を示すフローチャートであ
り、ピッチ周期検出法として、精密なピッチ周期の算出
が可能にするケプストラム法を用いる。FIG. 9 is a flowchart showing the detailed operation of the pitch cycle detection process (step ST301) of FIG. 8. As the pitch cycle detection method, a cepstrum method that enables accurate calculation of the pitch cycle is used.

【０１０６】まず、ステップＳＴ４０１で時間波形を入
力し、ステップＳＴ４０２で窓掛けを行う。First, a time waveform is input in step ST401, and windowing is performed in step ST402.

【０１０７】次いで、ステップＳＴ４０３で窓掛けを行
った時間・波形に対して離散フーリエ変換（ＤＦＴ）を
行い、ステップＳＴ４０４でその実部と虚部の自乗和の
平方根を対数変換する。Next, a discrete Fourier transform (DFT) is performed on the time-waveform subjected to windowing in step ST403, and the square root of the sum of squares of the real part and the imaginary part is logarithmically converted in step ST404.

【０１０８】次いで、ステップＳＴ４０５で逆フーリエ
変換（ＩＤＦＴ）を行い、ステップＳＴ４０６で逆フー
リエ変換によりケプストラム成分を得て本フローを終了
する。Next, inverse Fourier transform (IDFT) is performed in step ST405, and cepstrum components are obtained by inverse Fourier transform in step ST406, and this flow ends.

【０１０９】すなわち、ケプストラム法は、畳み込み演
算を加法的な演算に変換するものであり、音声の有声音
信号は音源成分を声道情報で畳み込んだものであって、
両者の分離に適している。That is, the cepstrum method converts a convolution operation into an additive operation, and a voiced sound signal of speech is obtained by convolving a sound source component with vocal tract information.
Suitable for separation of both.

【０１１０】入力信号が音声の有声音の場合、ピッチ周
期をＴ０とすれば、音源成分はＴ０の近傍として現れ、
また声道成分は短時間領域の成分として現れる。When the input signal is a voiced voice, if the pitch period is T0, the sound source component appears near T0.
The vocal tract component appears as a component in a short-time region.

【０１１１】上記手法で求めたピッチ周期を重心点検出
に用いる窓の窓長とし、窓データを作成する。本発明者
による予備実験では、用いる窓としてサイドローブでの
減衰が大きいハニング窓，ブラックマン窓等で、安定に
ピッチマークを求められることを確認した。The pitch period obtained by the above method is used as the window length of the window used for detecting the center of gravity, and window data is created. In a preliminary experiment by the present inventors, it was confirmed that a pitch mark can be stably obtained in a Hanning window, a Blackman window, or the like having a large attenuation in a side lobe as a window to be used.

【０１１２】図１０は図８の重心点検出処理（ステップ
ＳＴ３０３）の詳細動作を示すフローチャートである。FIG. 10 is a flowchart showing the detailed operation of the center-of-gravity point detection process (step ST303) of FIG.

【０１１３】まず、ステップＳＴ２０１で分割された音
声信号の第ｉフレームのデータｄａｔａ［ｋ］，（ｋ＝
１，２，…）を切り出す。First, the data data [k], (k = k) of the i-th frame of the audio signal divided in step ST201
1, 2, ...).

【０１１４】次いで、ステップＳＴ５０１でピッチ周期
を検出するとともに、検出したピッチ周期を窓長に設定
し、窓データｗｉｎｄｏｗ［ｋ］，（ｋ＝−窓長／２〜
窓長／２）を作成する。Next, in step ST501, the pitch period is detected, and the detected pitch period is set as the window length, and the window data window [k], (k = −window length / 2−2)
(Window length / 2) is created.

【０１１５】次いで、重心点検出処理（前記ステップＳ
Ｔ３０３）として、ステップＳＴ２０３でピッチマーク
探索範囲（始点・終点）を設定する。Next, the center-of-gravity point detection process (step S
As T303), a pitch mark search range (start point / end point) is set in step ST203.

【０１１６】ステップＳＴ５０２では、ステップＳＴ２
０３で設定した探索範囲の始点よりステップＳＴ５０１
で設定したの窓長の半分だけ前の点から、ステップＳＴ
２０３で設定した探索範囲の終点からステップＳＴ５０
１で設定した窓長の半分だけ後までの音声信号の最小値
ｍｉｎｉｍｕｍ［ｉ］を検出する。At step ST502, at step ST2
Step ST501 from the start point of the search range set in 03
Step ST from the point before the window length set by
Step ST50 from the end point of the search range set in 203
The minimum value [minimum [i]] of the audio signal up to half the window length set in 1 is detected.

【０１１７】ステップＳＴ２０５では、ステップＳＴ５
０２で検出した最小値が０となるようフレーム内のデー
タにオフセットを加える。このオフセット加算は、前記
式（２）で示される。In step ST205, step ST5
An offset is added to the data in the frame so that the minimum value detected in 02 becomes 0. This offset addition is represented by the above equation (2).

【０１１８】次いで、ピッチマーク探索用の窓の中心と
なる時間軸座標をｊとし、ステップＳＴ２０６におい
て、ｊの初期値としてステップＳＴ２０３で設定したピ
ッチマーク探索始点の時間軸座標を代入し、ステップＳ
Ｔ２０７〜ステップＳＴ２１０で、探索始点から窓長個
のデータに前記の固定長窓をかけ、この窓掛けデータの
面積Ｓ（ｊ）を計算する。Next, the time axis coordinate which is the center of the pitch mark search window is set as j, and in step ST206, the time axis coordinate of the pitch mark search start point set in step ST203 is substituted as the initial value of j, and step S206 is executed.
From T207 to step ST210, the fixed length window is applied to the data of the window length from the search start point, and the area S (j) of the windowed data is calculated.

【０１１９】すなわち、ステップＳＴ２０７でｋを−窓
長／２、面積Ｓ（ｊ）を０とし、ステップＳＴ５０３で
データにピッチ長窓をかけ、前記式（３）により窓掛け
データの面積Ｓ（ｊ）を計算する。That is, in step ST207, k is set to −window length / 2, and the area S (j) is set to 0. In step ST503, a pitch length window is applied to the data, and the area S (j) of the windowed data is calculated according to the equation (3). ) Is calculated.

【０１２０】ステップＳＴ２０９では、ｋが窓長／２よ
り大きい（ｋ＞窓長／２）か否かを判別し、ｋが窓長／
２以下（ｋ≦窓長／２）のときは窓長の計算が済んでい
ないと判断してステップＳＴ２１０でｋを１インクリメ
ント（ｋ＝ｋ＋１）してステップＳＴ５０３に戻る。In step ST209, it is determined whether or not k is larger than window length / 2 (k> window length / 2), and k is determined to be window length /
If it is not more than 2 (k ≦ window length / 2), it is determined that the calculation of the window length has not been completed, and k is incremented by 1 (k = k + 1) in step ST210, and the process returns to step ST503.

【０１２１】ｋが窓長／２より大きい（ｋ＞窓長／２）
のときは窓長の計算が済んだと判断してステップＳＴ２
１１でｊが探索終点より大きい（ｊ＞探索終点）か否か
を判別し、ｊが探索終点以下（ｊ≦探索終点）のときは
窓掛けデータの面積Ｓ（ｊ）の計算が済んでいないと判
断してステップＳＴ２１２でｊを１インクリメント（ｊ
＝ｊ＋１）してステップＳＴ２０７に戻る。K is larger than window length / 2 (k> window length / 2)
In the case of, it is determined that the calculation of the window length has been completed, and step ST2 is performed.
In step 11, it is determined whether or not j is larger than the search end point (j> search end point). When j is equal to or smaller than the search end point (j ≦ search end point), the area S (j) of the windowing data has not been calculated. In step ST212, j is incremented by 1 (j
= J + 1), and returns to step ST207.

【０１２２】このように、窓の中心をステップＳＴ２１
２で１ポイントずつずらしていきながら窓掛けデータの
面積Ｓ（ｊ）を計算していき、ステップＳＴ２１１でｊ
が探索終点より大きく（ｊ＞探索終点）なったときに
は、窓の中心が探索終点の位置に来たと判断してステッ
プＳＴ５０４に進み窓掛けデータの面積Ｓ（ｊ）の計算
を終了する。As described above, the center of the window is set at step ST21.
The area S (j) of the windowing data is calculated while shifting one point at a time by 2 in step ST211.
Is larger than the search end point (j> search end point), it is determined that the center of the window has come to the position of the search end point, and the process proceeds to step ST504 to end the calculation of the area S (j) of the windowing data.

【０１２３】最後に、ステップＳＴ５０４で面積の最大
を与える窓の中心の時間軸座標ａｒｇｍａｘ［Ｓ
（ｊ）］を検出し、これを窓の重心点と呼び、式（４）
に従ってピッチマークを求め本フローを終了する。Finally, in step ST504, the time axis coordinate argmax [S of the center of the window giving the maximum area is set.
(J)], and this is called the center of gravity of the window.
The pitch mark is obtained in accordance with the above equation, and the present flow ends.

【０１２４】ピッチマーク（ｉ）＝ａｒｇｍａｘ［Ｓ（ｊ）］＋窓長／２ …式（４）本フローにより、第ｉフレームのデータのピッチマーク
が検出されると、前記図８のステップＳＴ３０４に移行
しピッチマークの前後の音声データ切り出しによる素片
作成処理を行う。Pitch mark (i) = argmax [S (j)] + window length / 2 Expression (4) According to this flow, when the pitch mark of the data of the i-th frame is detected, step ST304 in FIG. Then, a segment creation process is performed by extracting audio data before and after the pitch mark.

【０１２５】以上説明したように、第２の実施形態に係
る音声合成方法及び音声合成装置は、窓長として、ケプ
ストラム分析から求めた正確なピッチ周期を用いるよう
にしているので、窓の遮断周波数をピッチ周波数に設定
でき、また、音韻中でのピッチ周期の変更にも対応でき
るため、より安定なピッチマークを得ることができ、ピ
ッチマーク誤りが減少する。これにより、合成音聴取で
は、音韻の接続性を向上させることができた。As described above, the speech synthesizing method and the speech synthesizing apparatus according to the second embodiment use the accurate pitch period obtained from the cepstrum analysis as the window length. Can be set to the pitch frequency, and it is possible to cope with a change in the pitch cycle in the phoneme, so that a more stable pitch mark can be obtained and the pitch mark error is reduced. Thereby, in the synthesis sound listening, the connectivity of phonemes could be improved.

【０１２６】図１１は本発明の第３の実施形態に係る音
声合成方法及び音声合成装置の構成を示すブロック図で
ある。本実施形態に係る音声合成方法の説明にあたり図
７に示す音声合成方法及び音声合成装置と同一構成部分
には同一符号を付して重複部分の説明を省略する。FIG. 11 is a block diagram showing a configuration of a speech synthesis method and a speech synthesis device according to the third embodiment of the present invention. In the description of the speech synthesis method according to the present embodiment, the same components as those of the speech synthesis method and the speech synthesis apparatus shown in FIG.

【０１２７】図１１において、１０１はテキスト解析
部、１０２は単語辞書、１０３はパラメータ生成部、１
０４は音声信号入力部、３００は素片作成部、１０６は
素片辞書、１０８は合成音声部であり、素片作成部３０
０はピッチマーク算出部３１０及び窓掛け部３１１から
なる。In FIG. 11, 101 is a text analysis unit, 102 is a word dictionary, 103 is a parameter generation unit,
04 is an audio signal input unit, 300 is a unit creation unit, 106 is a unit dictionary, and 108 is a synthesized speech unit.
0 comprises a pitch mark calculation unit 310 and a windowing unit 311.

【０１２８】上記素片作成部３００は、データの重心点
をピッチマークとする処理を行うもので本音声合成方法
の主要部分であり、図１２〜図１３により詳細に後述す
る。The segment generating section 300 performs a process of setting the center of gravity of the data as a pitch mark, and is a main part of the present voice synthesizing method, and will be described later in detail with reference to FIGS.

【０１２９】上記素片辞書１０６は、音声信号を入力し
た後、素片作成部３００により作成される。After inputting a speech signal, the segment dictionary 106 is created by the segment creating section 300.

【０１３０】上記音声合成部１０８は、素片辞書１０６
内の素片を選択して、ΡＳＯＬＡ（pitch-synchronous
Overlap Add method）法にて音声合成する。The speech synthesizing unit 108
Select the segment in the ΡSOLA (pitch-synchronous
Speech synthesis using Overlap Add method).

【０１３１】以下、上述のように構成された音声合成方
法の動作を説明する。Hereinafter, the operation of the speech synthesis method configured as described above will be described.

【０１３２】第３の実施形態では、前記第２の実施形態
と同様に、データの重心点を検出する際、あらかじめケ
プストラム分析で信頼度の高いピッチ周期を求めてお
き、そのピッチ周期を重心点抽出時の窓関数に用いる。In the third embodiment, similarly to the second embodiment, when detecting the center of gravity of data, a pitch cycle having high reliability is obtained in advance by cepstrum analysis, and the pitch cycle is determined by the center of gravity. Used for the window function at the time of extraction.

【０１３３】図１２は上記素片作成部３００の動作を示
すフローチャートであり、図１３は図１２の重心点検出
処理の詳細を示すフローチャートである。なお、前記図
８及び図１０に示す素片作成処理と同一処理部分には同
一ステップ番号を付している。FIG. 12 is a flowchart showing the operation of the segment generating unit 300, and FIG. 13 is a flowchart showing the details of the center-of-gravity point detection processing of FIG. The same processing steps as those of the segment creation processing shown in FIGS. 8 and 10 are denoted by the same step numbers.

【０１３４】音声信号は、音声信号入力部１０４によっ
て、ディスクなどから入力されるものとする。It is assumed that the audio signal is input from a disk or the like by the audio signal input unit 104.

【０１３５】まず、ステップＳＴ１０１で音声信号デー
タを分析フレームと称する区間に分割する。本実施形態
では、１フレーム長は、３２ｍ秒で、８ｍ秒ずらして次
のフレームに移る。ここで、総フレーム数をＮとする。First, in step ST101, the audio signal data is divided into sections called analysis frames. In the present embodiment, the length of one frame is 32 ms, and the next frame is shifted by 8 ms. Here, it is assumed that the total number of frames is N.

【０１３６】次いで、ステップＳＴ１０２で処理を行う
フレーム番号ｉを初期化する。Next, in step ST102, the frame number i to be processed is initialized.

【０１３７】次いで、ステップＳＴ３０１で第ｉフレー
ムにおける音声のピッチ周期Ｔｐを検出する。ピッチ周
期検出法としては、前記第２の実施形態と同様に、より
精密にピッチ周期を算出するため、ケプストラム法を用
いる。Next, in step ST301, the pitch period Tp of the sound in the i-th frame is detected. As the pitch period detection method, the cepstrum method is used to calculate the pitch period more precisely, as in the second embodiment.

【０１３８】次いで、ステップＳＴ３０２で上記ケプス
トラム法で求めたピッチ周期を基に重心点検出用の窓長
決定を行い、ステップＳＴ６０１で重心点を検出する。Next, in step ST302, a window length for detecting the center of gravity is determined based on the pitch period obtained by the cepstrum method, and the center of gravity is detected in step ST601.

【０１３９】ステップＳＴ３０２及びステップＳＴ６０
１は本音声合成方法の核心部であり、このステップにお
いて、ピッチマークを検出する。詳細については、図１
３により後述する。Steps ST302 and ST60
Reference numeral 1 denotes a core part of the present speech synthesis method, and in this step, a pitch mark is detected. See Figure 1 for details.
3 will be described later.

【０１４０】本実施形態では、第１、第２の実施形態と
同様に、第ｉフレームのデータに対し、まず、ピッチマ
ークを探索する範囲を与え、ピッチマーク探索時に用い
るデータ（探索範囲内及び探索範囲の始点・終点から窓
長の半分の時間軸座標まで）の音声信号について、最小
値を検出し、この最小値が０となるようオフセットを加
える。このオフセットを加えたデータに、固定長の窓を
かけ、窓の中心を１ポイントずつずらしながら各々の窓
の中心位置での窓掛けデータの面積を計算し、その面積
の最大を与える窓の中心の時間軸座標を検出し、これを
第ｉフレームのデータの重心点と呼び、第ｉフレームの
データのピッチマークとする。In the present embodiment, similarly to the first and second embodiments, first, a range for searching for a pitch mark is given to the data of the i-th frame, and data used in searching for the pitch mark (in the search range and within the search range). The minimum value is detected for the audio signal from the start point / end point of the search range to the time axis coordinate of half the window length), and an offset is added so that the minimum value becomes zero. A window of a fixed length is applied to the data to which the offset is added, and the area of the windowing data at the center position of each window is calculated while shifting the center of the window by one point, and the center of the window giving the maximum of the area is calculated. , And this is referred to as the center of gravity of the data of the i-th frame, and is used as the pitch mark of the data of the i-th frame.

【０１４１】第ｉフレームのデータのピッチマークが検
出されると、ステップＳＴ３０４でピッチマークの前後
の音声データを切り出し、ピッチマークが中央に位置す
るようセンタリングする。本実施形態では、切出し長と
しては、予備実験により、男性で最長のピッチ周期に余
裕を持たせた１２ｍ秒とした。When a pitch mark of the data of the i-th frame is detected, audio data before and after the pitch mark is cut out in step ST304, and the centering is performed so that the pitch mark is located at the center. In the present embodiment, the cut-out length is set to 12 ms, which is a length of the longest pitch period for males with a margin, by a preliminary experiment.

【０１４２】次いで、ステップＳＴ３０６でステップＳ
Ｔ３０５からの窓データＣ０を基に窓掛けし、ステップ
ＳＴ３０７で窓掛けしたものを素片辞書へ書き込み、ス
テップＳＴ１０７で切り出した音声データをディスク等
の記憶媒体に素片辞書として順次書き込む。このステッ
プＳＴ３０６における窓掛けは、前記第１の実施形態の
窓掛け部１０７（図１）動作に対応している。Next, in step ST306, step S306 is executed.
Windowing is performed based on the window data C0 from T305, and the windowed data is written in a segment dictionary in step ST307, and the audio data cut out in step ST107 is sequentially written as a unit dictionary on a storage medium such as a disk. The windowing in step ST306 corresponds to the operation of the windowing unit 107 (FIG. 1) of the first embodiment.

【０１４３】このように、本実施形態では、第２の実施
形態と同様に、既にステップＳＴ３０１で第ｉフレーム
の正確なピッチ周期を求めているので、ステップＳＴ３
０６において窓掛けしたものを素片辞書へ書き込むこと
ができ、音声合成時において、第１の実施形態で必要で
あった１ピッチ毎の窓掛けの乗算が不要となり、ただ重
ね合わせを実行するだけでよく、音声合成処理量の処理
量が大幅に減少できる。このため、本音声合成方法を用
いた装置においては、ＤＳΡ等の演算プロセッサを使用
することなく、通常のＣＰＵで実現可能となる。As described above, in the present embodiment, as in the second embodiment, since the exact pitch period of the i-th frame has already been obtained in step ST301, step ST3 is performed.
In step 06, the windowed data can be written in the segment dictionary, and the speech synthesis does not require the multiplication of the windowing for each pitch required in the first embodiment. And the processing amount of the speech synthesis processing amount can be greatly reduced. Therefore, an apparatus using the present speech synthesis method can be realized by a normal CPU without using an arithmetic processor such as DSΡ.

【０１４４】ステップＳＴ１０８では、総フレーム数Ｎ
がフレーム番号ｉより大きいか（ｉ＜Ｎか）を比較して
全フレーム終了したかの判定を行い、ｉ＜Ｎであれば終
了していないと判断してステップ１０９で処理を行うフ
レーム番号を更新して（ｉ＝ｉ＋１）ステップＳＴ３０
１に進み、以降の処理を継続する。また、ステップＳＴ
１０８で全フレームの処理が終了したと判定したとき
は、ディスクのクローズ処理等（図示せず）を行って素
片作成部２００の動作を終了する。In step ST108, the total number of frames N
Is greater than the frame number i (i <N), it is determined whether all frames have been completed. If i <N, it is determined that the processing has not been completed, and the frame number to be processed in step 109 is determined. Update (i = i + 1) step ST30
Proceed to 1 to continue the subsequent processing. Step ST
If it is determined in step 108 that the processing of all the frames has been completed, a disc closing process or the like (not shown) is performed, and the operation of the segment generating unit 200 ends.

【０１４５】図１３は図１０の重心点検出処理（ステッ
プＳＴ６０１）の詳細動作を示すフローチャートであ
る。FIG. 13 is a flowchart showing the detailed operation of the center-of-gravity point detection process (step ST601) of FIG.

【０１４６】まず、ステップＳＴ２０１で音声信号を分
析フレームと称する区間に分割し、分割された音声信号
の第ｉフレームのデータｄａｔａ［ｋ］，（ｋ＝１，
２，…）を切り出す。本実施形態では、１フレーム長
は、３２ｍ秒で、８ｍ秒ずらして次のフレームに移る。First, in step ST201, the audio signal is divided into sections called analysis frames, and data data [k], (k = 1, i) of the i-th frame of the divided audio signal.
2, ...). In the present embodiment, the length of one frame is 32 ms, and the next frame is shifted by 8 ms.

【０１４７】次いで、ステップＳＴ５０１でピッチ周期
を検出するとともに、検出したピッチ周期を窓長に設定
し、窓データｗｉｎｄｏｗ［ｋ］，（ｋ＝−窓長／２〜
窓長／２）を作成する。Next, in step ST501, the pitch period is detected, and the detected pitch period is set as the window length, and the window data window [k], (k = −window length / 2−2)
(Window length / 2) is created.

【０１４８】次いで、重心点検出処理（前記ステップＳ
Ｔ６０１）として、ステップＳＴ２０３でピッチマーク
探索範囲（始点・終点）を設定する。Next, the center-of-gravity point detection processing (step S
As T601), a pitch mark search range (start point / end point) is set in step ST203.

【０１４９】ステップＳＴ７０１では、ステップＳＴ２
０３で設定した探索範囲の始点より窓長の半分だけ前の
点時間から、探索範囲の終点より窓長の半分だけ後の時
間点までの音声信号の最小値ｍｉｎｉｍｕｍ［ｉ］を検
出する。In step ST701, step ST2
The minimum value of the audio signal, minimum [i], from the point time half the window length before the start point of the search range set in 03 to the time point half the window length after the end point of the search range is detected.

【０１５０】ステップＳＴ２０５では、ステップＳＴ７
０１で検出した最小値が０となるようフレーム内のデー
タにオフセットを加える。このオフセット加算は、前記
式（２）で示される。In Step ST205, Step ST7 is executed.
An offset is added to the data in the frame so that the minimum value detected in 01 becomes 0. This offset addition is represented by the above equation (2).

【０１５１】次いで、ピッチマーク探索用の窓の中心と
なる時間軸座標をｊとし、ステップＳＴ２０６におい
て、ｊの初期値としてステップＳＴ２０３で設定したピ
ッチマーク探索始点の時間軸座標を代入し、ステップＳ
Ｔ２０７〜ステップＳＴ２１０で、探索始点から窓長個
のデータに前記の固定長窓をかけ、この窓掛けデータの
面積Ｓ（ｊ）を計算する。Next, j is the time axis coordinate which is the center of the pitch mark search window, and in step ST206, the time axis coordinate of the pitch mark search start point set in step ST203 is substituted as the initial value of j.
From T207 to step ST210, the fixed length window is applied to the data of the window length from the search start point, and the area S (j) of the windowed data is calculated.

【０１５２】すなわち、ステップＳＴ２０７でｋを−窓
長／２、面積Ｓ（ｊ）を０とし、ステップＳＴ７０２で
オフセットを加えたデータの所定累乗に、前記第２の実
施形態で示したピッチ長窓をかけ、前記式（３）により
窓掛けオフセットデータの面積Ｓ（ｊ）を計算する。窓
掛けオフセットデータの面積Ｓ（ｊ）は、式（５）によ
り計算される。That is, in step ST207, k is set to −window length / 2, area S (j) is set to 0, and in step ST702, the pitch length window shown in the second embodiment is applied to a predetermined power of the data to which the offset is added. , And the area S (j) of the windowing offset data is calculated by the equation (3). The area S (j) of the windowing offset data is calculated by equation (5).

【０１５３】Ｓ（ｊ）＝Ｓ（ｊ）＋［（Ｏｆｆ＿Ａｄｄ＿Ｄａｔａ（ｋ））ⁿ ×ｗｉｎｄｏｗ［ｋ］］ …式（５）ステップＳＴ２０９では、ｋが窓長／２より大きい（ｋ
＞窓長／２）か否かを判別し、ｋが窓長／２以下（ｋ≦
窓長／２）のときは窓長の計算が済んでいないと判断し
てステップＳＴ２１０でｋを１インクリメント（ｋ＝ｋ
＋１）してステップＳＴ７０２に戻る。S (j) = S (j) + [(Off_Add_Data (k)) ⁿ × window [k]] Expression (5) In step ST209, k is larger than window length / 2 (k
> Window length / 2), and k is equal to or less than window length / 2 (k ≦
In the case of (window length / 2), it is determined that the calculation of the window length has not been completed, and k is incremented by 1 (k = k) in step ST210.
+1), and returns to step ST702.

【０１５４】ｋが窓長／２より大きい（ｋ＞窓長／２）
のときは窓長の計算が済んだと判断してステップＳＴ２
１１でｊが探索終点より大きい（ｊ＞探索終点）か否か
を判別し、ｊが探索終点以下（ｊ≦探索終点）のときは
窓掛けデータの面積Ｓ（ｊ）の計算が済んでいないと判
断してステップＳＴ２１２でｊを１インクリメント（ｊ
＝ｊ＋１）してステップＳＴ２０７に戻る。K is larger than window length / 2 (k> window length / 2)
In the case of, it is determined that the calculation of the window length has been completed, and step ST2 is performed.
At 11, it is determined whether or not j is larger than the search end point (j> search end point). If j is equal to or smaller than the search end point (j ≦ search end point), the area S (j) of the windowing data has not been calculated. In step ST212, j is incremented by 1 (j
= J + 1), and returns to step ST207.

【０１５５】このように、窓の中心をステップＳＴ２１
２で１ポイントずつずらしていきながら窓掛けデータの
面積Ｓ（ｊ）を計算していき、ステップＳＴ２１１でｊ
が探索終点より大きく（ｊ＞探索終点）なったときに
は、窓の中心が探索終点の位置に来たと判断してステッ
プＳＴ７０３に進み窓掛けデータの面積Ｓ（ｊ）の計算
を終了する。As described above, the center of the window is set at step ST21.
The area S (j) of the windowing data is calculated while shifting one point at a time by 2 in step ST211.
Is larger than the search end point (j> search end point), it is determined that the center of the window has come to the position of the search end point, and the process proceeds to step ST703 to terminate the calculation of the area S (j) of the windowing data.

【０１５６】最後に、ステップＳＴ７０３で面積の最大
を与える窓の中心の時間軸座標ａｒｇｍａｘ［Ｓ
（ｊ）］を検出し、これを窓の重心点と呼び、ピッチマ
ークとして重心点検出処理を終了する。Finally, in step ST703, the time axis coordinate argmax [S of the center of the window giving the maximum area is set.
(J)], and this is called the center of gravity of the window, and the center of gravity point detection processing is terminated as a pitch mark.

【０１５７】本フローにより、第ｉフレームのデータの
ピッチマークが検出されると、前記図１２のステップＳ
Ｔ３０４に移行しピッチマークの前後の音声データ切り
出しによる素片作成処理を行う。According to this flow, when the pitch mark of the data of the i-th frame is detected, step S in FIG.
The process shifts to T304 to perform a segment creation process by cutting out audio data before and after the pitch mark.

【０１５８】ここで、第３の実施形態では、上記ステッ
プＳＴ２０７〜ステップＳＴ２１０において、窓掛けオ
フセット加算データのそれぞれの窓の中心位置での面積
を求めるが、ステップＳＴ７０２の処理が新たに提案し
た処理である。Here, in the third embodiment, in the above-mentioned steps ST207 to ST210, the area of the window offset addition data at the center position of each window is obtained, but the processing of step ST702 is a processing newly proposed. It is.

【０１５９】すなわち、ステップＳＴ７０２において、
オフセットを加えたデータを所定累乗して、この所定累
乗したデータにステップＳＴ５０１で設定した窓をかけ
てその面積を求め、ステップＳＴ７０３で重心点を検出
する。That is, in step ST702,
The data to which the offset has been added is raised to a predetermined power, the window set in step ST501 is applied to the data raised in the predetermined power to determine its area, and the center of gravity is detected in step ST703.

【０１６０】この重心点をピッチマークとして、音声信
号の再切り出しとセンタリング、さらにステップＳＴ５
０１で求めたピッチ周期を用いて窓をかけて素片辞書へ
の書込みを行うが、この処理は前記第２の実施形態と同
じでよい。Using this center of gravity as a pitch mark, re-cutting and centering of the audio signal are performed, and furthermore, step ST5
Using the pitch period obtained in step 01 to write data into the segment dictionary over a window, this process may be the same as in the second embodiment.

【０１６１】第３の実施形態では、オフセットを加えた
データを所定累乗してピッチ周期の長さの窓をかけ、重
心点を検出することを特徴とする。The third embodiment is characterized in that the data to which the offset has been added is raised to a predetermined power, a window having a pitch period length is applied, and the center of gravity is detected.

【０１６２】図１４はオフセットを加えた波形の所定累
乗データの重心点の推移を示す図である。この図に示す
波形は、ア（／ａ／）と発声した女声音の波形であり、
縦軸・横軸とも前記図５と同様に、縦軸がオフヤットを
加えたデータの大きさ、横軸は時間であり、矢印の先の
点がそれぞれ１・２・４・８乗したデータの重心点（ピ
ッチマーク）である。FIG. 14 is a diagram showing the transition of the center of gravity of predetermined power data of a waveform to which an offset has been added. The waveform shown in this figure is a waveform of a female voice uttered as a (/ a /),
On both the vertical and horizontal axes, as in FIG. 5, the vertical axis represents the size of the data to which the offset was added, and the horizontal axis represents time. The center of gravity (pitch mark).

【０１６３】以上説明したように、第３の実施形態に係
る音声合成方法及び音声合成装置は、オフセットを加え
たデータを所定累乗してピッチ周期の長さの窓をかけ、
重心点を検出するようにしたので、簡易な処理でより適
切なピッチマークを設定することができ、高品質の音声
合成が可能になる。As described above, the voice synthesizing method and the voice synthesizing apparatus according to the third embodiment multiply the data to which the offset has been added by a predetermined power to open a window of a pitch period length,
Since the center of gravity is detected, more appropriate pitch marks can be set by simple processing, and high-quality speech synthesis can be performed.

【０１６４】以下、本実施形態の効果を具体的に説明す
る。Hereinafter, the effects of the present embodiment will be specifically described.

【０１６５】図１４はオフセットを加えたデータを所定
累乗したものによる重心点位置の振る舞いを示した図で
ある。オフセットを加えたデータにそのままピッチ長窓
をかけた場合（１乗）では、前記第２の実施形態と同じ
結果が得られる。次に、オフセットを加えたデータを２
乗、４乗、８乗としたものにピッチ長窓をかけ、それぞ
れ、その面積最大をあたえる窓の中心を検出すると、乗
数が大きくなるにつれて重心点が波形ピークに近づいて
いくことが分かる。ここで、ピッチマーク付与において
波形のピークを重視するのは、ホルマントへの変形が少
なく、より自然な音声合成音を得るためである。FIG. 14 is a diagram showing the behavior of the position of the center of gravity based on a predetermined power of the data to which the offset has been added. When the pitch added window is directly applied to the data to which the offset has been added (the first power), the same result as in the second embodiment can be obtained. Next, the data with the offset
When a pitch length window is applied to the squared, fourth, and eighth powers and the center of the window that gives the maximum area is detected, it can be seen that the center of gravity approaches the waveform peak as the multiplier increases. Here, the reason why the peak of the waveform is emphasized in the pitch mark assignment is to obtain a more natural speech synthesized sound with less deformation into a formant.

【０１６６】単純な波形のピーク検出では、個々の波形
ごとにピッチマーク位置が多少のゆらぎを持つため、合
成音は接続の悪いざらついた音になってしまうが、本手
法ではオフセットを加えた窓掛け波形のパワーが、ピッ
チ周期ごとの緩やかな正弦波に近い波として現れるた
め、安定に波形のピーク付近を検出することができ、な
めらかな合成音が得られる。In the simple waveform peak detection, since the pitch mark position has a slight fluctuation for each individual waveform, the synthesized sound becomes a rough sound with poor connection. Since the power of the multiplied waveform appears as a wave close to a gentle sine wave for each pitch period, the vicinity of the peak of the waveform can be detected stably, and a smooth synthesized sound can be obtained.

【０１６７】本発明者による予備実験では、乗数８、窓
関数としてピッチ長のハニング窓を用いると、重心点は
波形のピークにほぼ一致することが分かっている。In preliminary experiments by the present inventors, it has been found that when a multiplier of 8 and a Hanning window having a pitch length as a window function are used, the center of gravity substantially coincides with the peak of the waveform.

【０１６８】なお、上記各実施形態における、重心点検
出は、音声信号の有声部分に対してのみ行われるものと
する。無声音部分は、音声データをそのまま使用する。In each of the above embodiments, the detection of the center of gravity is performed only on the voiced portion of the audio signal. For the unvoiced sound portion, audio data is used as it is.

【０１６９】また、上記第２、第３の実施形態における
ピッチ周期検出法としては、ケプストラム法を用いたが
他の方法、例えば、自己相関法や、線形予測残差の自己
相関である変形自己相関法などの方法を用いることもで
きる。In the second and third embodiments, the cepstral method is used as the pitch period detecting method. However, other methods such as the autocorrelation method and the modified autonomous correlation which is the autocorrelation of the linear prediction residual are used. A method such as a correlation method can also be used.

【０１７０】また、上記各実施形態における音声合成方
法及び装置における素片作成部は、原音声のピッチを変
化させ、声の高さを変更する、いわゆる、音声ピッチ変
換装置でのピッチマーク設定等の、種々の音声出力装置
での処理に適応することも可能である。[0170] The segment creation unit in the speech synthesis method and apparatus in each of the above embodiments changes the pitch of the original speech to change the pitch of the voice, that is, the pitch mark setting in a so-called speech pitch conversion device. It is also possible to adapt to the processing of various audio output devices.

【０１７１】また、上記各実施形態に係る音声合成方法
及び装置では、テキストデータを入力とする音声合成方
法に全て適用することができるが、ディジタル音声波形
データをアナログ信号に変換して音声を合成する音声合
成方法どのようなものでもよく、各種端末に組み込まれ
る回路の一部であってもよい。The speech synthesizing method and apparatus according to each of the embodiments described above can be applied to any speech synthesizing method using text data as input. Digital speech waveform data is converted into an analog signal to synthesize speech. Any of the voice synthesis methods may be used, and may be a part of a circuit incorporated in various terminals.

【０１７２】さらに、上記各実施形態に係る音声合成方
法及び装置を構成する辞書や各種回路部の数、種類接続
状態などは前述した各実施形態に限られないことは言う
までもない。また、本発明の技術的思想の範囲内であれ
ば素片作成部における、各種処理手段、例えばピッチマ
ーク探索範囲、窓掛けデータの面積の算出方法等は適宜
変更することができることは言うまでもない。Furthermore, it goes without saying that the number of dictionaries and various circuit parts and the type of connection, which constitute the speech synthesis method and apparatus according to each of the above embodiments, are not limited to the above embodiments. It goes without saying that various processing means, for example, a pitch mark search range, a method of calculating the area of the windowing data, and the like in the segment creation unit can be appropriately changed within the scope of the technical idea of the present invention.

【０１７３】[0173]

【発明の効果】本発明に係る音声合成方法及び音声合成
装置では、音声信号の最小値を検出し、音声信号に最小
値が非負となるようオフセットを加算して加算データを
算出する手段と、所定長の窓を準備し、所定長窓の中心
を移動させながら、加算データに窓をかけ、窓の各中心
位置での窓掛けデータの面積を計算し、面積の最大を与
える窓の中心の時間軸座標を検出する手段と、時間軸座
標を中心にセンタリングして音声信号を切出し、切出し
音声信号の中心の時間軸座標を重畳の中心として、ピッ
チ周期分ずらしながら窓掛け重畳する手段とを有してい
るので、簡単な処理で揺れが少ないピッチマークを設定
することができ、高品質の音声合成が実現できる。According to the speech synthesizing method and the speech synthesizing apparatus according to the present invention, means for detecting a minimum value of a voice signal, adding an offset to the voice signal so that the minimum value is non-negative, and calculating addition data; Prepare a window of a predetermined length, apply a window to the addition data while moving the center of the predetermined length window, calculate the area of the windowing data at each center position of the window, and calculate the center of the window that gives the maximum area. Means for detecting time axis coordinates, and means for centering the time axis coordinates to cut out an audio signal, and for windowing and superimposing while shifting the pitch axis by the time axis coordinate of the center of the extracted audio signal as the center of superimposition. As a result, a pitch mark with less fluctuation can be set by simple processing, and high-quality speech synthesis can be realized.

【０１７４】本発明に係る音声合成方法及び音声合成装
置では、音声信号からピッチ周期を検出する手段と、音
声信号の最小値を検出し、音声信号に最小値が非負とな
るようオフセットを加算して加算データを算出する手段
と、ピッチ周期長の窓を準備し、窓の中心を移動させな
がら加算データに窓をかけ窓の各中心位置での窓掛けデ
ータの面積を計算し、面積の最大を与える窓の中心の時
間軸座標を検出する手段と、時間軸座標を中心にセンタ
リングして音声信号を切出し、切出し音声信号の中心の
時間軸座標を重畳の中心として、ピッチ周期分ずらしな
がら窓掛け重畳する手段とを有しているので、簡単な処
理で揺れが少ないピッチマークを設定することができ
る。In the voice synthesizing method and the voice synthesizing apparatus according to the present invention, means for detecting a pitch period from a voice signal, detecting a minimum value of the voice signal, and adding an offset to the voice signal so that the minimum value is non-negative. A window having a pitch period length is prepared, a window is added to the added data while moving the center of the window, and the area of the windowing data at each center position of the window is calculated. Means for detecting the time axis coordinate of the center of the window that gives the audio signal, and extracting the audio signal by centering around the time axis coordinate, and shifting the window by shifting the pitch axis by the time axis coordinate of the center of the extracted audio signal as the center of superposition. Since there is a means for multiplying and superimposing, it is possible to set a pitch mark with less fluctuation by simple processing.

【０１７５】また、窓長として、正確なピッチ周期を用
いるようにしているので、窓の遮断周波数をピッチ周波
数に設定でき、また、音韻中でのピッチ周期の変更にも
対応できるため、より安定なピッチマークを得ることが
できる。Further, since an accurate pitch period is used as the window length, the cutoff frequency of the window can be set to the pitch frequency, and the change in the pitch period during the phoneme can be handled, so that the window is more stable. Pitch marks can be obtained.

【０１７６】本発明に係る音声合成方法及び音声合成装
置では、音声信号からピッチ周期を検出する手段と、音
声信号の最小値を検出し、音声信号に最小値が非負とな
るようオフセットを加算して加算データを算出する手段
と、ピッチ周期長の窓を準備し、窓の中心を移動させな
がら、加算データを所定累乗したデータに窓をかけ、窓
の各中心位置における窓掛けデータの面積を計算し、面
積の最大を与える窓の中心の時間軸座標を検出する手段
と、時間軸座標を中心にセンタリングして音声信号を切
出し、切出し音声信号の中心の時間軸座標を重畳の中心
として、ピッチ周期分ずらしながら窓掛け重畳する手段
とを有しているので、簡単な処理で揺れが少ないピッチ
マークを設定することができ、さらに、ピッチ周期ごと
の緩やかな正弦波に近い波として現れるため、安定に波
形のピーク付近を検出することができ、なめらかな合成
音を得ることができる。In the voice synthesizing method and the voice synthesizing apparatus according to the present invention, means for detecting a pitch period from a voice signal, detecting a minimum value of the voice signal, and adding an offset to the voice signal so that the minimum value is non-negative. A window having a pitch period length is prepared, and while moving the center of the window, a window is formed on the data obtained by raising the added data to a predetermined power, and the area of the windowed data at each center position of the window is calculated. Means for calculating and detecting the time axis coordinate of the center of the window giving the maximum of the area, and cutting out the audio signal by centering around the time axis coordinate, and taking the time axis coordinate of the center of the extracted audio signal as the center of superposition, Since it has a means for windowing and superimposing while shifting by the pitch period, it is possible to set a pitch mark with less fluctuation by simple processing, and furthermore, a gentle sine wave for each pitch period To appear as close waves can be detected near the peak of the stable waveform can be obtained a smooth synthetic sounds.

[Brief description of the drawings]

【図１】本発明を適用した第１の実施形態に係る音声合
成方法及び音声合成装置の構成を示すブロック図であ
る。FIG. 1 is a block diagram illustrating a configuration of a speech synthesis method and a speech synthesis device according to a first embodiment of the present invention.

【図２】上記音声合成方法及び音声合成装置の素片作成
部の動作を示すフローチャートである。FIG. 2 is a flowchart showing the operation of the speech synthesis method and the speech synthesis unit of the speech synthesis device.

【図３】上記音声合成方法及び音声合成装置の重心点検
出処理の詳細を示すフローチャートである。FIG. 3 is a flowchart illustrating details of a center-of-gravity point detection process of the speech synthesis method and the speech synthesis device.

【図４】上記音声合成方法及び音声合成装置の最小値検
出範囲と窓関数を示す図である。FIG. 4 is a diagram showing a minimum value detection range and a window function of the speech synthesis method and the speech synthesis device.

【図５】上記音声合成方法及び音声合成装置の６４，９
６，１２８点の各窓長における窓掛けデータの面積の推
移を示す図である。FIG. 5 shows the speech synthesis method and the speech synthesis device 64, 9;
It is a figure which shows transition of the area of the windowing data in each window length of 6,128 points.

【図６】上記音声合成方法及び音声合成装置の高周波成
分の大きい波形での重心点検出による窓掛けデータの面
積の推移を示す図である。FIG. 6 is a diagram showing a transition of an area of windowing data by detecting a center of gravity in a waveform having a large high frequency component of the speech synthesis method and the speech synthesis device.

【図７】本発明を適用した第２の実施形態に係る音声合
成方法及び音声合成装置の構成を示すブロック図であ
る。FIG. 7 is a block diagram illustrating a configuration of a speech synthesis method and a speech synthesis device according to a second embodiment of the present invention.

【図８】上記音声合成方法及び音声合成装置の素片作成
部の動作を示すフローチャートである。FIG. 8 is a flowchart showing the operation of the speech synthesis method and the segment creation unit of the speech synthesis device.

【図９】上記音声合成方法及び音声合成装置のピッチ周
期検出処理の詳細を示すフローチャートである。FIG. 9 is a flowchart showing details of a pitch cycle detection process of the speech synthesis method and the speech synthesis device.

【図１０】上記音声合成方法及び音声合成装置の重心点
検出処理の詳細を示すフローチャートである。FIG. 10 is a flowchart showing details of a center-of-gravity point detection process of the speech synthesis method and the speech synthesis device.

【図１１】本発明を適用した第３の実施形態に係る音声
合成方法及び音声合成装置の構成を示すブロック図であ
る。FIG. 11 is a block diagram illustrating a configuration of a speech synthesis method and a speech synthesis device according to a third embodiment of the present invention.

【図１２】上記音声合成方法及び音声合成装置の素片作
成部の動作を示すフローチャートである。FIG. 12 is a flowchart showing the operation of the speech synthesis method and the segment creation unit of the speech synthesis device.

【図１３】上記音声合成方法及び音声合成装置の重心点
検出処理の詳細を示すフローチャートである。FIG. 13 is a flowchart illustrating details of a center-of-gravity point detection process of the speech synthesis method and the speech synthesis device.

【図１４】上記音声合成方法及び音声合成装置のオフセ
ットを加えた波形の所定累乗データの重心点の推移を示
す図である。FIG. 14 is a diagram showing a transition of a center of gravity of predetermined power data of a waveform to which an offset of the speech synthesis method and the speech synthesis device is added.

【図１５】従来のＰＳＯＬＡのピッチを変更しながら重
畳する音声合成方法を示す模式図である。FIG. 15 is a schematic diagram showing a conventional speech synthesis method of superimposing while changing the pitch of PSOLA.

[Explanation of symbols]

１０１テキスト解析部、１０２単語辞書、１０３
パラメータ生成部、１０４音声信号入力部、１０５，
２００，３００素片作成部、１０６素片辞書、１０
７，２１１，３１１窓掛け部、１０８合成音声部、
２１０，３１０ピッチマーク算出部101 text analysis unit, 102 word dictionary, 103
Parameter generation unit, 104 audio signal input unit, 105,
200, 300 unit creation unit, 106 unit dictionary, 10
7, 211, 311 windowing part, 108 synthetic voice part,
210, 310 pitch mark calculation unit

Claims

[Claims]

1. A voice synthesis method for obtaining a synthesized voice by connecting or overlapping voice waveforms, wherein a minimum value of a voice signal is detected, and an offset is added to the voice signal so that the minimum value is non-negative. Calculating the addition data, preparing a window of a predetermined length, moving the center of the predetermined length window, applying the window to the addition data, and calculating the area of the windowing data at each center position of the window. Is calculated, the time axis coordinate of the center of the window that gives the maximum of the area is detected, the audio signal is cut out by centering around the time axis coordinate, and the cut out audio signal is superimposed on the time axis coordinate. A voice synthesizing method, characterized in that windowing and superimposing are performed while being shifted by a pitch period as a center of the speech.

2. A voice synthesizing method for obtaining a synthesized voice by connecting or overlapping voice waveforms, wherein a pitch period is detected from a voice signal, a minimum value of the voice signal is detected, and Calculating addition data by adding an offset so that the minimum value is non-negative, preparing a window of the pitch period length, applying the window to the addition data while moving the center of the window, and applying each window center position. Calculates the area of the windowing data in, detects the time axis coordinate of the center of the window that gives the maximum of the area, cuts out the audio signal by centering around the time axis coordinate, and cuts out the audio signal, Using the time-axis coordinates as the center of superimposition and window-superimposing them while shifting them by a pitch cycle.

3. A voice synthesizing method for obtaining a synthesized voice by connecting or overlapping voice waveforms, wherein a pitch period is detected from a voice signal, a minimum value of the voice signal is detected, and the voice signal is added to the voice signal. Calculating the addition data by adding an offset so that the minimum value is non-negative, preparing a window of the pitch period length, moving the center of the window, and applying the window to data obtained by raising the addition data to a predetermined power. Calculating the area of the windowing data at each center position of the window, detecting the time axis coordinate of the center of the window that gives the maximum of the area, and centering around the time axis coordinate to obtain the audio signal. A speech synthesizing method, comprising: clipping and superimposing the clipped audio signal on a window while shifting the pitch axis by using the time axis coordinate as a center of superposition.

4. The speech synthesis method according to claim 2, wherein the detection of the pitch period is a detection of an accurate pitch period obtained by a cepstrum analysis method. Method.

5. The speech synthesis method according to claim 3, wherein the predetermined value of the power is any one of a square, a fourth and an eighth power.

6. A voice synthesizing apparatus for obtaining a synthesized voice by connecting and overlapping voice waveforms, detecting a minimum value of a voice signal, and adding an offset to the voice signal so that the minimum value is non-negative. Means for calculating addition data, and preparing a window of a predetermined length, applying the window to the addition data while moving the center of the predetermined length window, and applying the windowing data at each center position of the window. Means for calculating the area of the window, and detecting the time axis coordinate of the center of the window that gives the maximum of the area, and cutting out the audio signal by centering around the time axis coordinate, A voice synthesizing device comprising: means for windowing and superimposing while shifting the pitch coordinates by a pitch cycle with the axis coordinates being the center of superimposition.

7. A voice synthesizer for obtaining a synthesized voice by connecting or overlapping voice waveforms, comprising: means for detecting a pitch period from a voice signal; detecting a minimum value of the voice signal; Means for calculating an addition data by adding an offset so that the minimum value is non-negative; preparing a window having the pitch period length; applying the window to the addition data while moving the center of the window; Means for calculating the area of the windowing data at each center position, detecting time axis coordinates of the center of the window that gives the maximum of the area, and centering the time axis coordinates to obtain the audio signal. A speech synthesizing apparatus, comprising: means for clipping and windowing and superimposing the clipped speech signal while shifting the clipped speech signal by a pitch period with the time axis coordinates being the center of superposition.

8. A voice synthesizing apparatus for obtaining a synthesized voice by connecting or overlapping voice waveforms, means for detecting a pitch period from a voice signal, detecting a minimum value of the voice signal, and Means for calculating an addition data by adding an offset so that the minimum value is non-negative; preparing a window having the pitch period length, moving the center of the window, and moving the addition data to a predetermined power. Means for covering the window, calculating an area of the windowing data at each center position of the window, and detecting a time axis coordinate of a center of the window that gives a maximum of the area; and centering around the time axis coordinate. Means for extracting the audio signal, and windowing and superimposing the extracted audio signal while shifting the pitch by a pitch period with the time axis coordinates being the center of superimposition. Speech synthesis apparatus according to.

9. The speech synthesizer according to claim 6, wherein a window length of the window is determined based on a pitch period obtained by a cepstrum analysis method.