JP2001100777A

JP2001100777A - Method and device for voice synthesis

Info

Publication number: JP2001100777A
Application number: JP27463299A
Authority: JP
Inventors: Yoshinori Shiga; 芳則志賀
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1999-09-28
Filing date: 1999-09-28
Publication date: 2001-04-13

Abstract

PROBLEM TO BE SOLVED: To provide a method and a device for voice synthesis which synthesizes a voice of natural intonation which is easy to hear and does not fatigue a listener when being heard for a long time. SOLUTION: A pitch pattern generation means generates a first pitch pattern or its feature quantity from analysis information of an input text in accordance with a prescribed rule and generates a second pitch pattern, which is obtained by correction based on a pitch pattern generation process model, from the first pitch pattern or its feature quantity and outputs this second pitch pattern.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力テキストの解
析結果から合成すべき音声のピッチパターンを生成する
音声合成方法及び装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizing method and apparatus for generating a pitch pattern of speech to be synthesized from an analysis result of an input text.

【０００２】[0002]

【従来の技術】従来の代表的な音声合成装置には、音声
を細分化して蓄積し、その組み合わせによって任意の音
声を合成可能な規則合成装置がある。この規則合成装置
は入力されるテキストデータを音韻情報と韻律情報から
なる記号列に変換し、その記号列から音声を生成する文
音声変換処理を行う。この規則合成装置における文音声
変換処理機構は大きく分けて言語処理部と音声合成部か
らなり、日本語の規則合成を例にとると次のように行わ
れる。2. Description of the Related Art A typical typical conventional speech synthesizer is a rule synthesizer capable of subdividing and accumulating speech, and synthesizing an arbitrary speech by a combination thereof. This rule synthesizing device converts input text data into a symbol string composed of phoneme information and prosody information, and performs a sentence-to-speech conversion process for generating a speech from the symbol string. The sentence-to-speech conversion processing mechanism in the rule synthesizing apparatus is roughly divided into a language processing unit and a speech synthesis unit.

【０００３】言語処理部ではテキストファイルから入力
される漢字かな混じり文からなるテキストデータに対し
て、形態素解析・構文解析等の言語処理を行い、形態素
への分解、係り受け関係の推定処理を行うと同時に、各
形態素に読みとアクセント型を与える。The linguistic processing unit performs linguistic processing such as morphological analysis / syntax analysis on text data composed of kanji and kana mixed sentences input from a text file, performs decomposition into morphemes, and estimates dependency relations. At the same time, give each morpheme a reading and accent type.

【０００４】その後、言語処理部ではアクセントに関し
て複合語等のアクセント移動規則を用いて読み上げの際
のアクセントの区切りとなる句（以下、アクセント句と
称する）毎のアクセント型を決定する。[0004] Thereafter, the language processing unit determines the accent type of each phrase (hereinafter referred to as an accent phrase) serving as a delimiter of the accent at the time of reading aloud using the accent movement rules such as compound words with respect to the accent.

【０００５】こうして言語処理部で得られた形態素、係
り受け、アクセント型、読み情報を以下では総称して言
語情報と称する。次に、音声合成部において、得られた
「読み」に含まれる各音韻の継続時間長を決定する。[0005] The morpheme, dependency, accent type, and reading information thus obtained by the language processing unit are hereinafter collectively referred to as linguistic information. Next, the speech synthesis unit determines the duration of each phoneme included in the obtained “reading”.

【０００６】この音韻継続時間長は日本語特有の拍の等
時性に基づき決定する方法が一般的である。例えば、子
音の継続時間長は子音の種類により一定とし、各モーラ
の基準時刻である子音から母音へのわたり部の間隔が一
定になるように母音の継続時間長が決定される。In general, the phoneme duration is determined based on the isochronism of beats peculiar to Japanese. For example, the duration of the consonant is fixed depending on the type of the consonant, and the duration of the vowel is determined so that the interval between the consonant and the vowel, which is the reference time of each mora, is constant.

【０００７】続いて、上述のようにして得られる「読
み」にしたがって、音声合成部内の音韻パラメータ生成
処理部が音声素片メモリから必要な音声素片を読み出
し、読み出した音声素片を上述した方法で決定した音韻
継続時間長にしたがって時間軸方向に伸縮させながら接
続して、合成すべき音声の特徴パラメータ系列を生成す
る。Subsequently, in accordance with the "reading" obtained as described above, the phoneme parameter generation processing unit in the speech synthesizing unit reads necessary speech units from the speech unit memory, and reads the read speech units. The connection is made while expanding and contracting in the time axis direction in accordance with the phoneme duration determined by the method, to generate a feature parameter sequence of the speech to be synthesized.

【０００８】音声合成部内の音声素片メモリには、予め
作成された多数の音声素片が格納されている。音声素片
はアナウンサー等が発声した音声を分析して、スペクト
ルの包絡特性を表現する所定の音声の特徴パラメータを
得た後、所定の合成単位、本例では日本語の音節の単位
（子音＋母音）で、日本語の音声に含まれる全ての音節
を上述した特徴パラメータから切り出すことにより作成
される。[0008] A large number of speech units prepared in advance are stored in the speech unit memory in the speech synthesis unit. The speech unit analyzes a voice uttered by an announcer or the like, obtains a predetermined voice characteristic parameter expressing the envelope characteristic of the spectrum, and then obtains a predetermined synthesis unit, in this example, a Japanese syllable unit (consonant + Vowels), and is created by cutting out all syllables included in Japanese speech from the above-mentioned feature parameters.

【０００９】音韻パラメータ生成処理部が生成する特徴
パラメータとしてはケプストラムの低次の係数が利用さ
れる。低次のケプストラム係数は次のようにして求める
ことができる。As a feature parameter generated by the phoneme parameter generation processing unit, a low-order coefficient of a cepstrum is used. The low-order cepstrum coefficient can be obtained as follows.

【００１０】まず、アナウンサー等が発声した音声デー
タを一定幅、一定周期で窓関数（例えばハニング窓）で
切り出し、各窓内の音声波形に対してフーリエ変換を行
い、音声の短時間スペクトルを計算する。次に、得られ
た短時間スペクトルのパワーを対数化して対数パワース
ペクトルを得た後、この対数パワースペクトルを逆フー
リエ変換する。こうして計算されるのがケプストラム係
数である。First, voice data uttered by an announcer or the like is cut out by a window function (for example, a Hanning window) at a constant width and a constant cycle, and a Fourier transform is performed on a voice waveform in each window to calculate a short-time spectrum of the voice. I do. Next, the power of the obtained short-time spectrum is logarithmically obtained to obtain a logarithmic power spectrum, and the logarithmic power spectrum is subjected to inverse Fourier transform. The cepstrum coefficient is calculated in this way.

【００１１】ケプストラムの特性として、高次の係数は
音声の基本周波数情報を、低次の係数は音声のスペクト
ル包絡情報を保持していることがよく知られている。一
方、人間の発声した音声から自己相関法やケプストラム
法を用いて抽出したピッチパターンをアクセント句単位
で切り出したユニットパターンが音声合成部内のユニッ
トパターン記憶部に記憶されている。It is well known that cepstrum characteristics include that higher-order coefficients hold voice fundamental frequency information and lower-order coefficients hold voice spectrum envelope information. On the other hand, a unit pattern obtained by extracting a pitch pattern extracted from a human uttered voice using an autocorrelation method or a cepstrum method in units of accent phrases is stored in a unit pattern storage unit in the voice synthesis unit.

【００１２】各ユニットパターンは、例えば数量化１類
のような統計的手法に基づいて言語情報との対応付けが
なされており、音声合成部内のユニットパターン選択／
接続部がこの対応付け規則に従って、言語処理部の出力
である言語情報から使用するユニットパターンを決定す
る。Each unit pattern is associated with linguistic information based on a statistical method such as quantification type 1, and unit pattern selection / selection in the speech synthesizer is performed.
The connection unit determines a unit pattern to be used from the language information output from the language processing unit according to the association rule.

【００１３】さらに、ユニットパターン選択／接続部は
文を構成する各アクセント句に対して決定されたユニッ
トパターンを補間処理を施しながら滑らかに接続して最
終的に文のピッチパターンを生成する。Further, the unit pattern selection / connection unit smoothly connects the unit patterns determined for each accent phrase constituting the sentence while performing interpolation processing, and finally generates a sentence pitch pattern.

【００１４】音声合成部内に設けられた合成フィルタ処
理部では、有声区間ではピッチパターンに基づいた周期
パルスを、無声区間ではホワイトノイズを音源とし、音
声の特徴パラメータ系列から算出したフィルタ係数とし
てフィルタリングを行い、所望の音声を合成する。合成
フィルタ処理部の合成フィルタとしては、ケプストラム
係数を直接フィルタ係数とする対数振幅近似フィルタが
用いられる。In a synthesis filter processing section provided in the voice synthesis section, a periodic pulse based on a pitch pattern is used as a sound source in a voiced section and white noise is used as a sound source in an unvoiced section, and filtering is performed as a filter coefficient calculated from a feature parameter sequence of the voice. Then, a desired voice is synthesized. As a synthesis filter of the synthesis filter processing unit, a logarithmic amplitude approximation filter that directly uses cepstrum coefficients as filter coefficients is used.

【００１５】[0015]

【発明が解決しようとする課題】上述のような規則合成
装置に代表される従来の音声合成装置では、生成される
ピッチパターンに関して次のような問題があった。ピッ
チパターンの制御に関しては、音響学会誌１９７１年
ｖｏｌ．２７、４４５頁−４５３頁に記載されたよう
な、大局的な下降成分と局所的な起伏成分の重畳によっ
てその動作特性を記述するピッチパターン生成過程モデ
ルが広く用いられており、このモデルが実際の人間の発
声した音声のピッチパターンを非常によく近似すること
が証明されている。The conventional speech synthesizer represented by the rule synthesizer as described above has the following problems with the generated pitch pattern. For the control of pitch pattern, see the Journal of the Acoustical Society of Japan, 1971.
vol. 27, pages 445 to 453, a pitch pattern generation process model that describes its operation characteristics by superimposing a global descent component and a local undulation component is widely used. It has been shown to approximate the pitch pattern of human uttered speech very well.

【００１６】ところが、大局的な下降成分と局所的な起
伏成分という２成分の自動分離が困難なため、音声コー
パスに基づいて大量の音声データから統計的手法により
生成規則を自動構築してピッチ制御を行う音声合成で
は、上述したように重畳型の生成過程モデルを意識しな
いモデル化が行われている。However, since it is difficult to automatically separate the two components, a global descent component and a local undulation component, a generation rule is automatically constructed by a statistical method from a large amount of voice data based on a voice corpus to control pitch. Is performed without considering the superimposed generation process model as described above.

【００１７】しかし、この重畳型の生成過程モデルを意
識しないモデル化は、ピッチのパターン形状や変化率に
着目したものであるため、ピッチパターン生成過程モデ
ルに比べ、合成した音声に人間の音声のピッチパターン
に現れる人間の発声機構の制約や言語的な制約が反映さ
れず、不自然な音声の抑揚になってしまうという問題点
があった。However, since the modeling that is not conscious of the superimposed generation process model focuses on the pattern shape and the rate of change of the pitch, compared to the pitch pattern generation process model, the synthesized voice of the human voice is compared with the synthesized voice. There is a problem that the restriction of the human vocalization mechanism and the linguistic restriction appearing in the pitch pattern are not reflected, resulting in unnatural inflection of voice.

【００１８】本発明は上記問題点を鑑み、音声コーパス
に基づくピッチ制御を行う音声合成方法および装置にお
いて、ピッチパターン生成の際に発声機構の制約や言語
的制約を考慮できるピッチ制御を行うことで、人間らし
い自然なピッチパターンの生成を可能にすることを目的
とする。In view of the above problems, the present invention provides a voice synthesizing method and apparatus for performing pitch control based on a voice corpus, by performing pitch control capable of taking into account restrictions on a vocal mechanism and linguistic restrictions when generating a pitch pattern. The object of the present invention is to enable generation of a natural human-like pitch pattern.

【００１９】[0019]

【課題を解決するための手段】上記課題を解決し目的を
達成するために、本発明の音声合成方法および装置は以
下のように構成されている。本発明の音声合成方法は、
所定の規則にしたがって入力テキストの解析情報から第
１のピッチパターンもしくはその特徴量を生成し、この
第１のピッチパターンもしくはその特徴量からピッチパ
ターン生成過程モデルに基づいて補正した第２のピッチ
パターンを生成し、この第２のピッチパターンに基づい
て音声を合成する。SUMMARY OF THE INVENTION In order to solve the above-mentioned problems and achieve the object, a voice synthesizing method and apparatus of the present invention are configured as follows. The speech synthesis method of the present invention comprises:
A first pitch pattern or a characteristic amount thereof is generated from analysis information of the input text according to a predetermined rule, and a second pitch pattern corrected based on the pitch pattern generation process model from the first pitch pattern or the characteristic amount Is generated, and a voice is synthesized based on the second pitch pattern.

【００２０】また、本発明の音声合成装置は、入力テキ
ストを解析してテキスト解析情報を生成するテキスト解
析手段と、このテキスト解析手段が出力するテキスト解
析情報に基づいてピッチパターンを生成するピッチパタ
ーン生成手段であって、所定の規則にしたがって前記テ
キスト解析情報から第１のピッチパターンもしくはその
特徴量を生成し、この第１のピッチパターンもしくはそ
の特徴量からピッチパターン生成過程モデルに基づいて
補正した第２のピッチパターンを生成し、この第２のピ
ッチパターンを出力するピッチパターン生成手段と、こ
のピッチパターン生成手段が出力する前記第２のピッチ
パターンに基づいて音声を合成する音声合成手段とを備
えている。Further, the speech synthesizing apparatus according to the present invention comprises a text analyzing means for analyzing input text to generate text analysis information, and a pitch pattern generating a pitch pattern based on the text analysis information output by the text analyzing means. Generating means for generating a first pitch pattern or its characteristic amount from the text analysis information in accordance with a predetermined rule, and correcting the first pitch pattern or its characteristic amount based on a pitch pattern generation process model. Pitch pattern generation means for generating a second pitch pattern and outputting the second pitch pattern, and voice synthesis means for synthesizing voice based on the second pitch pattern output from the pitch pattern generation means Have.

【００２１】[0021]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態について説明する。図１は本発明の一形態に係
る音声規則合成装置の概略構成を示すブロック図であ
る。この音声規則合成装置は、例えばパーソナルコンピ
ュータ等の情報処理装置上で、ＣＤ−ＲＯＭ、フロッピ
ーディスク、メモリカード等の記録媒体あるいはネット
ワーク等の通信媒体により供給される専用ソフトウエア
を実行することにより実現される。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a schematic configuration of a speech rule synthesis device according to one embodiment of the present invention. This voice rule synthesizing apparatus is realized by executing dedicated software supplied by a recording medium such as a CD-ROM, a floppy disk, a memory card or a communication medium such as a network on an information processing apparatus such as a personal computer. Is done.

【００２２】本音声規則合成装置はテキストデータから
音声を生成する文音声変換処理機能を有しており、テキ
スト解析部１０１と音声合成部１０２から構成される。
テキスト解析部１０１は、入力文である漢字かな混じり
文を解析して語の同定を行い（形態素解析）、得られた
品詞情報等を基に文の構成を推定し（係り受け解析）、
その結果を出力する。This speech rule synthesizing apparatus has a sentence-to-speech conversion function for generating speech from text data, and is composed of a text analysis unit 101 and a speech synthesis unit 102.
The text analysis unit 101 analyzes the input sentence, which is a mixture of kanji and kana characters, identifies words (morphological analysis), estimates the composition of the sentence based on the obtained part-of-speech information and the like (dependency analysis),
Output the result.

【００２３】音声合成部１０２は、テキスト解析部１０
１の出力であるテキスト解析結果を基に音声を生成す
る。文音声変換（読み上げ）の対象となるテキストデー
タ（ここでは日本語文書）はテキストファイル１０３と
して保存されている。The speech synthesizer 102 includes the text analyzer 10
A voice is generated based on the text analysis result which is the output of 1. Text data (in this case, a Japanese document) to be subjected to sentence-to-speech conversion (speech) is stored as a text file 103.

【００２４】本装置では、文音声変換ソフトウエアに従
い、テキストファイル１０３から漢字かな混じり文を読
み出して、テキスト解析部１０１および音声合成部１０
２により以下に述べる文音声変換処理を行い音声を合成
する。In this apparatus, in accordance with the sentence-to-speech conversion software, a sentence mixed with kanji or kana is read from the text file 103, and is read by the text analysis unit 101 and the speech synthesis unit 10.
2, a sentence-to-speech conversion process described below is performed to synthesize speech.

【００２５】まず、テキストファイル１０３から読み出
された漢字かな混じり文（入力文）はテキスト解析部１
０１内の形態素解析部１０４に入力される。形態素解析
部１０４は、入力される漢字かな混じり文に対して形態
素解析を行い、読み情報とアクセント情報を生成する。
形態素解析とは与えられた文の中でどの文字列が語句を
構成しているか、そしてその語の文法的な属性がどのよ
うなものかを解析する作業である。First, the sentence (input sentence) mixed with kanji or kana read from the text file 103 is read by the text analysis unit 1.
01 is input to the morphological analysis unit 104. The morphological analysis unit 104 performs a morphological analysis on the input kanji-kana mixed sentence to generate reading information and accent information.
The morphological analysis is an operation of analyzing which character string forms a phrase in a given sentence and what grammatical attributes of the word are.

【００２６】形態素解析部１０４は入力文を日本語解析
辞書１０５と照合して全ての形態素系列候補を求め、そ
の中から文法的に接続可能な組み合わせを出力する。日
本語解析辞書１０５には形態素解析時に用いられる情報
とともに、個々の形態素の読みとアクセント型が登録さ
れている。したがって、形態素解析により形態素が定ま
れば、同時に読みとアクセント型が与えられる。The morphological analysis unit 104 checks the input sentence against the Japanese analysis dictionary 105 to obtain all morphological sequence candidates, and outputs a grammatically connectable combination from among them. In the Japanese analysis dictionary 105, readings of individual morphemes and accent types are registered together with information used at the time of morphological analysis. Therefore, if a morpheme is determined by morphological analysis, a reading and an accent type are given at the same time.

【００２７】次に、形態素解析部１０４にて決定した文
に含まれる個々の語の文法属性は、係り受け解析部１０
６に入力され、各語の係り受け関係を推定する文構造の
解析が行われる。Next, the grammatical attributes of the individual words included in the sentence determined by the morphological analysis unit 104 are stored in the dependency analysis unit 10.
Then, the sentence structure is analyzed to estimate the dependency relation of each word.

【００２８】以上のようにして、テキスト解析部１０１
では語の読みやアクセントの情報と共に品詞や係り受け
関係を音声合成部１０２に出力する。音声合成部１０２
ではまず音韻継続時間長決定処理部１０７が起動する。
本実施例では従来技術で説明したものと同じく、音韻継
続時間長は日本語特有の拍の等時性に基づき決定する手
法を用いるので説明は省略する。As described above, the text analysis unit 101
Then, the part-of-speech and the dependency relation are output to the speech synthesis unit 102 together with the information of the word reading and the accent. Voice synthesis unit 102
Then, the phoneme duration determination processing unit 107 is activated first.
In the present embodiment, as in the case of the prior art, a method of determining the phoneme duration based on the isochronism of the beat unique to Japanese is used, and therefore the description is omitted.

【００２９】次に、ピッチパターン生成処理部１０８が
起動する。ピッチパターン生成処理部１０８は、ピッチ
オアターン概形推定処理部１０９、モデルパラメータ計
算処理部１１０、ピッチ制御処理部１１１を有する。Next, the pitch pattern generation processing unit 108 is activated. The pitch pattern generation processing unit 108 includes a pitch or turn approximate shape estimation processing unit 109, a model parameter calculation processing unit 110, and a pitch control processing unit 111.

【００３０】ピッチパターン概形推定処理部１０９は、
予め作成されたピッチ概形推定規則を用いて、ピッチ概
形を表す２つの点ピッチ（以下、代表ピッチと称する）
を推定する。ピッチ概形推定規則は次のようにして作成
する。The pitch pattern rough shape estimation processing unit 109
Two point pitches representing a pitch outline (hereinafter, referred to as representative pitches) using a pitch outline estimation rule created in advance.
Is estimated. The rule for estimating the pitch outline is created as follows.

【００３１】まず初めに、所定のテキストを人間（例え
ばアナウンサ）が読み上げた音声を収録する。収録した
音声はコンピュータに取り込んで分析し、声の高さのパ
ターンを表すピッチ（または基本周波数）パターンを抽
出する。First, voices of a predetermined text read out by a human (for example, an announcer) are recorded. The recorded voice is taken into a computer and analyzed to extract a pitch (or fundamental frequency) pattern representing a voice pitch pattern.

【００３２】ピッチパターンの抽出には、自己相関を用
いる方法、ケプストラムを用いる方法等様々な手法が提
案されており、それらの内の適当な手法を利用すればよ
い。こうして得られたピッチパターンから、第２図に示
すように、アクセント句毎に最大値およびアクセント句
末尾モーラのピッチ値の２点、すなわち上述の代表ピッ
チを抽出する。Various methods have been proposed for extracting a pitch pattern, such as a method using autocorrelation and a method using cepstrum, and an appropriate method may be used. As shown in FIG. 2, two points of the maximum value and the pitch value of the accent phrase end mora, that is, the representative pitch described above, are extracted from the pitch pattern thus obtained, as shown in FIG.

【００３３】次に、収録時に用意したテキストをテキス
ト解析して得られる結果と代表点ピッチとの対応関係を
規則化する。収録内容に十分な分量があれば、統計的な
手法を用いることで、代表点ピッチを推定する規則を自
動的に構築できる。Next, the correspondence between the result obtained by text analysis of the text prepared at the time of recording and the representative point pitch is regularized. If the recorded contents have a sufficient amount, a rule for estimating the representative point pitch can be automatically constructed by using a statistical method.

【００３４】本実施例では、テキスト解析結果である文
を構成するアクセント句のアクセント型、モーラ数、品
詞、係り受け関係を説明関数とし、個々の代表点ピッチ
を外的基準として、「数量化理論とデータ処理」林知己
夫他著、朝倉書店（１９８２）に記載された「数量化Ｉ
類」を適用し、代表点ピッチを推定する規則を自動構築
する。In this embodiment, the accent type, mora number, part of speech, and dependency relation of the accent phrase constituting the sentence which is the result of the text analysis are used as explanatory functions, and the individual representative point pitches are used as an external reference. Theory and Data Processing, "Quantification I" described by Tomio Hayashi et al., Asakura Shoten (1982)
By applying "class", a rule for estimating the representative point pitch is automatically constructed.

【００３５】以上のようにして構築した代表点ピッチ推
定規則を用い、テキスト解析処理部１０１により渡され
るアクセント情報、品詞情報、係り受け情報から、ピッ
チパターン概形推定処理部１０９はアクセント句単位で
代表ピッチを生成する。Using the representative point pitch estimation rule constructed as described above, from the accent information, part-of-speech information, and dependency information passed by the text analysis processing unit 101, the pitch pattern rough shape estimation processing unit 109 performs an accent phrase unit unit. Generate a representative pitch.

【００３６】次に、モデルパラメータ計算処理部１１０
において、代表点ピッチから生成過程モデルのモデルパ
ラメータを計算する。ピッチパターンの生成過程モデル
としては、音響学会誌１９７１年ｖｏｌ．２７、４４
５頁−４５３頁に記載された臨界制動２次線形系のモデ
ルがあるが、本実施例で用いるモデルはアクセント句あ
たり２点の代表ピッチからモデルパラメータを推定でき
るように以下のごとくフレーズ成分を簡略化する。Next, the model parameter calculation processing unit 110
In, the model parameters of the generation process model are calculated from the representative point pitch. A model of the process of generating a pitch pattern is described in Journal of the Acoustical Society of Japan, 1971, vol. 27, 44
Although there is a model of the critical damping quadratic linear system described on page 5 to page 453, the model used in this embodiment uses the following phrase components so that model parameters can be estimated from two representative pitches per accent phrase. Simplify.

【００３７】フレーズ成分は、指数関数で表される緩や
かな下降成分と、アクセント句先頭の声立て成分を用い
て表現する。ｉ番目のアクセント句の下降成分、声立て
成分をそれぞれＧｐｉ、Ｚｐｉとすると、これらを時間
ｔの関数として、Ｇ_pi（ｔ）＝０（ｉ番目のアクセント句外）ｅ^- ^t （ｉ番目のアクセント句内）（１）Ｚ_pi（ｔ）＝０（ｔ＜０、ｔ＞１／γ） γｔｅ^1- ^t−１（０≦ｔ≦１／γ）（２）と表す。The phrase component is represented by using a gentle descending component represented by an exponential function and a vocal component at the beginning of the accent phrase. Let Gpi and Zpi denote the descending component and the vocal component of the i-th accent phrase, respectively, as functions of time t: G _pi (t) = 0 (outside the i-th accent phrase) e ^- ^t (i-th (1) Z _pi (t) = 0 (t <0, t> 1 / γ) γte ¹⁻ ^t −1 (0 ≦ t ≦ 1 / γ) (2)

【００３８】一方、アクセント成分は臨界制動２次線形
系ステップ応答関数で近似する。ｉ番目のアクセント句
におけるステップ応答をＧ_aiとすると、時間ｔの関数と
して、Ｇ_ai（ｔ）＝０（ｔ＜０）ｍｉｍ［１−（１＋βｔ）ｅ^- ^t ，θ］（ｔ≧０）（３）と表される。On the other hand, the accent component is approximated by a critical damping quadratic linear system step response function. When the step response of the i-th accent phrase and G _ai, as a function of time _{t, G ai (t) =} 0 (t <0) mim [1- (1 + βt) e - t, θ] (t ≧ 0) (3) is represented.

【００３９】（１）−（３）の式で、α、β、γはそれ
ぞれの制御の固有角周波数であり、また実際の音声でア
クセント成分が有限の時間で上限に達することに対応
し、θ≦１である。In the equations (1)-(3), α, β, and γ are the natural angular frequencies of the respective controls, and correspond to the fact that the accent component of the actual voice reaches the upper limit in a finite time, θ ≦ 1.

【００４０】Ｆ₀ パターンはこれらを重畳して、となる。The F ₀ pattern superimposes these, Becomes

【００４１】ここで、Ａ_pi、Ａ_aiはそれぞれフレーズ指
令、アクセント指令の大きさに対応し、Ｔ_0iはフレーズ
指令の時刻、Ｔ_1i、Ｔ_2iはアクセント指令の開始時刻と
終点時刻である。Here, A _pi and A _ai correspond to the size of the phrase command and the accent command, respectively, T _0i is the time of the phrase command, and T _1i and T _2i are the start time and end time of the accent command.

【００４２】Ｆ_min 、α、β、γを固定すれば点ピッチ
（ｔ_F0mx、Ｆ_0mx ）と（ｔ_F0e 、Ｆ_0e）を上式に適用し
てＡ_piとＡ_aiを一意に求めることができる。Ａ_piはアク
セント型が頭高型もしくは起伏型の場合、Ａ_pi＝ｅ ^(tF0e-T0i) ｌｎ（Ｆ_0e／Ｆ_min ）となり、アクセント型が平板型の場合には、Ａ_pi＝ｌｎ（Ｆ_0mx ／Ｆ_0e）／（ｅ^- ^(tF0mx-T0i) −ｅ
^- ^(tF0e-T0i)）と求められる。Ａ_aiはＡ_piを用いてＡ_ai＝ｌｎ（Ｆ_0mx ／Ｆ_min ）−Ａ_piｅ ^(tF0mx ^T0i) となる。If F _min , α, β, and γ are fixed, A _pi and A _ai can be uniquely obtained by applying the point pitches (t _F0mx , F _0mx ) and (t _F0e , F _0e ) to the above equation. it can. A _pi If the accent type of head height type or relief _{^{type, A pi = e (tF0e-}} T0i) ln (F 0e / F min) , and the when the accent type of the plate type, A _pi = ln (F _{_{0mx / F 0e) / (e}} - (tF0mx-T0i) -e
^- ^(tF0e-T0i) ). A _ai becomes _{_{A ai = ln (F 0mx /}} F min) -A pi e (tF0mx T0i) using A _pi.

【００４３】図３は起伏型アクセントのモデルパラメー
タを求める場合の例である。モデルパラメータ計算処理
部１１０では、さらに上述のようにして求められたモデ
ルパラメータに対して、発声機構の生理的、物理的な制
約と言語的制約を課す。この操作によって、代表点ピッ
チの推定エラーによって生じる生成ピッチパターンの不
自然さを緩和することができる。具体的には次のように
モデルパラメータを補正する。１．負の声立て（Ａ_pi＜Ａ_pi ₁Ｇ_pi-1（Ｔ_0i−Ｔ
_0i ₁））が生じていれば、声立てがゼロとなるようにフ
レーズ指令の大きさを補正する。（実際にはあり得ない
負の声立てを取り除くため）２．アクセント指令の大きさが下限値Ａａｉｍｉｎ（≧
０）を下回らないようフレーズ指令の大きさを補正す
る。（アクセント核は必ず存在するため）３．平板型が先行するアクセント句のフレーズ指令とア
クセント指令の大きさの和が、先行アクセント句末尾の
フレーズ成分とアクセント成分の和を下回るときは、声
立てはゼロに、アクセント指令の大きさは先行アクセン
ト句に一致するよう補正する。（平板型アクセントの末
尾モーラにアクセント核を生じさせないため）最後に、補正されたモデルパラメータを基にして（４）
式にしたがってピッチ制御処理部１１１がピッチパター
ンを生成する。FIG. 3 shows an example in which model parameters of an undulating accent are obtained. The model parameter calculation processing unit 110 further imposes physiological and physical constraints and linguistic constraints on the vocalization mechanism on the model parameters obtained as described above. This operation can reduce the unnaturalness of the generated pitch pattern caused by the estimation error of the representative point pitch. Specifically, the model parameters are corrected as follows. 1. A negative voice (A _pi <A _pi ₁ G _pi-1 (T _0i −T
_{If 0i} ₁ )) has occurred, the magnitude of the phrase command is corrected so that the loudness becomes zero. (To remove negative voices that are not actually possible) The size of the accent command is lower limit value Aaimin (≧
Correct the size of the phrase command so as not to fall below 0). (Because the accent nucleus always exists) If the sum of the size of the phrase command and accent command of the accent phrase preceded by the flat type is smaller than the sum of the phrase component and the accent component at the end of the preceding accent phrase, the voice is set to zero and the size of the accent command is set to Correct to match accent phrase. (In order not to create an accent nucleus at the last mora of the flat accent) Finally, based on the corrected model parameters (4)
The pitch control processing unit 111 generates a pitch pattern according to the formula.

【００４４】一方、音声合成部１０２内の音声素片選択
部１１２は、アクセント句毎の読みに基づいて音声素片
を選択する。本実施例では、サンプリング周波数１１０
２５Ｈｚで標本化した実音声を改良ケプストラム法によ
り窓長２０ｍｓｅｃ、フレーム周期１０ｍｓｅｃで分析
して得た０次から２５次の低次ケプストラム係数を、子
音＋母音の単位で日本語音声の合成に必要な全音節を切
り出した計１３７個の音声素片が蓄積された音声素片フ
ァイル（図示せず）が用意される。この音声素片ファイ
ルの内容は、文音声変換ソフトウエアに従う文音声変換
処理の開始時に、例えばメインメモリに確保された音声
素片領域１１３に読み込まれているものとする。On the other hand, the speech unit selection unit 112 in the speech synthesis unit 102 selects a speech unit based on the reading of each accent phrase. In this embodiment, the sampling frequency 110
Analysis of real speech sampled at 25 Hz using the improved cepstrum method with a window length of 20 msec and a frame period of 10 msec. 0th to 25th-order low-order cepstrum coefficients required for the synthesis of Japanese speech in consonant + vowel units A speech unit file (not shown) storing a total of 137 speech units obtained by cutting out all the syllables is prepared. It is assumed that the contents of the speech unit file have been read, for example, in the speech unit area 113 secured in the main memory at the start of the sentence-to-speech conversion process according to the sentence-to-speech conversion software.

【００４５】このようにして、上述の子音＋母音単位の
音声素片を音声素片領域１１３から順次読み出し、これ
を素片接続部１１４に渡す。素片接続部１１４は、素片
選択部１１２より入力された音声素片を順次補間接続す
ることにより、合成すべき音声の音韻パラメータ（特徴
パラメータ）を生成する。In this way, the above-described speech units in the consonant + vowel units are sequentially read out from the speech unit area 113 and passed to the unit connection unit 114. The unit connection unit 114 generates phoneme parameters (feature parameters) of the speech to be synthesized by sequentially interpolating and connecting the speech units input from the unit selection unit 112.

【００４６】以上のようにして、ピッチパターン生成処
理部１０８によりピッチパターンが生成され、音声素片
接続部１１４により音韻パラメータが生成されると、音
声合成部１０２内の合成フィルタ部１１５が起動され
る。この合成フィルタ部１１５は無音区間ではホワイト
ノイズを、有声区間ではインパルスを駆動音源として、
音韻パラメータであるケプストラム係数を直接フィルタ
係数とするＬＭＡフィルタにより音声出力する。As described above, when the pitch pattern is generated by the pitch pattern generation processing unit 108 and the phoneme parameters are generated by the speech unit connection unit 114, the synthesis filter unit 115 in the speech synthesis unit 102 is activated. You. This synthesis filter unit 115 uses white noise in a silent section and impulse in a voiced section as a driving sound source.
Voice is output by an LMA filter that directly uses cepstrum coefficients, which are phonemic parameters, as filter coefficients.

【００４７】以上、本発明の一形態について説明した
が、本発明は本形態に限定されるものではなく、音声の
特徴パラメータとしてケプストラムを使用する代わりに
ＬＰＣ、ＰＡＲＣＯＲ、フォルマント等他のパラメータ
であってもよい。また、本形態では特徴パラメータを用
いた分析合成型の方式を採用したが、波形編集（重畳）
型やフォルマント合成型の方式でもよい。音韻継続時間
長の制御に関しても、上述のような等時性を利用した方
法の代わりに、例えば統計的手法を利用してもよい。Although one embodiment of the present invention has been described above, the present invention is not limited to this embodiment. Instead of using cepstrum as a speech characteristic parameter, other parameters such as LPC, PARCO, and formant are used. You may. In this embodiment, the analysis-synthesis method using the characteristic parameters is adopted.
Or a formant synthesis type system. Regarding the control of the phoneme duration, for example, a statistical method may be used instead of the method using the isochronism as described above.

【００４８】[0048]

【発明の効果】本発明によれば、ピッチパターンの生成
の際に発声機構の制約や言語的制約を考慮することがで
きるピッチ制御を実現することにより、人間らしい自然
なピッチパターンの生成を可能とし、聞き取りやすく、
長時間聞いていても疲れない自然な抑揚をもつ音声合成
方法および装置を提供することができる。According to the present invention, a natural human-like pitch pattern can be generated by realizing pitch control that can take into account the restrictions of the vocal mechanism and linguistic restrictions when generating the pitch pattern. , Easy to hear,
It is possible to provide a speech synthesis method and apparatus having a natural intonation that does not become tired even after listening for a long time.

[Brief description of the drawings]

【図１】本発明の実施の形態に係る音声規則合成装置の
概略構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a speech rule synthesis device according to an embodiment of the present invention.

【図２】代表点ピッチの抽出方法を説明する図である。FIG. 2 is a diagram illustrating a method of extracting a representative point pitch.

【図３】起伏型アクセントのモデルパラメータを求める
場合の説明図である。FIG. 3 is an explanatory diagram when a model parameter of an undulating accent is obtained.

[Explanation of symbols]

１０１テキスト解析部１０２音声合成部１０３テキストファイル１０４形態素解析部１０５日本語解析辞書１０６係り受け解析部１０７音韻継続時間長決定処理部１０８ピッチパターン生成処理部１０９ピッチパターン概形推定処理部１１０モデルパラメータ計算処理部１１１ピッチ制御処理部１１２音声素片選択部１１３音声素片領域１１４素片接続部１１５合成フィルタ処理部 DESCRIPTION OF SYMBOLS 101 Text analysis part 102 Speech synthesis part 103 Text file 104 Morphological analysis part 105 Japanese analysis dictionary 106 Dependency analysis part 107 Phoneme duration length determination processing part 108 Pitch pattern generation processing part 109 Pitch pattern rough shape estimation processing part 110 Model parameter Calculation processing unit 111 Pitch control processing unit 112 Speech unit selection unit 113 Speech unit region 114 Unit connection unit 115 Synthesis filter processing unit

Claims

[Claims]

1. A speech synthesis method for generating a pitch pattern based on information obtained by analyzing an input text, and synthesizing a speech based on the generated pitch pattern, comprising the steps of: Generating a pitch pattern or a characteristic amount thereof, generating a second pitch pattern corrected from the first pitch pattern or the characteristic amount based on a pitch pattern generation process model, and generating a voice based on the second pitch pattern. A speech synthesis method characterized by combining

2. A text analysis means for analyzing input text to generate text analysis information, and a pitch pattern generation means for generating a pitch pattern based on text analysis information output by the text analysis means, A first pitch pattern or a feature thereof is generated from the text analysis information in accordance with rules, and a second pitch pattern corrected based on the pitch pattern generation process model is generated from the first pitch pattern or the feature thereof. A pitch pattern generating means for outputting the second pitch pattern; and a voice synthesizing means for synthesizing a voice based on the second pitch pattern output from the pitch pattern generating means. Synthesizer.

3. A voice synthesizing method for generating a pitch pattern based on information obtained by analyzing an input text and synthesizing a voice based on the generated pitch pattern. Generating a pitch pattern or a characteristic amount thereof, calculating a model parameter of a pitch pattern generation process model from the first pitch pattern or the characteristic amount thereof, and calculating a second parameter from the model parameter based on the pitch pattern generation process model. A speech synthesis method comprising: generating a pitch pattern; and synthesizing speech based on the second pitch pattern.

4. A text analysis means for analyzing input text to generate text analysis information, and a pitch pattern generation means for generating a pitch pattern based on text analysis information output by the text analysis means, wherein A first pitch pattern or a feature thereof is generated according to a predetermined rule based on the analysis information of the first pitch pattern, a model parameter of a pitch pattern generation process model is calculated from the first pitch pattern or the feature thereof, and the model parameter A pitch pattern generating means for generating and outputting a second pitch pattern based on the pitch pattern generating process model from the above, and a voice synthesizing means for synthesizing voice based on the pitch pattern output by the pitch pattern generating means. A speech synthesizer characterized by the following.

5. The speech synthesis method according to claim 1, wherein said predetermined rule is created by a speech corpus based on a learning algorithm such as a statistical method.

6. The speech synthesizer according to claim 2, wherein said predetermined rule in said pitch pattern generation means is created by a speech corpus based on a learning algorithm such as a statistical method. Speech synthesizer.

7. The speech synthesis method according to claim 5, wherein the feature amount of the first pitch pattern includes at least a pitch maximum value in an accent phrase.

8. The speech synthesizer according to claim 6, wherein the feature amount of the first pitch pattern includes at least a pitch maximum value in an accent phrase.

9. The speech synthesis method according to claim 5, wherein the feature amount of the first pitch pattern includes at least a pitch maximum value in an accent phrase and a pitch of an accent phrase final mora. Speech synthesis method to be used.

10. The speech synthesizer according to claim 6, wherein the feature amount of the first pitch pattern includes at least a pitch maximum value in an accent phrase and a pitch of an accent phrase final mora. Speech synthesizer.

11. The speech synthesis method according to claim 1, wherein the pitch pattern generation process model is a superposition type model.

12. The speech synthesis apparatus according to claim 2, wherein the pitch pattern generation process model is a superposition type model.

13. The speech synthesis method according to claim 11, wherein said superimposition type model approximates a global descending component by an exponential function.

14. The speech synthesizer according to claim 12, wherein said superimposition type model approximates a global falling component by an exponential function.

15. The speech synthesis method according to claim 11, wherein said superposition type model has a constraint that prohibits generation of a negative rumbling when generating a phrase component. Method.

16. The speech synthesis apparatus according to claim 12, wherein said superposition type model has a constraint for prohibiting generation of a negative rumbling when generating a phrase component. apparatus.

17. The speech synthesis method according to claim 11, wherein said superimposition type model has a constraint that when generating an accent component, its amplitude does not fall below a predetermined lower limit. Speech synthesis method.

18. The speech synthesizer according to claim 12, wherein said superimposition type model has a constraint that when generating an accent component, its amplitude does not fall below a predetermined lower limit. Speech synthesizer.

19. The speech synthesis method according to claim 11, wherein said superimposed model includes a ending mora of said preceding accent phrase if said preceding accent phrase is a flat accent when generating a phrase component and an accent component. A speech synthesis method characterized by correcting phrase components and accent components so that accent nuclei do not occur.

20. The speech synthesizer according to claim 12, wherein the superimposition type model adds a ending mora of the preceding accent phrase if a preceding accent phrase when generating a phrase component and an accent component is a flat accent. A speech synthesizer that corrects a phrase component and an accent component so that an accent nucleus does not occur.