JPH01284898A

JPH01284898A - Voice synthesizing device

Info

Publication number: JPH01284898A
Application number: JP63115721A
Authority: JP
Inventors: Tomohisa Hirokawa; 広川　智久
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1988-05-11
Filing date: 1988-05-11
Publication date: 1989-11-16
Anticipated expiration: 2013-06-04
Also published as: JP2761552B2

Abstract

PURPOSE:To output a composite voice whose naturalness and articulation are both high by accumulating in advance a large quantity of waveforms as a dictionary and selecting and connecting the most suitable waveform to an input text. CONSTITUTION:An input text from a terminal 1 is analyzed by a text analyzing part 2, and based on an output of the analyzing part 2, meter information for synthesizing a voice is generated by a meter information generating part 3. On the other hand, a waveform dictionary 9 for storing a large quantity of waveform information at every appropriate unit for constituting an output voice such as a phoneme, etc., is provided, and by information from the analyzing part 2 and the generating part 3, an appropriate waveform is selected from the dictionary 9 by a waveform selecting part 8. In case when a desired waveform does not exist, a deformation is performed by a waveform deformation processing part 10 so as to conform to a use purpose with respect to the nearest waveform to the selecting condition, and in case when the desired waveform does not exist at all, a new waveform is generated by a waveform generating part 11. These waveforms from the selecting part 8, the processing part 10 and the generating part 11 are connected by a waveform connecting part 12.

Description

【発明の詳細な説明】「産業上の利用分野」この発明は、テキストを入力しそのテキストに応じた任
意の音声を出力する音声合成装置、特に主）−音韻系列
と韻律情報とから音声を合成する規則合成装置（−関す
るものである。[Detailed Description of the Invention] "Industrial Application Field" This invention relates to a speech synthesis device that inputs text and outputs arbitrary speech according to the text. A rule-based synthesizer (-) that performs synthesis.

「従来の技術」従来、任意の音声を出力する規則合成装置では、音声合
成方式にＬ　Ｐ　Ｃ（Ｌｉｎｅａｒ　Ｐｒｅｄｉｃｔｉ
ｏｎＣｏｒｄｅｒ　）方式を利用し、音声の結合単位に
はＣＶやｖｃｖ　、ｃｖｃなど音韻との対応や調音結合
を考慮した単位を設定し、基本周波数パタンなどの韻律
情報はアクセント形や呼気段落内のモーラ数などから音
韻情報とは独立Ｃ二生成する方式を保つているものが多
い。しかしながらこれらの方法では、当然ながら音声合
成時に分析時とは異なった基本周波数を持つ音源で駆動
するため、ＬＰＣパラメータがあられす声道スペクトル
と音源ス々クトルの不整合により、異常振幅やスペクト
ルＱ値の低減が生じ、合成音品質劣化の原因となってい
る。これはＬＰＣ方式が声道パラメータと音源パラメー
タとの独立を仮定した分析合成方式であるにもかかわら
ず、実際（二はこれら二つのパラメータは本来独立では
なく微妙（二関係しあっているために起こる劣化で、規
則合成にＬＰＣ分析合成手法を用いる根本的な問題と考
えられる。“Prior Art” Conventionally, in rule synthesis devices that output arbitrary speech, LPC (Linear Predictive) is used as a speech synthesis method.
onCorder) method, units such as CV, vcv, and cvc are set in consideration of the correspondence with phonemes and articulatory connections, and prosodic information such as fundamental frequency patterns is set using the accent form and mora in exhalation paragraphs. Many of them maintain a method of generating C2 independently from phonological information based on the number and other factors. However, these methods naturally use a sound source with a different fundamental frequency from that used during analysis during speech synthesis, which causes abnormal amplitude and spectrum Q due to the mismatch between the vocal tract spectrum and the sound source spectrum, which causes LPC parameters. This results in a reduction in the value, causing a deterioration in the quality of the synthesized sound. This is despite the fact that the LPC method is an analysis and synthesis method that assumes the independence of vocal tract parameters and sound source parameters. This degradation is considered to be a fundamental problem in using the LPC analysis synthesis method for ordered synthesis.

他に音声の特徴をホルマントで記述し、ホルマントの動
きを規定すること（二より規則合成音を得る方式がある
が、ホルマントの自動抽出が難しく、ホルマント遷移の
記述も十分ではないため、ＬＰＣを用いる方式より品質
が良くないのが現状である。Another method is to describe the characteristics of speech in formants and specify the movement of formants (there is a method to obtain a regular synthesized sound from two methods, but automatic extraction of formants is difficult and the description of formant transitions is not sufficient, so LPC is used). Currently, the quality is not as good as the method used.

一方、このような問題を回避し、明瞭性の高い原波形に
着目した方式もいくつか提案されている。On the other hand, some methods have been proposed that avoid this problem and focus on original waveforms with high clarity.

しかしいずれも音素や音節単位に高々数種類の波形を用
意し、基本周波数や継続時間長の調整は、波形の打ち切
りや繰り返し、間引き等を施すことにより対処している
。従って合成音声は細かな制御は不可能であり、短かい
音声をはりあわせた感じの音や、ブザーのような機械音
的になってしまうという欠点を有していた。However, in both cases, at most several types of waveforms are prepared for each phoneme or syllable, and the fundamental frequency and duration length are adjusted by truncating, repeating, or thinning out the waveforms. Therefore, synthetic speech cannot be controlled in detail, and has the disadvantage that it sounds like a combination of short sounds or a mechanical sound like a buzzer.

この発明の目的は、テキスト合成に必要な規則合成にお
いて、自然性、明瞭性のともに高い合成音声の出力を可
能とする音声合成装置を提供することにある。An object of the present invention is to provide a speech synthesis device that is capable of outputting synthesized speech with high naturalness and clarity in rule synthesis required for text synthesis.

「課題を解決するための手段」この発明によれば入力テキストはテキスト解析部で解析
され、そのテキスト解析部の出力を基に音声合成のため
の韻律情報が韻律情報生成部で生成される。一方音素な
ど出力音声を組み立てる上で適切な単位毎に、原波形、
発声された音韻環境、基本周波数パタン形状、継続時間
情報、振幅情報などを記載した大量の波形情報を格納す
る波形辞書か設けられ、テキスト解析部及び韻律情報生
成部からの情報により波形辞書より適切な波形が波形選
択部で選択され、所望の波形がない場合ｃ畷ま最も選択
条件に近い波形（二対し使用目的に合致するように波形
変形処理部で変形が施され、所望の波形が全くない場合
は新たに波形が波形生成部で生成される。これら波形選
択部、波形変形処理部及び波形生成部からの波形は波形
接続部で接続される。"Means for Solving the Problem" According to the present invention, an input text is analyzed by a text analysis section, and based on the output of the text analysis section, prosody information for speech synthesis is generated by a prosody information generation section. On the other hand, the original waveform,
A waveform dictionary is provided that stores a large amount of waveform information including the phonological environment of utterance, fundamental frequency pattern shape, duration information, amplitude information, etc., and is more suitable than the waveform dictionary based on information from the text analysis section and the prosodic information generation section. If the desired waveform is selected in the waveform selection section and the desired waveform is not available, the waveform closest to the selection condition (on the other hand, the waveform modification processing section deforms it to match the purpose of use and the desired waveform is not found at all) If there is no waveform, a new waveform is generated by the waveform generation section.The waveforms from the waveform selection section, waveform modification processing section, and waveform generation section are connected at the waveform connection section.

このようにこの発明によれば大量の波形を辞書として蓄
積しておき、入力テキストに対し最も適した波形を選択
して接続することにより出力音声を合成しているため、
明瞭性が高く、シかも自然性の良い音声が得られる。In this way, according to the present invention, a large number of waveforms are stored as a dictionary, and output speech is synthesized by selecting and connecting the most suitable waveforms for the input text.
High clarity and natural sound can be obtained.

「実施例」第１図はこの発明の一実施例を示すブロック図である。"Example" FIG. 1 is a block diagram showing one embodiment of the present invention.

すなわち端子ｌより音声に変換すべきテキストが入力さ
れると、テキスト解析部２により係り受けや品詞解析な
どの形態素解析、および漢字かな変換、アクセント処理
が行われ、音韻系列バッファ７、韻律情報生成部３（二
必要な情報が送出される。その情報としては音韻系列バ
ッファ７に対しては音韻の区別を示す記号列、韻律情報
生成部３に対しては呼気段落内モーラ数、アクセント形
、発声スピードなどである。韻律情報生成部３はこれら
の情報を基にピッチパタン、各音素毎の時間長パタン、
および振幅パタンを規則により生成し、それぞれのバッ
ファ４，５．６に書き込む。That is, when text to be converted into speech is input from terminal l, the text analysis unit 2 performs morphological analysis such as dependency and part-of-speech analysis, kanji-kana conversion, and accent processing, and the phonological sequence buffer 7 and prosody information generation. Part 3 (2) Necessary information is sent.The information includes a symbol string indicating phoneme distinctions to the phoneme sequence buffer 7, the number of moras in an exhalation paragraph, the accent shape, and the prosodic information generator 3. The prosodic information generation unit 3 generates pitch patterns, duration patterns for each phoneme, etc. based on this information.
and amplitude patterns are generated according to the rules and written into the respective buffers 4, 5.6.

波形選択部８は音韻系列バッファ７、ピッチパタンバッ
ファ４１時間長バッファ５、振幅バッファ６を参照して
、波形辞Ｍ９より最適な波形を選択する。波形辞書９は
一例として、第２図に示すような構成ｔしており発声時
の種々の情報とともに波形が格納されている。種々の情
報とは、すなわち音韻種別、前後７程度の音韻環境、音
素内の平均ピッチ、ピッチの形状を示すための１次直線
で近似した場合の傾き、音素の継続時間長、波形中心部
での数ピツチの始点・終点を示す時間長調整用情報、正
規化した音素波形のＲＭＳ値（振幅）および実際の波形
データである。波形辞書９の作成は、大量の発声データ
をもとにオフライン処理で予め作成しておく。例えば男
性アナウンサー名の発声による単語、文章など約数時間
の音声データを１２ＫＨｚでＡＤ変換し、デジタルツナ
グラムの視察により音韻ラベリングを施す。この音声デ
ータに対し、ラベルの音韻境界前後２０〜３０ｍ５の波
形をデイスプレィに表示し、カーソルで切り出すことで
作成できる。切り出し位置は原則として波形の負から正
へのＯ切片とし、さら（二音韻毎に例えば正のピークの
手前の０切片で切り出すなど、ルールを定めておく。こ
うすることで接続点での不連続は避けられ、滑らかな連
続波形が得られる。また音声データをＬＰＣ等で音声分
析してピッチを抽出し、ピッチ形状や時間長などにおい
て類似の音韻の統合化を行っておけば、切り出し波形数
を低減でき、能率良く波形辞書の作成ができる。The waveform selection section 8 refers to the phoneme sequence buffer 7, pitch pattern buffer 41, time length buffer 5, and amplitude buffer 6, and selects the optimal waveform from the waveform dictionary M9. The waveform dictionary 9 has, for example, a configuration as shown in FIG. 2, and stores waveforms together with various information at the time of utterance. The various types of information include the phoneme type, the phoneme environment of about 7 degrees before and after, the average pitch within a phoneme, the slope of approximation by a linear straight line to indicate the shape of the pitch, the duration of the phoneme, and the center of the waveform. These are time length adjustment information indicating the start and end points of the number pitch, the RMS value (amplitude) of the normalized phoneme waveform, and the actual waveform data. The waveform dictionary 9 is created in advance by off-line processing based on a large amount of voice data. For example, approximately several hours of audio data, such as words and sentences uttered by a male announcer's name, are converted into AD at 12 KHz, and phonological labeling is performed by observing a digital tunagram. This audio data can be created by displaying a waveform of 20 to 30 m5 before and after the phoneme boundary of the label on a display and cutting it out with a cursor. As a general rule, the cutout position should be the O-intercept from the negative to the positive waveform, and rules should also be established (for example, cut out at the 0-intercept before the positive peak for each diphoneme).By doing this, it is possible to avoid errors at connection points. Continuity can be avoided and a smooth continuous waveform can be obtained.In addition, if the audio data is analyzed using LPC, etc. to extract the pitch, and phonemes that are similar in terms of pitch shape and duration are integrated, the extracted waveform can be obtained. The number of waveform dictionaries can be reduced and waveform dictionaries can be created efficiently.

波形選択部８の動作をさらに詳細に述べると、−例とし
て第３図に示したようになる。まず検索音韻系列を設定
する。検索音韻系列は該当する音韻？中心（＝置き、辞
書中にある環境音韻の数での窓かけを行って入力音韻系
列から切り出しで設定する。波形辞書９を検索して波形
候補が見つからない場合は順次検索音韻系列を両側から
削除していきながら検索を行う。検索音韻系列が該当す
る音韻のみとなっても、波形候補が見つからない場合、
波形生成部１１において所望のピッチ波形の生成を行う
。次に合成音声の自然性に最も大きな影響を及ぼすと考
えられるピッチパタンを考慮し、選択すべき音素のピッ
チ条件を設定する。これはピッチパタンバッファ４を参
照して、平均ピッチ、ピッチの形状より決定する。許容
範囲は実験値より決定すべきであるが、およそ所望値の
５チ以内ならば自然性は作たれると考えられる。波形１
＋　浦が見つかった場合は、それらに対し時間長条件に
よる選択を行う。時間長条件は、時間長バッファ５の時
間長と、ピッチ条件と同様に、実験値より決まる許容範
囲とから設定される。時間長条件（ユ合う波形候補がな
い場合、最も条件に近い波形候補が選択され、波形変形
処理部１０（＝おいて時間長調整処理を施す。波形候補
が見つかりた場合は、次に振幅条件による選択を行う。The operation of the waveform selection section 8 will be described in more detail as shown in FIG. 3 as an example. First, a search phoneme sequence is set. Is the search phoneme series the corresponding phoneme? Set the center (= placement) by cutting out the input phoneme sequence from the input phoneme sequence by performing windowing using the number of environmental phonemes in the dictionary.If a waveform candidate is not found by searching the waveform dictionary 9, sequentially search the phoneme sequence from both sides. Search while deleting.Even if the search phoneme sequence contains only the corresponding phoneme, if no waveform candidates are found,
The waveform generator 11 generates a desired pitch waveform. Next, pitch conditions for the phonemes to be selected are set, taking into account the pitch pattern that is thought to have the greatest effect on the naturalness of the synthesized speech. This is determined based on the average pitch and pitch shape with reference to the pitch pattern buffer 4. Although the permissible range should be determined based on experimental values, it is considered that naturalness can be created if it is within approximately 5 inches of the desired value. Waveform 1
+ If ura are found, select them based on time length conditions. The time length condition is set from the time length of the time length buffer 5 and an allowable range determined from experimental values, similar to the pitch condition. If there is no waveform candidate that matches the time length condition, the waveform candidate closest to the condition is selected, and time length adjustment processing is performed at the waveform transformation processing unit 10 (=.If a waveform candidate is found, then the amplitude condition Make a selection.

この場合も時間長条件と同様、振幅バッファ６の振幅、
および許容範囲から条件が設定される。波形候補が見つ
からない場合は、時間長と同様、最も条件（二近い波形
候補が選択され、波形変形処理部１０（二おいて振幅調
整処理を施す。こうして音韻環境や韻律条件により波形
選択部８で選択された波形、波形生成部１１で作られた
波形、および波形変形処理部１０で調整された波形は波
形接続部１２（″−送出され、順次結合されて音声波形
として出力端子１３（＝出力される。In this case, as well as the time length condition, the amplitude of the amplitude buffer 6,
Conditions are set from the and permissible ranges. If a waveform candidate is not found, the waveform candidate closest to the condition (2) is selected, and amplitude adjustment processing is performed at the waveform modification processing section 10 (2).In this way, the waveform selection section 8 The waveform selected in , the waveform created in the waveform generation section 11 , and the waveform adjusted in the waveform modification processing section 10 are sent out to the waveform connection section 12 (''-), and are sequentially combined as an audio waveform to the output terminal 13 (= Output.

波形生成部１１では、例えばＬＰＧ技術を用いて任意の
ピッチを持つ波形を生成する。すなわち音韻対応にスペ
クトルを示すＬＰＣパラメータを蓄積しておき、指定さ
れたピッチによりパルス、または残差などを駆動し波形
を生成する。ここでＬＰＧ技術を用いることは発明の目
的と異なるが、この波形生成部１１は波形が全くない場
合の、いわば救済措置であり使用頻度は少ないと考えら
れる。The waveform generation unit 11 generates a waveform with an arbitrary pitch using, for example, LPG technology. That is, LPC parameters indicating spectra corresponding to phonemes are stored, and a waveform is generated by driving pulses or residuals with a specified pitch. Although the use of LPG technology here is different from the purpose of the invention, this waveform generation unit 11 is considered to be a so-called rescue measure when there is no waveform at all, and is not used frequently.

また波形変形処理部１０では、時間長調整処理、振幅調
整処理を行っている。以下にそれらの処理について説明
する。The waveform modification processing section 10 also performs time length adjustment processing and amplitude adjustment processing. These processes will be explained below.

時間長調整処理は、当該音韻が無声音と有声音で処理が
異なる。無声音の場合、破裂音であれば無音区間を伸縮
する事で対処し、摩擦音であれば中心部から前後に向か
って所望の時間長になるよう、波形の切断、または繰り
返し使用を行う。有声音の場合は、波形中心部でのピッ
チ位置を３ピッチ程度辞書中に蓄えておき、波形データ
の方が長い場合はそれらの間引き、波形データの方が短
い場合はそれらの繰り返し使用を行う。The time length adjustment process differs depending on whether the phoneme is a voiceless sound or a voiced sound. In the case of unvoiced sounds, if it is a plosive sound, it is dealt with by expanding or contracting the silent section, and if it is a fricative sound, the waveform is cut or used repeatedly so that the desired time length is reached from the center to the front and back. In the case of voiced sounds, the pitch position at the center of the waveform is stored in a dictionary of about 3 pitches, and if the waveform data is longer, they are thinned out, and if the waveform data is shorter, they are used repeatedly. .

振幅調整処理は、音素毎に定められた振幅値を振幅バッ
ファより参照して、選択または生成された波形のＲＭＳ
値との比率（二より振幅値を線形に調整する。The amplitude adjustment process refers to the amplitude value determined for each phoneme from the amplitude buffer, and calculates the RMS of the selected or generated waveform.
linearly adjust the amplitude value by the ratio of the two values.

「発明の効果」以上述べたようＣ二この発明（二よれば、大量の波形を
辞書として蓄積しておき、入力テキストに対し最も適し
た波形を選択し、接続することで出力音声を合成してい
るため、明瞭性が高く、シかも自然性も良い音声を提供
できる。"Effects of the Invention" As stated above, according to this invention (2), a large number of waveforms are stored as a dictionary, the most suitable waveforms are selected for the input text, and the output speech is synthesized by connecting them. As a result, it is possible to provide audio with high clarity, clarity, and naturalness.

[Brief explanation of the drawing]

第１図はこの発明による音声合成装置の実施例を示すブ
ロック図、第２図は波形辞書９の一構成例を示す図、第
３図は波形辞書から最も適切な波形を選択する方法を示
すフロー図である。特許出願人　　日本電信電話株式会社代　　理　　人　　　草　　　野　　　　　卓〒３　図FIG. 1 is a block diagram showing an embodiment of a speech synthesis device according to the present invention, FIG. 2 is a diagram showing an example of the configuration of a waveform dictionary 9, and FIG. 3 is a diagram showing a method for selecting the most appropriate waveform from the waveform dictionary. It is a flow diagram. Patent applicant: Nippon Telegraph and Telephone Corporation Representative: Takashi Kusano Figure 3

Claims

[Claims]

(1) A speech synthesis device that outputs speech according to input text, including a text analysis section that analyzes the input text, and a prosody that generates prosody information for speech synthesis based on the output of the text analysis section. It stores a large amount of waveform information that describes the original waveform, phonological environment of utterance, fundamental frequency pattern shape, duration information, amplitude information, etc. for each appropriate unit for assembling output speech such as an information generation unit and phoneme. a waveform dictionary, a waveform selection unit that selects an appropriate waveform from the waveform dictionary based on information from the text analysis unit and the prosodic information generation unit, and a waveform selection unit that selects an appropriate waveform from the waveform dictionary; A waveform transformation processing unit that performs transformation to match the purpose of use, a waveform generation unit that generates a new waveform if the desired waveform is not found at all, and a waveform selection unit, waveform transformation processing unit, and waveform generation unit. A speech synthesis device comprising: a waveform connecting section that connects waveforms of.