JPH0363696A

JPH0363696A - Text voice synthesizer

Info

Publication number: JPH0363696A
Application number: JP1200181A
Authority: JP
Inventors: Osamu Kimura; 治木村; Nobuyoshi Umiki; 延佳海木; Jiyungo Kitou; 鬼頭　淳悟
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1989-08-01
Filing date: 1989-08-01
Publication date: 1991-03-19

Abstract

PURPOSE:To generate a synthesized voice of high quality with good understanding and high naturalness by finding the understanding difficulty of each word of an input character string, controlling the rhythm of the synthesized voice according to the found difficulty and controlling the syllable clearness. CONSTITUTION:A KANJI (Chinese character) and KANA (Japanese syllabary) mixed sentence which is inputted from an input part 10 is divided into respective words by a word division processing part 21 in a character string analysis part 20 by referring to a Japanese dictionary 14a and the understanding difficulty of each word is calculated. When a word divided by the processing part 21 is a word divided matching the storage contents of the dictionary 14a, a processing part 22 performs identical type word selection processing and outputs the result to a rhythm processing part 23. When not, on the other hand, a KANJI dictionary 14b is referred to and the word is converted by a processing part 24 into rendering, which is given an accent by a processing part 25 and outputted to the processing part 23. The processing part 23 calculates a rhythm parameter by using a memory 15 according to the understanding difficulty and performs rhythm limitation processing. A generation part 26 receives the parameter and performs the control processing of syllable clearness according to the understanding difficulty by referring to a dictionary 16, and a voice is synthesized 12 and outputted 13.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、文゛字列の入力に基づいて音声を合成出力す
るテキスト音声合成装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a text-to-speech synthesizer that synthesizes and outputs speech based on input of a character string.

［従来の技術〕テキスト音声合成においては、日本語正書法（漢字仮名
交り文）から音韻系列や韻律情報を抽出し、これら抽出
内容に基づき所定の規則に従って音声パラメータを生成
することにより音声を合成することが行われる。[Prior art] In text-to-speech synthesis, speech is synthesized by extracting phonological sequences and prosodic information from Japanese orthography (Kanji, Kana, and Kana), and generating speech parameters according to predetermined rules based on these extracted contents. What is done is done.

このような規則音声合成を行う場合、合成音の明瞭性及
び了解性を確保することはもちろんのこと、自然性を向
上させることも合成音の高品質化の点で重要である。し
かしながら、明瞭性及び了解性と自然性とを同時に向上
させることは難しく、このため、まず合成するテキスト
全般に渡っである程度の明瞭性を確保し、その上で自然
性の向上を図ることが一般に行われている。即ち、合成
音の音節明瞭度はテキストの内容にかかわらず一定に保
たれるのである。When performing such regular speech synthesis, it is important not only to ensure the clarity and intelligibility of the synthesized speech, but also to improve its naturalness in order to improve the quality of the synthesized speech. However, it is difficult to improve clarity, comprehensibility, and naturalness at the same time. Therefore, it is generally recommended to first ensure a certain degree of clarity throughout the text to be synthesized, and then try to improve naturalness. It is being done. That is, the syllable intelligibility of the synthesized speech remains constant regardless of the content of the text.

［発明が解決しようとする課１！ｉｌ上述したごとき従来のテキスト音声合成方法によると、
テキストの流れから容易に了解できる単語であってもあ
る程度の明瞭度を保つように合成されるため、自然音声
と比べて耳につきやすく、自然性の損なわれた感じを受
けてしまう、Ｉ！に使用頻度の低い単語、例えば馴染み
の薄い単語、固有名詞、又は数詞等、については聴き取
ることができない場合も生じてしまう。[Lesson 1 that the invention attempts to solve! According to the conventional text-to-speech synthesis method described above,
Even words that can be easily understood from the flow of the text are synthesized to maintain a certain degree of clarity, so they are easier to hear than natural speech, making them feel less natural.I! In some cases, it may be difficult to hear words that are used infrequently, such as unfamiliar words, proper nouns, or numerals.

また従来のテキスト音声合成方法によると、合成音が一
本調子のため、聴取者は常に緊張して合成音に耳を傾け
ていなければならず、長時間聴〜１ていると疲れやすい
という問題があった。In addition, according to the conventional text-to-speech synthesis method, the synthesized sound is in one tone, so the listener must constantly listen to the synthesized sound with tension, and the problem is that the listener gets tired easily if listening for a long time. was there.

即ち、人間の発声の場合にはテキストの難易度や聴取者
の理解度等を考慮して発声の方法を適宜変えることによ
り、情報を正確に伝えることができかつ自然性を保った
音声を生成できるが、従来のテキスト音声合成方法では
このような音声の生成が不可能であった。In other words, in the case of human speech, by appropriately changing the method of speech in consideration of the difficulty of the text, the level of understanding of the listener, etc., it is possible to generate speech that accurately conveys information and maintains naturalness. However, conventional text-to-speech synthesis methods have not been able to generate such speech.

従って本発明の目的は、了解性及び自然性の高い合成音
声を生成できるテキスト音声合成装置を提供することに
ある。Therefore, an object of the present invention is to provide a text-to-speech synthesis device that can generate synthesized speech with high intelligibility and naturalness.

［課題を解決するための手段］上述の目的を遠戚する本発明の特徴は、入力される文字
列を構文解析して音声パラメータを生成し、該生成した
音声パラメータに基づいて音声を合成するテキスト音声
合成装置であって、前記文字列の各単語の了解難易度を
求める文字列解析部と、該求めた各単語の了解難易度に
応じて合成音声の韻律を制御する韻律処理部と、該求め
た各単語の了解難易度に応じて合成音声の音節明瞭度を
制御する音声パラメータ生成部とを備えたことにある。[Means for Solving the Problems] A feature of the present invention that is distantly related to the above-mentioned object is that an input character string is parsed to generate voice parameters, and voice is synthesized based on the generated voice parameters. A text-to-speech synthesis device, comprising: a character string analysis unit that determines the comprehension difficulty level of each word in the character string; and a prosody processing unit that controls the prosody of the synthesized speech according to the determined comprehension difficulty level of each word. The present invention further includes a speech parameter generation section that controls the syllable intelligibility of the synthesized speech according to the obtained comprehension difficulty level of each word.

文字列解析部は、文字列の各単語の使用頻度をも記憶し
ている日本語辞書を参照して得られる単語の馴染み度合
いと単語の品詞及び出現回数と出現間隔とから了解難易
度を求める了解難易度算出機能を備えることが望ましい
。The character string analysis unit calculates the degree of comprehension difficulty based on the degree of familiarity of the word obtained by referring to a Japanese dictionary that also stores the frequency of use of each word in the character string, the part of speech of the word, the number of occurrences, and the interval of occurrence. It is desirable to have a comprehension difficulty calculation function.

韻律処理部は、単語の了解難易度に応じて合成音声の基
本周波数、パワー、及び時間長を変化せしめる韻律制御
機能を備えることが望ましい。It is preferable that the prosody processing section has a prosody control function that changes the fundamental frequency, power, and time length of the synthesized speech according to the degree of difficulty in understanding the word.

音声パラメータ生成部は、了解難易度の高い単語につい
ては母音の定常部を長くすることによって音節明瞭度を
向上させるように母音の定常部と母音の過渡部との比率
を変化せしめる母音定常部制御機能を備えることが望ま
しい。The speech parameter generation unit is a vowel constant part control that changes the ratio of the vowel constant part to the vowel transient part so as to improve syllable intelligibility by lengthening the vowel constant part for words that are difficult to understand. It is desirable to have this function.

音声パラメータ生成部は、了解難易度の高い単語につい
ては音声パラメータの時間的変化の許容量を大きくする
ことによって音節明瞭度を向上させるように該許容量を
変化せしめる過渡部制御機能を備えることが望ましい。The speech parameter generation unit may include a transient section control function that increases the allowable amount of temporal change in the speech parameter for words that are difficult to understand, thereby changing the allowable amount so as to improve syllable intelligibility. desirable.

音声パラメータ生成部は、了解難易度の高い単語につい
てはその母音の無声化を行わない無声化判定機能を備え
ることが望ましい。It is desirable that the speech parameter generation unit has a devoicing determination function that does not devoice vowels of words that are difficult to understand.

［実施例］以下図面を用いて本発明の実施例を詳細に説明する。[Example] Embodiments of the present invention will be described in detail below using the drawings.

第２図は本発明の一実施例の構成を概略的に示すブロッ
ク図である。FIG. 2 is a block diagram schematically showing the configuration of an embodiment of the present invention.

同図において、１０は音声合成すべき漢字仮名交り文の
入力される入力部、１１は制御部、１２は音声パラメー
タに従って音声の合成を行いこれを出力する音声合成部
、１３は音声合成部１２からの合成音声信号を出力する
出力部、１４は日本語辞書及び漢字辞書用メモリ、１５
は韻律制御用メモリ、１６は音声データ辞書用メモリを
それぞれ示している。これら入力部１０、制御部１１、
音声合成部１２、及びメモリ１４．１５．１６は、バス
１７を介して互いに接続されている。In the figure, numeral 10 is an input section into which the kanji-kana-original sentence to be synthesized is input, 11 is a control section, 12 is a speech synthesis section that synthesizes speech according to speech parameters and outputs it, and 13 is a speech synthesis section. 12 is an output unit that outputs a synthesized speech signal; 14 is a memory for a Japanese dictionary and a kanji dictionary; 15 is a memory for a Japanese dictionary and a kanji dictionary;
Reference numeral 16 indicates a prosody control memory, and 16 indicates a speech data dictionary memory. These input section 10, control section 11,
The speech synthesis section 12 and the memories 14, 15, and 16 are connected to each other via a bus 17.

制御部１１は、プログラムされたコンピュータで主とし
て構成されており、後述する如く、入力部１０から与え
られる入力データからメモリ１４．１５．１６を用いて
音声パラメータを生成する。The control unit 11 is mainly composed of a programmed computer, and generates audio parameters from input data provided from the input unit 10 using memories 14, 15, and 16, as will be described later.

第３図は、特にこの制御部１１の機能的構成を詳しく表
すブロック図である。FIG. 3 is a block diagram specifically showing the functional configuration of this control section 11 in detail.

入力部１０から与えられる漢字仮名交り文は、文字列解
析部２０内の単語分割処理部２１に印加される。The kanji/kana collocation sentence given from the input section 10 is applied to the word division processing section 21 in the character string analysis section 20 .

単語分割処理部２１において漢字仮名交り文は、日本語
辞書１４ａを参照し、公知の最長一致法又は文中の文節
数が最小となるように単語を選択する文節最小法等を用
いて各単語に分割される０日本語辞書１４ａには、単語
毎に一般的な使用頻度、品詞、読み、及びアクセント等
があらかじめ格納されている。The word division processing unit 21 divides each word into a kanji-kana-kori sentence by referring to the Japanese dictionary 14a and using a known longest-match method or a clause-minimization method that selects words such that the number of clauses in the sentence is minimized. The Japanese dictionary 14a, which is divided into 0 Japanese dictionary 14a, stores in advance the general usage frequency, part of speech, pronunciation, accent, etc. for each word.

単語分割処理部２１においては、さらに、分割された単
語の了解のし易さ、即ち了解難易度を算出する。この了
解難易度算出の前処理として、過去に出現したｎ個の単
語を記憶することが行われる。The word division processing unit 21 further calculates the ease of understanding of the divided words, that is, the degree of difficulty of understanding. As pre-processing for calculating the comprehension difficulty level, n words that have appeared in the past are stored.

ｎ個の単語配列として、ｆ（ｎ）〜ｆ（１）の記憶場所
を用意しておき、以下の如く順次シフトを行った後、現
在の単語をｆ（１）に格納する。Storage locations f(n) to f(1) are prepared as n word arrays, and after sequential shifting as described below, the current word is stored in f(1).

ｆ（ｎ）　　←ｆ（ｎ−１）ｆ（ｎ−１）−ｆ（ｎ−２）ｆ（２）　←ｆ（１）ただし、ｎは下記の第１表におけるｂ（ｗ）がｂ（ｗ）
＝０となるＷの値であり、第１表の例ではｎ＝１００と
なる。f(n) ←f(n-1) f(n-1)-f(n-2) f(2) ←f(1) However, for n, b(w) in Table 1 below is b( w)
= 0, and in the example in Table 1, n = 100.

了解難易度の算出処理は、第４図に示すプログラムに従
って実行される。まずステップＳ１において、−数的な
使用頻度を各単語毎に５段階ずつ記憶している日本語辞
書１４ａを参照して該当する単語の使用頻度を求め、こ
れを了解難易度「として格納する。ただし、日本語辞書
１４ａにその単語がなかった場合は、了解難易度「は最
も低い値「＝１とする。了解難易度ｒは、数値が大きい
ほど頻度が高く了解し易い。The comprehensibility difficulty level calculation process is executed according to the program shown in FIG. First, in step S1, the usage frequency of the corresponding word is determined by referring to the Japanese dictionary 14a which stores the numerical usage frequency in five levels for each word, and this is stored as the comprehension difficulty level. However, if the word is not found in the Japanese dictionary 14a, the comprehension difficulty level "is set to the lowest value" = 1.As for the comprehension difficulty level r, the larger the value, the higher the frequency and the easier it is to understand.

次いで、ステップＳ２において、該当する単語の品詞が
数詞又は固有名詞であるかを日本語辞書１４ａにより調
べ、数詞又は固有名詞の場合は了解難易度「を「＝１と
する。ただし、日本語辞書１４ａに格納しておく際に、
固有名詞に関しては使用頻度を少し低めにすることで対
処してもよい。Next, in step S2, the Japanese dictionary 14a is used to check whether the part of speech of the corresponding word is a numeral or a proper noun, and if it is a numeral or a proper noun, the comprehension difficulty level is set to 1. When storing it in 14a,
As for proper nouns, it may be possible to deal with this by reducing the frequency of use a little.

次のステップＳ３では、該当する単語が過去に出現した
ｎｆｉｌの単語に含まれるかどうかを出現単語配列ｆ（
２）からｆ（ｎ）までさかのぼって調べる。ただし、前
述の如くｎは第１表におけるｂ（ｗ）がｂ（ｗ）＝０と
なるＷの値であり、第１表の例では旧１００である。過
去のｎ個の単語に含まれる場合はステップＳ４に進み、
含まれない場合はステップＳ５に進む。In the next step S3, it is determined whether the corresponding word is included in the words of nfil that appeared in the past.
2) to f(n). However, as described above, n is the value of W such that b(w) in Table 1 becomes b(w)=0, and in the example in Table 1, n is the old value of 100. If it is included in the past n words, proceed to step S4;
If not included, the process advances to step S5.

ステップＳ４では、該当する単語が出現単語配列の何番
目であるかを調べ、その番号ｎ＝２をＷとする。さらに
、第１表からこのＷに応じたｂ（ｗ）を求め、その求め
たｂ（ｗ）を「に加算する。In step S4, the number of the corresponding word in the word array is checked, and the number n=2 is set as W. Furthermore, b(w) corresponding to this W is determined from Table 1, and the determined b(w) is added to ``.

ステップＳ５では、単語が後続するかどうかを判別し、
後続する場合はステップＳ１へ進み、後続しない場合は
この算出処理を終了する。In step S5, it is determined whether the word follows,
If there is a subsequent one, the process advances to step S1, and if there is no subsequent one, this calculation process is ended.

第　　１　　表単語分割処理部２１において分割された単語が、日本語
辞書１４ａの格納内容とマツチングして分割された単語
であれば、単語読みアクセント処理部２２で同型単語選
択処理が行われる。この処理は公知であり、同じ文字で
ありながら異なる読み及びアクセントで発声される単語
の区別を行うものである。処理された単語は、韻律処理
部２３へ出力される。Table 1 If the word divided by the word division processing unit 21 is a word divided by matching with the contents stored in the Japanese dictionary 14a, the word pronunciation accent processing unit 22 performs isomorphic word selection processing. This process is well known and is used to distinguish between words that have the same character but are pronounced with different pronunciations and accents. The processed words are output to the prosody processing section 23.

一方、日本語辞書１４ａの格納内容とマツチングせずに
分割された未知単語は、公知の方法で次の如く処理され
る。即ち、未知単語のうち漢字未知語は、漢字１文字毎
の読みがあらかじめ格納されている漢字辞書１４ｂを参
照して未知単語読み処理部２４においていったん読みに
変換された後、未知単語アクセント処理部２５へ出力さ
れる。未知単語のうち平板名、片仮名の未知語は、その
平板名、片仮名を読みとしてそのまま未知単語アクセン
ト処理部２５へ出力される。未知単語アクセント処理部
２５においては、所定のルールを用いることによって読
みからアクセントを生成する処理が行われ、アクセント
を付与された未知単語は韻律処理部２３へ出力される。On the other hand, unknown words that are divided without being matched with the contents stored in the Japanese dictionary 14a are processed as follows using a known method. That is, among the unknown words, unknown words in kanji are once converted into pronunciations in the unknown word pronunciation processing unit 24 with reference to the kanji dictionary 14b in which the pronunciations of each kanji character are stored in advance, and then converted into pronunciations in the unknown word pronunciation processing unit 24. 25. Among the unknown words, unknown words in plain name and katakana are output as they are to the unknown word accent processing section 25 with the plain name and katakana as readings. In the unknown word accent processing section 25, a process of generating an accent from the pronunciation is performed using a predetermined rule, and the accented unknown word is output to the prosody processing section 23.

韻律処理部２３では、単語読みアクセント処理部２２又
は未知単語アクセント処理部２５から得られる各単語の
アクセントから、単語が連鎖した際の分節のアクセント
の設定、フレーズの設定、及び呼気段落間のポーズの設
定が公知の方法で行われる。The prosody processing unit 23 uses the accents of each word obtained from the word reading accent processing unit 22 or the unknown word accent processing unit 25 to set segment accents when words are chained, set phrases, and pauses between exhalation paragraphs. The setting is performed using a known method.

さらにこの韻律処理部２３では、了解難易度「に応じた
韻律パラメータの算出及び韻律制御処理が韻律制御用メ
モリ１５を用いて行われる。以下この処理について、第
１図に示すプログラムを用いて説明する。Furthermore, in this prosody processing section 23, calculation of prosody parameters according to the comprehension difficulty level and prosody control processing are performed using the prosody control memory 15.This processing will be explained below using the program shown in FIG. do.

なお、ここで述べる韻律パラメータＦＯ１ｐｗ、■「と
は、テキスト全体の平均基本周波数、平均パワ、平均時
間長をそれぞれ制御するものであり、これらの韻律パラ
メータは公知の方法で既に算出されているものとする。Note that the prosodic parameters FO1pw and ``■'' described here are those that control the average fundamental frequency, average power, and average duration of the entire text, respectively, and these prosodic parameters have already been calculated using a known method. shall be.

ステップ８１Ｇにおいて、各韻律パラメータＦｏ、Ｐｗ
、　Ｔｒに関数Ｓ（「）を掛ける。即ち、ＦｏをＦｏ−
３（ｒ）に、ｐｗをＰｗ−３（ｒ）に■「を■「・Ｓ（
「）にそれぞれ変更する。ただし、関数Ｓ（「）は了解
難易度ｒを変数とする例えば第２表に示す如き関数であ
る。この第２表に示す関数は、韻律パラメータの拳類に
よって変わるが、了解のしにくさに伴って各韻律パラメ
ータを同一に保つか又は大きくする点では互いに一致し
たものである。In step 81G, each prosodic parameter Fo, Pw
, Tr is multiplied by the function S(''). That is, Fo is Fo−
3(r), pw to Pw-3(r) ■" to ■"・S(
).However, the function S(``) is a function as shown in Table 2, for example, with the comprehension difficulty r as a variable.The functions shown in Table 2 change depending on the type of prosodic parameter. However, they agree that each prosodic parameter is kept the same or increased depending on the difficulty of understanding.

第　　２　　表次いで、ステップ８１１において、母音長Ｉｖ及び子音
長１ｃを次式から算出する。Table 2 Next, in step 811, the vowel length Iv and the consonant length 1c are calculated from the following equations.

Ｉｖ＝ＩＶ／Ｃ／−１ｃ（ｖ、Ｉ）　　−ＴｒＩＣ雪１
／ｃ／　・Ｉｃ（ｃ、Ｉ）　　・■「ただし、Ｉｖ／ｃ
／は先行子音別母音の基本長、Ｉ／Ｃ／は子音固有長、
Ｉｃ（ｖ、Ｉ）　、Ｉｃ（ｃ、Ｉ）は音素毎のモーラ位
置係数、Ｔ「は基本時間長、■は母音、Ｃは子音、Ｉは
モーラ位置である。なお、母音区間は時間軸に対して内
部が過渡部１ｖ１、定常部１ｖ２、過渡部１ｖ３に分割
されている。この定常部Ｉｖ２の算出方法は公知のどの
ようなものであっても良いが、後述する音節明瞭度の制
御処理において補正される。Iv=IV/C/-1c(v,I) -TrIC snow 1
/c/ ・Ic (c, I) ・■"However, Iv/c
/ is the basic length of the vowel by preceding consonant, I/C/ is the specific length of the consonant,
Ic (v, I), Ic (c, I) are the mora position coefficients for each phoneme, T' is the basic time length, ■ is the vowel, C is the consonant, and I is the mora position.The vowel interval is the time axis The inside is divided into a transient part 1v1, a steady part 1v2, and a transient part 1v3.The steady part Iv2 may be calculated by any known method, but it is possible to control the syllable intelligibility described later. Corrected in processing.

次のステップＳｉ２では、ピッチパターンＦ（ｔ）が下
記の如く算出される。In the next step Si2, the pitch pattern F(t) is calculated as follows.

Ｉ　ｎ（Ｆ（ｔ））−ｊ　ｎ（Ｆｍｉｎ）＋Ａｐ−Ｇｐ
（ｔ−Ｔｏ）＋Ａａ−Ｆｏ−（Ｇａ（ｔ−ＴＩ）−Ｇａ
（ｔ−７２））Ｇｌ）（ｔ）＝ａ　＋　ｔ　−ｅｘＥｌ
（−ａ　−ｔ）Ｇａ（ｔ）＝１−（１＋β　−ｔ）−ｅ
ｘｐ（−β−１３ただし、＾Ｄ−ＧＥ）（ｔ−Ｔｏ）は
フレーズ成分、＾ａ−Ｆ。I n(F(t))-j n(Fmin)+Ap-Gp
(t-To)+Aa-Fo-(Ga(t-TI)-Ga
(t-72))Gl)(t)=a+t-exEl
(-a -t)Ga(t)=1-(1+β-t)-e
xp(-β-13, where ^D-GE) (t-To) is a phrase component, ^a-F.

−（Ｇａ（ｔ−Ｔｌ）−Ｇａ（ｔ−７２））はアクセン
ト成分、Ｆｍ１ｎは下限臨界値、Ａｐ、＾ｑはフレーズ
成分、アクセント成分の振幅、ＴＯ１■１はフレーズ成
分、アクセント成分の開始指令時点、１２はアクセント
成分の終了指令時点、α、βはフレーズ成分、アクセン
ト成分の下降時係数、ｔは時間である。-(Ga(t-Tl)-Ga(t-72)) is the accent component, Fm1n is the lower critical value, Ap, ^q is the phrase component, the amplitude of the accent component, TO1■1 is the phrase component, the start of the accent component A command time point, 12 is a command time point to end an accent component, α and β are a phrase component, a falling coefficient of an accent component, and t is time.

次のステップ３１３では、パワーパターンｐＨ）を上述
のピッチパターンＦ（ｔ）の算出と同様の方法で算出す
る。ただし、ステップ８１２におけるＦｏをＰｗに、Ｆ
ｍｉｎをＰｍ１ｎ（下限臨界値）に置き換えて行う。In the next step 313, the power pattern (pH) is calculated in the same manner as the calculation of the pitch pattern F(t) described above. However, if Fo in step 812 is changed to Pw, F
This is done by replacing min with Pm1n (lower limit critical value).

以上の如く韻律の制御を行うことにより、了解しにくい
単語は、例えば時間長が長くなりパワーが大きくなる。By controlling prosody as described above, words that are difficult to understand have a longer time length and a higher power, for example.

その結果、後述するように、聴き取り易くなるのである
。As a result, it becomes easier to listen to, as will be described later.

韻律処理部２３で算出された母音長Ｉｖ、子音長１ｃ、
ピッチパターンＦ（ｔ）、パワーパターンｐ（ｔ）は音
声パラメータ主成部２６に印加される。この音声パラメ
ータ主成部２６では、合成用単位の音声データ辞書１６
を参照して各単語の読みに対応する合成単位が検索され
、さらにこれらの補間及び合成が韻律処理部２３からの
上述の情報に従って行われ、最終的に音声合成用の音声
パラメータの時系列が得られる。Vowel length Iv, consonant length 1c calculated by prosody processing unit 23,
The pitch pattern F(t) and the power pattern p(t) are applied to the audio parameter main generator 26. This voice parameter main generator 26 uses a voice data dictionary 16 as a synthesis unit.
A synthesis unit corresponding to the pronunciation of each word is searched by referring to , and these interpolations and synthesis are performed according to the above information from the prosody processing unit 23, and finally the time series of speech parameters for speech synthesis is obtained. can get.

さらにこの音声パラメータ主成部２６では、了解難易度
ｒに応じた音節明瞭度の制御処理が行われる。Further, in this voice parameter main generating section 26, a control process of syllable intelligibility is performed in accordance with the comprehension difficulty level r.

第５図はこの音節明瞭度の制御処理方法の一例を示して
いる。この例は、了解難易度の高い単語については母音
の定常部を長くすることによって音節明瞭度を向上させ
るように母音の定常部と母音の過渡部との比率を変化せ
しめるものである。FIG. 5 shows an example of this syllable intelligibility control processing method. In this example, for words that are difficult to understand, the ratio of the vowel constant part to the vowel transient part is changed so as to improve syllable clarity by lengthening the vowel constant part.

韻律処理部２３で、母音部の過渡部時間長ＩＶｌ　。The prosody processing unit 23 calculates the duration IVl of the transition part of the vowel part.

定常部時間長１ｖ２　、過渡部時間長１ｖ３を算出して
おき、この第５図のプログラムで各区間の補正を行う。The steady-state time length 1v2 and the transient time length 1v3 are calculated in advance, and each section is corrected using the program shown in FIG.

まずステップ８２０では、了解難易度「を変数とする例
えば第２表に示す如き関数Ｓ（「）を用いて過渡部時間
長１ｖｌ　、　Ｉｖ３を次のように補正する。First, in step 820, the transient portion time lengths 1vl and Iv3 are corrected as follows using a function S(') as shown in Table 2, for example, with the comprehension difficulty level '' as a variable.

Ｉｖｌ　←Ｉｖｌ−１ｖ２　・（Ｓ（ｒ）−１）／２Ｉ
ｖ３←Ｉｖ３−１ｖ２　・（Ｓ（ｒ）−１）／２次のス
テップ８２１では、同様に定常部時間長Ｉｖ２を次のよ
うに補正する。Ivl ←Ivl-1v2 ・(S(r)-1)/2I
v3←Iv3-1v2 .(S(r)-1)/2 In the next step 821, the steady-state time length Iv2 is similarly corrected as follows.

Ｉｖ２　＋Ｉｖ２−５（ｒ）次いでステップＳ２２において、過渡部時間長１ｖ１の
音声パラメータが先行する音の中で最も近いターゲット
パラメータに線形に補間され、過渡部時間長１ｖ３の音
声パラメータが後続する音の中で最も近いターゲットパ
ラメータに線形に補間される。Iv2 +Iv2-5(r) Next, in step S22, the audio parameter with the transition time length 1v1 is linearly interpolated to the closest target parameter of the preceding sound, and the audio parameter with the transition time length 1v3 is interpolated with the target parameter of the following sound. linearly interpolated to the nearest target parameter.

以上の処理によって音声パラメータが生成される。Audio parameters are generated by the above processing.

第６図は、上述の制御処理方法による作用を説明する図
であり、母音連鎖の音響パラメータの時間的変化、即ち
、母音の／ａ／から／ｉ／への第１ホルマント及び第２
ホルマントの時間的遷移を表している。同図（＾）に示
す従来の制御処理方法に比して、同図（Ｂ）に示す上述
の制御処理方法によれば、各母音の定常部の時間長が長
くなり母音部が明瞭になることから音節明瞭度が向上す
る。FIG. 6 is a diagram explaining the effect of the above-mentioned control processing method, and shows the temporal change in the acoustic parameters of the vowel chain, that is, the first formant and second formant of the vowel from /a/ to /i/.
It represents the temporal transition of formants. Compared to the conventional control processing method shown in the same figure (^), according to the above-mentioned control processing method shown in the same figure (B), the time length of the stationary part of each vowel becomes longer and the vowel part becomes clearer. This improves syllable intelligibility.

第７図は音節明瞭度の制御処理方法の他の例を示してい
る。この例は、了解難易度の高い単語については音響パ
ラメータの時間的変化の許容量を大きくすることによっ
て音節明瞭度を向上させるようにこの許容量を変化せし
めて音声パラメータを生成する方法である。FIG. 7 shows another example of the syllable intelligibility control processing method. This example is a method of generating speech parameters by increasing the allowable amount of temporal change in acoustic parameters for words that are difficult to understand, thereby changing the allowable amount so as to improve syllable intelligibility.

まずステップ３３０では、合成用単位の音声データ辞書
１６を参照しつつ、韻律処理部２３で算出した音素時間
長に応じて音素毎にターゲットパラメータを抽出する。First, in step 330, a target parameter is extracted for each phoneme according to the phoneme duration calculated by the prosody processing section 23 while referring to the speech data dictionary 16 as a unit for synthesis.

以下の説明のために、例えば母音連＠／ａ（／を合成す
ると仮定する。また、各々の母音のターゲットパラメー
タの中心位置をｔｌ、ｔ２（各々は母音区間の中心）、
各々の第１パラメータの大きさをＦｌ（ｉ）　、Ｆ２（
ｉ）とする、ただし、ｉはターゲットパラメータの番号
を示している。For the following explanation, it is assumed that, for example, the vowel series @/a (/ is synthesized. Also, the center position of the target parameter of each vowel is tl, t2 (each is the center of the vowel interval),
The magnitude of each first parameter is Fl(i), F2(
i), where i indicates the number of the target parameter.

次のステップ８３１では、了解難易度「を変数とする例
えば第２表に示す如き関数Ｓ（「）を用いて通常発声の
時間的変化の許容量τ（ｉ）を次のように補正する。In the next step 831, the permissible amount τ(i) of the temporal change in normal utterance is corrected as follows using the function S(') as shown in Table 2, for example, with the comprehension difficulty level '' as a variable.

τ（ｉ）←τに）・Ｓ（「）次いで、ステップ８３２において、音素毎に後続するタ
ーゲットパラメータの中心位ＩＦｔ２とこの音素の中心
位置ｔｌとの間の時間的中心となる位置ｔ。τ(i)←τ)・S(“) Next, in step 832, for each phoneme, a position t that is the temporal center between the center position IFt2 of the subsequent target parameter and the center position tl of this phoneme is determined.

を次式から算出する。is calculated from the following formula.

ｔＯ雪ｔ１＋（ｔ２−ｔｌ）／２次のステップ５３３では、ｔｏでのターゲットパラメー
タＦｏ（ｉ）の大きさを次式から算出する。tO snow t1+(t2-tl)/2 In the next step 533, the magnitude of the target parameter Fo(i) at to is calculated from the following equation.

Ｆｏ（ｉ）＝（Ｆｌ　（ｉ）＋Ｆ２（ｉ））／２次のス
テップ３３４では、ｔｌから１２の間のターゲットパラ
メータの時間的変化Ｆ（ｔ、ｉ）をＦｌ（ｉ）とＦ２（
ｉ）との大小関係で場合骨けし次式で算出する。Fo(i)=(Fl(i)+F2(i))/2 In the next step 334, the temporal change of the target parameter F(t,i) between tl and 12 is expressed as Fl(i) and F2(
In relation to i), it is calculated using the following formula.

ただし、ｔは時間を表す。However, t represents time.

Ｆｌ（ｉ）　＜Ｆ２（ｉ）のときＦ（ｔ、　１）−ＦＨｉ）ｔｌ＜　ｔ　＜　（Ｆｌ（ｉ）−Ｆｏ（ｉ））／τ（ｉ
）＋ｔ。When Fl(i) < F2(i), F(t, 1) - FHi) tl < t < (Fl(i) - Fo(i))/τ(i
)+t.

Ｆ（ｔ、１）＝Ｆｏ（ｉ）＋　ｒ（ｉ）　　−（ｔ−ｔ
ｏ）（Ｆｌ（ｉ）−Ｆｏ（ｉ））／ｒ　（ｉ）＋ｔｏ≦
ｔ≦（Ｆ２（ｉ）−Ｆｏ（ｉ））／τ（ｉ）＋ｔ。F(t, 1)=Fo(i)+r(i)−(t−t
o) (Fl(i)-Fo(i))/r (i)+to≦
t≦(F2(i)−Fo(i))/τ(i)+t.

Ｆ（ｔ、　１）＝Ｆ２（ｉ）（Ｆ２（ｉ）−Ｆｏ（ｉ））／ｒ　（ｉ）＋ｔｏ＜　ｔ
　＜　ｔ２Ｆ１（ｉ）　＞Ｆ２（ｉ）のときＦ（ｔ、　１）＝ＦＨｉ）ｔｌ＜　ｔ　＜　（Ｆｏ（ｔ）−Ｆｌ（ｉ））／ｒ　（
ｉ）＋ｔ。F(t, 1)=F2(i) (F2(i)−Fo(i))/r(i)+to<t
When < t2F1(i) >F2(i), F(t, 1) = FHi) tl< t < (Fo(t)-Fl(i))/r (
i)+t.

Ｆ（ｔ、１）＝Ｆｏ（ｉ）−ｒ（ｉ）　　・（ｔ−ｔｏ
）（Ｆｏ（ｉ）−Ｆｌ（ｉ））／τ（ｉ）＋ｔｏ≦ｔ≦
（Ｆｏ（ｉ）−Ｆ２（ｉ））／τ（ｉ）＋ｔ。F(t, 1) = Fo(i)-r(i) ・(t-to
)(Fo(i)−Fl(i))/τ(i)+to≦t≦
(Fo(i)-F2(i))/τ(i)+t.

Ｆ（ｔ、　１）＝Ｆ２（ｉ）（Ｆｏ（ｉ）−Ｆ２（ｉ））／ｒ　（ｉ）＋ｔｏ＜　ｔ
　＜　ｔ２以上の処理によって音声パラメータが生成さ
れる。F(t, 1)=F2(i) (Fo(i)−F2(i))/r(i)+to<t
< Voice parameters are generated by the processing above t2.

第８図は、上述の制御処理方法による作用を説明する図
であり、母音の／ａ／から／ｉ／への第１ホルマント及
び第２ホルマントの時間的遷移を表している。同図（Ａ
）に示す従来の制御処理方法によれば、母音過渡部の時
間的変化の制限が大きく変化が緩やかである。しかしな
がら、同図（Ｂ）に示す上述の制御処理方法によれば、
例えば、了解しにくい単語では時間的変化が大きくなり
、それだけ過渡部が短くなり定常部が長くできるので結
果的に音節明瞭度が向上する。FIG. 8 is a diagram for explaining the effect of the above-described control processing method, and shows the temporal transition of the first formant and second formant of the vowel from /a/ to /i/. The same figure (A
According to the conventional control processing method shown in ), there is a large restriction on the temporal change in the vowel transition part, and the change is gradual. However, according to the above-described control processing method shown in FIG.
For example, in words that are difficult to understand, the temporal change becomes large, the transient part becomes shorter and the steady part becomes longer, resulting in improved syllable intelligibility.

音声パラメータ主成部２６における了解難易度「に応じ
た音節明瞭度の制御処理方法として、了解難易度の高い
単語についてはその母音の無声化を行わない無声化判定
処理について以下説明する。As a processing method for controlling syllable intelligibility according to the comprehension difficulty level in the voice parameter main generating section 26, a devoicing determination process in which the vowel of a word with a high comprehension difficulty level is not devoiced will be described below.

一般に、アクセント核にない無声子音に挾まれた高貴母
音／ｉ／　、／ｕ／は、無声化した方が自然性を増すた
め、ルールによって無声化処理する場合が多い、しかし
ながら、了解しにくい単語ではその部分の了解度が低下
してしまう、そこで、了解難易度「が例えば２以下の場
合には無声化処理を行わないで時間長を短くしたり、パ
ワーを小さくするような準無声化処理を行う無声化判定
機能を音声パラメータ主成部２６に設けることで音節明
瞭度を向上させることができる。In general, noble vowels /i/ and /u/ sandwiched between voiceless consonants that are not in the accent core are often devoiced according to the rules because devoicing them increases naturalness.However, words that are difficult to understand Therefore, if the intelligibility level is less than 2, for example, semi-devoicing processing that shortens the duration or reduces the power without devoicing is performed. By providing the voice parameter main generating section 26 with a devoicing determination function that performs this, syllable intelligibility can be improved.

この無声化判定処理の手順について、第９図を用いて以
下説明する。The procedure of this devoicing determination process will be explained below using FIG. 9.

まずステップＳ４０において、対象とする母音が高舌母
音（／ｉ八へＵ／）かどうか判別し、高舌母音でない場
合は有声と判断してステップＳ４４へ進んで無声化処理
を行わず、高舌母音である場合は次のステップ５４１へ
進む。First, in step S40, it is determined whether the target vowel is a high tongue vowel (/i 8 to U/), and if it is not a high tongue vowel, it is determined that it is voiced, and the process proceeds to step S44, where no devoicing processing is performed and the vowel is high. If it is a tongue vowel, the process advances to the next step 541.

ステップ８４１では、対象とする母音が無声子音に挾ま
れているかどうか判別し、挾まれてない場合は有声と判
断してステップ３４４へ進んで無声化処理を行わず、挾
まれている場合は次のステ・ノブ３４２へ進む。In step 841, it is determined whether the target vowel is sandwiched between voiceless consonants, and if it is not sandwiched, it is determined that it is voiced and the process proceeds to step 344, where no devoicing processing is performed. Proceed to Ste Nob 342.

ステップ８４２では、対象とする母音がアクセント核を
有しているかどうか判別し、有していれば時間長を短く
したり、パワーを小さくするような準無声化処理を行う
ステップ８４６へ進む、アクセント核を有していない場
合は次のステップ３４３へ進む。In step 842, it is determined whether or not the target vowel has an accent nucleus, and if it does, the process proceeds to step 846 in which quasi-devoicing processing is performed to shorten the duration or reduce the power of the accent. If it does not have a nucleus, proceed to the next step 343.

ステップＳ４３では、対象とする母音の了解難易度「が
２以下かどうか判別し、２以下の場合には上述の準無声
化処理を行うステップ８４６へ進む。In step S43, it is determined whether the comprehension difficulty level ``of the target vowel is 2 or less, and if it is 2 or less, the process proceeds to step 846 in which the above-mentioned semi-devoicing process is performed.

３以上の場合は無声と判断してステップ８４５へ進み、
無声か処理を行う。If it is 3 or more, it is determined that there is no voice and the process proceeds to step 845.
Perform silent processing.

音声パラメータ主成部２６で以上の如き処理を行うこと
により、合成用音声パラメータが生成され、音声合成部
１２に印加される。By performing the above processing in the voice parameter main generation section 26, voice parameters for synthesis are generated and applied to the voice synthesis section 12.

音声合成部１２では、音声パラメータに基づいて公知の
方法により実際の合成音声波形に対応した信号を合成し
、出力部１３に出力する。The speech synthesis section 12 synthesizes a signal corresponding to an actual synthesized speech waveform based on the speech parameters using a known method, and outputs the synthesized signal to the output section 13.

次に、本実施例の作用を、「私の名前は、前出伊津子で
す、」という文を実際に合成する場合について説明する
。Next, the operation of this embodiment will be explained with reference to a case where the sentence "My name is Izuko," is actually synthesized.

入力部１０よりこの文が入力されると、単語分割処理部
２１において第３表の（Ａ）の如く分割され、さらに日
本語辞書１４ａによって第３表の（Ｂ）の如く単語毎に
品詞及び使用頻度が得られる。単語分割処理部２１では
さらにまた分割された各単語の了解難易度「が第３表の
（Ｃ）の如く算出される。When this sentence is input from the input unit 10, the word division processing unit 21 divides it as shown in (A) of Table 3, and the Japanese dictionary 14a further divides the part of speech and part of speech for each word as shown in (B) of Table 3. Usage frequency can be obtained. The word division processing unit 21 further calculates the comprehension difficulty level of each divided word as shown in (C) of Table 3.

単語絖みアクセント処理部２２において、第３表の（０
）に示す読みとアクセントとが与えられる。In the word accent processing unit 22, (0
) are given the pronunciations and accents shown below.

韻律処理部２３においては、了解難易度「の低い「前出
伊津子」の部分の基本周波数を高く、パワーを大きく、
時間長を長くする韻律制御が行われ、この部分が非常に
聴き取り易くなる。In the prosody processing unit 23, the fundamental frequency of the part of ``Izuko Maeda'' with a low comprehension difficulty level is increased, the power is increased,
Prosodic control is performed to lengthen the duration, making this part much easier to listen to.

音声パラメータ生成部２６では、前述の音節明瞭度の制
御処理によって「前出」の「え」が「名前ｊの「え」よ
りも母音での定常部の比率を高められている。これは、
「前出」の「え」が「名前」の「え」よりも了解難易度
「が低いためである。さらに従来の処理では、音韻系列
から「いつこ」の［つ」が無声化されるが、本発明では
前述の制御処理により了解難易度「の低い「いつこ」の
［つ」は無声化処理されない。In the speech parameter generation unit 26, the syllable intelligibility control process described above causes the ``e'' in ``maede'' to have a higher proportion of the constant part of the vowel than the ``e'' in the name j. this is,
This is because the ``e'' in ``maede'' has a lower comprehension difficulty than the ``e'' in ``name.''Furthermore, in conventional processing, the ``tsu'' in ``itsuko'' is devoiced from the phonetic sequence. According to the present invention, the ``tsu'' in ``itsuko'' whose comprehension difficulty level is ``low'' is not devoiced by the above-described control processing.

以上の処理により、「前出伊津子」の部分が他の部分に
比べて明瞭に発声されるため、通常では理解しにくい固
有名詞が聴き取り易くなる。また、逆に容易に聴き取る
ことができる文末の「です」等は、パワー、基本周波数
が小さくなり、従って、聴取者の耳への負担が低減でき
、従来の合成音声に比べて聴き疲れしにくい合成音声を
作ることが可能となる。Through the above processing, the part "Maede Izuko" is uttered more clearly than other parts, making it easier to hear proper nouns that are normally difficult to understand. Conversely, the power and fundamental frequency of words such as "desu" at the end of a sentence, which can be easily audible, are lower, which reduces the burden on the listener's ears and makes them less tiring to listen to than with conventional synthesized speech. This makes it possible to create difficult-to-understand synthesized speech.

本発明のテキスト音声合成装置は、上述した実施例に限
定されるものではない。The text-to-speech synthesis device of the present invention is not limited to the embodiments described above.

例えば、文字列解析部は文字列の各単語の了解難易度を
求めるものであれば、それ以外の構成はどのようなもの
であっても良い、また、韻律処理部も求めた各単語の了
解難易度に応じて合成音声の韻律を制御するものであれ
ば、それ以外の構成はどのようなものであっても良い、
同様に、音声パラメータ生成部も求めた各単語の了解難
易度に応じて合成音声の音節明瞭度を制御するものであ
れば、それ以外の構成はどのようなものであっても良い
。For example, the character string analysis unit may have any other configuration as long as it determines the comprehension difficulty level of each word in the character string, and the prosody processing unit also determines the comprehension difficulty of each word. Any other configuration may be used as long as it controls the prosody of the synthesized speech according to the difficulty level.
Similarly, any other configuration may be used as long as the speech parameter generation unit also controls the syllable intelligibility of the synthesized speech according to the obtained comprehension difficulty level of each word.

さらに、音声パラメータ生成部の構成として、前述の実
施例の全ての構成を必ずしも備える必要はなく、各単語
の了解難易度に応じて合成音声の音節明瞭度を制御する
構成が少なくとも１つあれば良い。Furthermore, the configuration of the speech parameter generation unit does not necessarily have to include all of the configurations of the above-mentioned embodiments, but may include at least one configuration for controlling the syllable intelligibility of the synthesized voice according to the comprehension difficulty level of each word. good.

音声合成部も、音声パラメータを用いて音声合成を行う
ものであればどのようなものであっても良い。The speech synthesis section may be of any type as long as it performs speech synthesis using speech parameters.

［発明の効果］以上詳細に説明したように本発明のテキスト音声合成装
置によれば、入力される文字列の各単語の了解難易度を
求める文字列解析部と、求めた各単語の了解難易度に応
じて合成音声の韻律を制御する韻律処理部と、求めた各
単語の了解難易度に応じて合成音声の音節明瞭度を制御
する音声パラメータ生成部とを備えているため、了解性
及び自然性の高い高品質の合成音声を生成することがで
きる。[Effects of the Invention] As described in detail above, according to the text-to-speech synthesis device of the present invention, the character string analysis unit calculates the comprehension difficulty level of each word of the input character string, and the It is equipped with a prosody processing section that controls the prosody of the synthesized speech according to the degree of difficulty of understanding the synthesized speech, and a speech parameter generation section that controls the syllable intelligibility of the synthesized speech according to the obtained understanding difficulty of each word. It is possible to generate highly natural and high quality synthetic speech.

第３表Table 3

[Brief explanation of drawings]

第１図は本発明の一実施例における韻律制御処理用プロ
グラムのフローチャート、第２図は上記実施例の構成を
概略的に示すブロック図、第３図は制御部の機能的構成
を詳しく表すブロック図、第４図は了解難易度の算出処
理用プログラムのフローチャート、第５図は音節明瞭度
の制御処理の一例のプログラムのフローチャート、第６
図は第５図の制御処理による作用を説明する図、第７図
は音節明瞭度の制御処理の他の例のプログラムのフロー
チャート、第８図は第７図の制御処理による作用を説明
する図、第９図は無声化判定処理用プログラムのフロー
チャートである。１０・・・・・・入力部、１１・・団・制御部、１２・
・・・・・音声合成部、１３・・・・・・出力部、１４
・・・・・・日本語辞書及び漢字辞書用メモリ、１４ａ
・・・・・・日本語辞書、１４ｂ・・・・・・漢字辞書
、１５・・・・・・韻律制御用メモリ、１６・・・・・
・音声データ辞書用メモリ、１７・・・・・・バス、２
１・・・・・・単語分割処理部、２２・・・・・・単語
読みアクセント処理部、２３・・・・・・韻律処理部、
２４・旧・・未知単語読み処理部、２５・・・・・・未
知単語アクセント処理部、夕生成部。２６・・・・・・音声バラメー第図第６図／ａ／／１／第８図FIG. 1 is a flowchart of a prosodic control processing program in an embodiment of the present invention, FIG. 2 is a block diagram schematically showing the configuration of the above embodiment, and FIG. 3 is a block diagram showing the detailed functional configuration of the control section. 4 is a flowchart of a program for calculating comprehension difficulty, FIG. 5 is a flowchart of a program for an example of syllable intelligibility control processing, and FIG.
7 is a flowchart of another example of a program for controlling syllable intelligibility; FIG. 8 is a diagram illustrating the effect of the control process in FIG. 7. , FIG. 9 is a flowchart of a program for devoicing determination processing. 10... Input unit, 11... Group/control unit, 12...
...Speech synthesis section, 13...Output section, 14
・・・・・・Memory for Japanese dictionary and Kanji dictionary, 14a
...Japanese dictionary, 14b...Kanji dictionary, 15...Memory for prosody control, 16...
・Memory for voice data dictionary, 17...Bus, 2
1...Word division processing unit, 22...Word reading accent processing unit, 23...Prosody processing unit,
24. Old: unknown word reading processing unit, 25: unknown word accent processing unit, evening generation unit. 26...Voice balance diagram Figure 6 /a/ /1/ Figure 8

Claims

[Claims]

A text-to-speech synthesizer that parses an input character string to generate voice parameters, and synthesizes voice based on the generated voice parameters, the character string determining the degree of difficulty of understanding each word in the character string. an analysis unit, a prosody processing unit that controls the prosody of the synthesized speech according to the obtained comprehension difficulty of each word, and a speech that controls the syllable intelligibility of the synthesized speech according to the obtained comprehension difficulty of each word. A text-to-speech synthesis device comprising: a parameter generation section.