JPH01120600A

JPH01120600A - Voice rule synthesization

Info

Publication number: JPH01120600A
Application number: JP62277407A
Authority: JP
Inventors: Shoichi Takeda; 武田　昌一; Hiroshi Ichikawa; 市川　熹; Shunichi Yajima; 矢島　俊一
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1987-11-04
Filing date: 1987-11-04
Publication date: 1989-05-12

Abstract

PURPOSE: To prevent the degradation in tone quality of nasal sounds to obtain a synthesized voice of high comprehensibility by connecting a CV(consonant- vowel chain) type synthesis fundamental unit by a specific method at the time of discriminating a specific nosal sound. CONSTITUTION: A syllabic code is inputted to discriminate whether a consonant C is a specific nosal sound like m, n, my, or ny or not. If it is discriminated that the consonant is a specific nosal sound, the value of the CV type synthesis fundamental unit is cut into the duration of the nasal sound if this duration is longer than a preliminarily determined set duration, and the value representing the unit is inserted to the front space so that the duration of the nosal sound is equalized to the set duration if it is shorter than the set duration. The value of the vowel part just preceding the nosal sound is cut into the set duration if the duration of this vowel part is longer than the preliminarily determined set duration, and the value representing the vowel part of the unit is inserted to the rear space so that the duration is equalized to the set duration if the duration of the vowel part is shorter than the set duration. Thus, the degradation in tone quality of nosal sounds is prevented.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は文章音声の規則合成方法に係わり、特に規則合
成音声の音質改善に関する。DETAILED DESCRIPTION OF THE INVENTION [Industrial Field of Application] The present invention relates to a method of rule-based synthesis of text speech, and particularly to improving the sound quality of rule-synthesized speech.

[Conventional technology]

任意の文章或いは単語のテキストより、これに対応する
音声を合成する手法は「規則による音声合成」或いは単
に「規則合成」と呼ばれている。A method of synthesizing speech corresponding to an arbitrary sentence or word is called "speech synthesis by rules" or simply "speech synthesis by rules."

規則合成の音声では、一般に、音韻のつながりや、持続
時間、或いはピッチ（声の高さ）の変化などの特徴を外
部から規則により与えているため、自然の音声のものと
は異なっている。したがって、規則合成による音声は、
これらの自然の音声の特徴をそのまま保存しているいわ
ゆる「分析合成」による音声の音質より悪い、規則合成
音声の音質劣化要因の一つに、音韻の明瞭性の低下に起
因するものがある。音韻明瞭性を低下させる原因には、
合成単位の作成（分析）方法に起因するものや、単位持
続処理に起因するものなどが考えられる。Speech synthesized by rules generally differs from natural speech because features such as phonological connections, duration, and changes in pitch (voice height) are externally imparted by rules. Therefore, the speech generated by rule synthesis is
One of the causes of the deterioration of the sound quality of rule-synthesized speech, which is worse than the sound quality of so-called "analytical synthesis" speech that preserves the characteristics of natural speech, is due to a decrease in the clarity of phonemes. Causes of decreased phonological intelligibility include:
Possible causes include the method of creating (analysis) the composite unit and the unit persistence process.

本発明は、その中で、単位接続処理に伴う音質劣化の改
善を図るものである。Among these, the present invention aims to improve the deterioration in sound quality caused by unit connection processing.

従来ｒｅｖ−１節を単位とする音声合成」来意、匂坂日
本音響学会講演会論文集３−４−３（昭５５）のように
Ｃｖ単位の接続には主としてパラメータの線形補間が用
いられていた。この線形補間は、いわば音節と音節の間
のパラメータの変化を平滑化し、結果としてつながりの
滑らかな合成音声を得るために重要な処理と考えられて
いた。Traditionally, linear interpolation of parameters was mainly used to connect Cv units, as described in ``Speech synthesis using rev-1 clauses as units'' and the Proceedings of the Acoustical Society of Japan Conference Proceedings 3-4-3 (1982). . This linear interpolation was considered to be an important process for smoothing the changes in parameters between syllables, resulting in a synthesized speech with smooth connections.

[Problem that the invention seeks to solve]

確かに多くの子音については上記ことは正しいと言える
が、／ｍ／、／ｎ／、／ｍｙ／、／ｎｙ／のような鼻子
音の場合には、平滑化により、これらの鼻音の特徴であ
る「歯切れの良さ」が喪失し、了解性が低下してしまう
。The above is certainly true for many consonants, but in the case of nasal consonants such as /m/, /n/, /my/, /ny/, smoothing reduces the characteristics of these nasals. A certain "crispness" is lost, and intelligibility decreases.

本発明の目的は、かかる鼻音の音質劣化を防ぐ単位接続
法を提供し、了解性の高い合成音声を得ることにある。An object of the present invention is to provide a unit conjunctive method that prevents such deterioration in the sound quality of nasal sounds, and to obtain synthesized speech with high intelligibility.

[Means for solving problems]

以下に上記問題点を解決するために有効な単位接続法に
ついて説明する。単位を構成するパラメータに何を用い
るかは、合成方式により異なる。A unit connection method effective for solving the above problems will be explained below. What parameters are used to constitute the unit differs depending on the synthesis method.

例えばＰＡＲＣＯＲ合成方式を用いる場合は、所定の次
数ＰＡＲＣＯＲ係数と音源振幅がパラメータであり、波
形合成方式を用いる場合は、波形振幅値そのものがパラ
メータとなる。本発明は、如何なる合成方式にも適用可
能な一般的な手法を提供するものであるが、ここでは、
説明を具体的にするために、残差圧縮法を用いた合成方
式（特願昭５８−４２１６９）を例に取り説明する。即
ち、スペクトル包絡パラメータとしてＦＡＩ？ＣＯＲ係
数、音源パラメータとして代表残差（有声の場合は１ピ
ッチ分の残差、無声の場合は１フレ一ム分の残差）を用
いる合成法である。For example, when the PARCOR synthesis method is used, the parameters are a predetermined order PARCOR coefficient and the sound source amplitude, and when the waveform synthesis method is used, the waveform amplitude value itself is the parameter. The present invention provides a general method that can be applied to any synthesis method, but here,
In order to make the explanation concrete, a synthesis method using residual compression method (Japanese Patent Application No. 58-42169) will be taken as an example. That is, FAI? as a spectral envelope parameter? This is a synthesis method that uses representative residuals (residuals for one pitch in the case of voiced, residuals for one frame in the case of unvoiced) as COR coefficients and sound source parameters.

第１図は１本発明の単位接続法の原理を示したものであ
る０図中（ａ）は単位接続の処理の流れを示している。FIG. 1 shows the principle of the unit connection method of the present invention. In FIG. 1, (a) shows the flow of the unit connection process.

即ち、入力される音節情報より、対象となる音節が「ま
」段か、「な」段か、「みや」段か、あるいは「にや」
段か判別する。もしこの音節がこれらのいずれかであれ
ば、同図（ｂ）で説明する鼻音接続処理を適用し、ＪＰ
！にいずれでもない場合には、従来の線形補間処理を適
用する。That is, based on the input syllable information, it is determined whether the target syllable is "ma", "na", "miya", or "niya".
Determine the stage. If this syllable is one of these, apply the nasal connection process explained in (b) of the same figure, and
! If neither is the case, conventional linear interpolation processing is applied.

以下に鼻音接続処理について説明する。第１図（ｂ）は
ある一つのパラメータの接続原理を示したもので、この
パラメータは第１次（ｉ＝１．２゜・・・ｐ：ｐは線形
予測の次数）のＰＡＲＣＯＲ係数、代表残差のいずれか
である。ただし、前者の場合には接続フレームごとの処
理になり、後者の場合にはフレーム内の対応するサンプ
ル点ごとの処理になる。またこの図では、先行Ｃ■単位
中の母音Ｖに−１は設定持続時間長より短く、当該Ｃｖ
単位中の子音（鼻音）ＣＫは設定持続時間長より長い場
合の例を取っである。この場合、Ｃｙｃの設定時間長よ
り長い部分は切り捨てる。またＶに一工の設定時間長よ
り短い隙間の部分には、Ｖに一１最後尾のフレームのパ
ラメータ値を埋める。これらの切断と埋込の処理を組合
せてＣ”、に−ｔ　Ｖに−１とＣＫＶにとの間に隙間が
無くなるよう−にすれば、鼻音接続処理は終了する。な
おｃ　ｖ　Ｉ’、ｊ、位の長さと設定持続時間長の大小
の組合せが本例と異なる場合でも、単位の方が長い場合
は単位の切断、逆に短い場合は隙間部を単位の終端値（
後続単位ならば先頭値、先行単位ならば最後尾の値）で
の埋込を実行することにより、所望の鼻音接続処理を行
うことができる。The nasal connection process will be explained below. Figure 1(b) shows the connection principle of a certain parameter, and this parameter is the first-order (i = 1.2°...p:p is the order of linear prediction) PARCOR coefficient, the representative One of the residuals. However, in the former case, processing is performed for each connected frame, and in the latter case, processing is performed for each corresponding sample point within the frame. Also, in this figure, -1 for the vowel V in the preceding C■ unit is shorter than the set duration length, and the corresponding Cv
An example is taken in which the consonant (nasal sound) CK in the unit is longer than the set duration length. In this case, the portion longer than the set time length of Cyc is truncated. Also, in the gap portion shorter than the set time length of V-1, the parameter value of the last frame of V-1 is filled. By combining these cutting and embedding processes so that there is no gap between C'', -t V, -1 and CKV, the nasal connection process is completed. Note that c v I', Even if the combination of the length of the digit j and the set duration length is different from this example, if the unit is longer, the unit is cut, and if it is shorter, the gap is used as the final value of the unit (
By performing embedding with the first value in the case of a succeeding unit and the last value in the case of a preceding unit, desired nasal connection processing can be performed.

上記の例では、振幅も不連続になる。そのため聴いた感
じが不連続になるのを避けたければ、音源情報（本例で
は代表残差）のみ線形補間を施せば良い、また線形補間
では滑らかすぎるというのであれば、上記鼻音処理の後
、適当な包絡形状を持った曲線で振幅の重み付けを施せ
ば良い。このような重み付けは、予め定めた包絡形状の
重みの時間関数と音源波形振幅値の掛は算により実現可
能である。In the above example, the amplitude is also discontinuous. Therefore, if you want to avoid discontinuity in the listening sensation, you can linearly interpolate only the sound source information (representative residual in this example).If linear interpolation is too smooth, then after the nasal sound processing described above, Amplitude weighting may be performed using a curve having an appropriate envelope shape. Such weighting can be realized by multiplying the time function of the weight of the predetermined envelope shape by the sound source waveform amplitude value.

[Effect]

上記単位接続法を用いれば／ｍ八へｎ／、／■ｙ／、／
ｎｙ／等の鼻音については、その特徴を喪失すること無
く、また他の子音の場合は従来からの線形補間を用いる
ことができるので、つながりの滑らかさを失うこと無く
、それぞれの音韻特有の自然な合成音声を得ることがで
きる。Using the above unit conjunction, /m8 to n/, /■y/, /
For nasal sounds such as ny/, you can use conventional linear interpolation without losing their characteristics, and for other consonants, you can use traditional linear interpolation, so you can use the natural characteristics unique to each phoneme without losing the smoothness of the connections. It is possible to obtain synthetic speech.

〔Example〕

以下１本発明の実施例を第２図により説明する。 An embodiment of the present invention will be described below with reference to FIG.

第２図は任意文章合成方式の全体構成を示す。Figure 2 shows the overall configuration of the arbitrary text synthesis method.

本方式では、漢字仮名混じり文のテキストを入力データ
として与えれば、それに対応する合成音声を出力として
得ることができる。処理手順は以下の通りである。In this method, if a text containing kanji and kana is given as input data, the corresponding synthesized speech can be obtained as output. The processing procedure is as follows.

まず、入力テキストは、日本語解析部１の形態素解析手
段により、各単語に分解され１品詞が決定され、さらに
読みが決定される６次にこの結果に基づき、音声言語処
理部２（特公昭５９−１３０４０、　　特願昭５７−１
９０８６１．特開昭５９−１２６８４１　）参照におい
て、各単語あるいは句のアクセント型が決定される。以
上のような構文レベルの処理結果として、音節情報、ア
クセント情報などが得られる。なお句や文章の区切りは
、入力テキスト中の句読点等区切り記号に基づいて決定
される０文章中や文章間のポーズ長は、読点や句点の後
のスペースの数で指定できる。また疑問文、命令文。First, the input text is broken down into each word by the morphological analysis means of the Japanese language analysis unit 1, one part of speech is determined, and the pronunciation is determined. 59-13040, patent application 1987-1
90861. JP-A-59-126841), the accent type of each word or phrase is determined. Syllable information, accent information, etc. are obtained as a result of the above syntactic level processing. Note that the breaks between phrases and sentences are determined based on delimiters such as punctuation marks in the input text.0 The length of pauses in sentences and between sentences can be specified by the number of spaces after commas or periods. Also interrogative and imperative sentences.

願望文等文のタイプは、語尾の活用のよって判定するこ
とができる場合もあるし、あるいは文章の終止に句点の
代わりにそれぞれ「？」、「！！」および「！」などの
終止記号を使うことにより指定することもできる０例え
ば同じ音韻列「川を渡る」であっても「川を渡る。」は
平叙文であり、ｒ川を渡る？」は疑問文である。以上の
■音節情報、■アクセント情報、■ポーズ情報、■句・
文章区切り情報、および（必要ならば例えば品詞基等の
）■文法情報は、「音節コード」と呼ばれる一連の数字
によって表現される。音節コードは制御パラメータ生成
部３の入力情報である。The type of sentence, such as a desire sentence, can sometimes be determined by the conjugation of the ending of the word, or by using a final symbol such as "?", "!!", and "!" at the end of the sentence instead of a period, respectively. It can also be specified by using 0. For example, even if they have the same phonetic sequence ``Kawa wo Cross.'', ``Kawa wo Cross.'' is a declarative sentence, and r? ” is an interrogative sentence. ■Syllable information, ■Accent information, ■Pause information, ■Phrase/
Sentence break information and (if necessary, part-of-speech basis, etc.) ■grammatical information are expressed by a series of numbers called a "syllable code." The syllable code is input information to the control parameter generation section 3.

制御パラメータ生成部３では、アクセント、イントネー
ションおよび音韻持続時間が規則により決定され、そ、
れに従ってピッチバタンと音韻パラメータ時系列・が生
成される。ここでアクセント型は、アクセント情報によ
り知ることができる。アクセント情報は、具体的にはア
クセント核のある音韻（アクセントが下降する直前の音
ｆｉＡ）の直後にアクセントを示す音節コード番号を挿
入することによって与えている。但し、この音節コード
がない場合は、平板型アクセントであることを示してい
る。またイントネーションは、基本的には文章タイプ情
報により定められる。但し１語尾の音韻の並びの違いに
よる変形も加えられる０例えば、願望文「川を渡りたい
！」と「川を渡りたいなあ！」とではイントネーション
・バタンか異なる。In the control parameter generation unit 3, accent, intonation, and phoneme duration are determined according to rules, and
Pitch bangs and phonological parameter time series are generated accordingly. Here, the accent type can be known from accent information. Specifically, the accent information is given by inserting a syllable code number indicating the accent immediately after the phoneme with the accent core (the sound fiA immediately before the accent falls). However, the absence of this syllable code indicates a flat accent. Furthermore, intonation is basically determined by sentence type information. However, variations may be added due to differences in the phonological order of the first word.For example, the intonation and slam are different between the wishful sentences ``I want to cross the river!'' and ``I want to cross the river!''.

最終的なピッチパタンは、アクセント型とイントネーシ
ョンの両者に基づいて生成される。音韻持続時間は、子
音の場合は周囲条件の影響が少ないので、子音の種類ご
とに固有長として決定される。The final pitch pattern is generated based on both accent type and intonation. In the case of consonants, the phoneme duration is determined as a unique length for each type of consonant since it is less affected by surrounding conditions.

それに対して、母音の場合は周囲条件によって様様な変
形を受ける。そのため、アクセント型、音節数、単語内
の位置、直前の子音の種類、その母音の種類などから持
続時間を決定している（特願昭５７−１９０８６１参照
）、このようにして音韻持続時間が決定されたら、Ｃｖ
単位でファイルに登録されている音韻パラメータ（スペ
クトル包絡パラメータと音源パラメータ）を音節コード
に対応させて抽出し、配列する。この際、長すぎれば持
続時間内に収まるように切断する。しかる後に、切断部
あるいは隙間部を埋めるようにＣｖ単位間を接続する。In contrast, vowels undergo various transformations depending on surrounding conditions. Therefore, the duration is determined based on the accent type, the number of syllables, the position within the word, the type of consonant immediately before it, the type of vowel, etc. (see patent application 190861/1986). In this way, the phonological duration is Once determined, Cv
Phonological parameters (spectral envelope parameters and sound source parameters) registered in a file in units are extracted and arranged in correspondence with syllable codes. At this time, if it is too long, it is cut to fit within the duration. Thereafter, the Cv units are connected so as to fill the cut portion or gap.

接続法は、本明細書の〔問題点を解決するための手段〕
で説明した方法を用いる。最後に、以上の処理によって
生成された基本周波数と音韻パラメータは、順次音声合
成部４に送られ、音声波形が出力される。ここで、音声
合成方式としては１例えば残差圧縮法（特願昭５９−５
５８３１１８　、特願昭６０−１３７７２１参照）を用
いればよい。この場合。The subjunctive method is a [means for solving problems] in this specification.
Use the method described in . Finally, the fundamental frequency and phoneme parameters generated through the above processing are sequentially sent to the speech synthesis section 4, and a speech waveform is output. Here, as a speech synthesis method, one example is the residual compression method (Japanese patent application No. 59-5
583118, Japanese Patent Application No. 60-137721) may be used. in this case.

音源パルスは基本的には、フレームごとに１ピッチ文の
残差パルス（代表残差）を抽出し、その代表残差を外か
ら与えるピッチ周期の間隔で並べることによって生成し
ている。このとき外から与えるピッチ周期が代表残差の
長さより短ければ、その長さの差だけ代表残差の末尾を
切り捨て、逆に長ければ、代表残差の不足している区間
だけ０を埋めている。このように並べられた残差パルス
列の振幅値は、フレーム間でつながりを滑らかにするた
めに、適宜腺形の重み付けなどが施される。Sound source pulses are basically generated by extracting residual pulses (representative residuals) of one pitch sentence for each frame and arranging the representative residuals at intervals of a pitch period given from the outside. At this time, if the pitch period given from the outside is shorter than the length of the representative residual, the end of the representative residual is truncated by that length difference, and if it is longer, 0 is filled in for the section where the representative residual is insufficient. There is. The amplitude values of the residual pulse trains arranged in this manner are subjected to appropriate gland-shaped weighting or the like in order to smooth the connections between frames.

以上の処理は、本発明の中心であるＣｖ単位の接続法を
除いて、すべて公知の手段により構成することができる
。All of the above processing can be configured by known means, except for the Cv unit connection method, which is the core of the present invention.

本実施例によれば、漢字仮名混じり文テキストから了解
性の高い合成音声を得ることができる。According to this embodiment, it is possible to obtain synthesized speech with high intelligibility from a text containing kanji and kana.

就中、鼻音の了解性を改善する効果がある。In particular, it has the effect of improving the intelligibility of nasal sounds.

〔Effect of the invention〕

以上示したように、本発明によ九ば、鼻音の明瞭性の向
上に有効であり、規則合成音声の音質改善に効果を発揮
する。As shown above, the present invention is effective in improving the clarity of nasal sounds and is effective in improving the sound quality of rule-synthesized speech.

【図面の簡単な説明】第１図は本発明の基本部分を示す図、第２図は本発明の
実施例を示す図。３・・・制御パラメータ生成部。　　　　　　　　　　
　ｒ７＝、、。茅１　図（久）（レノBRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram showing the basic part of the present invention, and FIG. 2 is a diagram showing an embodiment of the present invention. 3... Control parameter generation section.
r7=,,. Kaya 1 Figure (Hisashi) (Reno

Claims

[Claims] 1. As a means for connecting CV (consonant-vowel chain) type synthesis basic units (hereinafter referred to as "units"), C is /
It is determined whether the nasal sound is a specific nasal sound such as m/, /n/, /my/, /ny/, etc., and when the determining means determines that the nasal sound is a specific nasal sound, the duration of the nasal sound is determined in advance. If it is longer than the set duration length, the value of the above unit is cut to the set duration length, and conversely, if the duration length of the nasal sound is shorter than the set duration length, a value representative of the above unit is cut to the set duration length. It is filled in the gap in front so as to match the set duration length, and if the duration length of the vowel part of the unit immediately before the nasal sound (including the cursive sound /N/) is longer than the predetermined set duration length, the above-mentioned The value of the vowel part of the immediately preceding unit is cut to the set duration length, and conversely, if the duration length of the vowel part is shorter than the set duration length, the value representative of the vowel part of the above unit is cut to the set duration length. 1. A method for synthesizing speech rules, characterized in that the method is configured to perform a process of filling in gaps at the rear so as to match a duration length. 2. In the process when the above-mentioned nasal sound is determined, the value of the parameter representing the amplitude of the above-mentioned nasal sound, the vowel immediately before the nasal sound, or both phonemes is corrected so that they are connected continuously at the boundary point of the two phonemes. A speech rule synthesis method according to claim 1, characterized in that: