JP2904279B2

JP2904279B2 - Voice synthesis method and apparatus

Info

Publication number: JP2904279B2
Application number: JP63197851A
Authority: JP
Inventors: 哲夫梅田; 徹都木
Original assignee: Nippon Hoso Kyokai NHK
Current assignee: Japan Broadcasting Corp
Priority date: 1988-08-10
Filing date: 1988-08-10
Publication date: 1999-06-14
Anticipated expiration: 2014-06-14
Also published as: JPH0247700A

Description

【発明の詳細な説明】［産業上の利用分野］本発明は出力音声情報に基づき予め記憶されている人
間が発声した音声を合成することによって出力する音声
合成方法および装置に関する。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing method and apparatus for synthesizing a pre-stored voice uttered by a human based on output voice information and outputting the synthesized voice.

［発明の概要］本発明は予め記憶されている人間が発声した音声を接
続し有意味な音声を出力するに際して、接続されたこと
による、および韻律情報が付加されたことによる合成音
声への影響を考慮し、合成音声の明瞭性や自然性を損な
わないようにした音声合成方法および装置である。[Summary of the Invention] The present invention connects a pre-stored voice uttered by a human and outputs a meaningful voice, and affects the synthesized voice due to the connection and the addition of the prosodic information. And a voice synthesizing method and apparatus in which the clarity and naturalness of synthesized voice are not impaired.

［従来の技術］従来、この種の技術においては、予め録音しておい
た単語や文節の音声を接続して再生する録音編集方式
や、波形をいったん分析して得られるパラメータを記
録しておき、再生時にこのパラメータによって合成器を
制御するパラメータ編集方式、また、音素、音節など
の単位音声として記憶されたパラメータと、合成時に所
定の規則によって生成されるアクセント、イントネーシ
ョン等の韻律情報とから合成する規則合成方式等の音声
合成方式が知られていた。[Prior Art] Conventionally, in this kind of technology, a recording and editing method for connecting and reproducing voices of words and phrases recorded in advance, and parameters obtained by analyzing a waveform once are recorded. A parameter editing method for controlling the synthesizer by using these parameters at the time of reproduction, and synthesis from parameters stored as unit sounds such as phonemes and syllables and prosody information such as accents and intonations generated by predetermined rules during synthesis. Speech synthesis methods such as a rule synthesis method have been known.

［発明が解決しようとする課題］しかしながら、との方式においては、出力する音
声を予め人が発声して登録しておいたものの中から選ん
で出力するので登録した音声の中では音質が保てるが、
音声と音声をそのまま接続して出力するために声帯振動
（ピッチ）周波数やホルマントの不連続を生じ、不自然
な音声になっていた。[Problems to be Solved by the Invention] However, in the method described above, since the output voice is selected and output from those registered in advance by a person, the sound quality can be maintained in the registered voice. ,
Since the voices are connected and output as they are, the vocal cord vibration (pitch) frequency and the formant discontinuity occur, resulting in unnatural voices.

また、の方式は、合成時にピッチ周波数に相当する
インパルスやホワイトノイズあるいは推定した声帯波形
を音源として声道特性フィルターに通したものを出力す
る方式であり、これらの場合でもアクセントやイントネ
ーションを変化させると、すなわち、音声のピッチ周期
を変化させるとホルマントも同時に変化し、音韻的に不
明瞭な音声となったり、さらに合成する単位音声と単位
音声との接続点での不自然さが生じていた。In addition, the method of the above is a method of outputting an impulse or white noise corresponding to a pitch frequency or an estimated vocal cord waveform as a sound source through a vocal tract characteristic filter at the time of synthesis, and also changes the accent and intonation in these cases. That is, when the pitch cycle of the voice is changed, the formant also changes at the same time, resulting in a phonologically unclear voice, and furthermore, unnaturalness occurs at the connection point between the unit voices to be synthesized and the unit voice. .

そこで本発明の目的は上述した従来の問題点を解消
し、音声の合成単位を細かくしたり、イントネーション
等の韻律情報を付加するなど、合成音声の音質をよくす
るためにピッチやホルマントの制御をきめ細かくした場
合にも、合成された音声の音韻性を保ち、人間の音声と
しての自然性を有した音声合成方法を提供することにあ
る。Therefore, an object of the present invention is to solve the above-mentioned conventional problems and to control pitch and formant in order to improve the sound quality of synthesized speech, for example, by reducing the synthesis unit of speech and adding prosody information such as intonation. It is an object of the present invention to provide a speech synthesis method that retains the phonological properties of synthesized speech even when it is finely divided and has naturalness as human speech.

また、本発明の他の目的は、イントネーション等の韻
律性やホルマント周波数などに基づく音韻性の制御を可
能にすることによって人間の言語音声に対する知覚特性
の測定方法を提供することにある。It is another object of the present invention to provide a method for measuring the perceptual characteristics of human linguistic speech by enabling control of prosody such as intonation and phonology based on formant frequency.

［課題を解決するための手段］そのために本発明では、出力音声情報に基づき単位音
声情報および韻律情報を定め、予め記憶されている人間
が発声した単位音声データの中から前記定められた単位
音声情報に基づき、当該単位音声情報に対応する単位音
声データを選択し、当該選択された単位音声データから
ピッチ周期，スペクトル包絡，ホルマント軌跡および単
位音声波形の各々を算出または抽出し、当該算出または
抽出された単位音声波形を滑らかに接続するように、お
よび前記韻律情報を付加するように、前記算出または抽
出されたピッチ周期を変更し、当該変更されたピッチ周
期においてピッチ変更によるスペクトル包絡を算出し、
該ピッチ変更によるスペクトル包絡と前記算出または抽
出されたスペクトル包絡とに基づき第１のスペクトル変
化分を算出し、前記算出または抽出された単位音声波形
を滑らかに接続するように前記算出または抽出されたホ
ルマント軌跡を変更し、当該変更されたホルマント軌跡
に基づいてホルマント変更によるスペクトル包絡を算出
し、該ホルマント変更によるスペクトル包絡と前記算出
または抽出されたスペクトル包絡とに基づき第２のスペ
クトル変化分を算出し、前記第１および第２のスペクト
ル変化分に基づき前記ピッチ周期の変更にかかる単位音
声波形のスペクトル包絡を変更し、当該スペクトル包絡
を変更した前記単位音声波形を接続した後、接続された
音声を出力することを特徴とする。[Means for Solving the Problems] For this purpose, in the present invention, unit voice information and prosody information are determined based on output voice information, and the predetermined unit voice is selected from pre-stored unit voice data uttered by a human. Based on the information, unit voice data corresponding to the unit voice information is selected, and a pitch cycle, a spectral envelope, a formant locus, and a unit voice waveform are calculated or extracted from the selected unit voice data, and the calculation or extraction is performed. The calculated or extracted pitch period is changed so as to smoothly connect the unit sound waveforms thus obtained and to add the prosody information, and the spectrum envelope by the pitch change is calculated in the changed pitch period. ,
A first spectrum change is calculated based on the spectrum envelope due to the pitch change and the calculated or extracted spectrum envelope, and the calculated or extracted unit voice waveform is calculated or extracted so as to smoothly connect the calculated or extracted unit sound waveform. A formant trajectory is changed, a spectrum envelope by a formant change is calculated based on the changed formant trajectory, and a second spectrum change is calculated based on the spectrum envelope by the formant change and the calculated or extracted spectrum envelope. Then, based on the first and second spectrum changes, the spectrum envelope of the unit speech waveform related to the change of the pitch period is changed, and after connecting the unit speech waveform having the changed spectrum envelope, the connected speech is connected. Is output.

また、出力音声情報に基づき単位音声情報および韻律
情報を定める手段と、予め記憶されている人間が発声し
た単位音声データの中から前記定められた単位音声情報
に基づき、当該単位音声情報に対応する単位音声データ
を選択する手段と、当該選択された単位音声データから
ピッチ周期，スペクトル包絡，ホルマント軌跡および単
位音声波形の各々を算出または抽出する手段と、当該算
出または抽出された単位音声波形を滑らかに接続するよ
うに、および前記韻律情報を付加するように、前記算出
または抽出されたピッチ周期を変更する手段と、当該変
更されたピッチ周期においてピッチ変更によるスペクト
ル包絡を算出する手段と、該ピッチ変更によるスペクト
ル包絡と前記算出または抽出されたスペクトル包絡とに
基づき第１のスペクトル変化分を算出する手段と、前記
算出または抽出された単位音声波形を滑らかに接続する
ように前記算出または抽出されたホルマント軌跡を変更
する手段と、当該変更されたホルマント軌跡に基づいて
ホルマント変更によるスペクトル包絡を算出する手段
と、該ホルマント変更によるスペクトル包絡と前記算出
または抽出されたスペクトル包絡とに基づき第２のスペ
クトル変化分を算出する手段と、前記第１および第２の
スペクトル変化分に基づき前記ピッチ周期の変更にかか
る単位音声波形のスペクトル包絡を変更する手段と、当
該スペクトル包絡を変更した前記単位音声波形を接続し
た後、接続された音声を出力する手段とを具えたことを
特徴とする。A unit for determining unit voice information and prosody information based on the output voice information; and a unit corresponding to the unit voice information based on the determined unit voice information from unit voice data uttered by a human stored in advance. Means for selecting unit voice data; means for calculating or extracting each of a pitch period, a spectral envelope, a formant trajectory and a unit voice waveform from the selected unit voice data; and smoothing the calculated or extracted unit voice waveform. Means for changing the calculated or extracted pitch cycle so as to be connected to and adding the prosody information; means for calculating a spectrum envelope by pitch change in the changed pitch cycle; A first spectrum based on the changed spectrum envelope and the calculated or extracted spectrum envelope; Means for calculating a change in the formant trajectory; means for changing the calculated or extracted formant trajectory so as to smoothly connect the calculated or extracted unit sound waveform; and a formant change based on the changed formant trajectory. Means for calculating a spectrum envelope by the formant change, means for calculating a second spectrum change based on the spectrum envelope by the formant change and the calculated or extracted spectrum envelope, and means for calculating the first and second spectrum changes. Means for changing the spectrum envelope of the unit sound waveform according to the change of the pitch period, and means for connecting the unit sound waveform whose spectrum envelope has been changed and outputting the connected sound. And

［作用］以上の構成によれば各単位音声の音韻性や自然性を保
ったまま、有意味な音声に必要なイントネーションが付
加された高音質の音声合成が可能となる。[Operation] According to the above configuration, it is possible to perform high-quality speech synthesis in which necessary intonation is added to meaningful speech while maintaining the phonological and natural characteristics of each unit speech.

［実施例］以下、図面を参照して本発明の実施例を詳細に説明す
る。[Example] Hereinafter, an example of the present invention will be described in detail with reference to the drawings.

第１図は本発明の一実施例を示す音声合成システムの
ブロック図である。図において、２は単位音声選択部、
４は声帯周波数補正部、６はホルマント補正部、８は周
波数特性変更部であり、各部は電子計算機内に構成さ
れ、この構成によってROM,RAMあるいはディスクメモリ
等のメモリを併用しながら音声合成の処理が実行され
る。FIG. 1 is a block diagram of a speech synthesis system showing one embodiment of the present invention. In the figure, 2 is a unit sound selection unit,
Reference numeral 4 denotes a vocal cord frequency correction unit, 6 denotes a formant correction unit, and 8 denotes a frequency characteristic change unit. Each unit is configured in an electronic computer. With this configuration, voice synthesis is performed while using a memory such as a ROM, a RAM, or a disk memory. The processing is executed.

文章等の出力音声情報が単位音声選択部２へ入力され
ると、出力音声情報はまず、離散的な言語情報としての
音韻記号列である単位音声情報に変換されるとととに音
声の継続時間長，アクセント，イントネーション，ポー
ズ等の韻律情報が決定される。さらに、予め記憶されて
いる単位音声の集合中から決定された音韻記号列を構成
する単位音声の列が選択される。When output speech information such as text is input to the unit speech selection unit 2, the output speech information is first converted into unit speech information, which is a phoneme symbol string as discrete linguistic information, and continuation of speech is performed. Prosodic information such as time length, accent, intonation, and pause is determined. Further, a sequence of unit speeches constituting a phoneme symbol sequence determined from a set of unit speeches stored in advance is selected.

なお、出力音声情報に基づいて音韻記号列を定める代
わりに、単位音声情報として単語や文字等を定め、これ
に基づき単位音声列を選択するようにしてもよい。ま
た、記憶される単位音声は、本例では波形のみとした
が、さらに単位音声にかかる線形予測係数またはピッチ
やホルマントの軌跡等を単位音声のデータとして併せて
記憶することにより、以下に示すピッチ周期，スペクト
ル包絡，ホルマント軌跡の算出をその都度行なわずに済
むようにしてもよい。Instead of determining the phonetic symbol sequence based on the output voice information, a word or character may be determined as the unit voice information, and the unit voice sequence may be selected based on this. Further, the unit voice to be stored is only the waveform in this example. However, the linear prediction coefficient or the pitch or formant trajectory of the unit voice is also stored as the unit voice data, so that the pitch shown below is obtained. The calculation of the period, the spectral envelope, and the formant trajectory may not be performed each time.

次に、音帯周波数補正部４において、単位音声選択部
２で得られた単位音声から、有声音区間を判別し、有声
音区間のピッチ周波数およびその時間軌跡を得る。さら
に、前後の単位音声のピッチ周波数軌跡を参照し、軌跡
が滑らかにつながるようピッチ周波数に変換を加えると
共に、アクセント，イントネーション等の成分を更に変
更成分として加え、この新たなピッチ周波数に応じて各
ピッチ周期毎に波形の継続時間長を伸縮する。これによ
りピッチ周波数が滑らかにつながると共に韻律情報とし
てのアクセント，イントネーション等が制御される。ま
たこの処理によって変化したスペクトル包絡の変化分を
求める。Next, the tone frequency correction unit 4 determines a voiced sound section from the unit sound obtained by the unit sound selection unit 2, and obtains a pitch frequency of the voiced sound section and its time locus. Further, the pitch frequency trajectory of the preceding and following unit voices is referred to, the pitch frequency is converted so that the trajectory is smoothly connected, and components such as accent and intonation are further added as changing components, and each component is changed according to the new pitch frequency. The duration of the waveform is expanded or contracted for each pitch cycle. Thereby, the pitch frequency is smoothly connected, and the accent, intonation, and the like as the prosody information are controlled. In addition, a change in the spectrum envelope changed by this processing is obtained.

ホルマント補正部６では有声音区間についてスペクト
ル包絡の形（共振周波数）からホルマントを決定し、そ
の軌跡を求める。次に前後の単位音声のホルマント周波
数軌跡を参照し滑らかにつながるようホルマント周波数
に変更を加える。また、元の音声のスペクトル包絡から
の、ホルマント変更によって生じた変化分を求める。The formant correction unit 6 determines a formant from a spectrum envelope form (resonance frequency) for a voiced sound section and obtains a locus thereof. Next, the formant frequency is changed with reference to the formant frequency trajectories of the unit voices before and after so as to connect smoothly. Further, a change caused by the formant change from the spectrum envelope of the original voice is obtained.

周波数特性変更部８においては、ピッチ周波数の変更
に応じて伸縮された波形に対し、ホルマント補正部で求
めたスペクトル変化分と声帯周波数補正部で求めたスペ
クトル変化分に応じてスペクトル包絡を変更する。The frequency characteristic changing unit 8 changes the spectral envelope of the waveform expanded and contracted in accordance with the change of the pitch frequency according to the spectrum change obtained by the formant correction unit and the spectrum change obtained by the vocal cord frequency correction unit. .

上記各部における処理の詳細を第２図（Ａ）および
（Ｂ）に示すブロック図およびフローチャートを参照し
ながら説明する。第２図（Ａ）は第１図に示した構成の
詳細を示し、単位音声選択部２は単位音声蓄積部22およ
び音声選択部24によって、声帯周波数補正部４はピッチ
抽出制御部42、波形伸縮部44、スペクトル包絡抽出部46
および補正分抽出部48によって、ホルマント補正部６は
スペクトル包絡抽出部62、スペクトル包絡制御部64およ
び補正分抽出部66によって、さらに周波数特性変更部８
はFFT部82、スペクトル包絡変更部84およびIFFT部86に
よって、それぞれ構成される。また、第２図（Ｂ）のフ
ローチャートにおける各ステップの左側に付した番号は
第２図（Ａ）の各部の番号を示し、該当のステップが付
された番号で示された部においてその処理が実行される
ことを表わす。Details of the processing in each of the above units will be described with reference to block diagrams and flowcharts shown in FIGS. 2 (A) and 2 (B). FIG. 2A shows the details of the configuration shown in FIG. 1. The unit voice selection unit 2 is composed of the unit voice storage unit 22 and the voice selection unit 24, the vocal cord frequency correction unit 4 is composed of the pitch extraction control unit 42, Expansion / contraction section 44, spectrum envelope extraction section 46
And the correction component extraction unit 48, the formant correction unit 6 is further controlled by the spectrum envelope extraction unit 62, the spectrum envelope control unit 64, and the correction component extraction unit 66.
Is composed of an FFT unit 82, a spectrum envelope changing unit 84 and an IFFT unit 86, respectively. The numbers attached to the left side of each step in the flowchart of FIG. 2 (B) indicate the numbers of the respective parts in FIG. 2 (A). Indicates that it will be executed.

以上の構成において、単位音声蓄積部22には、母音−
子音−母音といった音素の並びの組で、予め人間が発声
した音声を変換ビット数12bit、標本化周波数15KHzでA/
D変換したものが単位音声として蓄積してある。ここ
で、音声選択部24では入力された文章等の出力音声情報
に基づき所定の音韻規則に従って音韻記号列が決定さ
れ、さらに韻律規則に従って継続時間長，アクセント，
イントネーション，ポーズ等が決定される。また、決定
された音韻記号列に基づいて単位音声蓄積部22から該当
する単位音声が選択される。In the above configuration, the vowel-
A set of phonemes such as consonants and vowels.
D-converted data is stored as unit sound. Here, the speech selection unit 24 determines a phoneme symbol string according to a predetermined phoneme rule based on output speech information such as an input sentence and the like, and further according to a prosody rule, a duration time, an accent,
Intonation, pose, etc. are determined. Further, a corresponding unit voice is selected from the unit voice storage unit 22 based on the determined phoneme symbol string.

引き出された各単位音声は、ピッチ抽出制御部42にお
いて音声パワーの有無に基づき有音区間と無音区間の判
別が行なわれ、次に有音区間の音声に対し１次の相関係
数と零交差数を求め、無声子音区間と有声音区間の判別
を行う。これは音声の中の高域成分と１次の相関係数と
零交差数の両方を調べることによって確実な判別を行う
ためである。Each of the extracted unit sounds is subjected to discrimination between a sound section and a silent section in the pitch extraction control section 42 based on the presence or absence of the sound power. The number is obtained, and the unvoiced consonant section and the voiced sound section are discriminated. This is because a reliable discrimination is performed by examining both the high-frequency component in the voice, the first-order correlation coefficient, and the number of zero crossings.

ここで、判別された無音区間の時間長および無声子音
区間の波形はそのままメモリーに記録しておく。Here, the determined time length of the silent section and the waveform of the unvoiced consonant section are recorded in the memory as they are.

さらに、ピッチ抽出制御部42において有声音区間にお
ける音声波形に対していわゆる声導逆フィルタを用いて
線形予測分析を行い残差波形を得、この残差波形に相関
分析を行うことにより相関のピークの間隔からピッチ周
期を求める。これを単位音声上の有声音区間全体に行
う。Further, the pitch extraction control unit 42 performs a linear prediction analysis on the voice waveform in the voiced sound section using a so-called voice conduction inverse filter to obtain a residual waveform, and performs a correlation analysis on the residual waveform to obtain a correlation peak. Is obtained from the interval of. This is performed for the entire voiced sound section on the unit voice.

次に、求められたピッチ周期のそれぞれについて、接
続される直前の（過去の時刻の）単位音声との接続が聴
感上滑らかに接続され、かつ文節や文章全体としてアク
セト、イントネーションが整うように変更を加え、新た
なピッチ周期列を算出する。Next, for each of the obtained pitch periods, the connection with the unit sound (at the past time) immediately before the connection is changed so that the connection is smoothly connected in terms of audibility, and the phrases and sentences as a whole are adjusted to accept and intonation. To calculate a new pitch period sequence.

すなわち、まず、求められた単位音声におけるピッチ
周期全体の平均ピッチ周期を、人間の聴感を考慮して相
乗平均ピッチとして求める。That is, first, the average pitch period of the entire pitch period in the obtained unit voice is determined as the geometric mean pitch in consideration of human hearing.

ここで、Pnをｎ番目のピッチ周期、単位音声における
全ピッチ数をＬとすると、平均ピッチ周期Paveは Pave＝（P₁×P₂×…×P_L）^1/L と表わされ、直前の単位音声の平均ピッチ周期をPavel
とするとき、平均ピッチ周期調整分ＲをＲ＝Pavel/Pave
とする。また、アクセント規則とイントネーション規則
から算出される周期変更係数分をQnとし、さらに第３図
に示すように平均ピッチ周期を調整した場合のピッチ周
期列のＬ個のピッチに対し、次の式で示される係数Snに
よって調整を行い、ピッチ周期列が滑らかに接続するよ
うにする。Here, assuming that Pn is the n-th pitch cycle and the total pitch number in the unit voice is L, the average pitch cycle Pave is expressed as Pave = (P ₁ × P ₂ ×... × P _L ) ^{1 / L.} Pavel the average pitch period of the unit voice of
Where R = Pavel / Pave
And Further, the period change coefficient calculated from the accent rule and the intonation rule is defined as Qn. Further, as shown in FIG. Adjustment is performed by the indicated coefficient Sn so that the pitch period trains are smoothly connected.

ここで、Ｐ′_Lは直前の（過去の）単位音声の修正後
のピッチ周期列の最後のピッチ周期である。 Here, P ′ _L is the last pitch cycle of the pitch cycle sequence after the correction of the immediately preceding (past) unit sound.

以上の各調整を総合して、Rnを総合した調整分とする
と、Rn＝Ｒ・Qn・Snと表わされ、Ｌ個の各ピッチ周期毎
にPnをRn倍すれば新しいピッチ周期情報が得られる。If the above adjustments are combined and Rn is added to obtain an adjusted component, then Rn = R · Qn · Sn, and new pitch cycle information can be obtained by multiplying Pn by Rn for each of the L pitch cycles. Can be

このときスペクトル包絡抽出部62においては、原音声
のスペクトル包絡を求める。すなわち、第４図に示すよ
うに原単位音声波形から波形のレベルが急に大きくなる
点の直前をピッチの開始点とし、ピッチ抽出制御部42で
最初に求めたピッチ周期に基づき、次のピッチの開始点
の１標本手前を終了点として１つのピッチ区間を定め、
１ピッチ区間の中心を分析窓の中心として20mSec程度の
窓掛けを行う。この窓掛けにより有限個の標本値による
短時間スペクトル分析が可能となり、この窓掛けデータ
を基に再び線形予測分析を行い、線形予測係数α₁〜α_p
を算出する。ここで、ｐは線形予測分析の次数であり、
一般に女性の声に対してはＰ＝10程度、男性の声に対し
てはＰ＝14程度を使用する。At this time, the spectrum envelope extraction unit 62 obtains the spectrum envelope of the original voice. That is, as shown in FIG. 4, immediately before the point at which the level of the waveform suddenly increases from the basic unit audio waveform, the start point of the pitch is used, and based on the pitch period first obtained by the pitch extraction control unit 42, the next pitch One pitch section is determined with the end point one sample before the start point of
Windowing of about 20 mSec is performed with the center of one pitch section as the center of the analysis window. This windowing makes it possible to perform a short-time spectrum analysis using a finite number of sample values. Based on the windowing data, a linear prediction analysis is performed again to obtain linear prediction coefficients α _{1 to} α _p.
Is calculated. Where p is the order of the linear prediction analysis,
Generally, P = about 10 is used for a female voice, and P = about 14 is used for a male voice.

さらに、次式によって上述の線形予測係数α₁〜α_pを
用いて原音声のスペクトル包絡Ｈ（ｋ）を求める。Further, the spectrum envelope H (k) of the original speech is obtained using the above-described linear prediction coefficients α _{1 to} α _p by the following equation.

ここでＮは標本数より大きい２のべき乗で512とす
る。 Here, N is a power of 2 which is larger than the number of samples and is 512.

この処理を１ピッチ区間ずらしながら有声音区間が終
るまで繰り返す。This process is repeated while shifting one pitch section until the voiced sound section ends.

また、波形伸縮部44では、ピッチ抽出制御部42で得た
新しいピッチ周期情報に応じて各ピッチごとの波形を伸
縮する。すなわち、原単位音声波形の１ピッチ標本数を
ｋとし、変更されたピッチに相当する標本数をｋ′とす
るとき、ピッチ周期を短縮したい場合はピッチ区間の開
始からｋ′番目の標本点で波形を打ち切り、逆にピッチ
周期を延ばしたい場合にはスペクトル包絡抽出部62で得
られた線形予測係数α₁〜α_pを用いて、次式のようにｍ
＝ｋ＋１番目からｍ＝ｋ′番目までの標本値を求め後続
の波形を得る。The waveform expansion / contraction unit 44 expands / contracts the waveform for each pitch in accordance with the new pitch cycle information obtained by the pitch extraction control unit 42. That is, when the number of samples per pitch of the basic unit audio waveform is k, and the number of samples corresponding to the changed pitch is k ', if it is desired to reduce the pitch period, the k'th sample point from the start of the pitch section is used. When the waveform is to be cut off and the pitch period is to be extended, on the other hand, the linear prediction coefficients α _{1 to} α _p obtained by the spectrum envelope extraction unit 62 are used to obtain
= K + 1th to m = k'th sample values are obtained to obtain the subsequent waveform.

ｘ（ｍ）＝α₁x（ｍ−１）＋α₂x（ｍ−２）＋…＋α_px
（ｍ−Ｐ）ただし、この処理を１ピッチ区間ずらしながら有声音
区間が終るまで繰り返すが、この際、ピッチ周期の伸縮
分だけ発話速度が変化するので１ピッチ周期の波形単位
で間引いたり同じ波形を繰り返したりしながら原単位音
声の発話時間長を保つ。また同じ手段で音声選択部24か
ら得られる韻律情報に基づいての継続時間長の補正もこ
こで行う。x (m) = α ₁ x (m−1) + α ₂ x (m−2) +... + α _p x
(MP) However, this process is repeated until the voiced sound section ends while shifting the pitch section by one pitch section. At this time, since the utterance speed changes by the expansion and contraction of the pitch cycle, it is thinned out in units of one pitch cycle, or the same waveform is used. And keep the utterance duration of the basic unit voice. Also, the duration is corrected here based on the prosody information obtained from the voice selecting unit 24 by the same means.

なお、ピッチを変更したことによってピッチ区間の波
形の最終標本点と次のピッチ区間の開始標本点との間に
は大きな不連続があるので、この最終標本点と開始標本
点の前後数標本のデータを用いて最小自乗法により３次
曲線を用いた近似を行い、連続的に接続する。Since the pitch is changed, there is a large discontinuity between the last sample point of the waveform of the pitch section and the start sample point of the next pitch section. Approximation using a cubic curve is performed by the least squares method using the data, and continuous connection is performed.

上述のピッチ抽出制御部42、波形伸縮部44およびスペ
クトル包絡抽出部62による処理を終了すると、まず、ス
ペクトル包絡抽出部46において、波形伸縮部44から得ら
れるピッチ周期を変更した波形の１ピッチ区間を中心と
して、上述したのと同様に20mSec程度の窓掛けを行いこ
の標本値について線形予測分析を行い線形予測係数
α₁′〜α_p′を算出し、次式によってピッチ変更後のス
ペクトル包絡を求める。When the processing by the pitch extraction control unit 42, the waveform expansion / contraction unit 44, and the spectrum envelope extraction unit 62 ends, first, the spectrum envelope extraction unit 46 performs one pitch section of the waveform obtained by changing the pitch period obtained from the waveform expansion / contraction unit 44. In the same manner as described above, windowing of about 20 mSec is performed, and linear prediction analysis is performed on this sample value to calculate linear prediction coefficients α ₁ ′ to α _p ′ .The spectrum envelope after pitch change is calculated by the following equation Ask for.

ここで、Ｎは前述と同様512とする。 Here, N is set to 512 as described above.

次に、補正分抽出部48で、原音声のスペクトル包絡Ｈ
（ｋ）に対し、ピッチ周期の変更によって歪んだスペク
トル包絡の変化分を算出する。Next, the spectrum extractor H of the original voice is output by the correction extracting unit 48.
In contrast to (k), the spectral envelope distorted by changing the pitch period Is calculated.

すなわち、を各ピッチ周期毎に計算しメモリに記憶する。That is, Is calculated for each pitch cycle and stored in the memory.

また、スペクトル包絡制御部64においては、スペクト
ル包絡抽出部62で求めたα₁〜α_pを係数として、以下に
示す式を満足するＰ個の根である複素数Z₁〜Z_pを求め
る。Further, the spectrum envelope control unit 64 uses the α _{1 to} α _p obtained by the spectrum envelope extraction unit 62 as coefficients to obtain complex numbers Z _{1 to} Z _p that are P roots satisfying the following expression.

１＋α₁Z^-1＋α₂Z^-2＋…＋α_pZ^-p＝０これらＰ個の根のうちには共役複素根の対が存在し、
１対の共役複素根は１つのホルマントに対応し得る。す
なわち、これらの根Z_iにより以下の式で共振周波数F_iお
よびその帯域幅B_iを求め、メモリに記録すると共に上述
した処理を１ピッチ区間毎にシフトしながら単位音声中
の有声音区間が終るまで繰り返す。1 + α ₁ Z ⁻¹ + α ₂ Z ⁻² +... + Α _p Z ^−p = 0 Among these P roots, there exists a pair of conjugate complex roots.
A pair of conjugate complex roots may correspond to one formant. That is, the resonance frequency F _i and its bandwidth B _i are obtained from these roots Z _{i according} to the following equation, and the voiced sound section in the unit voice is recorded in the memory while the above-described processing is shifted for each pitch section. Repeat until done.

F_i＝Fs/（２π）・arg（Z_i） B_i＝Fs/π・|log（｜Z_i｜）｜ここで、Fsは標本化周波数である。F _i = Fs / (2π) · arg (Z _i ) B _i = Fs / π · | log (| Z _i |) | where Fs is a sampling frequency.

さらに、一連の共振周波数からその帯域幅と連続性を
考慮して帯域幅の狭い共振周波数を周波数の低いほうか
ら順に第１ホルマント、第２ホルマント，第３ホルマン
ト…として選択し、ホルマント周波数の軌跡を求める。Further, from the series of resonance frequencies, the resonance frequencies having a narrow bandwidth are selected as a first formant, a second formant, a third formant... In order from a lower frequency in consideration of the bandwidth and continuity, and the locus of the formant frequency is selected. Ask for.

スペクトル包絡制御部64ではさらに、第５図に示すよ
うに直前の単位音声との接続性をよくするために、ホル
マントとその帯域幅の軌跡を、直前単位音声の最終標本
点と、処理にかかる単位音声における開始標本点の前後
数標本を用い、前述と同様に最小自乗法により３次曲線
近似により内挿を行って連続的に接続する。In order to improve the connectivity with the immediately preceding unit sound, the spectrum envelope control unit 64 further processes the formant and its trajectory as shown in FIG. Using several samples before and after the starting sample point in the unit voice, interpolation is performed by cubic curve approximation by the least squares method in the same manner as described above, and continuous connection is made.

次に、上述のようにして新たなホルマント周波数の軌
跡と帯域幅が決定したら、新たな線形予測係数を以下の
ようにして求める。Next, when the locus of the new formant frequency and the bandwidth are determined as described above, a new linear prediction coefficient is obtained as follows.

すなわち、変更されたホルマントおよび変更されなか
ったホルマントの共振周波数、さらにホルマントと認め
られなかった共振周波数を含めて、新しい共振周波数を
F_i′、その帯域幅をB_i′とし、次式を用いて新な根Z_i′
を求める。なお、Z_i′にはホルマントに対応した共役複
素根対が含まれる。That is, the new resonance frequency including the changed formants, the resonance frequencies of the unchanged formants, and the resonance frequencies not recognized as formants
F _i ′, its bandwidth as B _i ′, and a new root Z _i ′ using the following equation:
Ask for. Note that Z _i ′ includes a conjugate complex root pair corresponding to a formant.

Z_i′＝exp（−πB_i′／F_s＋j2 πF_i′／F_s）これらＰ個のZ_i′を根とするｐ次方程式を（１−Z₁′Z^-1）（１−Z₂′Z^-1）…（１−Z_p′Z^-1）＝
０とし、この式を展開したときのZ^-kの係数をβ_kとすれ
ば、上式は１＋β₁Z^-1＋β₂Z^-2＋…＋β_pZ^-p＝０と表わされ、係数β₁〜β_pは新しい線形予測係数を与え
る。この新たな線形予測係数を用いて次式によりスペク
トル包絡を求める。Z _i ′ = exp (−πB _i ′ / F _s + j2 πF _i ′ / F _s ) These P-order equations rooted at Z _i ′ are expressed as (1-Z ₁ ′ Z ⁻¹ ) (1-Z _{^{2 'Z -1) ... (1}} -Z p' Z -1) =
Assuming that the coefficient of Z ^−k when this equation is expanded is β _k , the above equation is expressed as 1 + β ₁ Z ⁻¹ + β ₂ Z ⁻² +... + Β _p Z ^−p = 0. β _{1 to} β _p give new linear prediction coefficients. Using this new linear prediction coefficient, the spectral envelope is Ask for.

ここでＮは512とする。 Here, N is 512.

補正分抽出部66では、スペクトル包絡抽出部62で求め
た原音声のスペクトル包絡Ｈ（ｋ）に対し、ホルマント
の変更により変更されたスペクトルの変化分を算出する。すなわち、を各ピッチ周期毎に計算しメモルに記録する。The correction component extraction unit 66 compares the spectrum envelope H (k) of the original speech obtained by the spectrum envelope extraction unit 62 with the spectrum changed by changing the formant. Is calculated. That is, Is calculated for each pitch period and recorded in memol.

周波数特性変更部８では、まずFFT部82において、ピ
ッチ抽出制御部42でピッチの変更されたピッチ周期にお
けるＮ個の標本ｘ（１）〜ｘ（Ｎ）に対し次式のような
時間窓ｗ（ｉ）を掛けてｙ（１）〜ｙ（Ｎ）とする。す
なわち、ｙ（ｉ）＝ｗ（ｉ）・ｘ（ｉ）１≦ｉ≦Ｎただしｗ（ｉ）＝0.5・｛１−cos（πi/L）｝１≦ｉ＜Ｌｗ（ｉ）＝１Ｌ≦ｉ＜Ｎ−Ｌｗ（ｉ）＝0.5・［１＋cos｛π（ｉ−Ｎ＋Ｌ）／｝Ｌ］
Ｎ−Ｌ≦ｉ≦Ｎ上記ｙ（ｉ）に対してＮ点の高速フーリエ変換を行
い、周波数領域に変換してＹ（ｋ）とする。In the frequency characteristic changing unit 8, first, in the FFT unit 82, a time window w represented by the following equation is used for N samples x (1) to x (N) in the pitch cycle whose pitch has been changed by the pitch extraction control unit 42. (I) is multiplied to obtain y (1) to y (N). That is, y (i) = w (i) · x (i) 1 ≦ i ≦ N where w (i) = 0.5 · {1−cos (πi / L)} 1 ≦ i <L w (i) = 1 L ≦ i <N−L w (i) = 0.5 · [1 + cos ｛π (i−N + L) /｝ L]
NL ≦ i ≦ N The above-mentioned y (i) is subjected to fast Fourier transform at N points, converted into the frequency domain, and set as Y (k).

次に、スペクトル変換部84において、補正分抽出部66
で算出したホルマントの変更による変化成分Ｖ（ｋ）と
補正分抽出部48で算出した周期変更による変化分Ｕ
（ｋ）とを用いてＹ（ｋ）を変更する。すなわち、（ｋ）＝Ｕ（ｋ）・Ｖ（ｋ）・Ｙ（ｋ）１≦ｋ≦Ｎとして補正された周波数領域表現（ｋ）を得る。Next, in the spectrum conversion unit 84, the correction component extraction unit 66
The change component V (k) due to the change of the formant calculated in step (1) and the change U due to the change in the cycle calculated by the correction extractor 48
(K) is used to change Y (k). That is, the frequency domain expression (k) corrected as (k) = U (k) · V (k) · Y (k) 1 ≦ k ≦ N is obtained.

次に、IFFT部86では高速フーリエ逆変換によりこの
（ｋ）を時間領域の音声波形（ｋ）に変換し、得られ
たＮ個のデータのうち波形接続の際の端の歪の効果を軽
減するため、標本データの中心に20mSecのハミング窓を
掛けて切り出す。Next, the IFFT unit 86 converts this (k) into a time-domain audio waveform (k) by inverse fast Fourier transform, and reduces the effect of the end distortion at the time of waveform connection among the obtained N data. To do this, a 20mSec Hamming window is applied to the center of the sample data and cut out.

次に、第６図に示すようにｙ（ｋ）を10mSecだけシフ
トし、スペクトル包絡の補正と切り出しの上述した一連
の操作を繰り返し、直前に切り出した波形と重ね合わせ
て連続した音声とする。Next, as shown in FIG. 6, y (k) is shifted by 10 mSec, and the above-described series of operations of correcting the spectrum envelope and cutting out is repeated, and a continuous sound is superimposed on the waveform cut out immediately before.

さらに一つの有声音区間の処理が終了したらメモリに
記憶しておいた無声子音区間または、無音区間と接続
し、さらに次の有声音区間の処理に移る。このように全
ての単位音声の処理を行い、最終的に合成された音声を
D/A変換して、出力音声とする。When the processing of one voiced sound section is completed, it is connected to the unvoiced consonant section or the non-voice section stored in the memory, and the process proceeds to the next voiced sound section. In this way, processing of all unit voices is performed, and finally synthesized voices are processed.
D / A conversion and output sound.

ところで、人間の言語音声の知覚特性については、ま
だまだ未知の部分が多い。音声信号波形の物理的な変更
が言語の音韻としての聞こえ方、すなわち人間の知覚特
性にどのような影響を与えるか、例えば声の高さや韻律
を微妙に変えたときどのように聞こえるかを、試験しよ
うとする場合、実際の人間を発声者として使用すると、
体調の変化等のために全く同じ音声を発することが困難
であったり、また微妙な調音をさせることは一般に困難
である。By the way, there are still many unknown parts regarding the perceptual characteristics of human speech. How the physical changes in the audio signal waveform affect the sound of the language's phonology, i.e., how it affects human perceptual properties, such as how it sounds when the pitch or prosody of the voice is subtly changed, When trying to test, using a real human as a speaker,
It is generally difficult to produce the exact same sound due to a change in physical condition or to make subtle articulation.

これらの理由により、人工的な発生音の調整法が必要
とされていた。しかし従来の方法を用いると音帯振動周
波数やホルマント周波数の不連続のために不自然さが生
じ微妙な聴覚心理的な実験に困難さが生じてしまってい
た。For these reasons, a method of artificially adjusting the generated sound was required. However, when the conventional method is used, unnaturalness occurs due to discontinuity of the tone band vibration frequency and the formant frequency, which causes difficulty in a subtle psychoacoustic experiment.

これに対し本発明の上述した実施例によれば音声の自
然性を保ったままきめ細かい制御ができるためにこの困
難さが解消できるようになる。On the other hand, according to the above-described embodiment of the present invention, fine control can be performed while maintaining the naturalness of voice, so that this difficulty can be solved.

［発明の効果］以上の説明から明らかなように、本発明によれば予め
記憶された単位音声を接続する際の不連続による音質劣
化を防ぎ原音声の持っている自然性や個人性に影響を与
えずにイントネーションやアクセント等を付加して音声
の合成ができるようになる。[Effects of the Invention] As is apparent from the above description, according to the present invention, sound quality deterioration due to discontinuity when connecting unit sounds stored in advance is prevented, and the naturalness and individuality of the original sound are affected. Can be synthesized by adding intonation, accent, and the like without giving the character.

また、本発明によってイントネーションやアクセント
等の韻律性やホルマント周波数等に基づく音韻性を任意
に変化させて音声を合成できるので人間の言語音声に対
する知覚特性の測定方法を提供できるようになる。Further, according to the present invention, a voice can be synthesized by arbitrarily changing the rhythm of the intonation and the accent and the phonology based on the formant frequency and the like, so that it is possible to provide a method for measuring the perceptual characteristics of human speech to speech.

[Brief description of the drawings]

第１図は本発明の一実施例を示す音声合成システムのブ
ロック図、第２図（Ａ）は第１図に示したシステムの詳細を示すブ
ロック図、第２図（Ｂ）は第２図（Ａ）に示したシステムの処理を
示すフローチャート、第３図は音声振動周波数の軌跡の連続性を保った変換法
を説明するための線図、第４図はピッチ区間の波形切り出し法を示す線図、第５図はホルマント周波数帯域幅の連続性を保つ変換法
を示す線図、第６図は処理された波形を接続する方法を示す線図であ
る。 22…単位音声蓄積部、24…音声選択部、42…ピッチ抽出
制御部、44…波形伸縮部、46…スペクトル包絡抽出部、
48…補正分抽出部、62…スペクトル包絡抽出部、64…ス
ペクトル包絡制御部、66…補正分抽出部、82…FFT部、8
4…スペクトル包絡変更部、86…IFFT部。FIG. 1 is a block diagram of a speech synthesis system showing one embodiment of the present invention, FIG. 2 (A) is a block diagram showing details of the system shown in FIG. 1, and FIG. 2 (B) is FIG. FIG. 3 is a flowchart showing the processing of the system shown in FIG. 3A. FIG. 3 is a diagram for explaining a conversion method that maintains the continuity of the locus of the sound vibration frequency. FIG. 4 shows a waveform segmentation method for the pitch section. FIG. 5 is a diagram showing a conversion method for maintaining the continuity of the formant frequency bandwidth, and FIG. 6 is a diagram showing a method for connecting the processed waveforms. 22 unit voice storage unit, 24 voice selection unit, 42 pitch extraction control unit, 44 waveform expansion unit, 46 spectral envelope extraction unit,
48: Correction extraction unit, 62: Spectrum envelope extraction unit, 64: Spectrum envelope control unit, 66: Correction extraction unit, 82: FFT unit, 8
4 ... Spectrum envelope changing unit, 86 ... IFFT unit.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 - 9/18 ＪＯＩＳファイル（ＪＩＣＳＴ)──────────────────────────────────────────────────続き Continued on the front page (58) Fields surveyed (Int. Cl. ⁶ , DB name) G10L 3/00-9/18 JOIS file (JICST)

Claims

(57) [Claims]

A unit voice information and a prosody information are determined based on output voice information, and corresponding to the unit voice information based on the predetermined unit voice information among unit voice data which is stored in advance and uttered by a human. Select unit voice data to be calculated, and calculate or extract each of the pitch period, spectrum envelope, formant trajectory and unit voice waveform from the selected unit voice data, and connect the calculated or extracted unit voice waveform smoothly. The calculated or extracted pitch period is changed so as to add the prosody information, and the spectrum envelope by the pitch change is calculated in the changed pitch period, and the spectrum envelope by the pitch change and the calculation are calculated. Alternatively, a first spectrum change is calculated based on the extracted spectrum envelope, and Changes the calculated or extracted formant trajectory so as to smoothly connect the extracted unit sound waveforms, calculates a spectrum envelope by a formant change based on the changed formant trajectory, and calculates a spectrum envelope by the formant change. Calculating a second spectrum change based on the calculated or extracted spectrum envelope; and changing a spectrum envelope of a unit sound waveform related to the change of the pitch cycle based on the first and second spectrum changes. A speech synthesis method comprising: connecting the unit speech waveform whose spectrum envelope has been changed, and outputting the connected speech.

Means for determining unit voice information and prosody information based on output voice information; and unit voice information based on the determined unit voice information from unit voice data uttered by a human stored in advance. Means for selecting unit voice data corresponding to the above, means for calculating or extracting each of the pitch period, spectral envelope, formant trajectory and unit voice waveform from the selected unit voice data, and the calculated or extracted unit voice Means for changing the calculated or extracted pitch period so as to connect the waveforms smoothly and to add the prosody information, and means for calculating a spectrum envelope by a pitch change in the changed pitch period. Based on the spectral envelope due to the pitch change and the calculated or extracted spectral envelope Means for calculating the spectrum change of the formant; means for changing the calculated or extracted formant trajectory so as to smoothly connect the calculated or extracted unit sound waveform; and formant based on the changed formant trajectory. Means for calculating a spectrum envelope due to the change; means for calculating a second spectrum change based on the spectrum envelope due to the formant change and the calculated or extracted spectrum envelope; and a first and a second spectrum change. Means for changing the spectrum envelope of the unit sound waveform related to the change of the pitch period, and means for outputting the connected sound after connecting the unit sound waveform whose spectrum envelope has been changed. Characteristic speech synthesizer.