JP3235747B2

JP3235747B2 - Voice synthesis device and voice synthesis method

Info

Publication number: JP3235747B2
Application number: JP31135692A
Authority: JP
Inventors: 敬一山田; 芳明及川; 直人岩橋
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1992-10-27
Filing date: 1992-10-27
Publication date: 2001-12-04
Anticipated expiration: 2016-12-04
Also published as: JPH06138894A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【目次】以下の順序で本発明を説明する。産業上の利用分野従来の技術発明が解決しようとする課題課題を解決するための手段（図１）作用（図１）実施例（図１〜図４）発明の効果[Table of Contents] The present invention will be described in the following order. BACKGROUND OF THE INVENTION Problems to be Solved by the Invention Means for Solving the Problems (FIG. 1) Function (FIG. 1) Embodiment (FIGS. 1 to 4) Effects of the Invention

【０００２】[0002]

【産業上の利用分野】本発明は音声合成装置及び音声合
成方法に関し、特に規則音声合成方式に従つて合成音を
生成するものに適用して好適なものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizing apparatus and a speech synthesizing method, and more particularly to a speech synthesizing apparatus and a speech synthesizing method suitable for generating a synthesized speech in accordance with a regular speech synthesizing method.

【０００３】[0003]

【従来の技術】従来、規則音声合成方式を用いた音声合
成装置においては、入力された文字の系列を解析した
後、所定の規則に従つてパラメータを合成することによ
り、いかなる言葉でも音声合成し得るようになされてい
る。すなわち、規則音声合成方式による音声合成装置
は、入力された文字の系列を解析した後、所定の規則に
従つて、各文節ごとにアクセントを検出し、各文節の並
びから文字系列全体としての抑揚やポース等を表現する
ピツチパラメータを合成する。2. Description of the Related Art Conventionally, a speech synthesizer using a regular speech synthesis system analyzes a sequence of input characters and synthesizes parameters in accordance with a predetermined rule, thereby synthesizing any words. Have been made to gain. That is, the speech synthesizer based on the rule speech synthesis method analyzes an input character sequence, detects an accent for each phrase in accordance with a predetermined rule, and inverts the entire character sequence from the sequence of the phrases. And pitch parameters expressing the pose and the like.

【０００４】さらに音声合成装置は、同様に所定の規則
に従つて各文節を例えばＣＶ単位のような音声単位に分
割した後、そのスペクトラムを表現する合成パラメータ
を生成する。これによりピツチパラメータ及び合成パラ
メータに基づいて合成音を発声するようになされてい
る。Further, the speech synthesizer similarly divides each phrase into speech units such as CV units in accordance with a predetermined rule, and then generates a synthesis parameter representing the spectrum. As a result, a synthesized sound is produced based on the pitch parameter and the synthesis parameter.

【０００５】またより高品質な音声を合成するために、
音声単位として周期性を有する有声部分については実音
声を分析処理し、その１周期にあたる音声波形データを
それぞれ保持し、また周期性の無い無声部分について
は、実音声をそのまま音声波形データとして保持し、合
成時はこれらの音声波形データをピツチパラメータに基
づいて波形重畳して合成音を生成するようになされてい
る。In order to synthesize higher quality speech,
For a voiced part having periodicity as a voice unit, the real voice is analyzed and processed, and the voice waveform data corresponding to one period is held, and for a voiceless part having no periodicity, the real voice is held as voice waveform data as it is. At the time of synthesis, these voice waveform data are superimposed on the waveform based on the pitch parameter to generate a synthesized sound.

【０００６】[0006]

【発明が解決しようとする課題】ところで従来の波形重
畳の手法では、ピツチパラメータに基づいて音声単位内
の音声波形データを繰り返したり、あるいは間引くこと
によつてフレーム数を調整して音声を合成している。こ
こで用いられる個々の音声単位は、それが抽出された実
音声内での前後の音韻環境の影響を受けており、その影
響が合成音声内において表れてくる。By the way, in the conventional waveform superimposition technique, the voice is synthesized by adjusting the number of frames by repeating or thinning out the voice waveform data in the voice unit based on the pitch parameter. ing. Each speech unit used here is influenced by the phonemic environment before and after it in the actual speech from which it is extracted, and the effect appears in the synthesized speech.

【０００７】すなわちある音声単位では合成時における
音韻環境と、抽出された実音声内ので音韻環境とが異な
る場合が生じてくる。これによつて合成音声の各音声単
位接続部において実音声と比べて不自然な音声波形が生
成され、周波数領域での不連続性が原因となつて異聴等
が発生し、合成音声の品質が劣化しやすいといつた問題
があつた。That is, in a certain voice unit, the phoneme environment at the time of synthesis may differ from the phoneme environment in the extracted real voice. As a result, an unnatural speech waveform is generated in each speech unit connection part of the synthesized speech as compared with the real speech, and abnormal hearing occurs due to discontinuity in the frequency domain, and the quality of the synthesized speech is reduced. Had a problem when it was apt to deteriorate.

【０００８】本発明は以上の点を考慮してなされたもの
で、実際の人間の音声に比して品質の劣化が少なく違和
感のない合成音を発声し得る音声合成装置及び音声合成
方法を提案しようとするものである。The present invention has been made in view of the above points, and proposes a voice synthesizing apparatus and a voice synthesizing method capable of producing a synthesized sound with less deterioration in quality as compared with actual human voice and without a sense of incongruity. What you want to do.

【０００９】[0009]

【課題を解決するための手段】かかる課題を解決するた
めに本発明においては、個々の音声単位毎に、その音声
単位内において周期性を有する有声部分について、実音
声の分析処理によつて得られる各１ピツチ周期分に対応
する音声波形データを音声単位として必要フレーム数分
だけメモリに貯え、同時に音声単位の前端フレーム及び
後端フレームには、音声波形データと共に分析処理によ
つて得られる包絡情報及び微細構造情報を合わせてメモ
リに貯え、同時に音声単位内における周期性の無い無声
部分について、実音声をそのまま音声波形データとして
メモリに貯える音声単位記憶部２と、入力された音韻記
号と韻律記号に基づく所定の音韻規則及び韻律規則に従
つて、ピツチパターンを生成する音声合成規則部４と、
有声部分の合成時の補間フレームにおいて、先行音声単
位の後端フレーム及び後方音声単位の前端フレームの包
絡情報を補間して、先行音声単位の微細構造情報と合わ
せて時間波形を求め、その時間波形を補間フレームの音
声波形データとして、音声単位及びピツチパターンに基
づいて合成音を生成する音声合成部５とを設けるように
した。According to the present invention, in order to solve the above-mentioned problems, for each voice unit, a voiced portion having a periodicity within the voice unit is obtained by an actual voice analysis process. Speech waveform data corresponding to each one pitch cycle is stored in memory as speech units for the required number of frames, and at the same time, the front end frame and the rear end frame of the speech unit are envelopes obtained by analysis processing together with the speech waveform data. Information and fine structure information are stored together in a memory, and at the same time, for a voiceless part having no periodicity in a voice unit, a voice unit storage unit 2 which stores the real voice in the memory as voice waveform data as it is, an input phonological symbol and a prosody A speech synthesis rule unit 4 for generating a pitch pattern in accordance with a predetermined phoneme rule and a prosody rule based on symbols;
In the interpolation frame at the time of synthesis of the voiced part, the envelope information of the rear end frame of the preceding speech unit and the front end frame of the rear speech unit are interpolated, and a time waveform is obtained along with the fine structure information of the preceding speech unit. Is provided as the speech waveform data of the interpolation frame, and a speech synthesis unit 5 for generating a synthesized sound based on the speech unit and the pitch pattern.

【００１０】また本発明においては、入力された文字の
系列を解析して、単語、文節の境界及び基本アクセント
を検出する文章解析部３と、個々の音声単位毎に、その
音声単位内において周期性を有する有声部分について、
実音声の分析処理によつて得られる各１ピツチ周期分に
対応する音声波形データを音声単位として必要フレーム
数分だけメモリに貯え、同時に音声単位の前端フレーム
及び後端フレームには、音声波形データと共に分析処理
によつて得られる包絡情報及び微細構造情報を合わせて
メモリに貯え、同時に音声単位内における周期性のない
無声部分について、実音声をそのまま音声波形データと
してメモリに貯える音声単位記憶部２と、文章解析部３
の解析結果に基づく所定の音韻規則及び韻律規則に従つ
て、ピツチパターンを生成する音声合成規則部４と、有
声部分の合成時の補間フレームにおいて、先行音声単位
の後端フレーム及び後方音声単位の前端フレームの包絡
情報を補間して、先行音声単位の微細構造情報と合わせ
て時間波形を求め、その時間波形を補間フレームの音声
波形データとして、音声単位及びピツチパターンに基づ
いて合成音を生成する音声合成部５とを設けるようにし
た。Further, in the present invention, a sentence analyzing unit 3 for analyzing a sequence of inputted characters to detect a word, a boundary of a phrase and a basic accent, and for each voice unit, a period within the voice unit. Voiced part
The audio waveform data corresponding to each one-pitch cycle obtained by the analysis processing of the actual audio is stored in the memory as the audio unit for the required number of frames, and at the same time, the audio waveform data is stored in the front end frame and the rear end frame of the audio unit. Together with the envelope information and the fine structure information obtained by the analysis process, and simultaneously stores in a memory an unvoiced portion having no periodicity within the voice unit, and stores the real voice as it is as voice waveform data in the memory. And sentence analysis unit 3
According to a predetermined phoneme rule and a prosody rule based on the analysis result, a speech synthesis rule unit 4 for generating a pitch pattern, and an interpolation frame at the time of synthesis of a voiced portion, a rear end frame and a rear speech unit of a preceding speech unit A time waveform is obtained by interpolating the envelope information of the front end frame and the fine structure information of the preceding voice unit, and the time waveform is used as the voice waveform data of the interpolation frame to generate a synthesized sound based on the voice unit and the pitch pattern. The voice synthesizing unit 5 is provided.

【００１１】また本発明においては、個々の音声単位毎
に、その音声単位内において周期性を有する有声部分に
ついて、実音声の分析処理によつて得られる各１ピツチ
周期分に対応する音声波形データを音声単位として必要
フレーム数分だけメモリにメモリに貯え、同時に音声単
位の前端フレーム及び後端フレームには、音声波形デー
タと共に分析処理によつて得られる包絡情報及び微細構
造情報を合わせてメモリにメモリに貯え、同時に音声単
位内における周期性の無い無声部分について、実音声を
そのまま音声波形データとしてメモリにメモリに貯え、
入力された音韻記号と韻律記号に基づく所定の音韻規則
及び韻律規則に従つて、ピツチパターンを生成し、有声
部分の合成時の補間フレームにおいて、先行音声単位の
後端フレーム及び後方音声単位の前端フレームの包絡情
報を補間して、先行音声単位の微細構造情報と合わせて
時間波形を求め、その時間波形を補間フレームの音声波
形データとして、音声単位及びピツチパターンに基づい
て合成音を生成するようにした。According to the present invention, for each voice unit, voice waveform data corresponding to each one-pitch period obtained by analysis processing of a real voice is obtained for a voiced portion having periodicity in the voice unit. Are stored in the memory for the required number of frames as a voice unit, and at the same time, the envelope information and the fine structure information obtained by the analysis processing together with the voice waveform data are stored in the memory in the front end frame and the rear end frame of the voice unit. Stored in memory, and at the same time, for unvoiced portions with no periodicity in the voice unit, stored the real voice in the memory as voice waveform data as it is,
In accordance with predetermined phonological rules and prosody rules based on the input phonological symbols and prosodic symbols, a pitch pattern is generated, and in the interpolation frame at the time of synthesis of the voiced portion, the rear end frame of the preceding voice unit and the front end of the rear voice unit. A time waveform is obtained by interpolating the envelope information of the frame and the fine structure information of the preceding voice unit, and using the time waveform as the voice waveform data of the interpolated frame, a synthesized sound is generated based on the voice unit and the pitch pattern. I made it.

【００１２】また本発明においては、入力された文字の
系列を解析して、単語、文節の境界及び基本アクセント
を検出し、個々の音声単位毎に、その音声単位内におい
て周期性を有する有声部分について、実音声の分析処理
によつて得られる各１ピツチ周期分に対応する音声波形
データを音声単位として必要フレーム数だけメモリに貯
え、同時に音声単位の前端フレーム及び後端フレームに
は、音声波形データと共に分析処理によつて得られる包
絡情報及び微細構造情報を合わせてメモリに貯え、同時
に音声単位内における周期性のない無声部分について、
実音声をそのまま音声波形データとしてメモリに貯え、
文字の系列の解析結果に基づく所定の音韻規則及び韻律
規則に従つて、ピツチパターンを生成し、有声部分の合
成時の補間フレームにおいて、先行音声単位の後端フレ
ーム及び後方音声単位の前端フレームの包絡情報を補間
して、先行音声単位の微細構造情報と合わせて時間波形
を求め、その時間波形を補間フレームの音声波形データ
として、音声単位及びピツチパターンに基づいて、合成
音を生成するようにした。Further, in the present invention, a sequence of inputted characters is analyzed to detect a word, a boundary of a phrase and a basic accent, and for each voice unit, a voiced portion having a periodicity within the voice unit. For each of the above, the required number of frames of audio waveform data corresponding to each one-pitch cycle obtained by the analysis processing of the actual audio are stored in the memory as the audio unit, and at the same time, the audio waveform data is stored in the front end frame and rear end frame of the audio unit Envelope information and fine structure information obtained by the analysis process together with the data are stored together in a memory, and at the same time, for unvoiced parts without periodicity in the voice unit,
The actual voice is stored as it is in the memory as voice waveform data,
A pitch pattern is generated in accordance with a predetermined phonological rule and a prosodic rule based on the analysis result of the character sequence, and in the interpolation frame at the time of synthesis of the voiced portion, the pitch of the rear end frame of the preceding voice unit and the front end frame of the rear voice unit are calculated. A time waveform is obtained by interpolating the envelope information together with the fine structure information of the preceding voice unit, and using the time waveform as the voice waveform data of the interpolation frame based on the voice unit and the pitch pattern to generate a synthesized sound. did.

【００１３】また本発明においては、音声として日本語
に基づく音声を用いるようにした。In the present invention, a voice based on Japanese is used as the voice.

【００１４】[0014]

【作用】周期性を有する有声部分に関しては実音声の分
析処理によつて得られた各１ピツチ周期分に対応する音
声波形データを、また周期性のない無声部分に関しては
実音声をそのまま音声波形データとして必要フレーム数
分メモリに貯えた音声単位内において、音声単位の前端
フレーム及び後端フレームには音声波形データと共に、
分析処理によりその音声波形データの包絡情報と微細構
造情報とを合わせ持たせる。さらに所定の韻律規則に従
つて合成音声のピツチパターンを生成し、また音韻規則
に従つて合成音声に必要な音声波形データをメモリから
読み出し、音声波形データ及びピツチパターンに基づい
て合成音を生成するようにしたことにより、実際の人間
の音声に比して品質の劣化が少なく違和感のない合成音
を発声し得る。The voice waveform data corresponding to each one-pitch period obtained by the analysis processing of the real voice is used for the voiced portion having periodicity, and the real voice is used for the voiceless portion having no periodicity as it is. In the audio unit stored in the memory for the required number of frames as data, the front end frame and the rear end frame of the audio unit together with the audio waveform data,
By the analysis process, the envelope information of the audio waveform data and the fine structure information are provided together. Further, a pitch pattern of the synthesized voice is generated according to a predetermined prosody rule, and voice waveform data necessary for the synthesized voice is read from a memory according to the phoneme rule, and a synthesized voice is generated based on the voice waveform data and the pitch pattern. By doing so, it is possible to utter a synthesized sound with less deterioration in quality as compared to actual human voice and without a sense of incongruity.

【００１５】[0015]

【実施例】以下図面について、本発明の一実施例を詳述
する。BRIEF DESCRIPTION OF THE DRAWINGS FIG.

【００１６】図１において、１は全体として演算処理装
置を含んでなる音声合成装置の概略構成を示し、音声単
位記憶部２、文章解析部３、音声合成規則部４及び音声
合成部５に分割される。In FIG. 1, reference numeral 1 denotes a schematic configuration of a speech synthesis apparatus including an arithmetic processing unit as a whole, which is divided into a speech unit storage section 2, a sentence analysis section 3, a speech synthesis rule section 4, and a speech synthesis section 5. Is done.

【００１７】文章解析部３は、所定の入力装置から入力
されたテキスト入力（文字の系列で表された文章等でな
る）を、所定の辞書を基準にして解析し、仮名文字列に
変換した後、単語、文節毎に分解する。すなわち日本語
においては英語のように単語が分かち書きされていない
ことから、例えば「米国産業界」のような言葉は、「米
国／産業・界」、「米／国産／業界」のように２種類区
分化し得る。The sentence analysis unit 3 analyzes a text input (consisting of a sentence or the like represented by a series of characters) input from a predetermined input device with reference to a predetermined dictionary and converts it into a kana character string. Later, it is broken down into words and phrases. In other words, since words are not separated in Japanese as in English, there are two types of words such as “US / industry / industry” and “US / industry / industry”, for example. Can be sectioned.

【００１８】このため文章解析部３は、辞書を参考にし
ながら、言葉の連続関係及び単語の統計的性質を利用し
て、テキスト入力を単語、文節毎に分解するようになさ
れ、これにより単語、文節の境界を検出するようになさ
れている。さらに文章解析部３は、各単語毎に基本アク
セントを検出した後、音声合成規則部４に出力する。For this reason, the sentence analyzing unit 3 is configured to decompose a text input into words and phrases by using the continuous relation of words and the statistical properties of words while referring to the dictionary. It is designed to detect clause boundaries. Further, the sentence analysis unit 3 detects a basic accent for each word, and outputs it to the speech synthesis rule unit 4.

【００１９】音声合成規則部４は、日本語の特徴に基づ
いて設定された所定の音韻規則に従つて、文章解析部３
の検出結果及びテキスト入力を処理するようになされて
いる。すなわち日本語の自然な音声は、言語学的特性に
基づいて区別すると、約 100程度の発声の単位に区分す
ることができる。例えば「さくら」という単語を発声の
単位に区分すると、「sa」＋「ak」＋「ku」＋「ur」＋
「ra」の５つのＣＶ／ＶＣ単位に分割することができ
る。The speech synthesizing rule unit 4 follows a predetermined phonetic rule set based on Japanese features and outputs the sentence analyzing unit 3.
And a text input. That is, natural Japanese speech can be divided into about 100 utterance units when distinguished based on linguistic characteristics. For example, if the word "Sakura" is divided into utterance units, "sa" + "ak" + "ku" + "ur" +
It can be divided into five CV / VC units of “ra”.

【００２０】さらに日本語は単語が連続する場合、連な
つた後ろの語の語頭音節が濁音化したり（すなわち続濁
でなる）、語頭以外のガ行音が鼻音化したりして、単語
単体の場合と発声が変化する特徴がある。従つて音声合
成規則部４は、これら日本語の特徴に従つて音韻規則が
設定されるようになされ、当該規則に従つてテキスト入
力を音韻記号列（すなわち上述の「sa」＋「ak」＋「k
u」＋「ur」＋「ra」等の連続する列でなる）に変換す
る。さらに音声合成規則部４は、当該音韻記号列に基づ
いて、音声単位記憶部２から各音声単位のデータをロー
ドする。Further, in Japanese, when words are consecutive, the initial syllable of the word after the consecutive words becomes muddy (that is, it becomes turbid), and the syllabic sounds other than the initial of the word become nasal, so that the word alone is There is a feature that the utterance changes with the case. Accordingly, the speech synthesis rule section 4 sets a phoneme rule according to these Japanese features, and converts a text input into a phoneme symbol string (that is, the above-mentioned “sa” + “ak” + "K
u "+" ur "+" ra "). Further, the speech synthesis rule unit 4 loads data of each speech unit from the speech unit storage unit 2 based on the phoneme symbol string.

【００２１】ここで当該音声合成装置１は、波形編集の
手法を用いて合成音を発声するようになされ、音声単位
記憶部２からロードされるデータは、各ＣＶ／ＶＣ単位
で表される合成音を生成する際に用いられる波形データ
である。この波形合成に用いられる音声単位データは次
のように構成されている。すなわち音声単位データの有
声部に関しては、実音声の有声部分において１ピツチに
対応する音声波形データを必要なフレーム数だけメモリ
に貯えたものからなり、また音声単位データの無声部に
関しては、実音声の無声部分の波形を切り出してそのま
まメモリに貯えたものからなる。Here, the speech synthesizer 1 produces a synthesized sound using a waveform editing technique, and data loaded from the speech unit storage unit 2 is synthesized in units of CV / VC. This is waveform data used when generating a sound. The voice unit data used for this waveform synthesis is configured as follows. In other words, the voiced part of the voice unit data is composed of voice waveform data corresponding to one pitch in the voiced part of the real voice in the required number of frames in the memory. The waveform of the unvoiced portion is cut out and stored in the memory as it is.

【００２２】また図２に示すように音声単位データの有
声部の前端フレーム及び後端フレームは、音声波形デー
タ（図２（Ａ））と共に、ケプストラム分析法等の分析
処理によつて得られるその音声波形データの包絡情報
（図２（Ｂ））及び微細構造情報（図２（Ｃ））が同時
にメモリに貯えられる。従つて音声単位データがＣＶ／
ＶＣ単位である場合には、１つの音声単位ＣＶの子音部
Ｃが無声子音である時には無声部分の切り出し波形と、
１ピツチの音声波形からなる複数フレームにおいて、そ
の後端フレームには対応する音声波形の包絡情報と微細
構造情報も含まれている。As shown in FIG. 2, the front end frame and the rear end frame of the voiced portion of the voice unit data are obtained together with the voice waveform data (FIG. 2A) by analysis processing such as cepstrum analysis. The envelope information (FIG. 2B) and the fine structure information (FIG. 2C) of the audio waveform data are simultaneously stored in the memory. Therefore, the voice unit data is CV /
In the case of the VC unit, when the consonant part C of one voice unit CV is an unvoiced consonant, a cutout waveform of an unvoiced part;
In a plurality of frames each composed of a one-pitch voice waveform, the trailing end frame also includes envelope information and fine structure information of the corresponding voice waveform.

【００２３】これにより１つの音声単位データが構成さ
れ、また１つの音声単位ＣＶの子音部Ｃが有声子音であ
るときには、１ピツチの音声波形からなる複数フレーム
において、その先端フレーム及び後端フレームにそれぞ
れ対応する音声波形の包絡情報及び微細構造情報が含ま
れており、これにより１つの音声単位データが構成され
る。As a result, one voice unit data is formed, and when the consonant part C of one voice unit CV is a voiced consonant, in a plurality of frames consisting of one pitch voice waveform, the leading frame and the trailing frame are included. The envelope information and the fine structure information of the corresponding audio waveforms are included, thereby forming one audio unit data.

【００２４】音声合成規則部４は、音声単位記憶部２か
らロードされた音声単位データをテキスト入力に応じた
順序（以下このデータを合成波形データと呼ぶ）で合成
し、かくして抑揚のない状態で、テキスト入力を読み上
げた合成音声波形を得ることができる。また合成波形デ
ータ内での、有声部における音声単位の連結では、次の
ような処理が行われる。The speech synthesis rule unit 4 synthesizes the speech unit data loaded from the speech unit storage unit 2 in an order according to the text input (hereinafter, this data is referred to as synthesized waveform data), and thus, in a state without inflection. Thus, a synthesized speech waveform obtained by reading a text input can be obtained. Further, in the connection of voice units in a voiced part in the synthesized waveform data, the following processing is performed.

【００２５】音声合成部５において合成しようとするあ
る音韻連鎖Ｃ′ＶＣ″では、音声単位記憶部２からロー
ドされたデータＣ′Ｖ、ＶＣ″内の波形データ群を順に
並べ、音韻連鎖内の同一音素Ｖ内での接続部において
は、次の補間処理によつて得られる音声波形データを用
いる。これは、図３に示すように、先行音声単位Ｃ′Ｖ
の終端フレーム内の包絡情報と、後方音声単位ＶＣ″の
先端フレーム内の包絡情報を用いて線形補間等の補間処
理を行い、この補間処理によつて得られた包絡情報と先
行音声単位Ｃ′Ｖの終端フレーム内の微細構造情報を加
えたものを接続部における補間フレームの周波数情報と
する。In a certain phoneme chain C'VC "to be synthesized by the voice synthesis unit 5, the waveform data groups in the data C'V and VC" loaded from the voice unit storage unit 2 are arranged in order, and At the connection portion within the same phoneme V, speech waveform data obtained by the following interpolation processing is used. This is, as shown in FIG.
Interpolation processing such as linear interpolation is performed using the envelope information in the end frame of the end speech frame and the envelope information in the front frame of the rear speech unit VC ″, and the envelope information obtained by this interpolation processing and the preceding speech unit C ′ The sum of the microstructure information in the terminal frame of V is used as the frequency information of the interpolation frame at the connection portion.

【００２６】この周波数情報を周波数領域から時間領域
へ変換し、接続部における補間フレームの音声波形デー
タとし、合成波形データに用いる。また、この補間処理
による音声単位の連結は、Ｃが有声子音であるような別
の音韻連鎖Ｖ′ＣＶ″における音声単位Ｖ′Ｃと音声単
位ＣＶ″の同一音素Ｃ内でも行われる。This frequency information is converted from the frequency domain to the time domain, and is used as speech waveform data of the interpolated frame at the connection section, and is used for composite waveform data. The connection of the speech units by the interpolation process is also performed within the same phoneme C of the speech unit V'C and the speech unit CV "in another phoneme chain V'CV" in which C is a voiced consonant.

【００２７】さらに音声合成規則部４は所定の韻律規則
に基づいて、テキスト入力を適当な長さで分割して、切
れ目（すなわちポーズでなる）を検出する。かくして、
例えばテキスト入力として図４（Ａ）に示すように、文
章「きれいな花を山田さんからもらいました」が入力さ
れた場合、当該テキスト入力は図４（Ｂ）に示すよう
に、「きれいな」、「はなを」、「やまださんから」、
「もらいました」に分解された後、「はなを」及び「や
まださんから」間にポーズが検出される。Further, the speech synthesis rule section 4 divides the text input into an appropriate length based on a predetermined prosody rule and detects a break (ie, a pause). Thus,
For example, as shown in FIG. 4 (A), when the sentence “I received a beautiful flower from Mr. Yamada” is input as a text input, the text input is “pretty”, as shown in FIG. 4 (B). "Hanawo", "From Yamada",
After being decomposed into "I got it", a pose is detected between "Hanao" and "Yamada-san".

【００２８】さらに音声合成規則部４は、韻律規則及び
各単語の基本アクセントに基づいて、各文節のアクセン
トを検出する。すなわち日本語の文節単体のアクセント
は、感覚的に仮名文字を単位として（以下モーラと呼
ぶ）高低の２レベルで表現することができる。このと
き、文節の内容等に応じて、文節のアクセント位置を区
別することができる。Further, the speech synthesis rule section 4 detects the accent of each phrase based on the prosodic rules and the basic accent of each word. In other words, the accent of a Japanese phrase alone can be intuitively expressed in two levels, high and low, in units of kana characters (hereinafter referred to as mora). At this time, the accent position of the phrase can be distinguished according to the content of the phrase and the like.

【００２９】例えば、端、箸、橋は、２モーラの単語
で、それぞれアクセントのない０型、アクセントの位置
が先頭のモーラにある１型、アクセントの位置が２モー
ラ目にある２型に分類することができる。かくしてこの
実施例の場合音声合成規則部４は、テキスト入力の各文
節を、図４（Ｃ）に示すように、順次１型、２型、０
型、４型と分類し、これにより文節単位でアクセント及
びポーズを検出する。For example, edges, chopsticks, and bridges are two-mora words, each of which is classified into a type 0 with no accent, a type 1 with an accent position in the first mora, and a type 2 with an accent position in the second mora. can do. Thus, in the case of the present embodiment, the speech synthesis rule unit 4 sequentially converts each phrase of the text input into type 1 type, type 2 and 0, as shown in FIG.
The type and the type are classified, and the accent and the pause are detected in units of a phrase.

【００３０】さらに音声合成規則部４は、アクセント及
びポーズの検出結果に基づいて、テキスト入力全体の抑
揚を表す基本ピツチパターンを生成する。すなわち、日
本語における文節のアクセントは感覚的に２レベルで表
し得るのに対し、実際の抑揚はアクセントの位置から徐
々に低下する特徴がある（図４（Ｄ））。さらに日本語
においては、文節が連続して１つの文章になると、ポー
ズから続くポーズに向かつて、抑揚が徐々に低下する特
徴がある（図４（Ｅ））。Further, the speech synthesis rule section 4 generates a basic pitch pattern representing the intonation of the entire text input based on the detection result of the accent and the pose. That is, while the accent of a phrase in Japanese can be intuitively expressed at two levels, the actual intonation is characterized by a gradual decrease from the position of the accent (FIG. 4D). Further, in Japanese, when a phrase is continuously formed into one sentence, there is a characteristic that the intonation gradually decreases toward the pause following the pause (FIG. 4E).

【００３１】従つて音声合成規則部４はこのような日本
語の特徴に基づいて、テキスト入力全体の抑揚を表すパ
ラメータを各モーラ毎に生成した後、人間が発声した場
合と同様に抑揚が滑らかに変化するように、モーラ間に
補間によりパラメータを設定する。かくして音声合成規
則部４は、テキスト入力に応じた順序で、各モーラのパ
ラメータ及び補間したパラメータを合成し（以下ピツチ
パターンと呼ぶ）、これにより図４（Ｆ）に示すよう
に、テキスト入力を読み上げた音声の抑揚を表すピツチ
パターン（図４（Ｆ））を得るようになされている。Accordingly, the speech synthesis rule unit 4 generates a parameter representing the inflection of the entire text input for each mora based on such Japanese features, and then the inflection is smooth as in the case where a human utters. Is set by interpolation between moras. Thus, the speech synthesis rule section 4 synthesizes the parameters of each mora and the interpolated parameters in the order corresponding to the text input (hereinafter referred to as pitch pattern), and thereby, as shown in FIG. A pitch pattern (FIG. 4 (F)) indicating the intonation of the read voice is obtained.

【００３２】音声合成部５は、合成波形データ及びピツ
チパターンに基づいて、波形合成処理を行ない合成音を
生成する。この波形合成処理は、次のようなことを行な
つている。すなわち合成音声の有声部においては、合成
波形データ内の１ピツチに対応した波形データをピツチ
パターンに基づいて並べ重畳していく。また合成音声の
無声部分においては、合成波形データ内の切り出し波形
をそのまま所望の合成音声の波形とする。The voice synthesizer 5 performs a waveform synthesis process based on the synthesized waveform data and the pitch pattern to generate a synthesized sound. This waveform synthesizing process performs the following. That is, in the voiced portion of the synthesized voice, the waveform data corresponding to one pitch in the synthesized waveform data is arranged and superimposed on the basis of the pitch pattern. In the unvoiced portion of the synthesized voice, the cut-out waveform in the synthesized waveform data is used as a desired synthesized voice waveform.

【００３３】これにより、ピツチパターンの変化に追従
して抑揚の変化する合成音を得ることができる。従つ
て、高品質な合成音声が得られる波形重畳方式の音声合
成システムにおいて、有声部分における音声単位接続部
での接続歪みを低減することができ、補間を行わない合
成方式や、時間軸上での単純な波形補間による合成方式
に比べ、接続部をよりなめらかに接続していくことがで
き、人間の音声に近い高品質な任意合成音が得ることが
できる。As a result, it is possible to obtain a synthesized sound whose inflection changes according to the change of the pitch pattern. Therefore, in a speech synthesis system of a waveform superposition method capable of obtaining a high-quality synthesized speech, connection distortion at a voice unit connection part in a voiced portion can be reduced, and a synthesis method without interpolation or a time axis The connection unit can be connected more smoothly than in the synthesis method based on simple waveform interpolation, and a high-quality arbitrary synthesized sound close to human voice can be obtained.

【００３４】以上の構成において、所定の入力装置から
入力されたテキスト入力は、文章解析部２で所定の辞書
を基準にして解析され、単語、文節の境界及び基本アク
セントが検出される。この単語、文節の境界及び基本ア
クセントの検出結果は、音声合成規則部４で所定の音韻
規則に従つて処理され、抑揚のない状態でテキスト入力
を読み上げた音声を表す合成波形データが生成される。In the above configuration, a text input input from a predetermined input device is analyzed by the text analysis unit 2 with reference to a predetermined dictionary, and words, boundaries between phrases and basic accents are detected. The detection results of the words, the boundaries of phrases and the basic accents are processed by the speech synthesis rule unit 4 in accordance with a predetermined phonological rule, and synthetic waveform data representing speech read out of the text input without inflection is generated. .

【００３５】さらに単語、文節の境界及び基本アクセン
トの検出結果は、音声合成規則部４で、所定の韻律規則
に従つて処理され、テキスト入力全体の抑揚を表すピツ
チパターンが生成される。ピツチパターンは合成波形デ
ータと共に音声合成部５に出力され、ここでピツチパタ
ーン及び合成波形データに基づいて合成音が生成され
る。Further, the detection results of the words, the boundaries of phrases and the basic accents are processed by the speech synthesis rule unit 4 in accordance with a predetermined prosody rule, and a pitch pattern representing the inflection of the entire text input is generated. The pitch pattern is output to the speech synthesizer 5 together with the synthesized waveform data, where a synthesized sound is generated based on the pitch pattern and the synthesized waveform data.

【００３６】以上の構成によれば、高品質な合成音声が
得られる波形重畳方式の音声合成システムにおいて、有
声部分における音声単位接続部での接続歪みを低減する
ことができ、よりなめらかに音声単位を接続することに
よつて、人間の音声に近い高品質な合成音声を任意に生
成し得る音声合成装置１を実現できる。According to the above configuration, in a speech synthesis system of a waveform superposition method capable of obtaining a high-quality synthesized speech, connection distortion at a speech unit connection part in a voiced portion can be reduced, and a speech unit can be more smoothly processed. By connecting, the speech synthesizer 1 that can arbitrarily generate high-quality synthesized speech close to human speech can be realized.

【００３７】なお上述の実施例においては、音声合成部
５で合成しようとするある音韻連鎖Ｃ′ＶＣ″におい
て、先行音声単位Ｃ′Ｖの終端フレーム内の包絡情報
と、後方音声単位ＶＣ″の先端フレーム内の包絡情報と
を補間処理して得られた包絡情報に対して、先行音声単
位Ｃ′Ｖの終端フレーム内の微細構造情報の代わりに後
方音声単位ＶＣ″の先端フレーム内の微細構造情報を加
えたものを接続部における補間フレームの周波数情報と
してもよい。In the above-described embodiment, in a certain phoneme chain C'VC "to be synthesized by the voice synthesizer 5, the envelope information in the end frame of the preceding voice unit C'V and the rear voice unit VC" For the envelope information obtained by interpolating the envelope information in the leading frame, the fine structure in the leading frame of the rear voice unit VC ″ is replaced with the fine structure information in the trailing frame of the preceding voice unit C′V. The information to which the information is added may be used as the frequency information of the interpolation frame in the connection unit.

【００３８】[0038]

【発明の効果】上述のように本発明によれば、高品質な
合成音声が得られる波形重畳方式の音声合成システムに
おいて、有声部分における音声単位接続部での接続歪み
を低減することができ、人間の音声に近い高品質な合成
音を任意に合成することができる音声合成装置を得るこ
とができる。As described above, according to the present invention, in a voice synthesis system of a waveform superposition method capable of obtaining a high-quality synthesized voice, connection distortion at a voice unit connection part in a voiced portion can be reduced. A speech synthesizer capable of arbitrarily synthesizing a high-quality synthesized sound similar to human speech can be obtained.

[Brief description of the drawings]

【図１】本発明の一実施例による音声合成装置を示すブ
ロツク図である。FIG. 1 is a block diagram showing a speech synthesizer according to one embodiment of the present invention.

【図２】音声単位データ内の先端フレーム及び終端フレ
ームのデータの説明に供する信号波形図である。FIG. 2 is a signal waveform diagram for describing data of a leading frame and a trailing frame in audio unit data.

【図３】補間処理の説明に供する略線図である。FIG. 3 is a schematic diagram for explaining an interpolation process;

【図４】音声合成装置の動作の説明に供する略線図であ
る。FIG. 4 is a schematic diagram used to explain the operation of the speech synthesizer.

[Explanation of symbols]

１……音声合成装置、２……音声単位記憶部、３……文
章解析部、４……音声合成規則部、５……音声合成部。1 ... Speech synthesis device, 2 ... Speech unit storage unit, 3 ... Sentence analysis unit, 4 ... Speech synthesis rule unit, 5 ... Speech synthesis unit.

フロントページの続き (56)参考文献特開平３−48300（ＪＰ，Ａ) 特開平４−98298（ＪＰ，Ａ) 特開昭63−208099（ＪＰ，Ａ) 特開昭63−136099（ＪＰ，Ａ) 特開昭63−118800（ＪＰ，Ａ) 特開昭62−133490（ＪＰ，Ａ) 特開平４−199196（ＪＰ，Ａ) 特開平４−281499（ＪＰ，Ａ) 特開平５−94196（ＪＰ，Ａ) 特開平５−108085（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 13/06 Continuation of front page (56) References JP-A-3-48300 (JP, A) JP-A-4-98298 (JP, A) JP-A-63-208099 (JP, A) JP-A-63-136099 (JP, A) JP-A-63-118800 (JP, A) JP-A-62-133490 (JP, A) JP-A-4-199196 (JP, A) JP-A-4-281499 (JP, A) 5-94196 (JP, A) JP-A-5-108085 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 13/06

Claims

(57) [Claims]

For each voice unit, for a voiced part having periodicity within the voice unit, voice waveform data corresponding to each one-pitch cycle obtained by analysis processing of a real voice is stored in the voice unit. At the same time, the front end frame and the rear end frame of the audio unit are stored in the memory together with the audio waveform data together with the envelope information and the fine structure information obtained by the analysis process. At the same time, for a voiceless part having no periodicity in the voice unit, a voice unit storage unit that stores the real voice as it is as the voice waveform data in the memory, and a predetermined phonological rule and prosody based on the input phonological symbols and prosodic symbols. According to a rule, a speech synthesis rule section for generating a pitch pattern, and an interpolation frame for synthesizing the voiced portion, By interpolating the envelope information of the rear end frame of the line audio unit and the front end frame of the rear audio unit, a time waveform is obtained along with the fine structure information of the preceding audio unit. A speech synthesizer comprising: a speech synthesizer that generates a synthesized sound based on the speech unit and the pitch pattern as the speech waveform data.

2. A system for analyzing a sequence of inputted characters to find a word,
A sentence analysis unit that detects the boundaries of phrases and basic accents; and, for each voice unit, a voiced portion having periodicity within the voice unit, for each one-pitch period obtained by analysis processing of real voice. Corresponding audio waveform data is stored in the memory for the required number of frames as the audio unit, and at the same time, in the front end frame and the rear end frame of the audio unit, the envelope information and the fine structure obtained by the analysis processing together with the audio waveform data are stored. A voice unit storage unit that stores information together with the information in the memory, and simultaneously stores the actual voice in the memory as the voice waveform data without any periodicity in the voice unit, and an analysis result of the text analysis unit. A voice synthesis rule section for generating a pitch pattern in accordance with a predetermined phoneme rule and a prosody rule based on In the interpolation frame at the time of synthesizing the minute, the envelope information of the rear end frame of the preceding voice unit and the front end frame of the rear voice unit is interpolated, and a time waveform is obtained along with the fine structure information of the preceding voice unit. A voice synthesizing unit that generates a synthesized voice based on the voice unit and the pitch pattern using the time waveform as the voice waveform data of the interpolation frame.

3. For each voice unit, for a voiced portion having periodicity within the voice unit, voice waveform data corresponding to each one-pitch cycle obtained by analyzing the real voice is converted into the voice unit. As many as the required number of frames are stored in the memory, and at the same time, the front end frame and the rear end frame of the audio unit are combined with the audio waveform data together with the envelope information and the fine structure information obtained by the analysis process, and are stored in the memory. For the unvoiced part having no periodicity in the voice unit, the real voice is stored as it is as the voice waveform data in the memory, and the predetermined phoneme rule based on the input phoneme symbol and the prosody symbol is stored. A pitch pattern is generated in accordance with the prosody rules, and in the interpolation frame when the voiced portion is synthesized, By interpolating the envelope information of the rear end frame and the front end frame of the rear audio unit, a time waveform is obtained in combination with the fine structure information of the preceding audio unit, and the time waveform is converted into the audio waveform data of the interpolation frame. Wherein a synthesized voice is generated based on the voice unit and the pitch pattern.

4. Analyzing a sequence of inputted characters, and analyzing words,
Detects the boundaries of phrases and basic accents, and for each voice unit, the voice waveform corresponding to each one-pitch period obtained by the analysis processing of the real voice for the voiced part having periodicity within the voice unit Data is stored in the memory for the required number of frames as the voice unit, and at the same time, the front end frame and the rear end frame of the voice unit are combined with the envelope information and the fine structure information obtained by the analysis processing together with the voice waveform data. For the unvoiced part having no periodicity within the voice unit, the real voice is stored as it is as the voice waveform data in the memory, and the voice data is stored in the memory. Therefore, a pitch pattern is generated, and in the interpolation frame when the voiced part is synthesized, By interpolating the envelope information of the rear end frame and the front end frame of the rear audio unit, a time waveform is obtained in combination with the fine structure information of the preceding audio unit, and the time waveform is used as the audio waveform data of the interpolation frame. , Based on the voice unit and the pitch pattern,
A speech synthesis method characterized in that a synthesized speech is generated.

5. The voice synthesizing apparatus and the voice synthesizing method according to claim 1, wherein a voice based on Japanese is used as said voice.