JPH0863187A

JPH0863187A - Speech synthesizer

Info

Publication number: JPH0863187A
Application number: JP6195178A
Authority: JP
Inventors: Nobuyuki Katae; 伸之片江
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1994-08-19
Filing date: 1994-08-19
Publication date: 1996-03-08
Anticipated expiration: 2015-07-10
Also published as: JP3060276B2

Abstract

PURPOSE: To make a synthesized speech easy to catch and to give it a natural rhythm by generating time change patterns of respective basic frequencies of fixed, variable information, successively connecting them, generating the time change pattern of the basic frequency of a sentence and synthesizing the speech. CONSTITUTION: A text analysis means generates a phonogram line from an input sentence to output it to an acoustic parameter generation means 11. Related to a fixed form part, in a fixed form FO pattern/duration time length generation means 8, and related to an unfixed form part, in an unfixed form FO pattern/duration time length generation means 9, an FO pattern and a duration time length are generated respectively. These FO pattern and duration time length are connected successively in an FO pattern/duration time length connecting/editing means 10, and the FO pattern and the duration time length of the whole sentence are generated. The acoustic parameter generation means 11 generates an acoustic parameter based on the phonogram line. Further, a speech signal generation means 12 generates and outputs a speech signal from the FO pattern, etc.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声合成装置に関し、
特に交通情報や天気概況の音声サービスなどに用いる、
合成すべき一群のメッセージのすべてに共通する固定情
報（以下、定型部と呼ぶ。）とメッセージ群で共通しな
い可変情報（以下、非定型部と呼ぶ。）からなる音声を
合成する音声合成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizer,
Especially used for traffic information and voice services for weather conditions,
The present invention relates to a voice synthesizing device for synthesizing a voice composed of fixed information (hereinafter, referred to as a fixed part) common to all of a group of messages to be combined and variable information (hereinafter, referred to as an atypical part) that is not common to message groups. .

【０００２】近年、社会一般の省力化・機械化の要請が
益々強くなり、各種音声サービスの分野も例外ではな
く、現在、交通情報や天気概況の音声サービス、銀行の
振り込み照会サービスなどに、音声合成装置が使用され
ている。このため、音声合成装置は聞取りやすく、自然
な韻律をもつ合成音声を提供する必要がある。In recent years, the demand for labor saving and mechanization of society in general has become stronger and stronger, and the fields of various voice services are no exception. Currently, voice synthesis is used for traffic information, weather condition voice service, bank transfer inquiry service, etc. The device is in use. Therefore, it is necessary for the speech synthesizer to provide synthetic speech that is easy to hear and has a natural prosody.

【０００３】[0003]

【従来の技術】従来の音声合成装置では、定型部には、
あらかじめ録音しておいた音声を再生する録音編集方
式、あるいは該音声をなんらかの音声パラメータに変換
したものを蓄積しておき、そのパラメータから音声を合
成する分析合成方式が用いられている。また、固有名詞
や数字などの非定型部は、文字列から規則を用いて、音
声を生成する規則合成方式を用い、それぞれの方式で合
成した音声を接続して、あるいは切替えて出力するのが
一般的であった。2. Description of the Related Art In a conventional speech synthesizer, a fixed part has
A recording / editing method that reproduces a voice that has been recorded in advance, or an analysis and synthesis method that accumulates a voice that has been converted into a voice parameter and synthesizes the voice from the parameter is used. In addition, atypical parts such as proper nouns and numbers use rules from character strings to generate a voice by using a rule synthesizing method, and voices synthesized by each method are connected or switched and output. It was common.

【０００４】従来技術による音声合成装置の構成図を図
９に示す。図中、１はテキスト入力手段、２はテキス
ト解析手段、３は定型部合成手段、４は非定型部合成手
段、５は出力音声接続手段、６は音声出力手段をそれぞ
れ示す。テキスト入力手段１に入力されたテキスト
を、テキスト解析手段２において、単語辞書を参照しな
がら解析する。その結果、定型部の部分は定型部合成手
段３に入力され、蓄積してある定型部音声データから音
声を合成する。可変な情報からなる部分は、非定型部合
成手段４に入力され、文字列からの規則合成を行なう。
それぞれの合成手段で合成した音声を、文として継るよ
うに出力音声接続手段５で接続し、音声出力手段６を介
して出力する。FIG. 9 shows a block diagram of a speech synthesizer according to the prior art. In the figure, 1 is a text input means, 2 is a text analysis means, 3 is a fixed part synthesis means, 4 is an atypical part synthesis means, 5 is an output voice connection means, and 6 is a voice output means. The text input means 1 analyzes the text in the text analysis means 2 while referring to the word dictionary. As a result, the fixed part is input to the fixed part synthesizing means 3 and a voice is synthesized from the accumulated fixed part voice data. The part consisting of variable information is input to the atypical part synthesizing means 4 to perform rule composition from a character string.
The voices synthesized by the respective synthesizing means are connected by the output voice connecting means 5 so as to be continued as a sentence, and output via the voice output means 6.

【０００５】[0005]

【発明が解決しようとする課題】ところが、音声の品質
を見ると、規則合成方式の音声品質は録音編集方式や分
析合成方式に比べて劣っているのが現状である。In view of the voice quality, however, the voice quality of the rule synthesis system is inferior to that of the recording / editing system and the analysis / synthesis system at present.

【０００６】従って、録音編集方式または分析合成方式
による定型部と規則合成方式による非定型部とを接続し
た音声では、定型部と非定型部の品質にギャップがあ
り、文中の重要な情報を含む非定型部が聞き取りにくい
という問題があった。これに対して、文全体を同じ品質
で生成するほうが聞き取りやすく、特に近年、技術の改
良によって規則合成方式の音声品質が向上してきたこと
もあり、すべてを規則合成によって合成しても、十分に
実用に耐えうるようになってきた。もちろん、すべて規
則合成方式を用いれば、定型部を変更したい場合でも、
音声を再収録する手間も省ける。Therefore, in a voice in which a fixed part based on the recording / editing method or the analysis / synthesis method and an atypical part based on the rule synthesis method are connected, there is a gap in the quality between the fixed part and the atypical part, and important information in the sentence is included. There was a problem that the atypical parts were difficult to hear. On the other hand, it is easier to hear if the whole sentence is generated with the same quality, and especially since the speech quality of the rule synthesis method has improved due to technological improvements in recent years, it is sufficient to synthesize all by rule synthesis. It has come to be practical. Of course, if you use the rule composition method, even if you want to change the fixed part,
You can save the trouble of re-recording the voice.

【０００７】ところで、我々が日常生活に用いている漢
字かな混じり文から音声を合成するとき、規則合成方式
では録音編集方式や分析合成方式とは異なり、辞書と規
則を参照しながら、自然な韻律（イントネーション、ア
クセント、ポーズ等）を生成する必要がある。この過程
で以下の２個の問題が存在する。By the way, when synthesizing a voice from a kanji / kana mixture sentence that we use in our daily lives, the rule synthesizing method is different from the recording / editing method or the analysis / synthesis method in that the natural prosody is referred to while referring to the dictionary and the rules. It is necessary to generate (intonation, accent, pose, etc.). There are the following two problems in this process.

【０００８】第１の問題は、漢字かな混じり文を解析し
て表音文字列を生成する過程におけるものである。ここ
で、表音文字列とは、音素（日本語ではローマ字表記と
ほぼ等しい。）列または音節（日本語では仮名文字表記
とほぼ等しい。）列に、ポーズ位置、アクセントの位置
を示す表記を含めた文字列のことである。日本語は単語
でわかち書きされておらず、漢字には幾通りもの読み方
があるため、辞書と規則から表音文字列を生成しようと
すると、誤読やアクセントの誤り、不自然なポーズの挿
入などが頻繁に起こる。第１の問題は、韻律情報を含
む予め作成した入力文字列を記憶した記憶手段としての
音声変換用入力列ファイルから抽出した文字列規則合成
することにより解決されている（特開平4-107598参
照。）が、構成費用の低減が要求され。The first problem is in the process of analyzing a kanji / kana mixed sentence and generating a phonetic character string. Here, the phonetic character string is a phoneme (in Japanese, it is almost equal to the Roman character notation) or syllable (in Japanese, it is almost equal to the kana character notation) column, and the notation indicating the position of the pause and the accent. It is the included character string. Since Japanese is not written in words and there are many ways to read kanji, when trying to generate phonetic strings from dictionaries and rules, misreading, incorrect accents, insertion of unnatural poses, etc. It happens frequently. The first problem is solved by synthesizing a character string rule extracted from a voice conversion input string file as a storage means for storing a previously created input character string containing prosody information (see Japanese Patent Laid-Open No. 4-107598). However, it is required to reduce the construction cost.

【０００９】第２の問題は、表音文字列から音響的（物
理的）なパラメータを生成する過程におけるものであ
る。例えば、イントネーションは声の高さの変化であ
り、有声音の音声が包含する最低周波数である基本周波
数の時間変化パターン（以下、Ｆ０パターンと称す
る。）を用いて制御するのが一般的である。これは数ミ
リ秒(msec)毎の基本周波数の時系列で表される。上記の
表音文字列からこのＦ０パターンを生成するための規則
として、有名なものに、藤崎モデルや点ピッチモデルな
どがあるが、人間の複雑な発声機構や、内容、意味によ
っても微妙に変化するＦ０パターンを簡単な規則によっ
て求めるのは困難である。また、発声がつかえたり間延
びしたりせずに自然になるように、各音素あるいは音節
の時間長を適切な値に設定している。ところが、この時
間長は音素あるいは音節の種類によって一意に決まるも
のではなく、この音素あるいは音節が置かれている文中
の位置や周辺の音韻環境によって複雑に影響されるもの
であり、これもまた単純な規則では求まらないものであ
る。The second problem is in the process of generating acoustic (physical) parameters from a phonetic character string. For example, intonation is a change in voice pitch and is generally controlled using a time change pattern (hereinafter, referred to as F0 pattern) of a fundamental frequency which is the lowest frequency included in voiced sound. . This is represented by a time series of fundamental frequencies every few milliseconds (msec). The famous rules for generating this F0 pattern from the phonetic character strings are the Fujisaki model and the point-pitch model, but there are subtle changes depending on the human complex vocalization mechanism, content, and meaning. It is difficult to find the F0 pattern to be performed by a simple rule. Further, the time length of each phoneme or syllable is set to an appropriate value so that the utterance becomes natural without being caught or delayed. However, this time length is not uniquely determined by the type of phoneme or syllable, but is complicatedly influenced by the position in the sentence in which this phoneme or syllable is placed and the surrounding phonological environment. It is something that cannot be obtained by such rules.

【００１０】[0010]

【課題を解決するための手段】図２は本発明の概念図で
ある。以下、同図と「今夜の［東京］地方の天気は
［晴れ］でしょう。」という例文によって説明する。FIG. 2 is a conceptual diagram of the present invention. In the following, this figure and the example sentence "Tonight's [Tokyo] region's weather will be [clear]" will be explained.

【００１１】本文は「今夜の・・地方の天気は・・・で
しょう。」という定型部と「東京」「晴れ」という非定
型部から構成されており、非定型部はそれぞれ「神奈川
県」「雨」のような単語と置換することが可能であると
する。このような文を合成するときに、定型部に関して
は、同文を人間が発声した音声から定型部のＦ０パター
ンや持続時間長を抽出し、例えば、Ｆ０パターンであれ
ば数msec毎の基本周波数値の時系列として、持続時間長
であれば各音素の長さの系列として蓄積しておく。非定
型部に関しては、非定型部への入力が期待される単語あ
るいは文節などの音節数とアクセント型のすべての組合
せのＦ０パターンを蓄積しておき、入力文、またはそれ
を解析した表音文字列から、同じ音節数とアクセント型
の組合せのＦ０パターンを読み込む。このＦ０パターン
は、音節数とアクセント型だけでなく、文全体のＦ０パ
ターンの中で決まるものであるから、定型部のいずれの
位置に挿入するかによって、Ｆ０パターンはそれぞれ異
なるものを持ち、選択する必要がある。たとえば、「東
京」という単語であれば４モーラ０型であるから、定型
部の「今夜の・・・地方」の位置に挿入されるパターン
の中から４モーラ０型のＦ０パターンを選択する。非定
型部の持続時間長は規則により生成する。定型部と非定
型部に分けて検索した（あるいは生成した）Ｆ０パター
ンと持続時間長を順に接続することによって、文全体の
Ｆ０パターンを作成する。Ｆ０パターンは、文全体で連
続して接続される。[0011] The text consists of a typical part "Tonight ... the weather in the region ..." and an atypical part "Tokyo" and "sunny". Each atypical part is "Kanagawa Prefecture". It is possible to replace with a word like "rain". When synthesizing such a sentence, with respect to the fixed form part, the F0 pattern and the duration length of the fixed form part are extracted from the voice uttered by a person, and for example, in the case of the F0 pattern, the fundamental frequency value every several msec. As a time series of, if the duration is long, it is accumulated as a series of the length of each phoneme. Regarding the atypical parts, the F0 patterns of all combinations of the number of syllables such as words or syllables expected to be input to the atypical parts and the accent type are accumulated, and the input sentence or the phonetic character that is analyzed is stored. The F0 pattern having the same combination of the syllable number and the accent type is read from the column. Since this F0 pattern is determined not only by the number of syllables and accent types but also by the F0 pattern of the whole sentence, the F0 patterns have different ones depending on where they are inserted in the fixed part, and the selection is made. There is a need to. For example, since the word "Tokyo" is a 4-mora 0 type, a 4-mora 0 type F0 pattern is selected from the patterns to be inserted at the position "Tonight ... The duration of the atypical part is generated by the rule. The F0 pattern of the entire sentence is created by connecting the searched (or generated) F0 pattern and the duration length separately to the fixed part and the non-fixed part. The F0 pattern is connected continuously throughout the sentence.

【００１２】また、非定型部に関してＦ０パターンを蓄
積しておかずに、規則によって生成しても、文全体のＦ
０パターンをすべて規則で生成した場合よりも高品質な
音声が得られる。Further, even if the F0 pattern is not stored in the atypical part and is generated by the rule, the F
Higher quality voice can be obtained than when all 0 patterns are generated by the rule.

【００１３】[0013]

【作用】本発明の原理図を図１に示す。図中、１はテ
キスト入力手段、７はテキスト解析手段、８は定型部Ｆ
０パターン・持続時間長生成手段、９は非定型部Ｆ０パ
ターン・持続時間長生成手段、１０はＦ０パターン・持
続時間長接続編集手段、１１は音響パラメータ生成手
段、１２は音声信号生成手段、６は音声出力手段をそれ
ぞれ示す。テキスト入力手段１に合成するテキストが
入力される。テキスト解析手段７では、入力テキストを
非定型部と定型部に分離する。入力されたテキストが通
常の漢字かな混じり文の場合は、定型部と非定型部に分
離するために、任意文の規則合成に用いるようなテキス
ト解析が必要であるが、ユーザインタフェースによっ
て、定型部と非定型部を分けて入力できる場合には、単
純に定型部と非定型部をそれぞれのＦ０パターン・持続
時間長生成手段に出力するだけでよい。又、テキスト解
析手段７では入力文から表音文字列（音素列または音節
列）を生成して音響パラメータ生成手段１１に出力す
る。定型部については定型部Ｆ０パターン・持続時間長
生成手段８において、非定型部については非定型部Ｆ０
パターン・持続時間長生成手段９において、それぞれ、
Ｆ０パターンおよび持続時間長を生成する。これらのＦ
０パターンおよび持続時間長は、Ｆ０パターン・持続時
間長接続編集手段１０において順次接続され、文全体の
Ｆ０パターンおよび持続時間長が生成される。音響パラ
メータ生成手段１１では、音素列または音節列などの表
音文字列を基に、ホルマント等の音響パラメータを生成
する。音響パラメータは音声信号生成手段１２に用いる
合成方式によって決まる。また、合成方式としては波形
を直接編集する波形編集方式があり、この方式を用いた
場合は音響パラメータではなく、それに相当するものと
して、波形接続情報を生成することになるが、ここで
は、音響パラメータに含めて扱う。音声信号生成手段１
２では、Ｆ０パターン、持続時間長、および音響パラメ
ータから、音声信号を生成し、音声出力手段６から出力
する。The principle of the present invention is shown in FIG. In the figure, 1 is a text input means, 7 is a text analysis means, and 8 is a fixed part F.
0 pattern / duration generation means, 9 atypical part F0 pattern / duration generation means, 10 F0 pattern / duration connection editing means, 11 acoustic parameter generation means, 12 audio signal generation means, 6 Indicates audio output means, respectively. The text to be combined is input to the text input means 1. The text analysis unit 7 separates the input text into a non-standard part and a standard part. If the input text is a normal kana-kana mixed sentence, it is necessary to perform text analysis such as that used for rule composition of arbitrary sentences in order to separate it into a fixed part and an unfixed part. And the atypical part can be separately input, it is sufficient to simply output the atypical part and the atypical part to the respective F0 pattern / duration generation means. Further, the text analysis means 7 generates a phonetic character string (phoneme string or syllable string) from the input sentence and outputs it to the acoustic parameter generation means 11. For the standard part, the standard part F0 pattern / duration generation means 8 is used. For the non-standard part, the non-standard part F0 is used.
In the pattern / duration generation means 9,
Generate F0 pattern and duration. These F
The 0 pattern and the duration length are sequentially connected in the F0 pattern / duration duration connection editing means 10 to generate the F0 pattern and duration length of the entire sentence. The acoustic parameter generation means 11 generates acoustic parameters such as formants based on phonetic character strings such as phoneme strings or syllable strings. The acoustic parameters are determined by the synthesizing method used in the audio signal generating means 12. In addition, there is a waveform editing method that directly edits the waveform as a synthesizing method, and when this method is used, the waveform connection information is generated as an equivalent parameter instead of an acoustic parameter. Handle it by including it in the parameter. Audio signal generation means 1
In 2, the audio signal is generated from the F0 pattern, the duration, and the acoustic parameter, and is output from the audio output means 6.

【００１４】[0014]

【実施例】Ｆ０パターン生成方法には３つのレベルが考
えられる。第１のレベルは、自然音声から抽出したＦ０
パターンをそのまま基本周波数の時系列の形式で蓄積し
ておき合成時に読み込む方法であり、最も自然な音声の
合成が期待されるものである。第２のレベルは、自然音
声のＦ０パターンをモデルにより近似して、そのモデル
のパラメータを蓄積しておき、合成時にパラメータから
基本周波数の時系列の形式に変換する方法である。第３
のレベルは、テキスト解析結果からモデルのパラメータ
を規則的に生成し、該パラメータから基本周波数の時系
列を生成する方法である。EXAMPLE Three levels can be considered for the F0 pattern generation method. The first level is F0 extracted from natural speech.
This is a method of accumulating the patterns as they are in the time-series format of the fundamental frequency and reading them at the time of synthesis, and is expected to produce the most natural speech synthesis. The second level is a method in which the F0 pattern of natural speech is approximated by a model, the parameters of the model are accumulated, and the parameters are converted into the time series form of the fundamental frequency at the time of synthesis. Third
Is a method of regularly generating model parameters from text analysis results and generating a time series of fundamental frequencies from the parameters.

【００１５】また、持続時間長生成方法には２つのレベ
ルが考えられる。第１のレベルは、自然音声から抽出し
た持続時間長をそのまま時間長の系列として蓄積してお
き合成時に読み込む方法である。第２のレベルは、上記
の時間長をテキスト解析結果から規則的に生成する方法
である。非定型部と定型部のＦ０パターンおよび持続時
間長生成方法として、上記のレベルそれぞれの組合せが
考えられる。これらを実施例として以下に述べる。There are two possible levels of the duration generating method. The first level is a method of accumulating the durations extracted from the natural speech as they are as a sequence of durations and reading them at the time of synthesis. The second level is a method of regularly generating the above time length from the text analysis result. As the F0 pattern of the non-standard part and the standard part and the duration length generation method, combinations of the above levels can be considered. These will be described below as examples.

【００１６】本発明の第１の実施例の構成図を図３に示
す。本実施例は特許の請求項２、４、８および９に対
応している。図中、０１１はテキスト入力部、７１はテ
キスト解析部、７２は定型／非定型判定部、７３は出力
切替部、７４は単語辞書、７５は定型部文例蓄積部、８
１は定型部持続時間長読み込み部、８２は定型部Ｆ０パ
ターン読み込み部、８３は定型部持続時間長蓄積部、８
４は定型部Ｆ０パターン蓄積部、９１は非定型部持続時
間長生成部、９２は非定型部Ｆ０パターン読み込み部、
９３はアクセント辞書、９４は非定型部Ｆ０パターン蓄
積部、１０１は持続時間長接続編集部、１０２はＦ０パ
ターン接続編集部、１１１は音響パラメータ生成部、１
１２は音響パラメータ蓄積部、１２１は音声信号生成
部、６１は音声出力部を示す。A block diagram of the first embodiment of the present invention is shown in FIG. This embodiment corresponds to claims 2, 4, 8 and 9 of the patent. In the figure, 011 is a text input part, 71 is a text analysis part, 72 is a fixed / atypical judgment part, 73 is an output switching part, 74 is a word dictionary, 75 is a fixed part sentence example storage part, 8
1 is a fixed part duration reading unit, 82 is a fixed part F0 pattern reading unit, 83 is a fixed part duration accumulation unit, 8
4 is a fixed part F0 pattern storage part, 91 is an atypical part duration generation part, 92 is an atypical part F0 pattern reading part,
93 is an accent dictionary, 94 is an atypical part F0 pattern storage part, 101 is a duration connection editing part, 102 is a F0 pattern connection editing part, 111 is an acoustic parameter generation part, 1
Reference numeral 12 is an acoustic parameter storage unit, 121 is an audio signal generation unit, and 61 is an audio output unit.

【００１７】あらかじめ、定型部について自然音声より
抽出した定型部Ｆ０パターンを定型部Ｆ０パターン蓄積
部８４に格納し、非定型部について、その音節数とアク
セント型のすべての組合せの非定型部Ｆ０パターンを非
定型部Ｆ０パターン蓄積部９４に格納し、定型部につい
て自然音声より抽出した定型部持続時間長を定型部持続
時間長蓄積部８３に格納してある。合成するテキストが
テキスト入力部０１１に入力される。入力が漢字かな混
じり表記である場合は、テキスト解析部７１において、
単語辞書７４を参照しながら、テキストを解析する。定
型／非定型判定部７２では、定型部文例蓄積部７５に格
納されている定型文例を参照し、解析結果を定型部と非
定型部に分離する。出力切替部７３は定型部と非定型部
をそれぞれの持続時間長、Ｆ０パターン生成部に出力す
る。またこのとき、テキストを解析した結果として、入
力テキストの表音文字列（音素列または音節列など）を
音響パラメータ生成部１１１に出力する。The fixed part F0 pattern extracted from the natural voice of the fixed part is stored in advance in the fixed part F0 pattern accumulating portion 84, and the fixed part F0 patterns of all combinations of the syllable number and the accent type are stored for the fixed part. Is stored in the non-standardized section F0 pattern storage section 94, and the standardized section duration length extracted from the natural voice of the standardized section is stored in the standardized section duration storage section 83. The text to be combined is input to the text input unit 011. When the input is a mixed kanji and kana notation, in the text analysis unit 71,
The text is analyzed with reference to the word dictionary 74. The fixed form / atypical form determination unit 72 refers to the fixed form example stored in the fixed form sentence example storage unit 75, and separates the analysis result into a fixed form part and an atypical form part. The output switching unit 73 outputs the fixed-form portion and the non-fixed-form portion to the F0 pattern generation unit for their respective durations. At this time, as a result of analyzing the text, a phonetic character string (a phoneme string or a syllable string) of the input text is output to the acoustic parameter generation unit 111.

【００１８】定型部については、定型部持続時間長読み
込み部８１において、定型部持続時間長蓄積部８３から
持続時間長を読み込み、又、定型部Ｆ０パターン読み込
み部８２において、定型部Ｆ０パターン蓄積部８４から
Ｆ０パターンを読み込み、それぞれ持続時間長接続編集
部１０１を経由し、Ｆ０パターン接続編集部１０２に出
力する。非定型部については、非定型部持続時間長生成
部９１において、規則により持続時間長を生成する。規
則による持続時間長生成は、非定型部の各音素または音
節について時間長テーブルを検索し、音素環境などによ
って補正するといった方法がとられるのが一般的であ
る。次に、非定型部Ｆ０パターン読み込み部９２では、
非定型部の単語のアクセントをアクセント辞書９３から
獲得し、音節数とアクセント型から非定型部Ｆ０パター
ン蓄積部９４を参照して、読み込んだＦ０パターンを持
続時間長接続編集部１０１、Ｆ０パターン接続編集部１
０２に出力する。持続時間長接続編集部１０１では、定
型部と非定型部それぞれの音素時間長を順番に接続し、
文全体の持続時間長の系列を作成する。Ｆ０パターン接
続編集部１０２では、定型部と非定型部のそれぞれのＦ
０パターンを順番に接続し、文全体のＦ０パターンを作
成する。Ｆ０パターンは発声中連続であるので、二つの
定型部と非定型部で読み込んだＦ０パターンのそれぞれ
に不連続がある場合には、適切なスムージングを行なう
などの編集を行なわなければならない。Regarding the standard part, the standard part duration reading part 81 reads the duration from the standard part duration storage part 83, and the standard part F0 pattern reading part 82 reads the standard part F0 pattern storage part. The F0 pattern is read from 84 and output to the F0 pattern connection editing unit 102 via the duration connection editing unit 101, respectively. For the non-standard part, the non-standard part duration generation unit 91 generates the duration length according to a rule. In general, the duration generation according to a rule is performed by searching a duration table for each phoneme or syllable in an atypical part and correcting the duration table. Next, in the non-standard part F0 pattern reading unit 92,
The accent of the word of the atypical part is acquired from the accent dictionary 93, and the read F0 pattern is referenced from the atypical part F0 pattern accumulating unit 94 based on the number of syllables and the accent type, and the read F0 pattern is connected to the duration connection editing unit 101 and the F0 pattern connection. Editorial department 1
Output to 02. In the duration connection editing unit 101, the phoneme time lengths of the fixed part and the atypical part are connected in order,
Create a sequence of durations for the entire sentence. In the F0 pattern connection editing unit 102, the F and F
The 0 patterns are connected in order, and the F0 pattern of the whole sentence is created. Since the F0 pattern is continuous during utterance, if there is discontinuity in each of the F0 patterns read by the two fixed parts and the non-fixed part, editing such as appropriate smoothing must be performed.

【００１９】一方、音響パラメータ生成部１１１では、
入力の表音文字列をもとに音響パラメータを生成する。
音響パラメータ蓄積部１１２には、音響パラメータが格
納されている。ここで言う、音響パラメータとは、デー
タ容量を圧縮するために音声生成モデルを用いて音声デ
ータを数値化したものであり、ホルマント、ＰＡＲＣＯ
Ｒ、ＬＳＰなどの種類があり。これらの音響パラメータ
を用いた合成方式を、それぞれホルマント合成、ＰＡＲ
ＣＯＲ合成、ＬＳＰ合成と呼び、音声信号生成部１２１
によって実現される。また、合成方式としては波形を
直接編集する波形編集方式があり、この方式を用いた場
合は音響パラメータではなく、それに相当するものとし
て、波形接続情報を生成することになるが、ここでは、
音響パラメータに含めて扱う。音響パラメータは、表音
文字ごと、あるいはそれを前後の音素環境などにより細
分化した単位で蓄積されている。表音文字列にしたがっ
てこれを読み込み、連接することによって、合成文の音
響パラメータ列が生成される。音声信号生成部１２１で
は、以上で生成された合成文の持続時間長、Ｆ０パター
ン、音響パラメータ列より音声信号を生成する。音声出
力部６１では、その音声信号をＤＡ変換することによ
り、合成音声として出力する。On the other hand, in the acoustic parameter generator 111,
Acoustic parameters are generated based on the input phonetic character string.
The acoustic parameter storage unit 112 stores acoustic parameters. The acoustic parameter mentioned here is a value obtained by digitizing audio data by using an audio generation model in order to compress the data volume, such as formant and PARCO.
There are types such as R and LSP. The synthesis methods using these acoustic parameters are the formant synthesis and the PAR, respectively.
Called COR synthesis and LSP synthesis, the audio signal generation unit 121
Is realized by In addition, there is a waveform editing method that directly edits the waveform as a synthesizing method.When this method is used, the waveform connection information is generated as an equivalent to the acoustic parameter, but here,
It is included in the acoustic parameters. The acoustic parameters are accumulated for each phonetic character or in a unit that is subdivided according to the preceding and following phoneme environments. By reading this according to the phonetic character string and connecting them, the acoustic parameter string of the synthetic sentence is generated. The voice signal generation unit 121 generates a voice signal from the duration time of the synthetic sentence, the F0 pattern, and the acoustic parameter sequence generated as described above. The voice output unit 61 DA-converts the voice signal to output it as synthesized voice.

【００２０】本発明の第２の実施例の構成図を図４に示
す。本実施例は特許の請求項３および５に対応してい
る。本実施例は、実施例１の定型部Ｆ０パターン読み込
み部８２と定型部Ｆ０パターン蓄積部８４を定型部Ｆ０
パラメータ読み込み部８５、定型部Ｆ０パターン生成部
８６、および定型部Ｆ０パラメータ蓄積部８７に、ま
た、非定型部Ｆ０パターン読み込み部９２と非定型部Ｆ
０パターン蓄積部９４を非定型部Ｆ０パラメータ読み込
み部９５、非定型部Ｆ０パターン生成部９６、および非
定型部Ｆ０パラメータ蓄積部９７に置き換えたものであ
る。A block diagram of the second embodiment of the present invention is shown in FIG. This embodiment corresponds to claims 3 and 5 of the patent. In this embodiment, the fixed part F0 pattern reading part 82 and the fixed part F0 pattern storage part 84 of the first embodiment are replaced by the fixed part F0.
In the parameter reading unit 85, the fixed part F0 pattern generation unit 86, and the fixed part F0 parameter storage unit 87, the non-fixed part F0 pattern reading unit 92 and the non-fixed part F
The 0 pattern accumulating section 94 is replaced with an atypical section F0 parameter reading section 95, an atypical section F0 pattern generating section 96, and an atypical section F0 parameter accumulating section 97.

【００２１】本実施例では、あらかじめ、自然音声から
抽出したＦ０パターンをモデルにより近似して、そのパ
ラメータを定型部Ｆ０パラメータ蓄積部８７と非定型部
Ｆ０パラメータ蓄積部９７に蓄積しておく。音声を合成
する際に、定型部に関しては、定型部Ｆ０パラメータ読
み込み部８５において、定型部のＦ０パラメータを定型
部Ｆ０パラメータ蓄積部８７から読みだし、定型部Ｆ０
パターン生成部８６において、パラメータから基本周波
数の時系列（Ｆ０パターン）を生成する。同様に、非定
型部についても、非定型部Ｆ０パラメータ読み込み部９
５において、非定型部の単語のアクセントをアクセント
辞書９３から獲得し、その音節数とアクセント型によっ
て、非定型部Ｆ０パラメータ蓄積部９７から適切なＦ０
パラメータを読みだし、非定型部Ｆ０パターン生成部９
６において、パラメータから基本周波数の時系列（Ｆ０
パターン）を生成する。In this embodiment, the F0 pattern extracted from the natural voice is approximated by a model in advance, and its parameters are stored in the fixed part F0 parameter storage part 87 and the non-fixed part F0 parameter storage part 97. When synthesizing a voice, regarding the fixed form part, in the fixed form part F0 parameter reading part 85, the F0 parameter of the fixed form part is read from the fixed form part F0 parameter storage part 87, and the fixed form part F0 is read.
The pattern generation unit 86 generates a time series of basic frequencies (F0 pattern) from the parameters. Similarly, regarding the non-standard part, the non-standard part F0 parameter reading unit 9
5, the accent of the word of the atypical part is acquired from the accent dictionary 93, and the appropriate F0 is obtained from the atypical part F0 parameter accumulating part 97 according to the number of syllables and the accent type.
The parameters are read and the non-standard part F0 pattern generation part 9
6, the time series (F0
Pattern).

【００２２】本発明の第３の実施例の構成図を図５に示
す。本実施例は特許の請求項６に対応している。本実施
例は、実施例１の非定型部Ｆ０パターン読み込み部９２
と非定型部Ｆ０パターン蓄積部９４を非定型部Ｆ０パタ
ーン生成部９８に置き換えたものである。その他の部分
の処理は実施例１と同様であるから、非定型部Ｆ０パタ
ーン生成部９８についてのみ説明する。A block diagram of the third embodiment of the present invention is shown in FIG. This embodiment corresponds to claim 6 of the patent. In this embodiment, the atypical portion F0 pattern reading unit 92 of the first embodiment is used.
The non-standard part F0 pattern accumulator 94 is replaced with the non-standard part F0 pattern generator 98. Since the processing of the other parts is the same as that of the first embodiment, only the non-standard part F0 pattern generation part 98 will be described.

【００２３】非定型部Ｆ０パターン生成部９８では、非
定型部の単語のアクセントをアクセント辞書９３から獲
得し、文中の位置などを考慮してＦ０パターンを規則に
より生成する。Ｆ０パターンを規則により生成する方法
としては、藤崎モデルや点ピッチモデルなどのモデルを
用いる方式が一般的であり、この場合もこれらが応用で
きる。The non-standard part F0 pattern generator 98 acquires the accent of the non-standard part word from the accent dictionary 93, and generates the F0 pattern according to the rule in consideration of the position in the sentence. As a method of generating the F0 pattern by a rule, a method using a model such as a Fujisaki model or a point pitch model is generally used, and these methods can also be applied in this case.

【００２４】本発明の第４の実施例の構成図を図６に示
す。本実施例は請求項１０および１１に対応してい
る。本実施例は、実施例１のテキスト入力部のユーザイ
ンタフェイスを置き換えることで、テキストの解析をよ
り正確にしたものである。入力インターフェイス部０１
２では、定型部文例蓄積部０１３より定型部を読みだ
し、ユーザインタフェースとして、図７または図８のよ
うに表示する。図７では、定型部には表示のみの機能し
かないカラムを、非定型部には、自由に単語の入力／編
集ができるエディット機能のあるカラムを用意し、使用
者に非定型部の入力を促す。このようなインターフェイ
スで入力すると、定型部と非定型部の判定が不必要で、
定型部のみを単語辞書７４で検索することによって、テ
キスト解析が可能である。A block diagram of the fourth embodiment of the present invention is shown in FIG. This embodiment corresponds to claims 10 and 11. In the present embodiment, the text analysis is made more accurate by replacing the user interface of the text input unit of the first embodiment. Input interface section 01
In 2, the standard part is read from the standard part sentence example accumulating part 013 and displayed as a user interface as shown in FIG. 7 or 8. In FIG. 7, a column having only a display-only function is provided in the fixed part, and a column having an edit function in which a word can be freely input / edited is prepared in the fixed part so that the user can input the fixed part. Urge. If you input with such an interface, it is unnecessary to judge the fixed part and the atypical part,
A text analysis can be performed by searching only the fixed part in the word dictionary 74.

【００２５】図８では定型文例蓄積部１３に、非定型部
の入力候補を蓄積しておき、非定型部のカラムを指定す
るとその箇所に入るべき入力候補が表示され、候補選択
手段を用いて、いずれを入力とするか指定できるという
インターフェイスを持っている。こちらも同様に、定型
部と非定型部の判定が不必要で、定型部のみを単語辞書
７４で検索することによって、テキスト解析が可能であ
る。以降の処理は他の実施例と同様である。In FIG. 8, input candidates for the non-standard parts are stored in the standard sentence storage unit 13, and when the column of the non-standard parts is designated, the input candidates that should be included in the column are displayed, and the candidate selection means is used. , It has an interface that allows you to specify which is the input. Similarly, it is unnecessary to determine the fixed part and the unfixed part, and the text analysis can be performed by searching only the fixed part in the word dictionary 74. Subsequent processing is the same as in the other embodiments.

【００２６】[0026]

【発明の効果】以上説明した様に、本発明によれば、交
通情報や天気概況の音声サービスなどに用いる、定型文
音声を合成するための音声合成装置において、聞き取り
やすく、自然な韻律をもつ音声を合成することができ
る。As described above, according to the present invention, in a voice synthesizer for synthesizing fixed-form sentence voices, which is used for a voice service of traffic information or weather conditions, it is easy to hear and has a natural prosody. Speech can be synthesized.

[Brief description of drawings]

【図１】本発明の原理図である。FIG. 1 is a principle diagram of the present invention.

【図２】本発明の基本的な考え方を示した概念図であ
る。FIG. 2 is a conceptual diagram showing the basic idea of the present invention.

【図３】本発明の第１の実施例である。FIG. 3 is a first embodiment of the present invention.

【図４】本発明の第２の実施例である。FIG. 4 is a second embodiment of the present invention.

【図５】本発明の第３の実施例である。FIG. 5 is a third embodiment of the present invention.

【図６】本発明の第４の実施例である。FIG. 6 is a fourth embodiment of the present invention.

【図７】本発明のユーザインターフェースの第１の例
である。FIG. 7 is a first example of a user interface of the present invention.

【図８】本発明のユーザインターフェースの第２の例
である。FIG. 8 is a second example of the user interface of the present invention.

【図９】従来例である。FIG. 9 is a conventional example.

[Explanation of symbols]

１テキスト入力手段２、７テキスト解析手段３定型部合成手段４非定型部合成手段５出力音声接続手段６音声出力手段８定型部Ｆ０パターン・持続時間長生成手段９非定型部Ｆ０パターン・持続時間長生成手段１０Ｆ０パターン・持続時間長接続編集手段（編集手
段と略す。）１１音響パラメータ生成手段１２音声信号生成手段６１音声出力部７１、７１’ テキスト解析部７２定型／非定型判定部７３出力切替部７４単語辞書７５、０１３定型部文例蓄積部８１定型部持続時間長読み込み部８２定型部Ｆ０パターン読み込み部８３定型部持続時間長蓄積部８４定型部Ｆ０パターン蓄積部８５定型部Ｆ０パラメータ読み込み部８６定型部Ｆ０パターン生成部８７定型部Ｆ０パラメータ蓄積部９１非定型部持続時間長生成部９２非定型部Ｆ０パターン読み込み部９３アクセント辞書９４非定型部Ｆ０パターン蓄積部９５非定型部Ｆ０パラメータ読み込み部９６、９８非定型部Ｆ０パターン生成部９７非定型部Ｆ０パラメータ蓄積部０１１テキスト入力部０１２入力インターフェース部１０１持続時間長接続編集部１０２Ｆ０パターン接続編集部１１１音響パラメータ生成部１１２音響パラメータ蓄積部１２１音声信号生成部1 Text Input Means 2, 7 Text Analyzing Means 3 Fixed Form Synthesizing Means 4 Atypical Form Composing Means 5 Output Voice Connecting Means 6 Audio Output Means 8 Fixed Form F0 Pattern / Duration Length Generation Means 9 Atypical Form F0 Patterns / Duration Long generation means 10 F0 pattern / duration duration connection editing means (abbreviated as editing means) 11 Acoustic parameter generation means 12 Speech signal generation means 61 Speech output section 71, 71 'Text analysis section 72 Fixed / atypical decision section 73 Output Switching unit 74 Word dictionary 75, 013 Fixed part sentence example accumulation unit 81 Fixed part duration reading unit 82 Fixed part F0 pattern reading unit 83 Fixed part duration length accumulation unit 84 Fixed part F0 pattern accumulation unit 85 Fixed part F0 parameter reading unit 86 standard part F0 pattern generation part 87 standard part F0 parameter storage part 91 non-standard part persistence Length generator 92 Atypical part F0 pattern reading part 93 Accent dictionary 94 Atypical part F0 pattern accumulating part 95 Atypical part F0 parameter reading part 96, 98 Atypical part F0 pattern generating part 97 Atypical part F0 parameter accumulating part 011 Text input section 012 Input interface section 101 Duration connection editing section 102 F0 pattern connection editing section 111 Acoustic parameter generating section 112 Acoustic parameter accumulating section 121 Voice signal generating section

Claims

[Claims]

1. A voice synthesizing apparatus for synthesizing a group of messages by connecting fixed information common to a group of messages to be synthesized and variable information different for each group of messages, to generate a time-varying pattern of a fundamental frequency. , First generating means for generating a temporal change pattern of the fundamental frequency of the fixed information, second generating means for generating a temporal change pattern of the fundamental frequency of the variable information, and time of the fundamental frequency generated by each of the generating means. And an edit unit for sequentially connecting the change patterns to generate a time change pattern of the fundamental frequency of the sentence, and a synthesizing unit for synthesizing a voice signal by using the time change pattern of the fundamental frequency generated by the edit unit. Characteristic speech synthesizer.

2. The first generating means according to claim 1, wherein the time-varying pattern of the fundamental frequency of the fixed information extracted from the natural voice is stored by using a time-series format of the fundamental frequency. A speech synthesizer characterized by generating a time-varying pattern of a fundamental frequency by including a time series of a fundamental frequency suitable for a sentence from the storage means.

3. The first generation means according to claim 1, wherein the time variation pattern of the fundamental frequency of the fixed information extracted from the natural voice is in the form of a model parameter that approximates the time variation pattern of the fundamental frequency. Means for storing and using,
Speech synthesis characterized by generating a time-varying pattern of the fundamental frequency by providing means for reading an appropriate parameter from the means for storing the input sentence and means for generating a time series of the fundamental frequency from the parameter apparatus.

4. The second generation means according to claim 1, wherein the temporal change pattern of the fundamental frequency extracted from the natural voice with respect to the combination of the syllable number of variable information and the accent type is converted into a time series format of the fundamental frequency. A speech synthesizer characterized by generating a temporal change pattern of a fundamental frequency by including a means for storing by using it and a means for selecting and reading a time series of a fundamental frequency suitable for an input sentence from the storage means.

5. The second generation means according to claim 1, wherein the time change pattern of the fundamental frequency is the time change pattern of the fundamental frequency extracted from the natural voice for all combinations of the syllable number of the variable information and the accent type. By including means for storing the pattern using the model parameter format that approximates the pattern, means for selecting and reading an appropriate parameter from the storage means, and means for generating a time series of the fundamental frequency from the parameter, A speech synthesizer characterized by generating a temporal change pattern of a fundamental frequency.

6. The speech synthesis apparatus according to claim 1, wherein the second generation means has means for generating a time change pattern of the fundamental frequency of the variable information according to a rule.

7. When generating a duration length that is a sequence of time lengths of a synthesis unit, a first generation unit that generates a duration length of fixed information and a second generation unit that generates a duration length of variable information. The generation means and the duration length generated by each generation means are sequentially connected,
A voice synthesizing apparatus comprising: an editing unit for generating a sentence duration and a unit for synthesizing a voice signal using the duration.

8. The first generating means according to claim 7, wherein the duration of fixed information extracted from natural voice is stored, and the duration suitable for an input sentence is read from the storage. A speech synthesizer characterized by generating a duration length by comprising:

9. The speech synthesis apparatus according to claim 7, further comprising a generation unit that generates a duration of variable information.

10. The voice synthesizer according to claim 1, wherein the voice synthesizer presents fixed information, and a user inputs a synthetic sentence using a user interface for inputting and editing variable information. By doing so, a voice synthesizing device comprising a text input means that enables separation of fixed information and variable information.

11. The speech synthesizing apparatus according to claim 1, wherein the speech synthesizing apparatus presents fixed information and input candidates of variable information, and selects variable information of the candidate. And a text input means for separating fixed information and variable information from each other.