JP2001282274A

JP2001282274A - Voice synthesizer and its control method, and storage medium

Info

Publication number: JP2001282274A
Application number: JP2000099421A
Authority: JP
Inventors: Kenichiro Nakagawa; 賢一郎中川; Takashi Aso; 隆麻生
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2000-03-31
Filing date: 2000-03-31
Publication date: 2001-10-12

Abstract

PROBLEM TO BE SOLVED: To synthesize a natural voice from a phonogram text. SOLUTION: When phonogram text data are inputted from an input device 301, non-sound information is estimated (303). Then a rhythm parameter is generated according to the phonogram text data and non-sound information (304). Then phonemes is selected (305), connected (306), and outputted from an output device 307.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、表音テキストから
音声を合成する音声合成装置及びその制御方法及び記憶
媒体に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizer for synthesizing speech from phonetic text, a control method thereof, and a storage medium.

【０００２】[0002]

【従来の技術】一般に、日本語テキストを音声合成する
音声合成装置（図１）では、漢字かな交じり文を入力と
し、合成波形を出力とする。音声合成装置は、内部で、
言語解析部と音響処理部の２つの処理を行う。言語解析
部は、漢字かな交じり文を、読み、アクセント位置，ア
クセントの強さをなどの音響情報を表すテキストデータ
（ここでは表音テキストと呼ぶ）に変換する。図２に入
力した漢字かな混じり文と、それに対応する表音テキス
トの例を示す。2. Description of the Related Art In general, a speech synthesizer for synthesizing Japanese text (FIG. 1) receives a kanji kana mixed sentence and outputs a synthesized waveform. The speech synthesizer is internally
Two processes of a language analysis unit and a sound processing unit are performed. The linguistic analysis unit reads the kanji kana mixed sentence and converts it into text data (herein referred to as phonetic text) representing acoustic information such as accent position and accent strength. FIG. 2 shows an example of the input kanji-kana mixed sentence and the corresponding phonetic text.

【０００３】音響処理部は、この表音テキストを用いて
合成波形を生成する。[0003] The acoustic processing unit generates a synthesized waveform using the phonetic text.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上述の
音声合成装置のように、表音テキストに基づいて音声を
合成する場合には、その表音テキストに含まれる音響情
報を考慮してはいるものの、人間らしい自然な音声を得
ることが難しいという問題があった。However, in the case of synthesizing speech based on phonogram text as in the above-described speech synthesizer, the speech information included in the phonogram text is considered. However, there is a problem that it is difficult to obtain a natural human voice.

【０００５】そこで本発明は、表音テキストから自然な
音声を合成する音声合成装置及びその制御方法及び記憶
媒体を提供しようとするものである。Accordingly, an object of the present invention is to provide a speech synthesizer for synthesizing natural speech from phonetic text, a control method thereof, and a storage medium.

【０００６】[0006]

【課題を解決するための手段】かかる課題を解決するた
め、例えば本発明の音声合成装置は以下の構成を備え
る。すなわち、音響情報を含む表音テキストデータを入
力し、音声合成を行う音声合成装置であって、入力した
表音テキストから、非音響情報を推定する推定手段と、
該推定手段により得られた非音響情報と前記表音テキス
トデータ内の記述に基づいて合成音声の出力形態を設定
するためのパラメータを生成するパラメータ設定手段と
を備える。In order to solve such a problem, for example, a speech synthesizer according to the present invention has the following arrangement. That is, a speech synthesizer that inputs phonogram text data including acoustic information and performs speech synthesis, and estimating means for estimating non-acoustic information from the input phonogram text;
A parameter setting unit configured to generate a parameter for setting an output form of the synthesized speech based on the non-acoustic information obtained by the estimation unit and the description in the phonetic text data.

【０００７】[0007]

【発明の実施の形態】以下、添付図面に従って本発明に
係る実施形態を詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the accompanying drawings.

【０００８】図３は、本実施形態の音声合成システムの
機能構成図である。FIG. 3 is a functional configuration diagram of the speech synthesis system of the present embodiment.

【０００９】同図において、入力装置３０１はキーボー
ド，マウス等を有し、ユーザの操作を本実施形態の中枢
となる音声合成装置３０２に入力する装置である。In FIG. 1, an input device 301 has a keyboard, a mouse, and the like, and is a device for inputting a user's operation to a speech synthesizer 302 which is a center of the present embodiment.

【００１０】音声合成装置３０２は、入力装置３０１の
操作に応じて所定の日本語テキストに対応する表音テキ
ストを選択し、選択した表音テキストを非音響情報推定
モジュール３０３に供給する。モジュール３０３は、各
単語の音響情報と非音響情報とを記述した単語辞書３０
８と文法的な接続規則を記述した接続辞書３０９とを用
いて、入力した表音テキストの非音響情報を推定する。
ここでの処理は後に詳しく説明する。ここで、単語辞書
３０８の音響情報は、読みを示す読み情報、アクセント
の位置や強さを示すアクセント情報等を含み、非音響情
報は、品詞の種類や活用形を示す品詞情報等を含む。[0010] The speech synthesizer 302 selects a phonetic text corresponding to a predetermined Japanese text in accordance with an operation of the input device 301, and supplies the selected phonetic text to the non-acoustic information estimation module 303. The module 303 includes a word dictionary 30 that describes acoustic information and non-acoustic information of each word.
8 and the connection dictionary 309 describing grammatical connection rules, the non-acoustic information of the input phonogram text is estimated.
This process will be described later in detail. Here, the acoustic information of the word dictionary 308 includes reading information indicating a reading, accent information indicating an accent position and strength, and the like, and the non-acoustic information includes a part of speech information indicating a type of a part of speech and an inflected form.

【００１１】モジュール３０４は、入力した表音テキス
トから推定した非音響情報と、表音テキストに記述され
た音響情報とを用いて、各種の韻律パラメータを生成す
る。この処理も後に詳しく説明する。モジュール３０４
で得られた韻律パラメータは、音素片選択モジュール３
０５に送られる。The module 304 generates various prosodic parameters using the non-acoustic information estimated from the input phonological text and the acoustic information described in the phonological text. This processing will be described later in detail. Module 304
The prosodic parameters obtained by
05.

【００１２】音素片選択モジュール３０５は、モジュー
ル３０４で得た韻律パラメータに基づいて、音素片辞書
３１０から最も適当な音素片系列を選択する。音素片接
続モジュール３０６は、ＰＳＯＬＡ法（ピッチ同期波形
重畳法）等を用いて、各音素片を編集して接続し、一つ
の音声波形データを生成する。こうして得られた音声波
形データは、スピーカなどからなる出力装置３０７から
出力される。The phoneme segment selection module 305 selects the most appropriate phoneme segment sequence from the phoneme segment dictionary 310 based on the prosodic parameters obtained in the module 304. The phoneme unit connection module 306 edits and connects each phoneme using the PSOLA method (pitch synchronous waveform superposition method) or the like, and generates one voice waveform data. The audio waveform data thus obtained is output from an output device 307 including a speaker or the like.

【００１３】図１０は、本実施形態の音声合成システム
の具体的なブロック構成図である。FIG. 10 is a specific block diagram of the speech synthesis system of the present embodiment.

【００１４】図示において、１は装置全体の制御を司る
制御部（ＣＰＵ）、２はブートプログラムやＢＩＯＳ等
を記憶しているＲＯＭ、３はＣＰＵ１が動作処理するプ
ログラム（オペレーティングシステム１０や本実施形態
の音声合成アプリケーションプログラム１１）を実行す
るために必要な作業領域を提供するＲＡＭである。４は
ハードディスク装置等の記憶装置であって、オペレーテ
ィングシステム（ＯＳ）１０、本実施形態の音声合成ア
プリケーション１１、各種の日本語テキストに対応する
表音テキストを保持する表音テキストデータベース１
２、単語辞書３０８、接続辞書３０９、音素片辞書３１
０を予め記憶している。処理部１〜４が、上記音声合成
装置３０２を構成する。In the figure, 1 is a control unit (CPU) for controlling the entire apparatus, 2 is a ROM storing a boot program, a BIOS and the like, and 3 is a program (operation system 10 and the present embodiment) RAM that provides a work area necessary for executing the speech synthesis application program 11). Reference numeral 4 denotes a storage device such as a hard disk device, which is an operating system (OS) 10, the speech synthesis application 11 of the present embodiment, and a phonetic text database 1 that holds phonetic texts corresponding to various Japanese texts.
2. Word dictionary 308, connection dictionary 309, phoneme dictionary 31
0 is stored in advance. The processing units 1 to 4 constitute the above-described speech synthesizer 302.

【００１５】５はキーボード、６はマウス等のポインテ
ィングデバイスである。処理部５及び６が、上記入力装
置３０１を構成する。７は表示装置である。８は音声波
形データを音声（音響ともいう）波形に変換する音源部
（アンプ内蔵とする）であり、９は音源部８より出力さ
れた音声波形を出力するスピーカである。処理部７〜９
が、上記出力装置３０７を構成する。Reference numeral 5 denotes a keyboard, and 6 denotes a pointing device such as a mouse. The processing units 5 and 6 constitute the input device 301. 7 is a display device. Reference numeral 8 denotes a sound source unit (with a built-in amplifier) for converting the audio waveform data into a sound (also referred to as acoustic) waveform. Reference numeral 9 denotes a speaker that outputs the audio waveform output from the sound source unit 8. Processing unit 7-9
Constitute the output device 307.

【００１６】かかる構成において、図３における各モジ
ュール３０３〜３０６は、音声合成アプリケーションに
含まれ、そのほとんどがＣＰＵ１の処理手順として実行
される。つまり、図１０に示す様に、実施形態の音声合
成システムは、そのスピーカ等のハードウェアを必要と
するものの、そのほとんどがＣＰＵ１の処理によって実
現されるものであり、パーソナルコンピュータ等の汎用
情報処理装置で実現できる。In such a configuration, each of the modules 303 to 306 in FIG. 3 is included in the speech synthesis application, and most of them are executed as the processing procedure of the CPU 1. That is, as shown in FIG. 10, the speech synthesis system according to the embodiment requires hardware such as a speaker, but most of the hardware is realized by the processing of the CPU 1. It can be realized with a device.

【００１７】図４は、本実施形態の音声合成アプリケー
ションの全体の処理手順を示すフローチャートである。FIG. 4 is a flowchart showing the overall processing procedure of the speech synthesis application according to this embodiment.

【００１８】先ず、ＯＳ１１が立ちあがり、音声合成ア
プリケーションプログラム１２がＲＡＭ３にロードされ
ると、表音テキストが入力されるまで待機する（ステッ
プＳ４０１）。このとき、ユーザは、キーボード５やマ
ウス６等の入力装置３０１を用いて、表示装置７を見な
がら音声合成アプリケーションのグラフィカルユーザイ
ンタフェースを操作する。音声合成アプリケーションプ
ログラム１２は、入力装置３０１の操作入力に応じて、
所定の日本語テキストに対応する表音テキストをデータ
ベース１２から読み出し、音声合成装置３０２に入力す
る。First, when the OS 11 starts up and the speech synthesis application program 12 is loaded into the RAM 3, the process waits until a phonetic text is input (step S401). At this time, the user operates the graphical user interface of the speech synthesis application while watching the display device 7 using the input device 301 such as the keyboard 5 and the mouse 6. The speech synthesis application program 12 responds to an operation input of the input device 301,
A phonetic text corresponding to a predetermined Japanese text is read from the database 12 and input to the speech synthesizer 302.

【００１９】入力が確認されると、モジュール３０３
は、その表音テキストから非音響情報を推定し（ステッ
プＳ４０２）。モジュール３０４は、推定した非音響情
報と表音テキストに含まれる音響情報を用いて各種の韻
律パラメータを生成する（ステップＳ４０３）。モジュ
ール３０５は、このパラメータと表音テキストの内容を
用いて、音素片辞書３１０から適当な音素片を選択する
（ステップＳ４０４）。モジュール３０６は、それらを
編集して接続し（ステップＳ４０５）、本システムの出
力とする（ステップＳ４０６）。When the input is confirmed, the module 303
Estimates non-acoustic information from the phonetic text (step S402). The module 304 generates various prosody parameters using the estimated non-acoustic information and the acoustic information included in the phonetic text (step S403). The module 305 uses this parameter and the contents of the phonetic text to select an appropriate phoneme segment from the phoneme segment dictionary 310 (step S404). The module 306 edits and connects them (step S405), and outputs them (step S406).

【００２０】上記処理手順を詳細に説明する。The above procedure will be described in detail.

【００２１】先ず、図５を用いて、ステップＳ４０２の
処理を詳細に説明する。ステップＳ４０２では、入力さ
れた表音テキスト（例えば、「ゴーセ’ーオンオシュ
ツリョクスル．」）から単語系列及び各単語の非音響情
報である品詞情報（例えば、“合成（名詞）、音（名
詞）、を（格助詞）、出力（サ変）、する（動詞）”）
を推定する。なお、以下の説明で使用する各変数はＲＡ
Ｍ３内に予め確保されているものとする。First, the processing in step S402 will be described in detail with reference to FIG. In step S402, the word sequence (for example, “synthesis (noun), sound (noun), sound (noun), (Case particle), output (sa-modification), do (verb) ")
Is estimated. Each variable used in the following description is RA
It is assumed that it is secured in M3 in advance.

【００２２】１つ以上のアクセント句からなる表音テキ
ストが入力されると、モジュール３０３は、未処理のア
クセント句が存在するか否かを判断する（ステップＳ５
０１）。否の場合には、全てについての処理が終了した
ものとし、本処理を終える。When a phonetic text including one or more accent phrases is input, the module 303 determines whether or not an unprocessed accent phrase exists (step S5).
01). If no, it is assumed that the processing for all has been completed, and the present processing is terminated.

【００２３】未処理のアクセント句が存在する場合に
は、その一つを選択して変数ａｃｃに格納する（ステッ
プＳ５０２）。アクセント句は、表音テキスト上のスペ
ース記号などを用いれば、容易に切り出すことが可能で
ある。従って上述の表音テキストでは、例えば、最初の
アクセント句である「ゴーセ’ーオンオ」が変数ａｃｃ
に格納されることになる。If there is an unprocessed accent phrase, one is selected and stored in the variable acc (step S502). The accent phrase can be easily extracted by using a space symbol or the like on the phonetic text. Therefore, in the above phonetic text, for example, the first accent phrase “Gothe-on-o” is replaced by the variable acc
Will be stored.

【００２４】次に、選択したアクセント句の表音テキス
トの読み情報に対して、かな漢字変換などで用いられる
形態素解析を行い、この表音テキストを構成する単語系
列と各単語の品詞情報とを推定する（ステップＳ５０
３）。Next, morphological analysis used in kana-kanji conversion and the like is performed on the reading information of the phonogram text of the selected accent phrase, and the word series constituting the phonogram text and the part of speech information of each word are estimated. Yes (Step S50)
3).

【００２５】ここで、表音テキストの読み情報に対して
形態素解析を行った場合、候補が一つに絞れない場合が
ある（ステップＳ５０４）。この場合、モジュール３０
３は、ステップＳ５０５の処理を行う。一方、形態素解
析の結果、該当する候補が一つもなかった場合、モジュ
ール３０３は、変数ａｃｃに格納したアクセント句にＮ
ＵＬＬを設定し、次のアクセント句の処理を実行する
（ステップＳ５０９）。このＮＵＬＬは、アクセント句
を構成する単語の品詞情報が推定できなかったことを示
す情報である。Here, when the morphological analysis is performed on the phonetic text reading information, the candidates may not be narrowed down to one (step S504). In this case, module 30
3 performs the process of step S505. On the other hand, as a result of the morphological analysis, when there is no corresponding candidate, the module 303 adds N to the accent phrase stored in the variable acc.
The URL is set, and the processing of the next accent phrase is executed (step S509). This NULL is information indicating that the part of speech information of the word constituting the accent phrase could not be estimated.

【００２６】モジュール３０３は、複数の候補の一つを
選択し、選択した候補を変数ｋｏｕｈｏに格納する（ス
テップＳ５０５）。次に、モジュール３０３は、単語辞
書３０８及び接続辞書３０９を用いて、変数ｋｏｕｈｏ
に格納した候補を表音テキスト（読み情報、アクセント
情報等の音響情報を含む）に変換し、その候補の表音テ
キストを変数ａｃｃ２に格納する（ステップＳ５０
６）。The module 303 selects one of the plurality of candidates and stores the selected candidate in a variable kouho (step S505). Next, the module 303 uses the word dictionary 308 and the connection dictionary 309 to change the variable kouho.
Is converted to phonetic text (including acoustic information such as reading information and accent information) and the phonetic text of the candidate is stored in a variable acc2 (step S50).
6).

【００２７】次に、変数ａｃｃに格納された表音テキス
トと変数ａｃｃ２に格納された表音テキストとを比較
し、ステップＳ５０２で選択したアクセント句と一致す
るか否かを判別する（ステップＳ５０７）。一致した場
合には、変数ｋｏｕｈｏに格納した候補の単語系列と各
単語の品詞情報とを、ステップＳ５０２で選択したアク
セント句の単語系列と品詞情報系列とに設定する（ステ
ップＳ５０８）。Next, the phonetic text stored in the variable acc is compared with the phonetic text stored in the variable acc2, and it is determined whether or not the phonetic text matches the accent phrase selected in step S502 (step S507). . If they match, the candidate word series stored in the variable kouho and the part of speech information of each word are set as the word series of the accent phrase and the part of speech information series selected in step S502 (step S508).

【００２８】例えば、アクセント句「ゴーセ’ーオン
オ」を形態素解析した結果、“業（名詞），清音（名
詞），を（格助詞）”と“合成（名詞），音（名詞），
を（格助詞）”の二つの候補が得られた場合について説
明する。この場合、モジュール３０３は、第１の候補で
ある“業清音を”を変数ｋｏｕｈｏに格納するととも
に、単語辞書３０８及び接続辞書３０９を用いて“業清
音を”の表音テキストである「ゴ’ーセーオンヲ」を
生成して変数ａｃｃ２に格納する。この場合、変数ａｃ
ｃと変数ａｃｃ２とは等しくないため、モジュール３０
３は、第２の候補に対する処理を実行する。For example, as a result of a morphological analysis of the accent phrase “Gorse-on-o”, “work (noun), Kiyone (noun), and (composition particle)” and “synthesis (noun), sound (noun),
Is described below. In this case, the module 303 stores the first candidate “Kyosei Tone” in the variable kouho, and also stores the word candidate 308 in the word dictionary 308 and the connection. Using the dictionary 309, the phonetic text “Go 'Sayon ヲ” of “Kyosei On” is generated and stored in the variable acc2. In this case, the variable ac
Since c and the variable acc2 are not equal, the module 30
3 executes a process for the second candidate.

【００２９】次に、モジュール３０３は、第２の候補で
ある“合成音を”を変数ｋｏｕｈｏに格納するととも
に、単語辞書３０８及び接続辞書３０９を用いて“合成
音を”の表音テキストである「ゴーセ’ーオンヲ」を生
成して変数ａｃｃ２に格納する。この場合、変数ａｃｃ
と変数ａｃｃ２とは等しくなるため、モジュール３０３
は、第２の候補の単語系列と各単語の品詞情報とを、ア
クセント句「ゴーセ’ーオンヲ」の単語系列と品詞情報
系列とに設定する。Next, the module 303 stores the second candidate “synthesized sound” in the variable kouho, and is a phonetic text of “synthesized sound” using the word dictionary 308 and the connection dictionary 309. “Goth-on- ヲ” is generated and stored in the variable acc2. In this case, the variable acc
And the variable acc2 are equal, the module 303
Sets the word sequence of the second candidate and the part-of-speech information of each word to the word sequence and the part-of-speech information sequence of the accent phrase “Goth-on- ヲ”.

【００３０】これらの処理は、入力された表音テキスト
を構成する全てのアクセント句に対して繰り返し実行さ
れる。全てのアクセント句の表音テキストに対して形態
素解析を行い、この表音テキストを構成する各単語の品
詞情報（品詞の種類を示す情報）を決定した後、次の表
音テキストが入力されるまで処理を終了する。These processes are repeatedly executed for all the accent phrases constituting the input phonogram text. After performing a morphological analysis on the phonetic texts of all accent phrases and determining the part of speech information (information indicating the type of part of speech) of each word constituting the phonetic text, the next phonetic text is input. The process ends up to this point.

【００３１】次に、図６を用いて、ステップＳ４０３の
処理を詳細に説明する。ステップＳ４０３では、アクセ
ント句の表音テキストから推定した単語系列と各単語の
品詞情報とに基づいて、アクセント句の韻律パラメータ
を生成する処理を説明する。Next, the processing in step S403 will be described in detail with reference to FIG. In step S403, a process of generating a prosodic parameter of the accent phrase based on the word sequence estimated from the phonetic text of the accent phrase and the part of speech information of each word will be described.

【００３２】１つ以上のアクセント句からなる表音テキ
ストが入力されると、モジュール３０４は、未処理のア
クセント句が存在するか否かを判断する（ステップＳ６
０１）。未処理のアクセント句が存在する場合には、そ
の一つを選択する（ステップＳ６０２）。次に、選択し
たアクセント句を構成する未処理の単語の一つを選択す
る（ステップＳ６０４）。モジュール３０４は、選択し
た単語に対応する音響情報（読み，アクセント位置，ア
クセント強さ等）に基づいて、その単語に対応する音韻
系列の韻律パラメータ（ピッチ周波数、音韻時間長、音
パワー等）をデフォルト値に設定する（ステップＳ６０
５）。When the phonetic text including one or more accent phrases is input, the module 304 determines whether or not an unprocessed accent phrase exists (step S6).
01). If there is an unprocessed accent phrase, one of them is selected (step S602). Next, one of the unprocessed words constituting the selected accent phrase is selected (step S604). The module 304 calculates prosodic parameters (pitch frequency, phoneme time length, sound power, etc.) of the phoneme sequence corresponding to the selected word based on acoustic information (reading, accent position, accent strength, etc.) corresponding to the selected word. Set to default value (step S60
5).

【００３３】次に、モジュール３０４は、選択した単語
の非音響情報である品詞情報に基づいて、その単語に対
応する音韻系列のピッチ周波数、音韻時間長、音パワー
を変更する（ステップＳ６０６〜Ｓ６１３）。ここで、
品詞情報がＮＵＬＬとなる単語は、品詞情報の推定に失
敗した単語として判断され、全ての韻律パラメータがデ
フォルト値となる。Next, the module 304 changes the pitch frequency, phoneme time length, and sound power of the phoneme sequence corresponding to the selected word based on the part of speech information that is the non-acoustic information of the selected word (steps S606 to S613). ). here,
A word whose part-of-speech information is NULL is determined as a word for which estimation of part-of-speech information has failed, and all prosodic parameters have default values.

【００３４】ステップＳ６０６〜Ｓ６１３について具体
的に説明する。選択した単語の品詞が「名詞」と判定さ
れた場合（ステップＳ６０６）、モジュール３０４は、
ピッチ周波数、音韻時間長及び音パワーの値を増やす
（ステップＳ６１０）。選択した単語の品詞が「感動
詞」と判定された場合（ステップＳ６０７）、モジュー
ル３０４は、ピッチ周波数、音韻時間長及び音パワーの
値を増やす（ステップＳ６１１）。選択した単語の品詞
が「数詞」と判定された場合（ステップＳ６０８）、モ
ジュール３０４は、ピッチ周波数、音韻時間長及び音パ
ワーの値を増やす（ステップＳ６１２）。選択した単語
の品詞が「助詞」と判定された場合（ステップＳ６０
９）、モジュール３０４は、ピッチ周波数の値を減ら
し、音韻時間長及び音パワーの値を増やす（ステップＳ
６１３）。名詞、感動詞、数詞、助詞以外の品詞に関し
ては、本実施形態では、デフォルト値を使用する。上述
の例では、品詞の種類に応じて韻律パラメータを変更す
る例について説明したが、これに限るものではない。例
えば、単語の活用形に応じて韻律パラメータを変更して
もよい。Steps S606 to S613 will be specifically described. When the part of speech of the selected word is determined to be “noun” (step S606), the module 304
The values of the pitch frequency, phoneme time length and sound power are increased (step S610). If the part of speech of the selected word is determined to be a "sentence verb" (step S607), the module 304 increases the values of the pitch frequency, the phoneme duration, and the sound power (step S611). When the part of speech of the selected word is determined to be “numeral” (step S608), the module 304 increases the values of the pitch frequency, the phoneme duration, and the sound power (step S612). When the part of speech of the selected word is determined to be "particle" (step S60)
9) The module 304 decreases the value of the pitch frequency and increases the values of the phoneme duration and the sound power (step S).
613). In the present embodiment, default values are used for parts of speech other than nouns, verbs, numerals, and particles. In the above example, an example in which the prosodic parameter is changed according to the type of part of speech has been described, but the present invention is not limited to this. For example, the prosodic parameters may be changed according to the inflected form of the word.

【００３５】これらの処理は、全てのアクセント句を構
成する全ての単語に対して行う。These processes are performed on all the words constituting all the accent phrases.

【００３６】以上の説明の如く、本実施形態によれば、
表音テキストの韻律パラメータを、音響情報だけでなく
非音響情報（単語の品詞の種類等）をも考慮して生成す
ることができるため、より自然な音声を合成して出力す
ることが可能になる。As described above, according to the present embodiment,
Prosody parameters of phonetic text can be generated taking into account not only acoustic information but also non-acoustic information (such as the type of part of speech of words), so that more natural speech can be synthesized and output. Become.

【００３７】＜その他の実施形態＞本実施形態では更
に、表音テキストの非音響情報に基づいて、文章全体の
韻律パラメータを変更する例について説明する。尚、本
実施形態では、第１の実施形態の音声合成システムの一
部を変更して実現する。<Other Embodiments> In this embodiment, an example will be described in which the prosodic parameters of the entire sentence are changed based on the non-acoustic information of the phonetic text. In the present embodiment, a part of the speech synthesis system of the first embodiment is modified and realized.

【００３８】一般的な文章を発声する場合、アクセント
句のアクセントパターンとは別に、アクセント句、ポー
ズ句及び文章のピッチ（声の高さ）は、後半にいくほど
下がる傾向がある。この傾向を、本実施形態では「ピッ
チの自然降下成分」とを定義する。このような文章（以
下では、一般文と称する）の表音テキスト、アクセント
パターン、ピッチの自然降下成分及びピッチパターンの
関係の一例を図７に示す。表音テキストにおいて、“
（スペース）”はアクセント句の区切りを示し、“，”
はポーズ句を示し、“．”は文章の終わりを示す。ピッ
チパターンは、アクセントパターンとピッチの自然降下
成分の和で表現される。When uttering a general sentence, apart from the accent pattern of the accent phrase, the pitch of the accent phrase, the pause phrase, and the sentence (the pitch of the voice) tends to decrease toward the latter half. In the present embodiment, this tendency is defined as a “natural drop component of pitch”. FIG. 7 shows an example of the relationship between the phonetic text, the accent pattern, the natural drop component of the pitch, and the pitch pattern of such a sentence (hereinafter, referred to as a general sentence). In phonetic text,
(Space) ”indicates a delimiter between accent phrases, and“, ”
Indicates a pause phrase, and “.” Indicates the end of a sentence. The pitch pattern is represented by the sum of the accent pattern and the natural drop component of the pitch.

【００３９】ところが、地名などのように複数の名詞を
つなげて発声する場合、ピッチの自然降下成分が平らで
ある方が自然で違和感のない発声となる。このような文
章（以下では、名詞読み上げ文と称する）の表音テキス
ト、アクセントパターン、ピッチの自然降下成分及びピ
ッチパターンの関係の一例を図８に示す。However, when a plurality of nouns are concatenated, such as a place name, when the pitch naturally descends, the utterance becomes natural and comfortable. FIG. 8 shows an example of the relationship between the phonetic text, the accent pattern, the natural drop component of the pitch, and the pitch pattern of such a sentence (hereinafter, referred to as a noun reading sentence).

【００４０】本実施形態では、このような特徴に注目し
て、各アクセント句の表音テキストから得た非音響情報
に基づいて、この表音テキストからなる文章が一般文な
のか、名詞読み上げ文なのかを判断することによって、
違和感の少ない合成音を生成する音声合成システムにつ
いて説明する。In the present embodiment, focusing on such features, based on the non-acoustic information obtained from the phonetic text of each accent phrase, whether the sentence composed of the phonetic text is a general sentence, By judging what it is,
A speech synthesis system that generates a synthesized sound with less discomfort will be described.

【００４１】図９は、本実施形態における韻律パラメー
タの生成手順を説明するフローチャートである。この処
理は、第１の実施形態のステップＳ４０３の処理に相当
する。FIG. 9 is a flowchart for explaining a procedure for generating a prosody parameter in the present embodiment. This processing corresponds to the processing in step S403 of the first embodiment.

【００４２】１つ以上のアクセント句からなる表音テキ
スト（ここで、表音テキストは１つの文章を構成する）
が入力されると、モジュール３０４は、各アクセント句
の品詞情報を調べ、名詞以外の品詞があるか否かの判定
を行う（ステップＳ９０１）。ここで、名詞だけからな
ると判定された場合、ｍｏｄｅフラグに１をセットする
（ステップＳ９０３）。一方、名詞以外の品詞が存在す
る場合には、一般文と見なし、ｍｏｄｅフラグに０をセ
ットする（ステップＳ９０２）。Phonetic text composed of one or more accent phrases (here, phonetic text forms one sentence)
Is input, the module 304 checks the part-of-speech information of each accent phrase, and determines whether or not there is a part of speech other than a noun (step S901). Here, if it is determined that the word consists only of a noun, the mode flag is set to 1 (step S903). On the other hand, when there is a part of speech other than a noun, it is regarded as a general sentence, and the mode flag is set to 0 (step S902).

【００４３】次にモジュール３０４は、未処理のアクセ
ント句が存在するか否かを判断する（ステップＳ９０
４）。未処理のアクセント句が存在する場合には、その
一つを選択する（ステップＳ９０５）。次に、選択した
アクセント句を構成する未処理の単語の一つを選択する
（ステップＳ９０６，Ｓ９０７）。Next, the module 304 determines whether or not an unprocessed accent phrase exists (step S90).
4). If there is an unprocessed accent phrase, one of them is selected (step S905). Next, one of the unprocessed words constituting the selected accent phrase is selected (steps S906 and S907).

【００４４】ここで、一般文モード（ｍｏｄｅ＝０）の
場合（ステップＳ９０８）、モジュール３０４は、選択
した単語に対応する音響情報（読み，アクセント位置，
アクセント強さ等）に基づいて、その単語に対応する音
韻系列の韻律パラメータ（ピッチ周波数、音韻時間長、
音パワー等）をデフォルト値に設定する（ステップＳ９
１０）。このデフォルト値は、ピッチの自然降下成分を
考慮した値である。Here, in the case of the general sentence mode (mode = 0) (step S908), the module 304 determines the acoustic information (reading, accent position,
Based on the accent strength, etc., the prosodic parameters (pitch frequency, phoneme time length,
Sound power, etc.) are set to default values (step S9)
10). This default value is a value that takes into account the natural drop component of the pitch.

【００４５】次に、モジュール３０４は、選択した単語
の非音響情報である品詞情報に基づいて、その単語のピ
ッチ周波数、音韻時間長、音パワーを変更する（ステッ
プＳ９１１〜Ｓ９１８）。ここで、品詞情報がＮＵＬＬ
となる単語は、品詞情報の推定に失敗した単語として判
断され、全ての韻律パラメータがデフォルト値となる。Next, the module 304 changes the pitch frequency, phoneme time length, and sound power of the selected word based on the part of speech information that is non-acoustic information of the word (steps S911 to S918). Here, the part of speech information is NULL
Is determined as a word for which part-of-speech information has failed to be estimated, and all prosodic parameters have default values.

【００４６】ステップＳ９１１〜Ｓ９１８について具体
的に説明する。選択した単語の品詞が「名詞」と判定さ
れた場合（ステップＳ９１１）、モジュール３０４は、
ピッチ周波数、音韻時間長及び音パワーの値を増やす
（ステップＳ９１５）。選択した単語の品詞が「感動
詞」と判定された場合（ステップＳ９１２）、モジュー
ル３０４は、ピッチ周波数、音韻時間長及び音パワーの
値を増やす（ステップＳ９１６）。選択した単語の品詞
が「数詞」と判定された場合（ステップＳ９１３）、モ
ジュール３０４は、ピッチ周波数、音韻時間長及び音パ
ワーの値を増やす（ステップＳ９１７）。選択した単語
の品詞が「助詞」と判定された場合（ステップＳ９１
４）、モジュール３０４は、ピッチ周波数の値を減ら
し、音韻時間長及び音パワーの値を増やす（ステップＳ
９１８）。名詞、感動詞、数詞、助詞以外の品詞に関し
ては、本実施形態では、デフォルト値を使用する。上述
の例では、品詞の種類に応じて韻律パラメータを変更す
る例について説明したが、これに限るものではない。例
えば、単語の活用形に応じて韻律パラメータを変更して
もよい。Steps S911 to S918 will be specifically described. When the part of speech of the selected word is determined to be “noun” (step S911), the module 304
The values of the pitch frequency, phoneme time length and sound power are increased (step S915). If the part of speech of the selected word is determined to be a "sentence verb" (step S912), the module 304 increases the values of the pitch frequency, the phoneme duration, and the sound power (step S916). When the part of speech of the selected word is determined to be “numerical” (step S913), the module 304 increases the values of the pitch frequency, the phoneme duration, and the sound power (step S917). When the part of speech of the selected word is determined to be "particle" (step S91)
4), the module 304 decreases the value of the pitch frequency and increases the values of the phoneme duration and the sound power (step S).
918). In the present embodiment, default values are used for parts of speech other than nouns, verbs, numerals, and particles. In the above example, an example in which the prosody parameter is changed according to the type of part of speech has been described, but the present invention is not limited to this. For example, the prosodic parameters may be changed according to the inflected form of the word.

【００４７】一方、名詞読みモード（ｍｏｄｅ＝１）の
場合（ステップＳ９０８）、モジュール３０４は、選択
した単語に対応する音響情報（読み，アクセント位置，
アクセント強さ等）に基づいて、その単語に対応する音
韻系列の韻律パラメータ（ピッチ周波数、音韻時間長、
音パワー等）をデフォルト値に設定する（ステップＳ９
０９）。ここで設定するデフォルト値は、ピッチの自然
降下成分を考慮しない値である。On the other hand, in the case of the noun reading mode (mode = 1) (step S908), the module 304 outputs the acoustic information (reading, accent position,
Based on the accent strength, etc.), the prosodic parameters (pitch frequency, phoneme time length,
Sound power, etc.) are set to default values (step S9)
09). The default value set here is a value that does not consider the natural drop component of the pitch.

【００４８】このように構成することにより、「キョ’
ーハガッコーニイッタ．」のような名詞以外の品詞
を含んだ文章では、品詞の種類とピッチの自然降下成分
とを考慮した韻律パラメータを設定することができ、
「カナガワ’ケンカワサキ’シナカハラ’ク」のよ
うな名詞のみからなる文章では、品詞の種類もピッチの
自然降下成分も考慮しない韻律パラメータを設定するこ
とができる。With this configuration, "Kyo '
-Ha Gacconi Itta. ), You can set prosodic parameters that take into account the type of part of speech and the natural drop component of pitch,
In a sentence consisting only of nouns such as "Kanakawa 'Ken Kawasaki' Shi Nakahara 'K", it is possible to set a prosodic parameter that does not consider the type of part of speech or the natural drop component of pitch.

【００４９】特に、ｍｏｄｅ＝０の場合には、先に説明
した第１の実施形態と同様に、より自然な抑揚の音声を
生成することが可能になる。In particular, when mode = 0, it is possible to generate a more natural intonation voice, as in the first embodiment described above.

【００５０】また、上記の説明の如く、非音響情報（品
詞の種類等）に基づいて韻律パラメータの一つであるピ
ッチ周波数の制御を行うことにより、より抑揚のついた
音声の再生が可能になる。このとき、ピッチ周波数の制
御は、品詞が感動詞であれば、名詞、数詞、助詞よりも
ピッチ周波数の最大値と最小値の幅を大きくすることが
望ましい。Further, as described above, by controlling the pitch frequency, which is one of the prosody parameters, based on the non-acoustic information (type of speech, etc.), it is possible to reproduce a sound with more intonation. Become. At this time, if the part of speech is an inflection, it is desirable to control the pitch frequency so that the range of the maximum value and the minimum value of the pitch frequency is larger than that of a noun, a numeral, or a particle.

【００５１】ピッチ周波数に加えて、非音響情報に基づ
いて韻律パラメータの一つである音パワーの制御を行う
ことにより、より自然な音声の再生が可能になる。この
ときに、音パワーの制御は、品詞が数詞であれば、名
詞、感動詞、助詞よりも音パワーを大きくすることが望
ましい。By controlling the sound power, which is one of the prosodic parameters, based on the non-acoustic information in addition to the pitch frequency, a more natural sound can be reproduced. At this time, if the part of speech is a numerical part, it is desirable to control the sound power so that the sound power is larger than that of a noun, an incarnation, or a particle.

【００５２】ピッチ周波数、音パワーに加えて、非音響
情報に基づいて韻律パラメータの一つである音韻時間長
の制御を行うことにより、更に自然な音声の再生が可能
になる。このとき、音韻時間長の制御は、品詞が数詞で
あれば、名詞、感動詞、助詞よりも音韻時間長を長くす
ることが望ましい。By controlling the phoneme time length, which is one of the prosody parameters, based on the non-acoustic information in addition to the pitch frequency and the sound power, a more natural sound can be reproduced. At this time, it is desirable that the control of the phoneme time length is to make the phoneme time length longer than that of a noun, an intransitive verb, or a particle if the part of speech is a numerical part.

【００５３】本実施形態では、非音響情報に基づいて制
御する韻律パラメータとしてピッチ周波数、音パワー、
音韻時間長を説明したが、これに限るものではない。例
えば、非音響情報に基づいてポーズ時間長を変更するこ
とも可能である。このとき、ポーズ時間長の制御は、品
詞が名詞または助詞であれば、その直後のポーズ時間長
を長くすることによって、より自然な抑揚のついた音声
の再生が可能になる。In the present embodiment, pitch frequency, sound power, and so on are used as prosodic parameters controlled based on non-acoustic information.
Although the phonological time length has been described, the present invention is not limited to this. For example, the pause time length can be changed based on non-acoustic information. At this time, if the part of speech is a noun or a particle, the pause time length can be controlled by increasing the pause time length immediately after the noun or particle, so that a sound with more natural intonation can be reproduced.

【００５４】また、実施形態では、１つの装置上で動作
する例を説明したが、本システムの一部を有線・無線ネ
ットワーク上のクライアント部に構築するようにしても
良いし、その構築形態は如何なるものでも構わない。In the embodiment, an example in which the present invention operates on one device has been described. However, a part of the present system may be constructed in a client unit on a wired / wireless network. Anything is fine.

【００５５】なお、本発明は、複数の機器から構成され
るシステムに適用しても、一つの機器からなる装置に適
用してもよい。また、本発明の目的は、前述した実施形
態の機能を実現するソフトウェアのプログラムコードを
記録した記憶媒体（または記録媒体）を、システムある
いは装置に供給し、そのシステムあるいは装置のコンピ
ュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納され
たプログラムコードを読み出し実行することによって
も、達成されることは言うまでもない。この場合、記憶
媒体から読み出されたプログラムコード自体が前述した
実施形態の機能を実現することになり、そのプログラム
コードを記憶した記憶媒体は本発明を構成することにな
る。また、コンピュータが読み出したプログラムコード
を実行することにより、前述した実施形態の機能が実現
されるだけでなく、そのプログラムコードの指示に基づ
き、コンピュータ上で稼働しているオペレーティングシ
ステム（ＯＳ）などが実際の処理の一部または全部を行
い、その処理によって前述した実施形態の機能が実現さ
れる場合も含まれることは言うまでもない。The present invention may be applied to a system constituted by a plurality of devices or to an apparatus constituted by a single device. Further, an object of the present invention is to supply a storage medium (or a recording medium) storing program codes of software for realizing the functions of the above-described embodiments to a system or an apparatus, and to provide a computer (or a CPU or Needless to say, the present invention can also be achieved by an MPU) reading and executing the program code stored in the storage medium. In this case, the program code itself read from the storage medium implements the functions of the above-described embodiment, and the storage medium storing the program code constitutes the present invention. When the computer executes the readout program codes, not only the functions of the above-described embodiments are realized, but also an operating system (OS) running on the computer based on the instructions of the program codes. It goes without saying that a case where some or all of the actual processing is performed and the functions of the above-described embodiments are realized by the processing is also included.

【００５６】さらに、記憶媒体から読み出されたプログ
ラムコードが、コンピュータに挿入された機能拡張カー
ドやコンピュータに接続された機能拡張ユニットに備わ
るメモリに書込まれた後、そのプログラムコードの指示
に基づき、その機能拡張カードや機能拡張ユニットに備
わるＣＰＵなどが実際の処理の一部または全部を行い、
その処理によって前述した実施形態の機能が実現される
場合も含まれることは言うまでもない。Further, after the program code read from the storage medium is written into a memory provided in a function expansion card inserted into the computer or a function expansion unit connected to the computer, the program code is read based on the instruction of the program code. , The CPU provided in the function expansion card or the function expansion unit performs part or all of the actual processing,
It goes without saying that a case where the function of the above-described embodiment is realized by the processing is also included.

【００５７】[0057]

【発明の効果】以上説明したように本発明によれば、表
音テキストの音響情報と非音響情報とを考慮して音声を
合成することができるため、より自然な音声を出力する
ことが可能になる。As described above, according to the present invention, speech can be synthesized in consideration of acoustic information and non-acoustic information of phonetic text, so that more natural speech can be output. become.

[Brief description of the drawings]

【図１】音声合成装置の一般的な処理の流れを示す図で
ある。FIG. 1 is a diagram showing a general processing flow of a speech synthesizer.

【図２】漢字かな混じり文と表音テキストの関係を示す
図である。FIG. 2 is a diagram showing the relationship between kanji kana mixed sentences and phonetic texts.

【図３】実施形態における装置の機能構成図である。FIG. 3 is a functional configuration diagram of the device according to the embodiment.

【図４】実施形態における全体の処理を示すフローチャ
ートである。FIG. 4 is a flowchart illustrating overall processing according to the embodiment.

【図５】実施形態におけるアクセント句の品詞系列の推
定処理のフローチャートである。FIG. 5 is a flowchart of processing for estimating a part-of-speech sequence of an accent phrase in the embodiment.

【図６】実施形態における韻律パラメータ生成処理のフ
ローチャートである。FIG. 6 is a flowchart of a prosody parameter generation process in the embodiment.

【図７】一般的な文章のピッチ周波数の制御の概要を示
す図である。FIG. 7 is a diagram illustrating an outline of control of a pitch frequency of a general sentence.

【図８】実施形態における名詞読みモードのピッチ周波
数制御の概要を示す図である。FIG. 8 is a diagram showing an outline of pitch frequency control in a noun reading mode in the embodiment.

【図９】実施形態における韻律パラメータ生成処理のフ
ローチャートである。FIG. 9 is a flowchart of a prosody parameter generation process in the embodiment.

【図１０】実施形態における装置のブロック構成図であ
る。FIG. 10 is a block diagram of an apparatus according to the embodiment.

Claims

[Claims]

An estimating means for estimating non-acoustic information from phonetic text including acoustic information, and an output form of a synthesized voice is set based on the non-acoustic information obtained by the estimating means and the acoustic information. And a parameter setting means for generating a prosodic parameter for performing the speech synthesis.

2. The speech synthesizer according to claim 1, wherein said estimating means performs an estimating process for each accent phrase constituting said phonetic text.

3. The speech synthesizer according to claim 1, wherein the phonetic text is composed of a symbol indicating a pronunciation and a symbol indicating a position of an accent.

4. The method according to claim 1, wherein said estimating means estimates a word sequence constituting said phonetic text and a part of speech information of each word constituting said word sequence.
The speech synthesizer according to any one of the above items.

5. The prosody parameter is a pitch frequency,
The speech synthesis device according to any one of claims 1 to 4, wherein the speech synthesis device includes at least one of a phoneme time length and power.

6. The parameter setting means controls at least one of a pitch frequency, a phoneme time length, and a power of a phoneme sequence corresponding to the phonetic text according to a part of speech of a word constituting the phonetic text. The speech synthesizer according to any one of claims 1 to 5, wherein:

7. The method according to claim 1, wherein the parameter setting means controls a natural descent component of a phoneme sequence corresponding to the phonetic text according to a part of speech of a word constituting the phonetic text. Item 7. The speech synthesizer according to any one of Items 6 to 6.

8. The parameter setting means, when the part of speech of a word constituting the phonological text is a list of nouns, does not consider a natural descent component of a phonological sequence corresponding to the phonological text. The speech synthesizer according to claim 7, characterized in that:

9. An estimation step of estimating non-acoustic information from phonetic text including acoustic information, and an output form of a synthesized speech is set based on the non-acoustic information and the acoustic information obtained in the estimating step. A parameter setting step of generating a parameter for performing a speech synthesis.

10. A synthesized speech output based on a program code of an estimation step of estimating non-acoustic information from phonetic text including acoustic information, and the non-acoustic information and the acoustic information obtained in the estimating step. A storage medium for storing a program code of a parameter setting step of generating a parameter for setting a form.