JP2007249022A

JP2007249022A - Speech synthesizer and speech synthesizing method

Info

Publication number: JP2007249022A
Application number: JP2006075058A
Authority: JP
Inventors: Yasuo Okuya; 泰夫奥谷
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2006-03-17
Filing date: 2006-03-17
Publication date: 2007-09-27

Abstract

PROBLEM TO BE SOLVED: To provide a speech synthesizer capable of processing a text wherein a rule applied to a variable portion is described. SOLUTION: The speech synthesizer includes an input means for inputting a rule set for speech synthesis and a character string to be substituted in a variable portion of the text, an extraction processing means of extracting a rule identifier from the text, a means of extracting information related to the rule from the character string input from the input means, and a variable portion processing means of selecting the rule to be applied according to the rule set and rule identifier and processing at least the variable portion based upon information related to the selected rule according to the rule. COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、制御ルールを含むテキストを処理する音声合成装置および音声合成方法に関する。 The present invention relates to a speech synthesizer and a speech synthesis method for processing text including control rules.

カーナビゲーションなどの機器では、ユーザに対する応答メッセージに音声合成を利用する。音声合成を用いて応答メッセージを生成する際、言語解析の誤りを避けるために、読みやアクセントなどの制御タグを含むテキストを用意することが一般的である。しかしながら、メッセージの一部分が可変である場合、可変部の読みやアクセントを事前に指定することは困難である。特許文献１には、システムがあらかじめ保持しているアクセントルールを可変部に適用し動的にアクセントを生成する技術が記載されている。
特許第３１７１７７５号公報 Devices such as car navigation systems use speech synthesis for response messages to users. When generating a response message using speech synthesis, it is common to prepare text including control tags such as reading and accent in order to avoid errors in language analysis. However, when a part of the message is variable, it is difficult to specify the reading and accent of the variable part in advance. Patent Document 1 describes a technique for dynamically generating an accent by applying an accent rule held in advance by the system to a variable part.
Japanese Patent No. 3171775

しかしながら、特許文献１では、テキストを記述するユーザが可変部に適用されるルールを指定することができないため、所望の合成音声を生成できない場合があるという問題点があった。 However, in Patent Document 1, there is a problem in that a user who describes text cannot specify a rule to be applied to the variable part, and thus a desired synthesized speech may not be generated.

本発明は上記の課題に鑑みてなされたものであり、可変部に適用されるルールを記述したテキストを処理することが可能な音声合成装置を提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a speech synthesizer capable of processing a text describing a rule applied to a variable part.

上記の目的を達成するための本発明による音声合成装置は、音声合成対象のテキストの可変部に代入する文字列を取得する取得手段と、前記テキストからルール識別子を抽出する抽出処理手段と、前記取得手段で取得した文字列からルールに関係する情報を取り出す取り出し手段と、前記ルール集合と前記ルール識別子から適用すべきルールを選択し、選択したルールに従って、ルールに関係する情報をもとに少なくとも可変部を処理する処理手段とを備えることを特徴とする。 To achieve the above object, a speech synthesizer according to the present invention includes an acquisition unit that acquires a character string to be substituted into a variable part of a text to be synthesized, an extraction processing unit that extracts a rule identifier from the text, Extraction means for extracting information related to the rule from the character string acquired by the acquisition means, a rule to be applied is selected from the rule set and the rule identifier, and at least based on the information related to the rule according to the selected rule And processing means for processing the variable part.

また、上記の目的を達成するための本発明による音声合成装置は、音声合成対象のテキストの可変部に代入される文字列を取得する取得手段と、前記テキストからルールを抽出する抽出処理手段と、前記取得手段で取得した文字列からルールに関係する情報を取り出す取り出し手段と、前記ルールに関係する情報とルールに従って少なくとも可変部を処理する処理手段とを備えることを特徴とする。 In addition, a speech synthesizer according to the present invention for achieving the above object includes an acquisition unit that acquires a character string to be substituted into a variable part of a text to be synthesized, and an extraction processing unit that extracts a rule from the text. The image processing apparatus includes: extraction means for extracting information related to the rule from the character string acquired by the acquisition means; and processing means for processing at least the variable part according to the information related to the rule and the rule.

本発明によれば、テキストを記述するユーザが可変部に適用されるルールを指定できるようになるため、ユーザの所望する合成音声を容易に生成することが可能となる。 According to the present invention, a user who describes a text can specify a rule to be applied to the variable part, and thus it is possible to easily generate a synthesized speech desired by the user.

以下、添付の図面を参照して本発明の好適な実施形態について詳細に説明する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

（第１実施形態）
第１実施形態では、留守番電話の留守番メッセージを提供する音声合成装置を例にあげて説明する。本音声合成装置により、ユーザは所望の文字列（通常は自分の名前）を留守番メッセージに設定することができる。また、ここではユーザが入力する文字列データはアクセント記号付きの仮名レベルの読み記号列である。 (First embodiment)
In the first embodiment, a voice synthesizer that provides an answering machine message for an answering machine will be described as an example. With this speech synthesizer, the user can set a desired character string (usually his name) in an answering message. Here, the character string data input by the user is a kana level reading symbol string with an accent symbol.

留守番電話の開発者は、留守番メッセージのテンプレート（図６の（ａ）参照）を作成する。音声合成用記号で記述されたテンプレートは、固定部、可変部、およびルールの３つのパートの組み合わせで構成される。固定部は、留守番メッセージの固定メッセージ部分であり、可変部はユーザが入力する任意の文字列であり、ルールはその文字列に対して適用されるルールの識別番号である。 The developer of the answering machine creates an answering message template (see FIG. 6A). A template described by a speech synthesis symbol is composed of a combination of three parts: a fixed part, a variable part, and a rule. The fixed part is a fixed message part of the answering message, the variable part is an arbitrary character string input by the user, and the rule is an identification number of a rule applied to the character string.

本実施形態では、可変部の前後に挿入されるポーズ生成ルールを例に挙げて説明することにする。 In the present embodiment, a pose generation rule inserted before and after the variable part will be described as an example.

図４は、ポーズ生成ルールの例である。ルール１は聞き取りやすさを優先する場合に好適なルールである。ルール２は留守番メッセージの自然性を優先する場合に好適なルールでる。また、ルール３は、情報伝達を優先する場合に好適なルールである。図２のポーズルール集合２０１には、これら３種類のポーズ生成ルールが格納されており、音声合成装置はこれらのポーズ生成ルールを使い分けることが可能である。 FIG. 4 is an example of a pose generation rule. Rule 1 is a suitable rule when priority is given to ease of listening. Rule 2 is a suitable rule when priority is given to the naturalness of the answering machine message. Rule 3 is a suitable rule when priority is given to information transmission. The pose rule set 201 in FIG. 2 stores these three types of pose generation rules, and the speech synthesizer can use these pose generation rules properly.

よって、留守番電話開発者は、購買層ターゲットに応じて最適なポーズ生成ルールを選択（テンプレートに記述）することが可能である。例えば、高齢者向けの留守番電話を開発する場合はルール１を選択するのが良い。また、一般向け留守番電話にはルール２が適しているといえる。 Therefore, the answering machine developer can select (describe in the template) an optimal pose generation rule according to the purchase target. For example, when developing an answering machine for elderly people, it is preferable to select rule 1. In addition, it can be said that rule 2 is suitable for a general answering machine.

図１は第１実施形態における音声合成装置のハードウエア構成を示すブロック図である。 FIG. 1 is a block diagram showing a hardware configuration of the speech synthesizer in the first embodiment.

図１において、１０１は制御メモリであり、本実施形態の音声合成処理の手順や必要な固定的データが格納される。１０２は中央処理装置であり、数値演算／制御等の処理を行う。１０３はメモリであり、一時的なデータが格納される。１０４は外部記憶装置であり、テキストやルールが格納されている。１０５は入力装置であり、ユーザが本装置に対してデータを入力したり、動作を指示したりするのに用いられる。１０６は出力装置であり、中央処理装置１０２の制御下でユーザに対して各種の情報を提示する。１０７は音声出力装置であり、音声合成された内容を出力する。１０８はバスであり、各装置間のデータのやり取りはこのバスを通じて行われる。 In FIG. 1, reference numeral 101 denotes a control memory, which stores the speech synthesis processing procedure of this embodiment and necessary fixed data. A central processing unit 102 performs processing such as numerical calculation / control. Reference numeral 103 denotes a memory in which temporary data is stored. Reference numeral 104 denotes an external storage device in which texts and rules are stored. Reference numeral 105 denotes an input device, which is used by the user to input data to the device and to instruct operations. An output device 106 presents various information to the user under the control of the central processing unit 102. Reference numeral 107 denotes an audio output device that outputs the synthesized content. Reference numeral 108 denotes a bus, and data exchange between the devices is performed through this bus.

図２は第１実施形態における音声合成装置のモジュール構成を示すブロック図である。 FIG. 2 is a block diagram showing a module configuration of the speech synthesizer in the first embodiment.

図２において、ポーズルール集合２０１は図４に示すようなポーズ生成ルールが格納されている。テキスト保持部２０２は、図６の（ａ）に示すような留守番メッセージのテンプレートが格納されている。このテンプレートに従って生成されたテキストが音声合成対象となる。入力処理部２０３は、ユーザが入力する所望の文字列を受理する。文字列保持部２０４は、ユーザが入力した文字列を保持する。抽出処理部２０５は、テキスト保持部２０２が保持するテンプレートからルールを識別する識別子であるルール番号を抽出する。ルール番号保持部２０６は、抽出されたルール番号を保持する。モーラ数解析部２０７は、文字列保持部２０４が保持する文字列のモーラ数をカウントする。モーラ数保持部２０８は、モーラ数解析部２０７が求めたモーラ数を保持する。ポーズ生成部２０９は、ポーズ生成ルールとルール番号から適用するルールを決定し、可変部前後のポーズをモーラ数に応じて生成する。韻律処理部２１０はポーズ生成の結果とテキストと文字列から韻律を生成する。波形生成部２１１は韻律情報をもとに合成音声を生成する。音声出力部２１２は、合成音声を出力する。 In FIG. 2, the pose rule set 201 stores pose generation rules as shown in FIG. The text holding unit 202 stores an answering message template as shown in FIG. Text generated according to this template is a speech synthesis target. The input processing unit 203 accepts a desired character string input by the user. The character string holding unit 204 holds a character string input by the user. The extraction processing unit 205 extracts a rule number that is an identifier for identifying a rule from the template held by the text holding unit 202. The rule number holding unit 206 holds the extracted rule number. The mora number analysis unit 207 counts the number of mora of the character string held by the character string holding unit 204. The mora number holding unit 208 holds the mora number obtained by the mora number analysis unit 207. The pose generation unit 209 determines a rule to be applied from the pose generation rule and the rule number, and generates poses before and after the variable unit according to the number of mora. The prosody processing unit 210 generates a prosody from the result of the pose generation, the text, and the character string. The waveform generation unit 211 generates synthesized speech based on the prosodic information. The voice output unit 212 outputs synthesized voice.

図３は第１実施形態における音声合成装置の処理の流れを示すフローチャートである。該フローチャートを実行するための制御プログラムは、制御メモリ１０１、メモリ１０３、外部記憶装置１０４等に記憶されている。 FIG. 3 is a flowchart showing the flow of processing of the speech synthesizer in the first embodiment. A control program for executing the flowchart is stored in the control memory 101, the memory 103, the external storage device 104, and the like.

ステップＳ３０１では、入力処理部２０３がユーザの入力を検知する。ユーザが入力するまでステップＳ３０１に留まる。ユーザの入力を検知した場合は、ユーザの入力を取得して文字列保持部２０４に保持し、ステップＳ３０２に移る。なお、ユーザが入力した文字列は、テンプレートの可変部に利用される変数に相当するので、以下の説明では変数と呼ぶことにする。 In step S301, the input processing unit 203 detects a user input. It stays at step S301 until the user inputs. If a user input is detected, the user input is acquired and held in the character string holding unit 204, and the process proceeds to step S302. Since the character string input by the user corresponds to a variable used for the variable part of the template, it will be referred to as a variable in the following description.

ステップＳ３０２では、抽出処理部２０５がテキスト保持部２０２に保持されているテンプレートからルール番号を抽出し、ルール番号保持部２０６に保持した後、ステップＳ３０３に移る。 In step S302, the extraction processing unit 205 extracts the rule number from the template held in the text holding unit 202, holds the rule number in the rule number holding unit 206, and then proceeds to step S303.

ステップＳ３０３では、モーラ数解析部２０７が変数のモーラ数を求め、モーラ数保持部２０８に保持してステップＳ３０４に移る。 In step S303, the mora number analysis unit 207 obtains the mora number of the variable, holds it in the mora number holding unit 208, and proceeds to step S304.

ステップＳ３０４では、ポーズ生成部２０９が、ポーズルール集合２０１とルール番号保持部２０６が保持するルール番号から、適用するポーズ生成ルールを選択する。さらに、モーラ数保持部２０８が保持するモーラ数をもとに、変数（すなわち可変部）の両側に挿入するポーズを生成した後、ステップＳ３０５に移る。 In step S304, the pose generation unit 209 selects a pose generation rule to be applied from the rule numbers held by the pose rule set 201 and the rule number holding unit 206. Furthermore, after generating poses to be inserted on both sides of a variable (that is, variable part) based on the number of mora held by the mora number holding unit 208, the process proceeds to step S305.

ステップＳ３０５では、韻律生成部２１０が、文字列保持部２０４が保持する変数とテキスト保持部２０２が保持するテンプレートとポーズ生成部２０９が生成したポーズ情報から韻律情報を生成して、ステップＳ３０６に移る。 In step S305, the prosody generation unit 210 generates prosody information from the variables held by the character string holding unit 204, the templates held by the text holding unit 202, and the pose information generated by the pose generation unit 209, and the process proceeds to step S306. .

ステップＳ３０６では、波形生成部２１１が韻律情報をもとに、文字列保持部２０４が保持する文字列をテンプレートの代入したテキストに対応する合成音声を生成して、ステップＳ３０７に移る。ステップＳ３０７では、音声出力部２１２が合成音声を出力して終了する。 In step S306, the waveform generation unit 211 generates synthesized speech corresponding to the text in which the character string held by the character string holding unit 204 is substituted with the template based on the prosodic information, and the process proceeds to step S307. In step S307, the voice output unit 212 outputs the synthesized voice and the process ends.

図５は、図４に示したポーズ生成ルールのルール１を適用した場合の、ポーズ生成結果を音声合成用記号で示した図である。図５（ａ）は、留守番メッセ−ジのテンプレートである。図５（ｂ）は、変数（すなわち、ユーザが入力した文字列）が「タナカ」である場合のポーズ生成結果を示している。変数のモーラ数が４モーラ以下であるので、可変部の両端に１００ｍｓのポーズが挿入される。図５（ｃ）および（ｄ）は、それぞれ変数が「タナカサ’トシ」「タナカカブシキガ’イシャ」である場合のポーズ生成結果を示している。変数のモーラ数がどちらも５以上であるので、可変部の前に２００ｍｓ、後ろに３００ｍｓのポーズが挿入される。 FIG. 5 is a diagram showing a pose generation result by using a speech synthesis symbol when rule 1 of the pose generation rule shown in FIG. 4 is applied. FIG. 5A shows an answering message template. FIG. 5B shows a pose generation result when the variable (that is, the character string input by the user) is “tanaka”. Since the number of mora of the variable is 4 mora or less, a 100 ms pause is inserted at both ends of the variable part. FIGS. 5C and 5D show pose generation results when the variables are “Tanakasa 'Toshi” and “Tanaka Kabushikiga' Isha”, respectively. Since both of the variable mora numbers are 5 or more, a pause of 200 ms is inserted in front of the variable portion and 300 ms in the back.

一方、図６は、図４に示したポーズ生成ルールのルール２を適用した場合の、ポーズ生成結果を音声合成用記号で示した図である。図６（ａ）は、留守番メッセ−ジのテンプレートである。図６（ｂ）は、変数が「タナカ」である場合のポーズ生成結果を示している。変数のモーラ数が４モーラ以下であるので、可変部の両端にポーズは挿入されない。図６（ｃ）は、変数が「タナカサ’トシ」である場合のポーズ生成結果を示している。変数のモーラ数が５以上であるので、可変部の前に１００ｍｓのポーズが挿入される。また、図６（ｄ）は、変数が「タナカカブシキガ’イシャ」である場合のポーズ生成結果を示している。変数のモーラ数が８以上であるので、可変部の前後に１００ｍｓのポーズが挿入される。 On the other hand, FIG. 6 is a diagram showing a pose generation result by using a speech synthesis symbol when the rule 2 of the pose generation rule shown in FIG. 4 is applied. FIG. 6A shows an answering machine message template. FIG. 6B shows a pose generation result when the variable is “Tanaka”. Since the number of mora of the variable is 4 or less, no pause is inserted at both ends of the variable part. FIG. 6C shows a pose generation result when the variable is “Tanakasa 'Toshi”. Since the number of mora of the variable is 5 or more, a 100 ms pause is inserted before the variable part. Further, FIG. 6D shows a pose generation result when the variable is “Tanaka Kabushikiga'Isha”. Since the number of mora of the variable is 8 or more, a 100 ms pause is inserted before and after the variable part.

以上説明したように、可変部の制御ルールを記述したテキストを処理可能な音声合成装置を提供することにより、留守番電話の開発者は、留守番メッセージのテンプレートにポーズ生成ルールの識別番号を記述することができる。このため、音声合成自体のアルゴリズムを変更することなく、購買層ターゲットに応じた留守番電話の開発がテンプレートの記述を書き換えるという簡単な処理だけで行うことができる。言い換えれば、ある機器を開発する際に、その機器に適した音声合成の制御を機器の開発者が選択できる枠組みを提供することが可能である。 As described above, by providing a speech synthesizer capable of processing text describing the control rules of the variable part, the answering machine developer can describe the identification number of the pause generation rule in the answering message template. Can do. For this reason, without changing the algorithm of speech synthesis itself, the answering machine corresponding to the purchase target can be developed with a simple process of rewriting the template description. In other words, when developing a certain device, it is possible to provide a framework that allows the device developer to select speech synthesis control suitable for the device.

（第２実施形態）
第１実施形態では、制御ルールが可変部のみに作用する場合について説明したが、本発明はこれに限定されるものではなく、固定部を含めた部分に作用するよう構成してもよい。 (Second Embodiment)
In the first embodiment, the case where the control rule acts only on the variable portion has been described. However, the present invention is not limited to this, and the control rule may be configured to act on a portion including the fixed portion.

第２実施形態でも第１実施形態と同様に留守番電話の留守番メッセージを提供する音声合成装置について説明する。 In the second embodiment, a voice synthesizer that provides an answering machine message for an answering machine will be described as in the first embodiment.

図９は、付属語のアクセント結合ルールの例を示した図である。ルール１は、前接単語と後接単語が接続した場合のアクセント型が前接単語のアクセント型に従うというルールである。また、ルール２やルール３は、前接単語が平板型か有核型かによって、結合時のアクセント型が異なるタイプのルールである。一般的に、付属語のアクセント結合は後接する付属語に応じて適切なルールを適用することにより実現される。制御記号のないいわゆる漢字かな混じり文を単語辞書を用いて解析するような場合は、付属語の属性として前記アクセント結合ルールの番号が格納されているため、適切なアクセント結合を行うことが可能である。しかしながら、アクセント記号や音声表記からなる音声合成用記号列が入力される場合は、単語辞書を持たないため、単語が付属語かどうかを知ることもできないし、アクセント結合ルールのいずれを適用すべきかということもわからない。本実施形態では、そのような入力に対して適切にアクセント結合を実現する枠組みを提供するものである。 FIG. 9 is a diagram showing an example of an accent combining rule for attached words. Rule 1 is a rule that the accent type when the front word and the back word are connected follows the accent type of the front word. Rules 2 and 3 are rules of different accent types when combined depending on whether the front word is a flat plate type or a nucleated type. In general, the accent combination of an adjunct is realized by applying an appropriate rule according to the adjoining adjunct. When analyzing a so-called kanji-kana mixed sentence without control symbols using a word dictionary, the accent combination rule number is stored as an attribute of the attached word, so that appropriate accent combination can be performed. is there. However, when a speech synthesis symbol string consisting of accent symbols and phonetic notations is input, it does not have a word dictionary, so it cannot be known whether a word is an adjunct word, and which of the accent combining rules should be applied. I don't know. In the present embodiment, a framework for appropriately realizing accent coupling for such an input is provided.

図７は第２実施形態における音声合成装置のモジュール構成を示すブロック図である。図７において、テキスト保持部２０２、入力処理部２０３、文字列保持部２０４、ルール番号保持部２０６、波形生成部２１１、音声出力部２１２は第１実施形態と同じ処理を行うため、図２の記号を用いることとし説明を省略する。 FIG. 7 is a block diagram showing a module configuration of the speech synthesizer in the second embodiment. In FIG. 7, the text holding unit 202, the input processing unit 203, the character string holding unit 204, the rule number holding unit 206, the waveform generation unit 211, and the voice output unit 212 perform the same processing as in the first embodiment. Symbols are used and explanation is omitted.

アクセント結合ルール集合７０１は図９に示すようなアクセント結合ルールが格納されている。抽出処理部２０５は、テキスト保持部２０２が保持するテンプレートからルール番号を抽出し、さらに、可変部とともにルールを適用する固定部を抽出する。固定部保持部７０５は、ルールに関与する固定部を保持する。アクセント解析部７０２は、文字列保持部２０４が保持する文字列のアクセントを解析する。アクセント保持部７０３は、アクセント解析部７０２が求めたアクセント情報を保持する。アクセント結合部７０４は、アクセント結合ルールとルール番号から適用するルールを決定し、可変部と固定部保持部７０５が保持する固定部との結合アクセントを生成する。韻律生成部２１０は、アクセント結合の結果とテキストと文字列から韻律情報を生成する。 The accent combination rule set 701 stores accent combination rules as shown in FIG. The extraction processing unit 205 extracts a rule number from the template held by the text holding unit 202, and further extracts a fixed unit to which the rule is applied together with the variable unit. The fixed part holding part 705 holds a fixed part related to the rule. The accent analysis unit 702 analyzes the accent of the character string held by the character string holding unit 204. The accent holding unit 703 holds the accent information obtained by the accent analysis unit 702. The accent combining unit 704 determines an applied rule from the accent combining rule and the rule number, and generates a combined accent between the variable unit and the fixed unit held by the fixed unit holding unit 705. The prosody generation unit 210 generates prosody information from the result of accent combination, text, and character string.

図８は第２実施形態における音声合成装置の処理の流れを示すフローチャートである。ステップＳ８０１では、入力処理部２０３がユーザの入力を検知する。ユーザが入力するまでステップＳ８０１に留まる。ユーザの入力を検知した場合は、ユーザの入力を文字列保持部２０４に保持して、ステップＳ８０２に移る。なお、ユーザが入力した文字列は、テンプレートの可変部に利用される変数に相当するので、以下の説明では変数と呼ぶことにする。 FIG. 8 is a flowchart showing the flow of processing of the speech synthesizer in the second embodiment. In step S801, the input processing unit 203 detects a user input. It stays at step S801 until the user inputs. If a user input is detected, the user input is held in the character string holding unit 204, and the process proceeds to step S802. Since the character string input by the user corresponds to a variable used for the variable part of the template, it will be referred to as a variable in the following description.

ステップＳ８０２では、抽出処理部２０５がテキスト保持部２０２に保持されているテンプレートからルール番号を抽出し、ルール番号保持部２０６に保持しする。さらに、ルールに関与する固定部を抽出し、固定部保持部７０５に保持した後、ステップＳ８０３に移る。ルールに関与する固定部は、本実施例の場合はテキスト内の可変部とルールに挟まれた部分とする。 In step S 802, the extraction processing unit 205 extracts the rule number from the template held in the text holding unit 202 and holds it in the rule number holding unit 206. Furthermore, after extracting the fixed part involved in the rule and holding it in the fixed part holding part 705, the process proceeds to step S803. In the case of the present embodiment, the fixed portion involved in the rule is a portion sandwiched between the variable portion in the text and the rule.

ステップＳ８０３では、アクセント解析部７０２が変数のアクセント情報を求め、アクセント保持部７０３に保持してステップＳ８０４に移る。アクセント解析部７０２が求めるアクセント情報としては、アクセント型、モーラ数が含まれる。さらには、アクセント型が有核型か平板型かという情報を含んでも良い。 In step S803, the accent analysis unit 702 obtains variable accent information, holds it in the accent holding unit 703, and proceeds to step S804. Accent information required by the accent analysis unit 702 includes an accent type and the number of mora. Furthermore, information regarding whether the accent type is a nucleated type or a flat plate type may be included.

ステップＳ８０４では、アクセント結合部７０４が、アクセント結合ルール集合７０１とルール番号保持部２０６が保持するルール番号から、適用するアクセント結合ルールを決定する。さらに、アクセント保持部７０３が保持するアクセント情報と、固定部保持部７０５が保持する固定部とから、可変部（すなわち変数）と固定部とのアクセント結合を行った後、ステップＳ８０５に移る。 In step S 804, the accent combining unit 704 determines an accent combining rule to be applied from the rule number held by the accent combining rule set 701 and the rule number holding unit 206. Further, after performing accent coupling between the variable part (that is, variable) and the fixed part from the accent information held by the accent holding part 703 and the fixed part held by the fixed part holding part 705, the process proceeds to step S805.

ステップＳ８０５では、韻律生成部２１０が、文字列保持部２０４が保持する変数とテキスト保持部２０２が保持するテンプレートとアクセント結合部７０４が生成した結合アクセント情報から韻律情報を生成して、ステップＳ８０６に移る。 In step S805, the prosody generation unit 210 generates prosody information from the variable held by the character string holding unit 204, the template held by the text holding unit 202, and the combined accent information generated by the accent combining unit 704, and the process proceeds to step S806. Move.

ステップＳ８０６では、波形生成部２１１が韻律情報をもとに合成音声を生成して、ステップＳ８０７に移る。ステップＳ８０７では、音声出力部２１２が合成音声を出力して終了する。 In step S806, the waveform generation unit 211 generates synthesized speech based on the prosodic information, and proceeds to step S807. In step S807, the voice output unit 212 outputs the synthesized voice and the process ends.

図１０は、図９に示したアクセント結合ルールのルール２を適用した場合の、アクセント結合結果を音声合成用記号で示した図である。図１０（ａ）は、留守番メッセ−ジのテンプレートである。図１０（ｂ）は、変数（すなわち、ユーザが入力した文字列）が「ヤマダ」である場合のアクセント結合結果を示している。変数のアクセントが平板型であるので、後接単語「デ’ス」のアクセント型に従って、アクセント核の位置が決定される。一方、図１０（ｃ）は、変数が「コ’モリ」である場合のアクセント結合結果を示している。変数のアクセントが有核型であるので、結合後のアクセント核の位置が変数のアクセント核の位置をそのまま継承している。 FIG. 10 is a diagram showing the result of accent combination when the rule 2 of the accent combination rule shown in FIG. 9 is applied, using symbols for speech synthesis. FIG. 10A shows an answering machine message template. FIG. 10B shows an accent combination result when the variable (that is, the character string input by the user) is “Yamada”. Since the variable accent is flat, the position of the accent kernel is determined according to the accent type of the trailing word “des”. On the other hand, FIG. 10C shows an accent combination result when the variable is “comment”. Since the accent of the variable is nucleated, the position of the accent nucleus after the combination inherits the position of the accent nucleus of the variable as it is.

以上説明したように、第２実施形態では、アクセント結合ルールを記述したテキストを処理する音声合成装置を提供することにより、可変部のみならず固定部を含んだアクセントの結合を適切に制御することが可能となる。 As described above, in the second embodiment, by providing a speech synthesizer that processes text describing an accent combining rule, it is possible to appropriately control the combining of accents including not only variable parts but also fixed parts. Is possible.

（第３実施形態）
第１実施形態及び第２実施形態では、ポーズルール集合２０１またはアクセント結合ルール集合７０１のようにルールがあらかじめ格納されている場合について説明したが、これに限定されるものではなく、ルール自体をテキストに記述する場合もよいものとする。 (Third embodiment)
In the first embodiment and the second embodiment, the case where rules are stored in advance as in the pose rule set 201 or the accent combination rule set 701 has been described. However, the present invention is not limited to this, and the rule itself is a text. It is also good to describe in.

第３実施形態でも第２実施形態と同様にアクセント結合ルールを例に説明する。 In the third embodiment as well, as in the second embodiment, an accent combination rule will be described as an example.

図１１は第３実施形態における音声合成装置のモジュール構成を示すブロック図である。 FIG. 11 is a block diagram showing a module configuration of the speech synthesizer in the third embodiment.

図１１と図７の違いは、アクセント結合ルール集合７０１がないことと、ルール番号保持部２０６がルール保持部１１０１に変わった点の２点である。 The difference between FIG. 11 and FIG. 7 is two points, that is, there is no accent combination rule set 701 and the rule number holding unit 206 is changed to the rule holding unit 1101.

抽出処理部２０５は、テキスト保持部２０２が保持するテンプレートからアクセント結合ルールを抽出し、さらに、可変部とともにアクセント結合ルールを適用する固定部を抽出する。ルール保持部１１０１は、抽出処理部２０５がテキストから抽出したアクセント結合ルールを保持する。アクセント結合部７０４は、アクセント結合ルールにしたがって、可変部と固定部保持部７０５が保持する固定部との結合アクセントを生成する。韻律生成部２１０は、アクセント結合の結果とテキストと文字列から韻律情報を生成する。 The extraction processing unit 205 extracts an accent combining rule from the template held by the text holding unit 202, and further extracts a fixed unit to which the accent combining rule is applied together with the variable unit. The rule holding unit 1101 holds the accent combination rule extracted from the text by the extraction processing unit 205. The accent coupling unit 704 generates a coupling accent between the variable unit and the fixed unit held by the fixed unit holding unit 705 according to the accent coupling rule. The prosody generation unit 210 generates prosody information from the result of accent combination, text, and character string.

図１２は第３実施形態における音声合成装置の処理の流れを示すフローチャートである。ステップＳ１２０１では、入力処理部２０３がユーザの入力を検知する。ユーザが入力するまでステップＳ１２０１に留まる。ユーザの入力を検知した場合は、ユーザの入力を文字列保持部２０４に保持して、ステップＳ１２０２に移る。なお、ユーザが入力した文字列は、テンプレートの可変部に利用される変数に相当するので、以下の説明では変数と呼ぶことにする。 FIG. 12 is a flowchart showing the flow of processing of the speech synthesizer in the third embodiment. In step S1201, the input processing unit 203 detects a user input. It stays at step S1201 until the user inputs. If a user input is detected, the user input is held in the character string holding unit 204, and the process proceeds to step S1202. Since the character string input by the user corresponds to a variable used for the variable part of the template, it will be referred to as a variable in the following description.

ステップＳ１２０２では、抽出処理部２０５がテキスト保持部２０２に保持されているテンプレートからアクセント結合ルールを抽出し、ルール番号保持部２０６に保持しする。さらに、アクセント結合ルールに関与する固定部を抽出し、固定部保持部７０５に保持した後、ステップＳ１２０３に移る。アクセント結合ルールに関与する固定部は、本実施例の場合はテキスト内の可変部とアクセント結合ルールに挟まれた部分とする。 In step S 1202, the extraction processing unit 205 extracts an accent combination rule from the template held in the text holding unit 202 and holds it in the rule number holding unit 206. Furthermore, after extracting the fixed part related to the accent combination rule and holding it in the fixed part holding part 705, the process proceeds to step S1203. In the case of the present embodiment, the fixed part involved in the accent combining rule is a part sandwiched between the variable part in the text and the accent combining rule.

ステップＳ１２０３では、アクセント解析部７０２が変数のアクセント情報を求め、アクセント保持部７０３に保持してステップＳ１２０４に移る。アクセント解析部７０２が求めるアクセント情報としては、アクセント型、モーラ数が含まれる。さらには、アクセント型が有核型か平板型かという情報を含んでも良い。 In step S1203, the accent analysis unit 702 obtains variable accent information, holds it in the accent holding unit 703, and proceeds to step S1204. Accent information required by the accent analysis unit 702 includes an accent type and the number of mora. Furthermore, information regarding whether the accent type is a nucleated type or a flat plate type may be included.

ステップＳ１２０４では、アクセント結合部７０４が、ルール保持部１１０１が保持するアクセント結合ルールに従って、アクセント保持部７０３が保持するアクセント情報と、固定部保持部７０５が保持する固定部とから、可変部（すなわち変数）と固定部とのアクセント結合を行った後、ステップＳ１２０５に移る。 In step S 1204, the accent combining unit 704 determines from the accent information held by the accent holding unit 703 and the fixed unit held by the fixed unit holding unit 705 according to the accent combining rule held by the rule holding unit 1101. After performing the accent connection between the variable) and the fixed part, the process proceeds to step S1205.

ステップＳ１２０５では、韻律生成部２１０が、文字列保持部２０４が保持する変数とテキスト保持部２０２が保持するテンプレートとアクセント結合部７０４が生成した結合アクセント情報から韻律情報を生成して、ステップＳ１２０６に移る。 In step S1205, the prosody generation unit 210 generates prosody information from the variable held by the character string holding unit 204, the template held by the text holding unit 202, and the combined accent information generated by the accent combining unit 704, and the process proceeds to step S1206. Move.

ステップＳ１２０６では、波形生成部２１１が韻律情報をもとに合成音声を生成して、ステップＳ１２０７に移る。ステップＳ１２０７では、音声出力部２１２が合成音声を出力して終了する。 In step S1206, the waveform generation unit 211 generates synthesized speech based on the prosodic information, and the process proceeds to step S1207. In step S1207, the voice output unit 212 outputs the synthesized voice and the process ends.

図１３は、図９におけるアクセント結合ルールのルール２をテキストに直接記述した場合の一例である。 FIG. 13 is an example in which rule 2 of the accent combination rule in FIG. 9 is described directly in the text.

以上説明したように、第３実施形態では、アクセント結合ルールを直接記述したテキストを処理する音声合成装置を提供することにより、あらかじめ用意されたアクセント結合規則に限定されることなく、アクセントの結合を適切に制御することが可能となる。 As described above, in the third embodiment, by providing a speech synthesizer that processes text that directly describes an accent combining rule, it is possible to combine accents without being limited to the accent combining rules prepared in advance. It becomes possible to control appropriately.

（その他の実施形態）
第１実施形態乃至第３実施形態では、ポーズ生成と付属語のアクセント結合に関するルールをテキストに記述した場合の音声合成装置の処理について説明したが、本発明はこれに限定されるものではない。読み、その他のアクセント生成、韻律生成、波形生成に関するルールを記述してもよいものとする。 (Other embodiments)
In the first to third embodiments, the processing of the speech synthesizer when the rules relating to the pose generation and the accent combination of the appendix are described in the text has been described, but the present invention is not limited to this. Rules regarding reading, other accent generation, prosody generation, and waveform generation may be described.

第１実施形態では、ルールの記述形式を可変部に後接する場合の例について説明したがこれに限定されるものではなく、可変部との関係が規定されている限りはいかなる位置に配置してもよいものとする。 In the first embodiment, the example in which the description format of the rule is followed by the variable part has been described. However, the present invention is not limited to this, and as long as the relationship with the variable part is defined, the rule may be placed at any position. It shall be good.

第２実施形態および第３実施形態では、ルールに関与する固定部が可変部とルールに挟まれる形で配置される場合について説明したが、本発明はこれに限定されるものではない。ルールに関与する固定部と可変部とルールとの関係が規定されている限りはいかなる位置に配置してもよいものとする。 In 2nd Embodiment and 3rd Embodiment, although the case where the fixing | fixed part which concerns on a rule was arrange | positioned between the variable part and the rule was demonstrated, this invention is not limited to this. As long as the relationship between the fixed part, the variable part, and the rule involved in the rule is defined, it may be arranged at any position.

さらに、テキストの記述形式も実施形態に記載した形式に限らず、解析可能な限りはいかなる形式であってもよいものとする。 Furthermore, the description format of the text is not limited to the format described in the embodiment, and may be any format as long as it can be analyzed.

なお、本発明の目的は次のようにしても達成される。即ち、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システムあるいは装置に供給する。そして、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出し実行する。このようにしても目的が達成されることは言うまでもない。 The object of the present invention can also be achieved as follows. That is, a storage medium in which a program code of software that realizes the functions of the above-described embodiments is recorded is supplied to the system or apparatus. Then, the computer (or CPU or MPU) of the system or apparatus reads and executes the program code stored in the storage medium. It goes without saying that the purpose is achieved even in this way.

この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。 In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the storage medium storing the program code constitutes the present invention.

プログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。 As a storage medium for supplying the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.

また、本発明に係る実施の形態は、コンピュータが読出したプログラムコードを実行することにより、前述した実施形態の機能が実現される場合に限られない。例えば、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, the embodiment according to the present invention is not limited to the case where the functions of the above-described embodiment are realized by executing the program code read by the computer. For example, an OS (operating system) running on a computer performs part or all of actual processing based on an instruction of the program code, and the functions of the above-described embodiments may be realized by the processing. Needless to say, it is included.

さらに、本発明に係る実施形態の機能は次のようにしても実現される。即ち、記憶媒体から読出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれる。そして、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行う。この処理により前述した実施形態の機能が実現されることは言うまでもない。 Furthermore, the functions of the embodiment according to the present invention are also realized as follows. That is, the program code read from the storage medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer. Then, based on the instruction of the program code, the CPU provided in the function expansion board or function expansion unit performs part or all of the actual processing. It goes without saying that the functions of the above-described embodiments are realized by this processing.

第１実施形態における音声合成装置のハードウエア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the speech synthesizer in 1st Embodiment. 第１実施形態における音声合成装置のモジュール構成を示すブロック図である。It is a block diagram which shows the module structure of the speech synthesizer in 1st Embodiment. 第１実施形態における音声合成装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech synthesizer in 1st Embodiment. ポーズ生成ルールの例を示した図である。It is the figure which showed the example of the pose production | generation rule. ポーズ生成ルールのルール１を適用した場合のポーズ生成結果を示した図である。It is the figure which showed the pose production | generation result at the time of applying the rule 1 of a pose production | generation rule. ポーズ生成ルールのルール２を適用した場合のポーズ生成結果を示した図である。It is the figure which showed the pose production | generation result at the time of applying the rule 2 of a pose production | generation rule. 第２実施形態における音声合成装置のモジュール構成を示すブロック図である。It is a block diagram which shows the module structure of the speech synthesizer in 2nd Embodiment. 第２実施形態における音声合成装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech synthesizer in 2nd Embodiment. アクセント結合ルールの例を示した図である。It is the figure which showed the example of the accent joint rule. アクセント結合ルールのルール２を適用した場合のアクセント結合結果を示した図である。It is the figure which showed the accent joint result at the time of applying the rule 2 of an accent joint rule. 第３実施形態における音声合成装置のモジュール構成を示すブロック図である。It is a block diagram which shows the module structure of the speech synthesizer in 3rd Embodiment. 第３実施形態における音声合成装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech synthesizer in 3rd Embodiment. アクセント結合ルールを直接テキストに記述した場合のテキストの一例を示す図である。It is a figure which shows an example of the text at the time of describing the accent joint rule directly in the text.

Explanation of symbols

１０１制御メモリ（ＲＯＭ）
１０２中央処理装置
１０３メモリ（ＲＡＭ）
１０４外部記憶装置
１０５入力装置
１０６出力装置
１０７音声出力装置
１０８バス
２０１ポーズルール集合
２０２テキスト保持部
２０３入力処理部
２０４文字列保持部
２０５抽出処理部
２０６ルール番号保持部
２０７モーラ数解析部
２０８モーラ数保持部
２０９ポーズ生成部
２１０韻律生成部
２１１波形生成部
２１２音声出力部
７０１アクセント結合ルール集合
７０２アクセント解析部
７０３アクセント保持部
７０４アクセント結合部
７０５固定部保持部
１１０１ルール保持部
101 Control memory (ROM)
102 Central processing unit 103 Memory (RAM)
104 External Storage Device 105 Input Device 106 Output Device 107 Audio Output Device 108 Bus 201 Pause Rule Set 202 Text Holding Unit 203 Input Processing Unit 204 Character String Holding Unit 205 Extraction Processing Unit 206 Rule Number Holding Unit 207 Mora Number Analysis Unit 208 Number of Mora Holding unit 209 Pose generating unit 210 Prosody generating unit 211 Waveform generating unit 212 Audio output unit 701 Accent combination rule set 702 Accent analysis unit 703 Accent holding unit 704 Accent combining unit 705 Fixed unit holding unit 1101 Rule holding unit

Claims

Acquisition means for acquiring a character string to be substituted into the variable part of the text to be synthesized;
Extraction processing means for extracting a rule identifier from the text;
Extraction means for extracting information related to the rule from the character string acquired by the acquisition means;
Processing means for selecting a rule to be applied from the rule set and the rule identifier, and processing at least a variable part based on information related to the rule according to the selected rule;
A speech synthesizer comprising:

The extraction processing means extracts the rule identifier and a fixed part involved in the rule,
The processing means processes the variable part and the fixed part involved in the rule according to information related to the rule and the selected rule.
The speech synthesizer according to claim 1.

Acquisition means for acquiring a character string to be substituted into the variable part of the text to be synthesized;
Extraction processing means for extracting rules from the text;
Extraction means for extracting information related to the rule from the character string acquired by the acquisition means;
Processing means for processing at least the variable part according to the information related to the rule and the rule;
A speech synthesizer comprising:

An acquisition step of acquiring a character string to be substituted into the variable part of the text to be synthesized;
An extraction processing step for extracting a rule identifier from the text;
An extraction step of extracting information related to the rule from the character string acquired in the acquisition step;
A processing step of selecting a rule to be applied from a rule set for speech synthesis and the rule identifier, and processing at least a variable part based on information related to the rule according to the selected rule;
A speech synthesis method comprising:

The extraction processing step extracts the fixed part involved in the rule identifier and the rule,
The processing step processes the variable part and the fixed part involved in the rule according to the information related to the rule and the selected rule.
The speech synthesis method according to claim 4.

An acquisition step of acquiring a character string to be substituted into the variable part of the text to be synthesized;
An extraction processing step for extracting rules from the text;
An extraction step of extracting information related to the rule from the character string acquired in the acquisition step;
Processing steps for processing at least the variable part according to the information related to the rule and the rule;
A speech synthesis method comprising:

A control program for causing a computer to execute the speech synthesis method according to claim 4.