JP2012073338A

JP2012073338A - Speech synthesizer and speech synthesis method

Info

Publication number: JP2012073338A
Application number: JP2010217039A
Authority: JP
Inventors: Takuya Noda; 拓也野田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2010-09-28
Filing date: 2010-09-28
Publication date: 2012-04-12

Abstract

PROBLEM TO BE SOLVED: To provide a speech synthesizer capable of reducing time required for generating synthesized speech even when the synthesize speech is generated by using any of plural synthesis conditions different from one another for a predetermined original.SOLUTION: A speech synthesizer 1 includes: a matching determination unit 10 for determining whether a set of input text information or input phonetic information and an input synthesis condition matches any of sets of at least one piece of registered text information or registered phonetic information and a registered synthesis condition; and an automatic update unit 17 for, when the set of the input text information and the like does not match any of the sets of the registered text information and the like, generating a set of derivative text information and derivative synthesis condition by correcting the input text information or the input synthesis condition, and making a storage unit 3 store a set of a set of the derivative text information and the derivative synthesis condition and derivative intermediate information generated during generation of a synthesized speech signal by a speech synthesis unit 11 based on the derivative text information and the derivative synthesis condition, as a set of a set of the registered text information and the registered synthesis condition and registered intermediate information.

Description

本発明は、例えば、音声信号を合成する音声合成装置及び音声合成方法に関する。 The present invention relates to a speech synthesizer and a speech synthesis method for synthesizing speech signals, for example.

近年、音声を自動合成する音声合成技術が開発されている。音声合成技術は、短時間で所望の音声を作成できるというメリットを有するため、これまで予め録音されたプロのナレータによる音声を用いていたアプリケーションの中には、このような音声合成技術を採用したものもある。特に、商業施設における案内放送、ハイウェイラジオ、ハイウェイテレホンまたは天気予報の放送など、短い時間間隔で提供する情報が更新されるアプリケーションでは、上記のメリットを持つ音声合成技術が有用である。しかし、このようなアプリケーションでは、音声合成技術を用いて合成音声を作成する装置に対して、視聴者にとってメッセージの内容が分かり易いように、合成音声の品質が高いこと、及び短時間で大量の合成音声を作成できることが求められる。 In recent years, speech synthesis technology for automatically synthesizing speech has been developed. Since speech synthesis technology has the advantage that it can create desired speech in a short time, such speech synthesis technology has been adopted in applications that have used pre-recorded speech by professional narrators. There are also things. In particular, in an application in which information provided at a short time interval is updated, such as a guidance broadcast in a commercial facility, a highway radio, a highway telephone, or a weather forecast broadcast, the speech synthesis technology having the above-described advantages is useful. However, in such an application, for a device that creates synthesized speech using speech synthesis technology, the quality of the synthesized speech is high so that the content of the message is easy for the viewer to understand, and a large amount in a short time. It is required to be able to create synthesized speech.

そこで、公知技術の一例では、音声合成装置は、一旦作成した音声に関して、原文情報の音声出力の要求の参照頻度に応じて、原文情報、音素列情報及び音声波形情報の何れかを選択的に蓄積する。そしてこの音声合成装置は、出力しようとする音声について、既に蓄積されている情報があれば、その情報を音声波形の合成に利用することで、音声合成に要する時間を短縮する（例えば、特許文献１を参照）。 Therefore, in an example of known technology, the speech synthesizer selectively selects any one of the original text information, the phoneme string information, and the speech waveform information according to the reference frequency of the voice output request for the original text information with respect to the once created voice. accumulate. This speech synthesizer shortens the time required for speech synthesis by using that information for speech waveform synthesis if there is already accumulated information about the speech to be output (for example, Patent Documents). 1).

特開平５−１９７９０号公報Japanese Patent Laid-Open No. 5-19790

上記の従来技術では、音声合成装置は、一つの原文に対して、特定の話速またはピッチといった特定の合成条件に基づいて作成された一種類の音素情報等の中間情報のみを記憶する。ところが、音声合成装置は、一つの原文に対して、異なる話速または異なるピッチ（すなわち、声の高さ）の出力音声を作成することもある。例えば、商業施設の館内放送で用いられるナレーションについて、緊急時（例えば、火災発生時）の話速は通常時の話速よりも速い方が好ましい。しかし、音声合成装置が、ある原文について、特定の合成条件で作成された音素情報等の中間情報を、その原文について他の合成条件の音声出力を作成するために利用すると、新たに生成された合成音声の韻律が不自然となる。その結果として新たに生成された合成音声の品質が劣化する。また、既に作成された合成音声を直接話速変換することにより修正された合成音声の品質も、元の合成音声の品質よりも低下する。
また、記憶されている中間情報を作成するために用いられた合成条件と異なる合成条件下で合成音声を作成することが要求された場合に、音声合成装置が、中間情報を利用せずに原文から合成音声を再度作成すると、音声合成に要する時間を短縮できない。 In the above-described conventional technology, the speech synthesizer stores only intermediate information such as one type of phoneme information created based on a specific synthesis condition such as a specific speech speed or pitch for one original sentence. However, the speech synthesizer may create output speech with different speech speeds or different pitches (ie, voice pitch) for one original sentence. For example, with regard to narration used in commercial facility broadcasting, it is preferable that the speech speed in an emergency (for example, when a fire occurs) be faster than the normal speech speed. However, if the speech synthesizer uses intermediate information such as phoneme information created under a specific synthesis condition for a certain original sentence to create a voice output under other synthesis conditions for that original sentence, a newly generated The prosody of synthesized speech is unnatural. As a result, the quality of the newly generated synthesized speech deteriorates. In addition, the quality of the synthesized speech that has been corrected by directly converting the synthesized speech that has been created is lower than the quality of the original synthesized speech.
In addition, when it is required to create a synthesized speech under a synthesis condition that is different from the synthesis conditions used to create the stored intermediate information, the speech synthesizer does not use the intermediate information and the original text If the synthesized speech is created again from the above, the time required for speech synthesis cannot be shortened.

そこで本明細書は、所定の原文に対して互いに異なる複数の合成条件の何れかにて合成音声を作成する場合でも、合成音声の作成に要する時間を短縮できる音声合成装置及び音声合成方法を提供することを目的とする。 Accordingly, the present specification provides a speech synthesizer and a speech synthesis method that can reduce the time required to create a synthesized speech even when a synthesized speech is created with any of a plurality of different synthesis conditions for a predetermined original text. The purpose is to do.

一つの実施形態によれば、音声合成装置が提供される。この音声合成装置は、合成音声信号の元となる原文を含む入力テキスト情報と、合成音声信号を作成するための入力合成条件とを取得する入力部と、登録テキスト情報と登録合成条件の組と、その登録テキスト情報と登録合成条件の組に対応する合成音声信号を作成する途中の段階で生成され、登録表音情報を含む登録中間情報との組を少なくとも一つ記憶する記憶部と、入力テキスト情報または入力テキスト情報により表される入力表音情報と入力合成条件との組が登録テキスト情報または登録表音情報と登録合成条件との組の何れかと一致するか否か判定する一致判定部と、入力テキスト情報または入力表音情報と入力合成条件との組が登録テキスト情報または登録表音情報と登録合成条件との組の何れかと一致する場合、登録テキスト情報または登録表音情報と登録合成条件との組に対応する登録中間情報を用いて合成音声信号を作成し、一方、入力テキスト情報または入力表音情報と入力合成条件との組が登録テキスト情報または登録表音情報と登録合成条件との組の何れとも一致しない場合、入力テキスト情報及び入力合成条件に基づいて合成音声信号を作成する音声合成部と、入力テキスト情報または入力表音情報と入力合成条件との組が登録テキスト情報または登録表音情報と登録合成条件との組の何れとも一致しない場合、入力テキスト情報または入力合成条件を修正することにより派生テキスト情報と派生合成条件の組を作成し、その派生テキスト情報と派生合成条件の組と、その派生テキスト情報及び派生合成条件に基づいて音声合成部が合成音声信号を作成する途中の段階まで実行することにより作成された派生中間情報との組を、登録テキスト情報と登録合成条件の組と登録中間情報との組の一つとして記憶部に記憶させる自動更新部とを有する。 According to one embodiment, a speech synthesizer is provided. The speech synthesizer includes an input unit that obtains input text information including an original sentence that is a source of a synthesized speech signal, an input synthesis condition for creating a synthesized speech signal, a set of registered text information and a registered synthesis condition, A storage unit for storing at least one set of registration intermediate information including registration phonetic information, which is generated in the middle of creating a synthesized speech signal corresponding to the set of registered text information and registration synthesis condition; A coincidence determination unit that determines whether a pair of input phonetic information represented by text information or input text information and an input synthesis condition matches any of registered text information or a set of registered phonetic information and registered synthesis conditions And the combination of the input text information or the input phonetic information and the input synthesis condition matches the registered text information or the set of the registered phonetic information and the registration synthesis condition. Information or registered phonetic information and registered synthesis information are used to create a synthesized speech signal, while input text information or input phonetic information and input synthesis conditions are registered text information. Or, if it does not match any of the set of registered phonetic information and registered synthesis condition, a speech synthesis unit that creates a synthesized speech signal based on the input text information and the input synthesis condition, and input text information or input phonetic information and input If the combination of the composition condition does not match any of the registered text information or registered phonetic information and the registered composition condition, the combination of the derived text information and the derived composition condition is changed by modifying the input text information or the input composition condition. The speech synthesizer creates a synthesized speech signal based on the combination of the derived text information and derived synthesis condition, and the derived text information and derived synthesis condition. An automatic update unit for storing a set of the derived intermediate information created by executing up to the intermediate stage in the storage unit as one of a set of registered text information, a set of registered composition conditions, and registered intermediate information. Have.

また他の実施形態によれば、音声合成方法が提供される。この音声合成方法は、合成音声信号の元となる原文を含む入力テキスト情報と、合成音声信号を作成するための入力合成条件とを取得し、入力テキスト情報または入力テキスト情報により表される入力表音情報と入力合成条件との組が、記憶部に記憶されている少なくとも一つの登録テキスト情報または登録表音情報と登録合成条件との組の何れかと一致するか否か判定し、入力テキスト情報または入力表音情報と入力合成条件との組が登録テキスト情報または登録表音情報と登録合成条件の組の何れかと一致する場合、記憶部に記憶され、かつ、登録テキスト情報または登録表音情報と登録合成条件との組に対応する合成音声信号を作成する途中の段階で生成される登録中間情報を用いて合成音声信号を作成し、一方、入力テキスト情報または入力表音情報と入力合成条件との組が登録テキスト情報または登録表音情報と登録合成条件との組の何れとも一致しない場合、入力テキスト情報及び入力合成条件に基づいて合成音声信号を作成し、入力テキスト情報または入力表音情報と入力合成条件との組が登録テキスト情報または登録表音情報と登録合成条件との組の何れとも一致しない場合、入力テキスト情報または入力合成条件を修正することにより派生テキスト情報と派生合成条件の組を作成し、その派生テキスト情報と派生合成条件の組と、その派生テキスト情報及び派生合成条件に基づいて合成音声信号を作成する途中で作成された派生中間情報との組を、登録テキスト情報と登録合成条件の組と登録中間情報との組の一つとして記憶部に記憶させることを含む。 According to another embodiment, a speech synthesis method is provided. This speech synthesis method acquires input text information including an original text that is a source of a synthesized speech signal and input synthesis conditions for creating a synthesized speech signal, and is an input table represented by the input text information or the input text information. It is determined whether the set of the sound information and the input synthesis condition matches at least one of the registered text information or the set of the registered phonetic information and the registration synthesis condition stored in the storage unit, and the input text information Alternatively, when the set of the input phonetic information and the input synthesis condition matches either of the registered text information or the set of the registered phonetic information and the registration synthesis condition, it is stored in the storage unit, and the registered text information or the registered phonetic information A synthesized speech signal is created using registered intermediate information generated in the middle of creating a synthesized speech signal corresponding to a set of registered synthesis conditions. Creates a synthesized speech signal based on the input text information and input synthesis conditions if the set of input phonetic information and input synthesis conditions does not match any of the registered text information or registered phonetic information and registration synthesis conditions If the combination of the input text information or the input phonetic information and the input synthesis condition does not match any of the registered text information or the registered phonetic information and the registration synthesis condition, the input text information or the input synthesis condition is corrected. To create a set of derived text information and derived synthesis conditions, and a set of derived text information and derived synthesis conditions, and a derivation created during the creation of a synthesized speech signal based on the derived text information and derived synthesis conditions Storing the set of intermediate information in the storage unit as one of a set of registered text information, a set of registered composition conditions, and registered intermediate information.

本発明の目的及び利点は、請求項において特に指摘されたエレメント及び組み合わせにより実現され、かつ達成される。
上記の一般的な記述及び下記の詳細な記述の何れも、例示的かつ説明的なものであり、請求項のように、本発明を限定するものではないことを理解されたい。 The objects and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.
It should be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention as claimed.

本明細書に開示された音声合成装置及び音声合成方法は、所定の原文に対して互いに異なる複数の合成条件の何れかにて合成音声を作成する場合でも、合成音声の作成に要する時間を短縮できる。 The speech synthesizer and the speech synthesis method disclosed in this specification reduce the time required for creating a synthesized speech even when a synthesized speech is created with any of a plurality of different synthesis conditions for a predetermined original text. it can.

第１の実施形態による音声合成装置の概略構成図である。It is a schematic block diagram of the speech synthesizer by 1st Embodiment. 第１の実施形態による音声合成装置が有する処理部の概略構成図である。It is a schematic block diagram of the process part which the speech synthesizer by 1st Embodiment has. 中間情報テーブルの一例を示す図である。It is a figure which shows an example of an intermediate information table. 中間情報テーブルの他の一例を示す図である。It is a figure which shows another example of an intermediate information table. 第１の実施形態による音声合成処理の動作フローチャートを示す図である。It is a figure which shows the operation | movement flowchart of the speech synthesis process by 1st Embodiment. 第２の実施形態による音声合成装置の処理部の概略構成図である。It is a schematic block diagram of the process part of the speech synthesizer by 2nd Embodiment. 第２の実施形態による中間情報テーブルの一例を示す図である。It is a figure which shows an example of the intermediate information table by 2nd Embodiment. 登録中間情報の存続判定処理の動作フローチャートである。It is an operation | movement flowchart of the survival determination process of registration intermediate information. 第３の実施形態による音声合成装置の処理部の概略構成図である。It is a schematic block diagram of the process part of the speech synthesizer by 3rd Embodiment. 変形例による中間情報テーブルの他の一例を示す図である。It is a figure which shows another example of the intermediate information table by a modification.

以下、図を参照しつつ、様々な実施形態による音声合成装置について説明する。
この音声合成装置は、所定の原文に対して特定の合成条件にて新規に合成音声を作成する際、他の合成条件でも中間情報を作成して記憶しておくことにより、様々な合成条件でその所定の原文についての合成音声の作成に要する時間を短縮する。 Hereinafter, speech synthesis apparatuses according to various embodiments will be described with reference to the drawings.
This speech synthesizer creates and stores intermediate information even under other synthesis conditions when creating a new synthesized speech for a given original text under specific synthesis conditions. The time required to create synthesized speech for the predetermined original text is shortened.

図１は、一つの実施形態による音声合成装置の概略構成図である。本実施形態では、音声合成装置１は、入力部２と、記憶部３と、処理部４と、出力部５とを有する。 FIG. 1 is a schematic configuration diagram of a speech synthesizer according to one embodiment. In the present embodiment, the speech synthesizer 1 includes an input unit 2, a storage unit 3, a processing unit 4, and an output unit 5.

入力部２は、合成音声の原文であるテキスト情報と、話速、ピッチまたは声の高低の幅といった音声合成条件を規定する合成パラメータを取得する。そのために、入力部２は、例えば、キーボードを有する。また、入力部２は、マウスなどのポインティングデバイスとそのポインティングデバイスにより指示される入力すべき文字または数値などを表示するディスプレイとを有する。あるいは、入力部２は、タッチパネルディスプレイを有してもよい。
さらにまた、入力部２は、テキスト情報及び合成パラメータを通信ネットワークを介して音声合成装置１と接続された他の機器から取得してもよい。この場合、入力部２は、音声合成装置１を通信ネットワークに接続するためのインターフェース回路を有する。
そして入力部２は、入力されたテキスト情報及び合成パラメータを処理部４へ渡す。 The input unit 2 acquires text information that is an original text of synthesized speech and synthesis parameters that define speech synthesis conditions such as speech speed, pitch, and voice pitch range. For this purpose, the input unit 2 includes, for example, a keyboard. The input unit 2 includes a pointing device such as a mouse and a display that displays characters or numerical values to be input, which are instructed by the pointing device. Alternatively, the input unit 2 may have a touch panel display.
Furthermore, the input unit 2 may acquire text information and synthesis parameters from another device connected to the speech synthesizer 1 via a communication network. In this case, the input unit 2 includes an interface circuit for connecting the speech synthesizer 1 to a communication network.
Then, the input unit 2 passes the input text information and synthesis parameters to the processing unit 4.

記憶部３は、例えば、半導体メモリ回路、磁気記憶装置または光記憶装置のうちの少なくとも一つを有する。そして記憶部３は、処理部４で用いられる各種コンピュータプログラム及び音声合成処理に用いられる各種のデータを記憶する。
記憶部３は、音声合成処理に用いられるデータとして、例えば、言語辞書と、韻律モデルと、音声波形辞書を記憶する。さらに記憶部３は、合成音声信号を作成する途中の段階で生成される、表音情報または波形生成情報といった中間情報を登録した中間情報テーブルを記憶する。なお、言語辞書、韻律モデル、音声波形辞書、中間情報及び中間情報テーブルの詳細については後述する。 The storage unit 3 includes, for example, at least one of a semiconductor memory circuit, a magnetic storage device, and an optical storage device. The storage unit 3 stores various computer programs used in the processing unit 4 and various data used for speech synthesis processing.
The storage unit 3 stores, for example, a language dictionary, a prosodic model, and a speech waveform dictionary as data used for speech synthesis processing. Further, the storage unit 3 stores an intermediate information table in which intermediate information such as phonetic information or waveform generation information is generated in the middle of creating a synthesized speech signal. Details of the language dictionary, prosodic model, speech waveform dictionary, intermediate information, and intermediate information table will be described later.

出力部５は、処理部４から受け取った合成音声信号をスピーカ６へ出力する。そのために、出力部５は、例えば、スピーカ６を音声合成装置１と接続するためのオーディオインターフェース回路を有する。
また出力部５は、合成音声信号を、通信ネットワークを介して音声合成装置１と接続された他の装置へ出力してもよい。この場合、出力部５は、その通信ネットワークに音声合成装置１と接続するためのインターフェース回路を有する。なお、入力部２も通信ネットワークを介してテキスト情報及び合成パラメータを取得する場合、入力部２と出力部５は一体化されていてもよい。 The output unit 5 outputs the synthesized voice signal received from the processing unit 4 to the speaker 6. For this purpose, the output unit 5 includes, for example, an audio interface circuit for connecting the speaker 6 to the speech synthesizer 1.
The output unit 5 may output the synthesized speech signal to another device connected to the speech synthesizer 1 via the communication network. In this case, the output unit 5 includes an interface circuit for connecting to the speech synthesizer 1 to the communication network. When the input unit 2 also acquires text information and synthesis parameters via the communication network, the input unit 2 and the output unit 5 may be integrated.

処理部４は、一つまたは複数のプロセッサと、メモリ回路と、周辺回路とを有する。そして処理部４は、入力されたテキスト情報に示された原文及び合成パラメータ、あるいはその原文と合成パラメータの組に対応する中間情報に基づいて、合成音声信号を作成する。そのために、処理部４は、一致判定部１０と、音声合成部１１と、制御部１６と、自動更新部１７とを有する。
処理部４が有するこれらの各部は、例えば、処理部４が有するプロセッサ上で動作するコンピュータプログラムにより実現される機能モジュールである。あるいは、処理部４が有するこれらの各部は、それぞれ、別個の回路として、音声合成装置１に実装されてもよい。さらに、処理部４が有するこれらの各部は、その各部の機能を実現する一つの集積回路として音声合成装置１に実装されてもよい。 The processing unit 4 includes one or a plurality of processors, a memory circuit, and a peripheral circuit. Then, the processing unit 4 creates a synthesized speech signal based on the original text and synthesis parameters indicated in the input text information, or intermediate information corresponding to the combination of the original text and the synthesis parameters. For this purpose, the processing unit 4 includes a match determination unit 10, a voice synthesis unit 11, a control unit 16, and an automatic update unit 17.
Each of these units included in the processing unit 4 is, for example, a functional module realized by a computer program that operates on a processor included in the processing unit 4. Alternatively, these units included in the processing unit 4 may be mounted on the speech synthesizer 1 as separate circuits. Furthermore, each of these units included in the processing unit 4 may be mounted on the speech synthesizer 1 as one integrated circuit that realizes the function of each unit.

一致判定部１０は、入力部２を介して入力されたテキスト情報またはそのテキスト情報により表される表音情報と合成パラメータとの組が、中間情報テーブルに登録されている何れかのテキスト情報または表音情報と合成パラメータとの組と一致するか否か判定する。そして入力されたテキスト情報または表音情報と合成パラメータとの組が中間情報テーブルに登録されている何れかのテキスト情報または表音情報と合成パラメータとの組と一致する場合、その入力されたテキスト情報と合成パラメータの組に対応する中間情報も中間情報テーブルに登録されている。一致判定部１０は、既に作成されている中間情報を合成音声信号の生成に利用するために、入力されたテキスト情報または表音情報と合成パラメータとの組と一致する、中間情報テーブルに登録されたテキスト情報または表音情報と合成パラメータの組の識別番号を制御部１６へ渡す。
なお、中間情報テーブルに登録されているテキスト情報、合成パラメータ及び中間情報を、便宜上、以下では、登録テキスト情報、登録合成パラメータ及び登録中間情報と呼ぶ。 The coincidence determination unit 10 includes text information input via the input unit 2 or a combination of phonetic information represented by the text information and a synthesis parameter, any text information registered in the intermediate information table or It is determined whether or not the set of the phonetic information and the synthesis parameter matches. If the set of input text information or phonetic information and the synthesis parameter matches any text information or phonetic information and the combination of synthesis parameters registered in the intermediate information table, the input text Intermediate information corresponding to a set of information and synthesis parameters is also registered in the intermediate information table. The coincidence determination unit 10 is registered in an intermediate information table that matches a set of input text information or phonetic information and a synthesis parameter in order to use already generated intermediate information for generating a synthesized speech signal. The identification number of the combination of the text information or the phonetic information and the synthesis parameter is passed to the control unit 16.
Note that text information, synthesis parameters, and intermediate information registered in the intermediate information table are hereinafter referred to as registered text information, registered synthesis parameters, and registered intermediate information for convenience.

一方、入力されたテキスト情報と合成パラメータの組が何れの登録テキスト情報と登録合成パラメータの組とも一致しない場合、音声合成装置１は、入力されたテキスト情報と合成パラメータの組に基づいて合成音声信号を作成する。そこで一致判定部１０は、入力されたテキスト情報と合成パラメータの組を音声合成部１１へ渡す。
さらに、入力されたテキスト情報の原文と一部が異なる原文を含むテキスト情報が後で入力されたり、あるいは入力された合成パラメータの少なくとも一つが異なる合成パラメータが後で入力されることがある。そこでこのような場合に、合成音声信号の作成に利用できる中間情報を作成するために、一致判定部１０は、入力されたテキスト情報及び合成パラメータを自動更新部１７へ渡す。 On the other hand, if the set of input text information and synthesis parameter does not match any of the registered text information and registered synthesis parameter set, the speech synthesizer 1 synthesizes synthesized speech based on the input text information and synthesis parameter set. Create a signal. Accordingly, the coincidence determination unit 10 passes the set of the input text information and the synthesis parameter to the speech synthesis unit 11.
Furthermore, text information including a text that is partly different from the original text of the input text information may be input later, or a composite parameter in which at least one of the input composite parameters is different may be input later. Therefore, in such a case, the match determination unit 10 passes the input text information and synthesis parameters to the automatic update unit 17 in order to create intermediate information that can be used to create a synthesized speech signal.

音声合成部１１は、入力されたテキスト情報と合成パラメータの組に基づいて合成音声信号を作成する。あるいは、音声合成部１１は、記憶部３に記憶されている中間情報に基づいて合成音声信号を作成する。さらに、音声合成部１１は、自動更新部１７からの指示に応じて中間情報を作成する。そのために、音声合成部１１は、言語処理部１２と、韻律生成部１３と、素片選択部１４と、波形生成部１５とを有する。 The speech synthesizer 11 creates a synthesized speech signal based on the set of input text information and synthesis parameters. Alternatively, the speech synthesizer 11 creates a synthesized speech signal based on the intermediate information stored in the storage unit 3. Furthermore, the speech synthesizer 11 creates intermediate information in response to an instruction from the automatic update unit 17. For this purpose, the speech synthesis unit 11 includes a language processing unit 12, a prosody generation unit 13, a segment selection unit 14, and a waveform generation unit 15.

言語処理部１２は、入力されたテキスト情報を表音情報に変換する。表音情報は、テキスト情報に含まれる原文の読みなどを表す情報であり、例えば、原文の読みをカタカナ文字で表し、さらにアクセントの位置及び区切りの位置を追加した情報である。
言語処理部１２は、入力されたテキスト情報を表音情報に変換するために、記憶部３に記憶されている言語辞書を読み込む。言語辞書には、例えば、テキスト情報中に出現すると想定される様々な単語、その単語の読み、品詞及び活用形が登録されている。そして言語処理部１２は、例えば、その言語辞書を用いて、テキスト情報に含まれる原文に対して形態素解析を行って、原文中に出現する各単語の順序及び読み、アクセントの位置及び区切りの位置を決定する。その際、言語処理部１２は、例えば、原文中で句読点が設定された位置を区切りの位置とする。なお、句読点が設定された位置で原文を区切ることにより得られる文の単位を、本明細書では呼気段落と呼ぶ。
言語処理部１２は、形態素解析として、例えば、動的計画法または隠れマルコフモデルを用いる方法を利用できる。そして言語処理部１２は、各単語の順序、読み、アクセントの位置及び区切りの位置に応じて表音情報を作成する。
言語処理部１２は、表音情報を韻律生成部１３へ出力する。また言語処理部１２は、中間情報を生成するために、表音情報を自動更新部１７の分割処理部１８へ出力してもよい。 The language processing unit 12 converts the input text information into phonetic information. The phonetic information is information representing the reading of the original text included in the text information. For example, the phonetic information is information in which the reading of the original text is represented by katakana characters, and the accent position and the break position are added.
The language processing unit 12 reads a language dictionary stored in the storage unit 3 in order to convert the input text information into phonetic information. In the language dictionary, for example, various words assumed to appear in text information, readings of the words, parts of speech, and utilization forms are registered. Then, the language processing unit 12 uses the language dictionary, for example, to perform morphological analysis on the original sentence included in the text information, and order and read each word appearing in the original sentence, position of the accent, and position of the break To decide. At that time, the language processing unit 12 sets, for example, a position where a punctuation mark is set in the original text as a break position. A unit of a sentence obtained by dividing an original sentence at a position where a punctuation mark is set is referred to as an exhalation paragraph in this specification.
As the morphological analysis, the language processing unit 12 can use, for example, a method using dynamic programming or a hidden Markov model. Then, the language processing unit 12 creates phonetic information according to the order of each word, reading, accent position, and break position.
The language processing unit 12 outputs the phonetic information to the prosody generation unit 13. The language processing unit 12 may output phonetic information to the division processing unit 18 of the automatic update unit 17 in order to generate intermediate information.

韻律生成部１３は、一致判定部１０から受け取った合成パラメータと、言語処理部１２から受け取った表音情報に基づいて、合成音声を生成する際の目標韻律を生成する。そのために、韻律生成部１３は、記憶部３から複数の韻律モデルを読み込む。この韻律モデルは、声を高くする位置及び声を低くする位置などを時間順に表したものである。そして韻律生成部１３は、複数の韻律モデルのうち、表音情報に示されたアクセントの位置などに最も一致する韻律モデルを選択する。そして韻律生成部１３は、選択した韻律モデル及び合成パラメータに従って、表音情報に対して声が高くなる位置あるいは声が低くなる位置、声の抑揚、ピッチなどを設定することにより、目標韻律を作成する。目標韻律は、音声波形を決定する単位となる音素ごとに、音素の長さ及びピッチ周波数を含む。なお、音素は、例えば、一つの母音あるいは一つの子音とすることができる。また本実施形態では、母音と子音とをそれぞれ１個以上組み合わせた音節も、音素に含まれるものとする。
韻律生成部１３は、目標韻律を素片選択部１４へ出力する。
また音声合成部１１は、自動更新部１７からの指示に応じて中間情報を生成する場合、韻律生成部１３は、テキスト情報と合成パラメータの組を自動更新部１７から取得する。そして韻律生成部１３は、自動更新部１７から取得したテキスト情報と合成パラメータに基づいて目標韻律を作成し、その目標韻律を素片選択部１４へ出力する。 The prosody generation unit 13 generates a target prosody for generating a synthesized speech based on the synthesis parameter received from the coincidence determination unit 10 and the phonetic information received from the language processing unit 12. For this purpose, the prosody generation unit 13 reads a plurality of prosody models from the storage unit 3. This prosodic model represents a position in which the voice is raised and a position in which the voice is lowered in time order. Then, the prosody generation unit 13 selects a prosody model that most closely matches the position of the accent indicated by the phonetic information from among the plurality of prosody models. Then, according to the selected prosodic model and synthesis parameters, the prosody generation unit 13 creates a target prosody by setting a position where the voice becomes higher or lower, a voice inflection, a pitch, and the like with respect to the phonetic information. To do. The target prosody includes a phoneme length and a pitch frequency for each phoneme as a unit for determining a speech waveform. Note that the phoneme can be, for example, one vowel or one consonant. In this embodiment, a syllable in which one or more vowels and consonants are combined is also included in the phoneme.
The prosody generation unit 13 outputs the target prosody to the segment selection unit 14.
When the speech synthesis unit 11 generates intermediate information in response to an instruction from the automatic update unit 17, the prosody generation unit 13 acquires a set of text information and synthesis parameters from the automatic update unit 17. Then, the prosody generation unit 13 creates a target prosody based on the text information acquired from the automatic update unit 17 and the synthesis parameter, and outputs the target prosody to the segment selection unit 14.

素片選択部１４及び波形生成部１５は、例えば、音素接続方式、コーパスベース方式または大規模コーパスベース方式によって合成音声信号を作成する。
素片選択部１４は、音素ごとに、目標韻律の音素長及びピッチ周波数に最も近い音声波形を、例えばパターンマッチングにより音声波形辞書に登録されている複数の音声波形の中から選択する。そのために、素片選択部１４は、記憶部１３から音声波形辞書を読み込む。音声波形辞書は、複数の音声波形及び各音声波形の識別番号を記録する。また音声波形は、例えば、一人以上のナレータが様々なテキストを読み上げた様々な音声を録音した音声信号から、音素単位で取り出された波形信号である。
さらに、素片選択部１４は、音素ごとに選択された音声波形を目標韻律に沿って接続できるようにするため、それら選択された音声波形と目標韻律に示された対応する音素の波形パターンとのずれ量を、波形変換情報として算出してもよい。
素片選択部１４は、音素ごとに選択された音声波形の識別番号を含む波形生成情報を作成する。波形生成情報は、波形変換情報をさらに含んでもよい。そして素片選択部１４は、波形生成情報を波形生成部１５へ出力する。また素片選択部１４は、波形生成情報を中間情報として保存するために、その波形生成情報を自動更新部１７へ出力する。
さらに、音声合成部１１が自動更新部１７から受け取ったテキスト情報と合成パラメータの組に基づく中間情報を作成する場合、素片選択部１４は、作成した波形生成情報を自動更新部１７へ出力する。 The segment selection unit 14 and the waveform generation unit 15 create a synthesized speech signal by, for example, a phoneme connection method, a corpus base method, or a large-scale corpus base method.
The segment selection unit 14 selects, for each phoneme, a speech waveform closest to the phoneme length and pitch frequency of the target prosody from a plurality of speech waveforms registered in the speech waveform dictionary by pattern matching, for example. For this purpose, the segment selection unit 14 reads the speech waveform dictionary from the storage unit 13. The speech waveform dictionary records a plurality of speech waveforms and an identification number of each speech waveform. The voice waveform is, for example, a waveform signal extracted in units of phonemes from voice signals obtained by recording various voices in which one or more narrators read various texts.
Furthermore, in order to enable the segment selection unit 14 to connect the speech waveforms selected for each phoneme along the target prosody, the selected speech waveform and the waveform pattern of the corresponding phoneme indicated in the target prosody May be calculated as waveform conversion information.
The segment selection unit 14 creates waveform generation information including the identification number of the speech waveform selected for each phoneme. The waveform generation information may further include waveform conversion information. Then, the element selection unit 14 outputs the waveform generation information to the waveform generation unit 15. The element selection unit 14 outputs the waveform generation information to the automatic update unit 17 in order to store the waveform generation information as intermediate information.
Further, when the speech synthesis unit 11 creates intermediate information based on the combination of text information and synthesis parameters received from the automatic update unit 17, the segment selection unit 14 outputs the created waveform generation information to the automatic update unit 17. .

波形生成部１５は、波形生成情報に基づいて合成音声信号を作成する。そのために、波形生成部１５は、素片選択部１４または制御部１６から受け取った波形生成情報に含まれる各音素の音声波形の識別番号に対応する音声波形信号を記憶部３から読み込む。そして波形生成部１５は、各音声波形信号を連続的に接続することにより、合成音声信号を作成する。なお、波形生成情報に波形変換情報が含まれている場合、波形生成部１５は、各音声波形信号を、対応する音素について求められた波形変換情報に従って補正して音声波形信号を連続的に接続することにより、合成音声信号を作成する。
波形生成部１５は、合成音声信号を出力部５へ出力する。 The waveform generation unit 15 generates a synthesized speech signal based on the waveform generation information. For this purpose, the waveform generation unit 15 reads the speech waveform signal corresponding to the speech waveform identification number of each phoneme included in the waveform generation information received from the segment selection unit 14 or the control unit 16 from the storage unit 3. And the waveform generation part 15 produces a synthetic | combination audio | voice signal by connecting each audio | voice waveform signal continuously. When the waveform conversion information is included in the waveform generation information, the waveform generation unit 15 corrects each speech waveform signal according to the waveform conversion information obtained for the corresponding phoneme and continuously connects the speech waveform signals. By doing so, a synthesized speech signal is created.
The waveform generation unit 15 outputs the synthesized speech signal to the output unit 5.

制御部１６は、入力されたテキスト情報と合成パラメータの組と一致する登録テキスト情報と登録合成パラメータの組が中間情報テーブルに登録されている場合、その登録テキスト情報と登録パラメータの組に対応する登録中間情報を記憶部３から読み込む。登録中間情報に波形生成情報が含まれている場合、制御部１６は、登録中間情報に含まれる波形生成情報を波形生成部１５へ出力する。また、登録中間情報に表音情報が含まれており、かつ波形生成情報が含まれていない場合、制御部１６は、登録中間情報に含まれる表音情報を韻律生成部１３へ出力する。
これにより、音声合成部１１は、表音情報、目標韻律及び波形生成情報の生成に関する処理の少なくとも一部を省略できる。その結果として、音声合成部１１は、合成音声信号の作成に要する処理時間を短縮できる。 When the registered text information and registered composite parameter pair that matches the input text information and composite parameter pair is registered in the intermediate information table, the control unit 16 corresponds to the registered text information and registered parameter pair. The registered intermediate information is read from the storage unit 3. When the waveform generation information is included in the registration intermediate information, the control unit 16 outputs the waveform generation information included in the registration intermediate information to the waveform generation unit 15. When the registered intermediate information includes phonetic information and does not include waveform generation information, the control unit 16 outputs the phonetic information included in the registered intermediate information to the prosody generating unit 13.
Thereby, the speech synthesizer 11 can omit at least a part of the processing related to the generation of the phonetic information, the target prosody, and the waveform generation information. As a result, the speech synthesizer 11 can shorten the processing time required to create the synthesized speech signal.

自動更新部１７は、入力されたテキスト情報を修正した派生テキスト情報または入力された合成パラメータの少なくとも一つの値を変えた派生合成パラメータに基づく中間情報を音声合成部１１に作成させる。そのために、自動更新部１７は、分割処理部１８とパラメータ調整部１９とを有する。 The automatic updating unit 17 causes the speech synthesis unit 11 to create intermediate information based on derived text information obtained by modifying the input text information or derived synthetic parameters obtained by changing at least one value of the input synthetic parameters. For this purpose, the automatic update unit 17 includes a division processing unit 18 and a parameter adjustment unit 19.

分割処理部１８は、一致判定部１０が入力されたテキスト情報と合成パラメータの組と一致する登録テキスト情報と登録合成パラメータの組が中間情報テーブルに登録されていないと判定した場合に、その入力されたテキスト情報を一致判定部１０から受け取る。そして分割処理部１８は、中間情報として利用し易いように、入力されたテキスト情報に含まれる原文を所定の単位に分割する。
例えば、分割処理部１８は、原文中に含まれる各句読点を検出することにより、一つの句読点から次の句読点までの区間が一つの文の単位となるように原文を分割する。あるいは、分割処理部１８は、原文に付された句読点とは関係無く、原文中で句読点を付すことができる区切り可能位置を検出し、隣接する区切り可能位置間の区間が一つの文の単位となるように原文を分割してもよい。なお、区切り可能位置を検出するために、分割処理部１８は、例えば、形態素解析を行って文節境界を検出し、その文節境界を区切り可能位置としてもよい。 When the division processing unit 18 determines that the combination of registered text information and registered synthesis parameter that matches the set of text information and synthesis parameter input by the match determination unit 10 is not registered in the intermediate information table, the input Received text information is received from the match determination unit 10. Then, the division processing unit 18 divides the original text included in the input text information into predetermined units so that it can be easily used as intermediate information.
For example, the division processing unit 18 detects each punctuation mark included in the original sentence, and divides the original sentence so that a section from one punctuation mark to the next punctuation mark becomes a unit of one sentence. Alternatively, the division processing unit 18 detects a breakable position where a punctuation mark can be added in the original text regardless of the punctuation mark attached to the original text, and a section between adjacent breakable positions is a unit of one sentence. The original text may be divided so that In order to detect a breakable position, the division processing unit 18 may detect a phrase boundary by performing morphological analysis, for example, and may set the phrase boundary as a breakable position.

また、分割処理部１８は、入力されたテキスト情報に基づいて言語処理部１２により作成された表音情報を受け取って、その表音情報に基づいて原文を分割してもよい。この場合、分割処理部１８は、例えば、表音情報において設定された区切り位置で原文を分割する。あるいは、分割処理部１８は、表音情報に設定された区切り位置以外で、原文を区切ることが可能な区切り位置で原文を分割してもよい。この場合も、分割処理部１８は、例えば、形態素解析を行って文節境界を検出することにより区切り位置を検出する。 The division processing unit 18 may receive the phonetic information created by the language processing unit 12 based on the input text information, and may divide the original text based on the phonetic information. In this case, for example, the division processing unit 18 divides the original text at a break position set in the phonetic information. Or the division | segmentation process part 18 may divide | segment an original sentence in the delimitation position which can delimit an original sentence other than the delimitation position set to the phonetic information. Also in this case, the division processing unit 18 detects a break position by, for example, performing a morphological analysis to detect a phrase boundary.

分割処理部１８は、原文を分割することにより作成された文の単位の何れかを含むテキスト情報を派生テキスト情報として作成する。
さらに、分割処理部１８は、原文を異なる位置で分割することにより、一つの原文に対して複数種類の文の単位を作成してもよい。例えば、テキスト情報に含まれる原文が「名古屋方面を走行中のドライバーに、渋滞のお知らせです。」であったとする。この場合、分割処理部１８は、原文を、「名古屋方面を走行中のドライバーに、」という単位と「渋滞のお知らせです。」という二つの単位に分割してもよい。あるいは、分割処理部１８は、原文を、「名古屋方面を」という単位と、「走行中のドライバーに、」という単位と「渋滞のお知らせです。」という三つの単位に分割してもよい。さらにまた、分割処理部１８は、原文を、「名古屋方面を」という単位と、「走行中のドライバーに、渋滞のお知らせです。」という二つの単位に分割してもよい。この場合、分割処理部１８は、上記のそれぞれの文の単位についてそれぞれ派生テキスト情報を作成してもよい。 The division processing unit 18 creates text information including any one of sentence units created by dividing the original sentence as derived text information.
Further, the division processing unit 18 may create a plurality of types of sentence units for one original sentence by dividing the original sentence at different positions. For example, it is assumed that the original text included in the text information is “Notice of traffic congestion to the driver traveling in the direction of Nagoya.” In this case, the division processing unit 18 may divide the original into two units of “to a driver traveling in the direction of Nagoya” and “a notification of traffic jam”. Alternatively, the division processing unit 18 may divide the original into three units: “To Nagoya,” a unit, “To the driving driver,” and “Congestion notification.” Furthermore, the division processing unit 18 may divide the original text into two units, such as “to the Nagoya area” and “is a traffic jam notification to the driving driver”. In this case, the division processing unit 18 may create derivative text information for each sentence unit.

パラメータ調整部１９は、入力された合成パラメータの少なくとも何れかを変更することで派生合成パラメータを作成する。例えば、合成パラメータに含まれる話速とピッチがそれぞれ'1'〜'5'で表されるとする。この場合において、入力された合成パラメータの話速及びピッチがそれぞれ'3'であれば、パラメータ調整部１９は、話速を'1'、'2'、'4'または'5'に修正するか、あるいは、ピッチを'1'、'2'、'4'または'5'に修正することで派生合成パラメータを作成する。
そして自動更新部１７は、派生テキスト情報と派生合成パラメータの組を、音声合成部１１へ渡し、音声合成部１１に派生テキスト情報と派生合成パラメータの組に対する表音情報及び波形生成情報を作成させる。
なお、派生テキスト情報と派生合成パラメータの組と一致する登録テキスト情報と登録合成パラメータの組が中間情報テーブルに既に登録されていることもある。このような場合、自動更新部１７は、派生テキスト情報と派生合成パラメータの組について音声合成部１１に表音情報及び波形生成情報を作成させなくてもよい。 The parameter adjustment unit 19 creates a derived synthesis parameter by changing at least one of the input synthesis parameters. For example, it is assumed that the speech speed and pitch included in the synthesis parameter are represented by “1” to “5”, respectively. In this case, if the speech speed and pitch of the input synthesis parameter are “3”, the parameter adjustment unit 19 corrects the speech speed to “1”, “2”, “4”, or “5”. Alternatively, the derived synthesis parameter is created by modifying the pitch to '1', '2', '4' or '5'.
Then, the automatic update unit 17 passes the combination of the derived text information and the derived synthesis parameter to the speech synthesis unit 11, and causes the speech synthesis unit 11 to create phonetic information and waveform generation information for the combination of the derived text information and the derived synthesis parameter. .
It should be noted that a combination of registered text information and a registered composite parameter that matches a set of derived text information and a derived composite parameter may already be registered in the intermediate information table. In such a case, the automatic updating unit 17 may not cause the speech synthesis unit 11 to create phonetic information and waveform generation information for the combination of the derived text information and the derived synthesis parameter.

自動更新部１７は、派生テキスト情報と派生合成パラメータの組に対する表音情報及び波形生成情報を音声合成部１１から受け取ると、派生テキスト情報と派生合成パラメータの組を表音情報及び波形生成情報とともに中間情報テーブルに追加登録する。このように、自動更新部１７は、記憶部３に記憶された中間情報テーブルを更新することで、以降に入力されたテキスト情報と合成パラメータの組に対する合成音声信号の作成に利用できる中間情報の数を増やす。すなわち、派生テキスト情報と派生合成パラメータの組は、新たな登録テキスト情報と登録合成パラメータの組となる。そして派生テキスト情報と派生合成パラメータの組について作成された表音情報及び波形生成情報は、その新たな登録テキスト情報と登録合成パラメータの組に対応する登録中間情報となる。
なお、自動更新部１７は、入力されたテキスト情報と合成パラメータのうち、何れか一方のみを変更して中間情報を作成してもよい。すなわち、派生テキスト情報及び派生合成パラメータの何れか一方が、入力されたテキスト情報または入力された合成パラメータと同じでもよい。また、自動更新部１７は、派生テキスト情報及び派生合成パラメータの組について作成された表音情報及び波形生成情報のうちの一方のみを、登録中間情報として中間情報テーブルに登録してもよい。 When the automatic update unit 17 receives the phonetic information and the waveform generation information for the set of the derived text information and the derived synthesis parameter from the speech synthesis unit 11, the automatic update unit 17 sets the set of the derived text information and the derived synthesis parameter together with the phonetic information and the waveform generation information. Add to the intermediate information table. As described above, the automatic update unit 17 updates the intermediate information table stored in the storage unit 3, so that intermediate information that can be used to create a synthesized speech signal for a set of text information and synthesis parameters input thereafter is stored. Increase the number. That is, the combination of the derived text information and the derived synthesis parameter becomes a set of new registered text information and the registered synthesis parameter. The phonetic information and the waveform generation information created for the combination of the derived text information and the derived synthesis parameter become registration intermediate information corresponding to the new registered text information and registered synthesis parameter group.
Note that the automatic update unit 17 may create intermediate information by changing only one of the input text information and the synthesis parameter. That is, either the derived text information or the derived synthesis parameter may be the same as the input text information or the input synthesis parameter. Further, the automatic updating unit 17 may register only one of the phonetic information and the waveform generation information created for the combination of the derived text information and the derived synthesis parameter as registered intermediate information in the intermediate information table.

図３は、中間情報テーブルの一例を示す図である。中間情報テーブル３００の各行は、一つの登録中間情報に対応する。そして左から順に、各列には、それぞれ、各登録中間情報の識別番号、合成パラメータ、テキスト情報、表音情報、波形生成情報が格納される。この例では、行３０１に入力されたテキスト情報及び合成パラメータに対応する中間情報が登録されており、一方、行３０２及び３０３には、入力されたテキスト情報と派生合成パラメータに基づいて作成された中間情報が登録されている。
この例では、合成パラメータとして話速とピッチが規定されている。そのため、派生合成パラメータの話速またはピッチの少なくとも一方が、行３０１に示された元の話速またはピッチと異なっている。なお、合成パラメータには、抑揚、音量、声の高さなど、他のパラメータが含まれていてもよい。 FIG. 3 is a diagram illustrating an example of the intermediate information table. Each row of the intermediate information table 300 corresponds to one registered intermediate information. In order from the left, each column stores an identification number of each registered intermediate information, a synthesis parameter, text information, phonetic information, and waveform generation information. In this example, intermediate information corresponding to the text information and synthesis parameters input in line 301 is registered, while lines 302 and 303 are created based on the input text information and derived synthesis parameters. Intermediate information is registered.
In this example, speech speed and pitch are defined as synthesis parameters. Therefore, at least one of the speech speed or pitch of the derived synthesis parameter is different from the original speech speed or pitch shown in the row 301. Note that the synthesis parameters may include other parameters such as intonation, volume, and voice pitch.

また、テキスト情報は、例えば、合成される音声が日本語であれば、一般的な文章のように、かな文字、漢字及び句読点の組み合わせにより表記される。例えば、この例では、テキスト情報は、「お客様に・・・申し上げます。」という文を含む。
表音情報は、例えば、電子情報技術産業協会規格TT-6004（車載用音声合成記号規格）に従って表記される。また表音情報は、テキスト情報に含まれる文中でのアクセントの位置、区切り位置と、各音素の読みが分かる他の表記形式に従って記述されてもよい。なお、この例では、派生テキスト情報に含まれる文及び表音情報は、入力されたテキスト情報に含まれる文及び表音情報と同一となっている。
また、波形生成情報が登録される欄には、時系列に沿って各音素について使用される音声波形の識別番号と、場合によっては波形変換情報が記述される。なお、識別番号と波形変換情報の記述形式は、適宜適切なものが選択されればよい。 In addition, for example, if the synthesized speech is Japanese, the text information is represented by a combination of kana characters, kanji and punctuation marks as in a general sentence. For example, in this example, the text information includes a sentence “I want to say to the customer”.
The phonetic information is expressed in accordance with, for example, the Japan Electronics and Information Technology Industries Association Standard TT-6004 (Automotive Speech Synthesis Symbol Standard). Moreover, the phonetic information may be described according to other notation formats in which the accent position and the break position in the sentence included in the text information and the reading of each phoneme can be understood. In this example, the sentence and phonetic information included in the derived text information are the same as the sentence and phonetic information included in the input text information.
In the column in which the waveform generation information is registered, the identification number of the speech waveform used for each phoneme and the waveform conversion information depending on the case are described along the time series. It should be noted that an appropriate one may be selected as the description format of the identification number and the waveform conversion information.

図４は、中間情報テーブルの他の一例を示す図である。中間情報テーブル４００の各行は、一つの登録中間情報に対応する。そして左から順に、各列には、それぞれ、各登録中間情報の識別番号、合成パラメータ、テキスト情報、表音情報、波形生成情報が格納される。この例では、入力されたテキスト情報は、「名古屋方面を走行中のドライバーに、渋滞のお知らせです。」という原文を含んでいる。そして行４０１及び４０２には、それぞれ、入力されたテキスト情報において文中に付された句読点で分割されたテキスト情報及び入力された合成パラメータに対応する中間情報が登録されている。一方、行４０３及び４０４には、入力されたテキスト情報において付された句読点の位置とは異なる位置で原文を分割することにより作成された派生テキスト情報と入力された合成パラメータに基づいて作成された中間情報が登録されている。
この例では、行４０３には、派生テキスト情報として、「名古屋方面を、」との文が登録されており、一方、行４０４には、派生テキスト情報として、「走行中のドライバーに渋滞のお知らせです。」との文が登録されている。
このように、一つの文に対して分割する単位を様々に変えることで生成された派生テキスト情報に対する中間情報も中間情報テーブルに登録することで、その後に入力されたテキスト情報と中間情報テーブルに登録されているテキスト情報が一致する確率が上がる。そのため、音声合成装置１は、合成音声の作成に要する時間を短縮できる可能性を高くできる。 FIG. 4 is a diagram illustrating another example of the intermediate information table. Each row of the intermediate information table 400 corresponds to one registered intermediate information. In order from the left, each column stores an identification number of each registered intermediate information, a synthesis parameter, text information, phonetic information, and waveform generation information. In this example, the input text information includes the original text “This is a traffic jam notification to the driver traveling in the direction of Nagoya.” In lines 401 and 402, text information divided by punctuation marks added to the sentence in the input text information and intermediate information corresponding to the input synthesis parameters are registered. On the other hand, the lines 403 and 404 are created based on the derived text information created by dividing the original text at a position different from the position of the punctuation mark added in the inputted text information and the inputted synthesis parameter. Intermediate information is registered.
In this example, in the line 403, the sentence “To Nagoya,” is registered as the derived text information, while in the line 404, the message “Notice of traffic congestion to the driving driver”. Is registered.
In this way, by registering the intermediate information for the derived text information generated by variously changing the unit for dividing one sentence into the intermediate information table, the text information input thereafter and the intermediate information table can be stored in the intermediate information table. Probability of matching registered text information increases. Therefore, the speech synthesizer 1 can increase the possibility that the time required for creating the synthesized speech can be shortened.

図５は、音声合成装置１の処理部４により実行される音声合成処理の動作フローチャートを示す。
処理部４は、入力部２を介してテキスト情報及び合成パラメータを取得する（ステップＳ１０１）。
一致判定部１０は、入力されたテキスト情報またはそのテキスト情報により表される表音情報と合成パラメータとの組が中間情報テーブルに登録された登録テキスト情報または表音情報と登録合成パラメータとの組の何れかと一致するか否か判定する（ステップＳ１０２）。
入力されたテキスト情報または表音情報と合成パラメータとの組が登録テキスト情報または表音情報と登録合成パラメータとの組の何れかと一致する場合（ステップＳ１０２−Ｙｅｓ）、一致判定部１０は、その一致した登録テキスト情報または表音情報と登録合成パラメータとの組の識別番号を制御部１６へ渡す。制御部１６は、中間情報テーブルを参照して、一致判定部１０から受け取った識別番号に対応する登録中間情報を記憶部３から読み出す（ステップＳ１０３）。そして制御部１６は、その登録中間情報を音声合成部１１へ渡す。
音声合成部１１は、その登録中間情報を用いて合成音声信号を作成する（ステップＳ１０４）。そして音声合成部１１は、合成音声信号を出力部５を介してスピーカ６へ出力する。 FIG. 5 shows an operation flowchart of the speech synthesis process executed by the processing unit 4 of the speech synthesizer 1.
The processing unit 4 acquires text information and synthesis parameters via the input unit 2 (step S101).
The coincidence determination unit 10 is a combination of registered text information or phonetic information and a registered synthesis parameter in which an input text information or a set of phonetic information represented by the text information and a synthesis parameter is registered in an intermediate information table. It is determined whether or not it matches any one of these (step S102).
When the set of the input text information or phonetic information and the synthesis parameter matches any of the set of registered text information or phonetic information and the registered synthesis parameter (step S102—Yes), the match determination unit 10 An identification number of a set of the registered text information or phonetic information that matches and the registered synthesis parameter is passed to the control unit 16. The control unit 16 reads the registered intermediate information corresponding to the identification number received from the match determination unit 10 from the storage unit 3 with reference to the intermediate information table (step S103). Then, the control unit 16 passes the registered intermediate information to the speech synthesis unit 11.
The voice synthesizer 11 creates a synthesized voice signal using the registered intermediate information (step S104). Then, the voice synthesizer 11 outputs the synthesized voice signal to the speaker 6 via the output unit 5.

一方、入力テキスト情報または表音情報と合成パラメータとの組は登録テキスト情報または表音情報と登録合成パラメータとの組の何れとも一致しない場合（ステップＳ１０２−Ｙｅｓ）、一致判定部１０は入力されたテキスト情報と合成パラメータの組を音声合成部１１へ渡す。そして音声合成部１１は、入力されたテキスト情報と合成パラメータに基づいて合成音声信号を作成する（ステップＳ１０５）。音声合成部１１は、合成音声信号を出力部５を介してスピーカ６へ出力する。また音声合成部１１は、合成音声信号を作成するためにその途中で作成した表音情報及び波形生成情報を中間情報として、入力テキスト情報及び合成パラメータとともに自動更新部１７へ渡す。
自動更新部１７は、入力テキスト情報、合成パラメータ、及び作成された表音情報と波形生成情報とを、それぞれ登録テキスト情報、登録合成パラメータ及び登録中間情報として中間情報テーブルに登録する（ステップＳ１０６）。 On the other hand, when the set of the input text information or the phonetic information and the synthesis parameter does not match any of the set of the registered text information or the phonetic information and the registered synthesis parameter (step S102—Yes), the match determination unit 10 is input. The combination of the text information and the synthesis parameter is passed to the speech synthesis unit 11. Then, the speech synthesizer 11 creates a synthesized speech signal based on the input text information and synthesis parameters (step S105). The voice synthesizer 11 outputs the synthesized voice signal to the speaker 6 via the output unit 5. The speech synthesizer 11 passes the phonetic information and waveform generation information created in the middle to create a synthesized speech signal as intermediate information to the automatic update unit 17 together with the input text information and synthesis parameters.
The automatic update unit 17 registers the input text information, the synthesis parameter, and the created phonetic information and waveform generation information in the intermediate information table as registered text information, registered synthesis parameter, and registered intermediate information, respectively (step S106). .

また、自動更新部１７は、入力テキスト情報及び合成パラメータを一致判定部１０から受け取る。そして自動更新部１７の分割処理部１８は、テキスト情報に含まれる原文を複数の所定の文の単位に分割し、その文の単位ごとに派生テキスト情報を作成する（ステップＳ１０７）。
また自動更新部１７のパラメータ調整部１９は、入力された合成パラメータに含まれる少なくとも一つのパラメータを修正することにより派生合成パラメータを作成する（ステップＳ１０８）。 Further, the automatic update unit 17 receives the input text information and the synthesis parameter from the coincidence determination unit 10. Then, the division processing unit 18 of the automatic update unit 17 divides the original text included in the text information into a plurality of predetermined sentence units, and creates derivative text information for each unit of the sentences (step S107).
The parameter adjustment unit 19 of the automatic updating unit 17 creates a derived synthesis parameter by correcting at least one parameter included in the input synthesis parameter (step S108).

自動更新部１７は、派生テキスト情報及び派生合成パラメータを音声合成部１１へ出力する。なお、自動更新部１７から音声合成部１１に派生テキスト情報及び派生合成パラメータが渡されるタイミングは、音声合成部１１の負荷を軽減するために、音声合成部１１が合成音声信号を作成していないときであることが好ましい。そこで、例えば、自動更新部１７は、音声合成部１１の波形生成部１５から、入力テキスト情報と入力合成パラメータの組についての合成音声信号の作成が終了したことを表す通知を受け取った後、派生テキスト情報及び派生合成パラメータを音声合成部１１へ渡す。そして音声合成部１１は、派生テキスト情報と派生合成パラメータに基づいて表音情報及び波形生成情報を作成する（ステップＳ１０９）。
音声合成部１１は、作成した表音情報及び波形生成情報を中間情報として自動更新部１７へ渡す。そして自動更新部１７は、派生テキスト情報、派生合成パラメータ、及び作成された表音情報と波形生成情報とを、それぞれ登録テキスト情報、登録合成パラメータ及び登録中間情報として中間情報テーブルに登録する（ステップＳ１１０）。
ステップＳ１０４またはステップＳ１１０の後、処理部４は音声合成処理を終了する。
なお、入力テキスト情報に含まれる原文の一部及び入力合成パラメータの組と一致する登録テキスト情報及び登録合成パラメータの組が中間情報テーブルに登録されていることもある。このような場合、処理部４は、原文の一部及び入力合成パラメータの組と一致する登録テキスト情報及び登録合成パラメータの組について、ステップＳ１０３及びステップＳ１０４の処理を実行する。一方、処理部４は、原文のその他の部分について、ステップＳ１０５〜ステップＳ１１０の処理を実行する。
また、処理部４は、ステップＳ１０７の処理とステップＳ１０８の処理の順序を入れ換えてもよく、あるいは、ステップＳ１０７の処理とステップＳ１０８の処理の何れか一方を省略してもよい。 The automatic update unit 17 outputs the derived text information and the derived synthesis parameter to the speech synthesis unit 11. Note that the timing at which the derived text information and the derived synthesis parameter are passed from the automatic update unit 17 to the speech synthesis unit 11 is that the speech synthesis unit 11 does not create a synthesized speech signal in order to reduce the load on the speech synthesis unit 11. It is preferred that Therefore, for example, the automatic update unit 17 receives a notification from the waveform generation unit 15 of the speech synthesis unit 11 that the creation of the synthesized speech signal for the combination of the input text information and the input synthesis parameter is completed, and then derives The text information and the derived synthesis parameters are passed to the speech synthesis unit 11. Then, the speech synthesizer 11 creates phonetic information and waveform generation information based on the derived text information and the derived synthesis parameter (step S109).
The speech synthesis unit 11 passes the created phonetic information and waveform generation information to the automatic update unit 17 as intermediate information. Then, the automatic update unit 17 registers the derived text information, the derived synthesis parameter, and the created phonetic information and waveform generation information in the intermediate information table as registered text information, registered synthesis parameter, and registered intermediate information, respectively (Step). S110).
After step S104 or step S110, the processing unit 4 ends the speech synthesis process.
Note that a part of the original text included in the input text information and a set of registered text information and a set of registered composite parameters that match a set of input composite parameters may be registered in the intermediate information table. In such a case, the processing unit 4 executes the processes of step S103 and step S104 for the registered text information and the registered composite parameter set that match a part of the original text and the input composite parameter set. On the other hand, the processing unit 4 executes the processes of steps S105 to S110 for the other parts of the original text.
Further, the processing unit 4 may exchange the order of the process of step S107 and the process of step S108, or may omit either one of the process of step S107 or the process of step S108.

以上に説明してきたように、この音声合成装置は、入力されたテキスト情報と合成パラメータの組に対応する中間情報が記憶されていなければ、その入力されたテキスト情報と合成パラメータに基づいて合成音声信号を作成する。そしてこの音声合成装置は、合成音声信号の作成の途中で作成される波形生成情報等を中間情報として入力されたテキスト情報と合成パラメータの組とともに記憶する。さらにこの音声合成装置は、入力されたテキスト情報を修正した派生テキスト情報と合成パラメータの一部を修正した派生合成パラメータについても、波形生成情報等の中間情報を作成する。そしてこの音声合成装置は、その中間情報を、音声合成に利用できるように、派生テキスト情報及び派生合成パラメータとともに記憶する。そのため、この音声合成装置は、一旦作成された合成音声信号の原文の少なくとも一部が同一の原文について、その合成音声信号の合成条件と異なる合成条件で合成音声信号を作成することが要求された場合に、音声合成に要する時間を短縮できる。 As described above, this speech synthesizer is configured to synthesize speech based on the input text information and synthesis parameters unless intermediate information corresponding to the set of input text information and synthesis parameters is stored. Create a signal. The speech synthesizer stores waveform generation information and the like created during the creation of the synthesized speech signal together with a set of text information and synthesis parameters input as intermediate information. Further, the speech synthesizer creates intermediate information such as waveform generation information for the derived text information obtained by correcting the input text information and the derived synthesis parameter obtained by correcting a part of the synthesis parameter. The speech synthesizer stores the intermediate information together with the derived text information and the derived synthesis parameter so that it can be used for speech synthesis. For this reason, this speech synthesizer is required to create a synthesized speech signal under a synthesis condition that is different from the synthesized condition of the synthesized speech signal for an original text that has at least a part of the original text of the synthesized speech signal once created. In this case, the time required for speech synthesis can be shortened.

次に、第２の実施形態による音声合成装置について説明する。
この第２の実施形態による音声合成装置は、中間情報テーブルに登録されている各中間情報の使用回数を調べる。そしてこの音声合成装置は、使用回数が少ない中間情報と、対応する登録テキスト情報と登録合成パラメータの組とを消去する。これにより、この音声合成装置は、入力されたテキスト情報と合成パラメータの組と一致する登録テキスト情報と登録合成パラメータの組の探索に要する時間を短縮する。 Next, a speech synthesizer according to the second embodiment will be described.
The speech synthesizer according to the second embodiment checks the number of uses of each piece of intermediate information registered in the intermediate information table. The speech synthesizer then deletes the intermediate information that is used less frequently and the corresponding registered text information and registered synthesis parameter set. As a result, this speech synthesizer shortens the time required for searching for a set of registered text information and registered synthesis parameters that match the set of input text information and synthesis parameters.

図６は、第２の実施形態による音声合成装置の処理部４１の概略構成図である。処理部４１は、一致判定部１０と、音声合成部１１と、制御部１６と、自動更新部１７とを有する。また自動更新部１７は、分割処理部１８と、パラメータ調整部１９と、存続判定部２０とを有する。
図６において、処理部４１の各構成要素には、図２に示された第１の実施形態による音声合成装置１の処理部４の対応する構成要素の参照番号と同じ参照番号を付した。この第２の実施形態による処理部４１は、第１の実施形態による処理部４と比較して、自動更新部１７が存続判定部２０を有する点で異なる。
そこで以下では、処理部４１が第１の実施形態による処理部４と異なる点について説明する。第２の実施形態による音声合成装置のその他の構成要素については、図１、２及び第１の実施形態の関連する部分の説明を参照されたい。 FIG. 6 is a schematic configuration diagram of the processing unit 41 of the speech synthesizer according to the second embodiment. The processing unit 41 includes a coincidence determination unit 10, a voice synthesis unit 11, a control unit 16, and an automatic update unit 17. The automatic update unit 17 includes a division processing unit 18, a parameter adjustment unit 19, and a survival determination unit 20.
In FIG. 6, each component of the processing unit 41 is assigned the same reference number as the corresponding component of the processing unit 4 of the speech synthesizer 1 according to the first embodiment shown in FIG. 2. The processing unit 41 according to the second embodiment is different from the processing unit 4 according to the first embodiment in that the automatic update unit 17 includes a survival determination unit 20.
Therefore, hereinafter, the difference between the processing unit 41 and the processing unit 4 according to the first embodiment will be described. For other components of the speech synthesizer according to the second embodiment, refer to FIGS. 1 and 2 and the description of the relevant part of the first embodiment.

この実施形態では、中間情報テーブルは、登録テキスト情報及び登録合成パラメータの組のそれぞれについて、合成音声信号の作成に利用された回数をチェックした最新の日時と、そのチェック時以降に合成音声信号の作成に利用された回数とをさらに記録する。 In this embodiment, the intermediate information table includes the latest date and time when the number of times used to create a synthesized speech signal for each set of registered text information and registered synthesis parameters, and the synthesized speech signal after the time of the check. Further record the number of times used for creation.

図７は、第２の実施形態による中間情報テーブルの一例を示す図である。中間情報テーブル７００の各行は、一つの登録中間情報に対応する。そして左から順に、各列には、それぞれ、各登録中間情報の識別番号、合成パラメータ、テキスト情報、表音情報、波形生成情報、使用回数及びチェック日時が格納される。例えば、行７０１に登録された中間情報について、2010年8月25日の0時に使用回数がチェックされたことが示されている。そして行７０１に登録された中間情報は、前回のチェック時以降、132回使用されたことが示されている。同様に、行７０２に登録された中間情報について、2010年8月20日の0時に使用回数がチェックされたことが示されている。そして行７０２に登録された中間情報は、前回のチェック時以降、16回使用されたことが示されている。 FIG. 7 is a diagram illustrating an example of an intermediate information table according to the second embodiment. Each row of the intermediate information table 700 corresponds to one registered intermediate information. In order from the left, each column stores an identification number of each registered intermediate information, synthesis parameters, text information, phonetic information, waveform generation information, number of uses, and check date and time. For example, for the intermediate information registered in the row 701, it is shown that the usage count is checked at 0:00 on August 25, 2010. The intermediate information registered in the row 701 indicates that it has been used 132 times since the previous check. Similarly, it is shown that the number of uses of the intermediate information registered in the row 702 has been checked at 0:00 on August 20, 2010. The intermediate information registered in the row 702 indicates that it has been used 16 times since the previous check.

この実施形態において、制御部１６は、何れかの登録中間情報が合成音声信号の作成に利用されると、中間情報テーブルに記憶されている、その登録中間情報についての使用回数を1増加させる。すなわち、一致判定部１０により入力されたテキスト情報及び合成パラメータの組と一致すると判定された登録テキスト情報及び登録合成パラメータの組に対応する登録中間情報の使用回数が1増加する。
また、自動更新部１７は、中間情報を新たに中間情報テーブルに登録する際、その中間情報についての使用回数を'0'とし、チェック日時をその登録日時とする。 In this embodiment, when any registered intermediate information is used to create a synthesized speech signal, the control unit 16 increments the number of uses for the registered intermediate information stored in the intermediate information table. That is, the number of times of using the registered intermediate information corresponding to the set of registered text information and registered composite parameter determined to match the set of text information and composite parameter input by the match determination unit 10 is increased by one.
Further, when the intermediate update information is newly registered in the intermediate information table, the automatic updating unit 17 sets the number of uses for the intermediate information to “0” and the check date as the registration date.

自動更新部１７の存続判定部２０は、中間情報テーブルに登録されている各登録中間情報のうち、直近の所定期間においてあまり使用されていない登録中間情報を消去する。
図８は、存続判定部２０により実行される登録中間情報の存続判定処理の動作フローチャートである。
存続判定部２０は、定期的に（例えば、毎日午前0時に）記憶部３に記憶されている中間情報テーブルを参照して、前回のチェック日時から所定期間経過した登録中間情報を特定する（ステップＳ２０１）。あるいは、存続判定部２０は、特定のタイミングにおいて（例えば、音声合成装置の起動時または終了時に）、前回のチェック日時から所定期間経過した登録中間情報を特定してもよい。この所定期間は、音声合成装置自体が使用される頻度、記憶部３の記憶容量または処理部４１が有するプロセッサの処理速度に応じて定められる。例えば、音声合成装置の使用頻度が高いほど、所定期間は短く設定され、一方、記憶部３の記憶容量が大きいほど、または処理部４１が有するプロセッサの処理速度が速いほど、所定期間は長く設定される。例えば、所定期間は、1週間、1ヶ月、6ヶ月または1年に設定される。 The existence determination unit 20 of the automatic update unit 17 erases the registered intermediate information that is not used so much in the most recent predetermined period among the registered intermediate information registered in the intermediate information table.
FIG. 8 is an operation flowchart of the registration intermediate information existence determination process executed by the existence determination unit 20.
The survival determination unit 20 refers to the intermediate information table stored in the storage unit 3 regularly (for example, every day at midnight), and identifies registered intermediate information that has passed a predetermined period from the previous check date (step) S201). Alternatively, the survival determination unit 20 may specify registration intermediate information that has passed a predetermined period from the previous check date and time at a specific timing (for example, when the speech synthesizer is activated or terminated). This predetermined period is determined according to the frequency with which the speech synthesizer itself is used, the storage capacity of the storage unit 3 or the processing speed of the processor included in the processing unit 41. For example, the higher the frequency of use of the speech synthesizer, the shorter the predetermined period is set, while the larger the storage capacity of the storage unit 3 or the faster the processing speed of the processor included in the processing unit 41, the longer the predetermined period is set. Is done. For example, the predetermined period is set to one week, one month, six months, or one year.

存続判定部２０は、特定された登録中間情報の使用回数が所定の閾値未満か否か判定する（ステップＳ２０２）。所定の閾値は、例えば、1回から数回程度に設定される。使用回数が所定の閾値未満であれば（ステップＳ２０２−Ｙｅｓ）、その使用回数に対応する中間情報は、合成音声信号の作成に殆ど使用されていない。そこで存続判定部２０は、使用回数が所定の閾値未満となる登録中間情報及び対応する登録テキスト情報と登録合成パラメータの組を中間情報テーブルから消去する（ステップＳ２０３）。
一方、使用回数が所定の閾値以上である場合（ステップＳ２０２−Ｎｏ）、登録中間情報は、合成音声信号の作成に使用されている。そのため、音声合成装置は、このような中間情報を残しておくことが好ましい。そこで存続判定部２０は、使用回数が所定の閾値以上となる登録中間情報について、中間情報テーブルに記録されているチェック日時を、今回のチェックを行った日時に修正する。また存続判定部２０は、チェックした中間情報についての使用回数を'0'にリセットする（ステップＳ２０４）。
存続判定部２０は、前回のチェック日時から所定期間経過した登録中間情報が複数存在する場合、その登録中間情報のそれぞれについて、ステップＳ２０２〜Ｓ２０４の処理を実行する。そして存続判定部２０は、全ての登録中間情報についてステップＳ２０２〜Ｓ２０４の処理が終了すると、存続判定処理を終了する。 The survival determination unit 20 determines whether or not the number of times of use of the specified registered intermediate information is less than a predetermined threshold (step S202). The predetermined threshold is set, for example, from about once to several times. If the number of times of use is less than a predetermined threshold (step S202—Yes), the intermediate information corresponding to the number of times of use is hardly used for creating a synthesized speech signal. Therefore, the survival determination unit 20 deletes from the intermediate information table the registered intermediate information whose usage count is less than the predetermined threshold and the corresponding registered text information and registered composite parameter set (step S203).
On the other hand, when the number of times of use is equal to or greater than a predetermined threshold (step S202-No), the registered intermediate information is used to create a synthesized speech signal. Therefore, it is preferable that the speech synthesizer keeps such intermediate information. Accordingly, the survival determination unit 20 corrects the check date and time recorded in the intermediate information table for the registered intermediate information whose number of uses is equal to or greater than a predetermined threshold to the date and time when this check was performed. Further, the survival determination unit 20 resets the number of uses for the checked intermediate information to “0” (step S204).
When there are a plurality of pieces of registered intermediate information that have passed a predetermined period from the previous check date and time, the survival determination unit 20 executes the processes of steps S202 to S204 for each of the registered intermediate information. And the survival determination part 20 complete | finishes a survival determination process, if the process of step S202-S204 is complete | finished about all the registration intermediate information.

これにより、この音声合成装置は、使用頻度の低い中間情報を記憶部から消去できる。そのため、この音声合成装置は、中間情報テーブルに登録された中間情報の数が不必要に増加することを抑制できる。したがって、この音声合成装置は、入力されたテキスト情報及び合成パラメータの組と一致する、登録テキスト情報及び登録合成パラメータの組に相当する登録中間情報の探索に要する時間を短縮できる。さらにこの音声合成装置は、各登録中間情報について、直近の一定期間の使用回数に基づいて消去するか否かを判定するので、消去する登録中間情報を適切に選択できる。 Thereby, this speech synthesizer can erase the intermediate information with low usage frequency from the storage unit. Therefore, the speech synthesizer can suppress an unnecessary increase in the number of intermediate information registered in the intermediate information table. Therefore, this speech synthesizer can reduce the time required for searching for registered intermediate information corresponding to the set of registered text information and registered synthesis parameter that matches the set of input text information and synthesis parameter. Furthermore, since this speech synthesizer determines whether or not to delete each registered intermediate information based on the number of times of use in the latest fixed period, it is possible to appropriately select the registered intermediate information to be deleted.

次に、第３の実施形態による音声合成装置について説明する。
この第３の実施形態による音声合成装置は、登録中間情報を編集する手段を有する。これにより、この音声合成装置は、中間情報テーブルに登録されている中間情報が不適切である場合に、手動でその登録中間情報を消去したり、あるいは、その登録中間情報を修正できる。 Next, a speech synthesizer according to a third embodiment will be described.
The speech synthesizer according to the third embodiment has means for editing registered intermediate information. As a result, when the intermediate information registered in the intermediate information table is inappropriate, the speech synthesizer can manually delete the registered intermediate information or correct the registered intermediate information.

図９は、第３の実施形態による音声合成装置の処理部４２の概略構成図である。処理部４２は、一致判定部１０と、音声合成部１１と、制御部１６と、自動更新部１７と、編集部２１とを有する。
図９において、処理部４２の各構成要素には、図２に示された第１の実施形態による音声合成装置１の処理部４の対応する構成要素の参照番号と同じ参照番号を付した。この第３の実施形態による処理部４２は、第１の実施形態による処理部４と比較して、編集部２１を有する点で異なる。
そこで以下では、処理部４２が第１の実施形態による処理部４と異なる点について説明する。第３の実施形態による音声合成装置のその他の構成要素については、図１、２及び第１の実施形態の関連する部分の説明を参照されたい。 FIG. 9 is a schematic configuration diagram of the processing unit 42 of the speech synthesizer according to the third embodiment. The processing unit 42 includes a match determination unit 10, a voice synthesis unit 11, a control unit 16, an automatic update unit 17, and an editing unit 21.
In FIG. 9, each component of the processing unit 42 is assigned the same reference number as that of the corresponding component of the processing unit 4 of the speech synthesizer 1 according to the first embodiment shown in FIG. The processing unit 42 according to the third embodiment is different from the processing unit 4 according to the first embodiment in that the editing unit 21 is included.
Therefore, hereinafter, the difference between the processing unit 42 and the processing unit 4 according to the first embodiment will be described. For other components of the speech synthesizer according to the third embodiment, refer to FIGS. 1 and 2 and the description of the relevant part of the first embodiment.

この実施形態では、入力部２は、例えば、キーボード、またはマウスなどのポインティングデバイスとディスプレイとを有する。あるいは、入力部２は、タッチパネルディスプレイを有してもよい。
そして処理部４２は、入力部２から登録中間情報の編集を行うことを示す操作信号を受け取ると、処理部４２は、編集部２１を起動する。 In this embodiment, the input unit 2 includes a pointing device such as a keyboard or a mouse and a display, for example. Alternatively, the input unit 2 may have a touch panel display.
When the processing unit 42 receives an operation signal indicating that the registered intermediate information is to be edited from the input unit 2, the processing unit 42 activates the editing unit 21.

編集部２１は、例えば、編集対象となる登録中間情報をユーザが選択するためのメニュー、または操作ボタンなどを入力部２が有するディスプレイに表示させる。また編集部２１は、編集対象となる登録中間情報を消去するための操作ボタン、またはその登録中間情報に含まれる表音情報などを修正するためのテキストボックスなどをディスプレイに表示させる。
そして編集部２１は、入力部２のキーボード等から編集対象の登録中間情報の識別番号を取得すると、記憶部３からその識別番号に相当する登録中間情報及び対応する登録テキスト情報と登録合成パラメータとを読み込む。そして編集部２１は、登録テキスト情報及び登録合成パラメータと、その登録中間情報に含まれる表音情報または波形生成情報等を入力部２のディスプレイに表示させる。
また編集部２１は、入力部２のキーボード等から編集対象の登録中間情報を消去する操作信号及びその登録中間情報の識別番号を受け取ると、中間情報テーブルからその登録中間情報を削除する。
あるいは、編集部２１は、入力部２のキーボード等から編集対象の登録中間情報の一部、例えば、表音情報の一部を修正する操作信号及びその登録中間情報の識別番号を受け取ると、その操作信号に従って、その識別番号で特定される登録中間情報の表音情報を修正する。そして編集部２１は、修正された表音情報を音声合成部１１へ入力することにより、修正された表音情報に応じた波形生成情報を得る。 The editing unit 21 displays, for example, a menu for the user to select registration intermediate information to be edited or an operation button on the display of the input unit 2. The editing unit 21 displays on the display an operation button for deleting the registered intermediate information to be edited or a text box for correcting the phonetic information included in the registered intermediate information.
When the editing unit 21 acquires the identification number of the registration intermediate information to be edited from the keyboard or the like of the input unit 2, the editing unit 21 registers the registration intermediate information corresponding to the identification number, the corresponding registered text information, and the registration composition parameter from the storage unit 3. Is read. Then, the editing unit 21 displays the registered text information and the registered synthesis parameter and the phonetic information or waveform generation information included in the registered intermediate information on the display of the input unit 2.
When the editing unit 21 receives an operation signal for deleting the registered intermediate information to be edited and the identification number of the registered intermediate information from the keyboard or the like of the input unit 2, the editing unit 21 deletes the registered intermediate information from the intermediate information table.
Alternatively, when the editing unit 21 receives an operation signal for correcting a part of the registered intermediate information to be edited, for example, a part of the phonetic information and the identification number of the registered intermediate information from the keyboard or the like of the input unit 2, According to the operation signal, the phonetic information of the registered intermediate information specified by the identification number is corrected. The editing unit 21 inputs the corrected phonetic information to the speech synthesis unit 11 to obtain waveform generation information corresponding to the corrected phonetic information.

このように、この音声合成装置は、既に登録されている中間情報を編集する手段を有するので、例えば、テキスト情報に示された文の本来の読みと異なる読みで表された表音情報に基づく中間情報のように、不適切な中間情報を編集できる。そのため、この音声合成装置は、適切な中間情報のみを残すことができる。 Thus, since this speech synthesizer has means for editing already registered intermediate information, for example, it is based on phonetic information represented by a reading different from the original reading of the sentence indicated in the text information. Like intermediate information, inappropriate intermediate information can be edited. Therefore, this speech synthesizer can leave only appropriate intermediate information.

なお、上記の各実施形態の変形例によれば、自動更新部のパラメータ調整部は、入力テキスト情報または派生テキスト情報に含まれる固有名詞を、その固有名詞と同種類の他の固有名詞に置換することで、別の派生テキスト情報を作成してもよい。例えば、テキスト情報に「名古屋方面を走行中のドライバーに、」という文が含まれている場合、パラメータ調整部は、地名についての固有名詞である「名古屋」を、その固有名詞と同じ地名の固有名詞である「東京」または「京都」に置換してもよい。なお、このような固有名詞の置換を行うために、例えば、記憶部に記憶されている単語辞書は、固有名詞の種類を表す識別情報も含む。また分割処理部は、パラメータ調整部に、形態素解析の結果として得られる、テキスト情報に含まれる各単語の品詞を通知する。
これにより、自動更新部は、より多くの派生テキスト情報について音声合成部に中間情報を作成させることができるので、次回以降の入力テキスト情報と合成パラメータの組についての合成音声信号の作成に中間情報を利用できる可能性をより高めることができる。 According to the modification of each embodiment described above, the parameter adjustment unit of the automatic update unit replaces the proper noun included in the input text information or the derived text information with another proper noun of the same type as the proper noun. Thus, another derived text information may be created. For example, if the text information contains the sentence “To the driver driving in the direction of Nagoya,” the parameter adjustment unit will change the proper name of the place name to “Nagoya”. The noun “Tokyo” or “Kyoto” may be substituted. In order to perform such proper noun replacement, for example, the word dictionary stored in the storage unit also includes identification information indicating the type of proper noun. The division processing unit notifies the parameter adjustment unit of the part of speech of each word included in the text information obtained as a result of the morphological analysis.
As a result, the automatic update unit can cause the speech synthesis unit to create intermediate information for more derived text information, so intermediate information can be used to create a synthesized speech signal for the next set of input text information and synthesis parameters. The possibility that it can be used can be further increased.

他の変形例によれば、自動更新部は、派生テキスト情報及び派生合成パラメータに対応する合成音声信号を音声合成部に作成させてもよい。この場合、自動更新部は、音声合成部から合成音声信号を受け取り、その合成音声信号を登録中間情報として記憶部に記憶させてもよい。同様に、自動更新部は、入力テキスト情報及び入力合成パラメータに基づいて作成された合成音声信号を、登録中間情報として記憶部に記憶させてもよい。このように、記憶部が合成音声信号を記憶することで、その後に入力されたテキスト情報及び合成パラメータの組と一致する登録テキスト情報及び登録合成パラメータの組が中間情報テーブルに登録されている場合、音声合成装置は、波形生成部の処理も省略できる。したがって、音声号装置は、音声合成に要する時間をさらに短縮できる。 According to another modification, the automatic update unit may cause the speech synthesis unit to create a synthesized speech signal corresponding to the derived text information and the derived synthesis parameter. In this case, the automatic update unit may receive the synthesized speech signal from the speech synthesis unit and store the synthesized speech signal in the storage unit as registered intermediate information. Similarly, the automatic updating unit may store the synthesized speech signal created based on the input text information and the input synthesis parameter in the storage unit as registered intermediate information. As described above, when the storage unit stores the synthesized speech signal, the registered text information and the set of registered synthesis parameters that match the set of text information and the synthesis parameters input thereafter are registered in the intermediate information table. The speech synthesizer can also omit the processing of the waveform generator. Therefore, the speech signal device can further reduce the time required for speech synthesis.

さらに、上記の各実施形態において、音声合成装置は、入力部を介して、入力されるテキスト情報と合成パラメータの組とともに、入力されるテキスト情報についての重要度を取得してもよい。重要度は、例えば、２段階、あるいは、３段階以上に設定される。そして例えば、重要度を表す数値が大きいほど、入力されるテキスト情報の重要度は高い。
自動更新部は、重要度が所定の重要度閾値以上である場合に限り、派生テキスト情報及び派生合成パラメータの組に対する中間情報を音声合成部に作成させてもよい。重要度閾値は、例えば、重要度が'0'と'1'の何れかに設定される場合、'1'に設定される。また、重要度が'0'〜'n'(ただしnは2以上の整数)といった３段階以上の何れかに設定される場合、重要度閾値は、'1'〜'n'の何れかに設定される。 Furthermore, in each of the embodiments described above, the speech synthesizer may acquire the importance of the input text information together with the input text information and the combination parameter through the input unit. For example, the importance is set to two levels or three levels or more. For example, the greater the numerical value representing the importance, the higher the importance of the input text information.
The automatic updating unit may cause the speech synthesis unit to create intermediate information for the combination of the derived text information and the derived synthesis parameter only when the importance is equal to or greater than a predetermined importance level threshold. For example, when the importance level is set to either “0” or “1”, the importance level threshold is set to “1”. In addition, when the importance is set to any one of three or more levels such as “0” to “n” (where n is an integer of 2 or more), the importance threshold is set to any one of “1” to “n”. Is set.

さらに、重要度が３段階の何れかに設定される場合、自動更新部は、重要度に応じて、中間情報テーブルに登録する中間情報を決定してもよい。例えば、重要度が最も低い段階(例えば、'0')であれば、自動更新部は、入力されるテキスト情報と合成パラメータの組について作成された中間情報も中間情報テーブルに登録しない。また、重要度が中間の段階(例えば、'1')であれば、自動更新部は、入力されるテキスト情報と合成パラメータの組について作成された中間情報を中間情報テーブルに登録する。しかし自動更新部は、派生テキスト情報及び派生合成パラメータの組についての中間情報を中間情報テーブルに登録しない。また、重要度が最も高い段階(例えば、'2')であれば、自動更新部は、入力されるテキスト情報と合成パラメータの組について作成された中間情報を中間情報テーブルに登録する。さらに自動更新部は、派生テキスト情報及び派生合成パラメータの組についての中間情報も中間情報テーブルに登録する。さらに、重要度が４段階以上の何れかに設定される場合、自動更新部は、重要度が'2'以上であり、かつ高い値であるほど、多数の派生テキスト情報と派生合成パラメータの組についての中間情報を登録してもよい。 Furthermore, when the importance is set to any one of the three levels, the automatic update unit may determine intermediate information to be registered in the intermediate information table according to the importance. For example, if the level of importance is the lowest (for example, “0”), the automatic update unit does not register the intermediate information created for the set of input text information and composite parameters in the intermediate information table. If the importance level is in an intermediate stage (for example, “1”), the automatic updating unit registers the intermediate information created for the set of input text information and synthesis parameter in the intermediate information table. However, the automatic updating unit does not register the intermediate information about the set of the derived text information and the derived synthesis parameter in the intermediate information table. If the level of importance is the highest (for example, “2”), the automatic update unit registers the intermediate information created for the set of input text information and composite parameters in the intermediate information table. Further, the automatic update unit also registers intermediate information about the combination of the derived text information and the derived synthesis parameter in the intermediate information table. Furthermore, when the importance is set to any of four or more levels, the automatic update unit sets a larger number of combinations of the derived text information and the derived synthesis parameter as the importance is “2” or higher and the value is higher. You may register the intermediate information about.

この変形例では、自動更新部は、中間情報テーブルに、各中間情報とともに重要度も記録する。例えば、自動更新部は、入力テキスト情報及び合成パラメータの組について作成された中間情報に対して、その組とともに入力された重要度を中間情報テーブルに記録する。また自動更新部は、入力テキスト情報及び合成パラメータの組から作成された派生テキスト情報及び派生合成パラメータの組について求められた中間情報についても、元の入力テキスト情報及び合成パラメータの組について設定された重要度を記録する。あるいは、自動更新部は、派生テキスト情報及び派生合成パラメータの組について求められた中間情報について、元の入力テキスト情報及び合成パラメータの組について設定された重要度よりも低い重要度を中間情報テーブルに記録してもよい。 In this modification, the automatic updating unit records the importance level together with each piece of intermediate information in the intermediate information table. For example, for the intermediate information created for the combination of the input text information and the synthesis parameter, the automatic update unit records the importance input together with the combination in the intermediate information table. In addition, the automatic update unit is also set for the original input text information and the composite parameter set for the intermediate information obtained for the derived text information and the derived composite parameter set created from the input text information and the composite parameter set. Record the importance. Alternatively, the automatic update unit sets, in the intermediate information table, an importance level lower than the importance level set for the original input text information and the synthesis parameter set for the intermediate information obtained for the set of the derived text information and the derived synthesis parameter. It may be recorded.

さらに、自動更新部が存続判定部を有する場合、その存続判定部は、中間情報テーブルに記録された重要度に応じて、登録中間情報を消去するか否か判定してもよい。例えば、存続判定部は、重要度が第２の重要度閾値よりも高い登録中間情報については、使用回数が所定の閾値未満となっても、その登録中間情報を消去しない。この第２の重要度閾値は、例えば、上記の重要度閾値よりも高い値に設定される。あるいは、存続判定部は、ある登録中間情報について記録された重要度が中間情報が登録される重要度の段階のうちで最も低い段階である場合、その登録中間情報の使用回数との比較に用いる閾値を、通常用いられる閾値よりも高く設定してもよい。 Furthermore, when the automatic update unit has a survival determination unit, the survival determination unit may determine whether to delete the registered intermediate information according to the importance recorded in the intermediate information table. For example, the survival determination unit does not delete the registration intermediate information for the registration intermediate information whose importance is higher than the second importance threshold even if the number of uses is less than a predetermined threshold. The second importance level threshold is set to a value higher than the above importance level threshold, for example. Alternatively, the survival determination unit is used for comparison with the number of times of use of the registered intermediate information when the importance recorded for the registered intermediate information is the lowest of the levels of importance in which the intermediate information is registered. The threshold value may be set higher than a normally used threshold value.

図１０は、この変形例による中間情報テーブルの他の一例を示す図である。中間情報テーブル１０００の各行は、一つの登録中間情報に対応する。そして左から順に、各列には、それぞれ、各登録中間情報の識別番号、合成パラメータ、テキスト情報、表音情報、波形生成情報、重要度、使用回数及びチェック日時が格納される。例えば、行１００１に登録された中間情報について、重要度は'5'であることが示されている。またこの登録中間情報は、前回のチェック時以降、1回も使用されていないことが示されている。同様に、行１００２に登録された中間情報について、重要度は'1'であることが示されている。またこの登録中間情報も、前回のチェック時以降、1回も使用されていないことが示されている。この場合、第２の重要度閾値が例えば'2'であり、使用回数との比較に用いる閾値が'1'以上であれば、行１００２に登録された中間情報は、重要度、使用回数とも閾値未満であるため、存続判定部により消去される。一方、行１００１に登録された中間情報は、重要度が第２の重要度閾値よりも高いので消去されない。 FIG. 10 is a diagram showing another example of the intermediate information table according to this modification. Each row of the intermediate information table 1000 corresponds to one registered intermediate information. In order from the left, each column stores an identification number, synthesis parameter, text information, phonetic information, waveform generation information, importance, number of uses, and check date and time of each registered intermediate information. For example, the intermediate information registered in the row 1001 indicates that the importance is “5”. In addition, it is indicated that this registration intermediate information has not been used once since the previous check. Similarly, the importance level of the intermediate information registered in the row 1002 is “1”. This registration intermediate information is also shown to have never been used since the previous check. In this case, if the second importance threshold is “2”, for example, and the threshold used for comparison with the number of uses is “1” or more, the intermediate information registered in the row 1002 has both the importance and the number of uses. Since it is less than the threshold value, it is deleted by the survival determination unit. On the other hand, the intermediate information registered in the row 1001 is not deleted because the importance is higher than the second importance threshold.

さらに、上記の各実施形態による音声合成装置の処理部が有する各機能をコンピュータに実現させるコンピュータプログラムは、コンピュータによって読み取り可能な媒体、例えば、磁気記録媒体、光記録媒体または半導体メモリに記録された形で提供されてもよい。 Furthermore, a computer program that causes a computer to realize each function of the processing unit of the speech synthesizer according to each of the above embodiments is recorded on a computer-readable medium, for example, a magnetic recording medium, an optical recording medium, or a semiconductor memory. It may be provided in the form.

ここに挙げられた全ての例及び特定の用語は、読者が、本発明及び当該技術の促進に対する本発明者により寄与された概念を理解することを助ける、教示的な目的において意図されたものであり、本発明の優位性及び劣等性を示すことに関する、本明細書の如何なる例の構成、そのような特定の挙げられた例及び条件に限定しないように解釈されるべきものである。本発明の実施形態は詳細に説明されているが、本発明の精神及び範囲から外れることなく、様々な変更、置換及び修正をこれに加えることが可能であることを理解されたい。 All examples and specific terms listed herein are intended for instructional purposes to help the reader understand the concepts contributed by the inventor to the present invention and the promotion of the technology. It should be construed that it is not limited to the construction of any example herein, such specific examples and conditions, with respect to showing the superiority and inferiority of the present invention. Although embodiments of the present invention have been described in detail, it should be understood that various changes, substitutions and modifications can be made thereto without departing from the spirit and scope of the present invention.

以上説明した実施形態及びその変形例に関し、更に以下の付記を開示する。
（付記１）
合成音声信号の元となる原文を含む入力テキスト情報と、該合成音声信号を作成するための入力合成条件とを取得する入力部と、
登録テキスト情報と登録合成条件の組と、該登録テキスト情報と該登録合成条件の組に対応する合成音声信号を作成する途中の段階で生成され、登録表音情報を含む登録中間情報との組を少なくとも一つ記憶する記憶部と、
前記入力テキスト情報または前記入力テキスト情報により表される入力表音情報と前記入力合成条件との組が前記登録テキスト情報または前記登録表音情報と前記登録合成条件との組の何れかと一致するか否か判定する一致判定部と、
前記入力テキスト情報または前記入力表音情報と前記入力合成条件との組が前記登録テキスト情報または前記登録表音情報と前記登録合成条件との組の何れかと一致する場合、当該登録テキスト情報または当該登録表音情報と当該登録合成条件との組に対応する前記登録中間情報を用いて合成音声信号を作成し、一方、前記入力テキスト情報または前記入力表音情報と前記入力合成条件との組が前記登録テキスト情報または前記登録表音情報と前記登録合成条件との組の何れとも一致しない場合、前記入力テキスト情報及び前記入力合成条件に基づいて合成音声信号を作成する音声合成部と、
前記入力テキスト情報または前記入力表音情報と前記入力合成条件との組が前記登録テキスト情報または前記登録表音情報と前記登録合成条件との組の何れとも一致しない場合、前記入力テキスト情報または前記入力合成条件を修正することにより派生テキスト情報と派生合成条件の組を作成し、当該派生テキスト情報と当該派生合成条件の組と、当該派生テキスト情報及び当該派生合成条件に基づいて前記音声合成部が合成音声信号を作成する途中の段階まで実行することにより作成された派生中間情報との組を、前記登録テキスト情報と前記登録合成条件の組と前記登録中間情報との組の一つとして前記記憶部に記憶させる自動更新部と、
を有する音声合成装置。
（付記２）
前記自動更新部は、前記入力テキスト情報に含まれる原文を所定単位ごとに分割することにより複数の第２の原文を作成し、該複数の第２の原文のうちの何れかを含むように前記派生テキスト情報を作成する分割処理部を有する、付記１に記載の音声合成装置。
（付記３）
前記所定単位は、前記原文中に設定された呼気段落である、付記２に記載の音声合成装置。
（付記４）
前記所定単位は、前記原文中に句読点を設定できる少なくとも一つの位置で区切られた文の単位である、付記２に記載の音声合成装置。
（付記５）
前記音声合成部は、前記原文に対して形態素解析を実行することにより当該原文を区切る第１の位置を求める言語処理部を有し、
前記分割処理部は、前記第１の位置で前記原文を分割することにより前記複数の第２の原文のうちの一つの原文を作成するとともに、前記第１の位置と異なり、かつ句読点を設定できる第２の位置で前記原文を分割することにより前記複数の第２の原文のうちの他の原文を作成する、付記２に記載の音声合成装置。
（付記６）
前記自動更新部は、前記原文中の固有名詞を、当該固有名詞と同種類の他の固有名詞に置換することにより、前記派生テキスト情報を作成する、付記１に記載の音声合成装置。
（付記７）
前記自動更新部は、前記登録中間情報のうち、直近の所定期間内において合成音声信号を作成するために使用された回数が所定の閾値より少ない登録中間情報を消去する、付記１〜６の何れか一項に記載の音声合成装置。
（付記８）
前記入力部は、前記入力テキスト情報の重要度を当該入力テキスト情報とともに取得し、
前記自動更新部は、前記入力テキスト情報の重要度に応じて前記中間情報の第２の重要度を決定し、かつ、当該第２の重要度を、前記派生テキスト情報と前記派生合成条件の組と前記派生中間情報との組と関連付けて前記記憶部に記憶させ、
かつ、前記登録中間情報のうち、直近の所定期間内において合成音声信号を作成するために使用された回数が所定の閾値より少なく、かつ前記第２の重要度が第１の重要度閾値未満である登録中間情報を消去する、付記１〜６の何れか一項に記載の音声合成装置。
（付記９）
前記入力部は、前記入力テキスト情報の重要度を当該入力テキスト情報とともに取得し、前記自動更新部は、前記入力テキスト情報の重要度が第２の重要度閾値以上である場合、前記派生テキスト情報と前記派生合成条件の組と前記派生中間情報との組を前記記憶部に記憶し、一方、前記入力テキスト情報の重要度が前記第２の重要度閾値未満である場合、前記派生テキスト情報と前記派生合成条件の組と前記派生中間情報との組を前記記憶部に記憶しない、付記１〜６の何れか一項に記載の音声合成装置。
（付記１０）
合成音声信号の元となる原文を含む入力テキスト情報と、該合成音声信号を作成するための入力合成条件とを取得し、
前記入力テキスト情報または当該入力テキスト情報により表される入力表音情報と前記入力合成条件との組が、記憶部に記憶されている少なくとも一つの登録テキスト情報または登録表音情報と登録合成条件との組の何れかと一致するか否か判定し、
前記入力テキスト情報または前記入力表音情報と前記入力合成条件との組が前記登録テキスト情報または前記登録表音情報と前記登録合成条件の組の何れかと一致する場合、前記記憶部に記憶され、かつ、当該登録テキスト情報または当該登録表音情報と当該登録合成条件との組に対応する合成音声信号を作成する途中の段階で生成される登録中間情報を用いて合成音声信号を作成し、一方、前記入力テキスト情報または前記入力表音情報と前記入力合成条件との組が前記登録テキスト情報または前記登録表音情報と前記登録合成条件との組の何れとも一致しない場合、前記入力テキスト情報及び前記入力合成条件に基づいて合成音声信号を作成し、
前記入力テキスト情報または前記入力表音情報と前記入力合成条件との組が前記登録テキスト情報または前記登録表音情報と前記登録合成条件との組の何れとも一致しない場合、前記入力テキスト情報または前記入力合成条件を修正することにより派生テキスト情報と派生合成条件の組を作成し、当該派生テキスト情報と当該派生合成条件の組と、当該派生テキスト情報及び当該派生合成条件に基づいて合成音声信号を作成する途中で作成された派生中間情報との組を、前記登録テキスト情報と前記登録合成条件の組と前記登録中間情報との組の一つとして前記記憶部に記憶させる、
ことを含む音声合成方法。
（付記１１）
合成音声信号をコンピュータに作成させる音声信号合成用コンピュータプログラムであって、
合成音声信号の元となる原文を含む入力テキスト情報または当該入力テキスト情報により表される入力表音情報と該合成音声信号を作成するための入力合成条件との組が、記憶部に記憶されている少なくとも一つの登録テキスト情報または登録表音情報と登録合成条件との組の何れかと一致するか否か判定し、
前記入力テキスト情報または前記入力表音情報と前記入力合成条件との組が前記登録テキスト情報または前記登録表音情報と前記登録合成条件との組の何れかと一致する場合、前記記憶部に記憶され、かつ、当該登録テキスト情報または当該登録表音情報と当該登録合成条件との組に対応する合成音声信号を作成する途中の段階で生成される登録中間情報を用いて合成音声信号を作成し、一方、前記入力テキスト情報または前記入力表音情報と前記入力合成条件との組が前記登録テキスト情報または前記登録表音情報と前記登録合成条件との組の何れとも一致しない場合、前記入力テキスト情報及び前記入力合成条件に基づいて合成音声信号を作成し、
前記入力テキスト情報または前記入力表音情報と前記入力合成条件との組が前記登録テキスト情報または前記登録表音情報と前記登録合成条件との組の何れとも一致しない場合、前記入力テキスト情報または前記入力合成条件を修正することにより派生テキスト情報と派生合成条件の組を作成し、当該派生テキスト情報と当該派生合成条件の組と、当該派生テキスト情報及び当該派生合成条件に基づいて合成音声信号を作成する途中で作成された派生中間情報との組を、前記登録テキスト情報と前記登録合成条件の組と前記登録中間情報との組の一つとして前記記憶部に記憶させる、
ことをコンピュータに実行させるコンピュータプログラム。 The following supplementary notes are further disclosed regarding the embodiment described above and its modifications.
(Appendix 1)
An input unit that obtains input text information including an original sentence that is a source of the synthesized speech signal and an input synthesis condition for creating the synthesized speech signal;
A set of registered text information and registered synthesis condition, and a set of registered intermediate information including registered phonetic information, which is generated in the middle of creating a synthesized speech signal corresponding to the set of registered text information and the registered synthesis condition A storage unit for storing at least one of
Whether the set of the input text information or the input phonetic information represented by the input text information and the input synthesis condition matches the registered text information or the set of the registered phonetic information and the registered synthesis condition A match determination unit for determining whether or not,
When the set of the input text information or the input phonetic information and the input synthesis condition matches either the registered text information or the set of the registered phonetic information and the registered synthesis condition, the registered text information or the A synthesized speech signal is created using the registered intermediate information corresponding to a set of registered phonetic information and the registered synthesis condition, while a set of the input text information or the input phonetic information and the input synthesis condition is A speech synthesizer that creates a synthesized speech signal based on the input text information and the input synthesis condition if it does not match any of the set of the registered text information or the registered phonetic information and the registered synthesis condition;
When the set of the input text information or the input phonetic information and the input synthesis condition does not match any of the registered text information or the set of the registered phonetic information and the registered synthesis condition, the input text information or the A set of derived text information and a derived synthesis condition is created by modifying the input synthesis condition, and the speech synthesis unit based on the derived text information and the derived synthesis condition set, the derived text information and the derived synthesis condition Is a set of derived intermediate information created by executing up to the middle stage of generating a synthesized speech signal as one of a set of the registered text information, the set of registered synthesis conditions, and the registered intermediate information. An automatic update unit to be stored in the storage unit;
A speech synthesizer.
(Appendix 2)
The automatic update unit creates a plurality of second original sentences by dividing the original sentence included in the input text information into predetermined units, and includes any one of the plurality of second original sentences. The speech synthesizer according to appendix 1, further comprising a division processing unit that creates derived text information.
(Appendix 3)
The speech synthesis apparatus according to appendix 2, wherein the predetermined unit is an exhalation paragraph set in the original text.
(Appendix 4)
The speech synthesizer according to appendix 2, wherein the predetermined unit is a unit of a sentence delimited by at least one position where a punctuation mark can be set in the original sentence.
(Appendix 5)
The speech synthesizer includes a language processing unit that obtains a first position that delimits the original text by performing morphological analysis on the original text,
The division processing unit creates one original sentence of the plurality of second original sentences by dividing the original sentence at the first position, and can set punctuation marks different from the first position. The speech synthesizer according to appendix 2, wherein another original text is created from the plurality of second original texts by dividing the original text at a second position.
(Appendix 6)
The speech synthesizer according to appendix 1, wherein the automatic update unit creates the derived text information by replacing a proper noun in the original sentence with another proper noun of the same type as the proper noun.
(Appendix 7)
The automatic update unit deletes the registration intermediate information that is used to create a synthesized speech signal within the most recent predetermined period from the registration intermediate information, the number of times being less than a predetermined threshold. The speech synthesizer according to claim 1.
(Appendix 8)
The input unit acquires the importance of the input text information together with the input text information,
The automatic update unit determines a second importance of the intermediate information according to the importance of the input text information, and determines the second importance as a set of the derived text information and the derived composition condition. And associated with the set of derived intermediate information and stored in the storage unit,
And among the registration intermediate information, the number of times used for creating a synthesized speech signal within the most recent predetermined period is less than a predetermined threshold, and the second importance is less than the first importance threshold. The speech synthesizer according to any one of appendices 1 to 6, wherein certain registration intermediate information is deleted.
(Appendix 9)
The input unit acquires the importance of the input text information together with the input text information, and the automatic update unit is configured to obtain the derived text information when the importance of the input text information is greater than or equal to a second importance threshold. And the combination of the derived synthesis condition and the derived intermediate information are stored in the storage unit, and when the importance of the input text information is less than the second importance threshold, the derived text information and The speech synthesizer according to any one of appendices 1 to 6, wherein the set of the derived synthesis condition and the set of the derived intermediate information is not stored in the storage unit.
(Appendix 10)
Obtaining input text information including the original text that is the source of the synthesized speech signal and input synthesis conditions for creating the synthesized speech signal;
The set of the input text information or the input phonetic information represented by the input text information and the input synthesis condition is at least one registered text information or registered phonetic information and the registration synthesis condition stored in the storage unit. Whether it matches any of the set of
When the set of the input text information or the input phonetic information and the input synthesis condition matches either of the registered text information or the set of the registered phonetic information and the registered synthesis condition, stored in the storage unit, In addition, a synthesized speech signal is created using registration intermediate information generated in the middle of creating a synthesized speech signal corresponding to the set of the registered text information or the registered phonetic information and the registered synthesis condition, When the set of the input text information or the input phonetic information and the input synthesis condition does not match any of the registered text information or the set of the registered phonetic information and the registered synthesis condition, the input text information and Create a synthesized speech signal based on the input synthesis conditions,
When the set of the input text information or the input phonetic information and the input synthesis condition does not match any of the registered text information or the set of the registered phonetic information and the registered synthesis condition, the input text information or the A set of derived text information and a derived synthesis condition is created by modifying the input synthesis condition, and a synthesized speech signal is generated based on the derived text information and the derived synthesis condition set, the derived text information, and the derived synthesis condition. A set of derived intermediate information created in the course of creating is stored in the storage unit as one of a set of the registered text information, the set of registered synthesis conditions, and the registered intermediate information.
A speech synthesis method.
(Appendix 11)
A computer program for speech signal synthesis that causes a computer to create a synthesized speech signal,
A set of input text information including an original sentence that is a source of a synthesized speech signal or input phonogram information represented by the input text information and an input synthesis condition for creating the synthesized speech signal is stored in a storage unit. Determining whether or not it matches any one of the set of at least one registered text information or registered phonetic information and registered synthesis condition;
When the set of the input text information or the input phonetic information and the input synthesis condition matches any of the registered text information or the set of the registered phonetic information and the registered synthesis condition, it is stored in the storage unit. And creating a synthesized speech signal using registered intermediate information generated in the middle of creating a synthesized speech signal corresponding to a set of the registered text information or the registered phonetic information and the registered synthesis condition, On the other hand, when the set of the input text information or the input phonetic information and the input synthesis condition does not match any of the registered text information or the set of the registered phonetic information and the registered synthesis condition, the input text information And a synthesized speech signal based on the input synthesis condition,
When the set of the input text information or the input phonetic information and the input synthesis condition does not match any of the registered text information or the set of the registered phonetic information and the registered synthesis condition, the input text information or the A set of derived text information and a derived synthesis condition is created by modifying the input synthesis condition, and a synthesized speech signal is generated based on the derived text information and the derived synthesis condition set, the derived text information, and the derived synthesis condition. A set of derived intermediate information created in the course of creating is stored in the storage unit as one of a set of the registered text information, the set of registered synthesis conditions, and the registered intermediate information.
A computer program that causes a computer to execute.

１音声合成装置
２入力部
３記憶部
４、４１、４２処理部
５出力部
６スピーカ
１０一致判定部
１１音声合成部
１２言語処理部
１３韻律生成部
１４素片選択部
１５波形生成部
１６制御部
１７自動更新部
１８分割処理部
１９パラメータ調整部
２０存続判定部
２１編集部 DESCRIPTION OF SYMBOLS 1 Speech synthesizer 2 Input part 3 Storage part 4, 41, 42 Processing part 5 Output part 6 Speaker 10 Match determination part 11 Speech synthesizer 12 Language processing part 13 Prosody generation part 14 Segment selection part 15 Waveform generation part 16 Control part 17 Automatic Update Unit 18 Division Processing Unit 19 Parameter Adjustment Unit 20 Survival Determination Unit 21 Editing Unit

Claims

An input unit that obtains input text information including an original sentence that is a source of the synthesized speech signal and an input synthesis condition for creating the synthesized speech signal;
A set of registered text information and registered synthesis condition, and a set of registered intermediate information including registered phonetic information, which is generated in the middle of creating a synthesized speech signal corresponding to the set of registered text information and the registered synthesis condition A storage unit for storing at least one of
Whether the set of the input text information or the input phonetic information represented by the input text information and the input synthesis condition matches the registered text information or the set of the registered phonetic information and the registered synthesis condition A match determination unit for determining whether or not,
When the set of the input text information or the input phonetic information and the input synthesis condition matches either the registered text information or the set of the registered phonetic information and the registered synthesis condition, the registered text information or the A synthesized speech signal is created using the registered intermediate information corresponding to a set of registered phonetic information and the registered synthesis condition, while a set of the input text information or the input phonetic information and the input synthesis condition is A speech synthesizer that creates a synthesized speech signal based on the input text information and the input synthesis condition if it does not match any of the set of the registered text information or the registered phonetic information and the registered synthesis condition;
When the set of the input text information or the input phonetic information and the input synthesis condition does not match any of the registered text information or the set of the registered phonetic information and the registered synthesis condition, the input text information or the A set of derived text information and a derived synthesis condition is created by modifying the input synthesis condition, and the speech synthesis unit based on the derived text information and the derived synthesis condition set, the derived text information and the derived synthesis condition Is a set of derived intermediate information created by executing up to the middle stage of generating a synthesized speech signal as one of a set of the registered text information, the set of registered synthesis conditions, and the registered intermediate information. An automatic update unit to be stored in the storage unit;
A speech synthesizer.

The automatic update unit creates a plurality of second original sentences by dividing the original sentence included in the input text information into predetermined units, and includes any one of the plurality of second original sentences. The speech synthesizer according to claim 1, further comprising a division processing unit that creates derived text information.

The speech synthesizer according to claim 2, wherein the predetermined unit is an exhalation paragraph set in the original text.

The speech synthesizer includes a language processing unit that obtains a first position that delimits the original text by performing morphological analysis on the original text,
The division processing unit creates one original sentence of the plurality of second original sentences by dividing the original sentence at the first position, and can set punctuation marks different from the first position. The speech synthesizer according to claim 2, wherein another original sentence is created among the plurality of second original sentences by dividing the original sentence at a second position.

5. The automatic update unit according to claim 1, wherein the automatic update unit deletes the registration intermediate information that is used to create a synthesized speech signal within the most recent predetermined period from the registration intermediate information that is less than a predetermined threshold. The speech synthesis apparatus according to any one of the above.

The input unit acquires the importance of the input text information together with the input text information,
The automatic update unit determines a second importance of the intermediate information according to the importance of the input text information, and determines the second importance as a set of the derived text information and the derived composition condition. And associated with the set of derived intermediate information and stored in the storage unit,
And among the registration intermediate information, the number of times used for creating a synthesized speech signal within the most recent predetermined period is less than a predetermined threshold, and the second importance is less than the first importance threshold. The speech synthesizer according to any one of claims 1 to 4, wherein certain registered intermediate information is deleted.

Obtaining input text information including the original text that is the source of the synthesized speech signal and input synthesis conditions for creating the synthesized speech signal;
The set of the input text information or the input phonetic information represented by the input text information and the input synthesis condition is at least one registered text information or registered phonetic information and the registration synthesis condition stored in the storage unit. Whether it matches any of the set of
When the set of the input text information or the input phonetic information and the input synthesis condition matches either of the registered text information or the set of the registered phonetic information and the registered synthesis condition, stored in the storage unit, In addition, a synthesized speech signal is created using registration intermediate information generated in the middle of creating a synthesized speech signal corresponding to the set of the registered text information or the registered phonetic information and the registered synthesis condition, When the set of the input text information or the input phonetic information and the input synthesis condition does not match any of the registered text information or the set of the registered phonetic information and the registered synthesis condition, the input text information and Create a synthesized speech signal based on the input synthesis conditions,
When the set of the input text information or the input phonetic information and the input synthesis condition does not match any of the registered text information or the set of the registered phonetic information and the registered synthesis condition, the input text information or the A set of derived text information and a derived synthesis condition is created by modifying the input synthesis condition, and a synthesized speech signal is generated based on the derived text information and the derived synthesis condition set, the derived text information, and the derived synthesis condition. A set of derived intermediate information created in the course of creating is stored in the storage unit as one of a set of the registered text information, the set of registered synthesis conditions, and the registered intermediate information.
A speech synthesis method.