JP5062178B2

JP5062178B2 - Audio recording system, audio recording method, and recording processing program

Info

Publication number: JP5062178B2
Application number: JP2008543053A
Authority: JP
Inventors: 康行三井; 玲史近藤; 正徳加藤; 伸一土井
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2006-11-06
Filing date: 2007-11-02
Publication date: 2012-10-31
Anticipated expiration: 2027-11-02
Also published as: WO2008056604A1; JPWO2008056604A1

Description

本発明は、音声収録システム等にかかり、特に、各専門分野に適応した自然性ある音声合成システムの構築に好適な、音声収録システム、音声収録方法、および収録処理プログラムに関する。 The present invention relates to an audio recording system and the like, and more particularly to an audio recording system, an audio recording method, and a recording processing program suitable for constructing a natural speech synthesis system adapted to each specialized field.

音声合成技術は、波形編集方式と波形接続方式との２つの手法に大別されている。いずれの方式であっても、音声合成用データベースに記憶されている音声データの品質によって、生成される合成音声の品質に大きく影響される。
特に、（ＴＴＳ：Text-To-Speech）と呼ばれる、テキストを音声に変換する技術の場合、任意のテキストを音声に出来ることは当然の性能として要求されるが、さらに、人間らしく聞き取り易い合成音声が要求されている。近年は、データベースの大規模化により、高い自然性を持つ音声合成システムが出現するようになった。The speech synthesis technology is roughly divided into two methods: a waveform editing method and a waveform connection method. In either method, the quality of the synthesized speech to be generated is greatly influenced by the quality of the speech data stored in the speech synthesis database.
In particular, in the case of a technology that converts text to speech (TTS: Text-To-Speech), it is a natural performance that any text can be converted into speech. It is requested. In recent years, speech synthesis systems with high naturalness have appeared due to the large scale of databases.

ところで、高い自然性を実現するためには、生成したい合成音声と同種の音声信号が作成された素片波形、およびピッチパタンや時間長等のパラメータが音声合成用データベースに予め存在（記憶）していなければならない。
例えば、新聞記事を読み上げた音声信号を音声合成用データベース化した場合、新聞記事に類似した表記のテキストであれば非常に高い自然性を実現できるが、口語体のような砕けた表記のテキストを合成すると、類似データが音声合成用データベースに存在しないため、自然性が著しく劣化する。口語体にも対応するようにさらに音声合成用データベースを大規模化すれば自然性は向上するが、膨大なデータが必要となる。By the way, in order to realize high naturalness, a segment waveform in which a speech signal of the same type as the synthesized speech to be generated and parameters such as a pitch pattern and a time length exist (store) in the speech synthesis database in advance. Must be.
For example, if a speech signal read out from a newspaper article is converted into a speech synthesis database, it can achieve very high naturalness if the text is similar to a newspaper article. Then, since similar data does not exist in the speech synthesis database, naturalness is significantly deteriorated. If the database for speech synthesis is further scaled up to support colloquial style, the naturalness will be improved, but a huge amount of data will be required.

一方、コンピュータ資源は有限であるため、あらゆる入力テキストデータに対応し、かつ生成された合成音声が全て高品質であるような音声合成用データベースの作成は事実上不可能である（例えば、非特許文献１参照）。 On the other hand, since computer resources are limited, it is practically impossible to create a database for speech synthesis that can handle any input text data and that all generated synthesized speech is of high quality (eg, non-patented). Reference 1).

効率的に音声合成用データベースを作成するためのデータベース作成方法が多数出願されている。例えば、言語的、および音響的に重要な言いまわしのテキスト表現に対して高品質な合成音声が生成可能な音声素片データベースの作成方法が知られている（例えば、特許文献１参照）。 Many database creation methods for efficiently creating a database for speech synthesis have been filed. For example, there is known a method for creating a speech segment database that can generate high-quality synthesized speech for a textual expression that is linguistically and acoustically important (see, for example, Patent Document 1).

上記した特許文献１に開示された技術によれば、テキストデータ中の各テキストから、各テキストの形態素解析処理と韻律推定とにより、音韻系列、ピッチパタン、テンポ、ポーズ等の韻律的な特徴量を推定し、音韻系列および韻律特徴量から各テキストの音響的重要度を求め、言語的および音響的に重要と判定された文から収録リストを作成する。そして、この収録リストに記載された文を音声収録し、音韻ラベルを付与することで音声合成用データベースを作成する。
古井貞煕著「ディジタル音声処理」、東海大学出版会、１９８５年発行、頁１３４〜頁１４８特開２００４−１３８６６１号公報 According to the technique disclosed in Patent Document 1 described above, prosodic feature quantities such as phoneme series, pitch pattern, tempo, and pause are obtained from each text in text data by morphological analysis processing and prosodic estimation of each text. And the acoustic importance of each text is obtained from the phoneme sequence and the prosodic feature quantity, and a recording list is created from sentences determined to be linguistically and acoustically important. Then, the sentence described in the recording list is recorded as speech, and a speech synthesis database is created by assigning phonological labels.
Sadaaki Furui “Digital Audio Processing”, Tokai University Press, 1985, pages 134-148 JP 2004-138661 A

しかしながら、上記した特許文献１に開示された音声合成用データベースの作成方法によれば、デキストデータベース内のテキストを形態素解析して、韻律的特徴を推定しているため、推定結果が形態素解析手法に依存してしまうという問題がある。
例えば、緊急放送の用途で音声合成システムを用いているユーザにとっては、話速がある程度速く、はっきりした発声ができる音声合成システムが好ましい。しかしながら、特許文献１に開示された技術では、実際にユーザが使用する音声合成システムが達成すべき条件（以下、「タスク」と呼ぶ）を考慮した音声合成用データベースを作成することができない。更に、既存の音声合成システムを改善する使用方法は想定されておらず、したがって、上記した収録リストにリストされた文を全て収録し、音声合成用データベースを最初から作成するという目的でしか使うことが出来ない、という不都合があった。However, according to the method for creating a speech synthesis database disclosed in Patent Document 1 described above, morphological analysis is performed on text in a dextst database to estimate prosodic features. There is a problem of being dependent.
For example, for a user who uses a speech synthesis system for emergency broadcast applications, a speech synthesis system that can speak clearly at a relatively high speech speed is preferable. However, with the technique disclosed in Patent Document 1, it is not possible to create a speech synthesis database that takes into account the conditions (hereinafter referred to as “tasks”) to be achieved by the speech synthesis system actually used by the user. In addition, there are no assumptions about how to improve the existing speech synthesis system, so it should only be used for the purpose of creating a speech synthesis database from scratch, recording all the sentences listed in the recording list above. There was an inconvenience that it was not possible.

［発明の目的］
本発明は、上記した事情に鑑みてなされたものであって、ユーザからの多くの要求に対応した自然性の高い合成音声の生成を可能とした音声合成システムを形成し得る音声収録システムおよびその方法ならびにプログラムを提供することを、その目的とする。[Object of invention]
The present invention has been made in view of the circumstances described above, and a speech recording system capable of forming a speech synthesis system capable of generating a highly natural synthesized speech corresponding to many requests from a user, and its The object is to provide a method and program.

上記した目的を達成するため、本発明にかかる音声収録システムは、音声合成用データベース内のパラメータ及び素片波形等の音声データを作成するために必要とする音声信号を収録する音声収録システムであって、発生内容を記述したテキストデータを入力し前記音声合成用データベース内に予め記憶された音声データに基づいて所望の合成音声を生成する音声合成部と、前記合成音声の特徴量に基づいて前記合成音声の品質評価を行なう品質評価部とを備えている。更に、この音声収録システムでは、この品質評価部による品質評価の結果に基づいて前記音声合成用データベースに追加記憶すべき音声データを決定する追加データ決定部を備える、という構成を採っている。 In order to achieve the above object, a speech recording system according to the present invention is a speech recording system that records speech signals necessary for creating speech data such as parameters and segment waveforms in a speech synthesis database. A speech synthesizer for generating desired synthesized speech based on speech data preliminarily stored in the speech synthesis database by inputting text data describing the generated content, and based on the feature amount of the synthesized speech And a quality evaluation unit for evaluating the quality of the synthesized speech. Further, the voice recording system has a configuration in which an additional data determination unit that determines voice data to be additionally stored in the voice synthesis database based on the quality evaluation result by the quality evaluation unit is provided.

このため、品質評価部が、音声合成部により生成される合成音声の特徴量に基づき品質評価を行い、追加データ決定部が、品質評価結果に基づき、音声合成用データベースに追加記憶すべき音声データを決定することにより、既存の音声合成システムに対してもその作成された合成音声の品質を評価することができ、この評価結果に基づき既存の音声合成システムを更新する際に追加すべき音声データを容易に決定することができる。 Therefore, the quality evaluation unit performs quality evaluation based on the feature amount of the synthesized speech generated by the speech synthesis unit, and the additional data determination unit performs speech data to be additionally stored in the speech synthesis database based on the quality evaluation result. Therefore, the quality of the created synthesized speech can be evaluated even for an existing speech synthesis system, and speech data to be added when updating the existing speech synthesis system based on the evaluation result Can be easily determined.

上記目的を達成するため、本発明にかかる音声収録システムでは、音声データを記憶する音声合成用データベースと、外部入力されるテキストデータと前記音声合成用データベースに記憶された音声データの合成にかかる処理動作を規定する音声合成システム情報とを分析すると共に，これに基づいて前記音声合成用データベースに記憶された音声データが所望の品質を満たすか否かを判定する品質評価部と、前記品質評価の結果に基づいて前記音声合成用データベースに追加記憶すべき音声データを決定する追加データ決定部とを備える、という構成を採っている。 In order to achieve the above object, in a speech recording system according to the present invention, a speech synthesis database for storing speech data, processing for synthesizing speech data stored in the speech synthesis database and text data inputted externally Analyzing the speech synthesis system information that defines the operation, and based on this, a quality evaluation unit that determines whether the speech data stored in the speech synthesis database satisfies a desired quality; An additional data determining unit that determines speech data to be additionally stored in the speech synthesis database based on the result is employed.

このため、品質評価部が、音声合成用データベースに記憶された音声合成処理の動作を規定する、例えば、形態素解析モデルや韻律情報等の音声合成システム情報を分析して評価することにより、合成音声化処理を経ることなく、取得した入力テキストデータと、既存の音声合成システムの音声合成システム情報とから、既存の音声合成システムを更新する際に追加記憶すべき音声データを容易に、かつ、高速に決定することができる。 For this reason, the quality evaluation unit defines the operation of speech synthesis processing stored in the speech synthesis database, for example, by analyzing and evaluating speech synthesis system information such as morphological analysis models and prosodic information, Easy and high-speed speech data to be stored when updating an existing speech synthesis system from acquired input text data and speech synthesis system information of an existing speech synthesis system Can be determined.

上記した目的を達成するため、本発明にかかる音声収録方法は、音声合成用データベースに記憶される音声データを作成するために音声信号を収録する音声収録方法であって、前記音声合成用データベースを参照し、外部入力されるテキストデータから所望の合成音声を生成する第１のステップと、前記合成音声の特徴量に基づき前記合成音声の品質評価を行なう第２のステップと、前記品質評価の結果に基づいて前記音声合成用データベースに追加記憶すべき音声データを決定する第３のステップと、を有することを特徴とする。 In order to achieve the above object, a speech recording method according to the present invention is a speech recording method for recording a speech signal in order to create speech data stored in a speech synthesis database, wherein the speech synthesis database is stored in the speech synthesis database. A first step of referring to and generating desired synthesized speech from externally input text data; a second step of evaluating the quality of the synthesized speech based on a feature amount of the synthesized speech; and a result of the quality evaluation And a third step of determining speech data to be additionally stored in the speech synthesis database.

このため、音声収録システムが、第１のステップを実行することにより生成される合成音声の特徴量に基づき品質評価を行う第２のステップを実行し、その品質評価結果に基づき音声合成用データベースに追加記憶すべき音声データを決定する第３のステップを実行することにより、既存の音声合成システムを用いて作成した合成音声の品質を評価することができ、この評価結果に基づき既存の音声合成システムを更新する際に追加すべき音声データを容易に決定することができる。
For this reason, the voice recording system executes the second step of performing the quality evaluation based on the feature amount of the synthesized voice generated by executing the first step, and stores the voice synthesis database in the voice synthesis database based on the quality evaluation result. By executing the third step of determining speech data to be additionally stored, the quality of synthesized speech created using an existing speech synthesis system can be evaluated, and an existing speech synthesis system is based on the evaluation result. It is possible to easily determine audio data to be added when updating.

また、上記した目的を達成するため、本発明にかかる音声収録方法は、音声合成用データベースに記憶される音声データを作成するために音声信号を収録する音声収録方法であって、外部入力されるテキストデータを取得する第１のステップと、この取得した入力テキストデータと前記音声合成用データベースに記憶された音声データの合成にかかる処理動作を規定する音声合成システム情報とを分析して評価すると共に、これに基づいて前記音声合成用データベースに記憶された音声データが所望の品質を満たすか否かを判定する第２のステップと、この品質評価の結果に基づいて前記音声合成用データベースに追加記憶すべき音声データを決定する第３のステップと、を有することを特徴とする。 In order to achieve the above object, a voice recording method according to the present invention is a voice recording method for recording a voice signal in order to create voice data stored in a voice synthesis database, and is input externally. Analyzing and evaluating the first step of acquiring text data, and the input text data thus acquired and the speech synthesis system information that defines the processing operation related to the synthesis of speech data stored in the speech synthesis database. Based on this, a second step of determining whether or not the speech data stored in the speech synthesis database satisfies a desired quality, and additionally storing in the speech synthesis database based on the quality evaluation result And a third step of determining audio data to be performed.

このため、音声収録システムが、音声合成用データベースに記憶された音声合成処理の動作を規定する、例えば、形態素解析モデルや韻律情報等の音声合成システム情報を分析して評価する第２のステップを実行することにより、合成音声化処理を経ることなく、取得した入力テキストデータと、既存の音声合成システムの音声合成システム情報とから、既存の音声合成システムを更新する際に追加記憶すべき音声データを容易に、かつ、高速に決定することができる。 For this reason, the speech recording system defines the operation of speech synthesis processing stored in the speech synthesis database, for example, a second step of analyzing and evaluating speech synthesis system information such as a morphological analysis model and prosodic information. By executing, the voice data to be additionally stored when the existing voice synthesis system is updated from the acquired input text data and the voice synthesis system information of the existing voice synthesis system without performing the synthesis voice conversion process Can be determined easily and at high speed.

上記した目的を達成するため、本発明にかかる音声収録処理プログラムでは、音声合成用データベースに記憶される音声データを作成するために音声信号を収録する音声収録システムに用いられるプログラムであって、前記音声合成用データベースを参照し、入力テキストデータから所望の合成音声を生成する音声合成処理機能、前記合成音声の特徴量に基づいて前記合成音声の品質評価を行なう品質評価処理機能、前記品質評価の結果に基づいて前記音声合成用データベースに追加記憶すべき音声データを決定する追加データ決定処理機能、をコンピュータに実行させるようにしたことを特徴とする。 In order to achieve the above object, the audio recording processing program according to the present invention is a program used in an audio recording system for recording an audio signal to create audio data stored in a database for speech synthesis, A speech synthesis processing function for generating a desired synthesized speech from input text data with reference to a speech synthesis database, a quality evaluation processing function for evaluating the quality of the synthesized speech based on a feature amount of the synthesized speech, The computer is caused to execute an additional data determination processing function for determining speech data to be additionally stored in the speech synthesis database based on the result.

このため、コンピュータが、合成音声の特徴量に基づき品質評価を行い、品質評価結果に基づき音声合成用データベースに追加記憶すべき音声データを決定することにより、既存の音声合成システムを用いて作成した合成音声の品質を評価することができ、この評価結果に基づき既存の音声合成システムを更新する際に追加すべき音声データを容易に決定することができる。 For this reason, the computer performs quality evaluation based on the feature amount of the synthesized speech, determines speech data to be additionally stored in the speech synthesis database based on the quality evaluation result, and is created using an existing speech synthesis system The quality of synthesized speech can be evaluated, and speech data to be added when updating an existing speech synthesis system can be easily determined based on the evaluation result.

また、上記した目的を達成するため、本発明にかかる音声収録処理プログラムでは、音声合成用データベースに記憶される音声データを作成するために音声信号を収録する音声収録システムに用いられるプログラムであって、外部入力されるテキストデータを取得するテキストデータ取得処理機能、この取得した入力テキストデータと前記音声合成用データベースに記憶された音声合成の処理動作を規定する音声合成システム情報とを分析して評価すると共にこれに基づいて前記音声合成用データベースに記憶された音声データが所望の品質を満たすか否かを判定する品質評価処理機能、前記品質評価の結果に基づいて前記音声合成用データベースに追加記憶すべき音声データを決定する追加データ決定処理機能、をコンピュータに実行させるようにしたことを特徴とする。 In order to achieve the above object, the audio recording processing program according to the present invention is a program used for an audio recording system for recording an audio signal in order to create audio data stored in a speech synthesis database. Text data acquisition processing function for acquiring externally input text data, analysis and evaluation of the acquired input text data and speech synthesis system information that defines the speech synthesis processing operation stored in the speech synthesis database And a quality evaluation processing function for determining whether or not the speech data stored in the speech synthesis database satisfies a desired quality based on this, and additionally storing in the speech synthesis database based on the quality evaluation result Causing the computer to execute an additional data determination processing function for determining audio data to be Characterized in that way the.

このため、コンピュータが、音声合成用データベースに記憶された音声合成処理の動作を規定する、例えば、形態素解析モデルや韻律情報等の音声合成システム情報を分析して評価する品質評価処理機能を実行することにより、合成音声化処理を経ることなく、取得した入力テキストデータと既存の音声合成システムの音声合成システム情報とから、既存の音声合成システムを更新する際に追加記憶すべき音声データを容易に、かつ、高速に決定することができる。 For this reason, the computer executes a quality evaluation processing function that analyzes and evaluates speech synthesis system information such as a morphological analysis model and prosodic information, which prescribes the operation of speech synthesis processing stored in the speech synthesis database. This makes it easy to store voice data that should be additionally stored when updating an existing speech synthesis system from the acquired input text data and the speech synthesis system information of the existing speech synthesis system, without going through synthesized speech processing. And can be determined at high speed.

本発明によれば、音声合成システムをユーザの望んだタスクに適応させると共にその品質を向上させることができ、しかも既存の音声合成システムにも適用し得るばかりでなく、音声合成用データベース内の不要データの削除も可能とし、これにより、音声合成用データベースの肥大化を防ぐことができるという従来にない優れた音声収録システム、音声収録方法、および収録処理プログラムを提供することができる。 According to the present invention, the speech synthesis system can be adapted to the task desired by the user and the quality thereof can be improved. Moreover, the present invention can be applied to an existing speech synthesis system and is not required in the speech synthesis database. It is also possible to delete data, thereby providing an unprecedented superior voice recording system, voice recording method, and recording processing program that can prevent the enlargement of the speech synthesis database.

［第１の実施の形態］
以下、本発明の第１の実施形態を添付図面に従って説明する。
まず、実施形態にかかる音声収録システムは、図１に示すように、音声合成用データベース１０Ａ内のパラメータ及び素片波形等の音声データを作成するために必要とする音声信号を収録するシステムであって、発生内容を記述したテキストデータを外部から入力し前記音声合成用データベース１０Ａ内に予め記憶された音声データに基づいて所望の合成音声を生成する音声合成部１１と、前記合成音声の特徴量に基づいて前記合成音声の品質評価を行なう品質評価部１２とを備えている。[First Embodiment]
DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, a first embodiment of the invention will be described with reference to the accompanying drawings.
First, as shown in FIG. 1, the audio recording system according to the embodiment is a system that records audio signals necessary for generating audio data such as parameters and segment waveforms in the speech synthesis database 10A. A speech synthesizer 11 for generating desired synthesized speech based on speech data preliminarily stored in the speech synthesis database 10A by inputting text data describing the generated content from the outside; and a feature amount of the synthesized speech And a quality evaluation unit 12 for performing quality evaluation of the synthesized speech based on the above.

又、この音声収録システムでは、この品質評価部１２による品質評価の結果に基づいて前記音声合成用データベース１０Ａに追加記憶すべき音声データを決定する追加データ決定部１３と、後述するようにテキスト生成用記憶部１４Ａを有し当該テキスト生成用記憶部１４Ａに記憶された生成用テキストデータを参照して収録指示データを生成する収録指示データ生成部１４とを備えている。ここで、本実施形態では、前記音声合成用データベース１０Ａと音声合成部１１とにより音声合成システム１０が構成されている。 In this voice recording system, an additional data determination unit 13 for determining voice data to be additionally stored in the voice synthesis database 10A based on the result of quality evaluation by the quality evaluation unit 12, and text generation as will be described later. And a recording instruction data generation unit 14 that generates recording instruction data by referring to the generation text data stored in the text generation storage unit 14A. Here, in the present embodiment, the speech synthesis system 10 is configured by the speech synthesis database 10 </ b> A and the speech synthesis unit 11.

上記音声合成用データベース１０Ａには、収録された音声信号に基づき作成される音声データが格納されている。この音声データは、例えば各種音声パラメータおよび音声素片等を含むデータから成る。 The voice synthesis database 10A stores voice data created based on recorded voice signals. This voice data is composed of data including, for example, various voice parameters and voice segments.

ここで、前述した音声合成部１１は、音声合成用データベース１０Ａを参照し、外部入力されるテキストデータ（入力テキストデータ）から所望の合成音声（中間データ）を生成して品質評価部１２へ出力する。品質評価部１２は、合成音声（中間データ）の特徴量に基づき合成音声の品質評価を行ない、追加データ決定部１３へ出力する。又、追加データ決定部１３は、品質評価部１２による品質評価の結果に基づいて音声合成用データベース１０Ａに追加記憶すべき音声データを決定して収録指示データ生成部１４へ出力する。収録指示データ生成部１４は、追加データ決定部１３により決定される音声データに対応して収録すべき音声信号の発声内容が記述されたテキストデータを含む収録指示データを生成する。 Here, the above-described speech synthesizer 11 refers to the speech synthesis database 10A, generates desired synthesized speech (intermediate data) from externally input text data (input text data), and outputs it to the quality evaluation unit 12 To do. The quality evaluation unit 12 evaluates the quality of the synthesized speech based on the feature amount of the synthesized speech (intermediate data), and outputs it to the additional data determination unit 13. Further, the additional data determination unit 13 determines voice data to be additionally stored in the speech synthesis database 10A based on the quality evaluation result by the quality evaluation unit 12, and outputs the voice data to the recording instruction data generation unit 14. The recording instruction data generation unit 14 generates recording instruction data including text data describing the utterance content of the audio signal to be recorded corresponding to the audio data determined by the additional data determination unit 13.

ここで、前述した品質評価部１２は、前記外部入力されるテキストデータに対する音声合成化に際して音声合成部１１が生成した中間データ（合成音声）を分析して評価すると共に、これに基づいて前記音声合成用データベース１０Ａに記憶された音声データが所望の品質を満たしているか否かを判定する機能を備えている。
このため、品質評価部１２では、音声合成の際に音声合成部１１によって生成される中間データ（例えば、発音記号や選択された素片波形等の詳細な中間データ）を分析して評価することで追加記憶すべき音声データを決定することができるため、既存の音声合成システムを用いて作成した合成音声を高い精度で評価することができ、既存の音声合成システムを更新する際に追加記憶すべき音声データを容易に決定することができる。Here, the above-described quality evaluation unit 12 analyzes and evaluates intermediate data (synthesized speech) generated by the speech synthesis unit 11 when speech synthesis is performed on the externally input text data, and based on this, the speech is analyzed. A function is provided for determining whether or not the voice data stored in the synthesis database 10A satisfies a desired quality.
For this reason, the quality evaluation unit 12 analyzes and evaluates intermediate data (for example, detailed intermediate data such as phonetic symbols and selected segment waveforms) generated by the speech synthesis unit 11 during speech synthesis. Since it is possible to determine the speech data to be additionally stored, the synthesized speech created using the existing speech synthesis system can be evaluated with high accuracy, and additional storage is performed when the existing speech synthesis system is updated. The voice data to be determined can be easily determined.

又、上記した入力テキストデータについては、音声様式を記述した様式指定データを含み、更にこの前記様式指定データは、ピッチパタン，発話速度を示すデータの内の少なくとも一つを含む構成となっている。また、上記した音声合成用データベース１０Ａでは、それに記憶される音声データとして、ピッチパタンモデル，継続時間長モデル，音声素片波形の内の少なくとも一つ若しくは全部を記憶し得るようになっている。更に、上記した追加データ決定部１３により決定される音声データについては、ピッチパタンパラメータ、継続時間長パラメータ、音声素片波形等が対象となっており、その内の少なくとも一つを含む構成としてもよい。 The input text data described above includes format designating data describing a speech format, and the format designating data includes at least one of data indicating pitch pattern and speech rate. . The speech synthesis database 10A described above can store at least one or all of a pitch pattern model, a duration model, and a speech unit waveform as speech data stored therein. Furthermore, the speech data determined by the additional data determination unit 13 described above is targeted for pitch pattern parameters, duration parameters, speech segment waveforms, etc., and may include at least one of them. Good.

図２に、本第１の実施形態の動作を説明するためのフローチャートを示す。
以下、この図２に示すフローチャートに基づいて本第１実施形態の動作を説明する。FIG. 2 shows a flowchart for explaining the operation of the first embodiment.
The operation of the first embodiment will be described below based on the flowchart shown in FIG.

まず、外部からテキストデータが入力され、音声合成部１１に供給される（ステップＳ１０１）。ここでいうテキストデータは、発話内容が記述されたテキストデータを少なくとも含む。音声合成部１１は、入力されたテキストデータを、周知のテキスト音声合成技術により音声波形に変換（以下、「合成音声化」と呼ぶ）し、品質評価部１２に出力する（ステップＳ１０２）。ここで、テキスト音声合成技術（ＴＴＳ）とは、例えば、非特許文献１に記載されているような、入力されたテキストを解析し、韻律や時間長を推定して合成音声として出力する技術の総称である。 First, text data is input from the outside and supplied to the speech synthesizer 11 (step S101). The text data here includes at least text data in which utterance contents are described. The speech synthesizer 11 converts the input text data into a speech waveform (hereinafter referred to as “synthesized speech”) using a known text speech synthesis technique, and outputs the speech waveform to the quality evaluation unit 12 (step S102). Here, the text-to-speech synthesis technology (TTS) is a technology for analyzing input text, estimating prosody and time length, and outputting the synthesized speech as described in Non-Patent Document 1, for example. It is a generic name.

音声合成部１１により出力される合成音声を入力として取得した品質評価部１２は、後述する実施例１に開示した所定の手順に従い合成音波形の音質を評価し、追加データ決定部１３へ出力する（ステップＳ１０３）。品質評価部１２では、音質評価結果としての評価値を予め設定した閾値と比較し（ステップＳ１０４）、評価値が閾値を上回った場合には処理を終了する。これに対し、品質評価部１２における評価値が閾値を下回った場合、追加データ決定部１４が作動し、音声合成用データベース１０Ａに追加記憶すべき音声データを、後述する手順（詳細は実施例１に開示）に従って決定する（ステップＳ１０５）。ここで、「追加記憶音声データ」とは、入力テキストを合成音声化するために、音声合成部１１に含まれる音声合成用データベースに追加すべきデータを示す。 The quality evaluation unit 12 that has acquired the synthesized speech output by the speech synthesizer 11 as an input evaluates the sound quality of the synthesized sound waveform according to a predetermined procedure disclosed in Example 1 described later, and outputs it to the additional data determination unit 13. (Step S103). The quality evaluation unit 12 compares the evaluation value as the sound quality evaluation result with a preset threshold value (step S104), and ends the process when the evaluation value exceeds the threshold value. On the other hand, when the evaluation value in the quality evaluation unit 12 falls below the threshold value, the additional data determination unit 14 operates, and voice data to be additionally stored in the voice synthesis database 10A is a procedure described later (details in the first embodiment). To be determined) (step S105). Here, “additionally stored speech data” indicates data to be added to the speech synthesis database included in the speech synthesizer 11 in order to convert the input text into synthesized speech.

続いて、収録指示データ生成部１４は、所定の手順（実施例１参照）に従って収録指示データを生成する（ステップＳ１０６）。ここで、「収録指示データ」とは、音声収録を行う際に、話者が発声する文や単語、あるいは発話様式、感情、話速、イントネーション等を指示するためのデータであり、少なくとも発声内容を指示するテキストを含む。 Subsequently, the recording instruction data generation unit 14 generates recording instruction data according to a predetermined procedure (see Example 1) (step S106). Here, “recording instruction data” is data for instructing a sentence or a word uttered by a speaker or an utterance style, emotion, speaking speed, intonation, etc. at the time of voice recording. Contains text that indicates

上記した本第１の実施形態によれば、品質評価部１２が、音声合成部１１により生成される合成音声の特徴量に基づき品質評価を行い、追加データ決定部１３が、品質評価結果に基づき、音声合成用データベース１０Ａに追加記憶すべき音声データを決定する構成とすることにより、既存の音声合成システム１０（音声合成用データベース１０Ａと音声合成部１１）を用いて作成した合成音声の波形を評価するため、既存音声合成システムを更新する際に追加すべき音声データを容易に推定することが可能になる。また、収録指示データ生成部１４が、追加データ決定部１３により決定される音声データに対応して収録すべき音声信号の発声内容が記述されたテキストデータを含む収録指示データを生成することにより、既存の音声合成システム１０を更新する際の音声データの収録が容易になる。 According to the first embodiment described above, the quality evaluation unit 12 performs quality evaluation based on the feature amount of the synthesized speech generated by the speech synthesis unit 11, and the additional data determination unit 13 performs based on the quality evaluation result. Since the voice data to be additionally stored in the voice synthesis database 10A is determined, the waveform of the synthesized voice created using the existing voice synthesis system 10 (the voice synthesis database 10A and the voice synthesis unit 11) is obtained. In order to evaluate, it becomes possible to easily estimate speech data to be added when updating an existing speech synthesis system. Further, the recording instruction data generation unit 14 generates recording instruction data including text data describing the utterance content of the audio signal to be recorded corresponding to the audio data determined by the additional data determination unit 13. Recording of speech data when updating the existing speech synthesis system 10 is facilitated.

［第２の実施の形態］
次に、第２の実施形態を図３乃至図４に基づいて説明する。ここで、前述した第１の実施形態と同一の構成部材については同一の符号を用いるものとする。
図３において、この第２の実施形態にける音声収録システムは、前述した第１の実施形態の場合とほぼ同様に、音声合成システム１０を構成する音声合成部１１及び音声合成用データベース１０Ａと、品質評価部２２と、追加データ決定部１３と、テキスト生成用記憶部１４Ａを備えた収録指示データ生成部１４とを備えている。[Second Embodiment]
Next, a second embodiment will be described with reference to FIGS. Here, the same reference numerals are used for the same constituent members as those in the first embodiment described above.
In FIG. 3, the speech recording system according to the second embodiment is substantially the same as in the case of the first embodiment described above. The speech synthesizer 11 and the speech synthesis database 10A constituting the speech synthesis system 10, A quality evaluation unit 22, an additional data determination unit 13, and a recording instruction data generation unit 14 including a text generation storage unit 14A are provided.

この第２の実施形態にあって、上記品質評価部２２には、外部から所定の音声信号（音声データ）が入力されるように構成されている。そして、この品質評価部２２は、前述した音声合成部１１で合成された合成音声（中間データ）と外部入力される音声信号とを比較して波形や特徴量の一致度を評価すると共に予め設定した所望の品質を満たすか否かを判定する機能を備えている。この場合、特徴量の一致度の評価は、評価値を算出しこれに基づいて比較することによって実行される。これにより、前述した従来例では不明確であった合成音声の特徴が明確化されるようになっている。
その他の構成は前述した図１（第１実施形態）の場合と同一となっている。In the second embodiment, the quality evaluation unit 22 is configured to receive a predetermined audio signal (audio data) from the outside. The quality evaluating unit 22 compares the synthesized speech (intermediate data) synthesized by the speech synthesizing unit 11 and the externally input speech signal to evaluate the degree of coincidence of the waveform and the feature amount and set in advance. A function of determining whether or not the desired quality is satisfied. In this case, the evaluation of the degree of coincidence of the feature amounts is performed by calculating an evaluation value and comparing based on the evaluation value. As a result, the characteristics of the synthesized speech, which was unclear in the conventional example described above, are clarified.
Other configurations are the same as those in the case of FIG. 1 (first embodiment) described above.

次に、上記第２の実施形態の動作を図４に基づいて説明する。
この第２の実施形態においては、上述したように、品質評価部２２が前述した第１の実施形態と大きく相違する。
即ち、前述した第１の実施形態における品質評価のステップＳ１０３が本第２の実施形態では比較品質評価（ステップＳ１０３ａ）となり、更に、本第２の実施形態では、テキスト入力（ステップＳ１０３）に並列に音声入力（品質評価部２２へ音声信号の外部入力）が処理ステップＳ１０７として追加した。
即ち、この図４におけるステップＳ１０７においては、上述した入力テキストとは別に音声信号が入力される。Next, the operation of the second embodiment will be described with reference to FIG.
In the second embodiment, as described above, the quality evaluation unit 22 is greatly different from the first embodiment described above.
That is, step S103 of the quality evaluation in the first embodiment described above becomes a comparative quality evaluation (step S103a) in the second embodiment, and further, in parallel with the text input (step S103) in the second embodiment. The voice input (external input of the voice signal to the quality evaluation unit 22) is added as processing step S107.
That is, in step S107 in FIG. 4, an audio signal is input separately from the input text described above.

ここで、音声信号は、本実施形態ではステップＳ１０１で入力されたテキストデータで記述された内容と同一の発話内容となっている。そして、品質評価部２２では、音声合成部１１により出力される合成音声と、ステップＳ１０７入力され取得した音声信号とを比較して、波形や特徴量の一致度から評価値を算出する（ステップＳ１０３ａ）。又、特徴量としては、平均ピッチ周波数，ピッチパタン，継続時間長，ケプストラム等がこれに相当し、評価方法としては、複数の特徴量をベクトル化し、これらベクトルの差分をスカラー化したものを評価値とすること等が本発明実施形態では実行されている。 Here, the audio signal has the same utterance content as the content described in the text data input in step S101 in this embodiment. Then, the quality evaluation unit 22 compares the synthesized speech output from the speech synthesis unit 11 and the speech signal input and acquired in step S107, and calculates an evaluation value from the degree of coincidence of the waveform and the feature amount (step S103a). ). In addition, as feature quantities, average pitch frequency, pitch pattern, duration, cepstrum, etc. correspond to this, and as an evaluation method, a plurality of feature quantities are vectorized and a difference between these vectors is evaluated as a scalar. In the embodiment of the present invention, the value is executed.

その他の動作手順（ステップＳ１０１、Ｓ１０２、Ｓ１０４〜Ｓ１０６）は、それぞれ前述した図２（第１の実施形態）に示すステップと同一となっている。 Other operation procedures (steps S101, S102, and S104 to S106) are the same as the steps shown in FIG. 2 (first embodiment) described above.

このため、この図３乃至図４に示す第２の実施形態では、前述した第１の実施形態と同一の作用効果を有するほか、更に、品質評価部２２が、テキストデータを音声合成化した合成音声信号と入力音声信号とを比較し、既存音声合成システムで表現しきれない特徴を明確にするため、よりタスクに適応した収録補助データ作成が可能となるという従来にない効果を奏する。 For this reason, the second embodiment shown in FIGS. 3 to 4 has the same effect as the first embodiment described above, and the quality evaluation unit 22 further synthesizes text data by voice synthesis. Compared with the audio signal and the input audio signal, the feature that cannot be expressed by the existing speech synthesis system is clarified, so that it is possible to create auxiliary recording data adapted to the task.

［第３の実施の形態］
次に、第３の実施形態を図５乃至図６に基づいて説明する。ここで、前述した第２の実施形態と同一の構成部材については同一の符号を用いるものとする。[Third Embodiment]
Next, a third embodiment will be described with reference to FIGS. Here, the same reference numerals are used for the same components as those of the second embodiment described above.

この図５に示す第３の実施形態にける音声収録システムは、前述した第２の実施形態の場合とほぼ同様に、音声合成システム１０を構成する音声合成部１１及び音声合成用データベース１０Ａと、品質評価部３２と、追加データ決定部１３と、テキスト生成用記憶部１４Ａを有する収録指示データ生成部１４とを備えている。 The speech recording system in the third embodiment shown in FIG. 5 is substantially the same as in the case of the second embodiment described above, the speech synthesis unit 11 and the speech synthesis database 10A constituting the speech synthesis system 10, A quality evaluation unit 32, an additional data determination unit 13, and a recording instruction data generation unit 14 having a text generation storage unit 14A are provided.

この第３の実施形態にあって、上記品質評価部３２には、図５に示すように外部から所定の音声信号（音声データ）が入力されるように構成されている。更に、この品質評価部３２には、前記外部入力される音声信号の前記合成音声との一致度を判断するのに必要な情報であるセグメンテーション情報を抽出して品質評価部３２に送り込むセグメンテーション抽出部１５が併設されている。ここで、セグメンテーション情報とは、少なくとも継続時間長情報を含むデータである。 In the third embodiment, the quality evaluation unit 32 is configured to receive a predetermined audio signal (audio data) from the outside as shown in FIG. Further, the quality evaluation unit 32 extracts segmentation information, which is information necessary to determine the degree of coincidence between the externally input audio signal and the synthesized speech, and sends it to the quality evaluation unit 32 15 is attached. Here, the segmentation information is data including at least duration time information.

そして、この品質評価部３２は、上述したセグメンテーション情報（セグメンテーションデータ）を用いて時刻情報との整合をとり、前記合成音声と前記入力音声信号とを比較して特徴量の一致度を評価し所望の品質を満たすか否かを判定する機能を備えている。即ち、この品質評価部３２は、抽出されたセグメンテーション情報を利用して前述した音声合成部１１で合成された合成音声（中間データ）と外部入力される音声信号とを比較して波形や特徴量の一致度を評価すると共に予め設定した所望の品質を満たすか否かを判定する。これにより、前述した従来例では不明確であった合成音声の特徴がより明確化されるようになっている。
その他の構成は前述した図３の第２実施形態と同一となっている。The quality evaluation unit 32 then matches the time information using the above-described segmentation information (segmentation data), compares the synthesized speech and the input speech signal, and evaluates the degree of coincidence of the feature amount. It has a function to determine whether or not the quality is satisfied. That is, the quality evaluation unit 32 compares the synthesized speech (intermediate data) synthesized by the speech synthesis unit 11 described above using the extracted segmentation information with the externally input speech signal, and compares the waveform and feature amount. The degree of coincidence is evaluated and whether or not a predetermined desired quality is satisfied is determined. As a result, the characteristics of the synthesized speech, which was unclear in the conventional example described above, are further clarified.
Other configurations are the same as those of the second embodiment shown in FIG.

次に、上記第３実施形態の動作を図６に基づいて説明する。
図６のフローチャートにおいて、セグメンテーションデータ抽出部１５は、ステップＳ１０７で入力された音声信号からセグメンテーション情報を抽出し、品質評価部３２に出力する（ステップＳ１０８）。品質評価部３２は、抽出されたセグメンテーションデータを利用して、合成音声信号を比較し、波形や特徴量の一致度を計算する（ステップＳ１０３ｂ）。Next, the operation of the third embodiment will be described with reference to FIG.
In the flowchart of FIG. 6, the segmentation data extraction unit 15 extracts segmentation information from the audio signal input in step S107, and outputs it to the quality evaluation unit 32 (step S108). The quality evaluation unit 32 compares the synthesized speech signals using the extracted segmentation data, and calculates the degree of coincidence between waveforms and feature amounts (step S103b).

その他の動作ステップは、前述した図４に示す第２の実施形態の場合と同一となっている。即ち、図６において、前述した図４（第２の実施形態）で示すステップ符号と同じ符号が付されたステップ（Ｓ１０１、Ｓ１０２、Ｓ１０４〜Ｓ１０７）は、図４（第２の実施形態）に示すステップとそれぞれ同一の動作を実行するようになっている。 Other operation steps are the same as those in the second embodiment shown in FIG. That is, in FIG. 6, steps (S101, S102, S104 to S107) denoted by the same reference numerals as those shown in FIG. 4 (second embodiment) described above are shown in FIG. 4 (second embodiment). Each step performs the same operation.

このため、この第３の実施の形態によれば、前述した第２の実施形態と同一の作用効果を有するほか、更に、品質評価部３２が、セグメンテーションデータ抽出部１５により抽出された音声信号のセグメンテーションデータを音声信号比較に利用するため、より精密で且つ詳細な比較が可能となるという利点を備えたものとなっている。 For this reason, according to the third embodiment, in addition to having the same effects as those of the second embodiment described above, the quality evaluation unit 32 further extracts the audio signal extracted by the segmentation data extraction unit 15. Since segmentation data is used for audio signal comparison, it has the advantage that a more precise and detailed comparison is possible.

［第４の実施の形態］
次に、第４の実施形態を図７乃至図８に基づいて説明する。ここで、前述した第２の実施形態と同一の構成部材については同一の符号を用いるものとする。[Fourth Embodiment]
Next, a fourth embodiment will be described with reference to FIGS. Here, the same reference numerals are used for the same components as those of the second embodiment described above.

図７において、この第４の実施形態にける音声収録システムは、前述した第２の実施形態の場合と同様に、音声合成システム４０を構成する音声合成部４１及び音声合成用データベース４０Ａと、品質評価部４２と、追加データ決定部１３と、テキスト生成用記憶部１４Ａを備えた収録指示データ生成部１４と、セグメンテーション抽出部４５とを備えている。このセグメンテーション抽出部４５は、本第４実施形態では、前述した入力テキストデータに対応する入力音声信号の少なくとも時刻情報に対応付けられた音素列を含むセグメンテーション情報を当該音声信号から抽出する機能を備えている。 In FIG. 7, the voice recording system according to the fourth embodiment is similar to the second embodiment described above in that the voice synthesis unit 41 and the voice synthesis database 40A constituting the voice synthesis system 40, the quality, An evaluation unit 42, an additional data determination unit 13, a recording instruction data generation unit 14 provided with a text generation storage unit 14A, and a segmentation extraction unit 45 are provided. In the fourth embodiment, the segmentation extraction unit 45 has a function of extracting segmentation information including a phoneme string associated with at least time information of the input speech signal corresponding to the input text data from the speech signal. ing.

そして、この第４の実施形態にあっては、上記品質評価部４２には、外部から所定の音声信号（音声データ）と前述した音声合成部４１の出力である合成音声が入力されるように構成され、前述した第２の実施形態における品質評価部２２と同等の機能を備えたものとなっている。又、前述した音声合成部４１には、前述したセグメンテーション抽出部４５が併設され、音声合成に必要な情報である上記セグメンテーション情報（実施例４でも詳細）が直接送り込まれるように構成されている。 In the fourth embodiment, a predetermined voice signal (voice data) and a synthesized voice that is the output of the voice synthesizer 41 are input to the quality evaluation unit 42 from the outside. It is configured and has the same function as the quality evaluation unit 22 in the second embodiment described above. Further, the above-described speech synthesis unit 41 is provided with the above-described segmentation extraction unit 45 so that the above-described segmentation information (detailed in the fourth embodiment), which is information necessary for speech synthesis, is directly sent.

更に、上記音声合成部４１は、セグメンテーション抽出部４５から直接送りこまれたセグメンテーション情報を利用して前述した入力音声信号に対応する時刻情報を持つ合成音声を生成する機能を備えたものとなっている。その他の構成は前述した図３（第２実施形態）の場合と同一となっている。 Further, the speech synthesizer 41 has a function of generating synthesized speech having time information corresponding to the input speech signal described above by using the segmentation information directly sent from the segmentation extractor 45. . Other configurations are the same as those in FIG. 3 (second embodiment) described above.

次に、上記第４実施形態の動作を図８に基づいて説明する。
本第４の実施形態では、前述した図３（第２の実施形態）の場合と異なり、セグメンテーションデータ抽出部４５が装備されていることから、音声合成部４１の動作が前述した音声合成部１１の動作とは一部異なる。Next, the operation of the fourth embodiment will be described with reference to FIG.
In the fourth embodiment, unlike the case of FIG. 3 (second embodiment) described above, since the segmentation data extraction unit 45 is provided, the operation of the speech synthesis unit 41 is the same as the speech synthesis unit 11 described above. The operation is partly different.

即ち、図８に示すフローチャートにおいて、セグメンテーションデータ抽出部４５は、入力された音声信号のセグメンテーション情報を抽出し、音声合成部４１に出力する（ステップＳ１０８ａ）。このことにより、音声合成部４５は、セグメンテーションデータによって示された継続時間長情報に従って、入力されたテキストを音声合成化する（ステップＳ１０２ａ）。
その他の動作を示す各ステップ（Ｓ１０１、Ｓ１０３ａ、Ｓ１０４〜Ｓ１０７）は、前述した図４（第２実施形態）に示す動作ステップと同一となっている。That is, in the flowchart shown in FIG. 8, the segmentation data extraction unit 45 extracts the segmentation information of the input speech signal and outputs it to the speech synthesis unit 41 (step S108a). As a result, the speech synthesizer 45 synthesizes the input text according to the duration information indicated by the segmentation data (step S102a).
Each step (S101, S103a, S104 to S107) showing other operations is the same as the operation step shown in FIG. 4 (second embodiment) described above.

本第４の実施形態は、上述のように構成され機能するので、これによると、前述した第２の実施形態と同等の作用効果を有するほか、更に、音声合成部４１が、セグメンテーションデータ抽出部４５により抽出される音声信号のセグメンテーションデータが示す継続時間長情報通りの合成音声を生成するため、ＤＰ（Dynamic Programming ）マッチング等の時間長のマッチング処理を必要とせず、従って、精密かつ詳細な比較が可能になるという優れた効果を得ることができる。 Since the fourth embodiment is configured and functions as described above, according to this, in addition to having the same effects as the second embodiment described above, the speech synthesis unit 41 further includes a segmentation data extraction unit. In order to generate synthesized speech according to the duration length information indicated by the segmentation data of the speech signal extracted by 45, time length matching processing such as DP (Dynamic Programming) matching is not required, and therefore precise and detailed comparison is performed. It is possible to obtain an excellent effect that is possible.

［第５の実施の形態］
次に、第５の実施形態を図９乃至図１０に基づいて説明する。ここで、前述した第２の実施形態と同一の構成部材については同一の符号を用いるものとする。
この図９において、第５の実施形態における音声収録システムは、前述した第２の実施形態の場合とほぼ同様に、音声合成システム１０を構成する音声合成部１１及び音声合成用データベース１０Ａと、品質評価部２２と、追加データ決定部１３と、テキスト生成用記憶部１４Ａを備えた収録指示データ生成部１４とを備えている。[Fifth Embodiment]
Next, a fifth embodiment will be described with reference to FIGS. Here, the same reference numerals are used for the same components as those of the second embodiment described above.
In FIG. 9, the speech recording system in the fifth embodiment is substantially the same as in the second embodiment described above, and the speech synthesis unit 11 and speech synthesis database 10A constituting the speech synthesis system 10, quality, and the like. An evaluation unit 22, an additional data determination unit 13, and a recording instruction data generation unit 14 including a text generation storage unit 14A are provided.

更に、上記収録指示データ生成部１４には、収録音声信号評価部１６が併設されている。この収録音声信号評価部１６は、前述した入力テキストデータに対応する入力音声信号と新たに収録した音声信号とを比較して当該比較に結果得られる評価値を予め設定した所定値に満たない場合に外部に対して再収録を行う指示を出力する機能を備えている。
その他の構成は、前述した第２の実施形態と同一となっている。Further, the recording instruction data generation unit 14 is provided with a recording audio signal evaluation unit 16. The recorded voice signal evaluation unit 16 compares the input voice signal corresponding to the input text data and the newly recorded voice signal, and the evaluation value obtained as a result of the comparison is less than a predetermined value set in advance. Has a function to output an instruction to re-record to the outside.
Other configurations are the same as those of the second embodiment described above.

次に、本第５実施形態の動作を図１０に基づいて説明する。
本第５の実施の形態は、図１０のフローチャートに示されるように、収録指示データ生成部１４が生成した収録指示データを用いて、話者が音声収録作業を行い、収録音声信号を収集する（ステップＳ１０９）。この場合、話者としては、音声合成部１１の音声合成用データベース１０Ａを作成した際に用いた収録音声信号と同じ話者であることが望ましい。Next, the operation of the fifth embodiment will be described with reference to FIG.
In the fifth embodiment, as shown in the flowchart of FIG. 10, a speaker performs voice recording work and collects recorded voice signals using the recording instruction data generated by the recording instruction data generation unit 14. (Step S109). In this case, the speaker is preferably the same speaker as the recorded speech signal used when the speech synthesis database 10A of the speech synthesizer 11 is created.

続いて、収録音声信号評価部１６は、リクエストとして入力された音声信号を基準とし、収録された音声信号を評価する（ステップＳ１１０）。そして、その評価値と閾値とを比較した結果（ステップＳ１１１）、評価値が閾値を上回った場合に処理を終了する。また、評価値が閾値を下回った場合、収録音声信号評価部１６は、話者に再収録を指示し、ステップＳ１０９の処理に戻って再収録を行う。
その他の動作は前述した図４（第２実施形態）における動作ステップ（Ｓ１０１、Ｓ１０３ａ、Ｓ１０４〜Ｓ１０７）とそれぞれ同一となっている。Subsequently, the recorded audio signal evaluation unit 16 evaluates the recorded audio signal with reference to the audio signal input as the request (step S110). Then, as a result of comparing the evaluation value with the threshold value (step S111), if the evaluation value exceeds the threshold value, the process is terminated. If the evaluation value falls below the threshold value, the recorded audio signal evaluation unit 16 instructs the speaker to perform re-recording, and returns to the process of step S109 to perform re-recording.
Other operations are the same as the operation steps (S101, S103a, S104 to S107) in FIG. 4 (second embodiment) described above.

本第５実施形態は上記のように構成され機能するので、これによると前述した第２の実施形態と同一の作用効果を有するほか、更に、収録音声信号評価部１６により、評価値が一定値を上回るような収録音声信号が得られるまで収録を繰り返すことが出来るため、高音質で、よりタスクに対応した収録音声信号を収集することが可能となる。 Since the fifth embodiment is configured and functions as described above, according to this, in addition to having the same operational effects as the second embodiment described above, the recorded audio signal evaluation unit 16 further evaluates the evaluation value to a constant value. Since the recording can be repeated until a recorded audio signal exceeding 1 is obtained, it is possible to collect the recorded audio signal corresponding to the task with high sound quality.

［第６の実施の形態］
次に、第６の実施形態を図１１乃至図１２に基づいて説明する。ここで、前述した第１の実施形態と同一の構成部材については同一の符号を用いるものとする。
この第６の実施形態は、音声収録システムにおける音声の収録指示データを作成するためのものである。[Sixth Embodiment]
Next, a sixth embodiment will be described with reference to FIGS. Here, the same reference numerals are used for the same constituent members as those in the first embodiment described above.
The sixth embodiment is for creating voice recording instruction data in the voice recording system.

図１１において、この第６の実施形態では、前述した第１実施形態（図１）に開示した各構成要素とほぼ同等に機能する音声合成用データベース６０Ａ、品質評価部６２、追加データ決定部１３、およびテキスト生成用記憶部１４Ａを備えた収録指示データ生成部１４、を備えている。 11, in the sixth embodiment, a speech synthesis database 60A, a quality evaluation unit 62, and an additional data determination unit 13 that function substantially the same as the components disclosed in the first embodiment (FIG. 1) described above. And a recording instruction data generation unit 14 including a text generation storage unit 14A.

この本第６実施形態にかかる音声収録システムにあっては、音声合成用データベース６０Ａに記憶されるデータの内容と、品質評価部６２の動作が、前述した第１実施形態の場合と異なった構成となっている。この場合、音声合成用データベース１０Ａには、音声合成処理の動作を規定する音声合成システム情報が記憶されている。ここで、音声合成システム情報とは、音声合成処理の動作を規定する情報であり、少なくとも、形態素解析モデル、言語辞書、アクセント辞書、韻律情報、合成規則情報、素片波形情報のいずれか一つを含む情報である。
その他の構成は、前述した第１実施形態（図１）の場合と同一となっている。In the voice recording system according to the sixth embodiment, the contents of data stored in the voice synthesis database 60A and the operation of the quality evaluation unit 62 are different from those in the first embodiment described above. It has become. In this case, the speech synthesis database 10A stores speech synthesis system information that defines the operation of speech synthesis processing. Here, the speech synthesis system information is information that defines the operation of speech synthesis processing, and at least one of a morphological analysis model, a language dictionary, an accent dictionary, prosodic information, synthesis rule information, and segment waveform information. It is information including.
Other configurations are the same as those of the first embodiment (FIG. 1) described above.

次に、上記第６の実施形態の動作を図１２に基づいて説明する。
まず、図１２のフローチャートにおいて、品質評価部６２は、音声合成用データベース１７に記憶された音声合成に係るシステム情報を分析し、評価値を算出する（ステップＳ１０３ｃ）。続いて、評価値を算出して以降の動作ステップ（Ｓ１０４〜Ｓ１０６）は、前述した図２（第１実施形態）の場合と同一となっている。更に、その他の動作手順、即ち、図１２中、ステップＳ１０１から始まる動作ステップ（Ｓ１０１、Ｓ１０４〜Ｓ１０６）は、前述した図２（第１実施形態）に示すステップとそれぞれ同一となっている。Next, the operation of the sixth embodiment will be described with reference to FIG.
First, in the flowchart of FIG. 12, the quality evaluation unit 62 analyzes system information related to speech synthesis stored in the speech synthesis database 17 and calculates an evaluation value (step S103c). Subsequently, the operation steps (S104 to S106) after the evaluation value is calculated are the same as those in FIG. 2 (first embodiment) described above. Further, other operation procedures, that is, operation steps (S101, S104 to S106) starting from step S101 in FIG. 12 are the same as the steps shown in FIG. 2 (first embodiment) described above.

このように、本第６実施形態によると、品質評価部６２が音声合成用データベース６１Ａに記憶された音声合成に係るシステム情報を分析し評価値を算出するように機能するので、図２におけるような音声合成部１０による合成音声化処理を経ずして、既存の音声合成システム情報とユーザが入力したテキストデータとから収録指示データを作成されることとなり、高速な収録指示データ作成が可能となるという従来にない優れた効果が得られる。 As described above, according to the sixth embodiment, the quality evaluation unit 62 functions to analyze the system information related to speech synthesis stored in the speech synthesis database 61A and calculate the evaluation value. The recording instruction data is created from the existing speech synthesis system information and the text data input by the user without passing through the synthesized speech processing by the voice synthesizing unit 10, and high-speed recording instruction data can be created. An unprecedented superior effect of becoming can be obtained.

［第７の実施の形態］
次に、第７の実施形態を図１３乃至図１４に基づいて説明する。ここで、前述した第１の実施形態と同一の構成部材については同一の符号を用いるものとする。[Seventh Embodiment]
Next, a seventh embodiment will be described with reference to FIGS. Here, the same reference numerals are used for the same constituent members as those in the first embodiment described above.

図１３において、本第７の実施形態では、前述した第１実施形態（図１）に開示した各構成とほぼ同等に機能する構成、即ち、音声合成システム１０を構成する音声合成部１１及び音声合成用データベース１０Ａと、品質評価部７２と、追加データ決定部１３と、テキスト生成用記憶部１４Ａを備えた収録指示データ生成部１４とを備えている。
この本第６実施形態にかかる音声収録システムにあっては、音声合成用データベース１０Ａに記憶される一部データの内容と、品質評価部７２の動作が、前述した第１実施形態の場合と一部異なったものとなっている。In FIG. 13, in the seventh embodiment, a configuration that functions substantially the same as each configuration disclosed in the first embodiment (FIG. 1), that is, a speech synthesizer 11 and a speech that constitute the speech synthesis system 10. A synthesis database 10A, a quality evaluation unit 72, an additional data determination unit 13, and a recording instruction data generation unit 14 including a text generation storage unit 14A are provided.
In the voice recording system according to the sixth embodiment, the contents of the partial data stored in the voice synthesis database 10A and the operation of the quality evaluation unit 72 are the same as those in the first embodiment described above. The parts are different.

この場合、品質評価部７２は、音声合成部１１でテキストを合成音声化する際に用いられた中間データを取り込んで分析し、評価値を算出する機能を備えている。ここで、「中間データ」とは、合成音声信号の生成情報であり、少なくとも発音記号列、各セグメントの継続時間長情報、ピッチパタン情報、選択された素片波形のいずれか一つを含む情報である。評価値の算出方法としては、単位選択のためのスコア計算におけるスコアを評価値とする方法等がある。
その他の構成は、前述した第１実施形態（図１）の場合と同一となっている。In this case, the quality evaluation unit 72 has a function of taking in and analyzing intermediate data used when the speech synthesis unit 11 converts the text into synthesized speech and calculating an evaluation value. Here, the “intermediate data” is generated information of a synthesized speech signal, and includes information including at least one of a phonetic symbol string, duration information of each segment, pitch pattern information, and a selected segment waveform. It is. As a method of calculating the evaluation value, there is a method of using a score in score calculation for unit selection as an evaluation value.
Other configurations are the same as those of the first embodiment (FIG. 1) described above.

次に、本第７実施形態の動作を図１４のフローチャートに基づいて説明する。
この第７実施形態における音声収録システムにあっては、図１４に示すように、品質評価部７２は、音声合成部１１でテキストを合成音声化する際に用いられた中間データを分析して、評価値を算出する（ステップＳ１０３ｄ）。評価値を算出して以降の動作（Ｓ１０４〜Ｓ１０６）は、図２に示す第１の実施の形態と同一となっている。
ここで、上記図１４中、ステップＳ１０１から始まる他の動作手順は、前述した第１実施形態（図２）でおける動作ステップ（Ｓ１０１、Ｓ１０４〜Ｓ１０６）とそれぞれ同一となっている。Next, the operation of the seventh embodiment will be described based on the flowchart of FIG.
In the audio recording system according to the seventh embodiment, as shown in FIG. 14, the quality evaluation unit 72 analyzes the intermediate data used when the speech synthesis unit 11 synthesizes text into speech, An evaluation value is calculated (step S103d). The subsequent operations (S104 to S106) after calculating the evaluation value are the same as those in the first embodiment shown in FIG.
Here, in FIG. 14, other operation procedures starting from step S101 are the same as the operation steps (S101, S104 to S106) in the first embodiment (FIG. 2) described above.

以上のように、本第７実施形態によれば、品質評価部７２が音声合成時の詳細な中間データを評価し、追加データ決定部１３が、この評価値を用いて追加データを決定するため、より詳細な収録指示データ作成が可能となる。 As described above, according to the seventh embodiment, the quality evaluation unit 72 evaluates detailed intermediate data at the time of speech synthesis, and the additional data determination unit 13 determines additional data using this evaluation value. More detailed recording instruction data can be created.

［第８の実施の形態］
次に、第８の実施形態を図１５乃至図１６に基づいて説明する。ここで、前述した第５の実施形態と同一の構成部材については同一の符号を用いるものとする。
この図１５に示す第８の実施形態は、前述した第５の実施形態の場合とほぼ同様に、音声合成システム１０を構成する音声合成部１１及び音声合成用データベース１０Ａと、品質評価部８２と、追加データ決定部１３と、テキスト生成用記憶部１４Ａを備えた収録指示データ生成部１４とを備えている。[Eighth Embodiment]
Next, an eighth embodiment will be described with reference to FIGS. Here, the same reference numerals are used for the same components as those of the fifth embodiment described above.
In the eighth embodiment shown in FIG. 15, the speech synthesis unit 11 and the speech synthesis database 10A constituting the speech synthesis system 10 and the quality evaluation unit 82 are substantially the same as in the case of the fifth embodiment described above. , An additional data determination unit 13 and a recording instruction data generation unit 14 including a text generation storage unit 14A.

又、上記収録指示データ生成部１４には、収録音声信号評価部１６が併設されている。この収録音声信号評価部１６は、前述した入力テキストデータに対応する入力音声信号と新たに収録した音声信号とを比較して当該比較に結果得られる評価値を予め設定した所定値に満たない場合に外部に対して再収録を行う指示を出力する機能を備えている。 Further, the recording instruction data generation unit 14 is provided with a recording audio signal evaluation unit 16. The recorded voice signal evaluation unit 16 compares the input voice signal corresponding to the input text data and the newly recorded voice signal, and the evaluation value obtained as a result of the comparison is less than a predetermined value set in advance. Has a function to output an instruction to re-record to the outside.

更に、前述した音声合成部１１には、音声合成システム更新部８８が併設されている。この音声合成システム更新部８８は、収録指示データ生成部１４からの収録指示データを用いて収録した音声信号を前述した収録音声信号評価部１６を介して取込むと共に、この音声信号により前述した音声合成用データベース１０Ａに記憶された音声合成システム用の音声データを更新する機能を備えている。
その他の構成は、前述した第５の実施形態（図９）と同一となっている。Furthermore, the speech synthesis unit 11 described above is provided with a speech synthesis system update unit 88. The voice synthesis system update unit 88 takes in the audio signal recorded using the recording instruction data from the recording instruction data generation unit 14 via the recording audio signal evaluation unit 16 described above, and uses the audio signal to output the audio described above. The voice data for the voice synthesis system stored in the synthesis database 10A is updated.
Other configurations are the same as those of the fifth embodiment (FIG. 9) described above.

次に、本第８実施形態の動作を図１６のフローチャートに基づいて説明する。
この第８実施形態における音声収録システムにあっては、図１６に示すように、音声合成システム更新部８８は、ステップＳ１０９で収録した音声信号を用いて収録データ評価値を検討し、その結果に基づいて音声合成用データベース１０Ａを含む音声合成システム１０を更新する（ステップＳ１１２）。
ここで、上記図１６中、ステップＳ１０１から始まるその他の動作手順は、前述した第５実施形態（図１０）でおける動作ステップ（Ｓ１０１、Ｓ１０２、Ｓ１０３ａ、Ｓ１０４〜Ｓ１１１）とそれぞれ同一となっている。Next, the operation of the eighth embodiment will be described based on the flowchart of FIG.
In the voice recording system according to the eighth embodiment, as shown in FIG. 16, the voice synthesis system update unit 88 examines the recorded data evaluation value using the voice signal recorded in step S109, and determines the result. Based on this, the speech synthesis system 10 including the speech synthesis database 10A is updated (step S112).
Here, in FIG. 16, the other operation procedures starting from step S101 are the same as the operation steps (S101, S102, S103a, S104 to S111) in the fifth embodiment (FIG. 10) described above. .

以上のように、本第８実施形態によれば、前述した第５の実施形態の場合と同等の作用効果を有するほか、更に、音声合成システム更新部８８が、収録音声信号を用いて音声合成用データベース１０Ａを含む音声合成システムを更新すると共に、この更新された音声合成システム１０を用いて反復処理を行うようになるため、合成音の品質やタスクの適応度を更に有効に高めることが可能となる。 As described above, according to the eighth embodiment, in addition to the same effects as those of the fifth embodiment described above, the speech synthesis system update unit 88 further performs speech synthesis using the recorded speech signal. Since the speech synthesis system including the database 10A for use is updated and the iterative process is performed using the updated speech synthesis system 10, the quality of the synthesized speech and the fitness of the task can be further effectively increased. It becomes.

［第９の実施の形態］
次に、第９の実施形態を図１７乃至図１８に基づいて説明する。ここで、前述した第８の実施形態と同一の構成部材については同一の符号を用いるものとする。
この図１７に示す第９の実施形態は、前述した第８の実施形態の場合とほぼ同様に、音声合成システム１０を構成する音声合成部１１及び音声合成用データベース１０Ａと、品質評価部８２と、追加データ決定部１３と、テキスト生成用記憶部１４Ａを備えた収録指示データ生成部１４とを備えている。[Ninth Embodiment]
Next, a ninth embodiment will be described with reference to FIGS. Here, the same reference numerals are used for the same constituent members as those in the eighth embodiment described above.
The ninth embodiment shown in FIG. 17 is substantially the same as in the case of the eighth embodiment described above, and includes a speech synthesis unit 11 and a speech synthesis database 10A that constitute the speech synthesis system 10, and a quality evaluation unit 82. , An additional data determination unit 13 and a recording instruction data generation unit 14 including a text generation storage unit 14A.

又、上記収録指示データ生成部１４には、前述した第８の実施形態における収録音声信号評価部１６とほぼ同等に機能する収録音声信号評価部９６が併設されている。更に、前述した品質評価部８２には、当該品質評価部８２における評価結果に基づいて前記音声合成用データベースに不要とするデータを推定すると共に、当該音声合成用データベースから削除すべき不要な音声データを決定する不要データ決定部１９が併設されている。
この場合、不要データとしては、ピッチパタン，継続時間長，素片波形等が判断対象とされている。In addition, the recording instruction data generation unit 14 is provided with a recording audio signal evaluation unit 96 that functions substantially the same as the recording audio signal evaluation unit 16 in the above-described eighth embodiment. Further, the above-described quality evaluation unit 82 estimates unnecessary data in the speech synthesis database based on the evaluation result in the quality evaluation unit 82, and unnecessary speech data to be deleted from the speech synthesis database. An unnecessary data determining unit 19 for determining
In this case, as unnecessary data, a pitch pattern, a duration, a segment waveform, and the like are determined.

更に、前述した音声合成部１１には、前述した不要データ決定部１９で決定される不要データに基づいて前記音声合成用データベースに記憶されている音声合成データを更新する音声合成システム更新部９８が併設されている。
その他の構成は、前述した第８の実施形態（図１５）と同一となっている。Furthermore, the speech synthesis unit 11 described above includes a speech synthesis system update unit 98 that updates speech synthesis data stored in the speech synthesis database based on unnecessary data determined by the unnecessary data determination unit 19 described above. It is attached.
Other configurations are the same as those of the above-described eighth embodiment (FIG. 15).

次に、本第９実施形態の動作を図１８のフローチャートに基づいて説明する。
この第９実施形態における音声収録システムにあっては、図１８に示すように、まず、不要データ決定部１９は、品質評価部８２における評価結果に基づいて音声合成用データベース１０から削除すべき不要データを決定し、音声合成システム更新部９８に出力する（ステップＳ１１３）。音声合成システム更新部９８は、ステップＳ１０９で収録された音声信号を追加し、かつ不要データ決定部１９により決定された不要データで示されるデータを削除することによって音声合成用データベース１０を含む音声合成システムを更新する（ステップＳ１１２ａ）。Next, the operation of the ninth embodiment will be described based on the flowchart of FIG.
In the voice recording system according to the ninth embodiment, as shown in FIG. 18, first, the unnecessary data determination unit 19 is unnecessary to be deleted from the voice synthesis database 10 based on the evaluation result in the quality evaluation unit 82. Data is determined and output to the speech synthesis system update unit 98 (step S113). The speech synthesis system update unit 98 adds the speech signal recorded in step S109 and deletes the data indicated by the unnecessary data determined by the unnecessary data determination unit 19 to include the speech synthesis database 10. The system is updated (step S112a).

その他の動作ステップ（ステップＳ１０１、Ｓ１０２、Ｓ１０３ａ、Ｓ１０４〜Ｓ１０７、Ｓ１０９〜Ｓ１１１）は、前述した第８実施形態（図１６）に示す各動作ステップとそれぞれ同一となっている。 Other operation steps (steps S101, S102, S103a, S104 to S107, and S109 to S111) are the same as the operation steps shown in the above-described eighth embodiment (FIG. 16).

以上のように、この第９実施形態によれば、前述した第８実施形態の場合と同等の作用効果を有するほか、更に、音声合成システム更新部９８が、音声合成システム１０の声合成用データベース１０Ａにデータを追加すると共に不要なデータを削除するため、音声合成用データベース１０Ａの肥大化を防ぐことが可能となるという利点がある。 As described above, according to the ninth embodiment, in addition to the same effects as those of the eighth embodiment described above, the speech synthesis system update unit 98 further includes a voice synthesis database of the speech synthesis system 10. Since data is added to 10A and unnecessary data is deleted, there is an advantage that enlargement of the speech synthesis database 10A can be prevented.

次に、上述した各実施形態に対応した実施例について説明する。
［実施例１］
この実施例１は、本発明にかかる上記第１実施形態（図１乃至図２）に対応したもので、図１９にこれを示す。ここで、図１９中、前述した図１と同一の構成部材については同一の符号を付すものとする。Next, examples corresponding to the above-described embodiments will be described.
[Example 1]
Example 1 corresponds to the first embodiment (FIGS. 1 and 2) according to the present invention and is shown in FIG. Here, in FIG. 19, the same components as those in FIG. 1 described above are denoted by the same reference numerals.

この図１９に示す実施例１においては、前述した図１に示す第１実施形態が備えている各構成要素、即ち、音声合成システム１０を構成する音声合成部１１及び音声合成用データベース１０Ａと、品質評価部１２と、追加データ決定部１３と、テキスト生成用記憶部１４Ａを備えた収録指示データ生成部１４とを備えている。更に、この収録指示データ生成部１４には、文生成用テキストデータベース１７は併設されている。 In Example 1 shown in FIG. 19, the components included in the first embodiment shown in FIG. 1, that is, the speech synthesis unit 11 and the speech synthesis database 10A constituting the speech synthesis system 10, A quality evaluation unit 12, an additional data determination unit 13, and a recording instruction data generation unit 14 including a text generation storage unit 14A are provided. Further, the recording instruction data generation unit 14 is provided with a sentence generation text database 17.

音声合成部１１は、音声収録リクエストとなるテキストデータを取得すると共に、これを合成音声化する。尚、本実施例１では、音声合成方法としては波形接続型音声合成の手法を採用している。ここで、波形接続型音声合成とは、音声合成用データベース１０に予め大量の素片波形と当該各各素片波形に対応するラベルと呼ばれる音韻情報とを記憶しておき、素片波形を接続することによって音声を生成する音声合成方式の総称である。 The voice synthesizer 11 obtains text data as a voice recording request and converts it to synthesized voice. In the first embodiment, a waveform connection type speech synthesis method is employed as the speech synthesis method. Here, the waveform connection type speech synthesis means that a large number of segment waveforms and phoneme information called labels corresponding to each segment waveform are stored in advance in the speech synthesis database 10, and the segment waveforms are connected. It is a general term for speech synthesis methods that generate speech by doing so.

品質評価部１２は、音声合成部１１により出力される合成音声を評価し、評価データの数値を算出する。評価基準としては、素片波形の接続点における滑らかさ、韻律の自然性（韻律に急峻な変化がないか等）、標準音素とのスペクトルや波形の乖離、等が取り上げられている。 The quality evaluation unit 12 evaluates the synthesized speech output from the speech synthesis unit 11 and calculates the numerical value of the evaluation data. As evaluation criteria, smoothness at the connection point of the segment waveform, naturalness of the prosody (whether there is a steep change in the prosody, etc.), spectral divergence from the standard phoneme, and the like are taken up.

追加データ決定部１３は、品質評価部１２により出力される評価データに基づいて、外部入力されるテキストデータを音声合成するのに不足している音声データを推定し、音声合成用データベース１０Ａに追加記憶すべき音声データを決定する。追加記憶すべき音声データの具体的内容は、ピッチパタン、継続時間長、素片波形等がある。 Based on the evaluation data output from the quality evaluation unit 12, the additional data determination unit 13 estimates speech data that is insufficient to synthesize externally input text data and adds it to the speech synthesis database 10A. The audio data to be stored is determined. Specific contents of the audio data to be additionally stored include a pitch pattern, a duration length, and a segment waveform.

次に、図２０を参照して、追加データ決定のための具体的方法を説明する。
この図２０（Ａ）に示されるように、例えば、「中国の雲南省に住んでいます」という内容のテキストが合成音声化され、合成音が出力されたとする。Next, a specific method for determining additional data will be described with reference to FIG.
As shown in FIG. 20 (A), for example, it is assumed that a text having the content “I live in Yunnan Province, China” has been synthesized into speech and a synthesized speech is output.

この場合、合成音は、品質評価部１２で評価され、継続時間長評価値，ピッチパタン評価値，および波形接続評価値が、それぞれアクセント句ごとに計算される。第１アクセント句「中国の」と、第３アクセント句「住んでいます」については、上記した３つの評価値が全て「１０」であるが、第２アクセント句「雲南省に」ついては、継続時間長評価値「８０」、ピッチパタン評価値「６５」、波形接続評価値が「２０」とする。 In this case, the synthesized sound is evaluated by the quality evaluation unit 12, and a duration length evaluation value, a pitch pattern evaluation value, and a waveform connection evaluation value are calculated for each accent phrase. For the first accent phrase “Chinese” and the third accent phrase “I live”, the above three evaluation values are all “10”, but the second accent phrase “To Yunnan” continues The time length evaluation value “80”, the pitch pattern evaluation value “65”, and the waveform connection evaluation value are “20”.

これは、「雲南省に」と言うアクセント句について、継続時間長とピッチパタンは既に音声合成用データベース１０Ａに存在しているもので表現できるが、素片同士を滑らかに接続できる素片波形が音声合成用データベース１０Ａに存在しないことを示している。
ここで、評価値に対する閾値が「５０」に設定されているものとすれば、追加データ決定部１３は、追加記憶すべき音声データは、素片波形であると決定し、追加記憶音声データとして、「うんなんしょーに」と発声された波形データを収録すべく収録指示データ生成部１４を起動する。For the accent phrase “in Yunnan”, the duration length and the pitch pattern can be expressed by what already exists in the speech synthesis database 10A. This indicates that it does not exist in the speech synthesis database 10A.
Here, if the threshold value for the evaluation value is set to “50”, the additional data determination unit 13 determines that the audio data to be additionally stored is a segment waveform, and the additional stored audio data is Then, the recording instruction data generation unit 14 is activated to record the waveform data uttered "Unnanshoni".

収録指示データ生成部１４は、追加データ決定部１３で決定された追加データから文生成用テキストデータベース１７を索引し、出力される文生成用テキストから、収録指示データを生成する。収録指示データは、少なくとも収録すべき発声内容を記述した発声リストを含み、さらに、発声様式等を指示するテキストや正しい韻律を指示する図表が追加されてもよい。発声リストのフォーマットの一例を図２０（Ｂ）に示す。 The recording instruction data generation unit 14 indexes the sentence generation text database 17 from the additional data determined by the additional data determination unit 13, and generates recording instruction data from the output sentence generation text. The recording instruction data includes at least an utterance list describing the utterance contents to be recorded, and may further include a text indicating the utterance style and a chart indicating the correct prosody. An example of the format of the utterance list is shown in FIG.

ここで、外部入力されるテキストデータは、発話内容が記述された発話内容テキストデータに加え、更に話速，感情，抑揚などを指示する補助テキストデータを含んでもよい。この場合、補助テキストデータは、音声合成部１１で発話内容テキストデータの音声合成化の際に使用してもよいし、収録指示データとしてもよい。
又、上述した例では、収録指示データ生成部１４において、文生成用テキストデータベース１７を用いて有意味文を生成しているが、収録文リストとして、無意味文や無意味単語を生成してもよい。また、音声合成リクエストとなるテキストをそのまま収録文リストとしてもよく、音声合成リクエストのテキストと文生成用テキストデータベース１７のテキストとを混在させてもよい。Here, the externally input text data may include auxiliary text data for instructing speech speed, emotion, inflection and the like in addition to the utterance content text data in which the utterance content is described. In this case, the auxiliary text data may be used when the speech synthesis unit 11 performs speech synthesis of the utterance content text data, or may be recording instruction data.
In the above-described example, the recording instruction data generation unit 14 generates a meaningful sentence using the sentence generation text database 17, but generates a meaningless sentence or a meaningless word as a recorded sentence list. Also good. Further, the text that becomes the voice synthesis request may be used as it is as the recorded sentence list, or the text of the voice synthesis request and the text of the sentence generation text database 17 may be mixed.

更に、収録指示データ生成部１４は、収録指示データに含まれる発声リストから収録される音声信号の総時間を推測し、それに応じて追加記憶音声データの充足率を変化させることもできる。例えば、追加データの充足率を１０〔％〕とした場合、音声信号の総時間が非常に長くなり、音声合成システム内における音声合成用データベース１０Ａに追加した際に大幅に音声データが増えてしまうことが想定される。
このような場合は、例えば、充足率を６０〔％〕で許容することで、音声データの大幅な増大を防ぐことが可能となる。この際、追加データの４０〔％〕分を削除しなければならないが、削除するデータを決定する方法として、文生成用テキストデータベース１７から極力同一文内に追加データが多く入るように選択する方法や、データの重要度をリクエストの頻度等から推測する方法等がある。Furthermore, the recording instruction data generation unit 14 can estimate the total time of the audio signal recorded from the utterance list included in the recording instruction data, and change the satisfaction rate of the additional stored audio data accordingly. For example, when the filling rate of the additional data is 10%, the total time of the voice signal becomes very long, and the voice data greatly increases when added to the voice synthesis database 10A in the voice synthesis system. It is assumed that
In such a case, for example, by allowing the sufficiency rate to be 60%, it is possible to prevent a significant increase in audio data. At this time, 40 [%] of the additional data must be deleted. As a method for determining the data to be deleted, a method of selecting as much additional data as possible in the same sentence from the sentence generation text database 17. And a method for estimating the importance of data from the frequency of requests.

［実施例２］
この実施例２は、本発明にかかる上記第２実施形態（図３乃至図４）に対応したもので、図２２にこれを示す。ここで、図２１中、前述した図３と同一の構成部材については同一の符号を付すものとする。[Example 2]
Example 2 corresponds to the second embodiment (FIGS. 3 to 4) according to the present invention and is shown in FIG. Here, in FIG. 21, the same components as those in FIG. 3 described above are denoted by the same reference numerals.

この図２１に示す実施例２においては、前述した図３に示す第２実施形態が備えている各構成要素、即ち、音声合成システム１０を構成する音声合成部１１及び音声合成用データベース１０Ａと、品質評価部２２と、追加データ決定部１３と、テキスト生成用記憶部１４Ａを備えた収録指示データ生成部１４とを備えている。更に、この収録指示データ生成部１４には、文生成用テキストデータベース１７が併設されている。 In Example 2 shown in FIG. 21, the components included in the second embodiment shown in FIG. 3 described above, that is, the speech synthesis unit 11 and the speech synthesis database 10A constituting the speech synthesis system 10, A quality evaluation unit 22, an additional data determination unit 13, and a recording instruction data generation unit 14 including a text generation storage unit 14A are provided. Further, the recording instruction data generation unit 14 is provided with a sentence generation text database 17.

また、音声合成リクエストとして、テキストデータと音声信号が入力されたものとする。ここで、テキストデータと音声信号とは対応関係にあることが望ましく、更には同一内容であることが好ましい。このため、本実施例２では、テキストデータと音声信号とは同一内容であるものとして説明する。 Further, it is assumed that text data and a voice signal are input as a voice synthesis request. Here, it is desirable that the text data and the audio signal have a correspondence relationship, and it is more preferable that they have the same content. For this reason, in the second embodiment, description will be made assuming that the text data and the audio signal have the same content.

品質評価部２２により出力される品質評価値１２ａは、合成音声と対応する音声信号とを比較し、特徴量の比較結果に基づいて評価値を算出する。ここで、比較する特徴量としては、基本周波数（Ｆ０）パタン，スペクトル，波形，継続時間長等が考えられ、算出する数値としては、２乗距離等の類似度が考えられる。ここでは、Ｆ０，スペクトル，継続時間長の類似度を算出したものとする。 The quality evaluation value 12a output by the quality evaluation unit 22 compares the synthesized speech and the corresponding speech signal, and calculates an evaluation value based on the comparison result of the feature amount. Here, as a feature quantity to be compared, a fundamental frequency (F0) pattern, a spectrum, a waveform, a duration length, and the like can be considered, and as a numerical value to be calculated, a similarity such as a square distance can be considered. Here, it is assumed that the similarity of F0, spectrum, and duration is calculated.

追加データ決定部１３は、品質評価部１２により出力される比較結果データの数値と、予め設定した閾値とを比較することによって追加データを決定する。又、収録指示データ生成部１４は、追加データ決定部１３で決定された追加記憶音声データに基づいて収録指示データを生成する。
この収録指示データは、上記した実施例１のように、発声内容を記述するテキストで発声リストを作成してもよく、また、入力音声信号を話者が発生する手本としてもよい。もちろん、テキストと音声信号の両方を用いてもよく、更に、韻律を指示する図表を加えてもよい。The additional data determination unit 13 determines additional data by comparing the numerical value of the comparison result data output from the quality evaluation unit 12 with a preset threshold value. Further, the recording instruction data generation unit 14 generates recording instruction data based on the additional stored audio data determined by the additional data determination unit 13.
As in the first embodiment, the recording instruction data may create an utterance list with text describing the utterance content, or may be an example of a speaker generating an input voice signal. Of course, both text and audio signals may be used, and a chart indicating the prosody may be added.

［実施例３］
この実施例３は、本発明にかかる上記第４実施形態（図５乃至図６）および第５実施形態（図７乃至図８）に対応したもので、図２２（Ａ）（Ｂ）にこれを示す。ここで、図２２（Ａ）中、前述した図５又は図７と同一の構成部材については同一の符号を付すものとする。[Example 3]
Example 3 corresponds to the fourth embodiment (FIGS. 5 to 6) and the fifth embodiment (FIGS. 7 to 8) according to the present invention. FIGS. 22 (A) and 22 (B) show this example. Indicates. Here, in FIG. 22A, the same components as those in FIG. 5 or FIG.

この図２２（Ａ）（Ｂ）に示す実施例３においては、前述した第３乃至第４の各実施形態が備えている各構成要素、即ち、音声合成システム１０を構成する音声合成部１１及び音声合成用データベース１０Ａと、品質評価部３２（４２）と、追加データ決定部１３と、テキスト生成用記憶部１４Ａを有する収録指示データ生成部１４と、セグメンテーション抽出部１５とを備えている。更に、収録指示データ生成部１４には、文生成用テキストデータベース１７が併設されている。そして、この実施例３においても、上述した実施例２と同様に、テキストデータと音声信号とが入力されるものとする。 In Example 3 shown in FIGS. 22 (A) and 22 (B), each component included in each of the third to fourth embodiments described above, that is, the speech synthesis unit 11 constituting the speech synthesis system 10 and A speech synthesis database 10A, a quality evaluation unit 32 (42), an additional data determination unit 13, a recording instruction data generation unit 14 having a text generation storage unit 14A, and a segmentation extraction unit 15 are provided. Further, the recording instruction data generation unit 14 is provided with a sentence generation text database 17. Also in the third embodiment, it is assumed that text data and a voice signal are input as in the second embodiment.

ここで、セグメンテーションデータ抽出部１５（４５）は、セグメンテーションデータとして、音声信号の各音素の継続時間長情報を抽出する。抽出方法としては、ＤＰマッチングにおけるＨＭＭ（隠れマルコフモデル）によるセグメンテーション等がある。セグメンテンションデータの記述形式は、例えば、図２２（Ｂ）に示されるように、各音素（ｃｈ、ｕ１、…）の継続時間を「ｍｓｅｃ単位」で記述する方法等がある。
前述した本発明の第３実施形態（図５参照）で説明したように、セグメンテーションデータを品質評価部３２に出力する場合は、比較の際に音素のマッチングを取るため、セグメンテーションデータを使うことがある。Here, the segmentation data extraction unit 15 (45) extracts the duration time information of each phoneme of the speech signal as the segmentation data. As an extraction method, there is segmentation by HMM (Hidden Markov Model) in DP matching. The description format of segment tension data includes, for example, a method of describing the duration of each phoneme (ch, u1,...) In “msec units” as shown in FIG.
As described in the third embodiment of the present invention (see FIG. 5), when segmentation data is output to the quality evaluation unit 32, segmentation data may be used for phoneme matching in comparison. is there.

また、前述した本発明の第４実施形態（図７参照）で記述されているように、セグメンテーションデータを音声合成部４１に出力する場合は、合成音声化の際に、セグメンテーションデータの持つ各音素の継続時間長を忠実に再現するような合成音声を作成する。このことにより、音声信号と各音素の継続時間長が一致した合成音声が生成されるため、品質評価部３２で音素マッチングを取る必要がなくなる。
尚、セグメンテーションデータ抽出部１５（４５）で抽出されたセグメンテーションデータは、音声合成部１１と品質評価部３２の両方に出力してもよい。In addition, as described in the fourth embodiment of the present invention (see FIG. 7) described above, when segmentation data is output to the speech synthesizer 41, each phoneme included in the segmentation data is generated during synthesis speech generation. Create synthesized speech that faithfully reproduces the duration time of. As a result, a synthesized speech in which the duration of each phoneme is the same as the speech signal is generated, and it is not necessary for the quality evaluation unit 32 to perform phoneme matching.
The segmentation data extracted by the segmentation data extraction unit 15 (45) may be output to both the speech synthesis unit 11 and the quality evaluation unit 32.

［実施例４］
この実施例４は、前述した図１１に示す第６実施形態に対応するもので、図２３にこれを示す。ここで、図２３中、前述した図１１と同一の構成部材については同一の符号を付すものとする。[Example 4]
Example 4 corresponds to the above-described sixth embodiment shown in FIG. 11, and is shown in FIG. Here, in FIG. 23, the same components as those in FIG. 11 described above are denoted by the same reference numerals.

この図２３に示す実施例４においては、前述した第６実施形態が備えている各構成要素、即ち、音声合成用データベース６０Ａ、品質評価部６２、追加データ決定部１３、およびテキスト生成用記憶部１４Ａを備えた収録指示データ生成部１４とを備えている。更に、この収録指示データ生成部１４には、文生成用テキストデータベース１７が併設されている。ここで、音声合成用データベース６０Ａには、合成規則，ピッチパタンモデル，継続時間長モデル，素片波形等の音声合成システム情報が記憶されている。 In Example 4 shown in FIG. 23, each component included in the above-described sixth embodiment, that is, a speech synthesis database 60A, a quality evaluation unit 62, an additional data determination unit 13, and a text generation storage unit. And a recording instruction data generation unit 14 having 14A. Further, the recording instruction data generation unit 14 is provided with a sentence generation text database 17. Here, the speech synthesis database 60A stores speech synthesis system information such as synthesis rules, pitch pattern models, duration models, and segment waveforms.

品質評価部６２は、外部入力されたテキストデータに基づき、音声合成用データベース６０Ａに記憶されている音声合成システム情報を分析し、評価値を算出する。ここでは、音声合成システム情報として、素片波形から品質劣化を伴う音節列がそれぞれ抽出されるようになっている。 The quality evaluation unit 62 analyzes the speech synthesis system information stored in the speech synthesis database 60A based on the externally input text data, and calculates an evaluation value. Here, as speech synthesis system information, syllable strings accompanied by quality degradation are extracted from the segment waveforms.

例えば、音声合成リクエストであるテキストデータ内に、「イルクーツク」といった単語が含まれているものとする。又、音声合成用データベース６０Ａ内には、「いるくーつく」を構成する各音節は存在するが、「るくーつ」という音節列は存在しないものとする。このような音声合成用データベース６０Ａを含む既存の音声合成システムで「いるくーつく」といった音節列を表現する場合、音節の前後環境が異なる「る」「くー」「つ」を組み合わせて生成しなければならない。このため、このデータは品質劣化を伴う音節列であることが判明する。 For example, it is assumed that a word such as “Irkutsk” is included in text data that is a speech synthesis request. In the speech synthesis database 60A, each syllable constituting "Irukutsu" exists, but no syllable string "Rukutsu" exists. When expressing a syllable string such as “Irukutsu” in an existing speech synthesis system including such a speech synthesis database 60A, a combination of “ru”, “ku”, and “tsu” with different environments before and after the syllable is generated. Must. For this reason, it becomes clear that this data is a syllable string accompanied by quality deterioration.

従って、追加データ決定部１３で、「るくーつ」という音節列を含む素片波形が追加記憶音声データの対象として挙げられ、収録指示データ生成部１４において、音節列「るくーつ」を含む音節信号を収録するような収録指示データが生成される。 Therefore, in the additional data determination unit 13, a segment waveform including the syllable string “Rukutsu” is listed as the target of the additional stored voice data. Recording instruction data for recording a syllable signal including the is generated.

［実施例５］
次に、実施例５を図２４に基づいて説明する。この実施例５は、前述した図１３に示す第７実施形態に対応するものである。ここで、図２４中、前述した図１３と同一の構成部材については同一の符号を付すものとする。[Example 5]
Next, Example 5 will be described with reference to FIG. Example 5 corresponds to the seventh embodiment shown in FIG. 13 described above. Here, in FIG. 24, the same components as those in FIG. 13 described above are denoted by the same reference numerals.

この図２４に示す実施例５においては、前述した第７実施形態が備えている各構成要素、即ち、音声合成システム１０を構成する音声合成部１１及び音声合成用データベース１０Ａと、品質評価部７２と、追加データ決定部１３と、テキスト生成用記憶部１４Ａを備えた収録指示データ生成部１４とを備えている。この収録指示データ生成部１４には、文生成用テキストデータベース１７が併設されている。
品質評価部７２は、音声合成部１１において入力テキストデータが合成音声化された際に用いられた中間データを抽出する。ここでは、中間データとして、Ｆ０の変更量，ピッチパタン，素片波形の候補数が抽出されるものとする。In Example 5 shown in FIG. 24, each component included in the seventh embodiment, that is, the speech synthesis unit 11 and the speech synthesis database 10A constituting the speech synthesis system 10, and the quality evaluation unit 72 are provided. And an additional data determination unit 13 and a recording instruction data generation unit 14 including a text generation storage unit 14A. The recording instruction data generation unit 14 is provided with a sentence generation text database 17.
The quality evaluation unit 72 extracts intermediate data used when the input text data is converted into synthesized speech by the speech synthesis unit 11. Here, it is assumed that the change amount of F0, the pitch pattern, and the number of segment waveform candidates are extracted as intermediate data.

ここで、Ｆ０の変更量を例にとり、中間データについて詳細に説明する。
一般に、Ｆ０の変更量が大きい場合には、合成音声の音質が劣化することが知られている。例えば、音声素片Ｕ１の元々の平均Ｆ０が１５０〔Ｈｚ〕だったとして、この音声素片Ｕ１が３００〔Ｈｚ〕に変更されていたとする。また、追加データ決定部１３で「Ｆ０の変更量が１．５倍以上の音声素片については追加記憶音声データとする」というルールが設定されているものとする。Here, the intermediate data will be described in detail using the change amount of F0 as an example.
Generally, it is known that the sound quality of synthesized speech deteriorates when the change amount of F0 is large. For example, assuming that the original average F0 of the speech unit U1 is 150 [Hz], the speech unit U1 is changed to 300 [Hz]. Further, it is assumed that the additional data determination unit 13 sets a rule that “a speech unit whose F0 change amount is 1.5 times or more is set as additional stored speech data”.

この場合、音声素片Ｕ１の変更率は２倍であるため、音声素片Ｕ１は、追加記憶音声データの対象となり、収録指示データ生成部１４では、音声素片Ｕ１と同一発声内容で、平均Ｆ０が３００〔Ｈｚ〕であるような音声信号を収録するように収録指示データを生成する。指示方法としては、単音節で発声する場合、３００〔Ｈｚ〕のトーン音を手本音声信号として指示する方法等が考えられる。 In this case, since the rate of change of the speech unit U1 is twice, the speech unit U1 becomes a target of additional stored speech data, and the recording instruction data generation unit 14 uses the same utterance content as the speech unit U1 and averages it. Recording instruction data is generated so as to record an audio signal having F0 of 300 [Hz]. As an instruction method, when uttering with a single syllable, a method of instructing a tone sound of 300 [Hz] as a model voice signal is conceivable.

［実施例６］
次に、実施例６について説明する。この実施例６は、前述した第５実施形態（図９）および第８実施形態（図１５）に対応するもので、図２５にその詳細を示す。
ここで、図２５中、前述した図９若しくは図１５と同一の構成部材に付いては同一の符号を用いるものとする。[Example 6]
Next, Example 6 will be described. Example 6 corresponds to the fifth embodiment (FIG. 9) and the eighth embodiment (FIG. 15) described above, and FIG. 25 shows the details thereof.
Here, in FIG. 25, the same reference numerals are used for the same constituent members as those in FIG. 9 or FIG.

この図２５に示す実施例６は、音声合成システム１０を構成する音声合成部１１及び音声合成用データベース１０Ａと、品質評価部２２（８２）と、追加データ決定部１３と、テキスト生成用記憶部１４Ａを備えた収録指示データ生成部１４とを備えている。又、上記収録指示データ生成部１４には、収録音声信号評価部１６が併設されている。更に、前述した音声合成部１１には、音声合成システム更新部８８が併設されている。 The sixth embodiment shown in FIG. 25 includes a speech synthesis unit 11 and a speech synthesis database 10A that constitute the speech synthesis system 10, a quality evaluation unit 22 (82), an additional data determination unit 13, and a text generation storage unit. And a recording instruction data generation unit 14 having 14A. Further, the recording instruction data generation unit 14 is provided with a recording audio signal evaluation unit 16. Furthermore, the speech synthesis unit 11 described above is provided with a speech synthesis system update unit 88.

この内、収録音声信号評価部１６は、収録された音声信号と入力音声信号を比較することにより、収録音声信号が入力音声信号による音声合成リクエストにどれだけ適応しているかについての評価を行う。評価方法としては、上記した図２１における実施例２の場合と同様に、Ｆ０パタン，スペクトル，波形，継続時間長等の特徴量の距離による評価等が考えられる。 Among these, the recorded audio signal evaluation unit 16 compares the recorded audio signal with the input audio signal, and evaluates how much the recorded audio signal is adapted to the voice synthesis request by the input audio signal. As an evaluation method, as in the case of the second embodiment in FIG. 21 described above, an evaluation based on a distance of feature amounts such as an F0 pattern, a spectrum, a waveform, and a duration length can be considered.

又、収録音声信号評価部１６には、評価値に対して閾値が予め設定されているものとする。そして、評価値が閾値より低い場合、再収録のための指示を話者に対して出力する。具体的には、モニタ等を介して再収録対象の発声リストを表示する方法等が考えられる。再収録された音声信号は、再び収録音声信号評価部１６において評価され、評価値が閾値より低い場合は、再度収録することになる。評価値が閾値より高い場合、収録された音声信号が、音声合成システム更新部８８に出力される。 Further, it is assumed that a threshold value is set in advance for the evaluation value in the recorded audio signal evaluation unit 16. If the evaluation value is lower than the threshold value, an instruction for re-recording is output to the speaker. Specifically, a method of displaying an utterance list to be rerecorded via a monitor or the like can be considered. The re-recorded audio signal is evaluated again by the recorded audio signal evaluation unit 16, and if the evaluation value is lower than the threshold value, it is recorded again. When the evaluation value is higher than the threshold value, the recorded voice signal is output to the voice synthesis system update unit 88.

音声合成システム更新部８８は、入力された収録音声信号を用いて、音声合成用データベース１０Ａの音声データを更新する。音声合成用データベース１０Ａには、合成規則，ピッチパタンモデル，継続時間長モデル，素片波形等が記憶されているものとする。また、追加記憶音声データとして、素片波形とピッチパタンを追加するように設定されているものとする。この場合、入力された収録音声信号からピッチパタンと素片波形とが作成される。 The voice synthesis system update unit 88 updates the voice data in the voice synthesis database 10A using the input recorded voice signal. It is assumed that a synthesis rule, a pitch pattern model, a duration model, a segment waveform, and the like are stored in the speech synthesis database 10A. Further, it is assumed that the additional waveform data and the pitch pattern are set as additional stored audio data. In this case, a pitch pattern and a segment waveform are created from the input recorded audio signal.

作成方法はマニュアルでも自動でもよいが、ピッチパタンを自動で作成する方法としては自己相関法等によるＦ０自動抽出等の方法が、又、素片波形を自動で作成する方法としてはＨＭＭによるセグメンテーションと切り出しによる方法等が、それぞれ考えられる。そして、音声合成システム更新部８８では、ここで作成された素片波形とピッチパタンを用いて、前述した音声合成用データベース１０Ａを更新することができる。 The creation method may be manual or automatic. However, as a method for automatically creating a pitch pattern, a method such as F0 automatic extraction by an autocorrelation method or the like, and as a method for automatically creating a fragment waveform, segmentation by HMM or A method by cutting out can be considered. The speech synthesis system update unit 88 can update the above-described speech synthesis database 10A using the segment waveform and the pitch pattern created here.

ここで、音声合成システムの具体的なシステム構成例を図２８に示す。
この図２８に示すように、ユーザ＃１〜＃３は、Ｗｅｂ（World Wide Web）サイト等を用いて、テキストおよび音声信号による音声合成リクエストを音声収録システム２０に記憶する。この音声収録システム２０は、上述した方法で収録指示データを作成し、モニタ３１等を通して話者である声優やアナウンサー等に収録指示を出力する。話者は、マイクロフォンとパーソナルコンピュータ等からなる音声収録機器３０を用いて音声を収録し、音声収録システム２０に記憶させる。Here, FIG. 28 shows a specific system configuration example of the speech synthesis system.
As shown in FIG. 28, users # 1 to # 3 store a speech synthesis request based on text and speech signals in the speech recording system 20 using a Web (World Wide Web) site or the like. The voice recording system 20 creates recording instruction data by the method described above, and outputs a recording instruction to a voice actor, an announcer, or the like through a monitor 31 or the like. The speaker records voice using a voice recording device 30 including a microphone and a personal computer, and stores the voice in the voice recording system 20.

そして、音声収録システム２０では、収録された音声信号を評価し、評価値が一定値以下であれば、モニタ３１等を通して再収録の指示を出す。又、評価値が一定値以上であれば、音声合成用データベース１００を更新し、更新された新たな音声合成システム２１が作成される。 Then, the audio recording system 20 evaluates the recorded audio signal, and if the evaluation value is equal to or less than a certain value, issues a re-recording instruction through the monitor 31 or the like. If the evaluation value is equal to or greater than a certain value, the speech synthesis database 100 is updated and a new updated speech synthesis system 21 is created.

［実施例７］
次に、実施例７について説明する。
この実施例７は、前述した第９の実施形態に対応するもので、第９の実施形態で示した図１７を参照しながら詳細に説明する。
図１７において、不要データ決定部１９は、品質評価部８２における評価結果から、音声合成部１１の音声合成用データベース１０Ａから削除すべき不要データを決定する。不要データとしては、ピッチパタン，継続時間長，素片波形等が対象データとして扱われる。[Example 7]
Next, Example 7 will be described.
Example 7 corresponds to the ninth embodiment described above, and will be described in detail with reference to FIG. 17 shown in the ninth embodiment.
In FIG. 17, the unnecessary data determination unit 19 determines unnecessary data to be deleted from the speech synthesis database 10 </ b> A of the speech synthesis unit 11 from the evaluation result in the quality evaluation unit 82. As unnecessary data, a pitch pattern, a duration, a segment waveform, and the like are treated as target data.

ここで、ピッチパタンと素片波形を例にとって説明する。音声合成リクエストとしては多重のテキストデータと音声信号が存在し、図２７にその一例が示されるように、「いかがですか？」といった文末を持つ文章が多く存在しているものとする。さらに、「いかがですか？」の末尾のピッチパタンが一般的な疑問文のように、ピッチが上がって終わっているもの（図２７（Ａ））は全く無く、全て語尾のピッチが下がって終わっているもの（図２７（Ｂ））であったとする。 Here, the pitch pattern and the segment waveform will be described as an example. It is assumed that there are multiple text data and speech signals as speech synthesis requests, and there are many sentences with the sentence endings such as “How are you?” As shown in FIG. Furthermore, the pitch pattern at the end of "How is it?" Is not finished at all with the pitch rising like the general question sentence (Fig. 27 (A)), and all ends with the pitch at the end of the ending. Suppose that it is what is (FIG. 27 (B)).

この場合、上記した音声合成リクエストに適応した音声合成エンジン（音声合成部１１）では、「いかがですか？」に関しては、文末のピッチが下がるピッチパタンで表現されるべきである。
従って、音声合成用データベース１０Ａに、図２７（Ａ）のように文末のピッチが上がる「いかがですか？」のピッチパタンおよび素片波形は、削除されるべきデータであると判明する。In this case, in the speech synthesis engine (speech synthesizer 11) adapted to the speech synthesis request described above, “How is it?” Should be expressed by a pitch pattern in which the pitch at the end of the sentence is lowered.
Therefore, the pitch pattern and the segment waveform of “How is it?” That the pitch at the end of the sentence increases as shown in FIG. 27A in the speech synthesis database 10A are found to be data to be deleted.

ここで、音声合成システム更新部９８は、収録音声信号評価部９６から入力された収録音声信号から抽出されたピッチパタン，素片波形等のデータを、音声合成用データベース１０Ａに追加するとともに、不要データ決定部１９で決定された不要データを、音声合成用データベース１０Ａから削除する。 Here, the speech synthesis system update unit 98 adds data such as pitch patterns and segment waveforms extracted from the recorded speech signal input from the recorded speech signal evaluation unit 96 to the speech synthesis database 10A and is unnecessary. The unnecessary data determined by the data determination unit 19 is deleted from the speech synthesis database 10A.

なお、ここでは、品質評価部８２における評価結果から不要データを決定したが、他の実施形態にあって品質評価部１２、２２，３２，４２，６２，７２，８２における評価結果に基づき、それぞれの実施例で不要データを決定するようにしてもよい。また、この実施例７では、音声合成システム更新部８８が、収録された音声信号および音声信号から作成されたパラメータを追加する処理と、不要データに基づいて音声合成データベース１０Ａ内の音声データを削除する処理との両方を行っているが、追加処理と削除処理を、それぞれ別個に用意された音声合成システム更新部が行う構成としてもよい。このようにすると、音声データ更新処理の高速化を図ることができる。 Here, unnecessary data is determined from the evaluation result in the quality evaluation unit 82, but in other embodiments, based on the evaluation result in the quality evaluation unit 12, 22, 32, 42, 62, 72, 82, respectively. In the embodiment, unnecessary data may be determined. In the seventh embodiment, the speech synthesis system update unit 88 adds the recorded speech signal and parameters created from the speech signal, and deletes speech data in the speech synthesis database 10A based on unnecessary data. However, it is also possible to employ a configuration in which the speech synthesis system update unit prepared separately performs the addition process and the deletion process. In this way, it is possible to speed up the audio data update process.

以上、各実施形態および実施例について説明したように、本発明によれば、既存の音声合成システムをユーザが所望するタスクに適応させ、品質を向上させることができ、しかも音声合成用データベース１０の肥大化を防ぐことができる。 As described above, each of the embodiments and examples described above, according to the present invention, it is possible to adapt an existing speech synthesis system to a task desired by a user, improve the quality, and to improve the quality of the speech synthesis database 10. It can prevent enlargement.

なお、図２、図４、図６、図８、図１０、図１２、図１４、図１６、図１８のそれぞれに示すフローチャートは、本発明の実施の形態１〜９における音声収録システムの動作のみならず、本実施形態にかかる音声収録方法の各工程についても合わせて示したものである。 2, 4, 6, 8, 10, 12, 14, 16, and 18 are the operations of the audio recording system according to the first to ninth embodiments of the present invention. In addition, each step of the audio recording method according to the present embodiment is also shown.

即ち、本実施形態における音声収録システムは、その稼働に際しては、音声収録のためのシーケンス制御において、少なくとも、音声合成用データベース１０を参照し、入力テキストデータから所望の合成音声を生成する第１のステップ（例えば、図１のＳ１０１、Ｓ１０２）と、合成音声の特徴量に基づき前記合成音声の品質評価を行なう第２のステップ（Ｓ１０３）と、品質評価の結果に基づき、音声合成用データベース１０に追加記憶すべき音声データを決定する第３のステップ（Ｓ１０４、Ｓ１０５）と、を順次実行するようになっている。 That is, when the voice recording system according to the present embodiment is operated, a first synthesized voice is generated from input text data by referring to at least the voice synthesis database 10 in sequence control for voice recording. Steps (for example, S101 and S102 in FIG. 1), a second step (S103) for evaluating the quality of the synthesized speech based on the feature amount of the synthesized speech, and the speech synthesis database 10 based on the quality evaluation result. A third step (S104, S105) for determining audio data to be additionally stored is sequentially executed.

なお、上記した第２のステップ（Ｓ１０３）は、合成音声と入力テキストデータに対応する入力音声信号とを比較して特徴量の類似度を評価し、所望の品質を満たすか否かを判定するサブステップ（例えば、図４のＳ１０３ａ）、を含んだ構成としてもよい。 In the second step (S103), the synthesized speech and the input speech signal corresponding to the input text data are compared to evaluate the similarity of the feature amount, and determine whether or not the desired quality is satisfied. A sub-step (for example, S103a in FIG. 4) may be included.

また、本実施形態における音声収録システムでは、その稼動に際しては、音声収録のためのシーケンス制御において、少なくとも、入力テキストデータを取得する第１のステップ（例えば、図１２のＳ１０１）と、取得した入力テキストデータと、音声合成用データベースに記憶された音声合成処理の動作を規定する音声合成システム情報を分析して評価し、音声合成用データベースに記憶された音声データが所望の品質を満たすか否かを判定する第２のステップ（Ｓ１０３ｃ、Ｓ１０４）と、品質評価の結果に基づき、音声合成用データベース１０に追加記憶すべき音声データを決定する第３のステップ（Ｓ１０５）と、を順次実行するようになっている。 Further, in the operation of the audio recording system according to the present embodiment, at the time of operation, at least a first step of acquiring input text data (for example, S101 in FIG. 12) and an acquired input in sequence control for audio recording. Analyzes and evaluates text data and speech synthesis system information that defines speech synthesis processing operations stored in the speech synthesis database, and whether the speech data stored in the speech synthesis database satisfies the desired quality The second step (S103c, S104) for determining the voice data and the third step (S105) for determining voice data to be additionally stored in the voice synthesis database 10 based on the result of the quality evaluation are sequentially executed. It has become.

また、第３のステップ（Ｓ１０５）において決定された音声データに対応して収録すべき音声信号の発声内容が記述されたテキストデータを含む収録指示データを生成する第４のステップ（例えば、図２のＳ１０６）、を実行する。 Further, a fourth step (for example, FIG. 2) for generating recording instruction data including text data describing the utterance content of the audio signal to be recorded corresponding to the audio data determined in the third step (S105). S106).

なお、上記した第２のステップ（Ｓ１０３）は、入力テキストデータに対応する入力音声信号を取得し、入力音声信号から少なくとも時刻情報に対応付けられた音素列を含むセグメンテーションデータを抽出するサブステップ（例えば、図６のＳ１０８）と、セグメンテーションデータを用いて時刻情報との整合をとり、合成音声と入力音声信号を比較して特徴量の類似度を評価し、所望の品質を満たすか否かを判定するサブステップ（Ｓ１０２、Ｓ１０４）と、を含んだ構成としてもよい。 The second step (S103) described above is a sub-step of acquiring an input speech signal corresponding to the input text data and extracting segmentation data including at least a phoneme string associated with time information from the input speech signal ( For example, S108) in FIG. 6 is matched with time information using segmentation data, the synthesized speech and the input speech signal are compared, the similarity of the feature quantity is evaluated, and whether or not the desired quality is satisfied. It is good also as a structure including the substep (S102, S104) to determine.

また、上記した第２のステップ（Ｓ１０３）は、入力テキストデータに対応する入力音声信号を取得し、入力音声信号から少なくとも時刻情報に対応付けられた音素列を含むセグメンテーションデータを抽出するサブステップ（例えば、図８のＳ１０８ａ）を含み、上記した第１のステップ（Ｓ１０２）は、抽出されたセグメンテーションデータを用いて入力音声信号に対応する時刻情報を持つ合成音声を生成するサブステップ（Ｓ１０２ａ）を含んだ構成としてもよい。 The second step (S103) described above is a sub-step of acquiring an input speech signal corresponding to the input text data and extracting segmentation data including at least a phoneme string associated with the time information from the input speech signal ( For example, the first step (S102) described above including S108a in FIG. 8 includes a sub-step (S102a) for generating synthesized speech having time information corresponding to the input speech signal using the extracted segmentation data. It is good also as a structure including.

また、上記した第４のステップ（Ｓ１０６）は、入力テキストデータに対応する入力音声信号と収録した音声信号とを比較して収録した音声信号を評価するサブステップ（例えば、図１０のＳ１０９、Ｓ１１０）と、評価の結果得られる評価値が閾値に満たない場合に再収録を行う収録指示データを出力するサブステップ（Ｓ１１１、Ｓ１０９）と、を含んだ構成としてもよい。 Further, the fourth step (S106) described above is a sub-step for evaluating the recorded audio signal by comparing the input audio signal corresponding to the input text data and the recorded audio signal (for example, S109 and S110 in FIG. 10). ) And sub-steps (S111, S109) for outputting recording instruction data for performing re-recording when the evaluation value obtained as a result of the evaluation is less than the threshold value.

また、第４のステップ（Ｓ１０６）において出力された収録指示データを用いて収録した音声信号を用い、音声合成用データベース１０に記憶された音声データを更新する第５のステップ（例えば、図１６のＳ１１２ａ）、を実行する。
更に、前述した第５のステップ（Ｓ１１２ａ）は、第４のステップ（Ｓ１０６）における評価結果に基づき、音声合成用データベース１０から削除すべき不要な音声データを決定するサブステップ（例えば、図１８のＳ１１３）を含んだ構成としてもよい。Further, a fifth step (for example, FIG. 16) of updating the voice data stored in the voice synthesis database 10 using the voice signal recorded by using the recording instruction data output in the fourth step (S106). S112a) is executed.
Further, the fifth step (S112a) described above is a sub-step (for example, FIG. 18) for determining unnecessary speech data to be deleted from the speech synthesis database 10 based on the evaluation result in the fourth step (S106). S113) may be included.

また、上記した第５のステップ（Ｓ１１２ａ）は、第４のステップ（Ｓ１１３）において決定された不要データに基づき音声合成用データベース１０に記憶される音声データを更新するサブステップ（例えば、図１８のＳ１１２ｂ）を含んだ構成としてもよい。 Further, the fifth step (S112a) described above is a sub-step (for example, FIG. 18) for updating the speech data stored in the speech synthesis database 10 based on the unnecessary data determined in the fourth step (S113). S112b) may be included.

上記した本実施形態にかかる音声収録方法によれば、音声収録システムが、第１のステップを実行することにより生成される合成音声の特徴量に基づき品質評価を行う第２のステップを実行し、その品質評価結果に基づき音声合成用データベース１０に追加記憶すべき音声データを決定する第３のステップを実行することにより、既存の音声合成システムを用いて作成した合成音声の品質を評価することができ、この評価結果に基づき既存の音声合成システムを更新する際に追加すべき音声データを容易に決定することができる。 According to the audio recording method according to the above-described embodiment, the audio recording system executes the second step of performing quality evaluation based on the feature amount of the synthesized speech generated by executing the first step, It is possible to evaluate the quality of synthesized speech created using an existing speech synthesis system by executing the third step of determining speech data to be additionally stored in the speech synthesis database 10 based on the quality evaluation result. It is possible to easily determine speech data to be added when updating an existing speech synthesis system based on the evaluation result.

また、音声収録システムが、上記した第２のステップを実行するにあたり、入力テキストデータを音声合成した合成音声と、入力テキストデータに対応した音声信号とを比較するサブステップを実行することにより、既存の音声合成システムで表現しきれない特徴が明確になるため、よりタスクに適応した音声データの作成が可能になる。 In addition, when the voice recording system executes the second step, the existing step is performed by executing a sub-step of comparing the synthesized voice obtained by voice synthesis of the input text data and the voice signal corresponding to the input text data. This makes it possible to create voice data that is more adapted to the task.

また、本実施形態にかかる音声収録方法によれば、音声収録システムが、音声合成用データベース１０Ａに記憶された音声合成処理の動作を規定する、例えば、形態素解析モデルや韻律情報等の音声合成システム情報を分析して評価する第２のステップを実行することにより、合成音声化処理を経ることなく、取得した入力テキストデータと、既存の音声合成システムの音声合成システム情報とから、既存の音声合成システムを更新する際に追加記憶すべき音声データを容易に、かつ、高速に決定することができる。
また、音声収録システムが、上記した第３のステップを実行することにより決定される音声データに対応して収録すべき音声信号の発声内容が記述されたテキストデータを含む収録指示データを生成する第４のステップを実行することにより、既存の音声合成システムを更新する際の音声データの収録が容易になる。Further, according to the speech recording method according to the present embodiment, the speech recording system defines the operation of speech synthesis processing stored in the speech synthesis database 10A, for example, a speech synthesis system such as a morphological analysis model and prosodic information. By executing the second step of analyzing and evaluating the information, the existing speech synthesis is performed from the acquired input text data and the speech synthesis system information of the existing speech synthesis system without performing the synthesized speech processing. The voice data to be additionally stored when the system is updated can be determined easily and at high speed.
In addition, the voice recording system generates recording instruction data including text data describing the utterance content of the voice signal to be recorded corresponding to the voice data determined by executing the third step. By performing step 4, it becomes easy to record voice data when updating an existing voice synthesis system.

また、音声収録システムが、上記した第２のステップを実行するにあたり、継続時間長情報等のセグメンテーションデータを抽出し、これを合成音声と入力音声信号との比較に利用することにより、その結果、より精密で詳細な合成音声の品質評価が可能になる。また、音声収録システムが、音声信号のセグメンテーションデータが示す継続時間長情報どおりの合成音声を生成するため、品質評価の際の合成音声と入力音声信号との比較操作を不要とすることができる。 In addition, when the audio recording system executes the second step described above, it extracts segmentation data such as duration information and uses it for comparison between the synthesized audio and the input audio signal. A more precise and detailed quality assessment of synthesized speech is possible. In addition, since the audio recording system generates synthesized speech according to the duration length information indicated by the segmentation data of the audio signal, it is possible to eliminate the need for comparison operation between the synthesized speech and the input audio signal at the time of quality evaluation.

更に、音声収録システムが、上記した第４のステップを実行するにあたり、入力テキストデータに対応する入力音声信号と収録した音声信号とを比較し、この比較結果得られる評価値が所定の値に満たない場合に外部に再収録を行う指示を出力するサブステップを実行することにより、話者による収録音声の評価値が所定の値を上回るまで収録操作を繰り返すことができるため、高音質で、かつタスクに適応した収録音声信号を取得することができる。 Further, when the voice recording system executes the fourth step, the input voice signal corresponding to the input text data is compared with the recorded voice signal, and the evaluation value obtained as a result of the comparison satisfies a predetermined value. By executing the sub-step that outputs an instruction to perform re-recording outside when there is not, the recording operation can be repeated until the evaluation value of the recorded voice by the speaker exceeds a predetermined value, so that the sound quality is high, and Recorded audio signal adapted to the task can be acquired.

また、音声収録システムが、収録した音声信号を用いて音声合成システムを更新する第５のステップを実行することにより、更新された音声合成システムを用いて反復処理を行い、合成音声の品質やタスクの適応度を高めることが可能になる。 In addition, the voice recording system performs the fifth step of updating the voice synthesis system using the recorded voice signal, thereby performing an iterative process using the updated voice synthesis system, and the quality of the synthesized voice and the task It becomes possible to increase the fitness of.

更に、音声収録システムが、上記した第４のステップを実行することにより出力される評価結果に基づき音声合成用データベースから削除すべき不要な音声データを決定する第５のステップにおけるサブステップを実行することにより、音声合成用データベース１０に音声データを追加記憶するとともに、不要なデータを削除するため、音声合成用データベース１０の肥大化を回避することができる。更に、音声収録システムが、上記した第４のステップを実行することにより決定された不要データに基づき、音声合成用データベース１０に記憶される音声データを更新する第５のステップにおけるサブステップを実行することにより、不要データが削除されるため、音声合成用データベースの肥大化を回避することができる Further, the voice recording system executes a sub-step in the fifth step of determining unnecessary voice data to be deleted from the voice synthesis database based on the evaluation result output by executing the fourth step. As a result, the speech data is additionally stored in the speech synthesis database 10 and unnecessary data is deleted, so that the enlargement of the speech synthesis database 10 can be avoided. Further, the voice recording system executes a sub-step in the fifth step of updating the voice data stored in the voice synthesis database 10 based on the unnecessary data determined by executing the fourth step. Since unnecessary data is deleted, it is possible to avoid the enlargement of the speech synthesis database.

なお、上述した音声収録システムにあって、その各構成部分が実行する前述した第１乃至第５の各ステップ、および第１乃至第５の各ステップにおける各サブステップにあっては、その各実行内容を、それぞれ、音声合成処理機能、品質評価処理機能、追加データ決定処理機能、収録指示データ生成機能、音声合成システム更新処理機能、およびそれぞれのサブ処理機能としてプログラム化し、コンピュータに実行させるようにしてもよい。
このようにしても上記した第１乃至第５の各ステップおよび各ステップ内のサブステップにおける処理内容と同等の処理をコンピュータで実行処理することができ、これによって前述した音声収録方法と同一の目的を効率よく達成することができる。In the audio recording system described above, each of the first to fifth steps executed by each component and each sub-step in the first to fifth steps is executed in the respective steps. The contents are programmed as a speech synthesis processing function, a quality evaluation processing function, an additional data determination processing function, a recording instruction data generation function, a speech synthesis system update processing function, and respective sub-processing functions, and are executed by a computer. May be.
Even in this way, it is possible to perform processing equivalent to the processing contents in the first to fifth steps and the sub-steps in each step by the computer, thereby having the same purpose as the above-described audio recording method. Can be achieved efficiently.

本発明は上述した各実施形態及び実施例に開示したように構成され機能するので、これによると、音声合成システムをユーザの望んだタスクに適応させると共にその品質を向上させることができ、しかも既存の音声合成システムにも適用し得るばかりでなく、音声合成用データベース内の不要データの削除も可能とし、これにより、音声合成用データベースの肥大化を防ぐこともでき、又、多数のユーザの要求から、最も求められているタスクに適応した音声合成システムを構築することができる。より具体的には、多数のユーザの要求を常に一定水準で満たしながら当該音声合成システムの内容を更新することができ、音声合成アプリケーションに最適な音声に対する収録システムの構築が可能となるという従来にない優れた効果を備えた音声収録システム、音声収録方法、および収録処理プログラムを提供することができる。
Since the present invention is configured and functions as disclosed in the above-described embodiments and examples, according to this, the speech synthesis system can be adapted to the task desired by the user and the quality thereof can be improved. In addition to being applicable to the speech synthesis system, it is possible to delete unnecessary data in the speech synthesis database, thereby preventing the speech synthesis database from becoming bloated. Therefore, it is possible to construct a speech synthesis system adapted to the most demanded task. More specifically, the content of the speech synthesis system can be updated while satisfying the requirements of a large number of users at a constant level, and it is possible to construct a recording system for speech that is optimal for speech synthesis applications. It is possible to provide an audio recording system, an audio recording method, and a recording processing program that have excellent effects.

ここで、上記した品質評価部については、前記合成音声と前記入力テキストデータに対応する入力音声信号とを比較して特徴量の一致度を評価すると共に、予め設定した所望の品質を満たすか否かを判定する機能を備えた構成としてもよい。
このため、これによると、品質評価部が、入力テキストデータを音声合成した合成音声と、入力テキストデータに対応した音声信号とを比較することにより、既存の音声合成システムで表現しきれない特徴が明確になるため、よりタスクに適応した音声データの作成が可能になる。Here, the above-described quality evaluation unit compares the synthesized speech with the input speech signal corresponding to the input text data, evaluates the degree of coincidence of the feature amounts, and satisfies whether a predetermined desired quality is satisfied. It is good also as a structure provided with the function to determine.
Therefore, according to this, the quality evaluation unit compares the synthesized speech obtained by synthesizing the input text data with the speech signal corresponding to the input text data, so that the feature cannot be expressed by the existing speech synthesis system. Since it becomes clear, it becomes possible to create voice data more adapted to the task.

また、上記した品質評価部については、前記入力テキストデータに対する音声合成化に際して前記音声合成部が生成した中間データを分析して評価すると共に、これに基づいて前記音声合成用データベースに記憶された音声データが所望の品質を満たしているか否かを判定する機能を備えた構成としてもよい。 The quality evaluation unit analyzes and evaluates intermediate data generated by the speech synthesizer during speech synthesis for the input text data, and based on this, the speech stored in the speech synthesis database It is good also as a structure provided with the function to determine whether data satisfy | fill the desired quality.

このため、これによると、品質評価部が、音声合成の際に音声合成部によって生成される、例えば、発音記号や選択された素片波形等の詳細な中間データを分析して評価することで追加記憶すべき音声データを決定することができるため、既存の音声合成システムを用いて作成した合成音声を高い精度で評価することができ、既存の音声合成システムを更新する際に追加記憶すべき音声データを容易に決定することができる。 Therefore, according to this, the quality evaluation unit analyzes and evaluates detailed intermediate data generated by the speech synthesis unit at the time of speech synthesis, such as phonetic symbols and selected segment waveforms. Since speech data to be additionally stored can be determined, synthesized speech created using an existing speech synthesis system can be evaluated with high accuracy and should be additionally stored when updating an existing speech synthesis system Audio data can be easily determined.

ここで、上記した入力テキストデータについては、音声様式を記述した様式指定データを更に含み、前記様式指定データは、ピッチパタン、発話速度を示すデータの内の少なくとも一つを含む構成としてもよい。 Here, the input text data described above may further include format designation data describing a voice format, and the format designation data may include at least one of data indicating a pitch pattern and a speech rate.

また、上記した音声合成用データベースについては、前記記憶される音声データとして、ピッチパタンモデル、継続時間長モデル、音声素片波形のうちの少なくとも一つを記憶する構成としてもよい。
更に、上記した追加データ決定部により決定される音声データについては、ピッチパタンパラメータ、継続時間長パラメータ、音声素片波形のうちの少なくとも一つを含む構成としてもよい。The above-described speech synthesis database may be configured to store at least one of a pitch pattern model, a duration model, and a speech segment waveform as the stored speech data.
Furthermore, the audio data determined by the additional data determination unit described above may include at least one of a pitch pattern parameter, a duration parameter, and a speech segment waveform.

また、前述した前記追加データ決定部に、当該追加データ決定部により決定される音声データに対応して、収録すべき音声信号の発声内容が記述されたテキストデータを含む収録指示データを生成する収録指示データ生成部を併設してもよい。
このようにすると、収録指示データ生成部が、追加データ決定部により決定される音声データに対応して収録すべき音声信号の発声内容が記述されたテキストデータを含む収録指示データを生成することから、既存の音声合成システムを更新する際の音声データの収録が容易になる。In addition, the additional data determining unit described above generates recording instruction data including text data describing the utterance content of the audio signal to be recorded, corresponding to the audio data determined by the additional data determining unit. An instruction data generation unit may be provided.
In this way, the recording instruction data generation unit generates the recording instruction data including the text data describing the utterance content of the audio signal to be recorded corresponding to the audio data determined by the additional data determination unit. This makes it easier to record speech data when updating an existing speech synthesis system.

また、上記した収録指示データ生成部については、前記収録指示データに含まれる、収録すべき音声信号の発声内容が記述されたテキストデータから収録される音声信号の総時間を演算し、前記演算結果に応じて収録すべき音声信号の量を抑制する機能を備えた構成としてもよい。
更に、前述した収録指示データ生成部は、テキスト生成用記憶部を有すると共に、このテキスト生成用記憶部に記憶された生成用テキストデータを参照して前記収録指示データを生成する機能を備えた構成としてもよい。Further, the recording instruction data generating unit described above calculates the total time of the audio signal recorded from the text data describing the utterance content of the audio signal to be recorded, included in the recording instruction data, and the calculation result It is good also as a structure provided with the function which suppresses the quantity of the audio | voice signal which should be recorded according to.
Further, the recording instruction data generation unit described above has a text generation storage unit and a function of generating the recording instruction data with reference to the generation text data stored in the text generation storage unit It is good.

又、上記した収録指示データ生成部については、前記収録指示データから収録すべき音声信号の量を演算し、前記演算された音声信号の量に応じて前記追加データの充足率を変化させる機能を備えた構成としてもよい。
更に、上記した収録指示データ生成部については、前記追加データを選択して収録指示データを作成する機能を備えた構成としてもよい。
又、上記した収録指示データ生成部については、前記音声合成部に対して入力される前記入力テキストデータ、テキスト生成用記憶部に記憶された生成用テキストデータ、および前記追加データの内の少なくとも一つにより、前記追加データの全部或いは一部を充足する前記収録指示データを生成する機能を備えた構成としてもよいIn addition, the recording instruction data generation unit described above has a function of calculating the amount of audio signals to be recorded from the recording instruction data and changing the satisfaction rate of the additional data according to the calculated amount of audio signals. It is good also as a structure provided.
Further, the recording instruction data generation unit described above may have a function of selecting the additional data and creating recording instruction data.
The recording instruction data generation unit described above includes at least one of the input text data input to the speech synthesis unit, the generation text data stored in the text generation storage unit, and the additional data. Thus, a configuration may be provided that has a function of generating the recording instruction data that satisfies all or part of the additional data.

更に、上記した収録指示データ生成部により生成される前記収録指示データは、発声内容を指示する発声内容指示データを含み、前記発声内容指示データは、前記発声内容を記述したテキストデータ、もしくは発生内容が同一である合成音声のうちの少なくとも一つを含む構成としてもよい。
又、上記した収録指示データ生成部により生成される収録指示データは、音声様式を指示する発声様式指示データを更に含み、前記発声様式指示データは、ピッチパタン、継続時間長、抑揚、感情を示すデータのうちの少なくとも一つを含む構成としてもよい。又、上記した収録指示データ生成部により生成される収録指示データは、前記入力テキストデータに対応する入力音声信号に含む構成としてもよい。Further, the recording instruction data generated by the recording instruction data generating unit includes utterance content instruction data for instructing utterance content, and the utterance content instruction data is text data describing the utterance content, or generation content It is good also as a structure containing at least one of the synthetic | combination speech |
The recording instruction data generated by the recording instruction data generating unit further includes utterance style instruction data for instructing a voice form, and the utterance style instruction data indicates a pitch pattern, a duration length, an inflection, and an emotion. It may be configured to include at least one of the data. Further, the recording instruction data generated by the above-described recording instruction data generation unit may be included in an input voice signal corresponding to the input text data.

更に、前記入力テキストデータに対応する入力音声信号の、少なくとも時刻情報に対応付けられた音素列を含むセグメンテーションデータを抽出するセグメンテーションデータ抽出部を前述した前記品質評価部に併設すると共に、この品質評価部が、前記セグメンテーションデータを用いて時刻情報との整合をとり、前記合成音声と前記入力音声信号を比較して特徴量の類似度を評価し、これにより所望の品質を満たすか否かを判定する機能を備えた構成としてもよい。 Further, the quality evaluation unit includes a segmentation data extraction unit that extracts segmentation data including at least a phoneme string associated with time information of the input speech signal corresponding to the input text data. The unit matches the time information using the segmentation data, compares the synthesized speech and the input speech signal, evaluates the similarity of the feature amount, and determines whether or not the desired quality is satisfied It is good also as a structure provided with the function to perform.

このため、これによると、セグメンテーション抽出部が、入力テキストデータに対応する入力音声信号のセグメンテーションデータを抽出し、品質評価部が、抽出されたセグメンテーションデータを用いて時刻情報との整合をとり、合成音声と入力音声信号を比較して特徴量の類似度を評価し、所望の品質を満たすか否かを判定することで、品質評価部が、継続時間長情報等のセグメンテーションデータを合成音声と入力音声信号との比較に利用することが出来、その結果、より精密で詳細な合成音声の品質評価が可能になる。 Therefore, according to this, the segmentation extraction unit extracts the segmentation data of the input speech signal corresponding to the input text data, the quality evaluation unit uses the extracted segmentation data to match the time information, and synthesize The quality evaluation unit inputs segmentation data such as duration information and the synthesized speech by comparing the speech with the input speech signal and evaluating the similarity of the feature quantity to determine whether the desired quality is satisfied. It can be used for comparison with a speech signal, and as a result, a more precise and detailed quality assessment of synthesized speech becomes possible.

更に、前述した入力テキストデータに対応する入力音声信号の、少なくとも時刻情報に対応付けられた音素列を含むセグメンテーションデータを抽出するセグメンテーションデータ抽出部を、前記音声合成部に併設すると共に、この音声合成部が、前記セグメンテーションデータを用いて前記入力音声信号に対応する時刻情報を持つ合成音声を生成する機能を備えた構成としてもよい。
このため、これによると、音声合成部において、音声信号のセグメンテーションデータが示す継続時間長情報どおりの合成音声が生成されるため、品質評価部における合成音声と入力音声信号との比較操作を不要とすることができる。Furthermore, a segmentation data extraction unit for extracting segmentation data including at least a phoneme string associated with time information of the input speech signal corresponding to the input text data is provided in the speech synthesis unit, and the speech synthesis The unit may have a function of generating synthesized speech having time information corresponding to the input speech signal using the segmentation data.
Therefore, according to this, since the synthesized speech according to the duration length information indicated by the segmentation data of the speech signal is generated in the speech synthesizer, the comparison operation between the synthesized speech and the input speech signal in the quality evaluation unit is unnecessary. can do.

また、上記した音声収録システムに、前記入力テキストデータに対応する入力音声信号と収録した音声信号とを比較し、当該比較の結果得られる評価値が予め設定した所定値に満たない場合に外部に対して再収録を行う指示を出力する収録音声信号評価部を、前記収録指示データ生成部に併設する、という構成としてもよい（。
このため、これによると、収録音声信号比較部が、入力テキストデータに対応する入力音声信号と収録した音声信号とを比較し、比較の結果得られる評価値が所定の値を超えた場合に処理を終了し、所定の値に満たない場合に外部に再収録を行う指示を出力することで、話者による収録音声の評価値が所定の値を上回るまで収録操作を繰り返すことができるため、高音質で、かつタスクに適応した収録音声信号を取得することができる。Further, when the input voice signal corresponding to the input text data is compared with the recorded voice signal in the voice recording system described above, and the evaluation value obtained as a result of the comparison is less than a predetermined value set in advance, On the other hand, a configuration may be adopted in which a recording audio signal evaluation unit that outputs an instruction to perform re-recording is provided in the recording instruction data generation unit.
Therefore, according to this, the recorded audio signal comparison unit compares the input audio signal corresponding to the input text data with the recorded audio signal, and processing is performed when the evaluation value obtained as a result of the comparison exceeds a predetermined value. , And if it does not reach the specified value, the recording operation can be repeated until the evaluation value of the recorded voice by the speaker exceeds the specified value. It is possible to acquire recorded audio signals that are sound quality and adapted to the task.

又、前述した指示データを用いて収録した音声信号により前記音声合成用データベースに記憶された音声合成システム用の音声データを更新する音声合成システム更新部を、前記音声合成部に併設した構成としてもよい。
このため、これによると、音声合成システム更新部が、収録した音声信号を用いて音声合成システムを更新し、更に、更新された音声合成システムを用いて反復処理を行うため、合成音声の品質やタスクの適応度を高めることが可能になる。In addition, a speech synthesis system update unit that updates speech data for a speech synthesis system stored in the speech synthesis database by a speech signal recorded using the instruction data described above may be provided alongside the speech synthesis unit. Good.
Therefore, according to this, since the speech synthesis system update unit updates the speech synthesis system using the recorded speech signal, and further performs an iterative process using the updated speech synthesis system, the quality of synthesized speech and It becomes possible to increase the fitness of the task.

又前述した品質評価部に、当該品質評価部における評価結果に基づいて前記音声合成用データベースに不要なデータを推定すると共に当該音声合成用データベースから削除すべき不要な音声データを決定する不要データ決定部を併設した構成としてもよい。
このため、これによると、不要データ決定部が、品質評価部における評価結果に基づき、前記音声合成用データベースから削除すべき不要な音声データを決定するため、音声合成用データベースに音声データを追加記憶するとともに、不要なデータを削除するため、音声合成用データベースの肥大化を回避することができる。In addition, unnecessary data determination for estimating unnecessary data in the speech synthesis database and determining unnecessary speech data to be deleted from the speech synthesis database based on the evaluation result in the quality evaluation unit is performed in the quality evaluation unit described above. It is good also as a structure which added the part.
Therefore, according to this, the unnecessary data determining unit additionally stores the speech data in the speech synthesis database in order to determine unnecessary speech data to be deleted from the speech synthesis database based on the evaluation result in the quality evaluation unit. In addition, since unnecessary data is deleted, it is possible to avoid the enlargement of the speech synthesis database.

ここで、上記した不要データ決定部により決定される不要データは、ピッチパタンパラメータ、継続時間長パラメータ、音声素片波形のうち少なくとも一つを含む構成としてもよい。 Here, the unnecessary data determined by the above-described unnecessary data determination unit may include at least one of a pitch pattern parameter, a duration parameter, and a speech segment waveform.

又、前述した不要データに基づいて前記音声合成用データベースに記憶される音声合成用の音声データを更新する音声合成システム更新部を、前記音声合成部に併設した構成としてもよい。
このため、これによると、音声合成システム更新部が、不要データ決定部により決定された不要データに基づき音声合成用データベースに記憶される音声データを更新するため、音声合成用データベースの肥大化を回避することができる。In addition, a speech synthesis system updating unit that updates speech synthesis speech data stored in the speech synthesis database based on the above-described unnecessary data may be provided in the speech synthesis unit.
Therefore, according to this, the speech synthesis system update unit updates the speech data stored in the speech synthesis database based on the unnecessary data determined by the unnecessary data determination unit, thereby avoiding the enlargement of the speech synthesis database. can do.

また、上記した第２のステップでは、前記合成音声と前記入力テキストデータに対応する入力音声信号とを比較して特徴量の一致度を評価すると共に、これに基づいて前記合成音声が所望の品質を満たすか否かを判定するサブステップを含む構成としてもよい。
このため、これによると、音声収録システムが、入力テキストデータを音声合成した合成音声と、入力テキストデータに対応した音声信号とを比較することにより、既存の音声合成システムで表現しきれない特徴が明確になるため、よりタスクに適応した音声データの作成が可能になる。In the second step described above, the synthesized speech and the input speech signal corresponding to the input text data are compared to evaluate the degree of coincidence of the feature quantities, and based on this, the synthesized speech is obtained with a desired quality. It is good also as a structure including the substep which determines whether it satisfy | fills.
Therefore, according to this, there is a feature that the voice recording system cannot express in the existing voice synthesis system by comparing the synthesized voice obtained by voice synthesis of the input text data and the voice signal corresponding to the input text data. Since it becomes clear, it becomes possible to create voice data more adapted to the task.

また、上記した音声収録方法に、前記第３のステップで決定された音声データに対応して収録すべき音声信号の発声内容が記述されたテキストデータを含む収録指示データを生成する第４のステップ、を付加する構成としてもよい。
このため、これにより、音声収録システムが、第３のステップを実行することにより決定される音声データに対応して収録すべき音声信号の発声内容が記述されたテキストデータを含む収録指示データを生成する第４のステップを実行することにより、既存の音声合成システムを更新する際の音声データの収録が容易になる。Further, in the voice recording method described above, a fourth step of generating recording instruction data including text data describing the utterance content of the voice signal to be recorded corresponding to the voice data determined in the third step. , May be added.
For this reason, the audio recording system generates recording instruction data including text data describing the utterance content of the audio signal to be recorded corresponding to the audio data determined by executing the third step. By executing the fourth step, it becomes easy to record voice data when updating an existing voice synthesis system.

ここで、前述した第２のステップでは、前記入力テキストデータに対応する入力音声信号を取得して当該入力音声信号から少なくとも時刻情報に対応付けられた音素列を含むセグメンテーションデータを抽出するサブステップと、前記セグメンテーションデータを用いて時刻情報との整合をとり，前記合成音声と前記入力音声信号を比較して特徴量の一致度を評価すると共に，これに基づいて所望の品質を満たすか否かを判定するサブステップとを含むように構成してもよい。 Here, in the second step described above, a sub-step of obtaining an input speech signal corresponding to the input text data and extracting segmentation data including at least a phoneme string associated with time information from the input speech signal; , Matching with time information using the segmentation data, comparing the synthesized speech and the input speech signal to evaluate the degree of coincidence of the feature quantity, and based on this, whether or not the desired quality is satisfied And a sub-step for determination.

このため、これによると、音声収録システムが、第２のステップの実行にあたり、入力テキストデータに対応する入力音声信号のセグメンテーションデータを抽出するサブステップと、ここで抽出されたセグメンテーションデータを用いて時刻情報との整合をとり、合成音声と入力音声信号を比較して特徴量の類似度を評価し、所望の品質を満たすか否かを判定するサブステップとを実行することにより、継続時間長情報等のセグメンテーションデータを合成音声と入力音声信号との比較に利用することが出来、その結果、より精密で詳細な合成音声の品質評価が可能になる。 For this reason, according to this, when the audio recording system executes the second step, the sub-step of extracting the segmentation data of the input audio signal corresponding to the input text data, and the segmentation data extracted here, The duration length information is obtained by matching the information, comparing the synthesized speech and the input speech signal, evaluating the similarity of the feature quantity, and executing a sub-step for determining whether or not the desired quality is satisfied. Or the like can be used for comparison between the synthesized speech and the input speech signal. As a result, a more precise and detailed quality assessment of the synthesized speech is possible.

又、上記した第２のステップでは、前記入力テキストデータに対応する入力音声信号を取得し、前記入力音声信号から少なくとも時刻情報に対応付けられた音素列を含むセグメンテーションデータを抽出するサブステップを含み、また、第１のステップでは、前記抽出されたセグメンテーションデータを用いて前記入力音声信号に対応する時刻情報を持つ合成音声を生成するサブステップを含む構成としてもよい。 The second step includes a sub-step of acquiring an input speech signal corresponding to the input text data and extracting segmentation data including at least a phoneme string associated with time information from the input speech signal. Further, the first step may include a sub-step of generating synthesized speech having time information corresponding to the input speech signal using the extracted segmentation data.

又、前述した第１のステップでは、前記入力テキストデータに対応する入力音声信号を取得し、前記入力音声信号から少なくとも時刻情報に対応付けられた音素列を含むセグメンテーションデータを抽出するサブステップと、前記抽出されたセグメンテーションデータを用いて前記入力音声信号に対応する時刻情報を持つ合成音声を生成するサブステップとを含む構成としてもよい。
このため、これによると、音声収録システムが、音声信号のセグメンテーションデータが示す継続時間長情報どおりの合成音声を生成するため、品質評価の際の合成音声と入力音声信号との比較操作を不要とすることができる。In the first step described above, a sub-step of acquiring an input speech signal corresponding to the input text data and extracting segmentation data including at least a phoneme string associated with time information from the input speech signal; A sub-step of generating synthesized speech having time information corresponding to the input speech signal using the extracted segmentation data.
Therefore, according to this, since the audio recording system generates synthesized speech according to the duration length information indicated by the segmentation data of the audio signal, it is unnecessary to perform a comparison operation between the synthesized speech and the input audio signal at the time of quality evaluation. can do.

また、上記した第４のステップでは、前記入力テキストデータに対応する入力音声信号と収録した音声信号とを比較して前記収録した音声信号を評価するサブステップと、前記評価の結果得られる評価値が閾値に満たない場合に再収録を行う収録指示データを出力するサブステップと、を含む構成としてもよい。
このため、これによると、音声収録システムが、第４のステップを実行するにあたり、入力テキストデータに対応する入力音声信号と収録した音声信号とを比較するサブステップと、比較の結果得られる評価値が所定の値を超えた場合に処理を終了し、所定の値に満たない場合に外部に再収録を行う指示を出力するサブステップとを実行することで、話者による収録音声の評価値が所定の値を上回るまで収録操作を繰り返すことができるため、高音質で、かつタスクに適応した収録音声信号を取得することができる。In the fourth step described above, a sub-step of evaluating the recorded voice signal by comparing the input voice signal corresponding to the input text data and the recorded voice signal, and an evaluation value obtained as a result of the evaluation And a sub-step for outputting recording instruction data for performing re-recording when the threshold value is less than the threshold value.
Therefore, according to this, when the voice recording system executes the fourth step, the substep for comparing the input voice signal corresponding to the input text data and the recorded voice signal, and the evaluation value obtained as a result of the comparison The process ends when the value exceeds a predetermined value, and when the value does not reach the predetermined value, the sub-step of outputting an instruction to perform re-recording to the outside is executed. Since the recording operation can be repeated until the value exceeds a predetermined value, it is possible to acquire a recorded sound signal with high sound quality and adapted to the task.

また、前述した第４のステップで出力される収録指示データを用いて収録した音声信号によって、前記音声合成用データベースに記憶されている音声データを更新する第５のステップを備えた構成としてもよい。
このため、これによると、音声収録システムが、収録した音声信号を用いて音声合成システムを更新する第５のステップを実行することにより、更新された音声合成システムを用いて反復処理を行い、合成音声の品質やタスクの適応度を高めることが可能になる。Further, it may be configured to include a fifth step of updating the voice data stored in the voice synthesis database by the voice signal recorded using the recording instruction data output in the fourth step described above. .
For this reason, according to this, the voice recording system performs the iterative process using the updated voice synthesis system by executing the fifth step of updating the voice synthesis system using the recorded voice signal, and performs the synthesis. It is possible to increase the quality of voice and the fitness of tasks.

更に、上記した第５のステップでは、前記第４のステップにおける評価結果に基づいて前記音声合成用データベースから削除すべき不要な音声データを決定するサブステップを含む構成としてもよい。
このため、これによると、音声収録システムが、第４のステップを実行することにより出力される評価結果に基づき音声合成用データベースから削除すべき不要な音声データを決定する第５のステップにおけるサブステップを実行することにより、音声合成用データベースに音声データを追加記憶するとともに、不要なデータを削除するため、音声合成用データベースの肥大化を回避することができる。Further, the fifth step described above may include a sub-step of determining unnecessary speech data to be deleted from the speech synthesis database based on the evaluation result in the fourth step.
For this reason, according to this, the sub step in the fifth step in which the voice recording system determines unnecessary voice data to be deleted from the voice synthesis database based on the evaluation result output by executing the fourth step. By executing the above, since the speech data is additionally stored in the speech synthesis database and unnecessary data is deleted, the enlargement of the speech synthesis database can be avoided.

また、上記した第５のステップでは、前記第４のステップにおいて決定された不要データに基づいて前記音声合成用データベースに記憶される音声データを更新するサブステップを含む構成としてもよい。
このため、これによると、音声合成システムが、第４のステップを実行することにより決定された不要データに基づき、音声合成用データベースに記憶される音声データを更新する第５のステップにおけるサブステップを実行することにより、不要データが削除されるため、音声合成用データベースの肥大化を回避することができる。Further, the fifth step described above may include a sub-step of updating the voice data stored in the voice synthesis database based on the unnecessary data determined in the fourth step.
For this reason, according to this, the speech synthesis system updates the speech data stored in the speech synthesis database based on the unnecessary data determined by executing the fourth step. By executing, unnecessary data is deleted, so that it is possible to avoid the enlargement of the speech synthesis database.

また、上記した品質評価処理機能にあっては、前記合成音声と前記入力テキストデータに対応する入力音声信号とを比較して特徴量の一致度を評価すると共に、これに基づいて所望の品質を満たすか否かを判定するサブ処理機能を含み、これをコンピュータに実行させるようにしてもよい。
このため、これによると、コンピュータが、入力テキストデータを音声合成した合成音声と、入力テキストデータに対応した音声信号とを比較することにより、既存の音声合成システムで表現しきれない特徴が明確になるため、よりタスクに適応した音声データの作成が可能になる。Further, in the quality evaluation processing function described above, the synthesized speech and the input speech signal corresponding to the input text data are compared to evaluate the degree of coincidence of the feature amount, and based on this, the desired quality is obtained. A sub-processing function for determining whether or not the condition is satisfied may be included, and this may be executed by a computer.
Therefore, according to this, by comparing the synthesized speech obtained by synthesizing the input text data with the speech signal corresponding to the input text data, the characteristics that cannot be expressed by the existing speech synthesis system are clarified. Therefore, it becomes possible to create voice data more adapted to the task.

また、前述した追加記憶データ決定機能に、当該追加記憶データ決定機能によって決定された音声データに対応して収録すべき音声信号の発声内容が記述されたテキストデータを含む収録指示データを生成する収録指示データ生成機能を併設すると共に、これらをコンピュータに実行させるようにしたことを特徴とする。
このため、これによると、コンピュータが、追加記憶が決定された音声データに対応して収録すべき音声信号の発声内容が記述されたテキストデータを含む収録指示データを生成することにより、既存の音声合成システムを更新する際の音声データの収録が容易になる。In addition, in the additional storage data determination function described above, the recording instruction data including text data describing the utterance content of the audio signal to be recorded corresponding to the audio data determined by the additional storage data determination function is generated. An instruction data generation function is provided, and the computer executes the function.
Therefore, according to this, the computer generates the recording instruction data including the text data describing the utterance content of the audio signal to be recorded corresponding to the audio data determined to be additionally stored. Recording of audio data when updating the synthesis system becomes easy.

また、上記した品質評価処理機能にあっては、前記入力テキストデータに対応する入力音声信号を取得し、前記入力音声信号から少なくとも時刻情報に対応付けられた音素列を含むセグメンテーションデータを抽出するサブ処理機能、および前記セグメンテーションデータを用いて時刻情報との整合をとり、前記合成音声と前記入力音声信号を比較して特徴量の類似度を評価し、所望の品質を満たすか否かを判定するサブ処理機能を含み、これらをコンピュータに実行させるようにしてもよい。 In the quality evaluation processing function described above, the input speech signal corresponding to the input text data is acquired, and the segmentation data including at least the phoneme string associated with the time information is extracted from the input speech signal. Matching with time information using the processing function and the segmentation data, comparing the synthesized speech and the input speech signal, evaluating the similarity of the feature amount, and determining whether or not the desired quality is satisfied. Sub processing functions may be included and executed by the computer.

このため、これによると、コンピュータが、入力テキストデータに対応する入力音声信号から時刻情報に対応付けられた音素列を含むセグメンテーションデータを抽出し、セグメンテーションデータを用いて時刻情報との整合をとって合成音声と前記入力音声信号を比較することにより特徴量の類似度を評価することで、継続時間長情報等のセグメンテーションデータを合成音声と入力音声信号との比較に利用することが出来、その結果、より精密で詳細な合成音声の品質評価が可能になる。 Therefore, according to this, the computer extracts segmentation data including a phoneme string associated with time information from the input speech signal corresponding to the input text data, and uses the segmentation data to match the time information. By comparing the synthesized speech and the input speech signal to evaluate the similarity of the feature amount, segmentation data such as duration information can be used for comparison between the synthesized speech and the input speech signal, and as a result This makes it possible to evaluate the quality of synthesized speech more precisely and in detail.

更に、上記した品質評価処理機能にあっては、前記入力テキストデータに対応する入力音声信号を取得し、前記入力音声信号から少なくとも時刻情報に対応付けられた音素列を含むセグメンテーションデータを抽出するサブステップを含み、また、上記した音声合成処理機能にあっては、前記抽出されたセグメンテーションデータを用いて前記入力音声信号に対応する時刻情報を持つ合成音声を生成するサブステップを含み、これらをコンピュータに実行させるようにしてもよい。
このため、これによると、コンピュータが、音声信号のセグメンテーションデータが示す継続時間長情報どおりの合成音声を生成するため、品質評価の際の合成音声と入力音声信号との比較操作を不要とすることができる。Further, in the quality evaluation processing function described above, a sub-voice that acquires an input speech signal corresponding to the input text data and extracts segmentation data including at least a phoneme string associated with time information from the input speech signal. And the above-described speech synthesis processing function includes a sub-step of generating synthesized speech having time information corresponding to the input speech signal using the extracted segmentation data. You may make it perform.
Therefore, according to this, since the computer generates the synthesized speech according to the duration length information indicated by the segmentation data of the audio signal, the comparison operation between the synthesized speech and the input audio signal at the time of quality evaluation is not required. Can do.

更に、上記した収録指示データ生成機能にあっては、前記入力テキストデータに対応する入力音声信号と収録した音声信号とを比較して前記収録した音声信号を評価するサブ処理機能、および前記評価の結果得られる評価値が閾値に満たない場合に再収録を行う収録指示データを出力するサブ処理機能を含み、これをコンピュータに実行させるようにしてもよい。 Further, in the recording instruction data generation function described above, a sub-processing function for evaluating the recorded voice signal by comparing the input voice signal corresponding to the input text data with the recorded voice signal, and the evaluation A sub-processing function that outputs recording instruction data for re-recording when the evaluation value obtained as a result is less than the threshold may be included, and this may be executed by a computer.

このため、これによると、コンピュータが、入力テキストデータに対応する入力音声信号と収録した音声信号とを比較し、比較の結果得られる評価値が所定の値を超えた場合に処理を終了し、所定の値に満たない場合に外部に再収録を行う指示を出力することにより、話者による収録音声の評価値が所定の値を上回るまで収録操作を繰り返すことができるため、高音質で、かつタスクに適応した収録音声信号を取得することができる。 Therefore, according to this, the computer compares the input voice signal corresponding to the input text data with the recorded voice signal, and ends the process when the evaluation value obtained as a result of the comparison exceeds a predetermined value, Since the recording operation can be repeated until the evaluation value of the recorded voice by the speaker exceeds the predetermined value by outputting an instruction to perform re-recording outside when the predetermined value is not reached, the sound quality is high, and Recorded audio signal adapted to the task can be acquired.

また、前記収録指示データ生成機能にて生成された収録指示データに従って収録した音声信号を用いて前記音声合成用データベースに記憶された音声データを更新する音声合成システム更新処理機能を、前記収録指示データ生成機能に併設し、これらをコンピュータに実行させるようにしてもよい。
このため、これによると、コンピュータが、収録した音声信号を用いて音声合成システムを更新することにより、更新された音声合成システムを用いて反復処理を行い、合成音声の品質やタスクの適応度を高めることが可能になる。A speech synthesis system update processing function for updating speech data stored in the speech synthesis database using speech signals recorded according to the recording instruction data generated by the recording instruction data generation function; You may make it make a computer perform these along with the generation function.
Therefore, according to this, when the computer updates the speech synthesis system using the recorded speech signal, it performs an iterative process using the updated speech synthesis system, thereby improving the quality of the synthesized speech and the fitness of the task. It becomes possible to increase.

更に、上記した音声合成システム更新処理機能にあっては、前記収録指示データ生成機能における評価結果に基づいて前記音声合成用データベースから削除すべき不要な音声データを決定するサブ処理機能を含み、これらをコンピュータに実行させるようにしてもよい。 Further, the above-described speech synthesis system update processing function includes a sub-processing function for determining unnecessary speech data to be deleted from the speech synthesis database based on the evaluation result in the recording instruction data generation function, May be executed by a computer.

このため、これによると、コンピュータが、評価結果に基づき音声合成用データベースから削除すべき不要な音声データを決定することにより、音声合成用データベースに音声データを追加記憶するとともに、不要なデータを削除するため、音声合成用データベースの肥大化を回避することができる。 Therefore, according to this, the computer determines unnecessary speech data to be deleted from the speech synthesis database based on the evaluation result, and additionally stores the speech data in the speech synthesis database and deletes the unnecessary data. Therefore, the enlargement of the speech synthesis database can be avoided.

また、上記した音声合成システム更新処理機能にあっては、前記収録指示データ生成機能において決定された不要データに基づき前記音声合成用データベースに記憶される音声データを更新するサブ処理機能を含み、これらをコンピュータに実行させるようにしてもよい。
このため、これによると、コンピュータが、決定された不要データに基づき、音声合成用データベースに記憶される音声データを更新することにより不要データが削除され、したがって、音声合成用データベースの肥大化を回避することができる。The speech synthesis system update processing function includes a sub-processing function for updating speech data stored in the speech synthesis database based on unnecessary data determined by the recording instruction data generation function. May be executed by a computer.
Therefore, according to this, unnecessary data is deleted by updating the speech data stored in the speech synthesis database by the computer based on the determined unnecessary data, thus avoiding the enlargement of the speech synthesis database. can do.

以上、実施形態（及び実施例）を参照して本願発明を説明したが、本願発明は上記実施形態（及び実施例）に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 While the present invention has been described with reference to the embodiments (and examples), the present invention is not limited to the above embodiments (and examples). Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

この出願は２００６年１１月６日に出願された日本出願特願２００６−３００８２４を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2006-300824 for which it applied on November 6, 2006, and takes in those the indications of all here.

本発明は、例えば、緊急放送や物語の読み上げ等、特定のタスクに重点を置いた音声合成システムを構築する用途に有効に適用される。又、使用する場所に合わせた雰囲気の発声音の合成が可能であることから、観光案内、役所からの緊急報告や行事の案内、博物館の案内、等々、のそれぞれに適応した言い回しが可能となり、又、ユーザのリクエストに応えて、定期的に音声合成システムをアップグレードすることが可能であることから、その汎用性は高いものがある。 The present invention is effectively applied to, for example, an application for constructing a speech synthesis system with an emphasis on a specific task such as emergency broadcasting or reading a story. In addition, since it is possible to synthesize voice sounds in accordance with the place of use, it is possible to make phrases adapted to each of tourist guidance, emergency reports from government offices, event guidance, museum guidance, etc. Further, since it is possible to periodically upgrade the speech synthesis system in response to a user request, there is a high versatility.

本発明の第１実施形態にかかわる音声収録システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice recording system concerning 1st Embodiment of this invention. 図１に開示した第１実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 1st Embodiment disclosed in FIG. 本発明の第２実施形態にかかわる音声収録システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice recording system concerning 2nd Embodiment of this invention. 図３に開示した第２実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 2nd Embodiment disclosed in FIG. 本発明の第３の実施の形態にかかわる音声収録システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice recording system concerning the 3rd Embodiment of this invention. 図５に開示した第３実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 3rd Embodiment disclosed in FIG. 本発明の第４実施形態にかかわる音声収録システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice recording system concerning 4th Embodiment of this invention. 図７に開示した第４実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 4th Embodiment disclosed in FIG. 本発明の第５実施形態にかかわる音声収録システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice recording system concerning 5th Embodiment of this invention. 図９に開示した第５実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 5th Embodiment disclosed in FIG. 本発明の第６実施形態にかかわる音声収録システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice recording system concerning 6th Embodiment of this invention. 図１１に開示した第６実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 6th Embodiment disclosed in FIG. 本発明の第７実施形態にかかわる音声収録システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice recording system concerning 7th Embodiment of this invention. 図１３に開示した第７実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 7th Embodiment disclosed in FIG. 本発明の第８実施形態にかかわる音声収録システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice recording system concerning 8th Embodiment of this invention. 図１５に開示した第８実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 8th Embodiment disclosed in FIG. 本発明の第９実施形態にかかわる音声収録システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice recording system concerning 9th Embodiment of this invention. 図１７に開示した第９実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 9th Embodiment disclosed in FIG. 本発明における実施例１の具体的構成を示すブロック図である。It is a block diagram which shows the specific structure of Example 1 in this invention. 図１９における実施例１の動作内容を示す図で、図２０（Ａ）は評価値の例を示す説明図、図２０（Ｂ）は文章の組立の例を示す説明図である。FIG. 20A is an explanatory diagram illustrating an example of evaluation values, and FIG. 20B is an explanatory diagram illustrating an example of text assembling. 本発明における実施例２の具体的構成を示すブロック図である。It is a block diagram which shows the specific structure of Example 2 in this invention. 本発明における実施例３を示す図で、図２２（Ａ）はその具体的構成を示すブロック図、図２２（Ｂ）はセグメンテーションの記述方式における各音素の継続時間の記述方法の例を示す説明図である。FIGS. 22A and 22B are diagrams showing a third embodiment of the present invention, FIG. 22A is a block diagram showing a specific configuration thereof, and FIG. 22B is an explanation showing an example of a description method of duration of each phoneme in a segmentation description method. FIG. 本発明における実施例４の具体的構成を示すブロック図である。It is a block diagram which shows the specific structure of Example 4 in this invention. 本発明における実施例５の具体的構成を示すブロック図である。It is a block diagram which shows the specific structure of Example 5 in this invention. 本発明における実施例６の具体的構成を示すブロック図である。It is a block diagram which shows the specific structure of Example 6 in this invention. 本発明における実施例６の更なる具体例を示す説明図である。It is explanatory drawing which shows the further specific example of Example 6 in this invention. 本発明の第９実施形態対応の実施例７の動作を説明するために引用した説明図である。It is explanatory drawing quoted in order to demonstrate operation | movement of Example 7 corresponding to 9th Embodiment of this invention.

Explanation of symbols

１０，４０音声合成システム
１０Ａ，４０Ａ，６０Ａ音声合成用データベース
１１音声合成部
１２，２２，３２，４２，６２，７２，８２品質評価部
１３追加データ決定部
１４収録指示データ生成部
１４Ａテキスト生成用記憶部
１５，４５セグメンテーションデータ抽出部
１６収録音声信号評価部
１７文生成用テキストデータベース
１８，８８，９８音声合成システム更新部
１９不要データ決定部
２０音声収録システム
２１更新された音声合成システム
３０音声収録機器DESCRIPTION OF SYMBOLS 10,40 Speech synthesis system 10A, 40A, 60A Speech synthesis database 11 Speech synthesis unit 12, 22, 32, 42, 62, 72, 82 Quality evaluation unit 13 Additional data determination unit 14 Recording instruction data generation unit 14A For text generation Storage unit 15, 45 Segmentation data extraction unit 16 Recorded speech signal evaluation unit 17 Text database for sentence generation 18, 88, 98 Speech synthesis system update unit 19 Unnecessary data determination unit 20 Speech recording system 21 Updated speech synthesis system 30 Speech recording machine

Claims

A voice recording system for recording voice signals necessary to create voice data such as parameters and segment waveforms in a voice synthesis database,
A speech synthesizer that inputs text data describing the generated content and generates a desired synthesized speech based on speech data stored in advance in the speech synthesis database; and the synthesized speech based on the feature amount of the synthesized speech And a quality evaluation unit that performs quality evaluation of
An audio recording system, comprising: an additional data determining unit that determines audio data to be additionally stored in the database for speech synthesis based on a result of the quality evaluation.

In the audio recording system according to claim 1,
The quality evaluation unit compares the synthesized speech with an input speech signal corresponding to the input text data, evaluates the degree of coincidence of feature amounts, and determines whether or not a predetermined desired quality is satisfied. An audio recording system characterized by having

In the audio recording system according to claim 1,
The quality evaluation unit analyzes and evaluates intermediate data generated by the speech synthesis unit when speech synthesis is performed on the input text data, and based on this, the speech data stored in the speech synthesis database is desired. An audio recording system comprising a function for judging whether or not quality is satisfied.

A speech synthesis database for storing speech data;
Analyzing the input text data and the speech synthesis system information defining the processing operation related to the synthesis of speech data stored in the speech synthesis database, and based on this, the speech data stored in the speech synthesis database is A quality evaluation unit that determines whether or not the desired quality is satisfied;
An additional data determination unit that determines voice data to be additionally stored in the speech synthesis database based on the result of the quality evaluation;
An audio recording system characterized by comprising

In the audio recording system according to any one of claims 1 to 4,
The input text data includes format specification data describing a speech format, and the format specification data includes at least one of data indicating a pitch pattern and an utterance speed. .

In the audio recording system according to any one of claims 1 to 5,
The speech recording system, wherein the speech synthesis database stores at least one of a pitch pattern model, a duration model, and a speech segment waveform as the stored speech data.

In the audio recording system according to any one of claims 1 to 6,
The voice recording system, wherein the voice data determined by the additional data determination unit includes at least one of a pitch pattern parameter, a duration parameter, and a voice segment waveform.

In the audio recording system according to any one of claims 1 to 7,
Corresponding to the audio data determined by the additional data determining unit, a recording instruction data generating unit for generating recording instruction data including text data in which the utterance content of the audio signal to be recorded is described, the additional data determining unit An audio recording system characterized by being attached to the

In the audio recording system according to claim 8,
The recording instruction data generation unit has a function of calculating the total time of the audio signal recorded from the text data in which the utterance content of the audio signal to be recorded is included, included in the recording instruction data, and the calculation result The sound recording system is characterized by having a function of suppressing the amount of sound signals to be recorded according to the conditions.

In the audio recording system according to claim 8 or 9,
The recording instruction data generation unit includes a text generation storage unit, and has a function of generating the recording instruction data with reference to the generation text data stored in the text generation storage unit Sound recording system featuring

In the audio recording system according to any one of claims 8 to 10,
The recording instruction data generation unit is configured to calculate the amount of audio signals to be recorded from the recording instruction data, and to have a function of changing a satisfaction rate of the additional data according to the calculated amount of audio signals; An audio recording system characterized by

The sound recording system according to any one of claims 8 to 11,
The audio recording system, wherein the recording instruction data generation unit has a function of selecting the additional data and creating recording instruction data.

In the audio recording system according to any one of claims 10 to 12,
The recording instruction data generation unit includes at least one of the input text data input to the speech synthesizer, the generation text data stored in the text generation storage unit, and the additional data. An audio recording system having a function of generating the recording instruction data satisfying all or part of the additional data.

In the audio recording system according to any one of claims 8 to 13,
The recording instruction data generated by the recording instruction data generating unit includes utterance content instruction data for instructing utterance content,
The speech recording system is characterized in that the speech content instruction data includes at least one of text data describing the speech content or synthesized speech having the same generated content.

The audio recording system according to any one of claims 8 to 14,
The recording instruction data generated by the recording instruction data generating unit includes utterance style instruction data for instructing a voice style,
The voice recording system is characterized in that the utterance style instruction data includes at least one of data indicating pitch pattern, duration, intonation, and emotion.

The sound recording system according to any one of claims 8 to 15,
An audio recording system, wherein the recording instruction data generated by the recording instruction data generating unit includes an input audio signal corresponding to the input text data.

In the sound recording system according to any one of claims 2 and 5 to 16,
A segmentation data extraction unit for extracting segmentation data including at least a phoneme string associated with time information of an input speech signal corresponding to the input text data is provided in the quality evaluation unit,
The quality evaluation unit uses the segmentation data to match time information, compares the synthesized speech with the input speech signal, evaluates the degree of coincidence of feature amounts, and determines whether or not a desired quality is satisfied. Audio recording system characterized by having a function to perform.

In the sound recording system according to any one of claims 2 and 5 to 16,
A segmentation data extraction unit for extracting segmentation data including at least a phoneme string associated with time information of an input speech signal corresponding to the input text data is provided in the speech synthesis unit,
The voice recording system, wherein the voice synthesizer has a function of generating synthesized voice having time information corresponding to the input voice signal using the segmentation data.

The sound recording system according to any one of claims 8 to 18,
An instruction to compare the input voice signal corresponding to the input text data with the newly recorded voice signal, and to perform external recording when the evaluation value obtained as a result of the comparison is less than a predetermined value set in advance The audio recording system is characterized in that a recording audio signal evaluation unit for outputting is attached to the recording instruction data generation unit.

The sound recording system according to any one of claims 8 to 19,
A speech synthesis system update unit for updating speech data for a speech synthesis system stored in the speech synthesis database by a speech signal recorded using the recording instruction data is provided in the speech synthesis unit. Audio recording system.

The audio recording system according to any one of claims 1 to 19,
An unnecessary data determination unit that estimates unnecessary data in the speech synthesis database and determines unnecessary speech data to be deleted from the speech synthesis database based on the evaluation result in the quality evaluation unit. An audio recording system characterized by the addition of

The audio recording system according to claim 21, wherein
The audio recording system, wherein the unnecessary data determined by the unnecessary data determination unit includes at least one of a pitch pattern parameter, a duration parameter, and a speech segment waveform.

In the audio recording system according to claim 21 or 22,
A speech recording system, wherein a speech synthesis system updating unit that updates speech synthesis speech data stored in the speech synthesis database based on the unnecessary data is provided in the speech synthesis unit.

A voice recording method for recording a voice signal necessary for creating voice data stored in a voice synthesis database,
A first step of referring to the speech synthesis database and generating a desired synthesized speech from externally input text data;
A second step of evaluating the quality of the synthesized speech based on the generated feature amount of the synthesized speech;
A third step of determining speech data to be additionally stored in the speech synthesis database based on the quality evaluation result;
An audio recording method characterized by comprising:

The audio recording method according to claim 24,
In the second step, the synthesized speech and the input speech signal corresponding to the input text data are compared to evaluate the degree of feature match, and based on this, whether the synthesized speech satisfies a desired quality. An audio recording method comprising a sub-step for determining whether or not.

A voice recording method for recording a voice signal to create voice data stored in a voice synthesis database,
A first step of acquiring externally input text data;
Analyzing and evaluating the acquired input text data and speech synthesis system information that defines processing operations related to synthesis of speech data stored in the speech synthesis database, and based on this, the speech synthesis database A second step of determining whether the stored audio data satisfies a desired quality;
A third step of determining speech data to be additionally stored in the speech synthesis database based on the quality evaluation result;
An audio recording method characterized by comprising:

The sound recording method according to any one of claims 24 to 26, wherein:
A voice comprising the fourth step of generating recording instruction data including text data describing the utterance content of the voice signal to be recorded corresponding to the voice data determined in the third step Recording method.

The audio recording method according to any one of claims 25 to 27, wherein:
The second step includes
A sub-step of acquiring an input speech signal corresponding to the input text data and extracting segmentation data including at least a phoneme string associated with time information from the input speech signal;
Sub-step for matching the time information using the segmentation data, comparing the synthesized speech and the input speech signal, evaluating the degree of coincidence of the feature values, and determining whether or not the desired quality is satisfied. Including,
An audio recording method characterized by this.

The audio recording method according to claim 25 or 27, wherein:
The first step is a sub-step of acquiring an input speech signal corresponding to the input text data and extracting segmentation data including at least a phoneme string associated with time information from the input speech signal. Sub-step of generating synthesized speech having time information corresponding to the input speech signal using the segmentation data;
An audio recording method characterized by including

30. The audio recording method according to any one of claims 27 to 29, wherein:
The fourth step includes a sub-step of evaluating the recorded voice signal by comparing the input voice signal corresponding to the input text data and the recorded voice signal, and an evaluation value obtained as a result of the evaluation is obtained in advance. A sub-step for outputting recording instruction data for re-recording when the set threshold value is not met,
An audio recording method characterized by including:

The audio recording method according to any one of claims 27 to 30, wherein
Voice recording characterized by comprising a fifth step of updating the voice data stored in the voice synthesis database with the voice signal recorded using the recording instruction data output in the fourth step. Method.

The audio recording method according to claim 31, wherein
5. The voice recording method according to claim 5, wherein the fifth step includes a sub-step of determining unnecessary voice data to be deleted from the voice synthesis database based on the evaluation result in the fourth step.

The audio recording method according to claim 32, wherein
The voice recording method according to claim 5, wherein the fifth step includes a sub-step of updating voice data stored in the voice synthesis database based on the unnecessary data determined in the fourth step.

A voice recording processing program used in a voice recording system for recording voice signals for creating voice data stored in a voice synthesis database,
A speech synthesis processing function for referring to the speech synthesis database and generating desired synthesized speech from externally input text data;
A quality evaluation processing function for evaluating the quality of the synthesized speech based on the feature amount of the generated synthesized speech;
An additional data determination processing function for determining speech data to be additionally stored in the speech synthesis database based on the result of the quality evaluation;
An audio recording processing program characterized in that a computer is executed.

In the sound recording processing program according to claim 34,
The quality evaluation processing function compares the synthesized speech and an input speech signal corresponding to the input text data to evaluate the degree of coincidence of feature amounts, and determines whether or not a desired quality is satisfied based on the comparison. An audio recording processing program characterized by including a sub-processing function.

A voice recording processing program for recording a voice signal to create voice data stored in a voice synthesis database,
Text data acquisition processing function to acquire externally input text data,
The acquired input text data and the speech synthesis system information that defines the speech synthesis processing operation stored in the speech synthesis database are analyzed and evaluated, and stored in the speech synthesis database based on the analysis. A quality evaluation processing function for determining whether the audio data satisfies a desired quality;
An additional data determination processing function for determining speech data to be additionally stored in the speech synthesis database based on the quality evaluation result;
An audio recording processing program characterized in that a computer is executed.

The audio recording processing program according to any one of claims 34 to 36,
Recording instruction data generation for generating recording instruction data including text data describing the utterance content of the audio signal to be recorded corresponding to the audio data determined by the additional storage data determination function in the additional storage data determination function Along with the function,
An audio recording processing program characterized by causing a computer to execute these.

In the sound recording processing program according to any one of claims 35 to 37,
The quality evaluation processing function
A sub-processing function for obtaining an input speech signal corresponding to the input text data, extracting segmentation data including at least a phoneme string associated with time information from the input speech signal, and time information using the segmentation data A sub-processing function that compares the synthesized speech and the input speech signal, evaluates the similarity of the feature amount, and determines whether or not a desired quality is satisfied,
An audio recording processing program characterized by causing a computer to execute these.

In the sound recording processing program according to claim 35 or 37,
The quality evaluation processing function includes a sub-step of acquiring an input speech signal corresponding to the input text data and extracting segmentation data including at least a phoneme string associated with time information from the input speech signal;
The speech synthesis processing function includes a sub-step of generating synthesized speech having time information corresponding to the input speech signal using the extracted segmentation data,
An audio recording processing program characterized by causing a computer to execute these.

40. The sound recording processing program according to any one of claims 37 to 39,
The recording instruction data generation function includes a sub-processing function that evaluates the recorded voice signal by comparing an input voice signal corresponding to the input text data and the recorded voice signal, and an evaluation value obtained as a result of the evaluation. Including a sub-processing function that outputs recording instruction data for re-recording when the threshold value is not met,
An audio recording processing program characterized by causing a computer to execute these.

In the sound recording processing program according to any one of claims 37 to 40,
A speech synthesis system update processing function for updating speech data stored in the speech synthesis database using speech signals recorded according to the recording instruction data generated by the recording instruction data generation function, and the recording instruction data generation function Attached to the
An audio recording processing program characterized by causing a computer to execute these.

In the sound recording processing program according to claim 41,
The speech synthesis system update processing function includes a sub-processing function for determining unnecessary speech data to be deleted from the speech synthesis database based on an evaluation result in the recording instruction data generation function,
An audio recording processing program characterized by causing a computer to execute these.

In the sound recording processing program according to claim 42,
The speech synthesis system update processing function includes a sub-processing function for updating speech data stored in the speech synthesis database based on unnecessary data determined in the recording instruction data generation function,
An audio recording processing program characterized by causing a computer to execute these.