JP2010145794A

JP2010145794A - Voice synthesis dictionary construction device, voice synthesis dictionary construction method, and program

Info

Publication number: JP2010145794A
Application number: JP2008323609A
Authority: JP
Inventors: Katsuhiko Sato; 勝彦佐藤
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2008-12-19
Filing date: 2008-12-19
Publication date: 2010-07-01
Anticipated expiration: 2028-12-19
Also published as: JP5326545B2

Abstract

<P>PROBLEM TO BE SOLVED: To generate a voice synthesis dictionary deleted with a pitch portion series data having significant pitch fluctuation, thereby to prevent a synthesized voice from getting unnatural. <P>SOLUTION: A pitch series data extracting part 220 extracts a pitch series data from a voice data taken out from a voice database 201. A voice cut-out part 225 cuts out the pitch portion series data in each phoneme from the extracted pitch series data. A pitch data evaluating part 230 determines whether the pitch fluctuation of each pitch portion series data is significant or not. A pitch data editing part 240 deletes the pitch portion series data having the significant pitch fluctuation. A phoneme HMM learning part 250 finds thereafter a correspondence relation between a phoneme label and a phoneme pitch data. A data writing-out part 260 writes the correspondence relation between the phoneme label and the phoneme pitch data after learning, into the voice synthesis dictionary 202. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声合成に用いるデータベースを構築する、音声合成辞書構築装置、音声合成辞書構築方法、及び、プログラムに関する。 The present invention relates to a speech synthesis dictionary construction device, a speech synthesis dictionary construction method, and a program for constructing a database used for speech synthesis.

音声認識及び音声合成技術として隠れマルコフモデル（以下、ＨＭＭと称呼する。）に基づいた音声認識技術及び音声合成技術が、広く利用されている。 Speech recognition technology and speech synthesis technology based on a hidden Markov model (hereinafter referred to as HMM) are widely used as speech recognition and speech synthesis technology.

ＨＭＭに基づいた音声認識技術及び音声合成技術は、例えば、特許文献１及び２に開示されている。 Speech recognition technology and speech synthesis technology based on HMM are disclosed in Patent Documents 1 and 2, for example.

ＨＭＭに基づいた音声合成においては、音素ラベルとスペクトルパラメータデータ列等の対応関係を記録した音声合成辞書が必要になる。 In speech synthesis based on the HMM, a speech synthesis dictionary in which a correspondence relationship between phoneme labels and spectrum parameter data strings is recorded is required.

音声合成辞書は、通例、音素ラベル列とそれに対応する音声データとの組から構成されているデータベース（以下、音声データベースと称呼する。）に記録されているデータについて、スペクトル分析とピッチ抽出を行い、ＨＭＭに基づく学習過程を経ることにより、構築される。 A speech synthesis dictionary usually performs spectrum analysis and pitch extraction on data recorded in a database (hereinafter referred to as a speech database) composed of a set of phoneme label sequences and corresponding speech data. It is constructed through a learning process based on HMM.

従来は、音声合成辞書を構築する際、音声データから抽出されたピッチに、他のピッチと比べて相違が顕著なものが含まれていた場合でも、そのままＨＭＭに基づく学習に用いていた。 Conventionally, when constructing a speech synthesis dictionary, even if pitches extracted from speech data include those that are significantly different from other pitches, they are used as they are for learning based on the HMM.

特開２００２−２４４６８９号公報Japanese Patent Laid-Open No. 2002-244689 特開２００２−２６８６６０号公報JP 2002-268660 A

しかしながら、そのように構築された音声合成辞書を用いて生成された合成音声には、ピッチ変動が急峻な部分が含まれる。 However, the synthesized speech generated using the speech synthesis dictionary constructed as described above includes a portion where the pitch variation is steep.

このため、従来の音声合成辞書構築装置で音声合成辞書を構築した場合、合成音声は、人間の自然な音声に比べて、不自然なものとなる場合があった。 For this reason, when a speech synthesis dictionary is constructed with a conventional speech synthesis dictionary construction device, the synthesized speech may be unnatural compared to natural human speech.

本発明は、上記実情に鑑みてなされたもので、急激なピッチ変動が生じる、不自然な音声を合成することがない音声合成辞書を構築可能とする音声合成辞書構築装置、音声合成辞書構築方法、及び、プログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and a speech synthesis dictionary construction apparatus and speech synthesis dictionary construction method capable of constructing a speech synthesis dictionary that does not synthesize unnatural speech in which rapid pitch fluctuation occurs. And to provide a program.

上記目的を達成するために、この発明の第１の観点に係る音声合成辞書構築装置は、
音声合成辞書を構築する装置であって、
音素ラベル列とそれに対応する音声データとを受信する受信部と、
受信した音声データからピッチ系列データを抽出するピッチ系列データ抽出部と、
抽出されたピッチ系列データから、音素毎のピッチ部分系列データを切り出す音素切り出し部と、
切り出されたピッチ部分系列データが、音声合成辞書を構築するに適するかを評価するピッチデータ評価部と、
音声合成辞書を構築するに適すると評価されたピッチ部分系列データを編集するピッチデータ編集部と、
前記音素ラベル列と、編集されたピッチ部分系列データとから、隠れマルコフモデルに基づく学習により各音素ラベルにピッチに関する音素ＨＭＭを対応させる音素ＨＭＭ学習部と、
学習結果を音声合成辞書に記録するデータ書き出し部と、
を備えることを特徴とする。 In order to achieve the above object, a speech synthesis dictionary construction device according to the first aspect of the present invention provides:
An apparatus for constructing a speech synthesis dictionary,
A receiver for receiving a phoneme label string and corresponding voice data;
A pitch sequence data extraction unit for extracting pitch sequence data from the received audio data;
A phoneme cutout unit that cuts out pitch partial sequence data for each phoneme from the extracted pitch sequence data;
A pitch data evaluation unit that evaluates whether the extracted pitch partial series data is suitable for constructing a speech synthesis dictionary;
A pitch data editing unit for editing pitch partial series data evaluated to be suitable for building a speech synthesis dictionary;
A phoneme HMM learning unit that associates a phoneme HMM related to a pitch with each phoneme label by learning based on a hidden Markov model from the phoneme label sequence and the edited pitch subsequence data;
A data writer for recording the learning results in the speech synthesis dictionary;
It is characterized by providing.

また、第１の観点に係る音声合成辞書構築装置において、
前記ピッチデータ評価部で音声合成辞書を構築するに適さないと評価されたピッチ部分系列データに対応するメルケプストラム部分系列データを併せて音声合成辞書を構築するに適さないと評価するようにしてもよい。 In the speech synthesis dictionary construction device according to the first aspect,
The pitch data evaluation unit may evaluate the combination of the mel cepstrum partial sequence data corresponding to the pitch partial sequence data evaluated to be unsuitable for constructing a speech synthesis dictionary and not suitable for constructing a speech synthesis dictionary. Good.

前記ピッチデータ評価部は、すべてのフレームにおいて、ピッチ部分系列データとその平均値との差が所定の閾値を超えない場合に音声合成辞書を構築するのに適すると評価することが望ましい。 It is desirable that the pitch data evaluation unit evaluates that it is suitable for building a speech synthesis dictionary when the difference between the pitch partial series data and the average value does not exceed a predetermined threshold value in all frames.

前記ピッチデータ評価部は、ピッチ部分系列データの平均値と同一モノフォンラベルに属するピッチ部分系列データの平均値との差が所定の閾値を超えない場合に音声合成辞書を構築するのに適すると評価するようにしてもよい。 The pitch data evaluation unit is suitable for constructing a speech synthesis dictionary when a difference between an average value of pitch partial series data and an average value of pitch partial series data belonging to the same monophone label does not exceed a predetermined threshold. You may make it evaluate.

また、第１の観点に係る音声合成辞書構築装置において、
前記ピッチ部分系列データを削除するとともに、当該ピッチ部分系列データを切り出したピッチ系列データを削除するようにしてもよい。 In the speech synthesis dictionary construction device according to the first aspect,
The pitch partial series data may be deleted, and the pitch series data cut out from the pitch partial series data may be deleted.

これに併せて、削除したピッチ部分系列データに対応するメルケプストラム部分系列データを削除するようにした場合は、
前記メルケプストラム部分系列データを削除するとともに、当該メルケプストラム部分系列データを切り出したメルケプストラム系列データを削除するようにしてもよい。 In addition to this, when deleting the mel cepstrum partial series data corresponding to the deleted pitch partial series data,
The mel cepstrum partial sequence data may be deleted and the mel cepstrum sequence data obtained by cutting out the mel cepstrum partial sequence data may be deleted.

また、第１の観点に係る音声合成辞書構築装置において、
前記ピッチ部分系列データを削除した音声データの数を計数し、当該数が所定の閾値を超えた場合には全ての音声データを再収集するようにしてもよい。 In the speech synthesis dictionary construction device according to the first aspect,
The number of audio data from which the pitch partial series data has been deleted may be counted, and if the number exceeds a predetermined threshold, all audio data may be collected again.

上記目的を達成するために、この発明の第２の観点に係る音声合成辞書構築方法は、
音声合成辞書を構築する方法であって、
データベースから、音素ラベル列とそれに対応する音声データとを受信する受信ステップと、
受信した音声データからピッチ系列データを抽出するピッチ系列データ抽出ステップと、
抽出されたピッチ系列データから、音素毎のピッチ部分系列データを切り出す音素切り出しステップと、
切り出されたピッチ部分系列データが、音声合成辞書を構築するに適するかを評価するピッチデータ評価ステップと、
音声合成辞書を構築するに適すると評価されたピッチ部分系列データを編集するピッチデータ編集ステップと、
前記音素ラベル列と、編集されたピッチ部分系列データとから、隠れマルコフモデルに基づく学習により各音素ラベルにピッチに関する音素ＨＭＭを対応させる音素ＨＭＭ学習ステップと、
学習結果を出力する出力ステップと、
から構成されることを特徴とする。 In order to achieve the above object, a speech synthesis dictionary construction method according to a second aspect of the present invention includes:
A method for building a speech synthesis dictionary,
A receiving step of receiving a phoneme label string and corresponding voice data from the database;
A pitch sequence data extraction step for extracting pitch sequence data from the received audio data;
A phoneme extraction step of extracting pitch partial sequence data for each phoneme from the extracted pitch sequence data;
A pitch data evaluation step for evaluating whether the extracted pitch subsequence data is suitable for constructing a speech synthesis dictionary;
A pitch data editing step for editing pitch subsequence data evaluated to be suitable for building a speech synthesis dictionary;
A phoneme HMM learning step in which a phoneme HMM associated with a pitch is associated with each phoneme label by learning based on a hidden Markov model from the phoneme label sequence and the edited pitch subsequence data;
An output step for outputting the learning result;
It is comprised from these.

上記目的を達成するために、この発明の第３の観点に係るプログラムは、
コンピュータに、
データベースから、音素ラベル列とそれに対応する音声データとを受信する受信ステップと、
受信した音声データからピッチ系列データを抽出するピッチ系列データ抽出ステップと、
抽出されたピッチ系列データから、音素毎のピッチ部分系列データを切り出す音素切り出しステップと、
切り出されたピッチ部分系列データが、音声合成辞書を構築するに適するかを評価するピッチデータ評価ステップと、
音声合成辞書を構築するに適すると評価されたピッチ部分系列データを編集するピッチデータ編集ステップと、
前記音素ラベル列と、編集されたピッチ部分系列データとから、隠れマルコフモデルに基づく学習により各音素ラベルにピッチに関する音素ＨＭＭを対応させる音素ＨＭＭ学習ステップと、
学習結果を出力する出力ステップと、
を実行させることを特徴とする。 In order to achieve the above object, a program according to the third aspect of the present invention provides:
On the computer,
A receiving step of receiving a phoneme label string and corresponding voice data from the database;
A pitch sequence data extraction step for extracting pitch sequence data from the received audio data;
A phoneme extraction step of extracting pitch partial sequence data for each phoneme from the extracted pitch sequence data;
A pitch data evaluation step for evaluating whether the extracted pitch subsequence data is suitable for constructing a speech synthesis dictionary;
A pitch data editing step for editing pitch subsequence data evaluated to be suitable for building a speech synthesis dictionary;
A phoneme HMM learning step in which a phoneme HMM associated with a pitch is associated with each phoneme label by learning based on a hidden Markov model from the phoneme label sequence and the edited pitch subsequence data;
An output step for outputting the learning result;
Is executed.

本発明によれば、音声データから抽出されたピッチ部分系列データから、他のピッチ部分系列データと比べてピッチ変動が顕著なものを削除して音声合成辞書の学習に用いる。このため、当該音声合成辞書を利用して得られる合成音声を、急激なピッチ変動を含むことなく、不自然な音声部分を生じさせることのない、高品質なものとすることができる。 According to the present invention, from the pitch partial series data extracted from the voice data, those having a remarkable pitch variation compared to other pitch partial series data are deleted and used for learning the voice synthesis dictionary. For this reason, the synthesized speech obtained by using the speech synthesis dictionary can be of high quality without causing an unnatural speech part without including a sudden pitch fluctuation.

以下、本発明の実施の形態に係る音声合成辞書構築装置について詳細に説明する。
（実施形態１）
まず、本実施形態に係る音声合成辞書構築装置の構成を説明する。
図１は、実施形態１に係る音声合成辞書構築装置の構成を示す図である。 Hereinafter, the speech synthesis dictionary construction device according to the embodiment of the present invention will be described in detail.
(Embodiment 1)
First, the configuration of the speech synthesis dictionary construction device according to the present embodiment will be described.
FIG. 1 is a diagram illustrating a configuration of the speech synthesis dictionary construction device according to the first embodiment.

図１に示す音声合成辞書構築装置１００は、ＲＯＭ１０１、記憶部１０２、データ入出力インタフェース（Ｉ／Ｆ）１０３、ユーザインタフェース（Ｉ／Ｆ）１０４、ＣＰＵ１０５を備えており、これらは、バス１０６で相互に接続されている。
ＲＯＭ１０１は、ＨＭＭに基づいた学習のための動作プログラム、特に、この実施形態においては、ピッチ部分系列データを評価し編集する動作を含む動作プログラムを記憶する。 A speech synthesis dictionary construction apparatus 100 shown in FIG. 1 includes a ROM 101, a storage unit 102, a data input / output interface (I / F) 103, a user interface (I / F) 104, and a CPU 105. Are connected to each other.
The ROM 101 stores an operation program for learning based on the HMM, in particular, in this embodiment, an operation program including operations for evaluating and editing the pitch subsequence data.

記憶部１０２は、ＲＡＭ１２１やハードディスク１２２から構成されて、学習のための定数、音素ラベル列、音声データ、ピッチ系列データ、ピッチ部分系列データ、データ音素ラベルと音素ピッチ情報を対応付けたもの、を記憶する。 The storage unit 102 includes a RAM 121 and a hard disk 122, and includes constants for learning, phoneme label strings, speech data, pitch series data, pitch partial series data, data phoneme labels and phoneme pitch information associated with each other. Remember.

データ入出力Ｉ／Ｆ１０３は、元データ入りハードディスク１１１及び処理済データ記録用ハードディスク１１２等に接続するためのインタフェースである。元データ入りハードディスク１１１は図２に示す音声データベース２０１に相当し、処理済データ記録用ハードディスク１１２は図２に示す音声合成辞書２０２に相当する。 The data input / output I / F 103 is an interface for connecting to the original data-containing hard disk 111, the processed data recording hard disk 112, and the like. The original data-containing hard disk 111 corresponds to the speech database 201 shown in FIG. 2, and the processed data recording hard disk 112 corresponds to the speech synthesis dictionary 202 shown in FIG.

このデータ入出力Ｉ／Ｆ１０３は、図２に示す音声データベース２０１に接続され、ＣＰＵ１０５の制御下で、学習対象の音素ラベル列と音声データの対を読み出してきて、記憶部１０２に格納する。 The data input / output I / F 103 is connected to the speech database 201 shown in FIG. 2, reads out a pair of phoneme label strings to be learned and speech data under the control of the CPU 105, and stores them in the storage unit 102.

また、このデータ入出力Ｉ／Ｆ１０３は、図２に示す音声合成辞書２０２に接続され、ＣＰＵ１０５による処理の結果である、音素ラベルと音素ピッチ情報の対応関係を、音声合成辞書２０２に出力する。 The data input / output I / F 103 is connected to the speech synthesis dictionary 202 shown in FIG. 2, and outputs a correspondence relationship between the phoneme label and the phoneme pitch information, which is a result of the processing by the CPU 105, to the speech synthesis dictionary 202.

ユーザＩ／Ｆ１０４は、キーボード（ＫＢ）１４１と、ディスプレイ（ＤＳＰ）１４２と、から構成され、任意の指示、データ及びプログラムを入力するために設けられている。特に、ピッチ評価・編集処理においては、ユーザが該Ｉ／Ｆを介して、所定の定数を与える必要がある。 The user I / F 104 includes a keyboard (KB) 141 and a display (DSP) 142, and is provided for inputting arbitrary instructions, data, and programs. In particular, in the pitch evaluation / editing process, the user needs to give a predetermined constant through the I / F.

ＣＰＵ１０５は、ＲＯＭ１０１に格納された動作プログラムを実行することにより、図２の機能構成に示す各部を実現し、合成辞書生成動作を実行する。 The CPU 105 executes the operation program stored in the ROM 101, thereby realizing each unit shown in the functional configuration of FIG.

図２は、図１に示す音声合成辞書構築装置１００の機能構成図である。
図１に示す音声合成辞書構築装置１００は、図２に示すように、音声データベース２０１と音声合成辞書２０２に接続される。 FIG. 2 is a functional configuration diagram of the speech synthesis dictionary construction device 100 shown in FIG.
A speech synthesis dictionary construction apparatus 100 shown in FIG. 1 is connected to a speech database 201 and a speech synthesis dictionary 202 as shown in FIG.

音声データベース２０１は、音素ラベル列とそれに対応する音声データとの組から構成されているデータベースであり、ハードディスク等に記憶されている。 The voice database 201 is a database composed of a set of phoneme label strings and corresponding voice data, and is stored in a hard disk or the like.

音声合成辞書２０２は、図１に示す音声合成辞書構築装置１００によって構築されるデータベースであり、音素ラベルと音素学習結果とを対応させて記憶しており、ハードディスク等に記憶されている。 The speech synthesis dictionary 202 is a database constructed by the speech synthesis dictionary construction apparatus 100 shown in FIG. 1, stores phoneme labels and phoneme learning results in association with each other, and is stored in a hard disk or the like.

前記音素学習結果は、音素ピッチ情報を含む。音声合成に必要な他のスペクトル情報は、音声合成装置の仕様により様々であり、前記音素学習結果には、かかる様々な情報も含まれるものとする。 The phoneme learning result includes phoneme pitch information. Other spectral information necessary for speech synthesis varies depending on the specifications of the speech synthesizer, and the phoneme learning result includes such various information.

図１に示す音声合成辞書構築装置１００は、機能的には、図２に示すように、データ取り出し部２１０と、ピッチ系列データ抽出部２２０と、音素切り出し部２２５と、ピッチデータ評価部２３０と、ピッチデータ編集部２４０と、音素ＨＭＭ学習部２５０と、データ書き出し部２６０と、を備える。 Functionally, the speech synthesis dictionary construction apparatus 100 shown in FIG. 1 functionally has a data extraction unit 210, a pitch sequence data extraction unit 220, a phoneme segmentation unit 225, a pitch data evaluation unit 230, as shown in FIG. A pitch data editing unit 240, a phoneme HMM learning unit 250, and a data writing unit 260.

データ取り出し部２１０は、音声データベース２０１からデータを読み込み、音素ラベル列と音声データとに分離する。音素ラベル列は、音素ＨＭＭ学習部２５０に引き渡され、音声データは、ピッチ系列データ抽出部２２０に引き渡される。 The data extraction unit 210 reads data from the voice database 201 and separates it into a phoneme label string and voice data. The phoneme label string is delivered to the phoneme HMM learning unit 250, and the speech data is delivered to the pitch sequence data extraction unit 220.

ピッチ系列データ抽出部２２０は、データ取り出し部２１０から引き渡された音声データから、所定のピッチ系列データを抽出し、音素切り出し部２２５に引き渡す。 The pitch sequence data extraction unit 220 extracts predetermined pitch sequence data from the audio data delivered from the data extraction unit 210 and delivers it to the phoneme segmentation unit 225.

音素切り出し部２２５は、ピッチ系列データ抽出部２２０から引き渡されたピッチ系列データから、音素毎のピッチ部分系列データを切り出し、ピッチデータ評価部２３０に引き渡す。 The phoneme cutout unit 225 cuts out the pitch partial sequence data for each phoneme from the pitch sequence data delivered from the pitch sequence data extraction unit 220 and delivers it to the pitch data evaluation unit 230.

ピッチデータ評価部２３０は、音素切り出し部２２５から引き渡された各ピッチ部分系列データが、音声合成辞書を構築するに適するかを評価する。この評価処理の詳細については、図３及び図４を参照して後述する。 The pitch data evaluation unit 230 evaluates whether each pitch partial series data delivered from the phoneme extraction unit 225 is suitable for constructing a speech synthesis dictionary. Details of this evaluation process will be described later with reference to FIGS. 3 and 4.

ピッチデータ編集部２４０は、ピッチデータ評価部２３０から引き渡されたピッチ部分系列データに対し、所定の編集処理を施し、編集済みピッチ部分系列データを生成する。この所定の編集処理の詳細については、図３及び図４を参照して後述する。 The pitch data editing unit 240 performs a predetermined editing process on the pitch partial series data delivered from the pitch data evaluation unit 230 to generate edited pitch partial series data. Details of the predetermined editing process will be described later with reference to FIGS.

編集済みピッチ部分系列データは、音素ＨＭＭ学習部２５０に引き渡される。 The edited pitch partial series data is delivered to the phoneme HMM learning unit 250.

音素ＨＭＭ学習部２５０は、音素ラベル列と編集済みピッチ部分系列データの対応関係を、ＨＭＭに基づく学習により、音素ラベルと音素ピッチ情報の対応関係（音素ＨＭＭ）に変換し、当該対応関係を、データ書き出し部２６０に引き渡す。 The phoneme HMM learning unit 250 converts the correspondence relationship between the phoneme label string and the edited pitch partial series data into a correspondence relationship between the phoneme label and the phoneme pitch information (phoneme HMM) by learning based on the HMM. The data is written to the data writing unit 260.

データ書き出し部２６０は、音素ラベルと音素ピッチ情報の対応関係を音声合成辞書２０２に記録する。 The data writing unit 260 records the correspondence relationship between the phoneme label and the phoneme pitch information in the speech synthesis dictionary 202.

ピッチデータ評価部２３０が実行する所定の評価処理は、ピッチデータが音声合成辞書を構築するのに適するかを評価する処理であれば、いかなる処理でもよいが、以下に、評価処理の好適な具体例について説明する。 The predetermined evaluation process executed by the pitch data evaluation unit 230 may be any process as long as the process evaluates whether the pitch data is suitable for constructing a speech synthesis dictionary. An example will be described.

なお、以下の説明では、フレームとは、ピッチ抽出のために用いられる時間区分を意味し、記号ｆｍで表す。 In the following description, a frame means a time segment used for pitch extraction and is represented by the symbol fm.

（評価・編集処理の具体例１）
図３に示すフローチャートを参照して、評価・編集処理の具体例１を説明する。 (Specific example 1 of evaluation / editing process)
A specific example 1 of the evaluation / editing process will be described with reference to the flowchart shown in FIG.

本具体例においては、あらかじめ、ユーザが、ピッチ差分の閾値thpitを、図１のユーザＩ／Ｆ１０４を介して、記憶部１０２に設定しておくものとする。 In this specific example, it is assumed that the user previously sets the threshold value thpit of the pitch difference in the storage unit 102 via the user I / F 104 in FIG.

図１に示す音声合成辞書構築装置１００により音声合成辞書２０２を構築する際には、音声合成辞書構築装置１００には、音声データベース２０１と、例えば、空状態の音声合成辞書２０２とが接続される。 When the speech synthesis dictionary construction apparatus 100 shown in FIG. 1 constructs the speech synthesis dictionary 202, the speech synthesis dictionary construction apparatus 100 is connected to the speech database 201 and, for example, an empty speech synthesis dictionary 202. .

音声合成辞書２０２生成の開始の指示がユーザＩ／Ｆ１０４からされると、データ取り出し部２１０は、音声データベース２０１から、音素ラベル列ＭnLab_ｍと音声データＳp_ｍ（但し、１≦ｍ≦Ｍ_ＳＰであり、Ｍ_ＳＰは音声データベースのデータ数である。）の対を順次読み出し、記憶部１０２に記憶する。 When an instruction to start speech synthesis dictionary 202 generated is from the user I / F 104, the data extraction unit 210, from the speech database 201, a phoneme label sequence MnLab _m and voice data Sp _m (however, 1 ≦ m ≦ _{M SP} Yes, _MSP is the number of data in the speech database.) The pairs are sequentially read out and stored in the storage unit 102.

ピッチ系列データ抽出部２２０は、音声データＳp_ｍからピッチ系列データＰit_ｍ［ｆｍ］（但し、０≦ｆｍ≦ＦLab_ｍであり、ＦLab_ｍは音声データＳp_ｍについてのフレーム数である。）を抽出し、記憶部１０２に記憶する（ステップＳ１）。 Pitch series data extraction unit 220, the voice data Sp _m from the pitch-series data Pit _m [fm] (where a _{0 ≦ fm ≦ FLab m, FLab} m is the number of frames of the audio data Sp _m.) Extract And it memorize | stores in the memory | storage part 102 (step S1).

そして、音素切り出し部２２５は、音素ラベル列ＭnLab_ｍを用いて、ピッチ系列データＰit_ｍに対して音素切り出しを実施し、ピッチ部分系列データpit_Ｌ，ｎを生成する（ステップＳ２）。但し、０≦Ｌ≦Ｍ_Ｌであり、Ｍ_Ｌは音素ラベル総数である。また、０≦ｎ≦Ｎ_Ｌであり、Ｎ_Ｌは音素ラベルに対するピッチ部分系列データ総数である。 Then, the phoneme cutout unit 225 performs phoneme cutout on the pitch sequence data Pit _m using the phoneme label string MnLab _m to generate pitch partial sequence data pit _{L, n} (step S2). However, a _{_{0 ≦ L ≦ M L, M}} L is a phoneme label total. Further, 0 ≦ n ≦ N _L , and N _L is the total number of pitch partial series data for the phoneme label.

ピッチデータ評価部２３０は、ピッチ部分系列データ群pit_Ｌ，ｎ［ｆｍ］（但し、０≦ｆｍ≦Ｆ_Ｌ，ｎであり、Ｆ_Ｌ，ｎはピッチ部分系列データpit_Ｌ，ｎについてのフレーム総数である。）について、ピッチ部分系列データpit_Ｌ，ｎ毎に平均値ave_Ｌ，ｎを次式に従って算出する（ステップＳ３）。
ave_Ｌ，ｎ＝｛１／（Ｆ_Ｌ，ｎ＋１）｝・Σ_ｆｍ＝０ ^ＦL,ｎpit_Ｌ，ｎ［ｆｍ］ The pitch data evaluation unit 230 calculates the pitch partial sequence data group pit _{L, n} [fm] (where 0 ≦ fm ≦ F _{L, n} , where F _{L, n} is the total number of frames for the pitch partial sequence data pit _{L, n.} For each of the pitch partial series data pit _{L, n} , an average value ave _{L, n} is calculated according to the following equation (step S3).
ave _{L, n} = {1 / (F _{L, n} + 1)} · Σ _{fm = 0} ^{FL, n} pit _{L, n} [fm]

そして、以下のステップにより、音素ラベル毎かつピッチ部分系列データpit_Ｌ，ｎ毎に、ピッチ部分系列データpit_Ｌ，ｎと平均値ave_Ｌ，ｎとの差を算出し、この差によって各ピッチ部分系列データpit_Ｌ，ｎを評価する。 Then, according to the following steps, the difference between the pitch partial series data pit _{L, n} and the average value ave _{L, n} is calculated for each phoneme label and for each pitch partial series data pit _{L, n.} The series data pit _{L, n} is evaluated.

まず、音素ラベルを識別するための番号を指定するポインタＬを「１」に初期化する（ステップＳ４）。 First, a pointer L for designating a number for identifying a phoneme label is initialized to “1” (step S4).

続いて、各ピッチ部分系列データを識別するための番号を指定するポインタｎを「１」に初期化する（ステップＳ５）。 Subsequently, a pointer n for designating a number for identifying each pitch partial series data is initialized to “1” (step S5).

これらのポインタL,nによって、記憶部１０２上のピッチ部分系列データ群pit_Ｌ，ｎ［ｆｍ］（但し、０≦ｆｍ≦Ｆ_Ｌ，ｎである。）に着目する。 With these pointers L, n, attention is paid to the pitch partial series data group pit _{L, n} [fm] (where 0 ≦ fm ≦ F _{L, n} ) on the storage unit 102.

そして、L,n番目のピッチ部分系列データについての、フレームの番号を示すポインタｆｍを「０」に初期化する（ステップＳ６）。 Then, the pointer fm indicating the frame number for the L, n-th pitch partial series data is initialized to “0” (step S6).

このポインタｆｍによって、ピッチ部分系列データpit_Ｌ，ｎ［ｆｍ］に着目し、処理対象のフレームｆｍについて平均値との差分が所定の閾値thpit以下であるかを次式（１）に従って判別する（ステップＳ７）。
｜pit_Ｌ，ｎ［ｆｍ］−ave_Ｌ，ｎ｜≦thpit （１） With this pointer fm, paying attention to the pitch partial series data pit _{L, n} [fm], it is determined according to the following equation (1) whether the difference from the average value of the processing target frame fm is equal to or less than a predetermined threshold thpit ( Step S7).
| Pit _{L, n} [fm] −ave _{L, n} | ≦ thpit (1)

ステップＳ７で所定の閾値以下でないと判別された場合（ステップＳ７；Ｎｏ）、ピッチ部分系列データpit_Ｌ，ｎを削除する（ステップＳ８）。 If it is determined in step S7 that it is not less than the predetermined threshold (step S7; No), the pitch partial series data pit _{L, n} is deleted (step S8).

ステップＳ７で所定の閾値以下であると判別された場合（ステップＳ７；Ｙeｓ）、全てのｆｍについて処理が完了したか否かを判別する（ステップＳ９）。 If it is determined in step S7 that the value is equal to or smaller than the predetermined threshold (step S7; Yes), it is determined whether or not the processing has been completed for all fm (step S9).

当該処理が完了してはいないと判別された場合には（ステップＳ９；Ｎｏ）、ステップＳ１２にてｆｍを「１」だけインクリメントして、ステップＳ７に戻る。 If it is determined that the process has not been completed (step S9; No), fm is incremented by “1” in step S12, and the process returns to step S7.

ステップＳ９にて、全てのｆｍについての処理が完了したと判別された場合は（ステップＳ９；Ｙeｓ）、全てのｎについて処理が完了したか否かを判別する（ステップＳ１０）。 If it is determined in step S9 that the processes for all fms have been completed (step S9; Yes), it is determined whether or not the processes have been completed for all n (step S10).

当該処理が完了してはいないと判別された場合には（ステップＳ１０；Ｎｏ）、ステップＳ１３にてｎを「１」だけインクリメントして、ステップＳ６に戻る。 If it is determined that the process has not been completed (step S10; No), n is incremented by “1” in step S13, and the process returns to step S6.

ステップＳ１０にて、全てのｎについての処理が完了したと判別された場合は（ステップＳ１０；Ｙeｓ）、全てのＬについて処理が完了したか否かを判別する（ステップＳ１１）。 If it is determined in step S10 that processing for all n has been completed (step S10; Yes), it is determined whether or not processing has been completed for all L (step S11).

当該処理が完了してはいないと判別された場合には（ステップＳ１１；Ｎｏ）、ステップＳ１４にてＬを「１」だけインクリメントして、ステップＳ５に戻る。 If it is determined that the process has not been completed (step S11; No), L is incremented by “1” in step S14, and the process returns to step S5.

ステップＳ１１にて、全てのＬについて処理が完了したか否かを判別した結果、完了したと判別された場合には（ステップＳ１１；Ｙeｓ）、処理を終了する。 If it is determined in step S11 whether or not the processing has been completed for all Ls, it is determined that the processing has been completed (step S11; Yes).

以上の処理により、図２の音声データベース２０１から取り出して学習に使用し得る全ての pit_Ｌ，ｎ［ｆｍ］は、図１の記憶部１０２に記憶される。当該pit_Ｌ，ｎ［ｆｍ］は、図２の音素ＨＭＭ学習部２５０にて使用される。 Through the above processing, all pit _{L, n} [fm] that can be extracted from the speech database 201 of FIG. 2 and used for learning are stored in the storage unit 102 of FIG. The pit _{L, n} [fm] is used in the phoneme HMM learning unit 250 in FIG.

本具体例では、図２に示す音素ＨＭＭ学習部２５０にて使用されるピッチ部分系列データを、あらかじめピッチデータ編集部２４０にてピッチ部分系列データ群の中で著しい相違があるデータが除かれたピッチ部分系列データとすることにより、より自然な合成音声を合成することができる音声合成辞書の構築を達成できる。 In this specific example, the pitch subsequence data used in the phoneme HMM learning unit 250 shown in FIG. 2 is excluded from the pitch subsequence data group by the pitch data editing unit 240 in advance. By using the pitch partial series data, it is possible to achieve the construction of a speech synthesis dictionary that can synthesize more natural synthesized speech.

（評価・編集処理の具体例２）
次に、図４に示すフローチャートを参照して、評価・編集処理の具体例２を説明する。 (Specific example 2 of evaluation / editing process)
Next, specific example 2 of the evaluation / editing process will be described with reference to the flowchart shown in FIG.

具体例１では、図３のステップＳ７についてｆｍ数分の処理を繰り返すことにより、評価・編集したが、ｆｍ値が大きい場合には処理時間を要することとなる。 In specific example 1, evaluation and editing are performed by repeating the process for the number of fm in step S7 of FIG. 3. However, if the fm value is large, a processing time is required.

そこで、本具体例においては、フレーム毎に判定する代わりに、後述する式（２）に従って判定することにより、処理時間を短縮する。 Therefore, in this specific example, instead of making a determination for each frame, the processing time is shortened by making a determination according to equation (2) described later.

本具体例の動作の流れは、図４に示すとおりで、基本的には、図３を用いて説明した具体例１と同様である。 The flow of the operation of this example is as shown in FIG. 4 and is basically the same as that of Example 1 described with reference to FIG.

そこで、図４においては、図３と同一の処理を行うステップには、同一の符号を付してある。 Therefore, in FIG. 4, the same reference numerals are given to steps for performing the same processing as in FIG. 3.

本具体例が具体例１と異なる主な点は、ステップＳ２４を設けた点と、ステップＳ７の代わりにステップＳ２７を設けてステップＳ９、１２のループを外した点である。 This example is different from Example 1 mainly in that Step S24 is provided, and Step S27 is provided instead of Step S7, and the loop of Steps S9 and S12 is removed.

ステップＳ２４では、次式に従って同一モノフォンラベルに属するピッチ部分系列データの平均値AVE_Ｌを算出する。
AVE_Ｌ＝（１／Ｎ_Ｌ）・Σ_ｎ＝０ ^ＮLave_Ｌ，ｎ In step S24, an average value AVE _L of the pitch partial series data belonging to the same monophone label is calculated according to the following equation.
AVE _L = (1 / N _L ) · Σ _{n = 0} ^NL ave _{L, n}

そして、ステップＳ２７で、モノフォンラベルごとのピッチ部分系列データの評価を、次式（２）に従って、各ピッチ部分系列データのフレームに対する平均値ave_Ｌ，ｎとAVE_Ｌとの差が所定の閾値を超えていないかを判定することにより行う。
｜ave_Ｌ，ｎ−AVE_Ｌ｜≦THpit （２） In step S27, the pitch partial series data for each monophone label is evaluated according to the following equation (2) _{, where} the difference between the average values ave _{L, n} and AVE _L for each frame of the pitch partial series data is a predetermined threshold value. This is done by determining whether or not
| Ave _{L, n} −AVE _L | ≦ THpit (2)

本具体例においては、あらかじめ、ユーザが、ピッチ差分の閾値THpitを、図１のユーザＩ／Ｆ１０４を介して、記憶部１０２に設定しておくものとする。 In this specific example, it is assumed that the user previously sets a threshold value THpit for pitch difference in the storage unit 102 via the user I / F 104 in FIG.

このように、本具体例によれば、急激なピッチ変動を起こさない、自然な合成音声の出力に資する音声合成辞書構築装置、かつ、構築の効率が高い装置を提供することができる。 Thus, according to this example, it is possible to provide a speech synthesis dictionary construction device that contributes to the output of natural synthesized speech that does not cause rapid pitch fluctuations, and a device with high construction efficiency.

なお、評価・編集処理の具体例として以上のように２つ例示したが、評価・編集処理はこれらに限定されるものではない。急激なピッチデータの変動を除去するものであれば、いかなるものでもよい。 Two specific examples of the evaluation / editing process have been described above, but the evaluation / editing process is not limited to these. Any device may be used as long as it removes a sudden change in pitch data.

例えば、編集については、前記式（１）、（２）の条件を満たさないピッチ部分系列データを削除する代わりに、条件を満たさないデータpit_Ｌ，ｎの前のデータpit_{Ｌ，ｎ−１}又は後のデータpit_{Ｌ，ｎ＋１}と置き換えるようにしてもよい。 For example, for editing, the formula (1), (2) condition instead of deleting the pitch partial sequence data that does not meet the data pit _L does not satisfy the _condition, previous data pit _L of _{_n, n-1} or It may be replaced with subsequent data pit _{L, n + 1} .

以上では理解を容易にするため、図２の音声データベース２０１から、データを、データ取り出し部２１０により図１に示す記憶部１０２に一旦全部読み込む例を示したが、かかる一括処理は本実施形態の本質的要件ではない。例えば、図２に示す音素ＨＭＭ学習部２５０の仕様次第では、より動的に音声合成辞書を構築することも考えられる。 In the above, in order to facilitate understanding, an example in which all data is once read from the voice database 201 in FIG. 2 into the storage unit 102 shown in FIG. 1 by the data extraction unit 210 has been described. It is not an essential requirement. For example, depending on the specifications of the phoneme HMM learning unit 250 shown in FIG. 2, it is possible to construct a speech synthesis dictionary more dynamically.

（実施形態２）
実施形態１においては、図１に示す音声合成辞書構築装置１００により音素ラベルと音素ピッチ情報とを対応付けた。本発明はこれに限定されず、音素ラベルと音素ピッチ情報及び音素スペクトルパラメータ情報とを対応付ける場合にも適用可能である。 (Embodiment 2)
In the first embodiment, the phoneme label is associated with the phoneme pitch information by the speech synthesis dictionary construction device 100 shown in FIG. The present invention is not limited to this, and can also be applied to the case where phoneme labels are associated with phoneme pitch information and phoneme spectrum parameter information.

以下、音素ラベルと音素ピッチ情報及び音素スペクトルパラメータ情報とを対応付けて音声合成辞書に書き出す実施形態２に係る音声合成辞書構築装置５００について説明する。 The following describes the speech synthesis dictionary construction apparatus 500 according to the second embodiment in which phoneme labels, phoneme pitch information, and phoneme spectrum parameter information are associated and written to the speech synthesis dictionary.

本実施形態に係る音声合成辞書構築装置５００は、図５に示すように、データ取り出し部２１０と、ピッチデータ評価部２３０と、を備える。これらの各部は、実施形態１に係る図１に示す音声合成辞書構築装置１００の対応する各部と同一の構成と機能を有する。 As shown in FIG. 5, the speech synthesis dictionary construction apparatus 500 according to the present embodiment includes a data extraction unit 210 and a pitch data evaluation unit 230. These units have the same configurations and functions as the corresponding units of the speech synthesis dictionary construction apparatus 100 shown in FIG. 1 according to the first embodiment.

音声合成辞書構築装置５００は、さらに、系列データ抽出部５２０と、音素切り出し部５２５と、データ編集部５４０と、音素ＨＭＭ学習部５５０と、データ書き出し部５６０と、を備える。 The speech synthesis dictionary construction apparatus 500 further includes a sequence data extraction unit 520, a phoneme segmentation unit 525, a data editing unit 540, a phoneme HMM learning unit 550, and a data writing unit 560.

系列データ抽出部５２０は、ピッチ系列データ抽出部２２０と同一の機能に加えて、さらに、データ取り出し部２１０により取り出された音声データに対してＤ次のメルケプストラム分析を施して、メルケプストラム系列データMcep^d _ｍ（０≦ｄ≦Ｄ）を生成する。 In addition to the same function as pitch sequence data extraction unit 220, sequence data extraction unit 520 further performs D-order mel cepstrum analysis on the audio data extracted by data extraction unit 210, thereby obtaining mel cepstrum sequence data. Mcep ^d _m (0 ≦ d ≦ D) is generated.

音素切り出し部５２５は、音素切り出し部２２５と同一の機能に加えて、さらに、メルケプストラム系列データMcep^d _ｍに対して、音素切り出し処理を施して、同一音素ラベル毎にメルケプストラム部分系列データmcep^d _Ｌ，ｎを生成する。 In addition to the same function as the phoneme segmentation unit 225, the phoneme segmentation unit 525 further performs a phoneme segmentation process on the mel cepstrum sequence data Mcep ^d _m to obtain the mel cepstrum partial sequence data mcep ^{d for} each identical phoneme label. _{L and n} are generated.

データ編集部５４０は、ピッチデータ編集部２４０と同一の機能に加えて、さらに、ピッチデータ評価部２３０で学習データとして使用できないと評価されて削除したピッチ部分系列データpit_Ｌ，ｎに対応するメルケプストラム部分系列データmcep^d _Ｌ，ｎを削除する。 In addition to the same function as that of the pitch data editing unit 240, the data editing unit 540 further includes a mel corresponding to the pitch partial series data pit _{L, n} that has been evaluated and deleted by the pitch data evaluation unit 230 as being unusable as learning data. The cepstrum partial series data mcep ^d _{L, n} is deleted.

音素ＨＭＭ学習部５５０は、音素ラベル列と対応する編集済みピッチ系列データとの対応関係を、ＨＭＭに基づいて学習することにより、音素ラベルと音素ピッチ情報との対応関係を示す情報（音素ＨＭＭ）に変換し、データ書き出し部５６０に引き渡す。さらに、音素ＨＭＭ学習部５５０は、音素ラベル列とメルケプストラム系列データとの対応関係を、ＨＭＭに基づいて学習し、音素ラベルと音素スペクトルパラメータ情報との対応関係を示す情報に変換し、データ書き出し部５６０に引き渡す。 The phoneme HMM learning unit 550 learns the correspondence between the phoneme label string and the corresponding edited pitch sequence data based on the HMM, thereby indicating information indicating the correspondence between the phoneme label and the phoneme pitch information (phoneme HMM). To the data writing unit 560. Further, the phoneme HMM learning unit 550 learns the correspondence relationship between the phoneme label sequence and the mel cepstrum sequence data based on the HMM, converts it into information indicating the correspondence relationship between the phoneme label and the phoneme spectrum parameter information, and writes the data Delivered to part 560.

データ書き出し部５６０は、音素ラベルと音素ピッチ情報の対応関係、及び、音素ラベルと音素スペクトルパラメータ情報の対応関係を、音声合成辞書２０２に書き出す。 The data writing unit 560 writes the correspondence relationship between the phoneme label and the phoneme pitch information and the correspondence relationship between the phoneme label and the phoneme spectrum parameter information to the speech synthesis dictionary 202.

このようにして構築された音声合成辞書２０２を用いることにより、音素ラベル毎に音素ピッチ情報と音素スペクトルパラメータ情報とを用いて、高品質な音声を合成することができる。 By using the speech synthesis dictionary 202 constructed in this way, high-quality speech can be synthesized using phoneme pitch information and phoneme spectrum parameter information for each phoneme label.

（実施形態３）
実施形態３は、実施形態１において、ピッチ部分系列データpit_Ｌ，ｎを削除するとともに、当該ピッチ部分系列データを切り出したピッチ系列データPit_ｍをも削除するようにしたものである。 (Embodiment 3)
In the third embodiment, the pitch partial series data pit _{L, n} is deleted in the first embodiment, and the pitch series data Pit _m obtained by cutting out the pitch partial series data is also deleted.

これは、前述した図３に示す評価・編集処理の具体例１において、図６に示すような変更を加えることにより実現される。以下、追加部分（ステップＳ６１〜Ｓ６８）について説明する。 This is realized by making a change as shown in FIG. 6 in the specific example 1 of the evaluation / editing process shown in FIG. Hereinafter, the additional part (steps S61 to S68) will be described.

音声データの取り出し（ステップＳ１）の後、音声データを計数するカウンタｍを設け、当該カウンタｍに初期値「１」を設定する（ステップＳ６１）とともに、ピッチ部分系列データを削除したか否かを示すフラグDELを設け、当該フラグDELに初期値「０」を設定する（ステップＳ６３）。 After taking out the audio data (step S1), a counter m for counting the audio data is provided, an initial value “1” is set in the counter m (step S61), and whether or not the pitch partial series data has been deleted. A flag DEL is provided, and an initial value “0” is set in the flag DEL (step S63).

そして、ピッチ部分系列データpit_Ｌ，ｎを削除した後（ステップＳ８）、削除フラグDELに「１」を設定する（ステップＳ６４）。 Then, after deleting the pitch partial series data pit _{L, n} (step S8), "1" is set to the deletion flag DEL (step S64).

カウンタｍで指示された音声データについての処理を終了した後（ステップＳ１１；Ｙｅｓ）、削除フラグDELを判定し（ステップＳ６５）、「１」が設定されている場合には、ピッチ系列データPit_ｍを削除する（ステップＳ６６）。 After completing the processing for the audio data instructed by the counter m (step S11; Yes), the deletion flag DEL is determined (step S65). If “1” is set, the pitch sequence data Pit _m Is deleted (step S66).

その後、すべての音声データについて処理を終了したか判定し（ステップＳ６７）、終了していない場合には、ｍをインクリメントし（ステップＳ６８）、次の音声データの処理へ進む（ステップＳ６３以下）。 Thereafter, it is determined whether or not the processing has been completed for all audio data (step S67). If not, m is incremented (step S68), and the processing proceeds to the next audio data (step S63 and subsequent steps).

また、前述した図４に示す評価・編集処理の具体例２において、ピッチ部分系列データpit_Ｌ，ｎを削除するとともに、当該ピッチ部分系列データを切り出したピッチ系列データPit_ｍをも削除するようにするため、図７に示すような変更を加えるようにしてもよい。図７の追加部分は、図６の追加部分と同様である。 Further, in the above-described specific example 2 of the evaluation / editing process shown in FIG. 4, the pitch partial series data pit _{L, n} is deleted, and the pitch series data Pit _m obtained by cutting out the pitch partial series data is also deleted. Therefore, a change as shown in FIG. 7 may be added. The additional part of FIG. 7 is the same as the additional part of FIG.

また、実施形態３の変形例として、ピッチ部分系列データ及びメルケプストラム部分系列データを削除するとともに、当該ピッチ部分系列データ及びメルケプストラム部分系列データを切り出したピッチ系列データ及びメルケプストラム系列データをも削除するようにしてもよい。これは、図５の装置によって図６又は図７の処理を実行することで実現される。 As a modification of the third embodiment, the pitch partial sequence data and the mel cepstrum partial sequence data are deleted, and the pitch sequence data and the mel cepstrum sequence data obtained by cutting out the pitch partial sequence data and the mel cepstrum partial sequence data are also deleted. You may make it do. This is realized by executing the processing of FIG. 6 or 7 by the apparatus of FIG.

（実施形態４）
実施形態４は、実施形態３において、ピッチ部分系列データ及びピッチ系列データを削除し、さらに、ピッチ系列データを生成した音声データも削除するものである。そして、削除した音声データの数を計数し、当該計数値が所定の閾値を超えた場合には全ての音声データを再収集するようにしたものである。 (Embodiment 4)
In the fourth embodiment, the pitch partial series data and the pitch series data in the third embodiment are deleted, and further, the audio data generated from the pitch series data is also deleted. Then, the number of deleted audio data is counted, and when the counted value exceeds a predetermined threshold, all the audio data is recollected.

これは、前述した図６に示す評価・編集処理の具体例１において、図８に示すような変更を加えることにより実現される。以下、追加部分（ステップＳ６２、Ｓ８０〜Ｓ８３）について説明する。 This is realized by adding the changes shown in FIG. 8 to the above-described specific example 1 of the evaluation / editing process shown in FIG. Hereinafter, additional portions (steps S62, S80 to S83) will be described.

音声データの取り出し（ステップＳ１）後、音声データの削除数を計数するカウンタCntを設け、当該カウンタCntに初期値「０」を設定する（ステップＳ６２）。そして、ピッチ部分系列データpit_Ｌ，ｎ及びピッチ系列データPit_ｍを削除する場合に、これらを生成した音声データSp_ｍを削除し（ステップＳ８０）、カウンタCntをインクリメントする（ステップＳ８１）。 After taking out the audio data (step S1), a counter Cnt for counting the number of deleted audio data is provided, and an initial value “0” is set in the counter Cnt (step S62). Then, when deleting the pitch partial series data pit _{L, n} and the pitch series data Pit _m , the voice data Sp _m that generated them is deleted (step S80), and the counter Cnt is incremented (step S81).

そして、すべての音声データについて処理を終了した後（ステップＳ６７；Ｙｅｓ）、カウンタCntの計数値が所定の閾値Ｔhdelを超えたか判別する（ステップＳ８２）。カウンタCntの計数値が所定の閾値Ｔhdelを超えたとき（ステップＳ８２；Ｙｅｓ）、図１のＣＰＵ１０５は、例えば、ユーザＩ／Ｆ１０４を介してその旨をユーザに報知し、ユーザに対し、再録音を指示し、音声データを再収集させる（ステップＳ８３）。 Then, after the processing is completed for all audio data (step S67; Yes), it is determined whether the count value of the counter Cnt has exceeded a predetermined threshold Thdel (step S82). When the count value of the counter Cnt exceeds a predetermined threshold Thdel (step S82; Yes), the CPU 105 in FIG. 1 notifies the user via the user I / F 104, for example, and re-records the user. And voice data is collected again (step S83).

これにより、より高品質の音声合成辞書を構築することができる。
また、前述した図７に示す評価・編集処理の具体例２において、ピッチ部分系列データ及びピッチ系列データを削除し、さらに、ピッチ系列データを生成した音声データも削除するようにするため、図９に示すような変更を加えるようにしてもよい。これは、この音声データが再び著しいピッチ変動を生じるピッチ系列データを生成する可能性が高いからである。図９の追加部分は、図８の追加部分と同様である。 Thereby, a higher quality speech synthesis dictionary can be constructed.
Further, in the specific example 2 of the evaluation / editing process shown in FIG. 7 described above, in order to delete the pitch partial series data and the pitch series data, and further delete the audio data that generated the pitch series data, FIG. You may make it add a change as shown in. This is because there is a high possibility that this audio data will generate pitch sequence data that causes a significant pitch fluctuation again. The additional part of FIG. 9 is the same as the additional part of FIG.

また、実施形態４の変形例として、ピッチ部分系列データ及びメルケプストラム部分系列データを削除するとともに、当該ピッチ部分系列データ及びメルケプストラム部分系列データを切り出したピッチ系列データ及びメルケプストラム系列データを削除し、さらに、これらのデータを生成した音声データをも削除するようにしてもよい。これは、図５の装置によって図８又は図９の処理を実行することで実現される。
なお、本発明は、上記実施形態に限定されず、種々の変形及び応用が可能である。 Further, as a modification of the fourth embodiment, the pitch partial series data and the mel cepstrum partial series data are deleted, and the pitch series data and the mel cepstrum series data cut out from the pitch partial series data and the mel cepstrum partial series data are deleted. Furthermore, the audio data that generated these data may also be deleted. This is realized by executing the processing of FIG. 8 or FIG. 9 by the apparatus of FIG.
In addition, this invention is not limited to the said embodiment, A various deformation | transformation and application are possible.

例えば、上述のハードウエア構成やブロック構成、フローチャートは例示であって、限定されるものでもない。 For example, the above-described hardware configuration, block configuration, and flowchart are examples and are not limited.

また、この発明は、音声合成辞書構築装置に限定されるものではなく、任意のコンピュータを用いて構築可能である。例えば、上述の処理を汎用のコンピュータに実行させるためのプログラムを記録媒体や通信により配布し、これをそのコンピュータにインストールして実行させることにより、本発明の音声合成辞書構築装置として機能させることも可能である。 The present invention is not limited to the speech synthesis dictionary construction device, and can be constructed using any computer. For example, a program for causing a general-purpose computer to execute the above-described processing is distributed by a recording medium or communication, and this is installed in the computer and executed, thereby allowing the speech synthesis dictionary construction apparatus of the present invention to function. Is possible.

実施形態１に係る音声合成辞書構築装置の物理的な構成を示す図である。It is a figure which shows the physical structure of the speech synthesis dictionary construction apparatus which concerns on Embodiment 1. FIG. 実施形態１に係る、ピッチデータ評価・編集部を備えた音声合成辞書構築装置の機能構成図である。It is a functional block diagram of the speech synthesis dictionary construction apparatus provided with the pitch data evaluation and edit part based on Embodiment 1. FIG. ピッチデータ評価・編集処理の具体例１における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the specific example 1 of pitch data evaluation and edit processing. ピッチデータ評価・編集処理の具体例２における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the specific example 2 of pitch data evaluation and edit processing. 実施形態２に係る、スペクトル分析を伴う音声合成辞書構築装置の機能構成図である。It is a function block diagram of the speech synthesis dictionary construction apparatus with spectrum analysis based on Embodiment 2. FIG. 実施形態３に係る、ピッチデータ評価・編集処理の具体例１における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the specific example 1 of the pitch data evaluation and edit process based on Embodiment 3. FIG. 実施形態３に係る、ピッチデータ評価・編集処理の具体例２における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the specific example 2 of the pitch data evaluation and edit process based on Embodiment 3. FIG. 実施形態４に係る、ピッチデータ評価・編集処理の具体例１における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the specific example 1 of the pitch data evaluation and edit process based on Embodiment 4. FIG. 実施形態４に係る、ピッチデータ評価・編集処理の具体例２における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the specific example 2 of the pitch data evaluation and edit process based on Embodiment 4. FIG.

Explanation of symbols

１００・・・音声合成辞書構築装置、１０１・・・ＲＯＭ、１０２・・・記憶部、１０３・・・データ入出力Ｉ／Ｆ、１０４・・・ユーザＩ／Ｆ、１０５・・・ＣＰＵ、１０６・・・バス、１１１・・・元データ入りハードディスク、１１２・・・処理済データ記録用ハードディスク、１２１・・・ＲＡＭ、１２２・・・ハードディスク、１４１・・・キーボード、１４２・・・ディスプレイ、２０１・・・音声データベース、２０２・・・音声合成辞書、２１０・・・データ取り出し部、２２０・・・ピッチ系列データ抽出部、２２５・・・音素切り出し部、２３０・・・ピッチデータ評価部、２４０・・・ピッチデータ編集部、２５０・・・音素ＨＭＭ学習部、２６０・・・データ書き出し部、５００・・・音声合成辞書構築装置、５２０・・・系列データ抽出部、５２５・・・音素切り出し部、５４０・・・データ編集部、５５０・・・音素ＨＭＭ学習部、５６０・・・データ書き出し部 DESCRIPTION OF SYMBOLS 100 ... Speech synthesis dictionary construction apparatus, 101 ... ROM, 102 ... Memory | storage part, 103 ... Data input / output I / F, 104 ... User I / F, 105 ... CPU, 106・・・ Bus, 111 ... Hard disk with original data, 112 ... Hard disk for recording processed data, 121 ... RAM, 122 ... Hard disk, 141 ... Keyboard, 142 ... Display, 201 ... Speech database, 202 ... Speech synthesis dictionary, 210 ... Data extraction unit, 220 ... Pitch sequence data extraction unit, 225 ... Phoneme extraction unit, 230 ... Pitch data evaluation unit, 240 ... Pitch data editing unit, 250 ... Phoneme HMM learning unit, 260 ... Data writing unit, 500 ... Speech synthesis dictionary construction device, 52 0 ... Series data extraction unit, 525 ... Phoneme extraction unit, 540 ... Data editing unit, 550 ... Phoneme HMM learning unit, 560 ... Data writing unit

Claims

An apparatus for constructing a speech synthesis dictionary,
A receiver for receiving a phoneme label string and corresponding voice data;
A pitch sequence data extraction unit for extracting pitch sequence data from the received audio data;
A phoneme cutout unit that cuts out pitch partial sequence data for each phoneme from the extracted pitch sequence data;
A pitch data evaluation unit that evaluates whether the extracted pitch partial series data is suitable for constructing a speech synthesis dictionary;
A pitch data editing unit for editing pitch partial series data evaluated to be suitable for building a speech synthesis dictionary;
A phoneme HMM learning unit that associates a phoneme HMM related to a pitch with each phoneme label by learning based on the hidden Markov model HMM from the phoneme label sequence and the edited pitch subsequence data;
A data writer for recording the learning results in the speech synthesis dictionary;
A speech synthesis dictionary construction device comprising:

Evaluating that it is not suitable for constructing a speech synthesis dictionary together with mel cepstrum partial sequence data corresponding to pitch partial sequence data evaluated as not suitable for constructing a speech synthesis dictionary by the pitch data evaluation unit,
The speech synthesis dictionary construction apparatus according to claim 1.

The pitch data evaluation unit evaluates that it is suitable for building a speech synthesis dictionary when the difference between the pitch partial series data and the average value does not exceed a predetermined threshold in all frames.
The speech synthesis dictionary construction device according to claim 1 or 2.

The pitch data evaluation unit is suitable for constructing a speech synthesis dictionary when a difference between an average value of pitch partial series data and an average value of pitch partial series data belonging to the same monophone label does not exceed a predetermined threshold. evaluate,
The speech synthesis dictionary construction device according to claim 1 or 2.

Deleting the pitch partial series data and deleting the pitch series data cut out from the pitch partial series data;
The speech synthesis dictionary construction apparatus according to claim 1.

Deleting the mel cepstrum partial series data and deleting the mel cepstrum series data cut out from the mel cepstrum partial series data;
The speech synthesis dictionary construction device according to claim 2.

Count the number of audio data from which the pitch partial series data has been deleted, and re-collect all audio data when the count value exceeds a predetermined threshold.
The speech synthesis dictionary construction device according to claim 1 or 2.

A method for building a speech synthesis dictionary,
A receiving step of receiving a phoneme label string and corresponding voice data from the database;
A pitch sequence data extraction step for extracting pitch sequence data from the received audio data;
A phoneme extraction step of extracting pitch partial sequence data for each phoneme from the extracted pitch sequence data;
A pitch data evaluation step for evaluating whether the extracted pitch subsequence data is suitable for constructing a speech synthesis dictionary;
A pitch data editing step for editing pitch subsequence data evaluated to be suitable for building a speech synthesis dictionary;
A phoneme HMM learning step in which a phoneme HMM associated with a pitch is associated with each phoneme label by learning based on a hidden Markov model from the phoneme label sequence and the edited pitch subsequence data;
An output step for outputting the learning result;
A speech synthesis dictionary construction method comprising:

On the computer,
A receiving step of receiving a phoneme label string and corresponding voice data from the database;
A pitch sequence data extraction step for extracting pitch sequence data from the received audio data;
A phoneme extraction step of extracting pitch partial sequence data for each phoneme from the extracted pitch sequence data;
A pitch data evaluation step for evaluating whether the extracted pitch subsequence data is suitable for constructing a speech synthesis dictionary;
A pitch data editing step for editing pitch subsequence data evaluated to be suitable for building a speech synthesis dictionary;
A phoneme HMM learning step in which a phoneme HMM associated with a pitch is associated with each phoneme label by learning based on a hidden Markov model from the phoneme label sequence and the edited pitch subsequence data;
An output step for outputting the learning result;
A program that executes