JP5326546B2

JP5326546B2 - Speech synthesis dictionary construction device, speech synthesis dictionary construction method, and program

Info

Publication number: JP5326546B2
Application number: JP2008324527A
Authority: JP
Inventors: 勝彦佐藤
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2008-12-19
Filing date: 2008-12-19
Publication date: 2013-10-30
Anticipated expiration: 2028-12-19
Also published as: JP2010145855A

Abstract

<P>PROBLEM TO BE SOLVED: To delete a mel-cepstrum portion series data having the possibility of constructing a voice synthesis dictionary bringing significant tone quality fluctuation, to prevent a synthesized voice from getting unnatural. <P>SOLUTION: A mel-cepstrum series data generating part 220 generates a mel-cepstrum series data from a voice data. A voice cut-out part 225 cuts out the mel-cepstrum portion series data in each phoneme from the mel-cepstrum series data. A mel-cepstrum data evaluating part 230 determines whether each mel-cepstrum portion series data reaches a voice feature amount numeric value having the possibility of constructing the voice synthesis dictionary bringing the significant tone quality fluctuation, or not. A mel-cepstrum data editing part 240 deletes the mel-cepstrum portion series data. A phoneme HMM learning part 250 finds thereafter a correspondence relation between a phoneme label and a phoneme mel-cepstrum data. A data writing-out part 260 writes the correspondence relation into the voice synthesis dictionary 202. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声合成に用いるデータベースを構築する、音声合成辞書構築装置、音声合成辞書構築方法、及び、プログラムに関する。 The present invention relates to a speech synthesis dictionary construction device, a speech synthesis dictionary construction method, and a program for constructing a database used for speech synthesis.

音声認識及び音声合成技術として隠れマルコフモデル（以下、ＨＭＭと称呼する。）に基づいた音声認識技術及び音声合成技術が、広く利用されている。 Speech recognition technology and speech synthesis technology based on a hidden Markov model (hereinafter referred to as HMM) are widely used as speech recognition and speech synthesis technology.

ＨＭＭに基づいた音声認識技術及び音声合成技術は、例えば、特許文献１及び２に開示されている。 Speech recognition technology and speech synthesis technology based on HMM are disclosed in Patent Documents 1 and 2, for example.

ＨＭＭに基づいた音声合成においては、音素ラベルとスペクトルパラメータデータ列等の対応関係を記録した音声合成辞書が必要になる。 In speech synthesis based on the HMM, a speech synthesis dictionary in which a correspondence relationship between phoneme labels and spectrum parameter data strings is recorded is required.

音声合成辞書は、通例、音素ラベル列とそれに対応する音声データとの組から構成されているデータベース（以下、音声データベースと称呼する。）に記録されているデータについて、スペクトル分析とピッチ抽出をし、ＨＭＭに基づく学習過程を経ることにより、構築される。 A speech synthesis dictionary usually performs spectrum analysis and pitch extraction on data recorded in a database (hereinafter referred to as a speech database) composed of a set of phoneme label sequences and corresponding speech data. It is constructed through a learning process based on HMM.

従来は、音声合成辞書を構築する際、音声データから生成されたメルケプストラム係数に、他のメルケプストラム係数と比べて顕著に相違するものが含まれていた場合でも、そのままＨＭＭに基づく学習に用いていた。 Conventionally, when constructing a speech synthesis dictionary, even if the mel cepstrum coefficients generated from the speech data contain significantly different mel cepstrum coefficients compared to other mel cepstrum coefficients, they are used as they are for learning based on the HMM. It was.

特開２００２−２４４６８９号公報Japanese Patent Laid-Open No. 2002-244689 特開２００２−２６８６６０号公報JP 2002-268660 A

しかしながら、そのように構築された音声合成辞書を用いて音声合成を行うと、生成された合成音声に不自然な部分が含まれる場合がある。 However, when speech synthesis is performed using the speech synthesis dictionary constructed as described above, an unnatural part may be included in the generated synthesized speech.

このため、従来の音声合成辞書構築装置により構築された音声合成辞書を用いた合成音声は、人間の自然な音声に比べて、不自然なものとなる場合があった。 For this reason, the synthesized speech using the speech synthesis dictionary constructed by the conventional speech synthesis dictionary construction device may be unnatural compared to the human natural speech.

本発明は、上記実情に鑑みてなされたもので、不自然な音声を合成することがない音声合成辞書を構築可能とする音声合成辞書構築装置、音声合成辞書構築方法、及び、プログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and provides a speech synthesis dictionary construction device, a speech synthesis dictionary construction method, and a program capable of constructing a speech synthesis dictionary that does not synthesize unnatural speech. For the purpose.

上記目的を達成するために、本発明の第１の観点に係る音声合成構築装置は、
音声合成辞書を構築する装置であって、
音素ラベル列とそれに対応する音声データとを受信する受信部と、
受信した音声データからフレーム単位でメルケプストラム係数を生成することにより、前記音声データ全体からメルケプストラム系列データを生成するメルケプストラム系列データ生成部と、
生成されたメルケプストラム系列データから、音素ラベルに対応するメルケプストラム部分系列データ群を音素ラベル毎に切り出す音素切り出し部と、
切り出されたメルケプストラム部分系列データ群毎に平均値を算出し、当該平均値との差が閾値を超えるフレーム単位のメルケプストラム部分系列データが前記メルケプストラム部分系列データ群に含まれているか判断し、含まれていれば、前記切り出されたメルケプストラム部分系列データ群が、音声合成辞書を構築するに適さないと評価するメルケプストラムデータ評価部と、
音声合成辞書を構築するに適さないと評価されたメルケプストラム部分系列データ群を削除することによりメルケプストラム部分系列データを編集するメルケプストラムデータ編集部と、
前記音素ラベル列と、編集されたメルケプストラム部分系列データとから、隠れマルコフモデルＨＭＭに基づく学習により各音素ラベルにメルケプストラムに関する音素ＨＭＭを対応させる音素ＨＭＭ学習部と、
前記学習の結果を音声合成辞書に記録するデータ書き出し部と、
を備えることを特徴とする。 In order to achieve the above object, a speech synthesis construction apparatus according to the first aspect of the present invention provides:
An apparatus for constructing a speech synthesis dictionary,
A receiver for receiving a phoneme label string and corresponding voice data;
A mel cepstrum sequence data generation unit for generating mel cepstrum sequence data from the entire audio data by generating a mel cepstrum coefficient in frame units from the received audio data;
A phoneme cutout unit that cuts out a mel cepstrum partial sequence data group corresponding to a phoneme label for each phoneme label from the generated mel cepstrum sequence data;
An average value is calculated for each extracted mel cepstrum partial series data group, and it is determined whether the mel cepstrum partial series data group includes mel cepstrum partial series data in units of frames whose difference from the average value exceeds a threshold value. If included, a mel cepstrum data evaluation unit that evaluates that the cut out mel cepstrum partial series data group is not suitable for constructing a speech synthesis dictionary;
A mel cepstrum data editing unit that edits the mel cepstrum partial series data by deleting a mel cepstrum partial series data group evaluated as unsuitable for constructing a speech synthesis dictionary;
A phoneme HMM learning unit that associates each phoneme label with a phoneme HMM related to the mel cepstrum by learning based on the hidden Markov model HMM from the phoneme label sequence and the edited mel cepstrum partial sequence data;
A data writing unit for recording the learning result in a speech synthesis dictionary;
It is characterized by providing.

また、第２の観点に係る音声合成辞書構築装置は、
音声合成辞書を構築する装置であって、
音素ラベル列とそれに対応する音声データとを受信する受信部と、
受信した音声データからフレーム単位でメルケプストラム係数を生成することにより、前記音声データ全体からメルケプストラム系列データを生成するメルケプストラム系列データ生成部と、
生成されたメルケプストラム系列データから、音素ラベルに対応するメルケプストラム部分系列データ群を音素ラベル毎に切り出す音素切り出し部と、
切り出されたメルケプストラム部分系列データ群毎に第１の平均値を算出するとともに、同一の音素ラベルに属する全てのメルケプストラム部分系列データ群についての第２の平均値を算出し、当該第２の平均値との差が閾値を超える第１の平均値を有するメルケプストラム部分系列データ群を、音声合成辞書を構築するに適さないと評価するメルケプストラムデータ評価部と、
音声合成辞書を構築するに適さないと評価されたメルケプストラム部分系列データ群を削除することによりメルケプストラム部分系列データを編集するメルケプストラムデータ編集部と、
前記音素ラベル列と、編集されたメルケプストラム部分系列データとから、隠れマルコフモデルＨＭＭに基づく学習により各音素ラベルにメルケプストラムに関する音素ＨＭＭを対応させる音素ＨＭＭ学習部と、
前記学習の結果を音声合成辞書に記録するデータ書き出し部と、
を備えることを特徴とする。
The speech synthesis dictionary construction device according to the second aspect is
An apparatus for constructing a speech synthesis dictionary,
A receiver for receiving a phoneme label string and corresponding voice data;
A mel cepstrum sequence data generation unit for generating mel cepstrum sequence data from the entire audio data by generating a mel cepstrum coefficient in frame units from the received audio data;
A phoneme cutout unit that cuts out a mel cepstrum partial sequence data group corresponding to a phoneme label for each phoneme label from the generated mel cepstrum sequence data;
A first average value is calculated for each extracted mel cepstrum partial series data group, and second average values for all mel cepstrum partial series data groups belonging to the same phoneme label are calculated. A mel cepstrum data evaluation unit that evaluates that the mel cepstrum partial series data group having the first average value whose difference from the average value exceeds a threshold value is not suitable for constructing a speech synthesis dictionary;
A mel cepstrum data editing unit that edits the mel cepstrum partial series data by deleting a mel cepstrum partial series data group evaluated as unsuitable for constructing a speech synthesis dictionary;
A phoneme HMM learning unit that associates each phoneme label with a phoneme HMM related to the mel cepstrum by learning based on the hidden Markov model HMM from the phoneme label sequence and the edited mel cepstrum partial sequence data;
A data writing unit for recording the learning result in a speech synthesis dictionary;
It is characterized by providing.

本発明によれば、音声データから生成されたメルケプストラム部分系列データから、他のメルケプストラム部分系列データと比べて音声スペクトル分析の結果として得られた特徴量が顕著なものを削除して音声合成辞書の学習に用いる。このため、当該音声合成辞書を利用して得られる合成音声を、不自然な音声部分を生じさせることのない、高品質なものとすることができる。 According to the present invention, speech synthesis is performed by deleting, from a mel cepstrum partial sequence data generated from speech data, features having significant feature amounts as a result of speech spectrum analysis compared to other mel cepstrum partial sequence data. Used for dictionary learning. For this reason, the synthesized speech obtained using the speech synthesis dictionary can be of high quality without causing an unnatural speech portion.

以下、本発明の実施の形態に係る音声合成辞書構築装置について詳細に説明する。
（実施形態１）
まず、本実施形態に係る音声合成辞書構築装置の構成を説明する。
図１は、実施形態１に係る音声合成辞書構築装置の構成を示す図である。 Hereinafter, the speech synthesis dictionary construction device according to the embodiment of the present invention will be described in detail.
(Embodiment 1)
First, the configuration of the speech synthesis dictionary construction device according to the present embodiment will be described.
FIG. 1 is a diagram illustrating a configuration of the speech synthesis dictionary construction device according to the first embodiment.

図１に示す音声合成辞書構築装置１００は、ＲＯＭ１０１、記憶部１０２、データ入出力インタフェース（Ｉ／Ｆ）１０３、ユーザインタフェース（Ｉ／Ｆ）１０４、ＣＰＵ１０５を備えており、これらは、バス１０６で相互に接続されている。
ＲＯＭ１０１は、ＨＭＭに基づいた学習のための動作プログラム、特に、この実施形態においては、メルケプストラム部分系列データを評価し編集する動作を含む動作プログラムを記憶する。 A speech synthesis dictionary construction apparatus 100 shown in FIG. 1 includes a ROM 101, a storage unit 102, a data input / output interface (I / F) 103, a user interface (I / F) 104, and a CPU 105. Are connected to each other.
The ROM 101 stores an operation program for learning based on the HMM, particularly an operation program including operations for evaluating and editing the mel cepstrum partial sequence data in this embodiment.

記憶部１０２は、ＲＡＭ１２１やハードディスク１２２から構成されて、学習のための定数、音素ラベル列、音声データ、メルケプストラム系列データ、メルケプストラム部分系列データ、データ音素ラベルと音素メルケプストラム情報を対応付けたもの、を記憶する。 The storage unit 102 includes a RAM 121 and a hard disk 122, and associates learning constants, phoneme label sequences, speech data, mel cepstrum sequence data, mel cepstrum partial sequence data, data phoneme labels, and phoneme mel cepstrum information. Remember things.

データ入出力Ｉ／Ｆ１０３は、元データ入りハードディスク１１１等及び処理済データ記録用ハードディスク１１２等に接続するためのインタフェースである。 The data input / output I / F 103 is an interface for connecting to the original data-containing hard disk 111 and the processed data recording hard disk 112 and the like.

このデータ入出力Ｉ／Ｆ１０３は、図２に示す音声データベース２０１に接続され、ＣＰＵ１０５の制御下に、学習対象の音素ラベル列と音声データの対を読み出してきて、記憶部１０２に格納する。 This data input / output I / F 103 is connected to the speech database 201 shown in FIG. 2, reads out a pair of phoneme label strings to be learned and speech data under the control of the CPU 105, and stores them in the storage unit 102.

また、このデータ入出力Ｉ／Ｆ１０３は、図２に示す音声合成辞書２０２に接続され、ＣＰＵ１０５による処理の結果である、音素ラベルと音素メルケプストラム情報の対応関係を、図２に示す音声合成辞書２０２に出力する。 The data input / output I / F 103 is connected to the speech synthesis dictionary 202 shown in FIG. 2, and the correspondence between the phoneme label and the phoneme mel cepstrum information, which is the result of processing by the CPU 105, is shown in the speech synthesis dictionary shown in FIG. To 202.

ユーザＩ／Ｆ１０４は、キーボード（ＫＢ）１４１と、ディスプレイ（ＤＳＰ）１４２と、から構成され、任意の指示、データ及びプログラムを入力するために設けられている。特に、メルケプストラムデータ評価・編集処理においては、ユーザが該Ｉ／Ｆを介して、所定の定数を与える必要がある。 The user I / F 104 includes a keyboard (KB) 141 and a display (DSP) 142, and is provided for inputting arbitrary instructions, data, and programs. In particular, in the mel cepstrum data evaluation / editing process, the user needs to give a predetermined constant through the I / F.

ＣＰＵ１０５は、ＲＯＭ１０１に格納された動作プログラムを実行することにより、図２の機能構成に示す各部を実現し、合成辞書生成動作を実行する。 The CPU 105 executes the operation program stored in the ROM 101, thereby realizing each unit shown in the functional configuration of FIG.

図２は、図１に示す音声合成辞書構築装置１００の機能構成図である。
図１に示す音声合成辞書構築装置１００は、図２に示すように、音声データベース２０１と音声合成辞書２０２に接続される。 FIG. 2 is a functional configuration diagram of the speech synthesis dictionary construction device 100 shown in FIG.
A speech synthesis dictionary construction apparatus 100 shown in FIG. 1 is connected to a speech database 201 and a speech synthesis dictionary 202 as shown in FIG.

音声データベース２０１は、音素ラベル列とそれに対応する音声データとの組から構成されているデータベースであり、ハードディスク等に記憶されている。 The voice database 201 is a database composed of a set of phoneme label strings and corresponding voice data, and is stored in a hard disk or the like.

音声合成辞書２０２は、図１に示す音声合成辞書構築装置１００によって構築されるデータベースであり、音素ラベルと音素学習結果とを対応させて記憶している。この音声合成辞書２０２は、ハードディスク等に記憶されている。 The speech synthesis dictionary 202 is a database constructed by the speech synthesis dictionary construction device 100 shown in FIG. 1, and stores phoneme labels and phoneme learning results in association with each other. This speech synthesis dictionary 202 is stored in a hard disk or the like.

前記音素学習結果は、音素メルケプストラム情報を含む。音声合成に必要な他のピッチ情報は、音声合成装置の仕様により様々であり、前記音素学習結果には、かかる様々な情報も含まれるものとする。 The phoneme learning result includes phoneme mel cepstrum information. Other pitch information necessary for speech synthesis varies depending on the specifications of the speech synthesizer, and the phoneme learning result includes such various information.

図１に示す音声合成辞書構築装置１００は、機能的には、図２に示すように、データ取り出し部２１０と、メルケプストラム系列データ生成部２２０と、音素切り出し部２２５と、メルケプストラムデータ評価部２３０と、メルケプストラムデータ編集部２４０と、音素ＨＭＭ学習部２５０と、データ書き出し部２６０と、を備える。 As shown in FIG. 2, the speech synthesis dictionary construction apparatus 100 shown in FIG. 1 functionally includes a data extraction unit 210, a mel cepstrum sequence data generation unit 220, a phoneme segmentation unit 225, and a mel cepstrum data evaluation unit. 230, a mel cepstrum data editing unit 240, a phoneme HMM learning unit 250, and a data writing unit 260.

データ取り出し部２１０は、音声データベース２０１からデータを読み込み、音素ラベル列と音声データとに分離する。音素ラベル列は、音素ＨＭＭ学習部２５０に引き渡され、音声データは、メルケプストラム系列データ生成部２２０に引き渡される。 The data extraction unit 210 reads data from the voice database 201 and separates it into a phoneme label string and voice data. The phoneme label string is delivered to the phoneme HMM learning unit 250, and the speech data is delivered to the mel cepstrum sequence data generation unit 220.

メルケプストラム系列データ生成部２２０は、データ取り出し部２１０から引き渡された音声データから、所定のメルケプストラム系列データを生成し、音素切り出し部２２５に引き渡す。 The mel cepstrum sequence data generation unit 220 generates predetermined mel cepstrum sequence data from the audio data delivered from the data extraction unit 210 and delivers it to the phoneme segmentation unit 225.

音素切り出し部２２５は、メルケプストラム系列データ生成部２２０から引き渡されたメルケプストラム系列データから、音素毎のメルケプストラム部分系列データを切り出し、メルケプストラムデータ評価部２３０に引き渡す。 The phoneme cutout unit 225 cuts out the mel cepstrum partial sequence data for each phoneme from the mel cepstrum sequence data delivered from the mel cepstrum sequence data generation unit 220, and delivers it to the mel cepstrum data evaluation unit 230.

メルケプストラムデータ評価部２３０は、音素切り出し部２２５から引き渡された各メルケプストラム部分系列データが、音声合成辞書を構築するに適するかを評価する。この評価処理の詳細については、図３及び図４を参照して後述する。 The mel cepstrum data evaluation unit 230 evaluates whether each mel cepstrum partial series data delivered from the phoneme segmentation unit 225 is suitable for constructing a speech synthesis dictionary. Details of this evaluation process will be described later with reference to FIGS. 3 and 4.

メルケプストラムデータ編集部２４０は、メルケプストラムデータ評価部２３０から引き渡されたメルケプストラム部分系列データに対し、所定の編集処理を施し、編集済みメルケプストラム部分系列データを生成する。この所定の編集処理の詳細については、図３及び図４を参照して後述する。 The mel cepstrum data editing unit 240 performs a predetermined editing process on the mel cepstrum partial sequence data delivered from the mel cepstrum data evaluation unit 230 to generate edited mel cepstrum partial sequence data. Details of the predetermined editing process will be described later with reference to FIGS.

編集済みメルケプストラム部分系列データは、音素ＨＭＭ学習部２５０に引き渡される。 The edited mel cepstrum partial series data is delivered to the phoneme HMM learning unit 250.

音素ＨＭＭ学習部２５０は、音素ラベル列と編集済みメルケプストラム部分系列データの対応関係を、ＨＭＭに基づく学習により、音素ラベルと音素メルケプストラム情報の対応関係（音素ＨＭＭ）に変換し、当該対応関係を、データ書き出し部２６０に引き渡す。 The phoneme HMM learning unit 250 converts the correspondence relationship between the phoneme label sequence and the edited mel cepstrum partial series data into a correspondence relationship (phoneme HMM) between the phoneme label and the phoneme mel cepstrum information by learning based on the HMM, and the correspondence relationship. Is delivered to the data writing unit 260.

データ書き出し部２６０は、音素ラベルと音素メルケプストラム情報の対応関係を音声合成辞書２０２に記録する。 The data writing unit 260 records the correspondence between the phoneme label and the phoneme mel cepstrum information in the speech synthesis dictionary 202.

メルケプストラムデータ評価部２３０が実行する所定の評価処理は、メルケプストラムデータが音声合成辞書を構築するのに適するかを評価する処理であれば、いかなる処理でもよいが、以下に、評価処理の好適な具体例について説明する。 The predetermined evaluation process executed by the mel cepstrum data evaluation unit 230 may be any process as long as it is a process for evaluating whether the mel cepstrum data is suitable for building a speech synthesis dictionary. Specific examples will be described.

なお、以下の説明では、フレームとは、メルケプストラム係数の生成のために用いられる時間区分を意味し、記号ｆｍで表す。 In the following description, a frame means a time segment used for generating a mel cepstrum coefficient and is represented by a symbol fm.

（評価・編集処理の具体例１）
図３に示すフローチャートを参照して、評価・編集処理の具体例１を説明する。 (Specific example 1 of evaluation / editing process)
A specific example 1 of the evaluation / editing process will be described with reference to the flowchart shown in FIG.

本具体例においては、あらかじめ、ユーザが、メルケプストラム係数差分の閾値th^dmcepを、図１のユーザＩ／Ｆ１０４を介して、記憶部１０２に設定しておくものとする。 In this specific example, it is assumed that the user previously sets the threshold value th ^d mcep of the mel cepstrum coefficient difference in the storage unit 102 via the user I / F 104 of FIG.

図１に示す音声合成辞書構築装置１００により音声合成辞書２０２を構築する際には、音声合成辞書構築装置１００には、音声データベース２０１と、例えば、空状態の音声合成辞書２０２とが接続される。 When the speech synthesis dictionary construction apparatus 100 shown in FIG. 1 constructs the speech synthesis dictionary 202, the speech synthesis dictionary construction apparatus 100 is connected to the speech database 201 and, for example, an empty speech synthesis dictionary 202. .

音声合成辞書２０２生成の開始の指示がユーザＩ／Ｆ１０４からされると、データ取り出し部２１０は、音声データベース２０１から、音素ラベル列ＭnLab_ｍと音声データＳp_ｍ（但し、１≦ｍ≦Ｍ_ＳＰであり、Ｍ_ＳＰは音声データベースのデータ数である。）の対を順次読み出し、記憶部１０２に記憶する。 When an instruction to start speech synthesis dictionary 202 generated is from the user I / F 104, the data extraction unit 210, from the speech database 201, a phoneme label sequence MnLab _m and voice data Sp _m (however, 1 ≦ m ≦ _{M SP} Yes, _MSP is the number of data in the speech database.) The pairs are sequentially read out and stored in the storage unit 102.

メルケプストラム系列データ生成部２２０は、音声データＳp_ｍからＤ次のメルケプストラム系列データＭcep^d _ｍ［ｆｍ］（但し、０≦ｄ≦Ｄ、０≦ｆｍ≦ＦLab_ｍであり、ＦLab_ｍは音声データＳp_ｍについてのフレーム数である。）を生成し、記憶部１０２に記憶する（ステップＳ１）。 Mel cepstrum series data generation unit 220, the voice data Sp _m from D following mel cepstrum series data Mcep ^d _m [fm] (where a 0 ≦ d ≦ D, 0 ≦ fm ≦ FLab m, FLab m audio data Is the number of frames for Sp _m .) And is stored in the storage unit 102 (step S1).

そして、音素切り出し部２２５は、音素ラベル列ＭnLab_ｍを用いて、メルケプストラム系列データＭcep^d _ｍに対して音素切り出しを実施し、メルケプストラム部分系列データ群mcep^d _Ｌ，ｎを生成する（ステップＳ２）。但し、０≦Ｌ≦Ｍ_Ｌであり、Ｍ_Ｌは音素ラベル総数である。また、０≦ｎ≦Ｎ_Ｌであり、Ｎ_Ｌは音素ラベルに対するメルケプストラム部分系列データ群の総数である。 Then, the phoneme segmentation unit 225 performs phoneme segmentation on the mel cepstrum sequence data Mcep ^d _m using the phoneme label string MnLab _m to generate a mel cepstrum partial sequence data group mcep ^d _{L, n} (step S2). ). However, a _{_{0 ≦ L ≦ M L, M}} L is a phoneme label total. Also, 0 ≦ n ≦ N _L , where N _L is the total number of mel cepstrum partial series data groups for the phoneme label.

メルケプストラムデータ評価部２３０は、メルケプストラム部分系列データ群mcep^d _Ｌ，ｎ［ｆｍ］（但し、０≦ｆｍ≦Ｆ_Ｌ，ｎであり、Ｆ_Ｌ，ｎはメルケプストラム部分系列データ群mcep^d _Ｌ，ｎについてのフレーム総数である。）について、メルケプストラム部分系列データ群mcep^d _Ｌ，ｎ毎に平均値ave^d _Ｌ，ｎを次式に従って算出する（ステップＳ３）。
ave^d _Ｌ，ｎ＝｛１／（Ｆ_Ｌ，ｎ＋１）｝・Σ_ｆｍ＝０ ^ＦL,ｎmcep^d _Ｌ，ｎ［ｆｍ］ The mel-cepstral data evaluation unit 230 determines that the mel-cepstrum partial series data group mcep ^d _{L, n} [fm] (where 0 ≦ fm ≦ F _{L, n} and FL _{, n} is the mel-cepstrum partial series data group mcep ^d _{L a} total number of frames in the _n.) for mel cepstrum partial series data group mcep ^d _L, the average value ave ^d _L for each _{_n,} the _n is calculated according to the following equation (step S3).
ave ^d _{L, n} = {1 / (F _{L, n} + 1)} · Σ _{fm = 0} ^{FL, n} mcep ^d _{L, n} [fm]

そして、以下のステップにより、音素ラベル毎かつメルケプストラム部分系列データ群mcep^d _Ｌ，ｎ毎に、フレーム単位のメルケプストラム部分系列データmcep^d _Ｌ，ｎ [ｆｍ]と平均値ave^d _Ｌ，ｎとの差を算出し、この差によって各メルケプストラム部分系列データ群mcep^d _Ｌ，ｎを評価する。 Then, according to the following steps, for each phoneme label and for each mel cepstrum partial series data group mcep ^d _{L, n} , the frame unit mel cepstrum partial series data mcep ^d _{L, n} [fm] and the average value ave ^d _{L, n} The mel cepstrum partial series data group mcep ^d _{L, n} is evaluated based on the difference.

まず、音素ラベルを識別するための番号を指定するポインタＬを「１」に初期化する（ステップＳ４）。 First, a pointer L for designating a number for identifying a phoneme label is initialized to “1” (step S4).

続いて、各メルケプストラム部分系列データ群を識別するための番号を指定するポインタｎを「１」に初期化する（ステップＳ５）。 Subsequently, a pointer n designating a number for identifying each mel cepstrum partial series data group is initialized to “1” (step S5).

そして、各メルケプストラム部分系列データの次元を指定するポインタｄを「０」に初期化する（ステップＳ６）。
これらのポインタL,n,ｄによって、記憶部１０２上のメルケプストラム部分系列データ群mcep^d _Ｌ，ｎ［ｆｍ］（但し、０≦ｆｍ≦Ｆ_Ｌ，ｎである。）に着目する。 Then, a pointer d for designating the dimension of each mel cepstrum partial series data is initialized to “0” (step S6).
With these pointers L, n, d, attention is focused on the mel cepstrum partial series data group mcep ^d _{L, n} [fm] (where 0 ≦ fm ≦ F _{L, n} ) on the storage unit 102.

そして、L,n番目でｄ次のメルケプストラム部分系列データについての、フレームの番号を示すポインタｆｍを「０」に初期化する（ステップＳ７）。 Then, the pointer fm indicating the frame number for the L, n-th and d-th order mel cepstrum partial sequence data is initialized to “0” (step S7).

このポインタｆｍによって、１フレームのメルケプストラム部分系列データmcep^d _Ｌ，ｎ［ｆｍ］に着目し、処理対象のフレームｆｍについて平均値との差分が所定の閾値th^dmcep以下であるかを次式（１）に従って判別する（ステップＳ８）。
｜mcep^d _Ｌ，ｎ［ｆｍ］−ave^d _Ｌ，ｎ｜≦th^dmcep （１） Using this pointer fm, paying attention to the mel cepstrum partial sequence data mcep ^d _{L, n} [fm] of one frame, whether the difference from the average value for the frame fm to be processed is equal to or less than a predetermined threshold th ^d mcep The determination is made according to (1) (step S8).
| Mcep ^d _{L, n} [fm] −ave ^d _{L, n} | ≦ th ^d mcep (1)

ステップＳ８で所定の閾値以下でないと判別された場合（ステップＳ８；Ｎｏ）、メルケプストラム部分系列データ群mcep^d _Ｌ，ｎを削除する（ステップＳ９）。 If it is determined in step S8 that it is not less than the predetermined threshold value (step S8; No), the mel cepstrum partial series data group mcep ^d _{L, n} is deleted (step S9).

ステップＳ８で所定の閾値以下であると判別された場合（ステップＳ８；Ｙeｓ）、全てのｆｍについて処理が完了したか否かを判別する（ステップＳ１０）。 If it is determined in step S8 that the value is equal to or less than the predetermined threshold (step S8; Yes), it is determined whether or not the processing has been completed for all fm (step S10).

当該処理が完了してはいないと判別された場合には（ステップＳ１０；Ｎｏ）、ステップＳ１４にてｆｍを「１」だけインクリメントして、ステップＳ８に戻る。 If it is determined that the process has not been completed (step S10; No), fm is incremented by “1” in step S14, and the process returns to step S8.

ステップＳ１０にて、全てのｆｍについての処理が完了したと判別された場合は（ステップＳ１０；Ｙeｓ）、全てのｄについて処理が完了したか否かを判別する（ステップＳ１１）。 If it is determined in step S10 that the processes for all fms have been completed (step S10; Yes), it is determined whether or not the processes have been completed for all d (step S11).

当該処理が完了してはいないと判別された場合には（ステップＳ１１；Ｎｏ）、ステップＳ１５にてｄを「１」だけインクリメントして、ステップＳ７に戻る。 If it is determined that the process has not been completed (step S11; No), d is incremented by “1” in step S15, and the process returns to step S7.

ステップＳ１１にて、全てのｄについての処理が完了したと判別された場合は（ステップＳ１１；Ｙeｓ）、全てのｎについて処理が完了したか否かを判別する（ステップＳ１２）。 If it is determined in step S11 that the processing for all d has been completed (step S11; Yes), it is determined whether or not the processing has been completed for all n (step S12).

当該処理が完了してはいないと判別された場合には（ステップＳ１２；Ｎｏ）、ステップＳ１６にてｎを「１」だけインクリメントして、ステップＳ６に戻る。 If it is determined that the process has not been completed (step S12; No), n is incremented by “1” in step S16, and the process returns to step S6.

ステップＳ１２にて、全てのｎについての処理が完了したと判別された場合は（ステップＳ１２；Ｙeｓ）、全てのＬについて処理が完了したか否かを判別する（ステップＳ１３）。 If it is determined in step S12 that the processing for all n has been completed (step S12; Yes), it is determined whether or not the processing has been completed for all L (step S13).

当該処理が完了してはいないと判別された場合には（ステップＳ１３；Ｎｏ）、ステップＳ１７にてＬを「１」だけインクリメントして、ステップＳ５に戻る。 If it is determined that the processing is not completed (step S13; No), L is incremented by “1” in step S17, and the process returns to step S5.

ステップＳ１３にて、全てのＬについて処理が完了したか否かを判別した結果、完了したと判別された場合には（ステップＳ１３；Ｙeｓ）、処理を終了する。 As a result of determining whether or not the processing has been completed for all L in step S13, if it is determined that the processing has been completed (step S13; Yes), the processing ends.

以上の処理により、図２の音声データベース２０１から取り出して学習に使用し得る全てのmcep^d _Ｌ，ｎ［ｆｍ］は、図１の記憶部１０２に記憶される。当該mcep^d _Ｌ，ｎ［ｆｍ］は、図２の音素ＨＭＭ学習部２５０にて使用される。 Through the above processing, all mcep ^d _{L, n} [fm] that can be extracted from the speech database 201 of FIG. 2 and used for learning are stored in the storage unit 102 of FIG. The mcep ^d _{L, n} [fm] is used in the phoneme HMM learning unit 250 in FIG.

本具体例では、図２に示す音素ＨＭＭ学習部２５０にて使用されるメルケプストラム部分系列データを、あらかじめメルケプストラムデータ編集部２４０にて著しく相違するデータが除かれたメルケプストラム部分系列データとすることにより、より自然な合成音声を合成することができる音声合成辞書の構築を達成できる。 In this specific example, the mel cepstrum partial sequence data used in the phoneme HMM learning unit 250 shown in FIG. 2 is mel cepstrum partial sequence data from which data that is significantly different from the mel cepstrum data editing unit 240 is removed in advance. This makes it possible to construct a speech synthesis dictionary that can synthesize more natural synthesized speech.

（評価・編集処理の具体例２）
図４に示すフローチャートを参照して、評価・編集処理の具体例２を説明する。 (Specific example 2 of evaluation / editing process)
A specific example 2 of the evaluation / editing process will be described with reference to the flowchart shown in FIG.

具体例１では、図３のステップＳ８についてｆｍ数分の処理を繰り返すことにより、評価・編集したが、ｆｍ値が大きい場合には処理時間を要することとなる。 In specific example 1, evaluation and editing are performed by repeating the process for the number of fm in step S8 in FIG. 3. However, if the fm value is large, a processing time is required.

そこで、本具体例においては、フレーム毎に判定する代わりに、後述する式（２）に従って判定することにより、処理時間を短縮する。 Therefore, in this specific example, instead of making a determination for each frame, the processing time is shortened by making a determination according to equation (2) described later.

本具体例の動作の流れは、図４に示すとおりで、基本的には、図３を用いて説明した具体例１と同様である。 The flow of the operation of this example is as shown in FIG. 4 and is basically the same as that of Example 1 described with reference to FIG.

そこで、図４においては、図３と同一の処理を行うステップには、同一の符号を付してある。 Therefore, in FIG. 4, the same reference numerals are given to steps for performing the same processing as in FIG. 3.

本具体例が具体例１と異なる主な点は、ステップＳ２４を設けた点と、ステップＳ８の代わりにステップＳ２８を設けてステップＳ１０、Ｓ１４のループを外した点である。 This example is different from Example 1 mainly in that Step S24 is provided, and Step S28 is provided instead of Step S8, and the loop of Steps S10 and S14 is removed.

ステップＳ２４では、次式に従って同一モノフォンラベルに属するメルケプストラム部分系列データの平均値AVE^d _Ｌを算出する。
AVE^d _Ｌ＝（１／Ｎ_Ｌ）・Σ_ｎ＝０ ^ＮLave^d _Ｌ，ｎ In step S24, an average value AVE ^d _L of mel cepstrum partial series data belonging to the same monophone label is calculated according to the following equation.
AVE ^d _L = (1 / N _L ) · Σ _{n = 0} ^NL ave ^d _{L, n}

そして、ステップＳ２８で、モノフォンラベルごとのメルケプストラム部分系列データの評価を、次式（２）に従って、各メルケプストラム部分系列データ群の平均値ave^d _Ｌ，ｎとAVE^d _Ｌとの差が所定の閾値を超えていないかを判定することにより行う。
｜ave^d _Ｌ，ｎ−AVE^d _Ｌ｜≦TH^dmcep （２）
In step S28, the evaluation of the mel cepstrum partial series data for each monophone label is performed according to the following equation (2). The difference between the average values ave ^d _{L, n} and AVE ^d _{L of} each mel cepstrum partial series data group is This is done by determining whether a predetermined threshold is not exceeded.
| Ave ^d _{L, n} −AVE ^d _L | ≦ TH ^d mcep (2)

本具体例においては、あらかじめ、ユーザが、メルケプストラム係数差分の閾値TH^dmcepを、図１のユーザＩ／Ｆ１０４を介して、記憶部１０２に設定しておくものとする。 In this specific example, it is assumed that the user previously sets the threshold TH ^d mcep for the mel cepstrum coefficient difference in the storage unit 102 via the user I / F 104 in FIG.

このように、本具体例によれば、急激な音質の変動を起こさずに、自然な合成音声を出力することができ、かつ、効率よく音声合成辞書を構築することができる音声合成辞書構築装置を提供することができる。 As described above, according to this example, a speech synthesis dictionary construction device that can output a natural synthesized speech without causing a sudden change in sound quality and can construct a speech synthesis dictionary efficiently. Can be provided.

なお、評価・編集処理の具体例として以上のように２つ例示したが、評価・編集処理はこれらに限定されるものではない。著しいメルケプストラムデータの相違を、除去するものであれば、いかなるものでもよい。 Two specific examples of the evaluation / editing process have been described above, but the evaluation / editing process is not limited to these. Any method can be used as long as it removes a significant difference in mel cepstrum data.

例えば、編集については、前記式（１）、（２）の条件を満たさないメルケプストラム部分系列データを削除する代わりに、条件を満たさないデータmcep^d _Ｌ，ｎの前のデータmcep^d _{Ｌ，ｎ−１}又は後のデータmcep^d _{Ｌ，ｎ＋１}と置き換えるようにしてもよい。 For example, for editing, instead of deleting the mel cepstrum partial series data that does not satisfy the conditions of the equations (1) and (2), the data mcep ^d _{L, n} before the data mcep ^d _{L, n} that does not satisfy the conditions _-1 or subsequent data mcep ^d _{L, n + 1} may be substituted.

以上では理解を容易にするため、音声データベース２０１から、データを、データ取り出し部２１０により記憶部１０２に一旦全部読み込む例を示したが、かかる一括処理は本実施形態の本質的要件ではない。例えば、図２に示す音素ＨＭＭ学習部２５０の仕様次第では、より動的に音声合成辞書を構築することも考えられる。 In the above, in order to facilitate understanding, an example in which all data is once read from the voice database 201 into the storage unit 102 by the data extraction unit 210 has been shown, but such batch processing is not an essential requirement of the present embodiment. For example, depending on the specifications of the phoneme HMM learning unit 250 shown in FIG. 2, it is possible to construct a speech synthesis dictionary more dynamically.

（実施形態２）
実施形態１においては、図１に示す音声合成辞書構築装置１００により音素ラベルと音素メルケプストラム情報とを対応付けた。この発明はこれに限定されず、音素ラベルと音素メルケプストラム情報及び音素ピッチ情報とを対応付ける場合にも適用可能である。 (Embodiment 2)
In the first embodiment, the phoneme label is associated with the phoneme mel cepstrum information by the speech synthesis dictionary construction device 100 shown in FIG. The present invention is not limited to this, and can also be applied to a case where a phoneme label is associated with phoneme mel cepstrum information and phoneme pitch information.

以下、音素ラベルと音素メルケプストラム情報及び音素ピッチ情報とを対応付けて音声合成辞書に書き出す音声合成辞書構築装置５００について説明する。 The following describes the speech synthesis dictionary construction apparatus 500 that writes phoneme labels, phoneme mel cepstrum information, and phoneme pitch information in a speech synthesis dictionary in association with each other.

本実施形態に係る音声合成辞書構築装置５００は、図５に示すように、データ取り出し部２１０と、メルケプストラムデータ評価部２３０と、を備える。これらの各部は、実施形態１に係る図１に示す音声合成辞書構築装置１００の対応する各部と同一の構成と機能を有する。 As shown in FIG. 5, the speech synthesis dictionary construction apparatus 500 according to the present embodiment includes a data extraction unit 210 and a mel cepstrum data evaluation unit 230. These units have the same configurations and functions as the corresponding units of the speech synthesis dictionary construction apparatus 100 shown in FIG. 1 according to the first embodiment.

音声合成辞書構築装置５００は、さらに、系列データ生成部５２０と、音素切り出し部５２５と、データ編集部５４０と、音素ＨＭＭ学習部５５０と、データ書き出し部５６０と、を備える。 The speech synthesis dictionary construction apparatus 500 further includes a sequence data generation unit 520, a phoneme segmentation unit 525, a data editing unit 540, a phoneme HMM learning unit 550, and a data writing unit 560.

系列データ生成部５２０は、メルケプストラム系列データ生成部２２０と同一の機能に加えて、さらに、データ取り出し部２１０により取り出された音声データに対してピッチ抽出を施して、ピッチ系列データPit_ｍを生成する。 The sequence data generation unit 520 generates pitch sequence data Pit _m by performing pitch extraction on the audio data extracted by the data extraction unit 210, in addition to the same function as the mel cepstrum sequence data generation unit 220. To do.

音素切り出し部５２５は、音素切り出し部２２５と同一の機能に加えて、さらに、ピッチ系列データPit_ｍに対して、音素切り出し処理を施して、同一音素ラベル毎にピッチ部分系列データpit_Ｌ，ｎを生成する。 In addition to the same function as the phoneme segmentation unit 225, the phoneme segmentation unit 525 further performs a phoneme segmentation process on the pitch sequence data Pit _m to obtain the pitch subsequence data pit _{L, n} for each identical phoneme label. Generate.

データ編集部５４０は、メルケプストラムデータ編集部２４０と同一の機能に加えて、さらに、メルケプストラムデータ評価部２３０で学習データとして使用できないと評価されて削除したメルケプストラム部分系列データmcep^d _Ｌ，ｎに対応するピッチ部分系列データpit_Ｌ，ｎを削除する。 In addition to the same function as the mel cepstrum data editing unit 240, the data editing unit 540 further evaluates that the mel cepstrum data evaluation unit 230 cannot be used as learning data and deletes the mel cepstrum partial series data mcep ^d _{L, n.} Pitch partial series data pit _{L, n} corresponding to is deleted.

音素ＨＭＭ学習部５５０は、音素ラベル列と対応する編集済みメルケプストラム系列データとの対応関係を、ＨＭＭに基づいて学習することにより、音素ラベルと音素メルケプストラム情報との対応関係を示す情報に変換し、データ書き出し部５６０に引き渡す。さらに、音素ＨＭＭ学習部５５０は、音素ラベル列とピッチ系列データとの対応関係を、ＨＭＭに基づいて学習し、音素ラベルと音素ピッチ情報との対応関係を示す情報に変換し、データ書き出し部５６０に引き渡す。 The phoneme HMM learning unit 550 converts the correspondence between the phoneme label string and the corresponding edited mel cepstrum sequence data into information indicating the correspondence between the phoneme label and the phoneme mel cepstrum information by learning based on the HMM. Then, the data is written to the data writing unit 560. Further, the phoneme HMM learning unit 550 learns the correspondence relationship between the phoneme label string and the pitch sequence data based on the HMM, converts it into information indicating the correspondence relationship between the phoneme label and the phoneme pitch information, and the data writing unit 560. To hand over.

データ書き出し部５６０は、音素ラベルと音素メルケプストラム情報の対応関係、及び、音素ラベルと音素ピッチ情報の対応関係を、音声合成辞書２０２に書き出す。 The data writing unit 560 writes the correspondence relationship between the phoneme label and the phoneme mel cepstrum information and the correspondence relationship between the phoneme label and the phoneme pitch information to the speech synthesis dictionary 202.

このようにして構築された音声合成辞書２０２を用いることにより、音素ラベル毎に音素メルケプストラム情報と音素ピッチ情報とを用いて、高品質な音声を合成することができる。 By using the speech synthesis dictionary 202 constructed in this way, high-quality speech can be synthesized using phoneme mel cepstrum information and phoneme pitch information for each phoneme label.

（実施形態３）
実施形態３は、実施形態１において、メルケプストラム部分系列データmcep^d _Ｌ，ｎを削除するとともに、当該メルケプストラム部分系列データを切り出したメルケプストラム系列データＭcep^d _ｍをも削除するようにしたものである。 (Embodiment 3)
In the third embodiment, the mel cepstrum partial sequence data mcep ^d _{L, n} is deleted in the first embodiment, and the mel cepstrum sequence data Mcep ^d _m obtained by cutting out the mel cepstrum partial sequence data is also deleted. is there.

これは、前述した図３に示す評価・編集処理の具体例１において、図６に示すような変更を加えることにより実現される。以下、追加部分（ステップＳ６１〜Ｓ６８）について説明する。 This is realized by making a change as shown in FIG. 6 in the specific example 1 of the evaluation / editing process shown in FIG. Hereinafter, the additional part (steps S61 to S68) will be described.

音声データの取り出し（ステップＳ１）の後、音声データを計数するカウンタｍを設け、当該カウンタｍに初期値「１」を設定する（ステップＳ６１）とともに、メルケプストラム部分系列データを削除したか否かを示すフラグDELを設け、当該フラグDELに初期値「０」を設定する（ステップＳ６３）。 After extracting the audio data (step S1), a counter m for counting the audio data is provided, an initial value “1” is set in the counter m (step S61), and whether or not the mel cepstrum partial series data has been deleted. And an initial value “0” is set in the flag DEL (step S63).

そして、メルケプストラム部分系列データmcep^d _Ｌ，ｎを削除した後（ステップＳ９）、削除フラグDELに「１」を設定する（ステップＳ６４）。 Then, after deleting the mel cepstrum partial series data mcep ^d _{L, n} (step S9), the deletion flag DEL is set to “1” (step S64).

カウンタＬで指示されたすべてのメルケプストラム部分系列データについて、上述のステップＳ８による評価処理及びステップＳ９による編集処理を終了した後（ステップＳ１３；Ｙｅｓ）、削除フラグDELを判定し（ステップＳ６５）、「１」が設定されている場合には、メルケプストラム系列データＭcep^d _ｍを削除する（ステップＳ６６）。 For all the mel cepstrum partial series data indicated by the counter L, after the evaluation process in step S8 and the editing process in step S9 are completed (step S13; Yes), the deletion flag DEL is determined (step S65). when "1" is set, it deletes the mel cepstrum series data Mcep ^d _m (step S66).

その後、すべての音声データについて処理を終了したか判定し（ステップＳ６７）、終了していない場合には、ｍをインクリメントし（ステップＳ６８）、次の音声データの処理へ進む（ステップＳ６３以下）。 Thereafter, it is determined whether or not the processing has been completed for all audio data (step S67). If not, m is incremented (step S68), and the processing proceeds to the next audio data (step S63 and subsequent steps).

また、前述した図４に示す評価・編集処理の具体例２において、メルケプストラム部分系列データmcep^d _Ｌ，ｎを削除するとともに、当該メルケプストラム部分系列データを切り出したメルケプストラム系列データＭcep^d _ｍをも削除するようにしてもよい。このようにするためには、図７に示すような変更を加える。図７の追加部分は、図６の追加部分と同様である。
また、実施形態３の変形例として、メルケプストラム部分系列データ及びピッチ部分系列データを削除するとともに、当該メルケプストラム部分系列データ及びピッチ部分系列データを切り出したメルケプストラム系列データ及びピッチ系列データをも削除するようにしてもよい。これは、音声合成辞書構築装置が、そのようなメルケプストラム系列データ及びピッチ系列データから、音声合成辞書の構築に適さないメルケプストラム部分系列データ及びピッチ部分系列データを再び生成する可能性があるからである。この変更は、図５の装置によって図６又は図７の処理を実行することで実現される。 Further, in the above-described specific example 2 of the evaluation / editing process shown in FIG. 4, the mel cepstrum partial sequence data mcep ^d _{L, n} is deleted, and the mel cepstrum sequence data Mcep ^d _m obtained by cutting out the mel cepstrum partial sequence data is used. May also be deleted. In order to do this, a change as shown in FIG. 7 is made. The additional part of FIG. 7 is the same as the additional part of FIG.
Further, as a modification of the third embodiment, the mel cepstrum partial sequence data and the pitch partial sequence data are deleted, and the mel cepstrum partial data and the pitch partial sequence data extracted from the mel cepstrum partial sequence data and the pitch partial sequence data are also deleted. You may make it do. This is because the speech synthesis dictionary construction apparatus may generate again the mel cepstrum partial sequence data and pitch partial sequence data that are not suitable for construction of the speech synthesis dictionary from such mel cepstrum sequence data and pitch sequence data. It is. This change is realized by executing the processing of FIG. 6 or FIG. 7 by the apparatus of FIG.

（実施形態４）
実施形態４は、実施形態３において、メルケプストラム部分系列データ及びメルケプストラム系列データを削除し、さらに、メルケプストラム系列データを生成した音声データも削除するものである。これは、そのような音声データが音声合成辞書の構築に適さないメルケプストラム系列データを再び生成する可能性があるからである。そして、削除した音声データの数を計数し、当該計数値が所定の閾値を超えた場合には全ての音声データを再収集するようにしたものである。 (Embodiment 4)
In the fourth embodiment, the mel cepstrum partial sequence data and the mel cepstrum sequence data in the third embodiment are deleted, and further, the audio data generated from the mel cepstrum sequence data is also deleted. This is because such speech data may again generate mel cepstrum sequence data that is not suitable for construction of a speech synthesis dictionary. Then, the number of deleted audio data is counted, and when the counted value exceeds a predetermined threshold, all the audio data is recollected.

これは、前述した図６に示す評価・編集処理の具体例１において、図８に示すような変更を加えることにより実現される。以下、追加部分（ステップＳ６２、Ｓ８０〜Ｓ８３）について説明する。 This is realized by adding the changes shown in FIG. 8 to the above-described specific example 1 of the evaluation / editing process shown in FIG. Hereinafter, additional portions (steps S62, S80 to S83) will be described.

音声データの取り出し（ステップＳ１）後、音声データの削除数を計数するカウンタCntを設け、当該カウンタCntに初期値「０」を設定する（ステップＳ６２）。そして、メルケプストラム部分系列データmcep^d _Ｌ，ｎ及びメルケプストラム系列データＭcep^d _ｍを削除する場合に、これらを生成した音声データSp_ｍを削除し（ステップＳ８０）、カウンタCntをインクリメントする（ステップＳ８１）。 After taking out the audio data (step S1), a counter Cnt for counting the number of deleted audio data is provided, and an initial value “0” is set in the counter Cnt (step S62). The Mel cepstrum partial series data mcep ^d _L, to delete _n and mel-cepstrum series data Mcep ^d _m, deletes the voice data Sp _m that generated them (step S80), increments the counter Cnt (step S81 ).

そして、すべての音声データについて処理を終了した後（ステップＳ６７；Ｙｅｓ）、カウンタCntの計数値が所定の閾値Ｔhdelを超えたか判別する（ステップＳ８２）。カウンタCntの計数値が所定の閾値Ｔhdelを超えたとき（ステップＳ８２；Ｙｅｓ）、図１のＣＰＵ１０５は、例えば、ユーザＩ／Ｆ１０４を介してその旨をユーザに報知し、ユーザに対し、再録音を指示し、音声データを再収集させる（ステップＳ８３）。 Then, after the processing is completed for all audio data (step S67; Yes), it is determined whether the count value of the counter Cnt has exceeded a predetermined threshold Thdel (step S82). When the count value of the counter Cnt exceeds a predetermined threshold Thdel (step S82; Yes), the CPU 105 in FIG. 1 notifies the user via the user I / F 104, for example, and re-records the user. And voice data is collected again (step S83).

これにより、より高品質の音声合成辞書を構築することができる。
また、前述した図７に示す評価・編集処理の具体例２において、メルケプストラム部分系列データ及びメルケプストラム系列データを削除し、さらに、メルケプストラム系列データを生成した音声データも削除するようにするため、図９に示すような変更を加えるようにしてもよい。この理由は、実施形態３の説明の最後で述べた理由と同じ理由により、これらの系列データから適切でない部分系列データを切り出す可能性があるからである。図９の追加部分は、図８の追加部分と同様である。 Thereby, a higher quality speech synthesis dictionary can be constructed.
Further, in the above-described specific example 2 of the evaluation / editing process shown in FIG. 7, in order to delete the mel cepstrum partial series data and the mel cepstrum series data, and further delete the voice data generated from the mel cepstrum series data. A change as shown in FIG. 9 may be added. This is because there is a possibility that inappropriate partial series data may be cut out from these series data for the same reason as described at the end of the description of the third embodiment. The additional part of FIG. 9 is the same as the additional part of FIG.

また、実施形態４の変形例として、メルケプストラム部分系列データ及びピッチ部分系列データを削除するとともに、当該メルケプストラム部分系列データ及びピッチ部分系列データを切り出したメルケプストラム系列データ及びピッチ系列データを削除し、さらに、これらのデータを生成した音声データをも削除するようにしてもよい。この理由は、本実施形態４の説明の冒頭で述べた理由と同じ理由により、この音声データから適切でない系列データさらには部分系列データを切り出す可能性があるからである。これは、図５の装置によって図８又は図９の処理を実行することで実現される。
なお、この発明は、上記実施形態に限定されず、種々の変形及び応用が可能である。 Further, as a modification of the fourth embodiment, the mel cepstrum partial sequence data and pitch partial sequence data are deleted, and the mel cepstrum sequence data and pitch sequence data obtained by cutting out the mel cepstrum partial sequence data and pitch partial sequence data are deleted. Furthermore, the audio data that generated these data may also be deleted. This is because, for the same reason as described at the beginning of the description of the fourth embodiment, there is a possibility that inappropriate series data or even partial series data may be cut out from this audio data. This is realized by executing the processing of FIG. 8 or FIG. 9 by the apparatus of FIG.
In addition, this invention is not limited to the said embodiment, A various deformation | transformation and application are possible.

例えば、上述のハードウエア構成やブロック構成、フローチャートは例示であって、限定されるものでもない。 For example, the above-described hardware configuration, block configuration, and flowchart are examples and are not limited.

また、この発明は、音声合成辞書構築装置に限定されるものではなく、任意のコンピュータを用いて構築可能である。例えば、上述の処理を汎用のコンピュータに実行させるためのプログラムを記録媒体や通信により配布し、これをそのコンピュータにインストールして実行させることにより、この発明の音声合成辞書構築装置として機能させることも可能である。 The present invention is not limited to the speech synthesis dictionary construction device, and can be constructed using any computer. For example, a program for causing a general-purpose computer to execute the above-described processing is distributed by a recording medium or communication, and this is installed and executed on the computer, so that it can function as the speech synthesis dictionary construction device of the present invention. Is possible.

実施形態１に係る音声合成辞書構築装置の物理的な構成を示す図である。It is a figure which shows the physical structure of the speech synthesis dictionary construction apparatus which concerns on Embodiment 1. FIG. 実施形態１に係る、メルケプストラムデータ評価・編集部を備えた音声合成辞書構築装置の機能構成図である。It is a functional block diagram of the speech synthesis dictionary construction apparatus provided with the mel cepstrum data evaluation and edit part based on Embodiment 1. FIG. メルケプストラムデータ評価・編集処理の具体例１における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the specific example 1 of a mel cepstrum data evaluation and edit process. メルケプストラムデータ評価・編集処理の具体例２における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the specific example 2 of a mel cepstrum data evaluation and edit process. 実施形態２に係る、ピッチ抽出を伴う音声合成辞書構築装置の機能構成図である。It is a function block diagram of the speech synthesis dictionary construction apparatus with pitch extraction based on Embodiment 2. FIG. 実施形態３に係る、メルケプストラムデータ評価・編集処理の具体例１における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the specific example 1 of a mel cepstrum data evaluation and edit process based on Embodiment 3. FIG. 実施形態３に係る、メルケプストラムデータ評価・編集処理の具体例２における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the specific example 2 of a mel cepstrum data evaluation and edit process based on Embodiment 3. FIG. 実施形態４に係る、メルケプストラムデータ評価・編集処理の具体例１における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the specific example 1 of a mel cepstrum data evaluation and edit process based on Embodiment 4. FIG. 実施形態４に係る、メルケプストラムデータ評価・編集処理の具体例２における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the specific example 2 of a mel cepstrum data evaluation and edit process based on Embodiment 4. FIG.

Explanation of symbols

１００・・・音声合成辞書構築装置、１０１・・・ＲＯＭ、１０２・・・記憶部、１０３・・・データ入出力Ｉ／Ｆ、１０４・・・ユーザＩ／Ｆ、１０５・・・ＣＰＵ、１０６・・・バス、１１１・・・元データ入りハードディスク、１１２・・・処理済データ記録用ハードディスク、１２１・・・ＲＡＭ、１２２・・・ハードディスク、１４１・・・キーボード、１４２・・・ディスプレイ、２０１・・・音声データベース、２０２・・・音声合成辞書、２１０・・・データ取り出し部、２２０・・・メルケプストラム系列データ生成部、２２５・・・音素切り出し部、２３０・・・メルケプストラムデータ評価部、２４０・・・メルケプストラムデータ編集部、２５０・・・音素ＨＭＭ学習部、２６０・・・データ書き出し部、５００・・・音声合成辞書構築装置、５２０・・・系列データ生成部、５２５・・・音素切り出し部、５４０・・・データ編集部、５５０・・・音素ＨＭＭ学習部、５６０・・・データ書き出し部 DESCRIPTION OF SYMBOLS 100 ... Speech synthesis dictionary construction apparatus, 101 ... ROM, 102 ... Memory | storage part, 103 ... Data input / output I / F, 104 ... User I / F, 105 ... CPU, 106・・・ Bus, 111 ... Hard disk with original data, 112 ... Hard disk for recording processed data, 121 ... RAM, 122 ... Hard disk, 141 ... Keyboard, 142 ... Display, 201・・・ Speech database, 202 ・・・ Speech synthesis dictionary, 210 ・・・ Data extraction unit, 220 ・・・ Mer cepstrum sequence data generation unit, 225 ・・・ phoneme segmentation unit, 230 ・・・ Mer cepstrum data evaluation unit , 240... Mel cepstrum data editing unit, 250... Phoneme HMM learning unit, 260. ..Speech synthesis dictionary construction device, 520... Series data generation unit, 525... Phoneme segmentation unit, 540... Data editing unit, 550... Phoneme HMM learning unit, 560.

Claims

An apparatus for constructing a speech synthesis dictionary,
A receiver for receiving a phoneme label string and corresponding voice data;
A mel cepstrum sequence data generation unit for generating mel cepstrum sequence data from the entire audio data by generating a mel cepstrum coefficient in frame units from the received audio data;
A phoneme cutout unit that cuts out a mel cepstrum partial sequence data group corresponding to a phoneme label for each phoneme label from the generated mel cepstrum sequence data;
An average value is calculated for each extracted mel cepstrum partial series data group, and it is determined whether the mel cepstrum partial series data group includes mel cepstrum partial series data in units of frames whose difference from the average value exceeds a threshold value. If included, a mel cepstrum data evaluation unit that evaluates that the cut out mel cepstrum partial series data group is not suitable for constructing a speech synthesis dictionary;
A mel cepstrum data editing unit that edits the mel cepstrum partial series data by deleting a mel cepstrum partial series data group evaluated as unsuitable for constructing a speech synthesis dictionary;
A phoneme HMM learning unit that associates each phoneme label with a phoneme HMM related to the mel cepstrum by learning based on the hidden Markov model HMM from the phoneme label sequence and the edited mel cepstrum partial sequence data;
A data writing unit for recording the learning result in a speech synthesis dictionary;
A speech synthesis dictionary construction device comprising:

An apparatus for constructing a speech synthesis dictionary,
A receiver for receiving a phoneme label string and corresponding voice data;
A mel cepstrum sequence data generation unit for generating mel cepstrum sequence data from the entire audio data by generating a mel cepstrum coefficient in frame units from the received audio data;
A phoneme cutout unit that cuts out a mel cepstrum partial sequence data group corresponding to a phoneme label for each phoneme label from the generated mel cepstrum sequence data;
A first average value is calculated for each extracted mel cepstrum partial series data group, and second average values for all mel cepstrum partial series data groups belonging to the same phoneme label are calculated. A mel cepstrum data evaluation unit that evaluates that the mel cepstrum partial series data group having the first average value whose difference from the average value exceeds a threshold value is not suitable for constructing a speech synthesis dictionary;
A mel cepstrum data editing unit that edits the mel cepstrum partial series data by deleting a mel cepstrum partial series data group evaluated as unsuitable for constructing a speech synthesis dictionary;
A phoneme HMM learning unit that associates each phoneme label with a phoneme HMM related to the mel cepstrum by learning based on the hidden Markov model HMM from the phoneme label sequence and the edited mel cepstrum partial sequence data;
A data writing unit for recording the learning result in a speech synthesis dictionary;
A speech synthesis dictionary construction device comprising:

The mel cepstrum data editing unit deletes the mel cepstrum partial sequence data group evaluated as unsuitable instead of deleting the mel cepstrum partial series data group evaluated as unsuitable for constructing a speech synthesis dictionary. 3. The speech synthesis dictionary construction apparatus according to claim 1, wherein the mel cepstrum partial series data group belonging to the group is replaced.

A method for building a speech synthesis dictionary,
A receiving step of receiving a phoneme label string and corresponding voice data from the database;
A mel cepstrum sequence data generation step for generating mel cepstrum sequence data from the entire audio data by generating a mel cepstrum coefficient in units of frames from the received audio data;
A phoneme segmentation step of segmenting a mel cepstrum partial sequence data group corresponding to a phoneme label for each phoneme label from the generated mel cepstrum sequence data;
An average value is calculated for each extracted mel cepstrum partial series data group, and it is determined whether the mel cepstrum partial series data group includes mel cepstrum partial series data in units of frames whose difference from the average value exceeds a threshold value. If included, a mel cepstrum data evaluation step for evaluating that the extracted mel cepstrum partial series data group is not suitable for constructing a speech synthesis dictionary;
A mel cepstrum data editing step for editing mel cepstrum partial series data by deleting a mel cepstrum partial series data group evaluated as unsuitable for constructing a speech synthesis dictionary;
A phoneme HMM learning step of associating each phoneme label with a phoneme HMM related to the mel cepstrum by learning based on a hidden Markov model HMM from the phoneme label sequence and the edited mel cepstrum partial sequence data;
An output step for outputting the learning result;
A speech synthesis dictionary construction method comprising:

A method for building a speech synthesis dictionary,
A receiving step of receiving a phoneme label string and corresponding voice data from the database;
A mel cepstrum sequence data generation step for generating mel cepstrum sequence data from the entire audio data by generating a mel cepstrum coefficient in units of frames from the received audio data;
A phoneme segmentation step of segmenting a mel cepstrum partial sequence data group corresponding to a phoneme label for each phoneme label from the generated mel cepstrum sequence data;
A first average value is calculated for each extracted mel cepstrum partial series data group, and second average values for all mel cepstrum partial series data groups belonging to the same phoneme label are calculated. A mel cepstrum data evaluation step for evaluating a mel cepstrum partial sequence data group having a first average value whose difference from the average value exceeds a threshold value as being unsuitable for constructing a speech synthesis dictionary;
A mel cepstrum data editing step for editing mel cepstrum partial series data by deleting a mel cepstrum partial series data group evaluated as unsuitable for constructing a speech synthesis dictionary;
A phoneme HMM learning step of associating each phoneme label with a phoneme HMM related to the mel cepstrum by learning based on a hidden Markov model HMM from the phoneme label sequence and the edited mel cepstrum partial sequence data;
An output step for outputting the learning result;
A speech synthesis dictionary construction method comprising:

In the mel cepstrum data editing step, instead of deleting the mel cepstrum partial series data group evaluated as unsuitable for constructing a speech synthesis dictionary, the mel cepstrum partial series data group evaluated as inappropriate is used as the phoneme label. 6. The method of constructing a speech synthesis dictionary according to claim 4 or 5, wherein the mel cepstrum partial series data group belonging to is replaced.

To a computer that executes the method of building a speech synthesis dictionary,
A receiving step of receiving a phoneme label string and corresponding voice data from the database;
A mel cepstrum sequence data generation step for generating mel cepstrum sequence data from the entire audio data by generating a mel cepstrum coefficient in units of frames from the received audio data;
A phoneme segmentation step of segmenting a mel cepstrum partial sequence data group corresponding to a phoneme label for each phoneme label from the generated mel cepstrum sequence data;
An average value is calculated for each extracted mel cepstrum partial series data group, and it is determined whether the mel cepstrum partial series data group includes mel cepstrum partial series data in units of frames whose difference from the average value exceeds a threshold value. If included, a mel cepstrum data evaluation step for evaluating that the extracted mel cepstrum partial series data group is not suitable for constructing a speech synthesis dictionary;
A mel cepstrum data editing step for editing mel cepstrum partial series data by deleting a mel cepstrum partial series data group evaluated as unsuitable for constructing a speech synthesis dictionary;
A phoneme HMM learning step of associating each phoneme label with a phoneme HMM related to the mel cepstrum by learning based on a hidden Markov model HMM from the phoneme label sequence and the edited mel cepstrum partial sequence data;
An output step for outputting the learning result;
A program that executes

To a computer that executes the method of building a speech synthesis dictionary,
A receiving step of receiving a phoneme label string and corresponding voice data from the database;
A mel cepstrum sequence data generation step for generating mel cepstrum sequence data from the entire audio data by generating a mel cepstrum coefficient in units of frames from the received audio data;
A phoneme segmentation step of segmenting a mel cepstrum partial sequence data group corresponding to a phoneme label for each phoneme label from the generated mel cepstrum sequence data;
A first average value is calculated for each extracted mel cepstrum partial series data group, and second average values for all mel cepstrum partial series data groups belonging to the same phoneme label are calculated. A mel cepstrum data evaluation step for evaluating a mel cepstrum partial sequence data group having a first average value whose difference from the average value exceeds a threshold value as being unsuitable for constructing a speech synthesis dictionary;
A mel cepstrum data editing step for editing mel cepstrum partial series data by deleting a mel cepstrum partial series data group evaluated as unsuitable for constructing a speech synthesis dictionary;
A phoneme HMM learning step of associating each phoneme label with a phoneme HMM related to the mel cepstrum by learning based on a hidden Markov model HMM from the phoneme label sequence and the edited mel cepstrum partial sequence data;
An output step for outputting the learning result;
A program that executes

In the mel cepstrum data editing step, instead of deleting the mel cepstrum partial series data group evaluated as unsuitable for constructing a speech synthesis dictionary, the mel cepstrum partial series data group evaluated as inappropriate is used as the phoneme label. 9. The program according to claim 7, wherein the program is replaced with another mel cepstrum partial series data group belonging to.