JP4826493B2

JP4826493B2 - Speech synthesis dictionary construction device, speech synthesis dictionary construction method, and program

Info

Publication number: JP4826493B2
Application number: JP2007025212A
Authority: JP
Inventors: 勝彦佐藤
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2007-02-05
Filing date: 2007-02-05
Publication date: 2011-11-30
Anticipated expiration: 2027-02-05
Also published as: JP2008191368A

Abstract

<P>PROBLEM TO BE SOLVED: To construct a speech synthesis dictionary capable of generating synthesized speech which gives natural impression, since pitch variations are moderately large. <P>SOLUTION: A speech synthesis dictionary constructing apparatus temporarily makes a preliminary speech synthesis dictionary 223, by performing phoneme HMM (hidden Markov Model) learning which is a known processing that uses a speech database 221. Then, based on the analysis and the comparison result of the synthesized speech created via the temporary speech synthesis dictionary 223 and speech collected in the speech database 221, an editing method of pitch sequence data for emphasizing the pitch fluctuations base formants of the speech is determined. Ultimately, the apparatus performs phoneme HMM learning again, by first including the editing process beforehand, where the editing method is adopted, and thereby the speech synthesis dictionary 227 is constructed. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声合成等に用いる音声合成辞書を構築する、音声合成辞書構築装置、音声合成辞書構築方法、及び、プログラムに関する。 The present invention relates to a speech synthesis dictionary construction device, a speech synthesis dictionary construction method, and a program for constructing a speech synthesis dictionary used for speech synthesis and the like.

音声認識及び音声合成技術として隠れマルコフモデル（Hidden Markov Model。以下、ＨＭＭと呼ぶ。）に基づいた音声認識技術及び音声合成技術が、広く利用されている。 Speech recognition technology and speech synthesis technology based on a Hidden Markov Model (hereinafter referred to as HMM) are widely used as speech recognition and speech synthesis technology.

ＨＭＭに基づいた音声認識技術及び音声合成技術は、例えば、特許文献１に開示されている。 A speech recognition technique and a speech synthesis technique based on the HMM are disclosed in, for example, Patent Document 1.

特開２００２−２６８６６０号公報JP 2002-268660 A

ＨＭＭに基づいた音声合成においては、音素ラベルとスペクトルパラメータデータ列等の対応関係を記録した音声合成辞書が必要になる。 In speech synthesis based on the HMM, a speech synthesis dictionary in which a correspondence relationship between phoneme labels and spectrum parameter data strings is recorded is required.

音声合成辞書は、音声合成辞書構築装置により構築される。音声合成辞書構築装置は、通例、音声データと音素モノフォンラベルデータと音素トライフォンラベルデータとの組から構成されているデータベース（以下、音声データベースと呼ぶ。）に記録されているデータについて、メルケプストラム分析とピッチ抽出をし、ＨＭＭに基づく学習過程を経ることにより、音声合成辞書を構築する。 The speech synthesis dictionary is constructed by a speech synthesis dictionary construction device. The speech synthesis dictionary construction apparatus generally uses a melody for data recorded in a database (hereinafter referred to as a speech database) configured from a set of speech data, phoneme monophone label data, and phoneme triphone label data. A speech synthesis dictionary is constructed by cepstrum analysis and pitch extraction, and through a learning process based on HMM.

従来の音声合成辞書構築装置は、音声合成辞書を構築する際、ピッチ抽出の結果生成されるピッチ系列データを、特に加工等を施すことなく、そのままＨＭＭに基づく学習に用いて、音声合成辞書を構築していた。 When a conventional speech synthesis dictionary construction device constructs a speech synthesis dictionary, the pitch sequence data generated as a result of pitch extraction is used as it is for learning based on the HMM without any particular processing or the like. Was building.

しかしながら、そのように構築された音声合成辞書を用いて生成された合成音声のピッチ変動は、元の音声のピッチ変動に比べて小さい。 However, the pitch variation of the synthesized speech generated using the speech synthesis dictionary constructed as described above is smaller than the pitch variation of the original speech.

このため、従来の音声合成辞書構築装置により構築された音声合成辞書を用いた合成音声は、人間の自然な音声に比べて、平坦な印象を与える不自然なものとなっていた。 For this reason, the synthesized speech using the speech synthesis dictionary constructed by the conventional speech synthesis dictionary construction device is unnatural that gives a flat impression as compared with the natural speech of human beings.

本発明は、上記実情に鑑みてなされたもので、自然な印象を与える音声を合成することができる音声合成辞書を構築可能とする音声合成辞書構築装置、音声合成辞書構築方法、及び、プログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and includes a speech synthesis dictionary construction device, a speech synthesis dictionary construction method, and a program capable of constructing a speech synthesis dictionary capable of synthesizing speech that gives a natural impression. The purpose is to provide.

上記目的を達成するために、この発明の第１の観点に係る音声合成辞書構築装置は、
音声データベースから音素ラベル列と該音素ラベル列に対応する録音音声データとを取得し、取得した録音音声データから録音音声ピッチ系列データを抽出するとともに、抽出された録音音声ピッチ系列データと取得した音素ラベル列とに基づいてＨＭＭ（Hidden Markov Model）学習により仮音声合成辞書を構築する仮構築部と、
前記仮音声合成辞書に依拠して合成音声データを生成し、生成された合成音声データから合成音声ピッチ系列データを抽出する合成データ生成部と、
前記音素ラベル列に対応する前記録音音声データから前記仮構築部により抽出された前記録音音声ピッチ系列データと、該音素ラベル列に対応する前記合成音声データから前記合成データ生成部により抽出された前記合成音声ピッチ系列データと、を比較した結果に基づき、前記録音音声ピッチ系列データを編集して編集ピッチ系列データを生成する編集部と、
前記音素ラベル列と前記編集部により生成された編集ピッチ系列データとに基づいてＨＭＭ学習により音声合成辞書を構築する再構築部と、
を備える。 In order to achieve the above object, a speech synthesis dictionary construction device according to the first aspect of the present invention provides:
A phoneme label string and recorded voice data corresponding to the phoneme label string are acquired from the voice database, and the recorded voice pitch series data is extracted from the acquired recorded voice data, and the extracted recorded voice pitch series data and the acquired phoneme are extracted. A temporary construction unit that constructs a temporary speech synthesis dictionary by HMM (Hidden Markov Model) learning based on the label sequence;
A synthesized data generation unit that relies on the temporary speech synthesis dictionary to generate synthesized speech data and extracts synthesized speech pitch series data from the generated synthesized speech data;
The recorded speech pitch sequence data extracted by the temporary construction unit from the recorded speech data corresponding to the phoneme label sequence, and the synthesized data generation unit extracted from the synthesized speech data corresponding to the phoneme label sequence Based on the result of comparing the synthesized voice pitch series data, an editing unit that edits the recorded voice pitch series data to generate edited pitch series data;
A reconstructing unit that constructs a speech synthesis dictionary by HMM learning based on the phoneme label string and the edited pitch sequence data generated by the editing unit;
Is provided.

元の自然な音声から抽出されたピッチと、いったん仮音声合成辞書を経て合成された平坦で不自然な音声である合成音声から抽出されたピッチと、が比較される。かかる比較によれば、合成音声がかかる不自然な音声にならないようにするためには、そもそも元の音声データにいかなる処理をあらかじめ施しておくべきであったのか、が、自ずと明らかになる。より具体的には、元の音声データのピッチ変動をどの程度大きくするのが適切であるかについての方針を効率的かつ容易に決定することができる。かかる調整を施した音声データを元に構築し直した音声合成辞書は、自然な印象を与える合成音声の生成に資する。 The pitch extracted from the original natural speech is compared with the pitch extracted from the synthesized speech, which is a flat and unnatural speech once synthesized through the temporary speech synthesis dictionary. According to such a comparison, it is naturally clarified what processing should have been performed on the original voice data in order to prevent the synthesized voice from becoming such unnatural voice. More specifically, it is possible to efficiently and easily determine a policy on how much it is appropriate to increase the pitch variation of the original audio data. The speech synthesis dictionary reconstructed based on the speech data subjected to such adjustment contributes to the generation of synthesized speech that gives a natural impression.

前記音声合成辞書構築装置は、
複数の音声データと前記音声データ毎に生成されたモノフォンラベルと該モノフォンラベルの始点及び終点に相当する時刻を指す始点ポインタ及び終点ポインタと前記音声データ毎に生成されたトライフォンラベルとを受け取り、該音声データからピッチ系列データを抽出し、該音声データから所定の次数までのメルケプストラム係数系列データを生成し、該モノフォンラベルと該始点ポインタと該終点ポインタと該トライフォンラベルと該ピッチ系列データと該メルケプストラム係数系列データとからＨＭＭ（Hidden Markov Model）学習により仮音声合成辞書を構築する第１学習部と、
前記仮音声合成辞書と前記トライフォンラベルとに基づいて複数の合成音声データを生成する合成部と、
前記合成音声データから抽出される合成音声ピッチ系列データと前記第１学習部により抽出された前記ピッチ系列データとを比較した結果に基づいて決定される編集方針に従い前記ピッチ系列データを編集して編集ピッチ系列データを生成する編集部と、
前記モノフォンラベルと前記始点ポインタと前記終点ポインタと前記トライフォンラベルと前記編集ピッチ系列データと前記メルケプストラム係数系列データとからＨＭＭ（Hidden Markov Model）学習により音声合成辞書を構築する第２学習部と、
を備えてもよい。 The speech synthesis dictionary construction device
A plurality of audio data, a monophone label generated for each of the audio data, a start point pointer and an end point pointer indicating the time corresponding to the start point and end point of the monophone label, and a triphone label generated for each of the audio data Receiving, extracting pitch sequence data from the audio data, generating mel cepstrum coefficient sequence data up to a predetermined order from the audio data, the monophone label, the start point pointer, the end point pointer, the triphone label, and the A first learning unit that constructs a temporary speech synthesis dictionary from pitch sequence data and the mel cepstrum coefficient sequence data by HMM (Hidden Markov Model) learning;
A synthesis unit that generates a plurality of synthesized speech data based on the temporary speech synthesis dictionary and the triphone label;
Editing and editing the pitch sequence data according to an editing policy determined based on the result of comparing the synthesized speech pitch sequence data extracted from the synthesized speech data and the pitch sequence data extracted by the first learning unit An editing unit for generating pitch series data;
A second learning unit that constructs a speech synthesis dictionary by HMM (Hidden Markov Model) learning from the monophone label, the start point pointer, the end point pointer, the triphone label, the edited pitch sequence data, and the mel cepstrum coefficient sequence data When,
May be provided.

前記編集部は、前記合成音声データ毎に合成モノフォンラベルと該合成モノフォンラベルの始点及び終点に相当する時刻を指す合成始点ポインタ及び合成終点ポインタとを生成し、前記合成音声ピッチ系列データと前記ピッチ系列データとを、前記合成モノフォンラベルと前記合成始点ポインタと前記合成終点ポインタと前記モノフォンラベルと前記始点ポインタと前記終点ポインタとを参照しつつ比較した結果に基づいて決定される編集方針に従い前記ピッチ系列データを編集して前記編集ピッチ系列データを生成してもよい。 The editing unit generates a synthesized monophone label and a synthesized start point pointer and a synthesized end point pointer indicating a time corresponding to a start point and an end point of the synthesized monophone label for each synthesized voice data, and the synthesized voice pitch sequence data Editing determined based on a result of comparing the pitch series data with reference to the composite monophone label, the composite start point pointer, the composite end point pointer, the monophone label, the start point pointer, and the end point pointer The edited pitch sequence data may be generated by editing the pitch sequence data according to a policy.

録音音声と合成音声それぞれのモノフォンラベルデータを考慮することにより、音素ラベル単位で両音声のピッチを比較することが可能となり、きめ細かく適切な編集方針を決定することができる。 Considering the monophonic label data of each of the recorded voice and synthesized voice, it becomes possible to compare the pitches of both voices in units of phoneme labels, so that an appropriate editing policy can be determined finely.

前記編集部は、前記ピッチ系列データのピッチ変動を大きくすることにより前記編集ピッチ系列データを生成する、ことが望ましい。 It is preferable that the editing unit generates the edited pitch series data by increasing a pitch variation of the pitch series data.

従来の合成音声が平坦で不自然な印象を与えるのは、ピッチ変動が小さいためであると考えられる。そこで、録音音声について、あらかじめそのピッチ変動を大きくするように加工しておけば、合成音声のピッチ変動も大きくなり、より自然な印象を与える合成音声が生成される。 The reason why the conventional synthesized speech gives a flat and unnatural impression is thought to be because the pitch fluctuation is small. Therefore, if the recorded voice is processed in advance so that the pitch fluctuation is increased, the pitch fluctuation of the synthesized voice is also increased, and a synthesized voice giving a more natural impression is generated.

前記編集部は、所定のピッチレベルを基準ピッチレベルとして該基準ピッチレベルを中心に前記ピッチ系列データを拡大することにより前記編集ピッチ系列データを生成する、ことが望ましい。 It is desirable that the editing unit generates the edited pitch series data by enlarging the pitch series data around the reference pitch level with a predetermined pitch level as a reference pitch level.

前記編集部は、例えば、前記ピッチ系列データの平均値を基準ピッチレベルとして該基準ピッチレベルを中心に前記ピッチ系列データを拡大することにより前記編集ピッチ系列データを生成する。 For example, the editing unit generates the edited pitch series data by enlarging the pitch series data around the reference pitch level with an average value of the pitch series data as a reference pitch level.

このようにすれば、ピッチ平均を一定に保ちつつ、ピッチ変動を大きくすることができる。 In this way, it is possible to increase the pitch variation while keeping the pitch average constant.

前記編集部は、あるいは例えば、前記ピッチ系列データのゼロレベルを基準ピッチレベルとして該基準ピッチレベルを中心に前記ピッチ系列データを拡大することにより前記編集ピッチ系列データを生成してもよい。 The editing unit may generate the edited pitch series data by, for example, enlarging the pitch series data around the reference pitch level with a zero level of the pitch series data as a reference pitch level.

このようにしても、ピッチ変動を大きくすることが可能である。 Even in this case, the pitch fluctuation can be increased.

前記編集部は、例えば、前記音声データ毎に前記ピッチ系列データの平均値である音声別平均ピッチを求め、前記合成音声データ毎に前記合成音声ピッチ系列データの平均値である合成音声別平均ピッチを求め、前記音声データ毎に前記ピッチ系列データと前記音声別平均ピッチとの差の絶対値の最大値である音声別ピッチ差最大絶対値を求め、前記合成音声データ毎に前記合成音声ピッチ系列データと前記合成音声別平均ピッチとの差の絶対値の最大値である合成音声別ピッチ差最大絶対値を求め、全ての前記音声データにおける前記音声別ピッチ差最大絶対値の最大値である音声総合ピッチ差最大絶対値を求め、全ての前記合成音声データにおける前記合成音声別ピッチ差最大絶対値の最大値である合成音声総合ピッチ差最大絶対値を求め、前記音声総合ピッチ差最大絶対値を前記合成音声総合ピッチ差最大絶対値で除した値である編集用総合倍率を求め、前記基準ピッチレベルを中心に前記ピッチ系列データを前記編集用総合倍率で拡大することにより前記編集ピッチ系列データを生成する。 The editing unit obtains, for example, an average pitch for each voice that is an average value of the pitch series data for each voice data, and an average pitch for each synthesized voice that is an average value of the synthesized voice pitch series data for each synthesized voice data For each voice data, a voice-specific pitch difference maximum absolute value that is the maximum absolute value of the difference between the pitch series data and the voice-specific average pitch is obtained, and the synthesized voice pitch series is obtained for each synthesized voice data. The maximum absolute value of the pitch difference for each synthesized voice, which is the maximum absolute value of the difference between the data and the average pitch for each synthesized voice, is obtained, and the voice that is the maximum absolute value of the pitch difference for each voice in all the voice data Obtain the maximum absolute value of the total pitch difference, and determine the maximum absolute value of the synthesized speech total pitch difference, which is the maximum value of the pitch difference maximum absolute value for each synthesized speech in all the synthesized speech data. A total magnification for editing, which is a value obtained by dividing the maximum absolute value of the voice total pitch difference by the maximum absolute value of the synthesized voice total pitch difference, and the pitch series data around the reference pitch level as the total magnification for editing. The edited pitch sequence data is generated by enlarging.

このような音声総合ピッチ差最大絶対値と合成音声総合ピッチ差最大絶対値とは、それぞれ、録音音声のピッチ変動の程度と合成音声のピッチ変動の程度とを表す指標であるといえる。よって、ピッチ変動の拡大率として、前者の絶対値を後者の絶対値で除した値を採用すれば、ピッチ変動が適度に拡大される。 Such a voice total pitch difference maximum absolute value and a synthesized voice total pitch difference maximum absolute value can be said to be indices representing the degree of pitch fluctuation of the recorded voice and the degree of pitch fluctuation of the synthesized voice, respectively. Therefore, if a value obtained by dividing the former absolute value by the latter absolute value is adopted as the pitch fluctuation expansion rate, the pitch fluctuation is appropriately expanded.

前記編集部は、例えば、前記音声データ毎に前記ピッチ系列データの平均値である音声別平均ピッチを求め、前記合成音声データ毎に前記合成音声ピッチ系列データの平均値である合成音声別平均ピッチを求め、前記音声データ毎に前記ピッチ系列データから前記音声別平均ピッチを減じた値の最大値である音声別ピッチ差最大値を求め、前記音声データ毎に前記ピッチ系列データから前記音声別平均ピッチを減じた値の最小値である音声別ピッチ差最小値を求め、全ての前記音声データにおける前記音声別ピッチ差最大値の最大値である音声総合ピッチ差最大値を求め、全ての前記音声データにおける前記音声別ピッチ差最小値の最小値である音声総合ピッチ差最小値を求め、前記合成音声データ毎に前記合成音声ピッチ系列データから前記合成音声別平均ピッチを減じた値の最大値である合成音声別ピッチ差最大値を求め、前記合成音声データ毎に前記合成音声ピッチ系列データから前記合成音声別平均ピッチを減じた値の最小値である合成音声別ピッチ差最小値を求め、全ての前記合成音声データにおける前記合成音声別ピッチ差最大値の最大値である合成音声総合ピッチ差最大値を求め、全ての前記合成音声データにおける前記合成音声別ピッチ差最小値の最小値である合成音声総合ピッチ差最小値を求め、前記音声総合ピッチ差最大値を前記合成音声総合ピッチ差最大値で除した値である編集用上側総合倍率を求め、前記音声総合ピッチ差最小値を前記合成音声総合ピッチ差最小値で除した値である編集用下側総合倍率を求め、前記ピッチ系列データのうち前記基準ピッチレベルを上回っているものを前記基準ピッチレベルを中心に前記編集用上側総合倍率で拡大するとともに前記ピッチ系列データのうち前記基準ピッチレベルを下回っているものを前記基準ピッチレベルを中心に前記編集用下側総合倍率で拡大することにより前記編集ピッチ系列データを生成する。 The editing unit obtains, for example, an average pitch for each voice that is an average value of the pitch series data for each voice data, and an average pitch for each synthesized voice that is an average value of the synthesized voice pitch series data for each synthesized voice data Obtaining the maximum pitch difference for each voice, which is the maximum value obtained by subtracting the average pitch for each voice from the pitch series data for each voice data, and calculating the average for each voice from the pitch series data for each voice data. Obtain a minimum pitch difference for each voice, which is the minimum value obtained by subtracting the pitch, obtain a maximum voice total pitch difference, which is a maximum value of the maximum pitch difference for each voice in all the voice data, and determine all the voices. A voice total pitch difference minimum value, which is a minimum value of the voice-specific pitch difference minimum values in the data, is obtained, and the synthesized voice pitch sequence data is pre- The maximum value of the pitch difference for each synthesized speech, which is the maximum value obtained by subtracting the average pitch for each synthesized speech, and the minimum value obtained by subtracting the average pitch for each synthesized speech from the synthesized speech pitch sequence data for each synthesized speech data The synthesized speech-specific pitch difference minimum value is obtained, and the synthesized speech overall pitch difference maximum value, which is the maximum value of the synthesized speech-specific pitch differences in all the synthesized speech data, is determined, and the synthesized speech data in all the synthesized speech data The minimum synthesized speech total pitch difference minimum value, which is the minimum pitch difference for each synthesized speech, is obtained, and the upper overall magnification for editing, which is a value obtained by dividing the maximum speech total pitch difference by the maximum synthesized speech total pitch difference, is obtained. An editing lower overall magnification, which is a value obtained by dividing the voice total pitch difference minimum value by the synthesized voice total pitch difference minimum value, is obtained, and the reference pitch of the pitch series data is obtained. Those above the reference pitch level are enlarged at the upper overall magnification for editing, and those below the reference pitch level among the pitch series data are used for the editing around the reference pitch level. The edit pitch series data is generated by enlarging with the lower overall magnification.

ピッチ系列データが基準ピッチレベルを上回っている場合と下回っている場合とで、異なるピッチ拡大率を採用することにより、ピッチ変動がさらに適切に拡大される。 By adopting different pitch enlargement rates depending on whether the pitch series data is above or below the reference pitch level, the pitch variation is further appropriately enlarged.

前記編集部は、例えば、前記音声データ毎に前記ピッチ系列データの平均値である音声別平均ピッチを求め、前記合成音声データ毎に前記合成音声ピッチ系列データの平均値である合成音声別平均ピッチを求め、前記音声データ毎に前記ピッチ系列データと前記音声別平均ピッチとの差の絶対値の最大値である音声別ピッチ差最大絶対値を求め、前記合成音声データ毎に前記合成音声ピッチ系列データと前記合成音声別平均ピッチとの差の絶対値の最大値である合成音声別ピッチ差最大絶対値を求め、前記音声データ毎に前記音声別ピッチ差最大絶対値を該音声データに対応する前記合成音声データの前記合成音声別ピッチ差最大絶対値で除した編集用音声別倍率を求め、前記音声データ毎に前記基準ピッチレベルを中心に前記ピッチ系列データを前記編集用音声別倍率で拡大することにより前記編集ピッチ系列データを生成する。 The editing unit obtains, for example, an average pitch for each voice that is an average value of the pitch series data for each voice data, and an average pitch for each synthesized voice that is an average value of the synthesized voice pitch series data for each synthesized voice data For each voice data, a voice-specific pitch difference maximum absolute value that is the maximum absolute value of the difference between the pitch series data and the voice-specific average pitch is obtained, and the synthesized voice pitch series is obtained for each synthesized voice data. The maximum absolute value of the pitch difference for each synthesized speech, which is the maximum absolute value of the difference between the data and the average pitch for each synthesized speech, is determined, and the maximum absolute value of the pitch difference for each speech is associated with the speech data. An editing voice-specific magnification obtained by dividing the synthesized voice data by the synthesized voice-specific pitch difference maximum absolute value is obtained, and the pitch sequence data is centered on the reference pitch level for each voice data. Generating the edited pitch series data by enlarging the data in the editing voice by magnification.

音声データ毎にピッチ編集処理が完結しているため、各音声データの特性に応じてピッチ変動が拡大され、きめ細かく適切な編集がなされる。 Since the pitch editing process is completed for each audio data, the pitch variation is expanded according to the characteristics of each audio data, and fine and appropriate editing is performed.

前記編集部は、例えば、前記音声データ毎に前記ピッチ系列データの平均値である音声別平均ピッチを求め、前記合成音声データ毎に前記合成音声ピッチ系列データの平均値である合成音声別平均ピッチを求め、前記音声データ毎に前記ピッチ系列データから前記音声別平均ピッチを減じた値の最大値である音声別ピッチ差最大値を求め、前記音声データ毎に前記ピッチ系列データから前記音声別平均ピッチを減じた値の最小値である音声別ピッチ差最小値を求め、前記合成音声データ毎に前記合成音声ピッチ系列データから前記合成音声別平均ピッチを減じた値の最大値である合成音声別ピッチ差最大値を求め、前記合成音声データ毎に前記合成音声ピッチ系列データから前記合成音声別平均ピッチを減じた値の最小値である合成音声別ピッチ差最小値を求め、前記音声データ毎に前記音声別ピッチ差最大値を該音声データに対応する前記合成音声データの前記合成音声別ピッチ差最大値で除した編集用上側音声別倍率を求め、前記音声データ毎に前記音声別ピッチ差最小値を該音声データに対応する前記合成音声データの前記合成音声別ピッチ差最小値で除した編集用下側音声別倍率を求め、前記音声データ毎に前記ピッチ系列データのうち前記基準ピッチレベルを上回っているものを前記基準ピッチレベルを中心に前記編集用上側音声別倍率で拡大するとともに前記ピッチ系列データのうち前記基準ピッチレベルを下回っているものを前記基準ピッチレベルを中心に前記編集用下側音声別倍率で拡大することにより前記編集ピッチ系列データを生成する。 The editing unit obtains, for example, an average pitch for each voice that is an average value of the pitch series data for each voice data, and an average pitch for each synthesized voice that is an average value of the synthesized voice pitch series data for each synthesized voice data Obtaining the maximum pitch difference for each voice, which is the maximum value obtained by subtracting the average pitch for each voice from the pitch series data for each voice data, and calculating the average for each voice from the pitch series data for each voice data. A minimum pitch difference for each voice, which is a minimum value obtained by subtracting the pitch, is obtained, and for each synthesized voice data, each synthesized voice data is a maximum value obtained by subtracting the average pitch for each synthesized voice from the synthesized voice pitch sequence data. A maximum pitch difference value is obtained, and for each synthesized voice data, the synthesized voice pitch is a minimum value obtained by subtracting the synthesized voice average pitch from the synthesized voice pitch sequence data. H for each voice data, and for each voice data, the maximum pitch difference for editing is obtained by dividing the maximum value for each voice pitch difference by the maximum pitch difference for each synthesized voice corresponding to the voice data. Obtaining the lower voice-specific magnification for editing by dividing the voice-specific pitch difference minimum value for each voice data by the synthesized voice-specific pitch difference minimum value of the synthesized voice data corresponding to the voice data; The pitch series data that exceeds the reference pitch level is enlarged at the magnification for each upper voice for editing around the reference pitch level, and the pitch series data that is below the reference pitch level The edit pitch series data is generated by enlarging the reference pitch level around the reference pitch level with the lower voice editing magnification.

基準ピッチレベルの上側と下側とで、かつ、音声データ単位で、ピッチ変動拡大率を変化させるので、きめ細かく適切な編集がなされる。 Since the pitch fluctuation expansion rate is changed between the upper side and the lower side of the reference pitch level and in units of audio data, fine and appropriate editing is performed.

前記編集部は、例えば、編集対象である前記ピッチ系列データの抽出元の音声データ毎かつ前記モノフォンラベル毎に、該音声データと該モノフォンラベルとにより特定される前記ピッチ系列データについて該モノフォンラベルの開始時点から終了時点まで平均した結果を前記合成ピッチ系列データについて該モノフォンラベルに等しい前記合成モノフォンラベルの開始時点から終了時点まで平均した結果により除した値を求め、該値を該音声データ毎かつ該モノフォンラベル毎の編集用モノフォン別倍率とし、前記ピッチ系列データにその抽出元の前記音声データ毎かつその前記モノフォンラベル毎の前記編集用モノフォン別倍率を乗じて前記編集ピッチ系列データを生成する。 The editing unit, for example, for the pitch sequence data specified by the audio data and the monophone label for each audio data from which the pitch sequence data to be edited is extracted and for each monophone label. A value obtained by dividing the averaged result from the start time to the end time of the phone label by the averaged result from the start time to the end time of the synthesized monophone label equal to the monophone label for the synthesized pitch series data is obtained, The editing monophone magnification for each audio data and each monophone label is set, and the editing is performed by multiplying the pitch series data by the magnification for each editing monophone for each audio data and the monophone label of the extraction source. Generate pitch series data.

音声データよりもさらに小さい単位である音素ラベル単位でピッチ系列データが編集されるので、さらにきめ細かく適切な編集がなされる。 Since the pitch series data is edited in units of phoneme labels, which is a smaller unit than the voice data, further fine and appropriate editing is performed.

上記目的を達成するために、この発明の第２の観点に係る音声合成辞書構築方法は、
音声データベースから音素ラベル列と該音素ラベル列に対応する録音音声データとを取得し、取得した録音音声データから録音音声ピッチ系列データを抽出するとともに、抽出された録音音声ピッチ系列データと取得した音素ラベル列とに基づいてＨＭＭ（Hidden Markov Model）学習により仮音声合成辞書を構築する仮構築ステップと、
前記仮音声合成辞書に依拠して合成音声データを生成し、生成された合成音声データから合成音声ピッチ系列データを抽出する合成データ生成ステップと、
前記音素ラベル列に対応する前記録音音声データから前記仮構築ステップにより抽出された前記録音音声ピッチ系列データと、該音素ラベル列に対応する前記合成音声データから前記合成データ生成ステップにより抽出された前記合成音声ピッチ系列データと、を比較した結果に基づき、前記録音音声ピッチ系列データを編集して編集ピッチ系列データを生成する編集ステップと、
前記音素ラベル列と前記編集ステップにより生成された編集ピッチ系列データとに基づいてＨＭＭ学習により音声合成辞書を構築する再構築ステップと、
から構成される。 In order to achieve the above object, a speech synthesis dictionary construction method according to a second aspect of the present invention includes:
A phoneme label string and recorded voice data corresponding to the phoneme label string are acquired from the voice database, and the recorded voice pitch series data is extracted from the acquired recorded voice data, and the extracted recorded voice pitch series data and the acquired phoneme are extracted. A temporary construction step of constructing a temporary speech synthesis dictionary by HMM (Hidden Markov Model) learning based on the label sequence;
Generating a synthesized voice data based on the provisional voice synthesis dictionary, and extracting a synthesized voice pitch series data from the generated synthesized voice data; and
The recorded voice pitch sequence data extracted by the temporary construction step from the recorded voice data corresponding to the phoneme label string and the synthesized data generation step extracted from the synthesized voice data corresponding to the phoneme label string An editing step of editing the recorded voice pitch series data to generate edited pitch series data based on the result of comparing the synthesized voice pitch series data;
A reconstructing step of constructing a speech synthesis dictionary by HMM learning based on the phoneme label string and the editing pitch sequence data generated by the editing step;
Consists of

上記目的を達成するために、この発明の第３の観点に係るコンピュータプログラムは、
コンピュータに、
音声データベースから音素ラベル列と該音素ラベル列に対応する録音音声データとを取得し、取得した録音音声データから録音音声ピッチ系列データを抽出するとともに、抽出された録音音声ピッチ系列データと取得した音素ラベル列とに基づいてＨＭＭ（Hidden Markov Model）学習により仮音声合成辞書を構築する仮構築ステップと、
前記仮音声合成辞書に依拠して合成音声データを生成し、生成された合成音声データから合成音声ピッチ系列データを抽出する合成データ生成ステップと、
前記音素ラベル列に対応する前記録音音声データから前記仮構築ステップにより抽出された前記録音音声ピッチ系列データと、該音素ラベル列に対応する前記合成音声データから前記合成データ生成ステップにより抽出された前記合成音声ピッチ系列データと、を比較した結果に基づき、前記録音音声ピッチ系列データを編集して編集ピッチ系列データを生成する編集ステップと、
前記音素ラベル列と前記編集ステップにより生成された編集ピッチ系列データとに基づいてＨＭＭ学習により音声合成辞書を構築する再構築ステップと、
を実行させる。 In order to achieve the above object, a computer program according to the third aspect of the present invention provides:
On the computer,
A phoneme label string and recorded voice data corresponding to the phoneme label string are acquired from the voice database, and the recorded voice pitch series data is extracted from the acquired recorded voice data, and the extracted recorded voice pitch series data and the acquired phoneme are extracted. A temporary construction step of constructing a temporary speech synthesis dictionary by HMM (Hidden Markov Model) learning based on the label sequence;
Generating a synthesized voice data based on the provisional voice synthesis dictionary, and extracting a synthesized voice pitch series data from the generated synthesized voice data; and
The recorded voice pitch sequence data extracted by the temporary construction step from the recorded voice data corresponding to the phoneme label string and the synthesized data generation step extracted from the synthesized voice data corresponding to the phoneme label string An editing step of editing the recorded voice pitch series data to generate edited pitch series data based on the result of comparing the synthesized voice pitch series data;
A reconstructing step of constructing a speech synthesis dictionary by HMM learning based on the phoneme label string and the editing pitch sequence data generated by the editing step;
Is executed.

本発明によれば、いったん仮音声合成辞書を構築し、該辞書に基づいて音声を合成し、該音声を元の音声と比較する。よって、ピッチ変動という観点からみた両音声の差を埋めるための、元の音声に係るピッチ系列データに施すべき編集処理が、容易かつ的確に定まる。そして、そのように処理された音声を元に音声合成辞書を再構築するので、最終的には、平坦でなく自然な印象を与える合成音声の生成に資する音声合成辞書を構築することができる。 According to the present invention, a temporary speech synthesis dictionary is once constructed, speech is synthesized based on the dictionary, and the speech is compared with the original speech. Therefore, the editing process to be performed on the pitch sequence data related to the original voice in order to fill the difference between the two voices from the viewpoint of pitch fluctuation is easily and accurately determined. Then, since the speech synthesis dictionary is reconstructed based on the speech thus processed, it is finally possible to construct a speech synthesis dictionary that contributes to the generation of synthesized speech that gives a natural impression that is not flat.

以下、本発明の実施の形態に係る音声合成辞書構築装置について詳細に説明する。実施形態としては、実施形態１と実施形態２とに分けて説明する。このうち実施形態１については、さらに細かく、編集の具体例１と、編集の具体例１の変形例と、編集の具体例２と、編集の具体例２の変形例と、に言及する。 Hereinafter, the speech synthesis dictionary construction device according to the embodiment of the present invention will be described in detail. The embodiment will be described separately in the first embodiment and the second embodiment. Of these, the first embodiment will be described in more detail by way of editing specific example 1, a modification of editing specific example 1, a specific example 2 of editing, and a modification of specific example 2 of editing.

（実施形態１）
図２〜図４に、本発明の実施形態１に係る音声合成辞書構築装置の機能構成を示す。 (Embodiment 1)
2 to 4 show the functional configuration of the speech synthesis dictionary construction device according to Embodiment 1 of the present invention.

本発明の実施形態１に係る音声合成辞書構築装置は、第１学習部１１１（図２）と、第１音声合成辞書２２３（図２）と、合成部１１３（図３）と、第２学習部１１７（図４）と、から構成される装置である。 The speech synthesis dictionary construction device according to Embodiment 1 of the present invention includes a first learning unit 111 (FIG. 2), a first speech synthesis dictionary 223 (FIG. 2), a synthesis unit 113 (FIG. 3), and a second learning. And a unit 117 (FIG. 4).

該音声合成辞書構築装置は、第１音声データベース２２１（図２）に基づいて第２音声合成辞書２２７（図４）を構築するための装置である。 The speech synthesis dictionary construction device is a device for constructing the second speech synthesis dictionary 227 (FIG. 4) based on the first speech database 221 (FIG. 2).

第１音声データベース２２１（図２）は、よく知られた音声データベースである。ここには、所定の文章を読み上げた人の声を録音した音声データとモノフォンラベルデータとトライフォンラベルデータとが組になったものが、多数組、格納されている。カウンタmにより識別される個々の音声データ毎に、該音声データに対応したモノフォンラベルデータとトライフォンラベルデータとが存在する。この様子の理解を容易にするために、音声データベースに音声データのみが格納されている状態から、ラベルデータが作成され音声データベースの完成へと至る手順を、図１を参照しつつ説明する。 The first voice database 221 (FIG. 2) is a well-known voice database. Here, a large number of sets of voice data, monophone label data, and triphone label data obtained by recording a voice of a person who reads out a predetermined sentence are stored. For each piece of audio data identified by the counter m, there is monophone label data and triphone label data corresponding to the audio data. In order to facilitate understanding of this situation, the procedure from the state in which only the voice data is stored in the voice database to the completion of the voice database after the label data is created will be described with reference to FIG.

ラベルデータの作成及び音声データベースの完成のためには、例えば、後に図５を参照して説明するような、一般的なコンピュータ装置が用いられる。つまり、例えばリムーバブルハードディスクとして存在する音声データベースにアクセスするためのインターフェースを有し、該リムーバブルハードディスク内からデータをロードして所定の処理を行う機能や、該処理の結果を一時的に保持したり該リムーバブルハードディスク内に格納したりする機能等を有する装置が用いられる。 In order to create the label data and complete the voice database, for example, a general computer device as described later with reference to FIG. 5 is used. In other words, for example, it has an interface for accessing an audio database that exists as a removable hard disk, and performs a predetermined process by loading data from the removable hard disk, and temporarily holds the result of the process. A device having a function of storing in a removable hard disk or the like is used.

未完成の音声データベースには、N_Sp個の音声データSp_m(1≦m≦N_Sp)が格納されているものとする。 It is assumed that N _Sp speech data Sp _m (1 ≦ m ≦ N _Sp ) is stored in the incomplete speech database.

なお、以下に説明する音声データからのピッチ抽出やメルケプストラム分析においては、音声データに一定長の時間枠が設定され、この時間枠が重複するように所定の周期（フレーム周期）で当該時間枠をずらしながら処理することで、それぞれの時点でのピッチ系列データやメルケプストラム係数系列データが算出されるが、記号fm(0≦fm≦N_fm[m])はこのフレーム周期が何番目であるかを示す番号を表すものである。 In pitch extraction and mel cepstrum analysis described below, a certain length of time frame is set for the sound data, and the time frame is set at a predetermined period (frame period) so that the time frames overlap. By processing while shifting the pitch sequence data and mel cepstrum coefficient sequence data at each point in time, the symbol fm (0 ≦ fm ≦ N _fm [m]) is the number of this frame period. This represents a number indicating.

まず、上述のコンピュータ装置は、内部に音声データ識別用のカウンタmを設け、m=1に初期化設定する（図１のステップＳ１１）。 First, the above-described computer device has a voice data identification counter m therein, and initializes m = 1 (step S11 in FIG. 1).

該コンピュータ装置は、未完成の音声データベースから音声データSp_mをロードし、該音声データから任意の既知の手法により、モノフォンラベルデータMLabData_m[ml](1≦ml≦ML_Sp[m])を生成する（ステップＳ１３）。ここで、ML_Sp[m]は、音声データSp_mに含まれるモノフォンラベルの数である。 The computer device loads audio data Sp _m from an incomplete audio database, and monophon label data MLabData _m [ml] (1 ≦ ml ≦ ML _Sp [m]) from the audio data by any known method. Is generated (step S13). Here, ML _Sp [m] is the number of monophone labels included in the audio data Sp _m .

モノフォンラベルデータMLabData_m[ml]は、モノフォンラベルMLab_m[ml]と、音声データSp_mの継続時間のうち該モノフォンラベルの始点及び終点に該当する時刻をフレーム周期の番号で指し示すポインタである開始フレームMFrameS_m[ml]及び終了フレームMFrameE_m[ml]と、から構成される。 The monophone label data MLabData _m [ml] is a pointer indicating the monophone label MLab _m [ml] and the time corresponding to the start point and the end point of the monophone label in the duration of the audio data Sp _m by the frame cycle number. A start frame MFrameS _m [ml] and an end frame MFrameE _m [ml].

モノフォンラベルデータMLabData_m[ml]は、音声データベースに格納される（ステップＳ１５）。 The monophone label data MLabData _m [ml] is stored in the voice database (step S15).

続いて、該コンピュータ装置は、ロードされたままになっている音声データSp_mから、任意の既知の手法により、トライフォンラベルデータTLabData_m[tl](1≦tl≦TL_Sp[m])を生成する（ステップＳ１７）。ここで、トライフォンラベルデータとは、トライフォンラベルそのものであり、また、TL_Sp[m]は、音声データSp_mに含まれるトライフォンラベルの数である。 Subsequently, the computer apparatus obtains triphone label data TLabData _m [tl] (1 ≦ tl ≦ TL _Sp [m]) from the audio data Sp _m that remains loaded by any known method. Generate (step S17). Here, the triphone label data is the triphone label itself, and TL _Sp [m] is the number of triphone labels included in the audio data Sp _m .

トライフォンラベルデータTLabData_m[tl]は、音声データベースに格納される（ステップＳ１９）。 The triphone label data TLabData _m [tl] is stored in the voice database (step S19).

続いて、mがN_Spに達したか否かが判別される（ステップＳ２１）。達していないと判別された場合（ステップＳ２１；Ｎｏ）、mを1増加してから（ステップＳ２３）、ステップＳ１３に戻り、達したと判別された場合（ステップＳ２１；Ｙｅｓ）、終了する。 Subsequently, it is determined whether m has reached N _Sp (step S21). If it is determined that it has not been reached (step S21; No), m is incremented by 1 (step S23), and then the process returns to step S13.

終了すれば、音声データベースには、全ての音声データSp_mについてのモノフォンラベルデータMLabData_m[ml]及びトライフォンラベルデータTLabData_m[tl]が格納されたことになる。このようにして、音声データベースは完成する。 When the processing is completed, the monophonic label data MLabData _m [ml] and the triphone label data TLabData _m [tl] for all the audio data Sp _m are stored in the audio database. In this way, the speech database is completed.

本発明の実施形態１に係る音声合成辞書構築装置の第１学習部１１１（図２）は、上述のように完成された音声データベースである第１音声データベース２２１から、音声データSp_m(1≦m≦N_Sp)と、モノフォンラベルデータMLabData_m[ml](1≦ml≦ML_Sp[m])と、トライフォンラベルデータTLabData_m[tl](1≦tl≦TL_Sp[m])と、を取得する。そして、第１学習部１１１は、合成音声を生成するために用いられる音声合成辞書である第1音声合成辞書２２３を、既知の手法である音素ＨＭＭ学習により、構築する。第１音声合成辞書２２３に格納された内容を、第１学習結果と呼ぶことにする。 The first learning unit 111 (FIG. 2) of the speech synthesis dictionary construction device according to the first exemplary embodiment of the present invention uses the speech data Sp _m (1 ≦ 1) from the first speech database 221 that is the speech database completed as described above. m ≦ N _Sp ), monophone label data MLabData _m [ml] (1 ≦ ml ≦ ML _Sp [m]), triphone label data TLabData _m [tl] (1 ≦ tl ≦ TL _Sp [m]), , Get. Then, the first learning unit 111 constructs the first speech synthesis dictionary 223, which is a speech synthesis dictionary used for generating synthesized speech, by phoneme HMM learning, which is a known method. The contents stored in the first speech synthesis dictionary 223 will be referred to as a first learning result.

第１学習部１１１は、第１ピッチ抽出部３１１と、メルケプストラム分析部３１３と、第１音素ＨＭＭ学習部３１５と、を備える。 The first learning unit 111 includes a first pitch extraction unit 311, a mel cepstrum analysis unit 313, and a first phoneme HMM learning unit 315.

第１ピッチ抽出部３１１は、第１音声データベース２２１から音声データSp_m(1≦m≦N_Sp)を受け取り、任意の既知の手法により、m番目の音声データからピッチ系列データPit_m[fm]を生成し、第１音素ＨＭＭ学習部３１５及び後述の第２学習部１１７（図４）に引き渡す。 The first pitch extraction unit 311 receives the audio data Sp _m (1 ≦ m ≦ N _Sp ) from the first audio database 221, and pitch sequence data Pit _m [fm] from the m-th audio data by any known method. Is generated and delivered to the first phoneme HMM learning unit 315 and a second learning unit 117 (FIG. 4) described later.

メルケプストラム分析部３１３（図２）は、第１音声データベース２２１から音声データSp_m(1≦m≦N_Sp)を受け取り、該音声データに対して、既知の手法であるD次のメルケプストラム分析を施す。その結果、メルケプストラム分析部３１３は、m番目の音声データの全てのフレームfm(0≦fm≦N_fm[m])について、0次〜D次までのメルケプストラム係数系列データMC_m ^d[fm](0≦d≦D)を生成し、第１音素ＨＭＭ学習部３１５及び後述の第２学習部１１７（図４）に引き渡す。 The mel cepstrum analysis unit 313 (FIG. 2) receives the audio data Sp _m (1 ≦ m ≦ N _Sp ) from the first audio database 221 and performs a D-th order mel cepstrum analysis which is a known method on the audio data. Apply. As a result, the mel cepstrum analysis unit 313, for all frames fm (0 ≦ fm ≦ N _fm [m]) of the m-th speech data, mel cepstrum coefficient series data MC _m ^d [fm ] (0 ≦ d ≦ D) is generated and delivered to the first phoneme HMM learning unit 315 and the second learning unit 117 (FIG. 4) described later.

第１音素ＨＭＭ学習部３１５（図２）は、第１音声データベース２２１からモノフォンラベルデータMLabData_m[ml](1≦m≦N_Sp、1≦ml≦ML_Sp[m])及びトライフォンラベルデータTLabData_m[tl](1≦m≦N_Sp、1≦tl≦TL_Sp[m])を受け取る。第１音素ＨＭＭ学習部３１５はまた、第１ピッチ抽出部３１１からピッチ系列データPit_m[fm](1≦m≦N_Sp、0≦fm≦N_fm[m])を受け取り、メルケプストラム分析部３１３からメルケプストラム係数系列データMC_m ^d[fm](1≦m≦N_Sp、0≦d≦D、0≦fm≦N_fm[m])を受け取る。第１音素ＨＭＭ学習部３１５は、受け取ったこれらのデータから、既知の手法である音素ＨＭＭ学習により、学習結果である第１学習結果を生成し、第１音声合成辞書２２３に格納する。より正確には、空のデータベースに第１学習結果が格納されることにより、該空のデータベースが第１音声合成辞書２２３として完成される。 The first phoneme HMM learning unit 315 (FIG. 2) obtains monophone label data MLabData _m [ml] (1 ≦ m ≦ N _Sp , 1 ≦ ml ≦ ML _Sp [m]) and triphone labels from the first speech database 221. Data TLabData _m [tl] (1 ≦ m ≦ N _Sp , 1 ≦ tl ≦ TL _Sp [m]) is received. The first phoneme HMM learning unit 315 also receives pitch sequence data Pit _m [fm] (1 ≦ m ≦ N _Sp , 0 ≦ fm ≦ N _fm [m]) from the first pitch extraction unit 311, and a mel cepstrum analysis unit From 313, mel cepstrum coefficient series data MC _m ^d [fm] (1 ≦ m ≦ N _Sp , 0 ≦ d ≦ D, 0 ≦ fm ≦ N _fm [m]) is received. The first phoneme HMM learning unit 315 generates a first learning result that is a learning result from the received data by phoneme HMM learning that is a known method, and stores the first learning result in the first speech synthesis dictionary 223. More precisely, the first learning result is stored in an empty database, whereby the empty database is completed as the first speech synthesis dictionary 223.

図３に示される合成部１１３は、音素ＨＭＭ列生成部３２１と、時系列データ生成部３２３と、励起音源生成部３２５と、ＭＬＳＡ合成フィルタ部３２７と、を備える。 The synthesis unit 113 illustrated in FIG. 3 includes a phoneme HMM sequence generation unit 321, a time series data generation unit 323, an excitation sound source generation unit 325, and an MLSA synthesis filter unit 327.

合成部１１３は、第１音声データベース２２１（図２）からトライフォンラベルデータTLabData_m[tl]を取得し、第１音声合成辞書２２３から第１学習結果を取得し、合成音声データSynSp_m(1≦m≦N_Sp)を出力する。出力された合成音声データSynSp_mは、後述の第２学習部１１７（図４）に引き渡される。 The synthesizer 113 acquires triphone label data TLabData _m [tl] from the first speech database 221 (FIG. 2), acquires the first learning result from the first speech synthesis dictionary 223, and synthesizes speech data SynSp _m (1 ≦ m ≦ N _Sp ) is output. The output synthesized speech data SynSp _m is delivered to the second learning section 117 will be described later (FIG. 4).

トライフォンラベルデータTLabData_m[tl]が第１音声データベース２２１から取得されているから、合成部１１３は、いわば、第１音声データベース２２１に格納されている音声データと同じセリフを合成音声という態様にて発していることになる。したがって当然のことながら、個々の合成音声データは元の音声データと同じく符号mにより識別されるし、合成音声データの個数は元の音声データの個数と同じくN_Spである。 Since the triphone label data TLabData _m [tl] is acquired from the first voice database 221, the synthesizer 113 puts the same speech as the voice data stored in the first voice database 221 into a synthesized voice. Will be emitted. Therefore, as a matter of course, each synthesized speech data is identified by the symbol m as in the original speech data, and the number of synthesized speech data is N _Sp as the number of the original speech data.

ここでの合成音声は、図２に示したように、従来からよく知られた音素ＨＭＭ学習の結果に基づいて生成されたものである。かかる合成音声は、元の音声である人間の自然な音声に比べて、平坦な印象を与える不自然なものとなることが知られている。合成音声のピッチ変動は、一般に、元の音声のピッチ変動に比べて小さいためである。 As shown in FIG. 2, the synthesized speech here is generated based on the result of phoneme HMM learning well known in the art. It is known that such synthesized speech becomes unnatural that gives a flat impression as compared to the natural speech of human beings as the original speech. This is because the pitch variation of the synthesized speech is generally smaller than the pitch variation of the original speech.

図３の音素ＨＭＭ列生成部３２１は、図２の第１音声データベース２２１からトライフォンラベルデータTLabData_m[tl]を受け取り、図２の第１音声合成辞書２２３から第１学習結果を受け取る。そして、図３の音素ＨＭＭ列生成部３２１は、受け取った第１学習結果に基づいて、既知の手法により、受け取ったトライフォンラベルデータTLabData_m[tl]から、ピッチに関する音素ＨＭＭ系列データと、メルケプストラムに関する音素ＨＭＭ系列データと、を生成し、それらを時系列データ生成部３２３に引き渡す。 The phoneme HMM string generation unit 321 in FIG. 3 receives the triphone label data TLabData _m [tl] from the first speech database 221 in FIG. 2, and receives the first learning result from the first speech synthesis dictionary 223 in FIG. Then, based on the received first learning result, the phoneme HMM sequence generation unit 321 in FIG. 3 uses the known method to generate the phoneme HMM sequence data related to the pitch and the melodies from the received triphone label data TLabData _m [tl]. Phoneme HMM sequence data related to the cepstrum, and the time series data generation unit 323.

時系列データ生成部３２３は、引き渡されたピッチに関する音素ＨＭＭ系列データ及びメルケプストラムに関する音素ＨＭＭ系列データから、既知の手法により、ピッチ時系列データ及びメルケプストラム時系列データを生成し、ピッチ時系列データは励起音源生成部３２５に、メルケプストラム時系列データはＭＬＳＡ合成フィルタ部３２７に、それぞれ引き渡す。 The time series data generation unit 323 generates pitch time series data and mel cepstrum time series data by a known method from the phoneme HMM series data related to the delivered pitch and the phoneme HMM series data related to the mel cepstrum. Is passed to the excitation sound source generator 325 and the mel cepstrum time series data is passed to the MLSA synthesis filter unit 327, respectively.

励起音源生成部３２５は、引き渡されたピッチ時系列データから、既知の手法により、励起音源データを生成し、ＭＬＳＡ合成フィルタ部３２７に引き渡す。 The excitation sound source generation unit 325 generates excitation sound source data from the delivered pitch time series data by a known method, and delivers it to the MLSA synthesis filter unit 327.

ＭＬＳＡ合成フィルタ部３２７は、時系列データ生成部３２３から引き渡されたメルケプストラム時系列データに基づいて、既知の手法により、ＭＬＳＡ（Mel Log Spectrum Approximation）フィルタとしての自らの仕様を定義する。かかる定義が済んだＭＬＳＡ合成フィルタ部３２７に、励起音源生成部３２５が生成した励起音源データが入力されると、合成音声データSynSp_mが出力される。出力された合成音声データSynSp_mは、図４の第２学習部１１７に送られる。 The MLSA synthesis filter unit 327 defines its specifications as an MLSA (Mel Log Spectrum Approximation) filter by a known method based on the mel cepstrum time series data delivered from the time series data generation unit 323. When the excitation sound source data generated by the excitation sound source generation unit 325 is input to the MLSA synthesis filter unit 327 for which such definition has been completed, synthesized speech data SynSp _m is output. The output synthesized speech data SynSp _m is sent to the second learning section 117 in FIG.

図４に示す第２学習部１１７は、第２ピッチ抽出部３４１と、方針決定部３４３と、編集部３４５と、第２音素ＨＭＭ学習部３４７と、を備える。 The second learning unit 117 illustrated in FIG. 4 includes a second pitch extraction unit 341, a policy determination unit 343, an editing unit 345, and a second phoneme HMM learning unit 347.

第２学習部１１７は、第１音声データベース２２１（図２）からトライフォンラベルデータTLabData_m[tl]及びモノフォンラベルデータMLabData_m[ml]を取得し、第１学習部１１１（図２）からピッチ系列データPit_m[fm]及びメルケプストラム係数系列データMC_m ^d[fm]を受け取り、合成部１１３（図３）から合成音声データSynSp_mを受け取り、以下で説明するようにこれらのデータに基づいて音素ＨＭＭ学習を行い、学習結果を第２学習結果として出力する。 The second learning unit 117 acquires the triphone label data TLabData _m [tl] and the monophone label data MLabData _m [ml] from the first speech database 221 (FIG. 2), and from the first learning unit 111 (FIG. 2). The pitch sequence data Pit _m [fm] and the mel cepstrum coefficient sequence data MC _m ^d [fm] are received, and the synthesized speech data SynSp _m is received from the synthesizer 113 (FIG. 3), and based on these data as described below. Phoneme HMM learning is performed, and the learning result is output as the second learning result.

図４の第２ピッチ抽出部３４１は、図２の第１ピッチ抽出部３１１と同じ機能を有し、ほぼ同様のことを行う。相違点は、入力されるデータが音声データSp_mではなく合成音声データSynSp_mであること及びfmの上限がN_fm[m]とは必ずしも一致しないM_fm[m]であることである。かかる相違ゆえ、第２ピッチ抽出部３４１が生成するデータを、合成音声ピッチ系列データSynPit_m[fm]( 0≦fm≦M_fm[m])と呼ぶことにする。該データは、方針決定部３４３に引き渡される。 The second pitch extraction unit 341 in FIG. 4 has the same function as the first pitch extraction unit 311 in FIG. 2 and performs almost the same thing. The difference is that the input data is not speech data Sp _m but synthetic speech data SynSp _m and that the upper limit of _fm is M _fm [m] which does not necessarily match N _fm [m]. Because of this difference, the data generated by the second pitch extraction unit 341 will be referred to as synthesized speech pitch sequence data SynPit _m [fm] (0 ≦ fm ≦ M _fm [m]). The data is delivered to the policy determining unit 343.

方針決定部３４３には、ピッチ系列データPit_m[fm]と、合成音声ピッチ系列データSynPit_m[fm]と、が集められる。前者は人間の自然な発話から収集された音声データに基づいて生成されたものである一方、後者はいったん音声合成辞書を経て発せられた合成音声データに基づいて生成されたものである。方針決定部３４３は、これら２種のデータを集めるので、これらを比較検討することができる。そこで、方針決定部３４３は、かかる比較検討により、合成音声が元の音声に比べて平坦で不自然なものとならないようにするには、元の音声に対して、そもそもあらかじめいかなる処理を施しておくべきだったのかを検討する。具体的には、方針決定部３４３は、ピッチ系列データPit_m[fm]を、音素ＨＭＭ学習の前にどのように編集しておくべきか、という編集方針を決定する。少なくとも定性的には、元の音声のピッチ変動が大きくなるように、ピッチ系列データPit_m[fm]をあらかじめ編集しておけば、合成音声がより自然なものとなる。 The policy determining unit 343 collects pitch sequence data Pit _m [fm] and synthesized speech pitch sequence data SynPit _m [fm]. The former is generated on the basis of speech data collected from human natural speech, while the latter is generated on the basis of synthesized speech data once issued through a speech synthesis dictionary. Since the policy decision unit 343 collects these two types of data, these can be compared. Therefore, in order to prevent the synthesized speech from becoming flat and unnatural compared to the original speech, the policy decision unit 343 originally performs any processing on the original speech in advance. Consider what you should have left. Specifically, the policy determination unit 343 determines an editing policy for how the pitch sequence data Pit _m [fm] should be edited before the phoneme HMM learning. At least qualitatively, if the pitch sequence data Pit _m [fm] is edited in advance so that the pitch variation of the original speech becomes large, the synthesized speech becomes more natural.

なお、編集方針の詳細については、後に例を挙げて説明する。 Details of the editing policy will be described later with an example.

方針決定部３４３は、かかる比較検討の結果決定したピッチ系列データPit_m[fm]の編集方針を、編集部３４５に伝達する。 The policy determination unit 343 transmits the editing policy of the pitch series data Pit _m [fm] determined as a result of the comparative study to the editing unit 345.

編集部３４５は、伝達された編集方針に従って、ピッチ系列データPit_m[fm]を編集し、編集ピッチ系列データEdPit_m[fm]を生成し、第２音素ＨＭＭ学習部３４７に引き渡す。 The editing unit 345 edits the pitch sequence data Pit _m [fm] in accordance with the transmitted editing policy, generates edited pitch sequence data EdPit _m [fm], and delivers it to the second phoneme HMM learning unit 347.

第２音素ＨＭＭ学習部３４７は、図２の第１音素ＨＭＭ学習部３１５と同じ機能を有しており、ほぼ同じ処理を行う。相違点は、ピッチ系列データPit_m[fm]の代わりに、編集ピッチ系列データEdPit_m[fm]を用いる点である。すなわち、第２音素ＨＭＭ学習部３４７（図４）は、モノフォンラベルデータMLabData_m[ml]と、トライフォンラベルデータTLabData_m[tl]と、編集ピッチ系列データEdPit_m[fm]と、メルケプストラム係数系列データMC_m ^d[fm]と、を受け取り、受け取ったこれらのデータから、音素ＨＭＭ学習により、学習結果である第２学習結果を生成し、第２音声合成辞書２２７に格納する。より正確には、空のデータベースに第２学習結果が格納されることにより、該空のデータベースが第２音声合成辞書２２７として完成される。 The second phoneme HMM learning unit 347 has the same function as the first phoneme HMM learning unit 315 in FIG. 2 and performs substantially the same processing. The difference is that the edited pitch sequence data EdPit _m [fm] is used instead of the pitch sequence data Pit _m [fm]. That is, the second phoneme HMM learning unit 347 (FIG. 4) includes the monophone label data MLabData _m [ml], the triphone label data TLabData _m [tl], the edited pitch sequence data EdPit _m [fm], and the mel cepstrum. The coefficient series data MC _m ^d [fm] is received, a second learning result as a learning result is generated from the received data by phoneme HMM learning, and stored in the second speech synthesis dictionary 227. More precisely, by storing the second learning result in an empty database, the empty database is completed as the second speech synthesis dictionary 227.

この第２音声合成辞書２２７こそが、本実施形態に係る音声合成辞書構築装置がその構築を目標とした音声合成辞書である。従来の技術により構築された第１音声合成辞書２２３（図２）に基づいて生成された合成音声に比べて、第２音声合成辞書２２７に基づいて生成された合成音声は、ピッチ変動が十分に大きい自然なものとなる。上述のように、方針決定部３４３（図４）において、合成音声が平坦で不自然な音声にならないようするために元の音声データに施すべき処理、すなわち、元の音声データのピッチ変動を大きくするためのピッチ系列データPit_m[fm]の編集方針、を決定し、該編集方針に従って編集部３４５により生成された編集ピッチ系列データEdPit_m[fm]を用いて、音素ＨＭＭ学習が行われるためである。 This second speech synthesis dictionary 227 is the speech synthesis dictionary targeted by the speech synthesis dictionary construction apparatus according to the present embodiment. Compared with the synthesized speech generated based on the first speech synthesis dictionary 223 (FIG. 2) constructed by the conventional technique, the synthesized speech generated based on the second speech synthesis dictionary 227 has a sufficient pitch variation. It will be a big natural thing. As described above, in the policy decision unit 343 (FIG. 4), the process to be performed on the original voice data in order to prevent the synthesized voice from becoming flat and unnatural voice, that is, the pitch fluctuation of the original voice data is greatly increased. The editing policy of the pitch sequence data Pit _m [fm] is determined, and the phoneme HMM learning is performed using the editing pitch sequence data EdPit _m [fm] generated by the editing unit 345 according to the editing policy. It is.

ここまで図２〜図４を参照して説明してきた音声合成辞書構築装置は、物理的には、図５に示すような一般的なコンピュータ装置５１１により、構成される。 The speech synthesis dictionary construction device described so far with reference to FIGS. 2 to 4 is physically configured by a general computer device 511 as shown in FIG.

ＣＰＵ（Central Processing Unit、中央演算装置）５２１、ＲＯＭ（Read Only Memory）５２３、記憶部５２５、操作キー入力処理部５３３、及び、データ入出力インタフェース（以下、Ｉ／Ｆと書く。）５５５は、システムバス５４１で相互に接続されている。システムバス５４１は、命令やデータを転送するための伝送経路である。 A CPU (Central Processing Unit) 521, a ROM (Read Only Memory) 523, a storage unit 525, an operation key input processing unit 533, and a data input / output interface (hereinafter referred to as I / F) 555, They are connected to each other via a system bus 541. The system bus 541 is a transmission path for transferring commands and data.

ＣＰＵ５２１は、カウンタ用レジスタや汎用レジスタ等の各種のレジスタ（図示せず）を内蔵しており、ＲＯＭ５２３から読み出した動作プログラムに従って、処理対象である数値列等を適宜記憶部５２５から前記レジスタにロードし、ロードされた数値列に所定の演算を施し、その結果を記憶部５２５等に格納する。 The CPU 521 incorporates various registers (not shown) such as a counter register and a general-purpose register, and according to an operation program read from the ROM 523, appropriately loads a numeric string to be processed from the storage unit 525 into the register. Then, a predetermined calculation is performed on the loaded numerical sequence, and the result is stored in the storage unit 525 or the like.

ＲＯＭ５２３は、音素ＨＭＭ学習のための既知の動作プログラムの他に、特に、本実施形態においては、ピッチ系列データPit_m[fm]の編集方針を決定し編集ピッチ系列データEdPit_m[fm]を生成するための動作プログラムを記憶する。 ROM523 generates, in addition to the known operating program for the phoneme HMM learning, in particular, in the present embodiment, to determine the editorial policy of pitch series data Pit _m [fm] Edit pitch series data EdPit _m [fm] An operation program for storing is stored.

記憶部５２５は、ＲＡＭ（Random Access Memory）５２７や内蔵ハードディスク５２９から構成されて、音声データ、ラベルデータ、ピッチ系列データ、メルケプストラム係数系列データ、音素ＨＭＭ等を、一時的に記憶する。これらのデータ等は、ＣＰＵ５２１の内蔵レジスタから伝達されたり、後述のリムーバブルハードディスクから伝達されたりする。 The storage unit 525 includes a RAM (Random Access Memory) 527 and a built-in hard disk 529, and temporarily stores voice data, label data, pitch series data, mel cepstrum coefficient series data, phoneme HMM, and the like. These data and the like are transmitted from a built-in register of the CPU 521 or transmitted from a removable hard disk described later.

また、特に、本実施形態においては、内蔵ハードディスク５２９は、第１音声合成辞書２２３（図２）として機能することが想定されている。第１音声合成辞書２２３は、本実施形態に係る音声合成辞書構築装置にとっては、中間生成物に過ぎず、外部から与えられるものでもないし最終的に該装置から取り外して利用するものでもなく、一時的に記憶されればよいものだからである。 In particular, in the present embodiment, it is assumed that the internal hard disk 529 functions as the first speech synthesis dictionary 223 (FIG. 2). The first speech synthesis dictionary 223 is merely an intermediate product for the speech synthesis dictionary construction device according to the present embodiment, and is not given from the outside or finally removed from the device and used. This is because it only needs to be memorized.

操作キー入力処理部５３３は、ユーザＩ／Ｆである操作キー５３１からの操作信号を受け付けて、操作信号に対応するキーコード信号をＣＰＵ５２１に入力する。ＣＰＵ５２１は、入力されたキーコード信号に基づいて操作内容を決定する。 The operation key input processing unit 533 receives an operation signal from the operation key 531 which is a user I / F, and inputs a key code signal corresponding to the operation signal to the CPU 521. The CPU 521 determines the operation content based on the input key code signal.

ユーザが操作キー５３１を介して、本実施形態に係る音声合成辞書構築装置の動作設定を所望のものに変更することができるようにすることが好ましい。例えば、後述の、編集ピッチ系列データEdPit_m[fm]をピッチ系列データPit_m[fm]から生成する手順においては、編集方針として、後述の編集の具体例のうちのひとつがＲＯＭ５２３にあらかじめ選択設定されているものとし、希望する場合にはユーザ自身が操作キー５３１を介して該設定を変更できるようにしてもよい。 It is preferable that the user can change the operation setting of the speech synthesis dictionary construction device according to the present embodiment to a desired one via the operation key 531. For example, in a procedure for generating edit pitch sequence data EdPit _m [fm], which will be described later, from pitch sequence data Pit _m [fm], one of the specific examples of edits described later is selected and set in advance in the ROM 523 as an editing policy. If desired, the user may be able to change the setting via the operation key 531.

データ入出力Ｉ／Ｆ５５５は、元データの入った第１リムーバブルハードディスク５５１等及び処理済データ記録用の第２リムーバブルハードディスク５５３等に接続するためのインタフェースである。該Ｉ／Ｆは、作業の効率化のため、かかる２個のリムーバブルハードディスクを同時に接続できるものとする。該Ｉ／Ｆは、第１及び第２リムーバブルハードディスク５５１及び５５３のいずれともデータの双方向通信ができる、一般的な仕様のものであり、その意味で双方向の白抜き矢印が図示されている。もっとも、第１リムーバブルハードディスク５５１との通信においては、主に該ディスクから元データの読み込みが行われる一方、第２リムーバブルハードディスク５５３との通信においては、主に該ディスクへ処理済データが書き込まれるため、情報の伝達は主に実線の矢印で表される向きになされる。 The data input / output I / F 555 is an interface for connecting to the first removable hard disk 551 containing original data and the second removable hard disk 553 for recording processed data. The I / F can connect two such removable hard disks at the same time to improve work efficiency. The I / F is of a general specification capable of bidirectional data communication with both the first and second removable hard disks 551 and 553, and a bidirectional white arrow is illustrated in that sense. . Of course, in communication with the first removable hard disk 551, the original data is mainly read from the disk, whereas in communication with the second removable hard disk 553, processed data is mainly written to the disk. Information is transmitted mainly in the direction indicated by solid arrows.

元データとしては、図２の第１音声データベース２２１に格納されたデータが想定され、処理済データとしては、図４の第２音声合成辞書２２７に格納された第２学習結果が想定される。つまり、第１リムーバブルハードディスク５５１は図２の第１音声データベース２２１に、第２リムーバブルハードディスク５５３は図４の第２音声合成辞書２２７に、それぞれ対応する。 As the original data, data stored in the first speech database 221 of FIG. 2 is assumed, and as the processed data, the second learning result stored in the second speech synthesis dictionary 227 of FIG. 4 is assumed. That is, the first removable hard disk 551 corresponds to the first speech database 221 in FIG. 2, and the second removable hard disk 553 corresponds to the second speech synthesis dictionary 227 in FIG.

ユーザは、本実施形態に係る音声合成辞書構築装置を用いて音声合成辞書を構築したいときには、与えられた第１音声データベース２２１すなわち第１リムーバブルハードディスク５５１と、空の第２リムーバブルハードディスク５５３と、を、それぞれデータ入出力Ｉ／Ｆ５５５の所定の位置に接続する。その後、ユーザは、操作キー５３１を操作する等して音声合成辞書構築装置を動作させる。すると、ＣＰＵ５２１の制御下に、各種処理が行われる。例えば、データ入出力Ｉ／Ｆ５５５を介して、コンピュータ装置５１１と、第１及び第２リムーバブルハードディスク５５１及び５５３と、の間で、データの入出力が行われる。かかる動作が終了したときには、第２リムーバブルハードディスク５５３には、図４に示した第２学習結果が書き込まれている。つまり、該ディスクは図４の第２音声合成辞書２２７として機能するのにあたり必要なデータが全て書き込まれた状態になっている。この後、ユーザが合成音声の発生を希望する場合には、該ディスクをユーザＩ／Ｆ５５５から取り外して、該ディスクを音声合成辞書として接続することができる音声合成装置に取り付け、該音声合成装置を動作させることにより、合成音声を発生させることができる。 When the user wants to construct a speech synthesis dictionary using the speech synthesis dictionary construction apparatus according to the present embodiment, the given first speech database 221, that is, the first removable hard disk 551 and the empty second removable hard disk 553 are stored. These are connected to predetermined positions of the data input / output I / F 555. Thereafter, the user operates the operation key 531 to operate the speech synthesis dictionary construction device. Then, various processes are performed under the control of the CPU 521. For example, data is input / output between the computer device 511 and the first and second removable hard disks 551 and 553 via the data input / output I / F 555. When such an operation is finished, the second learning result shown in FIG. 4 is written in the second removable hard disk 553. That is, the disc is in a state where all data necessary for functioning as the second speech synthesis dictionary 227 of FIG. 4 has been written. Thereafter, when the user wishes to generate synthesized speech, the disc is removed from the user I / F 555 and attached to a speech synthesizer that can be connected as a speech synthesis dictionary. By operating it, synthesized speech can be generated.

図４に示すように、本実施形態に係る音声合成辞書構築装置の特徴は、方針決定部３４３においてピッチ系列データPit_m[fm]の編集方針を決定するとともに、かかる編集方針に従い編集部３４５においてピッチ系列データPit_m[fm]を編集して編集ピッチ系列データEdPit_m[fm]を生成することである。 As shown in FIG. 4, the feature of the speech synthesis dictionary construction device according to the present embodiment is that the policy determining unit 343 determines the editing policy of the pitch sequence data Pit _m [fm], and the editing unit 345 follows the editing policy. The pitch sequence data Pit _m [fm] is edited to generate edited pitch sequence data EdPit _m [fm].

編集部３４５が実行する編集処理は、音声データSp_mのピッチ変動を適度に大きくする処理であれば、いかなる処理でもよい。ただし、特に本実施形態の場合には、かかる処理の指針を、方針決定部３４３に集められたピッチ系列データPit_m[fm]と、合成音声ピッチ系列データSynPit_m[fm]と、に基づいて、効率的に、かつ的確に、そして簡易に、決定することが重要である。 Editing the editing unit 345 performs, if processing for moderately increasing the pitch variation of the audio data Sp _m, may be any process. However, particularly in the case of the present embodiment, the guidelines for such processing are based on the pitch sequence data Pit _m [fm] collected in the policy determination unit 343 and the synthesized speech pitch sequence data SynPit _m [fm]. It is important to decide efficiently, accurately and simply.

（編集の具体例について）
以下に、かかる編集処理の典型的な手順について説明する。 (Specific examples of editing)
Hereinafter, a typical procedure of such editing processing will be described.

なお、少なくとも定性的には、個々のピッチ系列データPit_m[fm]とその平均値との差を大きくしたものを編集ピッチ系列データEdPit_m[fm]とすれば、音声データSp_mのピッチ変動は大きくなる。そこで、以下の編集の具体例についての説明は、前記平均値と個々の値との差を具体的にはどのように大きくするかについての説明に重点が置かれたものになるとともに、基本的には、該差に編集用の1より大きいなんらかの値を乗じることにより該差を拡大して編集ピッチ系列データEdPit_m[fm]を求めることを念頭においたものになる。 Note that at least qualitatively, if the difference between the individual pitch sequence data Pit _m [fm] and the average value is the edit pitch sequence data EdPit _m [fm], the pitch variation of the audio data Sp _m Becomes bigger. Therefore, the following explanation of the specific example of editing will be focused on the explanation on how to specifically increase the difference between the average value and the individual value, and the basics. In this case, the difference is expanded by multiplying the difference by some value larger than 1 for editing to obtain the edit pitch series data EdPit _m [fm].

ただし、上述の差を拡大し過ぎた音声データに基づいて生成された合成音声は、平坦さの程度が小さすぎるためにかえって不自然になると考えられる。したがって、上述の編集用の1より大きいなんらかの値は、大き過ぎてはならないのであって、適度な大きさであることが望ましい。かかる適度な大きさは、元の音声データと、それに基づいて生成された従来技術による合成音声データとの比較により決めるのが妥当であると考えられる。なぜなら、前者のデータと後者のデータとの差が、音素ＨＭＭ学習の過程で生じる平坦さの程度に対応しているのであるから、該差に基づいて編集方針を立てあらかじめ前者のデータを平坦さの程度を小さくしたデータに編集しておけば、かかる編集が施されたデータに基づいて生成された合成音声データにおける平坦さは適度に平坦さの程度が小さく聴く者に自然な印象を与えると考えられるからである。以下の説明からも明らかなように、図４の方針決定部３４３が決定する編集方針は、このような考察に沿ったものである。 However, it is considered that the synthesized speech generated based on the speech data in which the above-described difference is enlarged too much becomes unnatural because the degree of flatness is too small. Therefore, any value greater than 1 for editing described above should not be too large and is preferably reasonably large. It is considered appropriate to determine such an appropriate size by comparing the original voice data with the synthesized voice data according to the prior art generated based on the original voice data. This is because the difference between the former data and the latter data corresponds to the degree of flatness that occurs during the phoneme HMM learning process, so that an editing policy is established based on the difference and the former data is flattened in advance. If the data is edited to a data with a reduced degree of noise, the flatness in the synthesized speech data generated based on the data subjected to such editing is moderately small and the naturalness is given to the listener. It is possible. As is clear from the following description, the editing policy determined by the policy determination unit 343 in FIG. 4 is in line with such consideration.

以下で説明する複数の手順のうち、どれを採用するのが最適であるかは、第１音声データベース２２１（図２）に収録されたサンプルデータの性質や、本実施形態に係る音声合成辞書構築装置として用いられるコンピュータ装置５１１（図５）のＣＰＵ５２１の処理能力や、合成音声として発話させたい内容や、あるいは合成音声の聴き手の感じ方等、様々な要素によって左右されるので、一概には結論づけられない。いくつかの手順を試行してみて、与えられた各種条件下で最適な手順がどれであるかを決定するのが妥当である。 Which one of the plurality of procedures described below is optimal to adopt depends on the nature of the sample data recorded in the first speech database 221 (FIG. 2) and the construction of the speech synthesis dictionary according to the present embodiment. Since it depends on various factors such as the processing capability of the CPU 521 of the computer device 511 (FIG. 5) used as the device, the content to be uttered as synthesized speech, and the way the listener hears the synthesized speech, generally I cannot conclude. It is reasonable to try several procedures and determine which is the optimal procedure under the various conditions given.

様々な手順が考えられるものの、これらの手順は、上述のように、図４の方針決定部３４３による編集方針の決定とそれに応じたピッチ系列データの編集の実行という点では、一貫している。すなわち、以下に示す様々な手順は、かかる技術的思想の範囲内におけるバリエーションである。 Although various procedures are conceivable, as described above, these procedures are consistent in terms of the determination of the editing policy by the policy determination unit 343 in FIG. 4 and the execution of editing of the pitch sequence data corresponding thereto. That is, the various procedures shown below are variations within the scope of the technical idea.

図５に示したとおり、本実施形態に係る音声合成辞書構築装置として機能するコンピュータ装置５１１は、記憶装置として、ＣＰＵ５２１の内蔵レジスタと、記憶部５２５の中のＲＡＭ５２７及び内蔵ハードディスク５２９と、を有する他にも、音声合成辞書構築中にはデータ入出力Ｉ／Ｆ５５５に接続され続けているため事実上前記コンピュータ装置５１１の一部ともいえる第１リムーバブルハードディスク５５１及び第２リムーバブルハードディスク５５３と、を有する。以下では、理解を容易にするために、各種演算が行われる場である前記レジスタ以外の記憶装置を総称して、単に記憶部５２５と呼ぶことにする。すると、記憶部５２５には、音声データSp_mと、モノフォンラベルデータMLabData_m[ml]と、トライフォンラベルデータTLabData_m[tl]と、が初めから格納されていることになる。以下ではさらに、ピッチ系列データPit_m[fm]、メルケプストラム係数系列データMC_m ^d[fm]、及び、合成音声ピッチ系列データSynPit_m[fm]が既に求められ記憶部５２５に格納されているものとする。 As illustrated in FIG. 5, the computer device 511 that functions as the speech synthesis dictionary construction device according to the present embodiment includes a built-in register of the CPU 521, a RAM 527 in the storage unit 525, and a built-in hard disk 529 as a storage device. In addition, the first and second removable hard disks 551 and 553, which are practically part of the computer device 511, are connected to the data input / output I / F 555 during the construction of the speech synthesis dictionary. . In the following, in order to facilitate understanding, storage devices other than the register, where various operations are performed, are collectively referred to simply as a storage unit 525. Then, the storage unit 525 stores the sound data Sp _m , the monophone label data MLabData _m [ml], and the triphone label data TLabData _m [tl] from the beginning. In the following, pitch sequence data Pit _m [fm], mel cepstrum coefficient sequence data MC _m ^d [fm], and synthesized speech pitch sequence data SynPit _m [fm] are already obtained and stored in the storage unit 525. And

なお、フレームは、有声音のフレームと無声音のフレームとに分類される。どのフレームが有声音のフレームでどのフレームが無声音のフレームであるかは、既に任意の既知の手法によって求められ、その結果も記憶部５２５に格納済みであるものとする。そして、以下では、「Pit_m[fm]」又は「SynPit_m[fm]」と記した場合、それはピッチの具体的な値を意味するとともに、特に明示しなくても、それに対応するフレームfmが有声音のフレームであるかそれとも無声音のフレームであるかの区別が既についていることも意味するものとする。 The frames are classified into voiced sound frames and unvoiced sound frames. It is assumed that which frame is a voiced sound frame and which frame is an unvoiced sound frame is already obtained by any known method, and the result is also stored in the storage unit 525. In the following, when “Pit _m [fm]” or “SynPit _m [fm]” is described, it means a specific value of the pitch, and the frame fm corresponding to it does not need to be specified. It also means that a distinction has already been made between a voiced sound frame and an unvoiced sound frame.

（編集の具体例１）
図６〜図１２に示すフローチャートを参照しつつ、編集の具体例１について説明する。 (Specific example 1 of editing)
A specific example 1 of editing will be described with reference to the flowcharts shown in FIGS.

まず、図６のように、編集用録音平均値AvePit_mを算出する。そのためには、図５のＣＰＵ５２１の内部のカウンタレジスタ（図示せず。）にカウンタmの初期値として1が格納される（ステップＳ１１１）。このmは、着目しているピッチ系列データがどの音声データに属しているかを識別するための変数である。 First, as shown in FIG. 6, the recording average value AvePit _m for editing is calculated. For this purpose, 1 is stored in the counter register (not shown) of the CPU 521 in FIG. 5 as the initial value of the counter m (step S111). This m is a variable for identifying to which audio data the pitch sequence data of interest belongs.

次に、ＣＰＵ５２１は、内部の汎用レジスタ（図示せず。）に編集用録音平均値AvePit_mを格納する領域を設けるとともに、編集用録音平均値AvePit_mを0に設定する。そして、ＣＰＵ５２１は、前記mを格納するカウンタレジスタとは別のカウンタレジスタにフレーム識別用カウンタfmを格納することとして、その初期値を0に設定する。そして、ＣＰＵ５２１は、さらに別のカウンタレジスタにカウンタN_Vfm[m]を格納することとして、その初期値を0に設定する（ステップＳ１１３）。 Next, the CPU 521 provides an area for storing the recording average value AvePit _m for editing in an internal general-purpose register (not shown), and sets the recording average value AvePit _m for editing to 0. Then, the CPU 521 stores the frame identification counter fm in a counter register different from the counter register that stores the m, and sets its initial value to 0. Then, CPU 521 stores counter N _Vfm [m] in another counter register, and sets its initial value to 0 (step S113).

前記N_Vfm[m]はm番目の音声データのうちにいくつの有声音のフレームが存在するかを数えるための変数である。 The N _Vfm [m] is a variable for counting how many voiced frames exist in the m-th audio data.

続いて、ＣＰＵ５２１は、fmで特定されるフレームが有声音のフレームであるか否かを判別する（ステップＳ１１５）。 Subsequently, the CPU 521 determines whether or not the frame specified by fm is a voiced sound frame (step S115).

fmで特定されるフレームが有声音のフレームであると判別された場合（ステップＳ１１５；Ｙｅｓ）は、ＣＰＵ５２１は、記憶部５２５からピッチ系列データPit_m[fm]をロードする（ステップＳ１１７）。そして、ＣＰＵ５２１は、かかるPit_m[fm]をAvePit_mに加えて新たなAvePit_mとするとともに、N_Vfm[m]に1を加えて新たなN_Vfm[m]とする（ステップＳ１１９）。すなわち、AvePit_mはAvePit_m+ Pit_m[fm]に、N_Vfm[m]はN_Vfm[m]+1に、それぞれ更新される。その後、ステップＳ１２１に進む。 When it is determined that the frame specified by fm is a frame of voiced sound (step S115; Yes), the CPU 521 loads the pitch sequence data Pit _m [fm] from the storage unit 525 (step S117). Then, CPU 521 is such Pit _m [fm] with a new AvePit _m in addition to AvePit _m, 1 is added to N _Vfm [m] and a new N _Vfm [m] (step S119). That is, AvePit _m is updated to AvePit _m + Pit _m [fm], and N _Vfm [m] is updated to N _Vfm [m] +1. Thereafter, the process proceeds to step S121.

fmで特定されるフレームが有声音のフレームではないと判別された場合（ステップＳ１１５；Ｎｏ）は、すぐにステップＳ１２１に進む。 If it is determined that the frame specified by fm is not a voiced sound frame (step S115; No), the process immediately proceeds to step S121.

ステップＳ１２１では、ＣＰＵ５２１は、m番目の音声データの全てのフレームについての処理が完了したか否かを判別する。つまり、ＣＰＵ５２１は、fm≧N_fm[m]であるか否かを判別する。 In step S121, the CPU 521 determines whether or not the processing for all the frames of the mth audio data has been completed. That is, the CPU 521 determines whether or not fm ≧ N _fm [m].

fm≧N_fm[m]ではないと判別された場合（ステップＳ１２１；Ｎｏ）は、ＣＰＵ５２１は、次のフレームについての処理を行うために、fmを1増加する（ステップＳ１２３）。そして、ステップＳ１１５に戻る。一方、fm≧N_fm[m]であると判別された場合（ステップＳ１２１；Ｙｅｓ）は、ステップＳ１２５に進む。 If it is determined that fm ≧ N _fm [m] is not satisfied (step S121; No), the CPU 521 increases fm by 1 in order to perform processing for the next frame (step S123). Then, the process returns to step S115. On the other hand, if it is determined that fm ≧ N _fm [m] (step S121; Yes), the process proceeds to step S125.

ステップＳ１２５に進んだ時点では、AvePit_mの値は、m番目の音声データのうちの有声音のフレームにおけるピッチ系列データPit_m[fm]の合計値となっている。そこで、ステップＳ１２５では、ＣＰＵ５２１は、AvePit_mを有声音のフレームの数であるN_Vfmで除して、m番目の音声データのうちの有声音のフレームにおけるピッチ系列データの平均値としてのAvePit_mを求める。すなわち、AvePit_mは、AvePit_m /N_Vfmに更新される。ＣＰＵ５２１は、更新されたAvePit_mを、記憶部５２５に格納する（ステップＳ１２７）。 At the time of proceeding to step S125, the value of AvePit _m is the total value of the pitch sequence data Pit _m [fm] in the frame of the voiced sound in the mth audio data. Therefore, in step S125, the CPU 521 divides AvePit _m by N _Vfm which is the number of voiced sound frames, and AvePit _m as an average value of pitch sequence data in the voiced sound frame of the m-th sound data. Ask for. That is, AvePit _m is updated to AvePit _m / N _Vfm . The CPU 521 stores the updated AvePit _m in the storage unit 525 (Step S127).

ＣＰＵ５２１は、全ての音声データについての処理が完了したか否かを判別する（ステップＳ１２９）。つまり、ＣＰＵ５２１は、m≧N_Spであるか否かを判別する。 The CPU 521 determines whether or not processing for all audio data has been completed (step S129). That is, the CPU 521 determines whether m ≧ N _Sp is satisfied.

m≧N_Spではないと判別された場合（ステップＳ１２９；Ｎｏ）は、ＣＰＵ５２１は、次の音声データについての処理を行うために、mを1増加する（ステップＳ１３１）。そして、ステップＳ１１３に戻る。一方、m≧N_Spであると判別された場合（ステップＳ１２９；Ｙｅｓ）は、ＣＰＵ５２１は、処理を終了する。 When it is determined that m ≧ N _Sp is not satisfied (step S129; No), the CPU 521 increments m by 1 to perform processing for the next audio data (step S131). Then, the process returns to step S113. On the other hand, when it is determined that m ≧ N _Sp (step S129; Yes), the CPU 521 ends the process.

続いて、図７のように、編集用合成平均値AveSynPit_mを算出する（ステップＳ１６１〜Ｓ１８１）。かかる算出の手順は、図６に示した編集用録音平均値AvePit_mを算出する手順とほぼ同じである。主な相違点は、ピッチ系列データPit_m[fm]についての平均値ではなく合成音声ピッチ系列データSynPit_m[fm]についての平均値を求めること（ステップＳ１６７、ステップＳ１６９、ステップＳ１７５）や、m番目の音声データに対応するフレームの番号の上限は必ずしもN_fm[m]と一致しないため別の変数M_fm[m]が用いられていること（ステップＳ１７１）である。 Subsequently, as shown in FIG. 7, a composite average value for editing AveSynPit _m is calculated (steps S161 to S181). The calculation procedure is almost the same as the procedure for calculating the recording average value AvePit _m for editing shown in FIG. The main difference is that an average value for the synthesized speech pitch sequence data SynPit _m [fm] is obtained instead of an average value for the pitch sequence data Pit _m [fm] (step S167, step S169, step S175), m Since the upper limit of the frame number corresponding to the second audio data does not necessarily match N _{fm [} m], another variable M _fm [m] is used (step S171).

続いて、図８のように、編集用録音音声別最大絶対値mxAbsDiffPit_mを算出する。 Subsequently, as shown in FIG. 8, the maximum absolute value mxAbsDiffPit _m for each recording sound for editing is calculated.

ＣＰＵ５２１は、音声データ識別用カウンタmをm=1に設定する（ステップＳ２１１）。ＣＰＵ５２１はさらに、mxAbsDiffPit_mを例えば0といった十分小さな値に設定するとともに、フレーム識別用カウンタfmをfm=0に設定する（ステップＳ２１３）。 The CPU 521 sets the audio data identification counter m to m = 1 (step S211). The CPU 521 further sets mxAbsDiffPit _m to a sufficiently small value such as 0, and sets the frame identification counter fm to fm = 0 (step S213).

ＣＰＵ５２１は、fmで特定されるフレームが有声音のフレームであるか否かを判別する（ステップＳ２１５）。 The CPU 521 determines whether or not the frame specified by fm is a voiced sound frame (step S215).

fmで特定されるフレームが有声音のフレームではないと判別された場合（ステップＳ２１５；Ｎｏ）は、ステップＳ２２５に進む。 When it is determined that the frame specified by fm is not a voiced sound frame (step S215; No), the process proceeds to step S225.

fmで特定されるフレームが有声音のフレームであると判別された場合（ステップＳ２１５；Ｙｅｓ）は、ＣＰＵ５２１は、記憶部５２５から、ピッチ系列データPit_m[fm]と、上述のとおり図６に示された手順により求められたAvePit_mと、をロードする（ステップＳ２１７）。そして、ＣＰＵ５２１は、TmpmxAbsDiffPit_m=| Pit_m[fm]-AvePit_m|を算出し（ステップＳ２１９）、TmpmxAbsDiffPit_m≧mxAbsDiffPit_mであるか否かを判別する（ステップＳ２２１）。TmpmxAbsDiffPit_m≧mxAbsDiffPit_mではないと判別された場合（ステップＳ２２１；Ｎｏ）は、すぐにステップＳ２２５に進み、TmpmxAbsDiffPit_m≧mxAbsDiffPit_mであると判別された場合（ステップＳ２２１；Ｙｅｓ）は、mxAbsDiffPit_mをmxAbsDiffPit_m=TmpmxAbsDiffPit_mのように更新してから（ステップＳ２２３）、ステップＳ２２５に進む。 When it is determined that the frame specified by fm is a frame of a voiced sound (step S215; Yes), the CPU 521 reads the pitch sequence data Pit _m [fm] from the storage unit 525, as shown in FIG. AvePit _m obtained by the indicated procedure is loaded (step S217). Then, the CPU 521 calculates TmpmxAbsDiffPit _m = | Pit _m [fm] −AvePit _m | (step S219), and determines whether TmpmxAbsDiffPit _m ≧ mxAbsDiffPit _m (step S221). If it is determined that TmpmxAbsDiffPit _m ≧ mxAbsDiffPit _m (step S221; No), the process immediately proceeds to step S225. If it is determined that TmpmxAbsDiffPit _m ≧ mxAbsDiffPit _m (step S221; Yes), mxAbsDiffPit _m is set. After updating as mxAbsDiffPit _m = TmpmxAbsDiffPit _m (step S223), the process proceeds to step S225.

ステップＳ２２５では、ＣＰＵ５２１は、fm≧N_fm[m]であるか否かを判別する。 In step S225, the CPU 521 determines whether or not fm ≧ N _fm [m].

fm≧N_fm[m]ではないと判別された場合（ステップＳ２２５；Ｎｏ）は、fmを1増加してから（ステップＳ２２７）、ステップＳ２１５に戻る。一方、fm≧N_fm[m]であると判別された場合（ステップＳ２２５；Ｙｅｓ）は、ステップＳ２２９に進む。 If it is determined that fm ≧ N _fm [m] is not satisfied (step S225; No), fm is incremented by 1 (step S227), and the process returns to step S215. On the other hand, if it is determined that fm ≧ N _fm [m] (step S225; Yes), the process proceeds to step S229.

ステップＳ２２９では、ＣＰＵ５２１は、mxAbsDiffPit_mを記憶部５２５に格納する。 In step S229, the CPU 521 stores mxAbsDiffPit _m in the storage unit 525.

ステップＳ２３１では、ＣＰＵ５２１は、m≧N_Spであるか否かを判別する。 In step S231, the CPU 521 determines whether m ≧ N _Sp is satisfied.

m≧N_Spではないと判別された場合（ステップＳ２３１；Ｎｏ）は、mを1増加してから（ステップＳ２３３）、ステップＳ２１３に戻る。一方、m≧N_Spであると判別された場合（ステップＳ２３１；Ｙｅｓ）は、処理を終了する。 If it is determined that m ≧ N _Sp is not satisfied (step S231; No), m is incremented by 1 (step S233), and the process returns to step S213. On the other hand, when it is determined that m ≧ N _Sp (step S231; Yes), the process ends.

続いて、図９のように、編集用合成音声別最大絶対値mxAbsDiffSynPit_mを算出する（ステップＳ２６１〜Ｓ２８３）。かかる算出の手順は、図８に示した編集用録音音声別最大絶対値mxAbsDiffPit_mを算出する手順とほぼ同じである。主な相違点は、ピッチ系列データPit_m[fm]ではなく合成音声ピッチ系列データSynPit_m[fm]を扱う点（ステップＳ２６７、ステップＳ２６９）や、図６に示した手順により求められた編集用録音平均値AvePit_mではなく図７に示した手順により求められた編集用合成平均値AveSynPit_mを用いる点（ステップＳ２６７、ステップＳ２６９）や、m番目の音声データに対応するフレームの番号の上限としてN_fm[m]ではなくM_fm[m]を用いる点（ステップＳ２７５）である。 Subsequently, as shown in FIG. 9, to calculate the maximum specific editing synthetic speech absolute value mxAbsDiffSynPit _m (step S261~S283). This calculation procedure is almost the same as the procedure for calculating the maximum absolute value mxAbsDiffPit _m for each recording sound for editing shown in FIG. The main difference is that the synthesized speech pitch sequence data SynPit _m [fm] is handled instead of the pitch sequence data Pit _m [fm] (steps S267 and S269), and the editing is obtained by the procedure shown in FIG. As the upper limit of the frame number corresponding to the m-th audio data, the edit composite average value AveSynPit _m obtained by the procedure shown in FIG. 7 instead of the recording average value AvePit _m is used. The point is that M _fm [m] is used instead of N _{fm [} m] (step S275).

続いて、図１０のように、編集用録音総合最大絶対値MaxAbsDiffPitを算出する。 Subsequently, as shown in FIG. 10, an editing recording maximum absolute value MaxAbsDiffPit is calculated.

ＣＰＵ５２１は、MaxAbsDiffPitを例えば0といった十分小さな値に設定するとともに、音声データ識別用カウンタmをm=1に設定する（ステップＳ３１１）。 The CPU 521 sets MaxAbsDiffPit to a sufficiently small value such as 0, and sets the audio data identification counter m to m = 1 (step S311).

ＣＰＵ５２１は、記憶部５２５から、図８に示した手順により求められたmxAbsDiffPit_mをロードし（ステップＳ３１３）、TmpMaxAbsDiffPit=mxAbsDiffPit_mとする（ステップＳ３１５）。 The CPU 521 loads mxAbsDiffPit _m obtained by the procedure shown in FIG. 8 from the storage unit 525 (step S313) and sets TmpMaxAbsDiffPit = mxAbsDiffPit _m (step S315).

ＣＰＵ５２１は、TmpMaxAbsDiffPit≧MaxAbsDiffPitであるか否かを判別する（ステップＳ３１７）。TmpMaxAbsDiffPit≧MaxAbsDiffPitではないと判別された場合（ステップＳ３１７；Ｎｏ）は、すぐにステップＳ３２１に進む。一方、TmpMaxAbsDiffPit≧MaxAbsDiffPitであると判別された場合（ステップＳ３１７；Ｙｅｓ）は、MaxAbsDiffPitをMaxAbsDiffPit=TmpMaxAbsDiffPitのように更新してから（ステップＳ３１９）、ステップＳ３２１に進む。 The CPU 521 determines whether or not TmpMaxAbsDiffPit ≧ MaxAbsDiffPit (step S317). If it is determined that TmpMaxAbsDiffPit ≧ MaxAbsDiffPit is not satisfied (step S317; No), the process immediately proceeds to step S321. On the other hand, if it is determined that TmpMaxAbsDiffPit ≧ MaxAbsDiffPit (step S317; Yes), MaxAbsDiffPit is updated as MaxAbsDiffPit = TmpMaxAbsDiffPit (step S319), and then the process proceeds to step S321.

ステップＳ３２１では、ＣＰＵ５２１は、m≧N_Spであるか否かを判別する。 In step S321, the CPU 521 determines whether m ≧ N _Sp is satisfied.

m≧N_Spではないと判別された場合（ステップＳ３２１；Ｎｏ）は、mを1増加してから（ステップＳ３２３）、ステップＳ３１３に戻る。一方、m≧N_Spであると判別された場合（ステップＳ３２１；Ｙｅｓ）は、MaxAbsDiffPitを記憶部５２５に格納してから（ステップＳ３２５）、処理を終了する。 If it is determined that m ≧ N _Sp is not satisfied (step S321; No), m is incremented by 1 (step S323), and the process returns to step S313. On the other hand, if it is determined that m ≧ N _Sp (step S321; Yes), MaxAbsDiffPit is stored in the storage unit 525 (step S325), and the process ends.

続いて、図１１のように、編集用合成総合最大絶対値MaxAbsDiffSynPitを算出する（ステップＳ３６１〜Ｓ３７５）。かかる算出の手順は、図１０に示した編集用録音総合最大絶対値MaxAbsDiffPitを算出する手順とほぼ同じである。主な相違点は、図８に示した手順により求められた編集用録音音声別最大絶対値mxAbsDiffPit_mではなく図９に示した手順により求められた編集用合成音声別最大絶対値mxAbsDiffSynPit_mを用いること（ステップＳ３６３、ステップＳ３６５）である。 Subsequently, as shown in FIG. 11, the combined maximum synthetic absolute value MaxAbsDiffSynPit is calculated (steps S361 to S375). The calculation procedure is almost the same as the procedure for calculating the editing total recording maximum absolute value MaxAbsDiffPit shown in FIG. The main difference is, using the maximum synthetic speech-edited absolute value MxAbsDiffSynPit _m obtained by the procedure shown in editing recorded speech by the maximum absolute value mxAbsDiffPit in _m rather 9 obtained by the procedure shown in FIG. 8 (Step S363, Step S365).

このように、図６と図７、図８と図９、図１０と図１１、は、それぞれ、第１音声データベース２２１（図２）に格納されている元の音声のピッチ系列データについての手順と第１音声合成辞書２２３（図２）に基づく合成音声のピッチ系列データについての手順との対になっている。 As described above, FIGS. 6 and 7, FIGS. 8 and 9, and FIGS. 10 and 11 are procedures for pitch sequence data of the original speech stored in the first speech database 221 (FIG. 2), respectively. And a procedure for pitch sequence data of synthesized speech based on the first speech synthesis dictionary 223 (FIG. 2).

本編集の具体例において最終的に編集ピッチ系列データEdPit_m[fm]を求める手順を、図１２のフローチャートに示す。 The procedure for finally obtaining the edit pitch sequence data EdPit _m [fm] in the specific example of this editing is shown in the flowchart of FIG.

なお、図１２においては、本実施形態における編集ピッチ系列データEdPit_m[fm]を、第１編集ピッチ系列データEdPit_m[fm]と表記してあるが、記号EdPit_m[fm]自体は、後述の他の例においても、編集部３４５（図４）が生成する編集ピッチ系列データを指すものとする。 In FIG. 12, the edited pitch sequence data EdPit _m [fm] in the present embodiment is represented as first edited pitch sequence data EdPit _m [fm], but the symbol EdPit _m [fm] itself will be described later. In another example, it is assumed that the editing pitch sequence data generated by the editing unit 345 (FIG. 4) is indicated.

第１編集ピッチ系列データEdPit_m[fm]を求めるためには、まず、ＣＰＵ５２１は、図１０に示した手順により求められた編集用録音総合最大絶対値MaxAbsDiffPitと、図１１に示した手順により求められた編集用合成総合最大絶対値MaxAbsDiffSynPitと、を記憶部５２５からロードする（ステップＳ４１１）。 In order to obtain the first editing pitch series data EdPit _m [fm], first, the CPU 521 obtains the recording total maximum absolute value MaxAbsDiffPit for editing obtained by the procedure shown in FIG. 10 and the procedure shown in FIG. The edited composition total maximum absolute value MaxAbsDiffSynPit is loaded from the storage unit 525 (step S411).

続いて、ＣＰＵ５２１は、音声データ識別用カウンタmを、m=1に設定し（ステップＳ４１３）、フレーム識別用カウンタfmをfm=0に設定する（ステップＳ４１５）。 Subsequently, the CPU 521 sets the audio data identification counter m to m = 1 (step S413), and sets the frame identification counter fm to fm = 0 (step S415).

ＣＰＵ５２１は、ピッチ系列データPit_m[fm]をロードし（ステップＳ４１７）、fmで特定されるフレームが有声音のフレームであるか否かを判別する（ステップＳ４１９）。 The CPU 521 loads the pitch series data Pit _m [fm] (step S417), and determines whether or not the frame specified by fm is a voiced sound frame (step S419).

fmで特定されるフレームが有声音のフレームであると判別された場合（ステップＳ４１９；Ｙｅｓ）は、ＣＰＵ５２１は、図６に示した手順により求められた編集用録音平均値AvePit_mを記憶部５２５からロードする（ステップＳ４２１）。そして、ＣＰＵ５２１は、第１編集ピッチ系列データEdpit_m[fm]を、
Edpit_m[fm]=(Pit_m[fm]- AvePit_m)×(MaxAbsDiffPit/MaxAbsDiffSynPit)+AvePit_m
により算出し（ステップＳ４２３）、記憶部５２５に格納する（ステップＳ４２７）。 When it is determined that the frame specified by fm is a frame of voiced sound (step S419; Yes), the CPU 521 stores the recording average value AvePit _m for editing obtained by the procedure shown in FIG. Is loaded (step S421). Then, the CPU 521 obtains the first edited pitch series data “Epit _m [fm]”,
Edpit _m [fm] = (Pit _m [fm]-AvePit _m ) × (MaxAbsDiffPit / MaxAbsDiffSynPit) + AvePit _m
(Step S423) and stored in the storage unit 525 (step S427).

一方、fmで特定されるフレームが有声音のフレームではないと判別された場合（ステップＳ４１９；Ｎｏ）は、ＣＰＵ５２１は、Pit_m[fm]をそのまま第１編集ピッチ系列データEdpit_m[fm]とし（ステップＳ４２５）、記憶部５２５に格納する（ステップＳ４２７）。 On the other hand, when it is determined that the frame specified by fm is not a voiced sound frame (step S419; No), the CPU 521 sets Pit _m [fm] as it is as the first edited pitch sequence data Edpit _m [fm]. (Step S425), the data is stored in the storage unit 525 (Step S427).

ＣＰＵ５２１は、fm≧N_fm[m]であるか否かを判別する（ステップＳ４２９）。 The CPU 521 determines whether or not fm ≧ N _fm [m] (step S429).

fm≧N_fm[m]ではないと判別された場合（ステップＳ４２９；Ｎｏ）は、fmを1増加してから（ステップＳ４３１）、ステップＳ４１７に戻る。一方、fm≧N_fm[m]であると判別された場合（ステップＳ４２９；Ｙｅｓ）は、ステップＳ４３３に進む。 If it is determined that fm ≧ N _fm [m] is not satisfied (step S429; No), fm is incremented by 1 (step S431), and the process returns to step S417. On the other hand, when it is determined that fm ≧ N _fm [m] (step S429; Yes), the process proceeds to step S433.

ステップＳ４３３では、ＣＰＵ５２１は、m≧N_Spであるか否かを判別する。 In step S433, the CPU 521 determines whether m ≧ N _Sp is satisfied.

m≧N_Spではないと判別された場合（ステップＳ４３３；Ｎｏ）は、mを1増加してから（ステップＳ４３５）、ステップＳ４１５に戻る。一方、m≧N_Spであると判別された場合（ステップＳ４３３；Ｙｅｓ）は、処理を終了する。 If it is determined that m ≧ N _Sp is not satisfied (step S433; No), m is incremented by 1 (step S435), and the process returns to step S415. On the other hand, if it is determined that m ≧ N _Sp (step S433; Yes), the process is terminated.

既に述べたとおり、一般に、従来の方法によれば、合成音声は平坦な印象を与える不自然な音声となるのであり、その理由は、合成音声のピッチ変動が元の自然な音声のピッチ変動に比べて小さくなってしまうためである。 As described above, generally, according to the conventional method, the synthesized speech becomes an unnatural speech giving a flat impression, because the pitch variation of the synthesized speech is changed to the pitch variation of the original natural speech. This is because it becomes smaller than that.

ところで、図１０に示した手順から明らかなように、編集用録音総合最大絶対値MaxAbsDiffPitは、元の音声データ全部のうちの、ピッチ平均からのズレの最大値を示している。一方、図１１に示した手順から明らかなように、編集用合成総合最大絶対値MaxAbsDiffSynPitは、合成音声データ全部のうちの、ピッチ平均からのズレの最大値を示している。 Incidentally, as is clear from the procedure shown in FIG. 10, the total recording maximum absolute value MaxAbsDiffPit for editing indicates the maximum value of the deviation from the pitch average in all the original audio data. On the other hand, as is clear from the procedure shown in FIG. 11, the editing combined maximum maximum absolute value MaxAbsDiffSynPit indicates the maximum deviation from the pitch average in all the synthesized speech data.

よって、編集用録音総合最大絶対値MaxAbsDiffPitは元の自然な音声におけるピッチ変動の程度を表す指標であり、編集用合成総合最大絶対値MaxAbsDiffSynPitは合成音声におけるピッチ変動の程度を表す指標である、と考えることができる。 Therefore, the total recording maximum absolute value for editing MaxAbsDiffPit is an index representing the degree of pitch fluctuation in the original natural speech, and the synthetic total maximum absolute value for editing MaxAbsDiffSynPit is an index representing the degree of pitch fluctuation in the synthesized speech. Can think.

合成音声が上述のように平坦であるということから、ほぼ確実に、MaxAbsDiffSynPit＜MaxAbsDiffPitとなることが期待される。そして、合成音声のピッチ変動を元の自然な音声のピッチ変動と同程度にするためには、元の自然な音声におけるピッチ変動を、あらかじめ、1より適度に大きい値であると期待される(MaxAbsDiffPit/MaxAbsDiffSynPit)倍に拡大しておくのが適切であると考えられる。図１２のステップＳ４２３において、ピッチ系列データとその平均値との差に(MaxAbsDiffPit/MaxAbsDiffSynPit)なる値を乗じているのは、このような理由による。 Since the synthesized speech is flat as described above, it is expected that MaxAbsDiffSynPit <MaxAbsDiffPit almost certainly. Then, in order to make the pitch fluctuation of the synthesized speech the same as the pitch fluctuation of the original natural voice, the pitch fluctuation in the original natural voice is expected to be a value appropriately larger than 1 in advance ( It is considered appropriate to enlarge to (MaxAbsDiffPit / MaxAbsDiffSynPit) times. This is the reason why the difference between the pitch sequence data and the average value is multiplied by the value (MaxAbsDiffPit / MaxAbsDiffSynPit) in step S423 in FIG.

こうしてあらかじめピッチ変動を大きくしておいたピッチ系列データを用いて第２音素ＨＭＭ学習部３４７（図４）による音素ＨＭＭ学習がなされるので、かかる学習結果が格納された第２音声合成辞書２２７（図４）は、自然な印象を与える音声を合成することに役立つ。 In this way, the phoneme HMM learning is performed by the second phoneme HMM learning unit 347 (FIG. 4) using the pitch series data in which the pitch variation is increased in advance, so that the second speech synthesis dictionary 227 (in which such learning results are stored) FIG. 4) is useful for synthesizing speech that gives a natural impression.

（編集の具体例１の変形例）
図１３〜図２１と、既に説明した図６及び図７と、に示すフローチャートを参照しつつ、編集の具体例１の変形例について説明する。 (Modification of specific example 1 of editing)
A modification of the first specific example of editing will be described with reference to the flowcharts shown in FIGS. 13 to 21 and FIGS. 6 and 7 already described.

図１３のように、編集用録音音声別最大値mxDiffPit_mを算出する。ＣＰＵ５２１は、音声データ識別用カウンタmをm=1に設定する（ステップＳ９１１１）。ＣＰＵ５２１はさらに、mxDiffPit_mを0等の十分小さな値に設定するとともに、フレーム識別用カウンタfmをfm=0に設定する（ステップＳ９１１３）。 As shown in FIG. 13, the maximum value mxDiffPit _m for each recording sound for editing is calculated. The CPU 521 sets the audio data identification counter m to m = 1 (step S9111). Further, the CPU 521 sets mxDiffPit _m to a sufficiently small value such as 0, and sets the frame identification counter fm to fm = 0 (step S9113).

ＣＰＵ５２１は、フレームfmが有声音のフレームであるか否かを判別する（ステップＳ９１１５）。 The CPU 521 determines whether or not the frame fm is a voiced sound frame (step S9115).

フレームfmが有声音のフレームではないと判別された場合（ステップＳ９１１５；Ｎｏ）は、すぐにステップＳ９１２５に進む。 If it is determined that the frame fm is not a voiced frame (step S9115; No), the process immediately proceeds to step S9125.

フレームfmが有声音のフレームであると判別された場合（ステップＳ９１１５；Ｙｅｓ）は、ＣＰＵ５２１は、ピッチ系列データPit_m[fm]と、図６に示す手順により求められた編集用録音平均値AvePit_mと、を記憶部５２５からロードする（ステップＳ９１１７）。そして、ＣＰＵ５２１は、TmpmxDiffPit_m=Pit_m[fm]-AvePit_mを算出し（ステップＳ９１１９）、TmpmxDiffPit_m≧mxDiffPit_mであるか否かを判別する（ステップＳ９１２１）。TmpmxDiffPit_m≧mxDiffPit_mではないと判別された場合（ステップＳ９１２１；Ｎｏ）は、すぐにステップＳ９１２５に進み、TmpmxDiffPit_m≧mxDiffPit_mであると判別された場合（ステップＳ９１２１；Ｙｅｓ）は、mxDiffPit_mをmxDiffPit_m=TmpmxDiffPit_mのように更新してから（ステップＳ９１２３）、ステップＳ９１２５に進む。 When it is determined that the frame fm is a voiced sound frame (step S9115; Yes), the CPU 521 determines the pitch sequence data Pit _m [fm] and the recording average value AvePit for editing obtained by the procedure shown in FIG. _m is loaded from the storage unit 525 (step S9117). Then, the CPU 521 calculates TmpmxDiffPit _m = Pit _m [fm] −AvePit _m (step S9119), and determines whether TmpmxDiffPit _m ≧ mxDiffPit _m (step S9121). When it is determined that TmpmxDiffPit _m ≧ mxDiffPit _m is not satisfied (step S9121; No), the process immediately proceeds to step S9125. When it is determined that TmpmxDiffPit _m ≧ mxDiffPit _m is satisfied (step S9121; Yes), mxDiffPit _m is set. After updating as mxDiffPit _m = TmpmxDiffPit _m (step S9123), the process proceeds to step S9125.

ステップＳ９１２５では、ＣＰＵ５２１は、fm≧N_fm[m]であるか否かを判別する。fm≧N_fm[m]ではないと判別された場合（ステップＳ９１２５；Ｎｏ）は、fmを1増加してから（ステップＳ９１２７）、ステップＳ９１１５に戻る。一方、fm≧N_fm[m]であると判別された場合（ステップＳ９１２５；Ｙｅｓ）は、ＣＰＵ５２１はmxDiffPit_mを記憶部５２５に格納してから（ステップＳ９１２９）、ステップＳ９１３１に進む。 In step S9125, the CPU 521 determines whether or not fm ≧ N _fm [m]. If it is determined that fm ≧ N _fm [m] is not satisfied (step S9125; No), fm is incremented by 1 (step S9127), and the process returns to step S9115. On the other hand, if it is determined that fm ≧ N _fm [m] (step S9125; Yes), the CPU 521 stores mxDiffPit _m in the storage unit 525 (step S9129), and then proceeds to step S9131.

ステップＳ９１３１では、ＣＰＵ５２１は、m≧N_Spであるか否かを判別する。m≧N_Spではないと判別された場合（ステップＳ９１３１；Ｎｏ）は、mを1増加してから（ステップＳ９１３３）、ステップＳ９１１３に戻る。一方、m≧N_Spであると判別された場合（ステップＳ９１３１；Ｙｅｓ）は、処理を終了する。 In step S9131, the CPU 521 determines whether m ≧ N _Sp is satisfied. If it is determined that m ≧ N _Sp is not satisfied (step S9131; No), m is incremented by 1 (step S9133), and the process returns to step S9113. On the other hand, if it is determined that m ≧ N _Sp (step S9131; Yes), the process ends.

続いて、図１４のように、編集用録音音声別最小値mnDiffPit_mを算出する。ＣＰＵ５２１は、音声データ識別用カウンタmをm=1に設定する（ステップＳ９１６１）。ＣＰＵ５２１はさらに、mnDiffPit_mを0等の十分大きな値に設定するとともに、フレーム識別用カウンタfmをfm=0に設定する（ステップＳ９１６３）。 Subsequently, as shown in FIG. 14, a minimum value mnDiffPit _m for each recording sound for editing is calculated. The CPU 521 sets the audio data identification counter m to m = 1 (step S9161). The CPU 521 further sets mnDiffPit _m to a sufficiently large value such as 0 and sets the frame identification counter fm to fm = 0 (step S9163).

なお、編集用録音音声別最小値mnDiffPit_mは、この後の手順から明らかなように、最終的には0以下の値になる。よって、上述のようにステップＳ９１６３では、mnDiffPit_mの初期値を例えば0とすれば、mnDiffPit_mの初期値を十分大きな値に設定したといえる。 Incidentally, voice recording by the minimum value MnDiffPit _m for editing, as is clear from a later step and eventually becomes 0 following values. Therefore, as described above, in step S9163, if the initial value of mnDiffPit _m is set to 0, for example, it can be said that the initial value of mnDiffPit _m is set to a sufficiently large value.

ＣＰＵ５２１は、フレームfmが有声音のフレームであるか否かを判別する（ステップＳ９１６５）。 The CPU 521 determines whether or not the frame fm is a voiced sound frame (step S9165).

フレームfmが有声音のフレームではないと判別された場合（ステップＳ９１６５；Ｎｏ）は、すぐにステップＳ９１７５に進む。 If it is determined that the frame fm is not a voiced frame (step S9165; No), the process immediately proceeds to step S9175.

フレームfmが有声音のフレームであると判別された場合（ステップＳ９１６５；Ｙｅｓ）は、ＣＰＵ５２１は、ピッチ系列データPit_m[fm]と、図６に示す手順により求められた編集用録音平均値AvePit_mと、を記憶部５２５からロードする（ステップＳ９１６７）。そして、ＣＰＵ５２１は、TmpmnDiffPit_m=Pit_m[fm]-AvePit_mを算出し（ステップＳ９１６９）、TmpmnDiffPit_m≦mnDiffPit_mであるか否かを判別する（ステップＳ９１７１）。TmpmnDiffPit_m≦mnDiffPit_mではないと判別された場合（ステップＳ９１７１；Ｎｏ）は、すぐにステップＳ９１７５に進み、TmpmnDiffPit_m≦mnDiffPit_mであると判別された場合（ステップＳ９１７１；Ｙｅｓ）は、mnDiffPit_mをmnDiffPit_m=TmpmnDiffPit_mのように更新してから（ステップＳ９１７３）、ステップＳ９１７５に進む。 When it is determined that the frame fm is a voiced sound frame (step S9165; Yes), the CPU 521 determines the pitch sequence data Pit _m [fm] and the recording average value AvePit for editing obtained by the procedure shown in FIG. _m is loaded from the storage unit 525 (step S9167). Then, the CPU 521 calculates TmpmnDiffPit _m = Pit _m [fm] −AvePit _m (step S9169), and determines whether or not TmpmnDiffPit _m ≦ mnDiffPit _m (step S9171). If it is determined that TmpmnDiffPit _m ≦ mnDiffPit _m is not satisfied (step S 9171; No), the process immediately proceeds to step S 9175, and if it is determined that TmpmnDiffPit _m ≦ mnDiffPit _m (step S 9171; Yes), mnDiffPit _m is set. After updating as mnDiffPit _m = TmpmnDiffPit _m (step S9173), the process proceeds to step S9175.

ステップＳ９１７５では、ＣＰＵ５２１は、fm≧N_fm[m]であるか否かを判別する。fm≧N_fm[m]ではないと判別された場合（ステップＳ９１７５；Ｎｏ）は、fmを1増加してから（ステップＳ９１７７）、ステップＳ９１６５に戻る。一方、fm≧N_fm[m]であると判別された場合（ステップＳ９１７５；Ｙｅｓ）は、ＣＰＵ５２１はmnDiffPit_mを記憶部５２５に格納してから（ステップＳ９１７９）、ステップＳ９１８１に進む。 In step S9175, the CPU 521 determines whether or not fm ≧ N _fm [m]. If it is determined that fm ≧ N _fm [m] is not satisfied (step S9175; No), fm is incremented by 1 (step S9177), and the process returns to step S9165. On the other hand, if it is determined that fm ≧ N _fm [m] (step S9175; Yes), the CPU 521 stores mnDiffPit _m in the storage unit 525 (step S9179), and then proceeds to step S9181.

ステップＳ９１８１では、ＣＰＵ５２１は、m≧N_Spであるか否かを判別する。m≧N_Spではないと判別された場合（ステップＳ９１８１；Ｎｏ）は、mを1増加してから（ステップＳ９１８３）、ステップＳ９１６３に戻る。一方、m≧N_Spであると判別された場合（ステップＳ９１８１；Ｙｅｓ）は、処理を終了する。 In step S9181, the CPU 521 determines whether m ≧ N _Sp is satisfied. If it is determined that m ≧ N _Sp is not satisfied (step S9181; No), m is incremented by 1 (step S9183), and the process returns to step S9163. On the other hand, if it is determined that m ≧ N _Sp (step S9181; Yes), the process ends.

続いて、図１５のように、編集用録音総合最大値MaxDiffPitを算出する。ＣＰＵ５２１は、MaxDiffPitを0等の十分小さな値に設定するとともに、音声データ識別用カウンタmをm=1に設定する（ステップＳ９２１１）。 Subsequently, as shown in FIG. 15, the editing maximum recording maximum value MaxDiffPit is calculated. The CPU 521 sets MaxDiffPit to a sufficiently small value such as 0, and sets the audio data identification counter m to m = 1 (step S9211).

ＣＰＵ５２１は、図１３に示す手順により求められた編集用録音音声別最大値mxDiffPit_mを記憶部５２５からロードし（ステップＳ９２１３）、TmpMaxDiffPit= mxDiffPit_mとし（ステップＳ９２１５）、TmpMaxDiffPit≧MaxDiffPitであるか否かを判別する（ステップＳ９２１７）。 The CPU 521 loads the maximum value mxDiffPit _m for each recording sound for editing obtained by the procedure shown in FIG. 13 from the storage unit 525 (step S9213), sets TmpMaxDiffPit = mxDiffPit _m (step S9215), and whether TmpMaxDiffPit ≧ MaxDiffPit. Is determined (step S9217).

TmpMaxDiffPit≧MaxDiffPitではないと判別された場合（ステップＳ９２１７；Ｎｏ）は、すぐにステップＳ９２２１に進み、TmpMaxDiffPit≧MaxDiffPitであると判別された場合（ステップＳ９２１７；Ｙｅｓ）は、MaxDiffPitをMaxDiffPit =TmpMaxDiffPitのように更新してから（ステップＳ９２１９）、ステップＳ９２２１に進む。 When it is determined that TmpMaxDiffPit ≧ MaxDiffPit is not satisfied (step S9217; No), the process immediately proceeds to step S9221, and when it is determined that TmpMaxDiffPit ≧ MaxDiffPit (step S9217; Yes), MaxDiffPit is expressed as MaxDiffPit = TmpMaxDiffPit. (Step S9219), the process proceeds to step S9221.

ステップＳ９２２１では、ＣＰＵ５２１は、m≧N_Spであるか否かを判別する。m≧N_Spではないと判別された場合（ステップＳ９２２１；Ｎｏ）は、mを1増加してから（ステップＳ９２２３）、ステップＳ９２１３に戻る。一方、m≧N_Spであると判別された場合（ステップＳ９２２１；Ｙｅｓ）は、ＣＰＵ５２１は、MaxDiffPitを記憶部５２５に格納してから（ステップＳ９２２５）、処理を終了する。 In step S9221, the CPU 521 determines whether m ≧ N _Sp is satisfied. If it is determined that m ≧ N _Sp is not satisfied (step S9221; No), m is increased by 1 (step S9223), and the process returns to step S9213. On the other hand, if it is determined that m ≧ N _Sp (step S9221; Yes), the CPU 521 stores MaxDiffPit in the storage unit 525 (step S9225) and ends the process.

続いて、図１６のように、編集用録音総合最小値MinDiffPitを算出する。ＣＰＵ５２１は、MinDiffPitを0等の十分大きな値に設定するとともに、音声データ識別用カウンタmをm=1に設定する（ステップＳ９２６１）。 Subsequently, as shown in FIG. 16, the editing total recording minimum value MinDiffPit is calculated. The CPU 521 sets MinDiffPit to a sufficiently large value such as 0, and sets the audio data identification counter m to m = 1 (step S9261).

なお、編集用録音総合最小値MinDiffPitは、この後の手順から明らかなように、最終的には0以下の値になる。よって、上述のようにステップＳ９２６１では、MinDiffPitの初期値を例えば0とすれば、MinDiffPitの初期値を十分大きな値に設定したといえる。 It should be noted that the total recording minimum value MinDiffPit for editing finally becomes a value of 0 or less, as is apparent from the subsequent procedure. Therefore, as described above, in step S9261, if the initial value of MinDiffPit is set to 0, for example, it can be said that the initial value of MinDiffPit is set to a sufficiently large value.

ＣＰＵ５２１は、図１４に示す手順により求められた編集用録音音声別最小値mnDiffPit_mを記憶部５２５からロードし（ステップＳ９２６３）、TmpMinDiffPit= mnDiffPit_mとし（ステップＳ９２６５）、TmpMinDiffPit≦MinDiffPitであるか否かを判別する（ステップＳ９２６７）。 The CPU 521 loads the minimum value mnDiffPit _m for each recorded sound for editing obtained by the procedure shown in FIG. 14 from the storage unit 525 (step S9263), sets TmpMinDiffPit = mnDiffPit _m (step S9265), and whether TmpMinDiffPit ≦ MinDiffPit. Is determined (step S9267).

TmpMinDiffPit≦MinDiffPitではないと判別された場合（ステップＳ９２６７；Ｎｏ）は、すぐにステップＳ９２７１に進み、TmpMinDiffPit≦MinDiffPitであると判別された場合（ステップＳ９２６７；Ｙｅｓ）は、MinDiffPitをMinDiffPit =TmpMinDiffPitのように更新してから（ステップＳ９２６９）、ステップＳ９２７１に進む。 If it is determined that TmpMinDiffPit ≦ MinDiffPit is not satisfied (step S9267; No), the process immediately proceeds to step S9271. If it is determined that TmpMinDiffPit ≦ MinDiffPit (step S9267; Yes), MinDiffPit is expressed as MinDiffPit = TmpMinDiffPit. (Step S9269), the process proceeds to step S9271.

ステップＳ９２７１では、ＣＰＵ５２１は、m≧N_Spであるか否かを判別する。m≧N_Spではないと判別された場合（ステップＳ９２７１；Ｎｏ）は、mを1増加してから（ステップＳ９２７３）、ステップＳ９２６３に戻る。一方、m≧N_Spであると判別された場合（ステップＳ９２７１；Ｙｅｓ）は、ＣＰＵ５２１は、MinDiffPitを記憶部５２５に格納してから（ステップＳ９２７５）、処理を終了する。 In step S9271, the CPU 521 determines whether m ≧ N _Sp is satisfied. If it is determined that m ≧ N _Sp is not satisfied (step S9271; No), m is increased by 1 (step S9273), and the process returns to step S9263. On the other hand, if it is determined that m ≧ N _Sp (step S9271; Yes), the CPU 521 stores MinDiffPit in the storage unit 525 (step S9275), and ends the process.

続いて、合成音声についての様々な編集用係数を求める。その手順を示したフローチャートが図１７〜図２０であるが、これらの図に示された手順は、上述の元の音声についての図１３〜図１６に示された手順とほぼ並行した手順となっている。そこで、説明が煩雑になるのを避けるために、以下では、各ステップの詳細な説明は省略し、主な注意点についてのみ説明する。 Subsequently, various editing coefficients for the synthesized speech are obtained. The flowcharts showing the procedures are shown in FIGS. 17 to 20, but the procedures shown in these figures are substantially parallel to the procedures shown in FIGS. 13 to 16 for the original voice described above. ing. Therefore, in order to avoid complicated description, detailed description of each step will be omitted below, and only main points of attention will be described.

図１７に示す手順に従って、編集用合成音声別最大値mxDiffSynPit_mを算出する（ステップＳ９３１１〜Ｓ９３３３）。この手順は、図１３に示した、編集用録音音声別最大値mxDiffPit_mを算出する手順と酷似している。主な相違点は、ピッチ系列データPit_m[fm]ではなく合成音声ピッチ系列データSynPit_m[fm]を扱う点（ステップＳ９３１７、ステップＳ９３１９）や、図６に示した手順により求めた編集用録音平均値AvePit_mではなく図７に示した手順により求めた編集用合成平均値AveSynPit_mを用いる点（ステップＳ９３１７、ステップＳ９３１９）や、m番目の音声データに対応するフレームの番号の上限としてN_fm[m]ではなくM_fm[m]を用いる点（ステップＳ９３２５）である。 According to the procedure shown in FIG. 17, it calculates the editing synthetic speech-specific maximum value mxDiffSynPit _m (step S9311~S9333). This procedure is very similar to the procedure for calculating the maximum value mxDiffPit _m for each recording sound for editing shown in FIG. The main difference is that the synthesized speech pitch sequence data SynPit _m [fm] is handled instead of the pitch sequence data Pit _m [fm] (steps S9317 and S9319), and the editing recording obtained by the procedure shown in FIG. that it uses the editing synthesis average AveSynPit _m determined by the procedure shown in FIG. 7 rather than the average value AvePit _m (step S9317, step S9319) and, N _fm upper limit of the number of frames corresponding to the m-th audio data The point is that M _fm [m] is used instead of [m] (step S9325).

図１８に示す手順に従って、編集用合成音声別最小値mnDiffSynPit_mを算出する（ステップＳ９３６１〜Ｓ９３８３）。この手順は、図１４に示した、編集用録音音声別最小値mnDiffPit_mを算出する手順と酷似している。主な相違点は、ピッチ系列データPit_m[fm]ではなく合成音声ピッチ系列データSynPit_m[fm]を扱う点（ステップＳ９３６７、ステップＳ９３６９）や、図６に示した手順により求めた編集用録音平均値AvePit_mではなく図７に示した手順により求めた編集用合成平均値AveSynPit_mを用いる点（ステップＳ９３６７、ステップＳ９３６９）や、m番目の音声データに対応するフレームの番号の上限としてN_fm[m]ではなくM_fm[m]を用いる点（ステップＳ９３７５）である。 According to the procedure shown in FIG. 18, it calculates the editing synthetic speech by minimum mnDiffSynPit _m (step S9361~S9383). This procedure is very similar to the procedure for calculating the minimum value mnDiffPit _m for each recording sound for editing shown in FIG. The main difference is that the synthesized speech pitch sequence data SynPit _m [fm] is handled instead of the pitch sequence data Pit _m [fm] (steps S9367 and S9369), and the editing recording obtained by the procedure shown in FIG. that it uses the editing synthesis average AveSynPit _m determined by the procedure shown in FIG. 7 rather than the average value AvePit _m (step S9367, step S9369) and, N _fm upper limit of the number of frames corresponding to the m-th audio data The point is that M _fm [m] is used instead of [m] (step S9375).

図１９に示す手順に従って、編集用合成総合最大値MaxDiffSynPitを算出する（ステップＳ９４１１〜Ｓ９４２５）。この手順は、図１５に示した、編集用録音総合最大値MaxDiffPitを算出する手順と酷似している。主な相違点は、図１３に示した手順により求めた編集用録音音声別最大値mxDiffPit_mではなく図１７に示した手順により求めた編集用合成音声別最大値mxDiffSynPit_mを用いること（ステップＳ９４１３、ステップＳ９４１５）である。 In accordance with the procedure shown in FIG. 19, the editing combined maximum value MaxDiffSynPit is calculated (steps S9411 to S9425). This procedure is very similar to the procedure for calculating the total recording maximum value MaxDiffPit for editing shown in FIG. Lord differences, using the editing synthetic speech-specific maximum value MxDiffSynPit _m determined by the procedure shown in editing recorded voice-specific maximum value mxDiffPit in _m without 17 was determined by the procedure shown in FIG. 13 (Step S9413 Step S9415).

図２０に示す手順に従って、編集用合成総合最小値MinDiffSynPitを算出する（ステップＳ９４６１〜Ｓ９４７５）。この手順は、図１６に示した、編集用録音総合最小値MinDiffPitを算出する手順と酷似している。主な相違点は、図１４に示した手順により求めた編集用録音音声別最小値mnDiffPit_mではなく図１８に示した手順により求めた編集用合成音声別最小値mnDiffSynPit_mを用いること（ステップＳ９４６３、ステップＳ９４６５）である。 In accordance with the procedure shown in FIG. 20, the edit synthesis minimum value MinDiffSynPit is calculated (steps S9461 to S9475). This procedure is very similar to the procedure for calculating the total recording minimum value MinDiffPit for editing shown in FIG. Lord differences, using the editing synthetic speech by minimum MnDiffSynPit _m determined by the procedure shown in FIG. 18 rather than editing recorded sound by minimum MnDiffPit _m was determined by the procedure shown in FIG. 14 (Step S9463 Step S9465).

本変形例において最終的に編集ピッチ系列データEdPit_m[fm]を求める手順を、図２１のフローチャートに示す。本変形例における編集ピッチ系列データを、第１変形編集ピッチ系列データと呼ぶことにする。 The flowchart of FIG. 21 shows a procedure for finally obtaining the edit pitch series data EdPit _m [fm] in this modification. The edit pitch series data in this modification will be referred to as first modified edit pitch series data.

第１変形編集ピッチ系列データEdPit_m[fm]を求めるためには、まず、ＣＰＵ５２１は、図１５に示した手順により求められた編集用録音総合最大値MaxDiffPitと、図１９に示した手順により求められた編集用合成総合最大値MaxDiffSynPitと、図１６に示した手順により求められた編集用録音総合最小値MinDiffPitと、図２０に示した手順により求められた編集用合成総合最小値MinDiffSynPitと、を記憶部５２５からロードする（ステップＳ９５１１）。 In order to obtain the first modified editing pitch series data EdPit _m [fm], first, the CPU 521 obtains the recording maximum recording maximum value MaxDiffPit obtained by the procedure shown in FIG. 15 and the procedure shown in FIG. The editing total synthesis maximum value MaxDiffSynPit, the editing total recording minimum value MinDiffPit obtained by the procedure shown in FIG. 16, and the editing synthesis total minimum value MinDiffSynPit obtained by the procedure shown in FIG. Loading is performed from the storage unit 525 (step S9511).

続いて、ＣＰＵ５２１は、音声データ識別用カウンタmを、m=1に設定し（ステップＳ９５１３）、フレーム識別用カウンタfmをfm=0に設定する（ステップＳ９５１５）。 Subsequently, the CPU 521 sets the audio data identification counter m to m = 1 (step S9513), and sets the frame identification counter fm to fm = 0 (step S9515).

ＣＰＵ５２１は、ピッチ系列データPit_m[fm]をロードし（ステップＳ９５１７）、フレームfmが有声音のフレームであるか否かを判別する（ステップＳ９５１９）。 The CPU 521 loads the pitch series data Pit _m [fm] (step S9517), and determines whether or not the frame fm is a voiced sound frame (step S9519).

フレームfmが有声音のフレームであると判別された場合（ステップＳ９５１９；Ｙｅｓ）は、ＣＰＵ５２１は、図６に示した手順により求められた編集用録音平均値AvePit_mを記憶部５２５からロードする（ステップＳ９５２１）。そして、ＣＰＵ５２１は、Pit_m[fm]-AvePit_m[fm]≧0であるか否かを判別する（ステップＳ９５２３）。 When it is determined that the frame fm is a voiced sound frame (step S9519; Yes), the CPU 521 loads the recording average value AvePit _m for editing obtained by the procedure shown in FIG. 6 from the storage unit 525 ( Step S9521). Then, the CPU 521 determines whether or not Pit _m [fm] −AvePit _m [fm] ≧ 0 (step S9523).

Pit_m[fm]-AvePit_m≧0であると判別された場合（ステップＳ９５２３；Ｙｅｓ）は、第１変形編集ピッチ系列データEdpit_m[fm]を、
Edpit_m[fm]=(Pit_m[fm]- AvePit_m)×(MaxDiffPit/MaxDiffSynPit)+AvePit_m
により算出し（ステップＳ９５２５）、記憶部５２５に格納する（ステップＳ９５３１）。 If it is determined that Pit _m [fm] −AvePit _m ≧ 0 (step S9523; Yes), the first modified edit pitch sequence data Edpit _m [fm] is
Edpit _m [fm] = (Pit _m [fm]-AvePit _m ) × (MaxDiffPit / MaxDiffSynPit) + AvePit _m
(Step S9525) and stored in the storage unit 525 (step S9531).

Pit_m[fm]-AvePit_m[fm]≧0ではないと判別された場合（ステップＳ９５２３；Ｎｏ）は、第１変形編集ピッチ系列データEdpit_m[fm]を、
Edpit_m[fm]=(Pit_m[fm]- AvePit_m)×(MinDiffPit/MinDiffSynPit)+AvePit_m
により算出し（ステップＳ９５２７）、記憶部５２５に格納する（ステップＳ９５３１）。 If it is determined that Pit _m [fm] -AvePit _m [fm] ≧ 0 is not satisfied (step S9523; No), the first modified edit pitch sequence data Edpit _m [fm] is
Edpit _m [fm] = (Pit _m [fm]-AvePit _m ) × (MinDiffPit / MinDiffSynPit) + AvePit _m
(Step S9527) and stored in the storage unit 525 (step S9531).

ステップＳ９５１９でフレームfmが有声音のフレームではないと判別された場合（ステップＳ９５１９；Ｎｏ）は、ＣＰＵ５２１は、Pit_m[fm]をそのまま第１編集ピッチ系列データEdpit_m[fm]とし（ステップＳ９５２９）、記憶部５２５に格納する（ステップＳ９５３１）。 If it is determined in step S9519 that the frame fm is not a voiced sound frame (step S9519; No), the CPU 521 uses Pit _m [fm] as it is as the first edited pitch sequence data Edpit _m [fm] (step S9529). ) And stored in the storage unit 525 (step S9531).

ＣＰＵ５２１は、fm≧N_fm[m]であるか否かを判別する（ステップＳ９５３３）。fm≧N_fm[m]ではないと判別された場合（ステップＳ９５３３；Ｎｏ）は、fmを1増加してから（ステップＳ９５３５）、ステップＳ９５１７に戻る。一方、fm≧N_fm[m]であると判別された場合（ステップＳ９５３３；Ｙｅｓ）は、ステップＳ９５３７に進む。 The CPU 521 determines whether or not fm ≧ N _fm [m] (step S9533). If it is determined that fm ≧ N _fm [m] is not satisfied (step S9533; No), fm is incremented by 1 (step S9535), and the process returns to step S9517. On the other hand, if it is determined that fm ≧ N _fm [m] (step S9533; Yes), the process proceeds to step S9537.

ステップＳ９５３７では、ＣＰＵ５２１は、m≧N_Spであるか否かを判別する。m≧N_Spではないと判別された場合（ステップＳ９５３７；Ｎｏ）は、mを1増加してから（ステップＳ９５３９）、ステップＳ９５１５に戻る。一方、m≧N_Spであると判別された場合（ステップＳ９５３７；Ｙｅｓ）は、処理を終了する。 In step S9537, the CPU 521 determines whether m ≧ N _Sp is satisfied. If it is determined that m ≧ N _Sp is not satisfied (step S9537; No), m is incremented by 1 (step S9539), and the process returns to step S9515. On the other hand, if it is determined that m ≧ N _Sp (step S9537; Yes), the process ends.

編集の具体例１においては、ピッチ変動を拡大するための乗数として、一律に(MaxAbsDiffPit/MaxAbsDiffSynPit)という値を用いた。それに対して、本変形例においては、図２１のステップＳ９５２３〜ステップＳ９５２７に示すように、ピッチ係数系列データがその平均値を上回っている場合と下回っている場合とで別の乗数を用いる。これにより、計算はわずかに複雑になるものの、より適切にピッチ変動を拡大することができると期待される。 In specific example 1 of editing, a value of (MaxAbsDiffPit / MaxAbsDiffSynPit) is uniformly used as a multiplier for enlarging pitch fluctuation. On the other hand, in this modification, as shown in steps S9523 to S9527 in FIG. 21, different multipliers are used depending on whether the pitch coefficient series data is above the average value or below the average value. Thereby, although the calculation is slightly complicated, it is expected that the pitch fluctuation can be expanded more appropriately.

上記の上回っている場合の乗数としては、元の音声データ全部を走査して求めた、ピッチ平均からの上側へのズレの最大値であるMaxDiffPitを、ほぼ確実にそれよりも小さな値である、合成音声データ全部を走査して同様に求めたMaxDiffSynPitにより除した値を用いるのが最も簡易かつ確実であると考えられる。同様に、上記の下回っている場合の乗数としては、元の音声データ全部を走査して求めた、ピッチ平均からの下側へのズレの最大値であるMinDiffPitを、ほぼ確実にそれよりも絶対値としては小さな値である、合成音声データ全部を走査して同様に求めたMinDiffSynPitにより除した値を用いるのが最も簡易かつ確実であると考えられる。図１３〜図２１に示した手順は、かかる考察結果を、具体的な処理手順に反映させた結果である。 As a multiplier in the case of exceeding the above, MaxDiffPit which is the maximum value of the deviation from the pitch average to the upper side obtained by scanning all the original audio data is almost certainly smaller than that, It is considered to be the simplest and most reliable to use the value divided by MaxDiffSynPit obtained by scanning the entire synthesized speech data in the same way. Similarly, MinDiffPit, which is the maximum deviation from the average pitch to the lower side, obtained by scanning the entire original audio data, is almost certainly more absolute than that. As the value, it is considered to be the simplest and most reliable to use a value that is a small value divided by MinDiffSynPit obtained by scanning all the synthesized speech data. The procedure shown in FIGS. 13 to 21 is a result of reflecting such a consideration result in a specific processing procedure.

なお、MinDiffPit及びMinDiffSynPitは、計算処理の過程ではいずれもほぼ確実に負の値となるが、上述の乗数として用いられるのは除算(MinDiffPit÷MinDiffSynPit)の結果であり、これはほぼ確実に正の値となる。 Note that MinDiffPit and MinDiffSynPit are almost certainly negative in the calculation process, but the above multiplier is the result of division (MinDiffPit ÷ MinDiffSynPit), which is almost certainly positive. Value.

また、既に述べたとおり、従来の合成音声は元の音声に比べてピッチ変動が小さいため、上述の2種類の乗数(MaxDiffPit/MaxDiffSynPit)、(MinDiffPit/MinDiffSynPit)は、ほぼ確実に、いずれも1より大きくなり、ピッチ変動を拡大するための乗数として適切である。 Also, as already mentioned, since the conventional synthesized speech has less pitch fluctuation than the original speech, the above two types of multipliers (MaxDiffPit / MaxDiffSynPit) and (MinDiffPit / MinDiffSynPit) are almost certainly 1 It is suitable as a multiplier for increasing the pitch variation.

（編集の具体例２）
図２２は、本編集の具体例における編集ピッチ系列データである第２編集ピッチ系列データEdPit_m[fm]を算出する手順を示すフローチャートである。以下、このフローチャートを参照しつつ説明する。 (Specific example 2 of editing)
FIG. 22 is a flowchart showing a procedure for calculating the second edited pitch series data EdPit _m [fm], which is the edited pitch series data in the specific example of this editing. Hereinafter, description will be given with reference to this flowchart.

ＣＰＵ５２１は、音声データ識別用カウンタmをm=1に設定し（ステップＳ５１１）、編集用録音音声別最大絶対値mxAbsDiffPit_mと編集用合成音声別最大絶対値mxAbsDiffSynPit_mとを記憶部５２５からロードし（ステップＳ５１３）、フレーム識別用カウンタfmをfm=0に設定し（ステップＳ５１５）、ピッチ系列データPit_m[fm]を記憶部５２５からロードし（ステップＳ５１７）、フレームfmが有声音のフレームであるか否かを判別する（ステップＳ５１９）。 The CPU 521 sets the audio data identification counter m to m = 1 (step S511), and loads the editing recorded audio maximum absolute value mxAbsDiffPit _m and the editing synthesized audio maximum absolute value mxAbsDiffSynPit _m from the storage unit 525. (Step S513) The frame identification counter fm is set to fm = 0 (Step S515), the pitch sequence data Pit _m [fm] is loaded from the storage unit 525 (Step S517), and the frame fm is a frame of voiced sound. It is determined whether or not there is (step S519).

なお、ここで、編集用録音音声別最大絶対値mxAbsDiffPit_mと編集用合成音声別最大絶対値mxAbsDiffSynPit_mとは、それぞれ、既に述べた、図８と図９とに示す手順により、求められる。 Here, the editing recorded speech by the maximum absolute value MxAbsDiffPit _m editing synthetic speech specific maximum absolute value MxAbsDiffSynPit _m, respectively, already mentioned, by the procedure shown in FIGS. 8 and 9 is obtained.

フレームfmが有声音のフレームであると判別された場合（ステップＳ５１９；Ｙｅｓ）は、ＣＰＵ５２１は、図６に示す手順により求められた編集用録音平均値AvePit_mをロードし（ステップＳ５２１）、編集ピッチ系列データEdPit_m[fm]を、
Ed Pit_m[fm]= (Pit_m[fm]- AvePit_m)×(mxAbsDiffPit_m/mxAbsDiffSynPit_m)+AvePit_m
により算出し（ステップＳ５２３）、記憶部５２５に格納する（ステップＳ５２７）。 When it is determined that the frame fm is a voiced sound frame (step S519; Yes), the CPU 521 loads the editing recording average value AvePit _m obtained by the procedure shown in FIG. 6 (step S521), and edits it. Pitch series data EdPit _m [fm]
Ed Pit _m [fm] = (Pit _m [fm]-AvePit _m ) × (mxAbsDiffPit _m / mxAbsDiffSynPit _m ) + AvePit _m
(Step S523) and stored in the storage unit 525 (step S527).

フレームfmが有声音のフレームではないと判別された場合（ステップＳ５１９；Ｎｏ）は、ＣＰＵ５２１は、ピッチ系列データPit_m[fm]をそのまま編集ピッチ系列データEdPit_m[fm]とし（ステップＳ５２５）、記憶部５２５に格納する（ステップＳ５２７）。 When it is determined that the frame fm is not a voiced sound frame (step S519; No), the CPU 521 uses the pitch sequence data Pit _m [fm] as it is as edited pitch sequence data EdPit _m [fm] (step S525). The data is stored in the storage unit 525 (step S527).

ステップＳ５２９では、ＣＰＵ５２１は、fm≧N_fm[m]であるか否かを判別する。fm≧N_fm[m]ではないと判別された場合（ステップＳ５２９；Ｎｏ）は、fmを1増加してから（ステップＳ５３１）、ステップＳ５１７に戻る。一方、fm≧N_fm[m]であると判別された場合（ステップＳ５２９；Ｙｅｓ）は、ステップＳ５３３に進む。 In step S529, the CPU 521 determines whether or not fm ≧ N _fm [m]. If it is determined that fm ≧ N _fm [m] is not satisfied (step S529; No), fm is incremented by 1 (step S531), and the process returns to step S517. On the other hand, when it is determined that fm ≧ N _fm [m] (step S529; Yes), the process proceeds to step S533.

ステップＳ５３３では、ＣＰＵ５２１は、m≧N_Spであるか否かを判別する。m≧N_Spではないと判別された場合（ステップＳ５３３；Ｎｏ）は、mを1増加してから（ステップＳ５３５）、ステップＳ５１３に戻る。一方、m≧N_Spであると判別された場合（ステップＳ５３３；Ｙｅｓ）は、処理を終了する。 In step S533, the CPU 521 determines whether m ≧ N _Sp is satisfied. If it is determined that m ≧ N _Sp is not satisfied (step S533; No), m is incremented by 1 (step S535), and the process returns to step S513. On the other hand, when it is determined that m ≧ N _Sp (step S533; Yes), the process ends.

本編集の具体例においては、編集の具体例１と同様に、ピッチ系列データをその平均値を基準として上下に拡大することにより、ピッチ変動を大きくする。そして、そのために、ピッチ系列データとその平均値との差にある係数を乗じる。ただし、編集の具体例１においては、該係数として音声データに依存しない値（すなわち音声データ識別用カウンタmに依存しない値）である(MaxAbsDiffPit/MaxAbsDiffSynPit)を一律に採用するのに対して、本編集の具体例においては、該係数として音声データ毎に求められる値（すなわち音声データ識別用カウンタmに依存する値）である(mxAbsDiffPit_m/mxAbsDiffSynPit_m)を採用する点が異なる。つまり、本編集の具体例においては、あるピッチ系列データを編集して編集ピッチ系列データを生成するにあたり、該ピッチ系列データの算出に用いられるフレームが属する音声データに対応するひとかたまりのピッチ系列データだけが参照される。換言すれば、音声データ毎に編集処理が完結しているといえる。 In the specific example of the editing, as in the specific example 1 of the editing, the pitch variation is increased by vertically expanding the pitch series data on the basis of the average value. For this purpose, a coefficient that is the difference between the pitch sequence data and the average value is multiplied. However, in the specific example 1 of editing, the value (MaxAbsDiffPit / MaxAbsDiffSynPit) that is a value that does not depend on audio data (that is, a value that does not depend on the audio data identification counter m) is uniformly adopted as the coefficient. The specific example of editing differs in that (mxAbsDiffPit _m / mxAbsDiffSynPit _m ), which is a value obtained for each audio data (that is, a value depending on the audio data identification counter m), is adopted as the coefficient. That is, in the specific example of this editing, when editing a certain pitch sequence data to generate the edited pitch sequence data, only a set of pitch sequence data corresponding to the audio data to which the frame used for calculation of the pitch sequence data belongs. Is referenced. In other words, it can be said that the editing process is completed for each audio data.

このようにすると、編集の具体例１の場合よりも、上述の係数が多くなるため処理が煩雑になるが、各音声データの特性に応じて編集用係数を変化させるため、より適切な編集が達成される。例えば、編集の具体例１の場合、特異的な音声データが１つ存在しただけでも、上述の係数がそれに影響され、全ての音声データについてのピッチ系列データの編集が適切に行われない可能性があるが、本編集の具体例においては、かかる特異なデータの存在が編集処理全体に対して悪影響を及ぼすことはない。 In this way, the above-described coefficient increases compared to the case of the specific example 1 of editing, and thus the processing becomes complicated. However, since the editing coefficient is changed according to the characteristics of each audio data, more appropriate editing can be performed. Achieved. For example, in the case of the specific example 1 of editing, there is a possibility that even if there is only one specific audio data, the above-described coefficient is affected by it, and the editing of the pitch series data for all the audio data is not appropriately performed. However, in the specific example of this editing, the existence of such unique data does not adversely affect the entire editing process.

（編集の具体例２の変形例）
図２３は、編集の具体例２の変形例における編集ピッチ系列データである第２変形編集ピッチ系列データEdPit_m[fm]を算出する手順を示すフローチャートである。以下、このフローチャートを参照しつつ説明する。 (Modification of specific example 2 of editing)
FIG. 23 is a flowchart showing a procedure for calculating the second modified editing pitch sequence data EdPit _m [fm], which is the editing pitch sequence data in the modified example of the specific example 2 of editing. Hereinafter, description will be given with reference to this flowchart.

ＣＰＵ５２１は、音声データ識別用カウンタmをm=1に設定し（ステップＳ９６１１）、編集用録音音声別最大値mxDiffPit_mと編集用合成音声別最大値mxDiffSynPit_mと編集用録音音声別最小値mnDiffPit_mと編集用合成音声別最小値mnDiffSynPit_mとを記憶部５２５からロードし（ステップＳ９６１３）、フレーム識別用カウンタfmをfm=0に設定し（ステップＳ９６１５）、ピッチ系列データPit_m[fm]を記憶部５２５からロードし（ステップＳ９６１７）、フレームfmが有声音のフレームであるか否かを判別する（ステップＳ９６１９）。 The CPU 521 sets the audio data identification counter m to m = 1 (step S9611), the editing recording voice maximum value mxDiffPit _m , the editing synthesized voice maximum value mxDiffSynPit _m, and the editing recording voice minimum value mnDiffPit _m. And the minimum value mnDiffSynPit _m for each synthesized voice for editing are loaded from the storage unit 525 (step S9613), the frame identification counter fm is set to fm = 0 (step S9615), and the pitch sequence data Pit _m [fm] is stored. It is loaded from the unit 525 (step S9617), and it is determined whether or not the frame fm is a voiced sound frame (step S9619).

なお、ここで、編集用録音音声別最大値mxDiffPit_mと編集用合成音声別最大値mxDiffSynPit_mと編集用録音音声別最小値mnDiffPit_mと編集用合成音声別最小値mnDiffSynPit_mとは、それぞれ、既に述べた、図１３と図１７と図１４と図１８とに示す手順により、求められる。 Here, the editing recorded sound-specific maximum value MxDiffPit _m editing synthetic speech-specific maximum value MxDiffSynPit _m editing and recording audio specific minimum MnDiffPit _m editing synthetic speech by minimum MnDiffSynPit _m, respectively, already It is obtained by the procedure shown in FIG. 13, FIG. 17, FIG. 14, and FIG.

フレームfmが有声音のフレームであると判別された場合（ステップＳ９６１９；Ｙｅｓ）は、ＣＰＵ５２１は、図６に示す手順により求められた編集用録音平均値AvePit_mをロードし（ステップＳ９６２１）、Pit_m[fm]- AvePit_m≧0であるか否かを判別する（ステップＳ９６２３）。 When it is determined that the frame fm is a voiced sound frame (step S9619; Yes), the CPU 521 loads the editing average recording value AvePit _m obtained by the procedure shown in FIG. 6 (step S9621), and Pit. It is determined whether or not _m [fm] −AvePit _m ≧ 0 (step S9623).

Pit_m[fm]- AvePit_m≧0であると判別された場合（ステップＳ９６２３；Ｙｅｓ）は、ＣＰＵ５２１は、編集ピッチ系列データEdPit_m[fm]を、
Ed Pit_m[fm]= (Pit_m[fm]- AvePit_m)×(mxDiffPit_m/mxDiffSynPit_m)+AvePit_m
により算出し（ステップＳ９６２５）、記憶部５２５に格納する（ステップＳ９６３１）。一方、Pit_m[fm]- AvePit_m≧0ではないと判別された場合（ステップＳ９６２３；Ｎｏ）は、ＣＰＵ５２１は、編集ピッチ系列データEdPit_m[fm]を、
Ed Pit_m[fm]= (Pit_m[fm]- AvePit_m)×(mnDiffPit_m/mnDiffSynPit_m)+AvePit_m
により算出し（ステップＳ９６２７）、記憶部５２５に格納する（ステップＳ９６３１）。 If it is determined that Pit _m [fm] −AvePit _m ≧ 0 (step S9623; Yes), the CPU 521 sets the edit pitch sequence data EdPit _m [fm] as
Ed Pit _m [fm] = (Pit _m [fm]-AvePit _m ) × (mxDiffPit _m / mxDiffSynPit _m ) + AvePit _m
(Step S9625) and stored in the storage unit 525 (step S9631). On the other hand, when it is determined that Pit _m [fm] −AvePit _m ≧ 0 is not satisfied (step S9623; No), the CPU 521 sets the edit pitch sequence data EdPit _m [fm] as
Ed Pit _m [fm] = (Pit _m [fm]-AvePit _m ) × (mnDiffPit _m / mnDiffSynPit _m ) + AvePit _m
(Step S9627) and stored in the storage unit 525 (step S9631).

フレームfmが有声音のフレームではないと判別された場合（ステップＳ９６１９；Ｎｏ）は、ＣＰＵ５２１は、ピッチ系列データPit_m[fm]をそのまま編集ピッチ系列データEdPit_m[fm]とし（ステップＳ９６２９）、記憶部５２５に格納する（ステップＳ９６３１）。 If it is determined that the frame fm is not a voiced frame (step S9619; No), the CPU 521 uses the pitch sequence data Pit _m [fm] as it is as edited pitch sequence data EdPit _m [fm] (step S9629). The data is stored in the storage unit 525 (step S9631).

ステップＳ９６３３では、ＣＰＵ５２１は、fm≧N_fm[m]であるか否かを判別する。fm≧N_fm[m]ではないと判別された場合（ステップＳ９６３３；Ｎｏ）は、fmを1増加してから（ステップＳ９６３５）、ステップＳ９６１７に戻る。一方、fm≧N_fm[m]であると判別された場合（ステップＳ９６３３；Ｙｅｓ）は、ステップＳ９６３７に進む。 In step S9633, the CPU 521 determines whether or not fm ≧ N _fm [m]. If it is determined that fm ≧ N _fm [m] is not satisfied (step S9633; No), fm is incremented by 1 (step S9635), and the process returns to step S9617. On the other hand, if it is determined that fm ≧ N _fm [m] (step S9633; Yes), the process proceeds to step S9637.

ステップＳ９６３７では、ＣＰＵ５２１は、m≧N_Spであるか否かを判別する。m≧N_Spではないと判別された場合（ステップＳ９６３７；Ｎｏ）は、mを1増加してから（ステップＳ９６３９）、ステップＳ９６１３に戻る。一方、m≧N_Spであると判別された場合（ステップＳ９６３７；Ｙｅｓ）は、処理を終了する。 In step S9637, the CPU 521 determines whether m ≧ N _Sp is satisfied. If it is determined that m ≧ N _Sp is not satisfied (step S9637; No), m is increased by 1 (step S9639), and the process returns to step S9613. On the other hand, if it is determined that m ≧ N _Sp (step S9637; Yes), the process ends.

本変形例は、既に述べた編集の具体例１の変形例における、ピッチ系列データがその平均値を上回っているか否かによりピッチ変動拡大用の係数を変化させることによる長所と、既に述べた編集の具体例２における、音声データ毎に処理を完結させることによる長所と、を併せ持っている。 The present modification is advantageous in that the coefficient for pitch fluctuation expansion is changed depending on whether or not the pitch series data exceeds the average value in the modification of the first specific example of editing described above, and the editing described above. In the second specific example, the advantage of completing the processing for each audio data is also provided.

（実施形態２）
本発明の実施形態２に係る音声合成辞書構築装置は、図２〜図４に示した実施形態１に係る音声合成辞書構築装置の機能ブロックのうち、図４に示す第２学習部１１７を、図２４に示す音素ラベルデータ比較機能付第２学習部１１９に置換したものである。 (Embodiment 2)
The speech synthesis dictionary construction device according to Embodiment 2 of the present invention includes the second learning unit 117 shown in FIG. 4 among the functional blocks of the speech synthesis dictionary construction device according to Embodiment 1 shown in FIGS. The second learning unit 119 with phoneme label data comparison function shown in FIG. 24 is substituted.

音素ラベルデータ比較機能付第２学習部１１９（図２４）は、第２学習部１１７（図４）の中に、モノフォン用音素ラベルデータ生成部３３１を付加したものである。 The second learning unit 119 with phoneme label data comparison function (FIG. 24) is obtained by adding a phoneme label data generation unit 331 for monophone to the second learning unit 117 (FIG. 4).

モノフォン用音素ラベルデータ生成部３３１は、合成部１１３（図３）から出力された合成音声データSynSp_mを受け取り、合成音声のモノフォンラベルデータである合成音声モノフォンラベルデータmLabData_m[ml]（1≦ml≦ML_SynSp[m]、ただし、ML_SynSp[m]は合成音声SynSp_mにおけるモノフォンラベルの数である。）を生成して出力する。 The monophone phoneme label data generation unit 331 receives the synthesized speech data SynSp _m output from the synthesis unit 113 (FIG. 3), and synthesized speech monophone label data mLabData _m [ml] (monophonic label data of synthesized speech) 1 ≦ ml ≦ ML _SynSp [m], where ML _SynSp [m] is the number of monophone labels in the synthesized speech SynSp _m ).

合成音声モノフォンラベルデータmLabData_m[ml]は、合成音声モノフォンラベルmLab_m[ml]と、合成音声データSynSp_mの継続時間のうち該合成音声モノフォンラベルの始点に該当する時刻を指し示すポインタである合成音声開始フレームmFrameS_m[ml]と、終点に該当する時刻を指し示すポインタである合成音声終了フレームmFrameE_m[ml]と、から構成される。 The synthesized speech monophone label data mLabData _m [ml] is a pointer indicating the time corresponding to the start point of the synthesized speech monophone label in the duration of the synthesized speech monophone label mLab _m [ml] and the synthesized speech data SynSp _m. Is composed of a synthesized speech start frame mFrameS _m [ml] and a synthesized speech end frame mFrameE _m [ml] which is a pointer indicating the time corresponding to the end point.

また、第２学習部１１７の中の方針決定部３４３には、ピッチ系列データPit_m[fm]と合成音声ピッチ系列データSynPit_m[fm]とが集められるのに対し、音素ラベルデータ比較機能付第２学習部１１９の中の方針決定部３４３には、それらに加えて、第１音声データベース２２１（図２）に格納されているモノフォンラベルデータMLabData_m[ml]とモノフォン用音素ラベルデータ生成部３３１から出力された合成音声モノフォンラベルデータmLabData_m[ml]とが集められる。 In addition, the policy determination unit 343 in the second learning unit 117 collects pitch sequence data Pit _m [fm] and synthesized speech pitch sequence data SynPit _m [fm], but has a phoneme label data comparison function. In addition to these, the policy determination unit 343 in the second learning unit 119 generates monophone label data MLabData _m [ml] stored in the first speech database 221 (FIG. 2) and phoneme label data for monophones. The synthesized voice monophone label data mLabData _m [ml] output from the unit 331 is collected.

音素ラベルデータ比較機能付第２学習部１１９の中の方針決定部３４３は、第２学習部１１７の中の方針決定部３４３と異なり、ピッチ系列データPit_m[fm]と合成音声ピッチ系列データSynPit_m[fm]とだけからピッチ系列データPit_m[fm]の編集方針を決定するのではなく、それらに加えてモノフォンラベルデータMLabData_m[ml]と合成音声モノフォンラベルデータmLabData_m[ml]とを総合的に比較検討してピッチ系列データPit_m[fm]の編集方針を決定する。 Unlike the policy determination unit 343 in the second learning unit 117, the policy determination unit 343 in the second learning unit 119 with phoneme label data comparison function is different from the pitch sequence data Pit _m [fm] and the synthesized speech pitch sequence data SynPit. The editing policy of the pitch sequence data Pit _m [fm] is not determined only from _m [fm], but in addition to these, the monophone label data MLabData _m [ml] and the synthesized speech monophone label data mLabData _m [ml] Are comprehensively examined and the editing policy of the pitch sequence data Pit _m [fm] is determined.

すなわち、音素ラベルデータ比較機能付第２学習部１１９の中の方針決定部３４３に引き渡されるのは、人間の自然な発話から収集された音声データに基づいて生成されたものであるモノフォンラベルデータMLabData_m[ml]及びピッチ系列データPit_m[fm]と、いったん音声合成辞書を経て生成された合成音声データに基づいて生成されたものである合成音声モノフォンラベルデータmLabData_m[ml]及び合成音声ピッチ系列データSynPit_m[fm]と、が集められる。方針決定部３４３は、これら４種のデータを集めるので、これらを比較検討することができる。 That is, what is delivered to the policy determination unit 343 in the second learning unit 119 with the phoneme label data comparison function is monophone label data generated based on speech data collected from human natural speech. MLabData _m [ml] and pitch sequence data Pit _m [fm] and synthesized speech monophone label data mLabData _m [ml] and synthesized once based on synthesized speech data generated through the speech synthesis dictionary Voice pitch sequence data SynPit _m [fm] is collected. Since the policy decision unit 343 collects these four types of data, these can be compared.

以下に、音素ラベルデータ比較機能付第２学習部１１９の中の方針決定部３４３により決定された編集方針に従い音素ラベルデータ比較機能付第２学習部１１９の中の編集部３４５が実行する編集処理の手順について、図２５に示すフローチャートを参照しつつ、説明する。図２５では、本実施形態における編集ピッチ系列データを、第３編集ピッチ系列データEdPit_m[fm]と表記した。 Hereinafter, the editing process executed by the editing unit 345 in the second learning unit 119 with phoneme label data comparison function according to the editing policy determined by the policy determination unit 343 in the second learning unit 119 with phoneme label data comparison function This procedure will be described with reference to the flowchart shown in FIG. In FIG. 25, the edit pitch sequence data in the present embodiment is represented as third edit pitch sequence data EdPit _m [fm].

なお、既に実施形態１における編集の具体例について触れた箇所で説明したとおり、少なくとも定性的には、個々のピッチ系列データPit_m[fm]とその平均値との差を大きくしたものを編集ピッチ系列データPit_m[fm]とすれば、音声データSp_mのピッチ変動は大きくなる。ところで、特に前記平均値に拘泥しなくても、ピッチ系列データPit_m[fm]全体を拡大したものを編集ピッチ系列データPit_m[fm]とすることによっても、音声データSp_mのピッチ変動は大きくなる。換言すれば、平均値を基準にするかわりに、ゼロレベルを基準にして全体を拡大することによっても、ピッチ変動を大きくすることができる。本実施形態においては、編集方針として、ピッチ系列データPit_m[fm]全体を拡大する方針を採ることとする。 As described in the section where the specific example of editing in the first embodiment has already been described, at least qualitatively, an edit pitch obtained by increasing the difference between each pitch series data Pit _m [fm] and its average value is used. If the series data Pit _m [fm] is used, the pitch variation of the audio data Sp _m increases. However, even without particular bound on the average value, also by the pitch-series data Pit _m [fm] Edit an enlarged overall pitch sequence data Pit _m [fm], the pitch of speech data Sp _m is growing. In other words, the pitch fluctuation can be increased by enlarging the whole with reference to the zero level instead of using the average value as a reference. In the present embodiment, as an editing policy, a policy of enlarging the entire pitch sequence data Pit _m [fm] is adopted.

もっとも、本実施形態においても、実施形態１における編集の具体例と同様な編集方針を採ってもよい。例えば、後述のステップＳ６１９において、フレームfmが有声音の場合は
EdPit_m[fm]=( Pit_m[fm]-AvePit_m)×(AvePitLab_m[ml]/AveSynPitLab_m[ml])+ AvePit_m
とし、フレームfmが無声音の場合は
EdPit_m[fm]= Pit_m[fm]
としてもよい。 However, in the present embodiment, the same editing policy as that of the specific example of editing in the first embodiment may be adopted. For example, in step S619 described later, if the frame fm is a voiced sound,
EdPit _m [fm] = (Pit _m [fm] -AvePit _m ) × (AvePitLab _m [ml] / AveSynPitLab _m [ml]) + AvePit _m
And if the frame fm is silent
EdPit _m [fm] = Pit _m [fm]
It is good.

音声データSp_m、モノフォンラベルデータMLabData_m[ml]、トライフォンラベルデータTLabData_m[tl]、ピッチ系列データPit_m[fm]、メルケプストラム係数系列データMC_m ^d[fm]、合成音声ピッチ系列データSynPit_m[fm]、及び、合成音声モノフォンラベルデータmLabData_m[ml]は、既に求められ、記憶部５２５に格納されているものとする。 Speech data Sp _m , monophone label data MLabData _m [ml], triphone label data TLabData _m [tl], pitch sequence data Pit _m [fm], mel cepstrum coefficient sequence data MC _m ^d [fm], synthesized speech pitch sequence It is assumed that the data SynPit _m [fm] and the synthesized audio monophone label data mLabData _m [ml] have already been obtained and stored in the storage unit 525.

まず、ＣＰＵ５２１は、音声データ識別用カウンタmをm=1に設定し（ステップＳ６１１）、フレーム識別用カウンタfmをfm=0に設定する（ステップＳ６１３）。 First, the CPU 521 sets the audio data identification counter m to m = 1 (step S611), and sets the frame identification counter fm to fm = 0 (step S613).

次に、ＣＰＵ５２１は、フレームfmがどのモノフォンラベルデータMLabData_m[ml]に対応しているかを調査する。具体的には、ＣＰＵ５２１は、記憶部５２５の中を検索して、MFrameS_m[ml']≦fm≦MFrameE_m[ml']を満たすようなモノフォン番号ml'を求め、モノフォンラベルデータ識別用カウンタmlをml=ml'に設定する（ステップＳ６１５）。 Next, the CPU 521 investigates to which monophone label data MLabData _m [ml] the frame fm corresponds. Specifically, the CPU 521 searches the storage unit 525 to obtain a monophone number ml ′ that satisfies MFrameS _m [ml ′] ≦ fm ≦ MFrameE _m [ml ′], and for monophone label data identification The counter ml is set to ml = ml ′ (step S615).

続いて、ＣＰＵ５２１は、元の音声と合成音声とについてそれぞれモノフォン単位でピッチ系列データを平均した結果であるAvePitLab_m[ml]とAveSynPitLab_m[ml]とを算出する（ステップＳ６１７）。かかる算出の具体的な手順については、後に図２６を参照しつつ説明する。 Subsequently, the CPU 521 calculates AvePitLab _m [ml] and AveSynPitLab _m [ml], which are the results of averaging the pitch sequence data in units of monophones for the original voice and the synthesized voice (step S617). A specific procedure for such calculation will be described later with reference to FIG.

続いて、ＣＰＵ５２１は、本実施形態における編集ピッチ系列データである第３編集ピッチ系列データEdPit_m[fm]を、
EdPit_m[fm]= Pit_m[fm]×(AvePitLab_m[ml]/AveSynPitLab_m[ml])
により算出し（ステップＳ６１９）、記憶部５２５に格納する（ステップＳ６２１）。 Subsequently, the CPU 521 obtains third edit pitch sequence data EdPit _m [fm], which is edit pitch sequence data in the present embodiment,
EdPit _m [fm] = Pit _m [fm] × (AvePitLab _m [ml] / AveSynPitLab _m [ml])
(Step S619) and stored in the storage unit 525 (step S621).

続いて、ＣＰＵ５２１は、fm≧N_fm[m]であるか否かを判別する（ステップＳ６２３）。fm≧N_fm[m]ではないと判別された場合（ステップＳ６２３；Ｎｏ）は、fmを1増加してから（ステップＳ６２５）、ステップＳ６１５に戻る。一方、fm≧N_fm[m]であると判別された場合（ステップＳ６２３；Ｙｅｓ）は、ステップＳ６２７に進む。 Subsequently, the CPU 521 determines whether or not fm ≧ N _fm [m] is satisfied (step S623). If it is determined that fm ≧ N _fm [m] is not satisfied (step S623; No), fm is incremented by 1 (step S625), and the process returns to step S615. On the other hand, when it is determined that fm ≧ N _fm [m] (step S623; Yes), the process proceeds to step S627.

ステップＳ６２７では、ＣＰＵ５２１は、m≧N_Spであるか否かを判別する。m≧N_Spではないと判別された場合（ステップＳ６２７；Ｎｏ）は、mを1増加してから（ステップＳ６２９）、ステップＳ６１３に戻る。一方、m≧N_Spであると判別された場合（ステップＳ６２７；Ｙｅｓ）は、処理を終了する。 In step S627, the CPU 521 determines whether m ≧ N _Sp is satisfied. If it is determined that m ≧ N _Sp is not satisfied (step S627; No), m is increased by 1 (step S629), and the process returns to step S613. On the other hand, if it is determined that m ≧ N _Sp (step S627; Yes), the process ends.

上述の、ステップＳ６１７におけるAvePitLab_m[ml]とAveSynPitLab_m[ml]とを算出する手順は、図２６に示すとおりである。 The procedure for calculating AvePitLab _m [ml] and AveSynPitLab _m [ml] in step S617 is as shown in FIG.

ＣＰＵ５２１は、開始フレームMFrameS_m[ml]、終了フレームMFrameE_m[ml]、合成音声開始フレームmFrameS_m[ml]、及び、合成音声終了フレームmFrameE_m[ml]を記憶部５２５からロードし（ステップＳ６６１）、さらに、ピッチ系列データPit_m[MFrameS_m[ml]]、Pit_m[MFrameS_m[ml]+1]、・・・、Pit_m[MFrameE_m[ml]-1]、Pit_m[MFrameE_m[ml]]と、合成音声メルケプストラム係数系列データSynPit_m[mFrameS_m[ml]]、SynPit_m[mFrameS_m[ml]+1]、・・・、SynPit_m[mFrameE_m[ml]-1]、SynPit_m[mFrameE_m[ml]]と、をロードし（ステップＳ６６３）、AvePitLab_m[ml]とAveSynPitLab_m[ml]とを、次の式に従って算出する（ステップＳ６６５）。
AvePitLab_m[ml]
= (Pit_m[MFrameS_m[ml]]+Pit_m[MFrameS_m[ml]+1]+・・・
+Pit_m[MFrameE_m[ml]-1]+Pit_m[MFrameE_m[ml]])
÷(MFrameE_m[ml]-MFrameS_m[ml]+1)、
AveSynPitLab_m[ml]
= (SynPit_m[mFrameS_m[ml]]+SynPit_m[mFrameS_m[ml]+1]+・・・
+SynPit_m[mFrameE_m[ml]-1]+SynPit_m[mFrameE_m[ml]])
÷(mFrameE_m[ml]-mFrameS_m[ml]+1) The CPU 521 loads the start frame MFrameS _m [ml], the end frame MFrameE _m [ml], the synthesized speech start frame mFrameS _m [ml], and the synthesized speech end frame mFrameE _m [ml] from the storage unit 525 (step S661). ), Pit _m [MFrameS _m [ml]], Pit _m [MFrameS _m [ml] +1], Pit _m [MFrameE _m [ml] -1], Pit _m [MFrameE _m [ml]] and synthetic speech mel cepstrum coefficient series data SynPit _m [mFrameS _m [ml]], SynPit _m [mFrameS _m [ml] +1], ..., SynPit _m [mFrameE _m [ml] -1] SynPit _m [mFrameE _m [ml]] is loaded (step S663), and AvePitLab _m [ml] and AveSynPitLab _m [ml] are calculated according to the following equations (step S665).
AvePitLab _m [ml]
= (Pit _m [MFrameS _m [ml]] + Pit _m [MFrameS _m [ml] +1] + ...
+ Pit _m [MFrameE _m [ml] -1] + Pit _m [MFrameE _m [ml]])
÷ (MFrameE _m [ml] -MFrameS _m [ml] +1),
AveSynPitLab _m [ml]
= (SynPit _m [mFrameS _m [ml]] + SynPit _m [mFrameS _m [ml] +1] + ...
+ SynPit _m [mFrameE _m [ml] -1] + SynPit _m [mFrameE _m [ml]])
÷ (mFrameE _m [ml] -mFrameS _m [ml] +1)

上述のとおり、AvePitLab_m[ml]とAveSynPitLab_m[ml]とは、それぞれ、元の音声と合成音声とについてそれぞれモノフォン単位でピッチ系列データを平均した結果である。このことは、上の式から明らかである。 As described above, AvePitLab _m [ml] and AveSynPitLab _m [ml] are the results of averaging the pitch sequence data in monophone units for the original voice and the synthesized voice, respectively. This is clear from the above equation.

音声データよりも小さい単位であるモノフォン単位では、各単位に対応する時間帯には、その時間帯の短さゆえに、概ね、ピッチの山の部分だけ又は谷の部分だけが含まれる。 In the monophone unit, which is a unit smaller than the sound data, the time zone corresponding to each unit generally includes only the peak portion or the valley portion of the pitch due to the short time zone.

既に述べたように、一般に、元の音声に比べ、合成音声は、平坦な印象を与える不自然な音声となることが知られている。これは、ピッチの山の部分が低くなるとともに、ピッチの谷の部分が高くなることに相当する。さて、本実施形態においては、元のピッチに乗じられる値である拡大率AvePitLab_m[ml]/AveSynPitLab_m[ml]はモノフォンラベル単位で決定される。ピッチの山の部分に対応するモノフォンラベル単位では、山の部分が上述のように低くなることに対応して、AvePitLab_m[ml]＞AveSynPitLab_m[ml]となる。すると、前記拡大率は1より大きくなる。これは、元の音声において、音声合成後は低くなるであろうことを見越してピッチの山をあらかじめ盛り上げるようピッチ編集を施しておいて、音声合成後のピッチの山が元に比べて低くならないようにしていることを意味する。一方、ピッチの谷の部分に対応するモノフォンラベル単位では、谷の部分が上述のように高くなることに対応して、AvePitLab_m[ml]＜AveSynPitLab_m[ml]となる。すると、前記拡大率は1より小さくなる。これは、元の音声において、音声合成後は高くなるであろうことを見越してピッチの谷をあらかじめ掘り下げるようピッチ編集を施しておいて、音声合成後のピッチの谷が元に比べて高くならないようにしていることを意味する。このように、本実施例においては、拡大率は1より大きかったり小さかったりするが、ピッチ編集によりピッチ変動が拡大されることが期待される点では、実施形態1の場合と同じである。 As already described, it is generally known that the synthesized speech is an unnatural speech that gives a flat impression as compared to the original speech. This corresponds to the pitch crest portion being lowered and the pitch trough portion being raised. In the present embodiment, the enlargement ratio AvePitLab _m [ml] / AveSynPitLab _m [ml], which is a value multiplied by the original pitch, is determined in units of monophone labels. In the monophone label unit corresponding to the peak portion of the pitch, AvePitLab _m [ml]> AveSynPitLab _m [ml] corresponding to the fact that the peak portion becomes lower as described above. Then, the enlargement ratio becomes larger than 1. This is because in the original speech, pitch editing is performed so that the pitch peak is raised beforehand in anticipation that it will be lower after speech synthesis, and the pitch peak after speech synthesis is not lower than the original speech. Means that On the other hand, in the monophone label unit corresponding to the valley portion of the pitch, AvePitLab _m [ml] <AveSynPitLab _m [ml] corresponding to the fact that the valley portion becomes higher as described above. Then, the enlargement ratio becomes smaller than 1. This is because in the original speech, pitch editing is performed so that the valley of the pitch is dug in advance in anticipation that it will be higher after speech synthesis, and the pitch valley after speech synthesis is not higher than the original speech. Means that As described above, in the present embodiment, the enlargement ratio is larger or smaller than 1, but the same as in the case of the first embodiment in that the pitch fluctuation is expected to be expanded by the pitch editing.

なお、この発明は、上述の実施形態や具体例や変形例に限定されず、さらなる種々の変形及び応用が可能である。上述のハードウェア構成やブロック構成、フローチャートは説明のための例示であって、本願発明の範囲を限定するものではない。 In addition, this invention is not limited to the above-mentioned embodiment, a specific example, and a modification, A further various deformation | transformation and application are possible. The above-described hardware configuration, block configuration, and flowchart are examples for explanation, and do not limit the scope of the present invention.

例えば、図２〜図４に示した実施形態１に係る音声合成辞書構築装置において、時系列データ生成部３２３（図３）が生成したピッチ時系列データ（これは合成音声ピッチ系列データSynPit_m[fm]と同じものである。）を、方針決定部３４３（図４）に直接に入力する変形例も考えられる。この変形例においては、励起音源生成部３２５（図３）、ＭＬＳＡ合成フィルタ部３２７（図３）、及び、第２ピッチ抽出部３４１（図４）を、省略することができる。 For example, in the speech synthesis dictionary construction apparatus according to Embodiment 1 shown in FIGS. 2 to 4, the pitch time series data generated by the time series data generation unit 323 (FIG. 3) (this is the synthesized speech pitch series data SynPit _m [ fm] is directly input to the policy determination unit 343 (FIG. 4). In this modification, the excitation sound source generation unit 325 (FIG. 3), the MLSA synthesis filter unit 327 (FIG. 3), and the second pitch extraction unit 341 (FIG. 4) can be omitted.

図２、図３、図２４に示した実施形態２に係る音声合成辞書構築装置においても、同様な変形例が考えられる。ただしその際は、各モノフォンラベルのピッチ時系列データに相当する範囲を示す情報が方針決定部３４３（図２４）に共に送られるようにする必要がある。 Similar modifications can be considered in the speech synthesis dictionary construction apparatus according to the second embodiment shown in FIGS. 2, 3, and 24. However, in this case, it is necessary to send information indicating a range corresponding to the pitch time-series data of each monophone label to the policy determining unit 343 (FIG. 24).

あるいは例えば、有声音のフレームにおける編集にあたって、ピッチ系列データのその平均からのズレが大きいほどかかるズレの拡大率を大きくするような方針を採用してピッチ変動を大きくするようにしてもよい。 Alternatively, for example, when editing a frame of voiced sound, a policy may be adopted in which the pitch variation is increased by adopting a policy of increasing the shift enlargement rate as the shift from the average of the pitch sequence data increases.

一般的な音声データベースを構築するための、ラベルデータの作成の流れを示す図である。It is a figure which shows the flow of preparation of label data for constructing a general voice database. 本発明の実施形態１に係る音声合成辞書構築装置の一部をなす第１学習部等の機能構成図である。It is a functional lineblock diagram of the 1st learning part etc. which make a part of speech synthesis dictionary construction device concerning Embodiment 1 of the present invention. 本発明の実施形態１に係る音声合成辞書構築装置の一部をなす合成部の機能構成図である。It is a function block diagram of the synthetic | combination part which makes a part of the speech synthesis dictionary construction apparatus which concerns on Embodiment 1 of this invention. 本発明の実施形態１に係る音声合成辞書構築装置の一部をなす第２学習部等の機能構成図である。It is a functional lineblock diagram of the 2nd learning part etc. which make a part of speech synthesis dictionary construction device concerning Embodiment 1 of the present invention. 本発明の実施の形態に係る音声合成辞書構築装置の物理的な構成を示す図である。It is a figure which shows the physical structure of the speech synthesis dictionary construction apparatus which concerns on embodiment of this invention. ピッチ系列データを編集するために必要な編集用録音平均値AvePit_mを算出する処理の流れを示す図である。It is a diagram showing the flow of processing for calculating the edit recording average AvePit _m needed to edit the pitch sequence data. ピッチ系列データを編集するために必要な編集用合成平均値AveSynPit_mを算出する処理の流れを示す図である。It is a diagram showing the flow of processing for calculating the editing synthesis average AveSynPit _m needed to edit the pitch sequence data. ピッチ系列データを編集するために必要な編集用録音音声別最大絶対値mxAbsDiffPit_mを算出する処理の流れを示す図である。It is a diagram showing the flow of processing for calculating the edit voice recording-specific maximum absolute value MxAbsDiffPit _m needed to edit the pitch sequence data. ピッチ系列データを編集するために必要な編集用合成音声別最大絶対値mxAbsDiffSynPit_mを算出する処理の流れを示す図である。It is a diagram showing the flow of processing for calculating the maximum synthetic speech-edited absolute value MxAbsDiffSynPit _m needed to edit the pitch sequence data. ピッチ系列データを編集するために必要な編集用録音総合最大絶対値MaxAbsDiffPitを算出する処理の流れを示す図である。It is a figure which shows the flow of a process which calculates the audio recording total maximum absolute value MaxAbsDiffPit required in order to edit pitch series data. ピッチ系列データを編集するために必要な編集用合成総合最大絶対値MaxAbsDiffSynPitを算出する処理の流れを示す図である。It is a figure which shows the flow of a process which calculates the synthetic | combination synthetic | combination synthetic | combination maximum absolute value MaxAbsDiffSynPit required in order to edit pitch series data. 編集の具体例１における編集ピッチ系列データEdPit_m[fm]である第１編集ピッチ系列データを算出する処理の流れを示す図である。Is a diagram showing the flow of processing for calculating a first edit pitch series data is edited pitch series data EdPit _m [fm] in the specific example 1 of the editing. ピッチ系列データを編集するために必要な編集用録音音声別最大値mxDiffPit_mを算出する処理の流れを示す図である。It is a diagram showing the flow of processing for calculating the edit voice recording-specific maximum value MxDiffPit _m needed to edit the pitch sequence data. ピッチ系列データを編集するために必要な編集用録音音声別最小値mnDiffPit_mを算出する処理の流れを示す図である。Is a diagram showing the flow of a process for calculating an editing recorded sound by minimum MnDiffPit _m needed to edit the pitch sequence data. ピッチ系列データを編集するために必要な編集用録音総合最大値MaxDiffPitを算出する処理の流れを示す図である。It is a figure which shows the flow of a process which calculates the recording total recording maximum value MaxDiffPit required in order to edit pitch series data. ピッチ系列データを編集するために必要な編集用録音総合最小値MinDiffPitを算出する処理の流れを示す図である。It is a figure which shows the flow of a process which calculates the audio recording total minimum value MinDiffPit for edit required in order to edit pitch series data. ピッチ系列データを編集するために必要な編集用合成音声別最大値mxDiffSynPit_mを算出する処理の流れを示す図である。It is a diagram showing the flow of processing for calculating the editing synthetic speech-specific maximum value MxDiffSynPit _m needed to edit the pitch sequence data. ピッチ系列データを編集するために必要な編集用合成音声別最小値mnDiffSynPit_mを算出する処理の流れを示す図である。It is a diagram showing the flow of processing for calculating the editing synthetic speech by minimum MnDiffSynPit _m needed to edit the pitch sequence data. ピッチ系列データを編集するために必要な編集用合成総合最大値MaxDiffSynPitを算出する処理の流れを示す図である。It is a figure which shows the flow of a process which calculates synthetic | combination synthetic | combination synthetic | combination maximum value MaxDiffSynPit required in order to edit pitch series data. ピッチ系列データを編集するために必要な編集用合成総合最小値MinDiffSynPitを算出する処理の流れを示す図である。It is a figure which shows the flow of a process which calculates the synthetic | combination synthetic | combination synthetic | combination minimum value MinDiffSynPit required in order to edit pitch series data. 編集の具体例１の変形例における編集ピッチ系列データEdPit_m[fm]である第１変形編集ピッチ系列データを算出する処理の流れを示す図である。Is a diagram showing the flow of processing for calculating a first modified edit pitch series data is edited pitch series data EdPit _m [fm] in a modification of the embodiment 1 of the editing. 編集の具体例２における編集ピッチ系列データEdPit_m[fm]である第２編集ピッチ系列データを算出する処理の流れを示す図である。Is a diagram showing the flow of processing for calculating a second edit pitch series data is edited pitch series data EdPit _m [fm] in the specific example 2 of the editing. 編集の具体例２の変形例における編集ピッチ系列データEdPit_m[fm]である第２変形編集ピッチ系列データを算出する処理の流れを示す図である。Is a diagram showing the flow of processing for calculating a second modification edit pitch series data is edited pitch series data EdPit _m [fm] in a modification of the embodiment 2 of the editing. 本発明の実施形態２に係る音声合成辞書構築装置の一部をなす音声ラベルデータ比較機能付第２学習部等の機能構成図である。It is functional block diagrams, such as a 2nd learning part with a speech label data comparison function which makes a part of the speech synthesis dictionary construction apparatus concerning Embodiment 2 of this invention. 本発明の実施形態２における編集ピッチ系列データEdPit_m[fm]である第３編集ピッチ系列データを算出する処理の流れを示す図である。Is a diagram showing the flow of processing for calculating a third editing pitch series data is edited pitch series data EdPit _m [fm] in the second embodiment of the present invention. ピッチ系列データを編集するために必要な、音素単位でのピッチ平均であるAvePitLab_m[ml]とAveSynPitLab_m[ml]とを算出する処理の流れを示す図である。Required to edit the pitch sequence data is a diagram showing the flow of a process for calculating the AvePitLab _m [ml] and AveSynPitLab _m [ml] and the pitch average of at phoneme.

Explanation of symbols

１１１・・・第１学習部、１１３・・・合成部、１１７・・・第２学習部、１１９・・・音声ラベルデータ比較機能付第２学習部、２２１・・・第１音声データベース、２２３・・・第１音声合成辞書、２２７・・・第２音声合成辞書、３１１・・・第１ピッチ抽出部、３１３・・・メルケプストラム分析部、３１５・・・第１音素ＨＭＭ学習部、３２１・・・音素ＨＭＭ列生成部、３２３・・・時系列データ生成部、３２５・・・励起音源生成部、３２７・・・ＭＬＳＡ合成フィルタ部、３３１・・・モノフォン用音素ラベルデータ生成部、３４１・・・第２ピッチ抽出部、３４３・・・方針決定部、３４５・・・編集部、３４７・・・第２音素ＨＭＭ学習部、５１１・・・コンピュータ装置、５２１・・・ＣＰＵ、５２３・・・ＲＯＭ、５２５・・・記憶部、５２７・・・ＲＡＭ、５２９・・・内蔵ハードディスク、５３１・・・操作キー、５３３・・・操作キー入力処理部、５４１・・・システムバス、５５１・・・第１リムーバブルハードディスク、５５３・・・第２リムーバブルハードディスク、５５５・・・データ入出力Ｉ／Ｆ 111 ... 1st learning part, 113 ... synthesis | combination part, 117 ... 2nd learning part, 119 ... 2nd learning part with audio | voice label data comparison function, 221 ... 1st audio | voice database, 223 ... 1st speech synthesis dictionary, 227 ... 2nd speech synthesis dictionary, 311 ... 1st pitch extraction part, 313 ... Mel cepstrum analysis part, 315 ... 1st phoneme HMM learning part, 321 ... Phoneme HMM sequence generation unit, 323 ... Time series data generation unit, 325 ... Excitation sound source generation unit, 327 ... MLSA synthesis filter unit, 331 ... Phoneme label data generation unit for monophone, 341 ... second pitch extraction unit, 343 ... policy determination unit, 345 ... editing unit, 347 ... second phoneme HMM learning unit, 511 ... computer device, 521 ... CPU, 523, ..ROM, 25... Storage unit, 527... RAM, 529... Built-in hard disk, 531... Operation key, 533 .. operation key input processing unit, 541. Removable hard disk, 553 ... Second removable hard disk, 555 ... Data input / output I / F

Claims

A phoneme label string and recorded voice data corresponding to the phoneme label string are acquired from the voice database, and the recorded voice pitch series data is extracted from the acquired recorded voice data, and the extracted recorded voice pitch series data and the acquired phoneme are extracted. A temporary construction unit that constructs a temporary speech synthesis dictionary by HMM (Hidden Markov Model) learning based on the label sequence;
A synthesized data generation unit that relies on the temporary speech synthesis dictionary to generate synthesized speech data and extracts synthesized speech pitch series data from the generated synthesized speech data;
The recorded speech pitch sequence data extracted by the temporary construction unit from the recorded speech data corresponding to the phoneme label sequence, and the synthesized data generation unit extracted from the synthesized speech data corresponding to the phoneme label sequence Based on the result of comparing the synthesized voice pitch series data, an editing unit that edits the recorded voice pitch series data to generate edited pitch series data;
A reconstructing unit that constructs a speech synthesis dictionary by HMM learning based on the phoneme label string and the edited pitch sequence data generated by the editing unit;
A speech synthesis dictionary construction device comprising:

A plurality of audio data, a monophone label generated for each of the audio data, a start point pointer and an end point pointer indicating the time corresponding to the start point and end point of the monophone label, and a triphone label generated for each of the audio data Receiving, extracting pitch sequence data from the audio data, generating mel cepstrum coefficient sequence data up to a predetermined order from the audio data, the monophone label, the start point pointer, the end point pointer, the triphone label, and the A first learning unit that constructs a temporary speech synthesis dictionary from pitch sequence data and the mel cepstrum coefficient sequence data by HMM (Hidden Markov Model) learning;
A synthesis unit that generates a plurality of synthesized speech data based on the temporary speech synthesis dictionary and the triphone label;
Editing and editing the pitch sequence data according to an editing policy determined based on the result of comparing the synthesized speech pitch sequence data extracted from the synthesized speech data and the pitch sequence data extracted by the first learning unit An editing unit for generating pitch series data;
A second learning unit that constructs a speech synthesis dictionary by HMM (Hidden Markov Model) learning from the monophone label, the start point pointer, the end point pointer, the triphone label, the edited pitch sequence data, and the mel cepstrum coefficient sequence data When,
A speech synthesis dictionary construction device comprising:

The editing unit
Generating a synthesized monophone label and a synthesis start point pointer and a synthesis end point pointer indicating a time corresponding to a start point and an end point of the synthesized monophone label for each synthesized voice data; and the synthesized voice pitch sequence data and the pitch sequence data; In accordance with an editing policy determined based on a result of comparison with reference to the composite monophone label, the composite start point pointer, the composite end point pointer, the monophone label, the start point pointer, and the end point pointer. Editing the data to generate the edited pitch series data;
The speech synthesis dictionary construction device according to claim 2.

The editing unit
Generating the edited pitch series data by increasing the pitch variation of the pitch series data;
The speech synthesis dictionary construction apparatus according to claim 2 or 3,

The editing unit
Generating the edited pitch series data by enlarging the pitch series data around the reference pitch level with a predetermined pitch level as a reference pitch level;
The speech synthesis dictionary construction device according to any one of claims 2 to 4, wherein the speech synthesis dictionary construction device.

The editing unit
Generating the edited pitch series data by enlarging the pitch series data around the reference pitch level with an average value of the pitch series data as a reference pitch level;
The speech synthesis dictionary construction device according to any one of claims 2 to 5,

The editing unit
Generating the edited pitch series data by enlarging the pitch series data around the reference pitch level with a zero level of the pitch series data as a reference pitch level;
The speech synthesis dictionary construction device according to any one of claims 2 to 5,

The editing unit
An average pitch for each voice that is an average value of the pitch series data is obtained for each voice data, and an average pitch for each synthesized voice that is an average value of the synthesized voice pitch series data is obtained for each synthesized voice data. A maximum pitch difference for each voice, which is a maximum absolute value of a difference between the pitch series data and the average pitch for each voice, and the average for each synthesized voice pitch series data and the average for each synthesized voice The maximum absolute value of the pitch difference for each synthesized voice, which is the maximum absolute value of the difference from the pitch, is obtained. Determining a synthesized speech total pitch difference maximum absolute value that is a maximum value of the synthesized speech-specific pitch difference maximum absolute values in all the synthesized speech data, An editing overall magnification that is a value obtained by dividing the maximum absolute value by the synthesized speech overall pitch difference maximum absolute value is obtained, and the editing is performed by enlarging the pitch series data at the editing overall magnification around the reference pitch level. Generate pitch series data,
The speech synthesis dictionary construction device according to any one of claims 5 to 7.

The editing unit
An average pitch for each voice that is an average value of the pitch series data is obtained for each voice data, and an average pitch for each synthesized voice that is an average value of the synthesized voice pitch series data is obtained for each synthesized voice data. The maximum value of the pitch difference for each voice, which is the maximum value obtained by subtracting the average pitch for each voice from the pitch series data, is determined, and the minimum of the values obtained by subtracting the average pitch for each voice from the pitch series data for each voice data A voice-specific pitch difference minimum value for each voice data, a voice total pitch difference maximum value that is a maximum value of the voice-specific pitch differences for all the voice data, and a voice-specific pitch difference for all the voice data. The minimum value of the total speech pitch difference, which is the minimum value of the minimum values, is obtained, and the average pitch for each synthesized speech from the synthesized speech pitch sequence data for each synthesized speech data The maximum value of the pitch difference for each synthesized speech, which is the maximum value of the subtracted values, is obtained, and the pitch for each synthesized speech that is the minimum value obtained by subtracting the average pitch for each synthesized speech from the synthesized speech pitch sequence data. A minimum difference value is obtained, a synthesized speech total pitch difference maximum value that is a maximum value of the synthesized speech-specific pitch differences in all the synthesized speech data is obtained, and a synthesized speech-specific pitch difference minimum in all the synthesized speech data A minimum value of the synthesized speech total pitch difference, which is a minimum value, and an upper overall magnification for editing, which is a value obtained by dividing the maximum value of the synthesized speech pitch difference by the maximum value of the synthesized speech overall pitch difference, The lower overall magnification for editing, which is a value obtained by dividing the minimum difference value by the minimum synthesized speech total pitch difference value, is obtained, and exceeds the reference pitch level in the pitch series data. Is enlarged at the upper overall magnification for editing centered on the reference pitch level, and the pitch series data below the reference pitch level is enlarged at the lower overall magnification for editing around the reference pitch level. Generating the edited pitch sequence data by enlarging,
The speech synthesis dictionary construction device according to any one of claims 5 to 7.

The editing unit
An average pitch for each voice that is an average value of the pitch series data is obtained for each voice data, and an average pitch for each synthesized voice that is an average value of the synthesized voice pitch series data is obtained for each synthesized voice data. A maximum pitch difference for each voice, which is a maximum absolute value of a difference between the pitch series data and the average pitch for each voice, and the average for each synthesized voice pitch series data and the average for each synthesized voice A synthesized voice-specific pitch difference maximum absolute value, which is a maximum absolute value of a difference from the pitch, is obtained, and for each voice data, the synthesized voice data corresponding to the voice data is used as the synthesized voice data. Magnification by editing voice divided by the maximum absolute value of the pitch difference by voice is obtained, and the pitch series data is divided by the editing voice by using the reference pitch level for each voice data. Generating the edited pitch series data by enlarging at a rate,
The speech synthesis dictionary construction device according to any one of claims 5 to 7.

The editing unit
An average pitch for each voice that is an average value of the pitch series data is obtained for each voice data, and an average pitch for each synthesized voice that is an average value of the synthesized voice pitch series data is obtained for each synthesized voice data. To determine the maximum value of the pitch difference for each voice, which is the maximum value obtained by subtracting the average pitch for each voice from the pitch series data, A pitch difference minimum value for each voice that is a value, and a maximum pitch difference value for each synthesized voice data that is a maximum value of a value obtained by subtracting the average pitch for each synthesized voice from the synthesized voice pitch series data, For each of the synthesized speech data, a minimum pitch difference value for each synthesized speech, which is a minimum value obtained by subtracting the average pitch for each synthesized speech from the synthesized speech pitch sequence data, For each voice data, the maximum pitch difference for editing is obtained by dividing the maximum pitch difference value for each voice by the maximum pitch difference value for each synthesized voice corresponding to the voice data. A lower voice-specific magnification for editing is obtained by dividing another minimum pitch difference value by the synthesized voice-by-synchronized pitch difference minimum value of the synthesized voice data corresponding to the voice data. Those that exceed the reference pitch level are enlarged at the magnification for each upper voice for editing centered on the reference pitch level, and those that are below the reference pitch level among the pitch series data are centered on the reference pitch level. Generating the edit pitch series data by enlarging at the lower voice-specific magnification for editing,
The speech synthesis dictionary construction device according to any one of claims 5 to 7.

The editing unit
The pitch sequence data specified by the audio data and the monophone label for each of the audio data from which the pitch sequence data to be edited is extracted and for each monophone label, end from the start time of the monophone label A value obtained by dividing the averaged result up to the time point by the result of averaging the synthesized pitch sequence data from the start time to the end time of the synthesized monophone label equal to the monophone label is obtained. The editing monophone magnification for each phone label is set, and the edit pitch series data is generated by multiplying the pitch series data by the extraction monophone magnification for each of the audio data of the extraction source and for each monophone label.
The speech synthesis dictionary construction apparatus according to claim 3.

A phoneme label string and recorded voice data corresponding to the phoneme label string are acquired from the voice database, and the recorded voice pitch series data is extracted from the acquired recorded voice data, and the extracted recorded voice pitch series data and the acquired phoneme are extracted. A temporary construction step of constructing a temporary speech synthesis dictionary by HMM (Hidden Markov Model) learning based on the label sequence;
Generating a synthesized voice data based on the provisional voice synthesis dictionary, and extracting a synthesized voice pitch series data from the generated synthesized voice data; and
The recorded voice pitch sequence data extracted by the temporary construction step from the recorded voice data corresponding to the phoneme label string and the synthesized data generation step extracted from the synthesized voice data corresponding to the phoneme label string An editing step of editing the recorded voice pitch series data to generate edited pitch series data based on the result of comparing the synthesized voice pitch series data;
A reconstructing step of constructing a speech synthesis dictionary by HMM learning based on the phoneme label string and the editing pitch sequence data generated by the editing step;
A speech synthesis dictionary construction method comprising:

On the computer,
A phoneme label string and recorded voice data corresponding to the phoneme label string are acquired from the voice database, and the recorded voice pitch series data is extracted from the acquired recorded voice data, and the extracted recorded voice pitch series data and the acquired phoneme are extracted. A temporary construction step of constructing a temporary speech synthesis dictionary by HMM (Hidden Markov Model) learning based on the label sequence;
Generating a synthesized voice data based on the provisional voice synthesis dictionary, and extracting a synthesized voice pitch series data from the generated synthesized voice data; and
The recorded voice pitch sequence data extracted by the temporary construction step from the recorded voice data corresponding to the phoneme label string and the synthesized data generation step extracted from the synthesized voice data corresponding to the phoneme label string An editing step of editing the recorded voice pitch series data to generate edited pitch series data based on the result of comparing the synthesized voice pitch series data;
A reconstructing step of constructing a speech synthesis dictionary by HMM learning based on the phoneme label string and the editing pitch sequence data generated by the editing step;
A computer program that executes