JP2008026777A

JP2008026777A - Speech synthesis dictionary structuring device, speech synthesis dictionary structuring method, and program

Info

Publication number: JP2008026777A
Application number: JP2006201712A
Authority: JP
Inventors: Katsuhiko Sato; 勝彦佐藤
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2006-07-25
Filing date: 2006-07-25
Publication date: 2008-02-07
Anticipated expiration: 2026-07-25
Also published as: JP4929896B2

Abstract

<P>PROBLEM TO BE SOLVED: To structure a speech synthesis dictionary with which an articulate speech can be synthesized. <P>SOLUTION: A formant emphasizing unit 145 or 445 converts melcepstrum coefficient sequence data generated by an analyzing unit 141 or 441 into emphasized melcepstrum coefficient sequence data for emphasizing formants of a speech spectrum. The formant emphasizing unit is disposed in front or in a phoneme HMM learning unit 151 or 431 which writes a phoneme HMM to a speech synthesis dictionary 173 or 453 while making it correspond to each phoneme label. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声合成等に用いる音声合成辞書を構築する、音声合成辞書構築装置、音声合成辞書構築方法、及び、プログラムに関する。 The present invention relates to a speech synthesis dictionary construction device, a speech synthesis dictionary construction method, and a program for constructing a speech synthesis dictionary used for speech synthesis and the like.

音声認識及び音声合成技術として隠れマルコフモデル（Hidden Markov Model。以下、ＨＭＭと称呼する。）に基づいた音声認識技術及び音声合成技術が、広く利用されている。 Speech recognition technology and speech synthesis technology based on a Hidden Markov Model (hereinafter referred to as HMM) are widely used as speech recognition and speech synthesis technology.

ＨＭＭに基づいた音声合成においては、音素ラベルとスペクトルパラメータデータ列等の対応関係を記録した音声合成辞書が必要になる。 In speech synthesis based on the HMM, a speech synthesis dictionary in which a correspondence relationship between phoneme labels and spectrum parameter data strings is recorded is required.

音声合成辞書は、音声合成辞書構築装置により構築される。音声合成辞書構築装置は、通例、音素ラベル列とそれに対応する音声データとの組から構成されているデータベース（以下、音声データベースと称呼する。）に記録されているデータについて、メルケプストラム分析とピッチ抽出をし、ＨＭＭに基づく学習過程を経ることにより、音声合成辞書を構築する。 The speech synthesis dictionary is constructed by a speech synthesis dictionary construction device. A speech synthesis dictionary construction device generally uses a mel cepstrum analysis and a pitch for data recorded in a database (hereinafter referred to as a speech database) configured from a set of phoneme label sequences and speech data corresponding thereto. A speech synthesis dictionary is constructed by performing extraction and a learning process based on the HMM.

従来の音声合成辞書構築装置は、音声合成辞書を構築する際、音声データベースに記録されている音声データを、特に加工等を施すことなく、そのままＨＭＭに基づく学習に用いて、音声合成辞書を構築していた。 A conventional speech synthesis dictionary construction device constructs a speech synthesis dictionary by using speech data recorded in the speech database as it is for learning based on the HMM without any special processing when constructing the speech synthesis dictionary. Was.

また、従来の音声合成辞書構築装置は、ＨＭＭに基づく学習過程において、メルケプストラム分析の結果生成されるメルケプストラム係数系列データを、特に加工等を施すことなく、そのままＨＭＭに基づく学習に用いて、音声合成辞書を構築していた。 Further, the conventional speech synthesis dictionary construction device uses the mel cepstrum coefficient series data generated as a result of the mel cepstrum analysis in the learning process based on the HMM as it is for learning based on the HMM without performing any particular processing. I was building a speech synthesis dictionary.

しかしながら、そのように構築された音声合成辞書を用いて音声を合成すると、音声データのスペクトル包絡の山谷の形状が元の音声データのスペクトル包絡の山谷の形状に比べて平滑化される。 However, when speech is synthesized using the speech synthesis dictionary constructed as described above, the shape of the valley of the spectrum envelope of the speech data is smoothed compared to the shape of the valley of the spectrum envelope of the original speech data.

従来の音声合成辞書構築装置により構築された音声合成辞書を用いた合成音声は、音声データのスペクトル包絡の山谷の形状が平滑化される結果、人間の自然な音声に比べて、明りょう性が損なわれたものとなっていた。 The synthesized speech using the speech synthesis dictionary constructed by the conventional speech synthesis dictionary construction device is smoother than the natural speech of human beings as a result of smoothing the shape of the spectrum envelope of the speech data. It was damaged.

本発明は、上記実情に鑑みてなされたもので、明りょうな音声を合成することを可能とする音声合成辞書を構築可能とする音声合成辞書構築装置、音声合成辞書構築方法、及び、プログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and a speech synthesis dictionary construction device, a speech synthesis dictionary construction method, and a program capable of constructing a speech synthesis dictionary capable of synthesizing clear speech. The purpose is to provide.

上記目的を達成するために、この発明の第１の観点に係る音声合成辞書構築装置は、
音素ラベル列とそれに対応する音声データとが入力される入力部と、
前記音声データに対してスペクトルパラメータを強調して強調音声データに変換する音声データ加工部と、
前記音素ラベル列と前記強調音声データとから、音素ラベル毎に音素ＨＭＭ（Hidden Markov Model）を対応させる音素ＨＭＭ学習部と、
学習結果を音声合成辞書に記録するデータ書き出し部と、
を備える。 In order to achieve the above object, a speech synthesis dictionary construction device according to the first aspect of the present invention provides:
An input unit for inputting a phoneme label string and corresponding voice data;
An audio data processing unit for emphasizing spectral parameters for the audio data and converting the audio data into emphasized audio data;
A phoneme HMM learning unit that associates a phoneme HMM (Hidden Markov Model) for each phoneme label from the phoneme label string and the emphasized speech data;
A data writer for recording the learning results in the speech synthesis dictionary;
Is provided.

元の音声データをそのまま利用するのではなく、あらかじめ強調処理を施した音声データを利用して音素ＨＭＭ学習を行うので、音声合成装置が明りょうな合成音声を生成するために参照する音声合成辞書として好適な音声合成辞書が構築される。 Rather than using the original speech data as it is, the phoneme HMM learning is performed using the speech data that has been subjected to enhancement processing in advance, so that the speech synthesis dictionary that the speech synthesizer refers to generate clear synthesized speech A suitable speech synthesis dictionary is constructed.

前記音声データ加工部は、前記音声データにメルケプストラム分析を施して第１のメルケプストラム係数系列データを生成する加工部内メルケプストラム分析部と、前記第１のメルケプストラム係数系列データにより定義され前記音声データから励起音源データを生成する逆フィルタと、前記第１のメルケプストラム係数系列データに所定の強調処理を施して前記スペクトルパラメータを強調した強調メルケプストラム係数系列データを生成する加工部内ホルマント強調部と、前記強調メルケプストラム係数系列データにより定義され、前記励起音源データを入力してホルマント強調処理済音声データを生成する合成フィルタと、を備えてもよい。 The voice data processing unit is defined by the in-processing mel cepstrum coefficient series data that performs mel cepstrum analysis on the voice data to generate first mel cepstrum coefficient series data, and is defined by the first mel cepstrum coefficient series data. An inverse filter that generates excitation sound source data from the data, and an in-process formant emphasis unit that generates a predetermined emphasis mel cepstrum coefficient series data in which the first mel cepstrum coefficient series data is subjected to a predetermined emphasis process to emphasize the spectral parameters; And a synthesis filter that is defined by the emphasized mel cepstrum coefficient series data and that generates the formant-enhanced processed speech data by inputting the excitation sound source data.

このようにメルケプストラム分析を経た後にメルケプストラム係数系列データを編集することは、音声データにホルマント強調処理を行うための確実かつ簡便な手法である。 Editing the mel cepstrum coefficient series data after the mel cepstrum analysis in this way is a reliable and simple method for performing formant emphasis processing on speech data.

前記音素ＨＭＭ学習部は、前記ホルマント強調処理済音声データにメルケプストラム分析を施して第２のメルケプストラム係数系列データを生成する学習部内メルケプストラム分析部と、前記ホルマント強調処理済音声データからピッチ系列データを抽出するピッチ抽出部と、前記音素ラベル列と前記第２のメルケプストラム係数系列データとから、音素ラベル毎にメルケプストラムに関する音素ＨＭＭを対応させるメルケプストラム学習部と、前記音素ラベル列と前記ピッチ系列データとから、音素ラベル毎にピッチに関する音素ＨＭＭを対応させるピッチ学習部と、を備えてもよい。 The phoneme HMM learning unit performs a mel cepstrum analysis on the formant-emphasized processed speech data to generate a second mel cepstrum coefficient sequence data; and a pitch sequence from the formant-enhanced processed speech data A pitch extraction unit that extracts data; a mel cepstrum learning unit that associates a phoneme HMM related to a mel cepstrum for each phoneme label from the phoneme label sequence and the second mel cepstrum coefficient sequence data; A pitch learning unit that associates a phoneme HMM related to the pitch for each phoneme label from the pitch series data may be provided.

これにより、メルケプストラムに関する音素ＨＭＭ学習の結果が記憶されたメルケプストラム音声合成辞書と、ピッチに関する音素ＨＭＭ学習の結果が記憶されたピッチ音声合成辞書と、の両方の辞書を生成することができる。 Accordingly, it is possible to generate both a mel cepstrum speech synthesis dictionary storing the phoneme HMM learning results related to the mel cepstrum and a pitch speech synthesis dictionary storing the phoneme HMM learning results related to the pitch.

上記目的を達成するために、この発明の第２の観点に係る音声合成辞書構築装置は、
音素ラベル列とそれに対応する音声データとが入力される入力部と、
前記音声データにメルケプストラム分析を施してメルケプストラム係数系列データを生成する学習部内メルケプストラム分析部と、
前記音声データからピッチ系列データを抽出するピッチ抽出部と、
前記メルケプストラム係数系列データに所定の強調処理を施してスペクトルパラメータを強調した強調メルケプストラム係数系列データを生成する学習部内ホルマント強調部と、
前記音素ラベル列と前記強調メルケプストラム係数系列データとから、音素ラベル毎にメルケプストラムに関する音素ＨＭＭ（Hidden Markov Model）を対応させるメルケプストラム学習部と、
前記音素ラベル列と前記ピッチ系列データとから、音素ラベル毎にピッチに関する音素ＨＭＭを対応させるピッチ学習部と、
学習結果を音声合成辞書に記録するデータ書き出し部と、
を備える。 In order to achieve the above object, a speech synthesis dictionary construction device according to the second aspect of the present invention provides:
An input unit for inputting a phoneme label string and corresponding voice data;
A mel cepstrum analysis unit in a learning unit that performs mel cepstrum analysis on the speech data to generate mel cepstrum coefficient series data;
A pitch extraction unit that extracts pitch series data from the audio data;
A formant emphasis unit in a learning unit that generates a emphasized mel cepstrum coefficient series data in which a predetermined emphasis process is performed on the mel cepstrum coefficient series data to emphasize spectrum parameters;
A mel cepstrum learning unit that associates a phoneme HMM (Hidden Markov Model) related to a mel cepstrum for each phoneme label from the phoneme label string and the emphasized mel cepstrum coefficient series data;
From the phoneme label string and the pitch series data, a pitch learning unit that associates a phoneme HMM related to pitch for each phoneme label;
A data writer for recording the learning results in the speech synthesis dictionary;
Is provided.

この場合、音素学習部には元の音声データがそのまま入力されるが、音素ＨＭＭ学習にホルマント強調処理過程が挿入されるため、音声合成装置が明りょうな合成音声を生成するために参照する音声合成辞書として好適な音声合成辞書が構築される。 In this case, the original speech data is input as it is to the phoneme learning unit, but since a formant emphasis process is inserted into the phoneme HMM learning, the speech that is referred to by the speech synthesizer to generate clear synthesized speech A speech synthesis dictionary suitable as a synthesis dictionary is constructed.

前記所定の強調処理は、例えば、前記メルケプストラム係数系列データのうち所定の次数よりも大きい次数のものに１より大きい所定の強調係数を乗じる処理である。 The predetermined enhancement process is, for example, a process of multiplying a degree greater than a predetermined order among the mel cepstrum coefficient series data by a predetermined enhancement coefficient greater than 1.

高次のメルケプストラム係数系列データを増幅することにより、元の音声データのスペクトル包絡に比べてスペクトル包絡の山と谷の差は増幅される。 By amplifying the higher-order mel cepstrum coefficient series data, the difference between the peak and valley of the spectral envelope is amplified compared to the spectral envelope of the original speech data.

前記所定の強調係数は、その強調係数を乗じられるメルケプストラム係数系列データの生成元である音声データ毎に異なるものとしてもよい。 The predetermined enhancement coefficient may be different for each audio data that is a generation source of mel cepstrum coefficient series data multiplied by the enhancement coefficient.

音声データ毎に強調係数を適宜変えることにより、例えば大量の音声データを採取した際の話者の音声の経時的な変化があった場合にも調整することができ、適切な音声合成辞書を構築することができる。 By appropriately changing the emphasis coefficient for each voice data, it can be adjusted even when there is a change in the speaker's voice over time when collecting a large amount of voice data, for example, and an appropriate voice synthesis dictionary is constructed can do.

前記所定の強調係数は、その強調係数を乗じられるメルケプストラム係数系列データの次数によって異ならせてもよい。 The predetermined enhancement coefficient may vary depending on the order of the mel cepstrum coefficient series data multiplied by the enhancement coefficient.

メルケプストラム係数系列データの次数によって異ならせれば、より適切なホルマント強調をすることができるようになる。 If it varies depending on the order of the mel cepstrum coefficient series data, more appropriate formant emphasis can be performed.

上記目的を達成するために、この発明の第３の観点に係る音声合成辞書構築方法は、
データベースから、音素ラベル列とそれに対応する音声データとが入力される入力ステップと、
前記音声データに対してスペクトルパラメータを強調して強調音声データに変換する音声データ加工ステップと、
前記音素ラベル列と前記強調音声データとから、音素ラベル毎に音素ＨＭＭ（Hidden Markov Model）を対応させる音素ＨＭＭ学習ステップと、
学習結果を音声合成辞書に記録するデータ書き出しステップと、
から構成される。 In order to achieve the above object, a speech synthesis dictionary construction method according to the third aspect of the present invention provides:
An input step in which a phoneme label string and corresponding speech data are input from the database;
An audio data processing step of enhancing spectral parameters with respect to the audio data and converting them into emphasized audio data;
A phoneme HMM learning step in which a phoneme HMM (Hidden Markov Model) is associated with each phoneme label from the phoneme label string and the emphasized speech data;
A data writing step for recording the learning results in the speech synthesis dictionary;
Consists of

上記目的を達成するために、この発明の第４の観点に係る音声合成辞書構築方法は、
データベースから、音素ラベル列とそれに対応する音声データとが入力される入力ステップと、
前記音声データにメルケプストラム分析を施してメルケプストラム係数系列データを生成する学習部内メルケプストラム分析ステップと、
前記音声データからピッチ系列データを抽出するピッチ抽出ステップと、
前記メルケプストラム係数系列データに所定の強調処理を施してスペクトルパラメータを強調した強調メルケプストラム係数系列データを生成する学習部内ホルマント強調ステップと、
前記音素ラベル列と前記強調メルケプストラム係数系列データとから、音素ラベル毎にメルケプストラムに関する音素ＨＭＭ（Hidden Markov Model）を対応させるメルケプストラム学習ステップと、
前記音素ラベル列と前記ピッチ系列データとから、音素ラベル毎にピッチに関する音素ＨＭＭを対応させるピッチ学習ステップと、
学習結果を音声合成辞書に記録するデータ書き出しステップと、
から構成される。 In order to achieve the above object, a speech synthesis dictionary construction method according to the fourth aspect of the present invention provides:
An input step in which a phoneme label string and corresponding speech data are input from the database;
A mel cepstrum analysis step in a learning unit that performs mel cepstrum analysis on the voice data to generate mel cepstrum coefficient series data;
A pitch extraction step of extracting pitch series data from the audio data;
In-learning formant emphasis step for generating emphasized mel cepstrum coefficient series data in which the mel cepstrum coefficient series data is subjected to predetermined emphasis processing to emphasize spectral parameters;
A mel cepstrum learning step for associating a phoneme HMM (Hidden Markov Model) related to a mel cepstrum for each phoneme label from the phoneme label string and the emphasized mel cepstrum coefficient series data;
From the phoneme label string and the pitch series data, a pitch learning step for associating a phoneme HMM related to pitch for each phoneme label;
A data writing step for recording the learning results in the speech synthesis dictionary;
Consists of

上記目的を達成するために、この発明の第５の観点に係るコンピュータプログラムは、
コンピュータに、
データベースから、音素ラベル列とそれに対応する音声データとが入力される入力ステップと、
前記音声データに対してスペクトルパラメータを強調して強調音声データに変換する音声データ加工ステップと、
前記音素ラベル列と前記強調音声データとから、音素ラベル毎に音素ＨＭＭ（Hidden Markov Model）を対応させる音素ＨＭＭ学習ステップと、
学習結果を音声合成辞書に記録するデータ書き出しステップと、
を実行させるコンピュータプログラムである。 In order to achieve the above object, a computer program according to the fifth aspect of the present invention provides:
On the computer,
An input step in which a phoneme label string and corresponding speech data are input from the database;
An audio data processing step of enhancing spectral parameters with respect to the audio data and converting them into emphasized audio data;
A phoneme HMM learning step in which a phoneme HMM (Hidden Markov Model) is associated with each phoneme label from the phoneme label string and the emphasized speech data;
A data writing step for recording the learning results in the speech synthesis dictionary;
It is a computer program that executes

上記目的を達成するために、この発明の第６の観点に係るコンピュータプログラムは、
コンピュータに、
データベースから、音素ラベル列とそれに対応する音声データとが入力される入力ステップと、
前記音声データにメルケプストラム分析を施してメルケプストラム係数系列データを生成する学習部内メルケプストラム分析ステップと、
前記音声データからピッチ系列データを抽出するピッチ抽出ステップと、
前記メルケプストラム係数系列データに所定の強調処理を施してスペクトルパラメータを強調した強調メルケプストラム係数系列データを生成する学習部内ホルマント強調ステップと、
前記音素ラベル列と前記強調メルケプストラム係数系列データとから、音素ラベル毎にメルケプストラムに関する音素ＨＭＭ（Hidden Markov Model）を対応させるメルケプストラム学習ステップと、
前記音素ラベル列と前記ピッチ系列データとから、音素ラベル毎にピッチに関する音素ＨＭＭを対応させるピッチ学習ステップと、
学習結果を音声合成辞書に記録するデータ書き出しステップと、
を実行させるコンピュータプログラムである。 In order to achieve the above object, a computer program according to the sixth aspect of the present invention provides:
On the computer,
An input step in which a phoneme label string and corresponding speech data are input from the database;
A mel cepstrum analysis step in a learning unit that performs mel cepstrum analysis on the voice data to generate mel cepstrum coefficient series data;
A pitch extraction step of extracting pitch series data from the audio data;
In-learning formant emphasis step for generating emphasized mel cepstrum coefficient series data in which the mel cepstrum coefficient series data is subjected to predetermined emphasis processing to emphasize spectral parameters;
A mel cepstrum learning step for associating a phoneme HMM (Hidden Markov Model) related to a mel cepstrum for each phoneme label from the phoneme label string and the emphasized mel cepstrum coefficient series data;
From the phoneme label string and the pitch series data, a pitch learning step for associating a phoneme HMM related to pitch for each phoneme label;
A data writing step for recording the learning results in the speech synthesis dictionary;
It is a computer program that executes

本発明によれば、音声データを、ホルマント強調処理を施してから、音素ＨＭＭ学習に用いて、音声合成辞書を構築する。あるいは、音素ＨＭＭ学習過程において生成されたメルケプストラム係数系列データを、ホルマント強調処理を施しつつ、音素ＨＭＭ学習に用いて、音声合成辞書を構築する。このため、当該音声合成辞書を利用して得られる合成音声を、ホルマントが強調された、明りょうなものとすることができる。 According to the present invention, after speech data is subjected to formant emphasis processing, a speech synthesis dictionary is constructed using phoneme HMM learning. Alternatively, a speech synthesis dictionary is constructed by using the mel cepstrum coefficient series data generated in the phoneme HMM learning process for phoneme HMM learning while performing formant emphasis processing. For this reason, the synthesized speech obtained by using the speech synthesis dictionary can be made clear with formants emphasized.

以下、本発明の実施の形態に係る音声合成辞書構築装置について詳細に説明する。 Hereinafter, the speech synthesis dictionary construction device according to the embodiment of the present invention will be described in detail.

（実施形態１）
図１は、本発明の実施形態１に係る音声合成辞書構築装置１１１の機能構成図である。 (Embodiment 1)
FIG. 1 is a functional configuration diagram of the speech synthesis dictionary construction device 111 according to Embodiment 1 of the present invention.

この音声合成辞書構築装置１１１は、図示するように、入力部１２１と、データ書き出し部１２３と、音声データ加工部１３１と、音素ＨＭＭ学習部１５１と、を備える。 As shown in the figure, the speech synthesis dictionary construction device 111 includes an input unit 121, a data writing unit 123, a speech data processing unit 131, and a phoneme HMM learning unit 151.

音声データ加工部１３１は、加工部内メルケプストラム分析部１４１と、逆フィルタ１４３と、加工部内ホルマント強調部１４５と、合成フィルタ１４７と、を備える。 The voice data processing unit 131 includes a processing unit internal mel cepstrum analysis unit 141, an inverse filter 143, a processing unit internal formant emphasis unit 145, and a synthesis filter 147.

音素ＨＭＭ学習部１５１は、学習部内メルケプストラム分析部１６１と、ピッチ抽出部１６３と、メルケプストラム学習部１６５と、ピッチ学習部１６７と、を備える。 The phoneme HMM learning unit 151 includes an in-learning unit mel cepstrum analysis unit 161, a pitch extraction unit 163, a mel cepstrum learning unit 165, and a pitch learning unit 167.

音声合成辞書構築装置１１１は、図１に示すように、音声データベース１７１と音声合成辞書１７３に接続される。 The speech synthesis dictionary construction device 111 is connected to a speech database 171 and a speech synthesis dictionary 173 as shown in FIG.

音声データベース１７１はハードディスク等で構成され、音素ラベル列とそれに対応する音声データとの組を複数記憶する。音声データベース１７１と入力部１２１とが、接続されている。 The voice database 171 is composed of a hard disk or the like, and stores a plurality of sets of phoneme label strings and corresponding voice data. The voice database 171 and the input unit 121 are connected.

音声データは、音声データベース１７１から、入力部１２１を介して、音声データ加工部１３１の内部の加工部内メルケプストラム分析部１４１に引き渡される。音声データ加工部１３１は、音声データに対してスペクトルパラメータを強調する。 The voice data is delivered from the voice database 171 to the in-process mel cepstrum analysis unit 141 inside the voice data processing unit 131 via the input unit 121. The audio data processing unit 131 emphasizes the spectrum parameter for the audio data.

音素ラベル列は、音声データベース１７１から、入力部１２１を介して、音素ＨＭＭ学習部１５１の内部のメルケプストラム学習部１６５とピッチ学習部１６７とに引き渡される。 The phoneme label string is transferred from the speech database 171 to the mel cepstrum learning unit 165 and the pitch learning unit 167 inside the phoneme HMM learning unit 151 via the input unit 121.

音声合成辞書１７３はハードディスク等で構成され、音素ラベルとそれに対応する音素ＨＭＭとの組を複数記憶する。音声合成辞書１７３とデータ書き出し部１２３とが、接続されている。音声合成辞書１７３は、音声合成辞書構築装置１１１によって構築される。 The speech synthesis dictionary 173 is composed of a hard disk or the like, and stores a plurality of sets of phoneme labels and corresponding phoneme HMMs. The speech synthesis dictionary 173 and the data writing unit 123 are connected. The speech synthesis dictionary 173 is constructed by the speech synthesis dictionary construction device 111.

加工部内メルケプストラム分析部１４１は、音声データにメルケプストラム分析を施してメルケプストラム係数系列データを生成し、加工部内ホルマント強調部１４５に引き渡す。 The in-processing unit mel cepstrum analysis unit 141 performs mel cepstrum analysis on the voice data to generate mel cepstrum coefficient series data, and passes it to the in-processing unit formant emphasis unit 145.

また、生成されたメルケプストラム係数系列データにより、逆フィルタ１４３が定義される。 The inverse filter 143 is defined by the generated mel cepstrum coefficient series data.

逆フィルタ１４３には音声データが入力される。その結果、励起音源データが生成され、合成フィルタ１４７に引き渡される。 Audio data is input to the inverse filter 143. As a result, excitation sound source data is generated and delivered to the synthesis filter 147.

加工部内ホルマント強調部１４５は、加工部内メルケプストラム分析部１４１から引き渡されたメルケプストラム係数系列データに、高次のメルケプストラム係数系列データを大きくすることによるホルマント強調処理を施すことにより、音声データのスペクトルパラメータを強調した強調メルケプストラム係数系列データを生成して、合成フィルタ１４７に引き渡す。 The in-process formant emphasizing unit 145 performs formant emphasis processing on the mel cepstrum coefficient series data delivered from the in-process mel cepstrum analysis unit 141 by enlarging the higher-order mel cepstrum coefficient series data, thereby Emphasized mel cepstrum coefficient series data with enhanced spectral parameters is generated and delivered to the synthesis filter 147.

なお、ホルマント強調処理の詳細な手順については、図３及び図６を参照して後述する。 A detailed procedure of the formant emphasis process will be described later with reference to FIGS.

合成フィルタ１４７は、強調メルケプストラム係数系列データを係数とするＭＬＳＡ（Mel Log Spectrum Approximation）合成フィルタであって、励起音源データを入力することによりホルマント強調処理済音声データを生成する。 The synthesis filter 147 is an MLSA (Mel Log Spectrum Approximation) synthesis filter that uses enhanced mel cepstrum coefficient series data as coefficients, and generates formant-enhanced voice data by inputting excitation sound source data.

合成フィルタ１４７は、生成したホルマント強調処理済音声データを、音素ＨＭＭ学習部１５１の内部の学習部内メルケプストラム分析部１６１と、ピッチ抽出部１６３と、に引き渡す。 The synthesis filter 147 passes the generated formant-emphasized speech data to the in-learning mel cepstrum analysis unit 161 and the pitch extraction unit 163 inside the phoneme HMM learning unit 151.

音素ＨＭＭ学習部１５１においては、音声データベース１７１に記録されている音声データが直接参照されるかわりに、音声データ加工部１３１によりあらかじめホルマント強調が施されたホルマント強調済音声データが参照される。他の点では、既知の手法と同様の手法が採用される。そして、音素ラベル毎に対応する音素ＨＭＭが決定される、いわゆる音素ＨＭＭ学習が行われる。 In the phoneme HMM learning unit 151, formant-emphasized speech data that has been formant-enhanced in advance by the speech data processing unit 131 is referred to instead of directly referring to the speech data recorded in the speech database 171. In other respects, a method similar to the known method is employed. Then, so-called phoneme HMM learning is performed in which a phoneme HMM corresponding to each phoneme label is determined.

学習部内メルケプストラム分析部１６１は、引き渡されたホルマント強調処理済音声データから、メルケプストラム分析により、メルケプストラム係数系列データを取り出す。 The in-learning unit mel cepstrum analysis unit 161 extracts mel cepstrum coefficient series data from the delivered formant-emphasized speech data by mel cepstrum analysis.

ピッチ抽出部１６３は、引き渡されたホルマント強調処理済音声データから、ピッチ系列データを取り出す。 The pitch extraction unit 163 extracts pitch series data from the delivered formant-enhanced voice data.

メルケプストラム学習部１６５においては、入力部１２１を介して音声データベース１７１から引き渡された音素ラベル列と、学習部内メルケプストラム分析部１６１が生成したメルケプストラム係数系列データとから、メルケプストラムに関する音素ＨＭＭ学習が行われる。学習結果は、データ書き出し部１２３を介して音声合成辞書１７３に記録される。 In the mel cepstrum learning unit 165, phoneme HMM learning related to the mel cepstrum is performed from the phoneme label string delivered from the speech database 171 via the input unit 121 and the mel cepstrum coefficient sequence data generated by the in-learning mel cepstrum analysis unit 161. Is done. The learning result is recorded in the speech synthesis dictionary 173 via the data writing unit 123.

さらに、入力部１２１を介して音声データベース１７１から引き渡された音素ラベル列と、ピッチ抽出部１６３により生成されたピッチ系列データとから、ピッチ学習部１６７においてピッチに関する音素ＨＭＭ学習が行われる。学習結果は、データ書き出し部１２３を介して音声合成辞書１７３に記録される。 Further, the phoneme HMM learning regarding the pitch is performed in the pitch learning unit 167 from the phoneme label string delivered from the speech database 171 via the input unit 121 and the pitch sequence data generated by the pitch extraction unit 163. The learning result is recorded in the speech synthesis dictionary 173 via the data writing unit 123.

図１に示す音声合成辞書構築装置１１１は、物理的には、図２に示すような一般的なコンピュータ装置２１１により、構成される。 The speech synthesis dictionary construction device 111 shown in FIG. 1 is physically configured by a general computer device 211 as shown in FIG.

ＣＰＵ２２１、ＲＯＭ２２３、記憶部２２５、ユーザインタフェース（以下、Ｉ／Ｆと書く。）２２７、及び、データ入出力Ｉ／Ｆ２２９は、バス２３１で相互に接続されている。 The CPU 221, ROM 223, storage unit 225, user interface (hereinafter referred to as I / F) 227, and data input / output I / F 229 are connected to each other via a bus 231.

ＲＯＭ２２３は、ＨＭＭに基づく学習のための既知の動作プログラムの他に、特に、この実施の形態においては、音声データにホルマント強調処理を施すための動作プログラムを記憶する。 In addition to the known operation program for learning based on the HMM, the ROM 223 stores an operation program for performing formant emphasis processing on audio data, particularly in this embodiment.

記憶部２２５は、ＲＡＭ２４１やハードディスク２４３から構成されて、ホルマント強調のための定数、音素ラベル列、音声データ、メルケプストラム係数系列データ、励起音源データ、ピッチ系列データ、音素ラベル毎に音素ＨＭＭを対応付けたもの等を、一時的に記憶する。 The storage unit 225 includes a RAM 241 and a hard disk 243, and supports a phoneme HMM for each formant emphasis constant, phoneme label string, voice data, mel cepstrum coefficient series data, excitation source data, pitch series data, and phoneme label. Temporarily memorize the attached ones.

データ入出力Ｉ／Ｆ２２９は、元データ入りハードディスク２６１等及び処理済データ記録用ハードディスク２６３等に接続するためのインタフェースである。元データ入りハードディスク２６１は図１の音声データベース１７１に、処理済データ記録用ハードディスク２６３は図１の音声合成辞書１７３に、それぞれ対応する。 The data input / output I / F 229 is an interface for connecting to the original data-containing hard disk 261 or the like and the processed data recording hard disk 263 or the like. The original data hard disk 261 corresponds to the speech database 171 in FIG. 1, and the processed data recording hard disk 263 corresponds to the speech synthesis dictionary 173 in FIG.

データ入出力Ｉ／Ｆ２２９は、図１に示す音声データベース１７１に接続され、図２に示すＣＰＵ２２１の制御下に、音素ラベル列と音声データの対を読み出してきて、記憶部２２５に格納する。 The data input / output I / F 229 is connected to the voice database 171 shown in FIG. 1, reads out a pair of phoneme label strings and voice data and stores it in the storage unit 225 under the control of the CPU 221 shown in FIG.

ＣＰＵ２２１はＲＯＭ２２３に記憶されているホルマント強調処理を施すための動作プログラムに従い、音声データを順次記憶部２２５から読み出し、ホルマント強調処理を施し、処理済音声データを、音素ラベル列と対応づけつつ、記憶部２２５に格納する。 The CPU 221 sequentially reads out the voice data from the storage unit 225 according to the operation program for performing the formant emphasis process stored in the ROM 223, performs the formant emphasis process, and stores the processed voice data in association with the phoneme label string. Stored in the unit 225.

かかるＣＰＵ２２１の処理の結果、記憶部２２５には、音素ラベル列とホルマント強調処理済み音声データの対が格納される。これが音素ＨＭＭ学習に用いられる。 As a result of the processing of the CPU 221, the storage unit 225 stores a pair of phoneme label strings and formant-enhanced voice data. This is used for phoneme HMM learning.

ＣＰＵ２２１は、ＲＯＭ２２３に格納された、音素ＨＭＭ学習のための動作プログラムを実行することにより、合成辞書生成動作を実行する。 The CPU 221 executes a synthesis dictionary generation operation by executing an operation program for phoneme HMM learning stored in the ROM 223.

データ入出力Ｉ／Ｆ２２９は、図１に示す音声合成辞書１７３に接続され、図２に示すＣＰＵ２２１による処理の結果である、音素ラベル毎のメルケプストラムに関する音素ＨＭＭと、音素ラベル毎のピッチに関する音素ＨＭＭとを、図１に示す音声合成辞書１７３に出力する。 The data input / output I / F 229 is connected to the speech synthesis dictionary 173 shown in FIG. 1, and is a result of the processing by the CPU 221 shown in FIG. 2, and the phoneme HMM related to the mel cepstrum for each phoneme label and the phoneme related to the pitch for each phoneme label. The HMM is output to the speech synthesis dictionary 173 shown in FIG.

図２に示すユーザＩ／Ｆ２２７は、キーボード２５１と、モニタ２５３と、から構成され、任意の指示、データ及びプログラムを入力するために設けられている。特に、ホルマント強調処理においては、ユーザが該Ｉ／Ｆを介して、各種定数を与える必要がある。 The user I / F 227 shown in FIG. 2 includes a keyboard 251 and a monitor 253, and is provided for inputting arbitrary instructions, data, and programs. In particular, in formant emphasis processing, the user needs to give various constants via the I / F.

図１に示すように、本実施形態に係る音声合成辞書構築装置１１１の特徴は、音声データ加工部１３１において各音声データのホルマントを強調するための所定の強調処理を行うことである。 As shown in FIG. 1, the feature of the speech synthesis dictionary construction apparatus 111 according to the present embodiment is that the speech data processing unit 131 performs a predetermined enhancement process for enhancing the formant of each speech data.

音声データは音声データ加工部１３１を経ることによりホルマント強調処理済音声データに加工される。より具体的には、加工部内ホルマント強調部１４５によりメルケプストラム係数系列データのうち高次のものが増幅される。 The voice data is processed into formant-emphasized voice data by passing through the voice data processing unit 131. More specifically, the higher-order mel cepstrum coefficient series data is amplified by the in-process formant emphasis unit 145.

音声データ加工部１３１が実行する所定の強調処理は、最終的に音声データのホルマントを強調する処理であれば、いかなる処理でもよい。以下に、強調処理の典型的な手順について説明する。 The predetermined emphasis process executed by the audio data processing unit 131 may be any process as long as it finally enhances the formant of the audio data. Below, the typical procedure of an emphasis process is demonstrated.

なお、以下の説明では、フレームとは、音声データをスペクトルに変換するために用いられる時間区分を意味し、記号ｆｍで表す。 In the following description, a frame means a time segment used to convert audio data into a spectrum, and is represented by the symbol fm.

図３に示すフローチャートを参照して、音声データ加工部１３１が実行する強調処理の具体例を説明する。 With reference to the flowchart shown in FIG. 3, the specific example of the emphasis process which the audio | voice data processing part 131 performs is demonstrated.

本具体例においては、あらかじめ、ユーザが、加工部内メルケプストラム分析部１４１において行われる、音声データのメルケプストラム分析の次数Ｄ_ａを、図２のユーザＩ／Ｆ２２７を介して、記憶部２２５に設定しておくものとする（ステップＳ３１５）。 In this example, in advance, the user is performed in the processing portion mel cepstrum analysis section 141, the order D _a mel cepstrum analysis of the voice data, via a user I / F227 of FIG. 2, set in the storage unit 225 It is assumed that this is done (step S315).

図１に示すように、音声合成辞書構築装置１１１により音声合成辞書１７３を構築する際には、音声合成辞書構築装置１１１に、音声データベース１７１と、例えば、空状態の音声合成辞書１７３が接続される。 As shown in FIG. 1, when the speech synthesis dictionary construction device 111 constructs the speech synthesis dictionary 173, the speech database 171 and, for example, an empty speech synthesis dictionary 173 are connected to the speech synthesis dictionary construction device 111. The

音声合成辞書１７３構築の開始の指示が図２のユーザＩ／Ｆ２２７からされると、図２のＣＰＵ２２１の内部のカウンタレジスタにカウンタｎの初期値として１が格納される（ステップＳ３１１）。このｎは、図１の音声データベース１７１に記録されている音声データを識別するための変数である。 When an instruction to start construction of the speech synthesis dictionary 173 is issued from the user I / F 227 of FIG. 2, 1 is stored as an initial value of the counter n in the counter register in the CPU 221 of FIG. 2 (step S311). This n is a variable for identifying the audio data recorded in the audio database 171 of FIG.

図１の入力部１２１は、音声データベース１７１から、音声データ音声Ｓｐ_ｎ（但し、１≦ｎ≦Ｎ_ＳＰであり、Ｎ_ＳＰは音声データベースのデータ数である。）を取り出し、図２の記憶部２２５に記憶する（ステップＳ３１３）。 The input unit 121 in FIG. 1 extracts the voice data voice Sp _n (where 1 ≦ n ≦ N _SP , where N _SP is the number of data in the voice database) from the voice database 171, and the storage unit in FIG. 2. It memorize | stores in 225 (step S313).

図１の加工部内メルケプストラム分析部１４１は、音声データＳｐ_ｎに対し、Ｄ_ａ次（次数は前述のとおり、ステップＳ３１５にて与えられている。）のメルケプストラム分析を行う（ステップＳ３１７）。 The in-process mel cepstrum analysis unit 141 in FIG. 1 performs a Da _c order (order is given in step S315 as described above) on the audio data Sp _n (step S317).

該分析の結果、メルケプストラム係数系列データＭＣ_ｎ、ｄ［ｆｍ］（１≦ｎ≦Ｎ_ＳＰ、０≦ｄ≦Ｄ_ａ、０≦ｆｍ≦Ｎ_ｆｍ［ｎ］、但し、Ｎ_ｆｍ［ｎ］は音声データＳｐ_ｎに対するフレームの数である。）が生成されるので、これらを図２の記憶部２２５に記憶する（ステップＳ３１９）。 As a result of the analysis, mel cepstrum coefficient series data MC _{n, d} [fm] (1 ≦ n ≦ N _SP , 0 ≦ d ≦ D _a , 0 ≦ fm ≦ N _fm [n], where N _fm [n] is is the number of frames to speech data Sp _n.) because is generated, and stores them in the storage unit 225 of FIG. 2 (step S319).

これらのメルケプストラム係数系列データにより、逆ＭＬＳＡフィルタが定義される（図３のステップＳ３２１及び図１の音声データ加工部１３１内部の点線矢印）。 An inverse MLSA filter is defined by these mel cepstrum coefficient series data (step S321 in FIG. 3 and a dotted arrow in the audio data processing unit 131 in FIG. 1).

このように定義された逆ＭＬＳＡフィルタに音声データＳｐ_ｎを入力した（図３のステップＳ３２３）結果、励起音源データＥｘＤａｔａ_ｎが生成され、図２の記憶部２２５に記憶される（図３のステップＳ３２５）。 As a result of inputting the audio data Sp _n to the inverse MLSA filter defined in this way (step S323 in FIG. 3), excitation sound source data ExData _n is generated and stored in the storage unit 225 in FIG. 2 (step in FIG. 3). S325).

図２のＣＰＵ２２１は、ＲＯＭ２２３に格納されている音声データ加工プログラムの指示に従って、記憶部２２５からメルケプストラム係数系列データＭＣ_ｎ、ｄ［ｆｍ］を読み込み、後に詳細に説明する所定の強調処理を施すことにより、強調メルケプストラム係数系列データＥｍＭＣ_ｎ、ｄ［ｆｍ］を生成し、図２の記憶部２２５に記憶する（図３のステップＳ３２７）。この処理は、図１においては、加工部内ホルマント強調部１４５が担う。 The CPU 221 in FIG. 2 reads the mel cepstrum coefficient series data MC _{n, d} [fm] from the storage unit 225 in accordance with an instruction of the audio data processing program stored in the ROM 223, and performs a predetermined emphasis process described in detail later. Thus, the emphasized mel cepstrum coefficient series data EMMC _{n, d} [fm] is generated and stored in the storage unit 225 of FIG. 2 (step S327 of FIG. 3). In FIG. 1, this processing is performed by the in-process formant emphasis unit 145.

このように生成された強調メルケプストラム係数系列データＥｍＭＣ_ｎ、ｄ［ｆｍ］はＭＬＳＡ合成フィルタの係数とされ、励起音源データＥｘＤａｔａ_ｎが入力されるとホルマント強調処理済音声データＥｍＳｐ_ｎが生成される。これは、図２の記憶部２２５に記憶する。 The enhanced mel cepstrum coefficient series data EmMC _{n, d} [fm] generated in this way is used as the coefficient of the MLSA synthesis filter, and when the excitation sound source data ExData _n is input, formant enhanced speech data EmSp _n is generated. . This is stored in the storage unit 225 of FIG.

すなわち、励起音源データＥｘＤａｔａ_ｎをＭＬＳＡ合成によりホルマント強調処理済音声データＥｍＳｐ_ｎに変換するために、強調メルケプストラム係数系列データＥｍＭＣ_ｎ、ｄ［ｆｍ］によりＭＬＳＡ合成フィルタの動作内容が定義される（ステップＳ３２９）。このように定義されたＭＬＳＡ合成フィルタに、図２の記憶部２２５から呼び出された励起音源データＥｘＤａｔａ_ｎが入力されると（ステップＳ３３１）、ホルマント強調処理済音声データＥｍＳｐ_ｎが計算される。計算結果は、図２の記憶部２２５に記憶する（ステップＳ３３３）。 That is, in order to convert the excitation sound source data ExData _n into formant-enhanced processed speech data EmSp _n by MLSA synthesis, the operation content of the MLSA synthesis filter is defined by the enhanced mel cepstrum coefficient series data EmMC _{n, d} [fm] ( Step S329). When the excitation sound source data ExData _n called from the storage unit 225 in FIG. 2 is input to the MLSA synthesis filter defined in this way (step S331), formant-emphasized voice data EmSp _n is calculated. The calculation result is stored in the storage unit 225 of FIG. 2 (step S333).

図１の音声データベース１７１に記録された全ての音声データについて、ホルマント強調処理済音声データへの変換が完了したか否かを判別する（ステップＳ３３５）。 It is determined whether or not all the audio data recorded in the audio database 171 in FIG. 1 has been converted to formant-enhanced audio data (step S335).

具体的には、図２のＣＰＵ２２１の内部のカウンタレジスタにおいてカウンタｎが音声データの数Ｎ_Ｓｐよりも小さい場合には（ステップＳ３３５；Ｙｅｓ）、ｎを１だけインクリメントしてから（ステップＳ３３７）、音声データＳｐ_ｎ取り出しステップ（ステップＳ３１３）に戻る。 Specifically, when the counter n is smaller than the number N _{Sp of} audio data in the counter register inside the CPU 221 in FIG. 2 (step S335; Yes), after incrementing n by 1 (step S337), Back to the audio data Sp _n extraction step (step S313).

ｎがＮ_Ｓｐ以上である場合には（ステップＳ３３５；Ｎｏ）、図１の音声データベース１７１に記録された全ての音声データについて、ホルマント強調処理済音声データへの変換が完了したということであるから、データ加工部１３１におけるホルマント強調処理は終了する。 If n is greater than or equal to N _Sp (step S335; No), it means that the conversion of all audio data recorded in the audio database 171 of FIG. 1 into formant enhanced audio data has been completed. Then, the formant emphasis processing in the data processing unit 131 ends.

この後、ホルマント強調処理済音声データは図１の音素ＨＭＭ学習部１５１に引き渡される。前述のとおり、音素ＨＭＭ学習部１５１においては、音声データベース１７１に記録されている音声データを直接参照するかわりに音声データ加工部１３１によりあらかじめホルマント強調が施された音声データを参照する他は、既知の手法と同様の手法により音素ＨＭＭ学習が行われる。 Thereafter, the formant-enhanced speech data is delivered to the phoneme HMM learning unit 151 in FIG. As described above, the phoneme HMM learning unit 151 is known except that instead of directly referring to the voice data recorded in the voice database 171, the voice data that has been subjected to formant enhancement in advance by the voice data processing unit 131 is referred to. Phoneme HMM learning is performed by a method similar to the above method.

本実施形態に係る音声合成辞書構築装置１１１によれば、あらかじめホルマント強調が施された音声データを参照して音素ＨＭＭ学習を行うので、該装置により構築された音声合成辞書は、音声合成装置が明りょうな合成音声を生成するのに資する。 According to the speech synthesis dictionary construction device 111 according to the present embodiment, phoneme HMM learning is performed with reference to speech data that has been subjected to formant emphasis in advance. Therefore, the speech synthesis dictionary constructed by the device is a speech synthesis device. Contributes to generating clear synthesized speech.

（実施形態２）
図４は、実施形態２に係る音声合成辞書構築装置４１１の機能構成図である。 (Embodiment 2)
FIG. 4 is a functional configuration diagram of the speech synthesis dictionary construction device 411 according to the second embodiment.

前記の実施形態の場合には音素ＨＭＭ学習の前に音声データ加工を行うのに対し、本実施形態では、音素ＨＭＭ学習部の内部にホルマント強調部を備えることを特徴とする。 In the case of the above-described embodiment, speech data processing is performed before phoneme HMM learning, whereas in this embodiment, a formant emphasis unit is provided inside the phoneme HMM learning unit.

音声合成辞書構築装置４１１は、基本的には、音声データベース４５１を用いてメルケプストラムに関する音素ＨＭＭ学習とピッチに関する音素ＨＭＭ学習を行い学習結果を音声合成辞書４５３に書き出すための、既知の音声合成辞書構築装置と同様の構成を有する。 The speech synthesis dictionary construction apparatus 411 basically uses a speech database 451 to perform a phoneme HMM learning related to a mel cepstrum and a phoneme HMM learning related to a pitch, and write a learning result to the speech synthesis dictionary 453. It has the same configuration as the construction device.

すなわち、図示するように、入力部４２１と、データ書き出し部４２３と、音素ＨＭＭ学習部４３１と、を備え、音素ＨＭＭ学習部４３１は、学習部内メルケプストラム分析部４４１と、ピッチ抽出部４４３と、メルケプストラム学習部４４７と、ピッチ学習部４４９と、を備える。 That is, as shown in the figure, an input unit 421, a data writing unit 423, and a phoneme HMM learning unit 431 are provided. The phoneme HMM learning unit 431 includes a learning-in-part mel cepstrum analysis unit 441, a pitch extraction unit 443, A mel cepstrum learning unit 447 and a pitch learning unit 449 are provided.

ただし、本実施形態に係る音声合成辞書構築装置４１１は、音素ＨＭＭ学習部４３１の内部に、学習部内ホルマント強調部４４５をさらに備える。 However, the speech synthesis dictionary construction device 411 according to the present embodiment further includes an in-learning formant emphasis unit 445 inside the phoneme HMM learning unit 431.

既知の音声合成辞書構築装置においては、学習部内メルケプストラム分析部４４１による分析結果がそのままメルケプストラム学習部４４７に引き渡される。 In the known speech synthesis dictionary construction device, the analysis result by the in-learning unit mel cepstrum analysis unit 441 is directly transferred to the mel cepstrum learning unit 447.

それに対し、本実施形態に係る音声合成辞書構築装置４１１においては、学習部内メルケプストラム分析部４４１は、音声データから生成したメルケプストラム係数系列データをまず学習部内ホルマント強調部４４５に引き渡す。 On the other hand, in the speech synthesis dictionary construction device 411 according to the present embodiment, the in-learning unit mel cepstrum analysis unit 441 first passes the mel cepstrum coefficient series data generated from the speech data to the in-learning unit formant emphasizing unit 445.

学習部内ホルマント強調部４４５は、引き渡されたメルケプストラム係数系列データに対し、所定のホルマント強調処理を施し、強調メルケプストラム係数系列データに変換してから、メルケプストラム学習部４４７に引き渡す。 The in-learning formant emphasizing unit 445 performs predetermined formant emphasis processing on the delivered mel cepstrum coefficient series data, converts the data into emphasized mel cepstrum coefficient series data, and then delivers the mel cepstrum coefficient series data to the mel cepstrum learning unit 447.

所定のホルマント強調処理とは、所定の次数よりも高次のメルケプストラム係数系列データを増加させる処理のことである。かかる強調処理の詳細については、図６を参照して後述する。 The predetermined formant emphasis process is a process of increasing mel cepstrum coefficient series data of higher order than a predetermined order. Details of such enhancement processing will be described later with reference to FIG.

図４に示す音声合成辞書構築装置４１１も、前期実施形態に係る装置１１１と同様に、物理的には、図２に示すような一般的なコンピュータ装置２１１により、構成される。 Similar to the device 111 according to the previous embodiment, the speech synthesis dictionary construction device 411 shown in FIG. 4 is also physically configured by a general computer device 211 as shown in FIG.

ＲＯＭ２２３は、ＨＭＭに基づく学習のための既知の動作プログラムの他に、特に、この実施の形態においては、メルケプストラム係数系列データにホルマント強調処理を施すための動作プログラムを記憶する。 In addition to the known operation program for learning based on the HMM, the ROM 223 stores an operation program for performing formant emphasis processing on the mel cepstrum coefficient series data in this embodiment.

ホルマント強調処理においては、ユーザがユーザＩ／Ｆ２２７を介して、各種定数を与える必要がある。 In the formant emphasis process, the user needs to give various constants via the user I / F 227.

以下では、音素ＨＭＭ学習部４３１の内部で学習部内ホルマント強調部４４５により実行される所定の強調処理を、図５を参照しつつ、説明する。 Hereinafter, a predetermined enhancement process executed by the in-learning formant emphasis unit 445 inside the phoneme HMM learning unit 431 will be described with reference to FIG.

まず、ユーザが、図４の学習部内メルケプストラム分析部４４１において実行されるメルケプストラム分析の次数Ｄ_ｃを、ユーザＩ／Ｆ２２７を介して記憶部２２５に記憶させる（ステップＳ５１５）。 First, the user stores the order _C of the mel cepstrum analysis executed in the in-learning unit mel cepstrum analysis unit 441 in FIG. 4 in the storage unit 225 via the user I / F 227 (step S515).

図２のＣＰＵ２２１の内部のカウンタレジスタに音声データ識別用のカウンタｎを格納する。ｎの初期値は１である（ステップＳ５１１）。 The counter n for voice data identification is stored in the counter register inside the CPU 221 in FIG. The initial value of n is 1 (step S511).

図２のＣＰＵ２２１はＲＯＭ２２３に格納されたプログラムの指示に従い、データ入出力Ｉ／Ｆ２２９を介して、図４の音声データベース４５１に記録されているＮ_Ｓｐ個の音声データのうちｎ番目の音声データＳｐ_ｎを取り出し（ステップＳ５１３）、記憶部２２５に記憶するとともに、Ｄ_ｃ次のメルケプストラム分析を施す（ステップＳ５１７）。 The CPU 221 in FIG. 2 follows the instruction of the program stored in the ROM 223, and the nth audio data Sp among the N _Sp audio data recorded in the audio database 451 in FIG. 4 via the data input / output I / F 229. _n is extracted (step S513), stored in the storage unit 225, and subjected to D _{c -th} order mel cepstrum analysis (step S517).

分析の結果、メルケプストラム係数系列データＭＣ_ｎ、ｄ［ｆｍ］（１≦ｎ≦Ｎ_Ｓｐ、０≦ｄ≦Ｄ_ｃ、０≦ｆｍ≦Ｎ_ｆｍ［ｎ］、但し、Ｎ_ｆｍ［ｎ］は音声データＳｐ_ｎに対するフレーム数である。）が生成され、図２のＣＰＵ２２１は、これらを記憶部２２５に記憶する（ステップＳ５１９）。 As a result of the analysis, the mel cepstrum coefficient series data MC _{n, d} [fm] (1 ≦ n ≦ N _Sp , 0 ≦ d ≦ D _c , 0 ≦ fm ≦ N _fm [n], where N _fm [n] is a voice a number of frames for data Sp _n.) are generated, CPU 221 of FIG. 2 stores them in the storage unit 225 (step S519).

図２のＣＰＵ２２１は、メルケプストラム係数系列データＭＣ_ｎ、ｄ［ｆｍ］を記憶部２２５から順次呼び出し、後に詳細に説明する所定の強調処理を施すことにより強調メルケプストラム係数系列データＥｍＭＣ_ｎ、ｄ［ｆｍ］を生成し、記憶部２２５に記憶する（ステップＳ５２１）。 The CPU 221 in FIG. 2 sequentially calls the mel cepstrum coefficient series data MC _{n, d} [fm] from the storage unit 225 and performs predetermined enhancement processing described in detail later, thereby performing the emphasized mel cepstrum coefficient series data EMMC _{n, d} [ fm] is generated and stored in the storage unit 225 (step S521).

図４の音声データベースに記録されたＮ_Ｓｐ個の音声データの全てについて強調メルケプストラム係数系列データの生成を完了したか否かを判別する（ステップＳ５２３）。 It is determined whether or not the generation of the emphasized mel cepstrum coefficient series data has been completed for all of the N _Sp speech data recorded in the speech database of FIG. 4 (step S523).

まだ完了していない場合には（ステップＳ５２３；Ｙｅｓ）、次の音声データについて（ステップＳ５２５）、音声データ取り込み作業から繰り返す（ステップＳ５１３）。 If not completed yet (step S523; Yes), the next audio data (step S525) is repeated from the audio data capturing operation (step S513).

完了した場合には（ステップＳ５２３；Ｎｏ）、図２の記憶部２２５に記憶されている音声データＳｐ_ｎと強調メルケプストラム係数系列データＥｍＭＣ_ｎ、ｄ［ｆｍ］とから、メルケプストラムに関する音素ＨＭＭを学習する（ステップＳ５２７）。 If completed (step S523; No), the phoneme HMM related to the mel cepstrum is obtained from the audio data Sp _n and the emphasized mel cepstrum coefficient series data EMMC _{n, d} [fm] stored in the storage unit 225 of FIG. Learning is performed (step S527).

音素ＨＭＭ学習の結果得られた学習データは、図４のデータ書き出し部４２３に送られる（ステップＳ５２９）。 The learning data obtained as a result of the phoneme HMM learning is sent to the data writing unit 423 in FIG. 4 (step S529).

前述のとおり、メルケプストラムに関する音素ＨＭＭ学習を行うのに際して、既知の手法と異なり、なんら処理をしていないメルケプストラム係数系列データＭＣ_ｎ、ｄ［ｆｍ］ではなく、強調メルケプストラム係数系列データＥｍＭＣ_ｎ、ｄ［ｆｍ］を用いるため、本実施形態に係る音声合成辞書構築装置４１１により構築された音声合成辞書は、音声合成装置が明りょうな合成音声を生成するのに資する。 As described above, when performing phoneme HMM learning related to the mel cepstrum, unlike the known method, not the processed mel cepstrum coefficient series data MC _{n, d} [fm] but the emphasized mel cepstrum coefficient series data EMMC _{n. , D} [fm] is used, the speech synthesis dictionary constructed by the speech synthesis dictionary construction device 411 according to the present embodiment helps the speech synthesis device to generate clear synthesized speech.

（実施形態３）
図１に示す実施形態１に係る音声合成辞書構築装置１１１においては、音声データ加工部１３１と音素ＨＭＭ学習部１５１を同一の筐体に収めている。そして、合成フィルタ１４７が生成するホルマント強調処理済音声データは、図２の記憶部２２５すなわち音声合成辞書構築装置１１１の内部に記憶された後、音素ＨＭＭ学習部１５１により取り出されて利用されている。 (Embodiment 3)
In the speech synthesis dictionary construction device 111 according to the first exemplary embodiment illustrated in FIG. 1, the speech data processing unit 131 and the phoneme HMM learning unit 151 are housed in the same casing. Then, the formant-emphasized speech data generated by the synthesis filter 147 is stored in the storage unit 225 of FIG. 2, that is, the speech synthesis dictionary construction device 111, and then extracted and used by the phoneme HMM learning unit 151. .

ここで、合成フィルタ１４７が生成するホルマント強調処理データは、必ずしも、単一の音声合成辞書構築装置１１１の内部の記憶装置、例えば図２に示すハードディスク２４３、に記憶しなくともよい。 Here, the formant emphasis processing data generated by the synthesis filter 147 does not necessarily have to be stored in a storage device inside the single speech synthesis dictionary construction device 111, for example, the hard disk 243 shown in FIG.

したがって、実施形態３に係る装置セットとして、次のものが考えられる。すなわち、図１に示す実施形態１に係る音声合成装置の、音声データ加工部１３１と音素ＨＭＭ学習部１５１とを分離する。そして、それぞれ独立した音声データ加工装置と音素ＨＭＭ学習装置の組とする。 Therefore, the following can be considered as the device set according to the third embodiment. That is, the speech data processing unit 131 and the phoneme HMM learning unit 151 of the speech synthesizer according to Embodiment 1 shown in FIG. A set of an independent speech data processing device and a phoneme HMM learning device is used.

この場合、前記２つの装置は、ハードディスク等の外部記録媒体を介して接続される。すなわち、音声データ加工装置は、外部記録媒体に音声ラベル列とホルマント強調処理済音声データの対を記録し、一方、音素ＨＭＭ学習装置は、該外部記録媒体から該対の読み出しを行う。 In this case, the two devices are connected via an external recording medium such as a hard disk. That is, the speech data processing device records a pair of speech label string and formant-emphasized speech data on an external recording medium, while the phoneme HMM learning device reads the pair from the external recording medium.

換言すれば、音声データ加工装置は、元の音声データベースを、ホルマント強調処理済音声データベースに作り直す装置であり、一方、音素学習装置は、既知の装置と同じ構成を有しつつも、参照する音声データベースが既知のものとは異なることを特徴とする装置である。 In other words, the speech data processing device is a device that recreates the original speech database into a formant-emphasized speech database, while the phoneme learning device has the same configuration as a known device, but also refers to speech The apparatus is characterized in that the database is different from the known one.

（実施形態４）
実施形態３に係る装置セットのうちの音声データ加工装置と、図４に示す実施形態２に係る音声合成辞書構築装置４１１とを、ハードディスク等の外部記録媒体を介して接続してもよい。 (Embodiment 4)
The speech data processing device in the device set according to Embodiment 3 and the speech synthesis dictionary construction device 411 according to Embodiment 2 shown in FIG. 4 may be connected via an external recording medium such as a hard disk.

あるいは、図１に示す実施形態１に係る音声合成辞書構築装置１１１の音素ＨＭＭ学習部１５１の内部の、学習部内メルケプストラム分析部１６１とメルケプストラム学習部１６５との間に、図４に示す学習部内ホルマント強調部４４５を挿入しても、実質的に同様の機能を有する。 Alternatively, the learning shown in FIG. 4 is performed between the in-learning mel cepstrum analysis unit 161 and the mel cepstrum learning unit 165 in the phoneme HMM learning unit 151 of the speech synthesis dictionary construction apparatus 111 according to the first embodiment shown in FIG. Even if the internal formant emphasizing unit 445 is inserted, it has substantially the same function.

本実施形態によれば、ホルマント強調が２重に行われるので、明りょうな合成音声を生成する音声合成辞書に参照される辞書として、より適切な音声合成辞書が構築され得る。 According to this embodiment, formant emphasis is performed twice, so that a more appropriate speech synthesis dictionary can be constructed as a dictionary that is referred to by the speech synthesis dictionary that generates clear synthesized speech.

（強調処理について）
メルケプストラム係数系列データは、定性的には、本来は周波数を意味する音声スペクトルの横軸を時間に見立て、該スペクトルが実時間領域の波形であるとすればどのような周波数成分を有するか、を分析した結果であるといえる。 (About emphasis processing)
The mel cepstrum coefficient series data qualitatively considers the horizontal axis of the speech spectrum that originally means the frequency as time, and what frequency component the waveform has in the real time domain, It can be said that it is the result of analyzing.

一般に、音声スペクトルに現れる明りょうなホルマントの間には、さほど明りょうでないホルマントが存在する。つまり、概して、明りょうなホルマントが比較的広い周波数間隔で分布するのに対し、他のホルマントはかかる顕著なホルマントの間に比較的狭い周波数間隔で分布する傾向がある。 In general, there are less obvious formants between clear formants appearing in the speech spectrum. That is, generally clear formants are distributed at relatively wide frequency intervals, while other formants tend to be distributed at relatively narrow frequency intervals between such prominent formants.

狭い周波数間隔の変動の成分は、ケプストラムにおける”高域”に対応する。かかる高域は、メルケプストラム係数系列データのうち、高次のものに対応する。 Narrow frequency interval variation components correspond to "high frequencies" in the cepstrum. Such a high frequency corresponds to a higher order of the mel cepstrum coefficient series data.

よって、音声データをメルケプストラム分析して得られるメルケプストラム係数系列データのうち、高次のものを増幅することは、音声スペクトルの包絡の山と谷とを強調することを意味し、かかる山と谷の強調により、音声データのホルマントが強調されることになる。このことを、ここでは、強調処理と呼ぶことにする。以下では、強調処理の具体例を示す。 Therefore, amplifying higher-order mel cepstrum coefficient series data obtained by mel cepstrum analysis of voice data means emphasizing peaks and valleys of the envelope of the voice spectrum. By emphasizing the valley, the formant of the voice data is emphasized. Here, this is called enhancement processing. Below, the specific example of an emphasis process is shown.

なお、該強調処理を行うホルマント強調部は、実施形態が異なる場合、音声合成辞書構築装置内において占める位置が異なることもあるが、メルケプストラム係数系列データに所定の演算処理を施す点では同じであるので、以下では、前述の実施形態の区別に拘泥しないこととする。 It should be noted that the formant emphasis unit that performs the emphasis process may have a different position in the speech synthesis dictionary construction device when the embodiment is different, but is the same in that a predetermined arithmetic process is performed on the mel cepstrum coefficient series data. Therefore, in the following, the distinction between the above-described embodiments is not limited.

（強調処理の具体例１）
強調処理の具体例１について、図６を参照しつつ説明する。 (Specific example 1 of emphasis processing)
Specific example 1 of the enhancement process will be described with reference to FIG.

本具体例においては、まず、ユーザが、何次以上のメルケプストラム係数を増幅するかを決定し、図２のユーザＩ／Ｆ２２７を介して、記憶部２２５に記憶する（ステップＳ６１１）。以下では、ｄ_ｅｍ（２≦ｄ_ｅｍ≦Ｄ_ｂ、但し、Ｄ_ｂはメルケプストラム分析の際に考慮された次数である。）次以上メルケプストラム係数を増幅することにしたものとする。 In this specific example, first, the user determines how many or more orders of mel cepstrum coefficients are to be amplified, and stores them in the storage unit 225 via the user I / F 227 of FIG. 2 (step S611). In the following _description , it is assumed that _dem (2 ≦ d _em ≦ D _b , where D _b is the order taken into consideration in the mel cepstrum analysis) and the mel cepstrum coefficient is amplified.

なお、Ｎ_ＳＰ個全ての音声データについてのメルケプストラム分析が済んでおり、メルケプストラム係数系列データＭＣ_ｎ、ｄ［ｆｍ］（１≦ｎ≦Ｎ_Ｓｐ、０≦ｄ≦Ｄ_ａ、０≦ｆｍ≦Ｎ_ｆｍ［ｎ］、Ｎ_ｆｍ［ｎ］は音声データＳｐ_ｎに対するフレーム数である。）は既に図２の記憶部２２５に記憶されているものとする。 The mel cepstrum analysis for all N _SP speech data has been completed, and the mel cepstrum coefficient series data MC _{n, d} [fm] (1 ≦ n ≦ N _Sp , 0 ≦ d ≦ D _a , 0 ≦ fm ≦ N _fm [n] and N _fm [n] are the number of frames for the audio data Sp _n .) Are already stored in the storage unit 225 of FIG.

また、以下では、ｎ番目の音声データに関する処理のみを説明する。ホルマント強調処理を完了するためには、全てのｎ（１≦ｎ≦Ｎ_Ｓｐ）について走査する必要がある。 In the following, only the process related to the nth audio data will be described. In order to complete the formant emphasis processing, it is necessary to scan all n (1 ≦ n ≦ N _Sp ).

図２のＣＰＵ２２１は、ＲＯＭ２２３に格納されている動作プログラムの指示に従って、レジスタに変数ｄをカウンタとしてロードする。ｄはメルケプストラムの次数を表し、０≦ｄ≦Ｄ_ｂであるので、初期値は０とする（ステップＳ６１３）。 The CPU 221 in FIG. 2 loads the variable d as a counter into the register in accordance with the instruction of the operation program stored in the ROM 223. d represents the order of the mel-cepstrum, since it is 0 ≦ d ≦ _{D b,} the initial value is set to 0 (step S613).

図２のＣＰＵ２２１は、ＲＯＭ２２３に格納されている動作プログラムの指示に従って、ｄがステップＳ６１１にて与えられたｄ_ｅｍ以上であるか否かを判別する（ステップＳ６１５）。 CPU221 of Figure 2, according to an instruction operation program stored in the ROM 223, d is determined whether a given _{d em} or more in step S611 (step S615).

ｄがｄ_ｅｍよりも小さい場合（ステップＳ６１５；Ｎｏ）、元のメルケプストラム係数系列データＭＣ_ｎ、ｄ［ｆｍ］（０≦ｆｍ≦Ｎ_ｆｍ［ｎ］）を増幅せずに、そのまま強調メルケプストラム係数系列データＥｍＭＣ_ｎ、ｄ［ｆｍ］とする。すなわち、ＥｍＭＣ_ｎ、ｄ［ｆｍ］＝ＭＣ_ｎ、ｄ［ｆｍ］とする（ステップＳ６１９）。 When d is smaller than d _em (step S615; No), the original mel cepstrum coefficient series data MC _{n, d} [fm] (0 ≦ fm ≦ N _fm [n]) is not amplified and is directly enhanced mel cepstrum. Coefficient series data EMMC _{n, d} [fm]. That is, EMMC _{n, d} [fm] = MC _{n, d} [fm] is set (step S619).

ｄがｄ_ｅｍ以上の場合（ステップＳ６１５；Ｙｅｓ）、元のメルケプストラム係数系列データＭＣ_ｎ、ｄ［ｆｍ］（０≦ｆｍ≦Ｎ_ｆｍ［ｎ］）を増幅して、強調メルケプストラム係数系列データＥｍＭＣ_ｎ、ｄ［ｆｍ］とする。 If d is greater than or equal _{d em} (step S615; Yes), the original Mel cepstrum coefficient series data _{MC n, d [fm] (} 0 ≦ fm ≦ N fm [n]) to amplify the emphasis mel cepstrum coefficient series data EMMC _{n, d} [fm].

本具体例においては、次数によらず１より大きい所定の強調係数を乗じることにより、増幅を行う。すなわち、ＥｍＭＣ_ｎ、ｄ［ｆｍ］＝ＭＣ_ｎ、ｄ［ｆｍ］×（１＋β）（但し、β＞１である。）とする（ステップＳ６１７）。 In this specific example, amplification is performed by multiplying a predetermined enhancement coefficient larger than 1 regardless of the order. That is, EMMC _{n, d} [fm] = MC _{n, d} [fm] × (1 + β) (where β> 1) (step S617).

このように高次のメルケプストラム係数系列データを増幅することは、音声データの高周波成分を強調することになるので、元の音声データに比べてホルマントが強調され、明りょうになる。 Amplifying higher-order mel cepstrum coefficient series data in this way emphasizes the high-frequency component of the audio data, so that the formant is emphasized compared to the original audio data and becomes clear.

次数ｄについてのステップＳ６１９又はステップＳ６１７の処理が終了したら、ｄがＤ_ｂより小さいか否かを判別する（ステップＳ６２１）。ｄ＜Ｄ_ｂの場合（ステップＳ６２１；Ｙｅｓ）、ｄを１だけインクリメントして（ステップＳ６２３）、次の次数についてのステップＳ６１５以降の処理に進む。ｄ≧Ｄ_ｂの場合（ステップＳ６２１；Ｎｏ）、全ての次数について強調メルケプストラム係数系列データＥｍＭＣ_ｎ、ｄ［ｆｍ］の生成が完了したので、強調処理を終了する。 When finished processing in step S619 or step S617 of degree d, d, it is determined whether or not the _{D b} is smaller than (step S621). For d _{<D b} (step S621; Yes), increments d by 1 (step S623), the process proceeds to step S615 and subsequent steps for the next order. For d ≧ _{D b} (step S621; No), emphasis mel cepstrum coefficient series data _{eMMC n} for all orders, since the generation of d [fm] is completed, and ends the enhancement process.

以上のような処理を施すことにより、音声データのホルマントの強調が簡便に達成できる。そして、このような処理過程を組み込んだ音声合成辞書構築装置を用いれば、音声合成装置が合成音声を生成するに際して参照する音声合成辞書として、合成音声を明りょうなものとするのに好適な音声合意辞書を構築することができる。 By performing the processing as described above, formant emphasis of audio data can be easily achieved. Then, if a speech synthesis dictionary construction device incorporating such a process is used, a speech suitable for making the synthesized speech clear as a speech synthesis dictionary to be referred to when the speech synthesizer generates synthesized speech. A consensus dictionary can be constructed.

（強調処理の具体例２）
前述の具体例では、強調係数を１＋βなる定数にしたが、強調係数を音声データによって使い分けてもよい。すなわち、図６のステップＳ６１７において、βをｎの関数とし、ＥｍＭＣ_ｎ、ｄ［ｆｍ］＝ＭＣ_ｎ、ｄ［ｆｍ］×（１＋β_ｎ）として強調メルケプストラム係数系列データを得てもよい。 (Specific example 2 of emphasis processing)
In the above-described specific example, the emphasis coefficient is a constant of 1 + β, but the emphasis coefficient may be used depending on the audio data. That is, in step S617 of FIG. 6, the enhanced mel cepstrum coefficient series data may be obtained by setting β as a function of n and EMMC _{n, d} [fm] = MC _{n, d} [fm] × (1 + β _n ).

より具体的な例として、図１又は４に示す音声データベース１７１又は４５１に、音声データとして録音時のスィチュエイション次第で音声データの特徴が変化した音声が録音されている場合、音声データが録音された時のスィチュエイションに合わせて強調係数を変化させるといったことが挙げられる。 As a more specific example, when the voice database 171 or 451 shown in FIG. 1 or 4 records voice data whose voice data has changed characteristics depending on the situation during recording, the voice data is recorded. For example, the emphasis coefficient may be changed in accordance with the situation at the time of being performed.

このように強調係数を音声データ毎に変化させることにより、音声データベースの音声が複数スィチュエイション下での音声であっても、音声データのホルマントの強調を適切に実行することができる。そして、このような処理過程を組み込んだ音声合成辞書構築装置を用いれば、音声合成装置が合成音声を生成するに際して参照する音声合成辞書として、合成音声を明りょうなものとするのに好適な音声合意辞書を構築することができる。 Thus, by changing the enhancement coefficient for each voice data, even if the voice in the voice database is a voice under a plurality of situations, the enhancement of the formant of the voice data can be appropriately executed. Then, if a speech synthesis dictionary construction device incorporating such a process is used, a speech suitable for making the synthesized speech clear as a speech synthesis dictionary to be referred to when the speech synthesizer generates synthesized speech. A consensus dictionary can be constructed.

（強調処理の具体例３）
強調係数をメルケプストラム係数系列データの次数ｄによって使い分けてもよい。すなわち、図６のステップＳ６１７において、βをｄの関数とし、ＥｍＭＣ_ｎ、ｄ［ｆｍ］＝ＭＣ_ｎ、ｄ［ｆｍ］×（１＋β_ｄ）として強調メルケプストラム係数系列データを得てもよい。 (Specific example 3 of emphasis processing)
The enhancement coefficient may be properly used depending on the order d of the mel cepstrum coefficient series data. That is, in step S617 of FIG. 6, the enhanced mel cepstrum coefficient series data may be obtained by setting β as a function of d and EMMC _{n, d} [fm] = MC _{n, d} [fm] × (1 + β _d ).

音声スペクトルにおけるホルマントを強調するにあたっては、強調する次数の閾値であるｄ_ｅｍ以上の次数のメルケプストラム係数系列データを一様に増幅するのが適切であるとは限らない。さらに、一様な増幅により、音声スペクトルの包絡の山と谷が不必要に強調される結果、本来存在すべきでないホルマントが見かけ上出現するなど、明りょうな合成音声の生成に資するという目的にかえって反する可能性もある。 In emphasizing a formant in a speech spectrum, it is not always appropriate to amplify mel cepstrum coefficient series data having an order equal to or higher than _dem which is a threshold of the order to be emphasized. Furthermore, the uniform amplification will unnecessarily emphasize the peaks and valleys of the envelope of the speech spectrum, resulting in the appearance of formants that should not exist. On the contrary, there is a possibility that it is contrary.

そこで、強調係数をメルケプストラム係数系列データの次数ｄによって使い分けることにより、本発明の目的に沿った音声合成辞書構築が可能となる。 Thus, by using different emphasis coefficients depending on the order d of the mel cepstrum coefficient series data, it is possible to construct a speech synthesis dictionary in accordance with the object of the present invention.

（強調処理の具体例４）
前述の具体例２と具体例３とを組み合わせてもよい。すなわち、図６のステップＳ６１７において、βをｎとｄ両方の関数とし、ＥｍＭＣ_ｎ、ｄ［ｆｍ］＝ＭＣ_ｎ、ｄ［ｆｍ］×（１＋β_ｎ、ｄ）として強調メルケプストラム係数系列データを得てもよい。 (Specific example 4 of emphasis processing)
Specific example 2 and specific example 3 described above may be combined. That is, in step S617 of FIG. 6, β is a function of both n and d, and emmmel cepstrum coefficient series data is obtained as EMMC _{n, d} [fm] = MC _{n, d} [fm] × (1 + β _{n, d} ). May be.

これにより、βをｎとｄとのいかなる関数とするかについての決定が煩雑になり得るが、具体例２と具体例３の長所を兼ね備えたホルマント強調処理が実現される。 This can complicate the determination of what function of β and n is β, but formant emphasis processing that combines the advantages of Specific Example 2 and Specific Example 3 is realized.

（強調処理の具体例５）
図６のステップＳ６１１においては、ユーザが、強調する次数の閾値であるｄ_ｅｍを定数として与えているが、これを音声データ毎に変化させてもよい。 (Specific example 5 of emphasis processing)
In step S611 of FIG. 6, the user, while giving d _em an order of the threshold emphasizing as a constant, which may be changed for each sound data.

音声データ毎に、何次以上の次数のメルケプストラム係数系列データを強調するのが適切であるかが異なる場合もあり得るからである。 This is because there may be a case where it is appropriate to emphasize the mel cepstrum coefficient series data of the order of higher order for each audio data.

なお、本具体例を実行しようとすると、図６のステップＳ６１１によれば、ユーザがひとつひとつの音声データに応じていちいち閾値ｄ_ｅｍを与えなければならないが、膨大な音声データの処理のために、所定の規則に従ってｎの関数としてのｄ_ｅｍを自動的に決定しステップＳ６１１を自動化して、ユーザの負担を軽減してもよい。 Incidentally, an attempt to perform this specific example, according to the step S611 of FIG. 6, but the user must give every time threshold d _em depending on every single audio data, for the processing of massive speech data, The user's burden may be reduced by automatically determining _dem as a function of n according to a predetermined rule and automating step S611.

なお、この発明は、上記実施形態に限定されず、種々の変形及び応用が可能である。例えば、上述のハードウェア構成やブロック構成、フローチャートは例示であって、限定されるものではない。また、この発明は、音声合成辞書構築装置に限定されるものではなく、任意のコンピュータを用いて構築可能である。例えば、上述の処理をコンピュータに実行させるためのコンピュータプログラムを記録媒体や通信により配布し、これをコンピュータにインストールして実行させることにより、この発明の音声合成辞書構築装置として機能させることも可能である。 In addition, this invention is not limited to the said embodiment, A various deformation | transformation and application are possible. For example, the above-described hardware configuration, block configuration, and flowchart are examples, and are not limited. The present invention is not limited to the speech synthesis dictionary construction device, and can be constructed using any computer. For example, by distributing a computer program for causing a computer to execute the above-described processing through a recording medium or communication, and installing and executing the computer program on the computer, the computer can function as the speech synthesis dictionary construction device of the present invention. is there.

実施形態１に係る、ホルマント強調部を音声データ加工部に備えた音声合成辞書構築装置の機能構成図である。It is a functional block diagram of the speech synthesis dictionary construction apparatus which provided the formant emphasis part based on Embodiment 1 in the speech data processing part. 実施形態１及び２に係る音声合成辞書構築装置の物理的な構成を示す図である。It is a figure which shows the physical structure of the speech synthesis dictionary construction apparatus which concerns on Embodiment 1 and 2. FIG. 実施形態１に係る音声データ加工部におけるホルマント強調処理における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the formant emphasis process in the audio | voice data processing part which concerns on Embodiment 1. FIG. 実施形態２に係る、ホルマント強調部を音素ＨＭＭ学習部に備えた音声合成辞書構築装置の機能構成図である。It is a function block diagram of the speech synthesis dictionary construction | assembly apparatus which provided the formant emphasis part based on Embodiment 2 in the phoneme HMM learning part. 実施形態２に係る、メルケプストラムに関する音素ＨＭＭ学習における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in the phoneme HMM learning regarding a mel cepstrum based on Embodiment 2. FIG. 強調メルケプストラム係数系列データを生成する動作の流れを示す図である。It is a figure which shows the flow of the operation | movement which produces | generates emphasis mel cepstrum coefficient series data.

Explanation of symbols

１１１・・・実施形態１に係る音声合成辞書構築装置、１２１・・・入力部、１２３・・・データ書き出し部、１３１・・・音声データ加工部、１４１・・・加工部内メルケプストラム分析部、１４３・・・逆フィルタ、１４５・・・加工部内ホルマント強調部、１４７・・・合成フィルタ、１５１・・・音素ＨＭＭ学習部、１６１・・・学習部内メルケプストラム分析部、１６３・・・ピッチ抽出部、１６５・・・メルケプストラム学習部、１６７・・・ピッチ学習部、１７１・・・音声データベース、１７３・・・音声合成辞書、２１１・・・コンピュータ装置、２２１・・・ＣＰＵ、２２３・・・ＲＯＭ、２２５・・・記憶部、２２７・・・ユーザＩ／Ｆ、２２９・・・データ入出力Ｉ／Ｆ、２３１・・・バス、２４１・・・ＲＡＭ、２４３・・・ハードディスク、２５１・・・キーボード、２５３・・・モニタ、２６１・・・元データ入りハードディスク、２６３・・・処理済みデータ記録用ハードディスク、４１１・・・実施形態２に係る音声合成辞書構築装置、４２１・・・入力部、４２３・・・データ書き出し部、４３１・・・音素ＨＭＭ学習部、４４１・・・学習部内メルケプストラム分析部、４４３・・・ピッチ抽出部、４４５・・・学習部内ホルマント強調部、４４７・・・メルケプストラム学習部、４４９・・・ピッチ学習部、４５１・・・音声データベース、４５３・・・音声合成辞書 111: Speech synthesis dictionary construction device according to embodiment 1, 121: input unit, 123: data writing unit, 131: speech data processing unit, 141: in-processing unit mel cepstrum analysis unit, 143... Inverse filter, 145... Formant emphasis unit in processing section, 147... Synthesis filter, 151... Phoneme HMM learning section, 161. 165 ... Mel cepstrum learning unit, 167 ... Pitch learning unit, 171 ... speech database, 173 ... speech synthesis dictionary, 211 ... computer device, 221 ... CPU, 223 ... ROM, 225 ... storage unit, 227 ... user I / F, 229 ... data input / output I / F, 231 ... bus, 241 ... RAM, 43: Hard disk, 251: Keyboard, 253: Monitor, 261: Hard disk with original data, 263: Hard disk for recording processed data, 411: Speech synthesis dictionary according to the second embodiment Construction device, 421 ... input unit, 423 ... data writing unit, 431 ... phoneme HMM learning unit, 441 ... in-learning mel cepstrum analysis unit, 443 ... pitch extraction unit, 445 ... In-learning formant emphasis unit, 447 ... mel cepstrum learning unit, 449 ... pitch learning unit, 451 ... speech database, 453 ... speech synthesis dictionary

Claims

An input unit for inputting a phoneme label string and corresponding voice data;
An audio data processing unit for emphasizing spectral parameters for the audio data and converting the audio data into emphasized audio data;
A phoneme HMM learning unit that associates a phoneme HMM (Hidden Markov Model) for each phoneme label from the phoneme label string and the emphasized speech data;
A data writer for recording the learning results in the speech synthesis dictionary;
A speech synthesis dictionary construction device comprising:

The voice data processing unit
An in-process mel cepstrum analysis unit that performs mel cepstrum analysis on the voice data to generate first mel cepstrum coefficient series data;
An inverse filter that is defined by the first mel cepstrum coefficient series data and generates excitation sound source data from the audio data;
A processing unit formant emphasis unit that performs a predetermined emphasis process on the first mel cepstrum coefficient series data to generate the emphasized mel cepstrum coefficient series data that emphasizes the spectral parameters;
A synthesis filter that is defined by the emphasized mel cepstrum coefficient series data and generates the formant-enhanced processed speech data by inputting the excitation sound source data;
The speech synthesis dictionary construction apparatus according to claim 1.

The phoneme HMM learning unit
An in-learning mel cepstrum analysis unit that performs mel cepstrum analysis on the formant-enhanced speech data to generate second mel cepstrum coefficient series data;
A pitch extraction unit that extracts pitch series data from the formant-enhanced speech data;
A mel cepstrum learning unit that associates phoneme HMMs related to mel cepstrum for each phoneme label from the phoneme label string and the second mel cepstrum coefficient series data;
From the phoneme label string and the pitch series data, a pitch learning unit that associates a phoneme HMM related to pitch for each phoneme label;
The speech synthesis dictionary construction apparatus according to claim 2.

An input unit for inputting a phoneme label string and corresponding voice data;
A mel cepstrum analysis unit in a learning unit that performs mel cepstrum analysis on the speech data to generate mel cepstrum coefficient series data;
A pitch extraction unit that extracts pitch series data from the audio data;
A formant emphasis unit in a learning unit that generates a emphasized mel cepstrum coefficient series data in which a predetermined emphasis process is performed on the mel cepstrum coefficient series data to emphasize spectrum parameters;
A mel cepstrum learning unit that associates a phoneme HMM (Hidden Markov Model) related to a mel cepstrum for each phoneme label from the phoneme label string and the emphasized mel cepstrum coefficient series data;
From the phoneme label string and the pitch series data, a pitch learning unit that associates a phoneme HMM related to pitch for each phoneme label;
A data writer for recording the learning results in the speech synthesis dictionary;
A speech synthesis dictionary construction device comprising:

The predetermined emphasis process is:
The mel cepstrum coefficient series data is a process of multiplying an order higher than a predetermined order by a predetermined enhancement coefficient greater than 1.
The speech synthesis dictionary construction device according to any one of claims 2 to 4, wherein the speech synthesis dictionary construction device.

The predetermined enhancement factor is:
It can be different for each voice data that is the generation source of the mel cepstrum coefficient series data multiplied by the enhancement coefficient,
The speech synthesis dictionary construction device according to claim 5.

The predetermined enhancement factor is:
Depending on the order of the mel cepstrum coefficient series data multiplied by the enhancement coefficient,
The speech synthesis dictionary construction device according to claim 5 or 6.

An input step in which a phoneme label string and corresponding speech data are input from the database;
An audio data processing step of enhancing spectral parameters with respect to the audio data and converting them into emphasized audio data;
A phoneme HMM learning step in which a phoneme HMM (Hidden Markov Model) is associated with each phoneme label from the phoneme label string and the emphasized speech data;
A data writing step for recording the learning results in the speech synthesis dictionary;
A speech synthesis dictionary construction method comprising:

An input step in which a phoneme label string and corresponding speech data are input from the database;
A mel cepstrum analysis step in a learning unit that performs mel cepstrum analysis on the voice data to generate mel cepstrum coefficient series data;
A pitch extraction step of extracting pitch series data from the audio data;
In-learning formant emphasis step for generating emphasized mel cepstrum coefficient series data in which the mel cepstrum coefficient series data is subjected to predetermined emphasis processing to emphasize spectral parameters;
A mel cepstrum learning step for associating a phoneme HMM (Hidden Markov Model) related to a mel cepstrum for each phoneme label from the phoneme label string and the emphasized mel cepstrum coefficient series data;
From the phoneme label string and the pitch series data, a pitch learning step for associating a phoneme HMM related to pitch for each phoneme label;
A data writing step for recording the learning results in the speech synthesis dictionary;
A speech synthesis dictionary construction method comprising:

On the computer,
An input step in which a phoneme label string and corresponding speech data are input from the database;
An audio data processing step of enhancing spectral parameters with respect to the audio data and converting them into emphasized audio data;
A phoneme HMM learning step in which a phoneme HMM (Hidden Markov Model) is associated with each phoneme label from the phoneme label string and the emphasized speech data;
A data writing step for recording the learning results in the speech synthesis dictionary;
A computer program that executes

On the computer,
An input step in which a phoneme label string and corresponding speech data are input from the database;
A mel cepstrum analysis step in a learning unit that performs mel cepstrum analysis on the voice data to generate mel cepstrum coefficient series data;
A pitch extraction step of extracting pitch series data from the audio data;
In-learning formant emphasis step for generating emphasized mel cepstrum coefficient series data in which the mel cepstrum coefficient series data is subjected to predetermined emphasis processing to emphasize spectral parameters;
A mel cepstrum learning step for associating a phoneme HMM (Hidden Markov Model) related to a mel cepstrum for each phoneme label from the phoneme label string and the emphasized mel cepstrum coefficient series data;
From the phoneme label string and the pitch series data, a pitch learning step for associating a phoneme HMM related to pitch for each phoneme label;
A data writing step for recording the learning results in the speech synthesis dictionary;
A computer program that executes