JP5228283B2

JP5228283B2 - Speech synthesis dictionary construction device, speech synthesis dictionary construction method, and program

Info

Publication number: JP5228283B2
Application number: JP2006115992A
Authority: JP
Inventors: 勝彦佐藤
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2006-04-19
Filing date: 2006-04-19
Publication date: 2013-07-03
Anticipated expiration: 2026-04-19
Also published as: JP2007286511A

Description

本発明は、与えられた音声データベースを参照することにより、音声合成に用いるデータベースを構築する、音声合成辞書構築装置及び音声合成辞書構築方法に関する。 The present invention relates to a speech synthesis dictionary construction apparatus and a speech synthesis dictionary construction method for constructing a database used for speech synthesis by referring to a given speech database.

隠れマルコフモデル（以下、ＨＭＭと称する。）に基づいた音声認識技術及び音声合成技術は、広く利用されている。 Speech recognition technology and speech synthesis technology based on a hidden Markov model (hereinafter referred to as HMM) are widely used.

ＨＭＭに基づいた音声認識技術及び音声合成技術に係る文献例としては、特許文献１乃至３に記載されるものがあった。 Examples of documents related to the speech recognition technology and speech synthesis technology based on the HMM are those described in Patent Documents 1 to 3.

特開２００２−６２８９０号公報JP 2002-62890 A 特開２００２−２４４６８９号公報Japanese Patent Laid-Open No. 2002-244689 特開２００２−２６８６６０号公報JP 2002-268660 A

ＨＭＭに基づいた音声認識及び音声合成においては、音素ラベルとスペクトルパラメータデータ列等の対応関係を記録した音声合成辞書が必要になる。 In speech recognition and speech synthesis based on the HMM, a speech synthesis dictionary that records the correspondence between phoneme labels and spectral parameter data strings is required.

音声合成辞書は、通例、音素ラベル列とそれに対応する音声データとの組から構成されているデータベース（以下、音声データベースと称呼する。）に記録されているデータについて、スペクトル分析とピッチ抽出を行い、ＨＭＭに基づく学習過程を経ることにより、構築される。 A speech synthesis dictionary usually performs spectrum analysis and pitch extraction on data recorded in a database (hereinafter referred to as a speech database) composed of a set of phoneme label sequences and corresponding speech data. It is constructed through a learning process based on HMM.

従来は、音声合成辞書を構築する際、音声データから算出された音声スペクトルパラメータデータ列を、特に加工等を施すことなく、そのままＨＭＭに基づく学習に用いて、音声合成辞書を構築していた。 Conventionally, when a speech synthesis dictionary is constructed, the speech synthesis parameter dictionary is constructed by using the speech spectrum parameter data string calculated from the speech data as it is for learning based on the HMM without any particular processing.

また、ＨＭＭに基づく学習過程は、しばしば、尤度を向上させるための、単数回または複数回の再学習過程を含む。 Also, the learning process based on HMM often includes a single or multiple re-learning processes for improving the likelihood.

ＨＭＭに基づく学習過程が、このように、複数段階の学習過程を含む場合、段階毎に音素ラベルに対する音素ＨＭＭが決定され、当該対応関係が次の学習段階に伝達され、学習が進んでいくことになる。 When the learning process based on the HMM includes a learning process of a plurality of stages as described above, the phoneme HMM for the phoneme label is determined for each stage, and the corresponding relationship is transmitted to the next learning stage, and the learning proceeds. become.

従来、段階毎に生成される音素ＨＭＭは、特に加工等をなされることなく、次の段階に送られていた。 Conventionally, the phoneme HMM generated at each stage has been sent to the next stage without any special processing.

このようにして構築された音声合成辞書は、音声合成装置に用いられる。 The speech synthesis dictionary constructed in this way is used in a speech synthesizer.

音声データに対してＬＳＰ分析を施して生成したＬＳＰ係数群時系列データや、音素ＨＭＭ学習の結果ＬＳＰ係数に関する音素ＨＭＭを定義するパラメータ、例えば平均値に乱れが生じることがある。 The LSP coefficient group time-series data generated by performing the LSP analysis on the speech data and the parameters defining the phoneme HMM related to the LSP coefficients as a result of the phoneme HMM learning, for example, the average value may be disturbed.

従来の音声合成辞書構築法では、ＬＳＰ係数群時系列データの乱れを無視して音素ＨＭＭ学習過程に進んだり、ＬＳＰ係数に関する音素ＨＭＭを定義するパラメータの乱れを無視して学習を進行させたりしている。このために、従来の音声合成辞書構築方法では、品質の不十分な音声合成辞書が構築されてしまう場合があった。 The conventional speech synthesis dictionary construction method ignores the disturbance of the LSP coefficient group time series data and proceeds to the phoneme HMM learning process, or ignores the disturbance of the parameter defining the phoneme HMM related to the LSP coefficient and advances the learning. ing. For this reason, in the conventional speech synthesis dictionary construction method, a speech synthesis dictionary with insufficient quality may be constructed.

本発明は、上記実情に鑑みてなされたもので、高品質のテキスト音声の合成を可能とする音声合成辞書構築装置及び方法を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a speech synthesis dictionary construction apparatus and method capable of synthesizing high-quality text speech.

本発明にかかる音声合成辞書構築装置は、
音声データに対してＬＳＰ分析を施して多次元のＬＳＰ（Line Spectrum Pair）係数を含むＬＳＰ係数群時系列データを生成するＬＳＰ係数群時系列データ生成部と、
前記ＬＳＰ係数群時系列データ生成部により生成された前記ＬＳＰ係数群時系列データに、所定の安定条件を満たすように、補正処理を施す学習前スペクトルパラメータ補正部と、
音素ラベル列と前記学習前スペクトルパラメータ補正部から出力された前記補正済ＬＳＰ係数群時系列データを受け取り、隠れマルコフモデルに基づく学習により音素ラベル毎に音素ＨＭＭを対応させる音素ＨＭＭ学習を複数段階行う音素ＨＭＭ学習部と、
前記音素ＨＭＭ学習部の各段階での音素ＨＭＭ学習において得られた音素ラベル毎に対応付けられた音素ＨＭＭを定義するＬＳＰ係数について、所定の安定条件を満たすように補正処理を施して、次段の音素ＨＭＭ学習に引き渡す学習中スペクトルパラメータ補正部と、
前記音素ＨＭＭ学習部によって対応付けられた音素ラベルと音素ＨＭＭとを音声合成辞書に記録するデータ書き出し部と、
を備えることを特徴とする。 The speech synthesis dictionary construction apparatus according to the present invention is
An LSP coefficient group time series data generating unit that performs LSP analysis on voice data and generates LSP coefficient group time series data including multidimensional LSP (Line Spectrum Pair) coefficients;
A pre-learning spectral parameter correction unit that performs a correction process on the LSP coefficient group time-series data generated by the LSP coefficient group time-series data generation unit so as to satisfy a predetermined stability condition;
The phoneme label sequence and the corrected LSP coefficient group time-series data output from the pre-learning spectrum parameter correction unit are received, and phoneme HMM learning is performed in a plurality of stages by associating the phoneme HMM for each phoneme label by learning based on the hidden Markov model. A phoneme HMM learning unit;
The LSP coefficient defining the phoneme HMM associated with each phoneme label obtained in the phoneme HMM learning at each stage of the phoneme HMM learning unit is subjected to correction processing so as to satisfy a predetermined stability condition. A learning spectral parameter correction unit that is handed over to the phoneme HMM learning of
A data writing unit for recording a phoneme label and a phoneme HMM associated by the phoneme HMM learning unit in a speech synthesis dictionary;
It is characterized by providing.

本発明によれば、ＬＳＰ係数群時系列データ、ＬＳＰ係数に関する音素ＨＭＭを定義するパラメータに補正が行われるため、適切なデータを有する音声合成辞書が構築される。その結果、当該音声合成辞書を参照することにより音声を合成する音声合成装置は、高品質の合成音声を発することができる。 According to the present invention, since the parameters defining the phoneme HMM related to the LSP coefficient group time series data and the LSP coefficients are corrected, a speech synthesis dictionary having appropriate data is constructed. As a result, a speech synthesizer that synthesizes speech by referring to the speech synthesis dictionary can emit high-quality synthesized speech.

以下、本発明の実施形態に係る音声合成辞書構築装置及び方法について説明する。 Hereinafter, a speech synthesis dictionary construction apparatus and method according to an embodiment of the present invention will be described.

（実施形態１）
図１は、実施形態１に係るスペクトルパラメータ補正機能を備えた音声合成辞書構築装置１１の概要構成図である。 (Embodiment 1)
FIG. 1 is a schematic configuration diagram of a speech synthesis dictionary construction device 11 having a spectrum parameter correction function according to the first embodiment.

音声合成辞書構築装置１１は、データ取り出し部１３と、スペクトル分析部１５と、学習前スペクトルパラメータ補正部１７と、音素ＨＭＭ学習部１９と、データ書き出し部２９と、を備える。 The speech synthesis dictionary construction device 11 includes a data extraction unit 13, a spectrum analysis unit 15, a pre-learning spectrum parameter correction unit 17, a phoneme HMM learning unit 19, and a data writing unit 29.

音素ＨＭＭ学習部１９は、第１音素ＨＭＭ学習部２１と、学習中スペクトルパラメータ補正部２３と、第２音素ＨＭＭ学習部２５〜第Ｎ音素ＨＭＭ学習部２７と、を備える。 The phoneme HMM learning unit 19 includes a first phoneme HMM learning unit 21, a learning spectral parameter correction unit 23, and a second phoneme HMM learning unit 25 to an Nth phoneme HMM learning unit 27.

音声合成辞書構築装置１１は、音声データベース３１と音声合成辞書３３に接続されている。 The speech synthesis dictionary construction device 11 is connected to the speech database 31 and the speech synthesis dictionary 33.

音声データベース３１は、音素ラベル列とそれに対応する音声データとの組から構成されているデータベースであり、例えばハードディスク装置に記憶されている。 The voice database 31 is a database composed of a set of phoneme label strings and corresponding voice data, and is stored in, for example, a hard disk device.

音声合成辞書３３は、音声合成辞書構築装置１１によって生成された音素ラベル毎の音素ＨＭＭを格納するデータベースであり、例えばハードディスクに記憶されている。この音声合成辞書３３は、音声の合成に使用される。 The speech synthesis dictionary 33 is a database that stores phoneme HMMs for each phoneme label generated by the speech synthesis dictionary construction device 11, and is stored in, for example, a hard disk. This speech synthesis dictionary 33 is used for speech synthesis.

データ取り出し部１３は、音声データベース３１から音声ラベル列と音声データの組を読み込み、音素ラベル列と音声データとに分離する。データ取り出し部１３は、音素ラベル列を第１音素ＨＭＭ学習部２１に引き渡し、音声データをスペクトル分析部１５に引き渡す。 The data extraction unit 13 reads a set of a voice label string and voice data from the voice database 31 and separates it into a phoneme label string and voice data. The data extraction unit 13 delivers the phoneme label string to the first phoneme HMM learning unit 21 and delivers the speech data to the spectrum analysis unit 15.

スペクトル分析部１５は、データ取り出し部１３から引き渡された音声データを解析して、音声データのスペクトル包絡を表す音声スペクトルパラメータデータ列を生成し、学習前スペクトルパラメータ補正部１７に引き渡す。本実施形態においては、音声スペクトルパラメータデータ列は、ＬＳＰ係数列である。 The spectrum analysis unit 15 analyzes the voice data delivered from the data extraction unit 13, generates a voice spectrum parameter data string representing the spectrum envelope of the voice data, and delivers it to the pre-learning spectrum parameter correction unit 17. In the present embodiment, the speech spectrum parameter data sequence is an LSP coefficient sequence.

学習前スペクトルパラメータ補正部１７は、スペクトル分析部１５から引き渡された音声スペクトルパラメータデータ列に対し、所定の安定条件を満たすよう、補正操作を施し、補正済音声スペクトルパラメータデータ列を第１音素ＨＭＭ学習部２１に供給する。 The pre-learning spectrum parameter correction unit 17 performs a correction operation on the speech spectrum parameter data sequence delivered from the spectrum analysis unit 15 so as to satisfy a predetermined stability condition, and converts the corrected speech spectrum parameter data sequence into the first phoneme HMM. This is supplied to the learning unit 21.

所定の安定条件とは、同一フレーム内のＬＳＰ係数が、全て０より大きくπより小さく、かつ、当該係数の次元の昇順に並べた場合に、小さい順に並ぶことである。すなわち、原則としては、隣接するフレームのＬＳＰ係数群同士で、同次元のＬＳＰ係数を時間軸に沿って線でつないだとき、それぞれの線が交差することはなく、かつ、これらの線がＬＳＰ係数についての０とπの間の領域をはみ出すことはない。 The predetermined stability condition is that LSP coefficients in the same frame are all larger than 0 and smaller than π, and are arranged in ascending order when the coefficients are arranged in ascending order. That is, in principle, when LSP coefficients of the same frame are connected by lines along the time axis between LSP coefficient groups of adjacent frames, the lines do not cross each other, and these lines do not intersect with LSP. The region between 0 and π for the coefficient does not protrude.

ここで、フレームとは、音声データをＬＳＰ分析する周期のことである。 Here, the term “frame” refers to a cycle in which audio data is LSP analyzed.

第１音素ＨＭＭ学習部２１は、データ取り出し部１３から供給された音素ラベル列と学習前スペクトルパラメータ補正部１７から供給される補正済音声スペクトルパラメータデータ列の対の群を用いて、音素ＨＭＭに基づく学習により、音素ラベル毎に音素ＨＭＭを構築し、これらを学習中スペクトルパラメータ補正部２３に供給する。 The first phoneme HMM learning unit 21 uses the group of pairs of the phoneme label sequence supplied from the data extraction unit 13 and the corrected speech spectrum parameter data sequence supplied from the pre-learning spectrum parameter correction unit 17 as a phoneme HMM. Based on the learning, a phoneme HMM is constructed for each phoneme label, and these are supplied to the in-learning spectral parameter correction unit 23.

第２音素ＨＭＭ学習部２５〜第Ｎ音素ＨＭＭ学習部２７は、学習中スペクトルパラメータ補正部２３から供給される補正済音素ＨＭＭを、音素ＨＭＭに基づき再学習し、これらを学習中スペクトルパラメータ補正部２３に供給する。 The second phoneme HMM learning unit 25 to the Nth phoneme HMM learning unit 27 re-learns the corrected phoneme HMM supplied from the learning spectral parameter correction unit 23 based on the phoneme HMM, and learns these spectral parameter correction units. 23.

学習中スペクトルパラメータ補正部２３は、第１音素ＨＭＭ学習部２１〜第Ｎ音素ＨＭＭ学習部２７から供給される、音素ＨＭＭを定義するパラメータのうち、平均値について、所定の安定条件をみたすような補正を行う。その後、学習中スペクトルパラメータ補正部２３は、音素ＨＭＭを次段の音素ＨＭＭ学習部に供給する。即ち、学習中スペクトルパラメータ補正部２３は、第１〜第（Ｎ−１）音素ＨＭＭ学習部から供給された、音素ＨＭＭを定義するパラメータのうち、平均値を、第２〜第Ｎ音素ＨＭＭ学習部に供給する。また、学習中スペクトルパラメータ補正部２３は、第Ｎ音素ＨＭＭ学習部２７から供給された補正済音素ＨＭＭを、データ書き出し部２９に供給する。 The in-learning spectral parameter correction unit 23 satisfies a predetermined stability condition with respect to the average value among the parameters defining the phoneme HMM supplied from the first phoneme HMM learning unit 21 to the Nth phoneme HMM learning unit 27. Make corrections. Thereafter, the in-learning spectral parameter correction unit 23 supplies the phoneme HMM to the next phoneme HMM learning unit. That is, the in-learning spectral parameter correction unit 23 calculates the average value of the parameters defining the phoneme HMM supplied from the first to (N-1) th phoneme HMM learning units, and the second to Nth phoneme HMM learning. Supply to the department. The in-learning spectral parameter correction unit 23 supplies the corrected phoneme HMM supplied from the Nth phoneme HMM learning unit 27 to the data writing unit 29.

データ書き出し部２９は、音素ＨＭＭを、音声合成辞書３３に記録する。 The data writing unit 29 records the phoneme HMM in the speech synthesis dictionary 33.

図１に示す音声合成辞書構築装置１１は、物理的には、図２に示すような一般的なコンピュータ装置４１により、構成される。ＣＰＵ４３、ＲＯＭ４５、記憶部４７、データ入出力Ｉ／Ｆ５１、ユーザＩ／Ｆ４９は、バス７１で相互に接続されている。 The speech synthesis dictionary construction device 11 shown in FIG. 1 is physically configured by a general computer device 41 as shown in FIG. The CPU 43, ROM 45, storage unit 47, data input / output I / F 51, and user I / F 49 are connected to each other via a bus 71.

ＲＯＭ４５は、ＨＭＭに基づいた学習のための動作プログラム、特に、この実施の形態においては、スペクトルパラメータを補正する動作を含む動作プログラムを記憶する。 The ROM 45 stores an operation program for learning based on the HMM, and in particular, in this embodiment, an operation program including an operation for correcting a spectrum parameter.

記憶部４７は、ＲＡＭ６５やハードディスク６７から構成されて、学習のための定数、音素ラベル列、音声データ、スペクトルパラメータデータ、音素ラベルとスペクトルパラメータデータ列の対応関係、を記憶する。 The storage unit 47 includes a RAM 65 and a hard disk 67 and stores learning constants, phoneme label strings, speech data, spectrum parameter data, and correspondences between phoneme labels and spectrum parameter data strings.

データ入出力Ｉ／Ｆ５１は、元データ入りハードディスク５５等及び処理済データ記録用ハードディスク５７等に接続するためのインタフェースである。 The data input / output I / F 51 is an interface for connecting to the original data-containing hard disk 55 and the like and the processed data recording hard disk 57 and the like.

データ入出力Ｉ／Ｆ５１は、図１に示す音声データベース３１に接続され、図２に示すＣＰＵ４３の制御下に、学習対象の音素ラベル列と音声データの対を読み出してきて、記憶部３７に格納する。 The data input / output I / F 51 is connected to the speech database 31 shown in FIG. 1, reads out a pair of phoneme label strings to be learned and speech data, and stores them in the storage unit 37 under the control of the CPU 43 shown in FIG. To do.

データ入出力Ｉ／Ｆ５１は、図１に示す音声合成辞書３３に接続され、図２に示すＣＰＵ４３による処理の結果である、音素ＨＭＭを、図１に示す音声合成辞書３３に出力する。 The data input / output I / F 51 is connected to the speech synthesis dictionary 33 shown in FIG. 1, and outputs the phoneme HMM, which is the result of the processing by the CPU 43 shown in FIG. 2, to the speech synthesis dictionary 33 shown in FIG.

図２に示すユーザＩ／Ｆ４９は、キーボード６１と、モニタ６３と、から構成され、任意の指示、データ及びプログラムを入力するために設けられている。 The user I / F 49 shown in FIG. 2 includes a keyboard 61 and a monitor 63, and is provided for inputting arbitrary instructions, data, and programs.

ＣＰＵ４３は、ＲＯＭ４５に格納された動作プログラムを実行することにより、音声合成辞書生成動作を実行する。 The CPU 43 executes a speech synthesis dictionary generation operation by executing an operation program stored in the ROM 45.

図１に示すように、本実施形態に係る音声合成辞書構築装置１１の特徴は、１又は２以上のスペクトルパラメータ補正部を設けて、スペクトルパラメータデータが所定の安定条件を満たすよう所定の補正処理を行うことである。 As shown in FIG. 1, the feature of the speech synthesis dictionary construction device 11 according to the present embodiment is that a predetermined correction process is performed so that one or two or more spectral parameter correction units are provided so that the spectral parameter data satisfies a predetermined stability condition. Is to do.

学習前スペクトルパラメータ補正部１７及び学習中スペクトルパラメータ補正部２３が実行する補正処理は、所定の安定条件を満たす操作であれば、いかなる操作でもよい。以下では、理解を容易にするため、特定の補正処理を参照しつつこの音声合成装置の具体的動作について説明する。 The correction process executed by the pre-learning spectral parameter correction unit 17 and the in-learning spectral parameter correction unit 23 may be any operation as long as the operation satisfies a predetermined stability condition. Hereinafter, in order to facilitate understanding, a specific operation of the speech synthesizer will be described with reference to specific correction processing.

まず、データ取り出し部１３は、音声データベース３１に記憶されている音素ラベルと音声データとの対を順次取り出し、音素ラベルと音声データとに分離し、音素ラベル列を第１学習部２１に、音声データをスペクトル分析部１５に供給する。 First, the data extraction unit 13 sequentially extracts pairs of phoneme labels and audio data stored in the audio database 31, separates them into phoneme labels and audio data, and stores the phoneme label string in the first learning unit 21. Data is supplied to the spectrum analyzer 15.

スペクトル分析部１５は、供給された音声データを既知の手法で分析して、ＬＳＰ係数を順次生成し、学習前スペクトルパラメータ補正部１７に供給する。 The spectrum analysis unit 15 analyzes the supplied voice data by a known method, sequentially generates LSP coefficients, and supplies the LSP coefficients to the pre-learning spectrum parameter correction unit 17.

図３に示すフローチャートを参照しつつ、学習前スペクトルパラメータ補正部１７が行う補正操作について説明する。 The correction operation performed by the pre-learning spectrum parameter correction unit 17 will be described with reference to the flowchart shown in FIG.

まず、Ｎ_Ｓｐ個の音声データを取り出して記憶し（ステップＳ３０１）、その中から、ｍ番目の音声データＳｐ_ｍを特定する（ステップＳ３０３）。次に、ｍ番目の音声データＳｐ_ｍのＮ_ｆｍ個のフレームのＬＳＰ係数群のうちから、第ｆｍフレーム（ステップＳ３０５）のＬＳＰ係数群ω_ｍ、ｋ［ｆｍ］を取り出す（なお、１≦ｋ≦Ｎ_ｄ、Ｎ_ｄはＬＳＰ係数の次数である。）（ステップＳ３０７）。次に、取り出したＬＳＰ係数ω_ｍ、ｋ［ｆｍ］（但し、１≦ｋ≦Ｎ_ｄ、０≦ｆｍ≦Ｎ_ｆｍ［ｍ］、Ｎ_ｄはＬＳＰ係数の次数、Ｎ_ｆｍ［ｍ］は音声データＳｐ_ｍに対するフレーム数）において、ＬＳＰ係数の所定の安定条件が満たされているか判別する（ステップＳ３０９）。 First, N _Sp pieces of audio data are extracted and stored (step S301), and the m-th audio data Sp _m is specified from among them (step S303). Next, the LSP coefficient group ω _{m, k} [fm] of the fm-th frame (step S305) is extracted from the LFM coefficient group of N _fm frames of the m-th audio data Sp _m (where 1 ≦ k ≦ N _d and N _d are the orders of the LSP coefficients.) (Step S307). Next, the extracted LSP coefficient ω _{m, k} [fm] (where 1 ≦ k ≦ N _d , 0 ≦ fm ≦ N _fm [m], N _d is the order of the LSP coefficient, and N _fm [m] is the voice data. in the number of frames) with respect sp _m, it determines whether a predetermined stability condition of the LSP coefficients is satisfied (step S309).

ここで、ＬＳＰ係数の所定の安定条件とは、
０＜ω_ｍ、１［ｆｍ］＜ω_ｍ、２［ｆｍ］＜……＜ω_ｍ、Ｎｄ［ｆｍ］＜π
である。 Here, the predetermined stability condition of the LSP coefficient is
0 <ω _{m, 1} [fm] <ω _{m, 2} [fm] <...... <ω _{m, Nd} [fm] <π
It is.

なお、ＬＳＰ係数ω_ｍ、ｋは、フレームｆｍ毎に、即ち、ＬＳＰ係数ω_ｍ、ｋ［ｆｍ］毎に前記の安定条件を満たしているか判別する。つまり、０≦ｆｍ≦Ｎ_ｆｍ［ｍ］の範囲でｆｍが走査されるようなループが設定されている（ステップＳ３１５；未完了ならｆｍ←ｆｍ＋１）。 Incidentally, the LSP coefficients omega _{m, k,} for each frame fm, i.e., determines whether or meets the stability condition for each LSP coefficients ω _m, k [fm]. That is, a loop is set such that fm is scanned in the range of 0 ≦ fm ≦ N _fm [m] (step S315; fm ← fm + 1 if not completed).

前記所定の安定条件を満たしている場合（ステップＳ３０９；満たす）、ＬＳＰ係数確認フラグｃｆ［ｆｍ］を１にセットし（ｃｆ［ｆｍ］＝１）（ステップＳ３１１）、満たしていない場合には（ステップＳ３０９；満たさない）、ＬＳＰ係数確認フラグｃｆ［ｆｍ］を０にセット（ｃｆ［ｆｍ］＝０）する（ステップＳ３１３）。 When the predetermined stability condition is satisfied (step S309; satisfied), the LSP coefficient confirmation flag cf [fm] is set to 1 (cf [fm] = 1) (step S311). Step S309; not satisfied), the LSP coefficient confirmation flag cf [fm] is set to 0 (cf [fm] = 0) (Step S313).

ｆｍをＮ_ｆｍ［ｍ］まで走査し終えたら（ステップＳ３１５；完了）、続いて、ｃｆ［ｆｍ_ＮＧ］＝０となるＬＳＰ係数確認フラグが存在するか判別する（ステップＳ３１７）。 When fm has been scanned to N _fm [m] (step S315; completion), it is then determined whether or not there is an LSP coefficient confirmation flag that satisfies cf [fm _NG ] = 0 (step S317).

ｃｆ［ｆｍ_ＮＧ］＝０となるＬＳＰ係数確認フラグが存在しない場合には（ステップＳ３１７；存在しない）、ｍ番目の音声データＳｐ_ｍについての補正操作を終了する。 If there is no LSP coefficient confirmation flag for cf [fm _NG ] = 0 (step S317; does not exist), the correction operation for the m-th audio data Sp _m ends.

ｃｆ［ｆｍ_ＮＧ］＝０となるＬＳＰ係数確認フラグが存在する場合には（ステップＳ３１７；存在する）、ｃｆ［ｆｍ＝ｆｍ_ＮＧ］＝０となるフレームｆｍ_ＮＧに対するＬＳＰ係数ω_ｍ、ｋ［ｆｍ_ＮＧ］を、補正し（ステップＳ３２１、Ｓ３２３、Ｓ３２５）、ｍ番目の音声データＳｐ_ｍについての補正操作を終了する。 If there is an LSP coefficient confirmation flag for cf [fm _NG ] = 0 (step S317; present), the LSP coefficient ω _{m, k} [fm for the frame fm _NG for cf [fm = fm _NG ] = 0. the _NG], corrected (step S321, S323, S325), and terminates the correction operation for the m-th audio data Sp _m.

補正の内容について説明する。まず、ｆｍ_ＮＧが０であるか、０＜ｆｍ_ＮＧ＜Ｎ_ｆｍ［ｍ］であるか、Ｎ_ｆｍ［ｍ］であるか判別する（ステップＳ３１９）。 The contents of correction will be described. First, it is determined whether fm _NG is 0, 0 <fm _NG <N _fm [m], or N _fm [m] (step S319).

ステップＳ３１９で、ｆｍ_ＮＧ＝０であると判別されたときは、２番目以降のフレームのＬＳＰ係数を参照して補正を行う（ステップＳ３２１）。例えば、ω_ｍ、ｋ［ｆｍ_ＮＧ］＝ω_ｍ、ｋ［ｆｍ_ＯＫ、Ｈ］（但し、１≦ｋ≦Ｎ_ｄであり、ｆｍ_ＯＫ、Ｈは、ｆｍ_ＮＧより大きくｃｆ［ｆｍ_ＯＫ、Ｈ］＝１を満たす最小値である。）とする。 If it is determined in step S319 that fm _NG = 0, correction is performed with reference to the LSP coefficients of the second and subsequent frames (step S321). For example, ω _{m, k} [fm _NG ] = ω _{m, k} [fm _{OK, H} ] (where 1 ≦ k ≦ N _d , fm _{OK, H} is larger than fm _NG , cf [fm _{OK, H} ]) = 1 is the minimum value satisfying 1).

ステップＳ３１９で、０＜ｆｍ_ＮＧ＜Ｎ_ｆｍ［ｍ］であると判別されたときは、０≦ｆｍ≦Ｎ_ｆｍ［ｍ］であるような、ｆｍ_ＮＧに隣接等するフレームのＬＳＰ係数を参照して補正を行う（ステップＳ３２３）。例えば、ω_ｍ、ｋ［ｆｍ_ＮＧ］＝ω_ｍ、ｋ［ｆｍ_ＯＫ、Ｌ］×α＋ω_ｍ、ｋ［ｆｍ_ＯＫ、Ｈ］×β（但し、１≦ｋ≦Ｎ_ｄであり、ｆｍ_ＯＫ、Ｌは、ｆｍ_ＮＧより小さくｃｆ［ｆｍ_ＯＫ、Ｌ］＝１を満たす最大値、ｆｍ_ＯＫ、Ｈは、ｆｍ_ＮＧより大きくｃｆ［ｆｍ_ＯＫ、Ｈ］＝１を満たす最小値、α及びβは重み係数である。）とする。 If it is determined in step S319 that 0 <fm _NG <N _fm [m], refer to the LSP coefficient of the frame adjacent to fm _NG such that 0 ≦ fm ≦ N _fm [m]. Then, correction is performed (step S323). For example, ω _{m, k} [fm _NG ] = ω _{m, k} [fm _{OK, L} ] × α + ω _{m, k} [fm _{OK, H} ] × β (where 1 ≦ k ≦ N _d , fm _{OK, L} Is a maximum value smaller than fm _NG and satisfies cf [fm _{OK, L} ] = 1, fm _{OK, H} is a minimum value larger than fm _NG and satisfies cf [fm _{OK, H} ] = 1, and α and β are weighting factors ).

ステップＳ２１で、ｆｍ_ＮＧ＝Ｎ_ｆｍ［ｍ］であると判別されたときは、最後尾から２番目以前のフレームを参照して補正を行う（ステップＳ３２５）。例えば、ω_ｍ、ｋ［ｆｍ_ＮＧ］＝ω_ｍ、ｋ［ｆｍ_ＯＫ、Ｌ］（但し、１≦ｋ≦Ｎ_ｄであり、ｆｍ_ＯＫ、Ｌは、ｆｍ_ＮＧより小さくｃｆ［ｆｍ_ＯＫ、Ｌ］＝１を満たす最大値である。）とする。 When it is determined in step S21 that fm _NG = N _fm [m], correction is performed with reference to the second and previous frames from the tail (step S325). For example, ω _{m, k} [fm _NG ] = ω _{m, k} [fm _{OK, L} ] (where 1 ≦ k ≦ N _d , fm _{OK, L} is smaller than fm _NG , cf [fm _{OK, L} ]) = 1 is the maximum value satisfying 1).

学習前スペクトルパラメータ補正部１７は、音声データとフレームとを順次更新しつつ上述の補正処理を繰り返す（ステップＳ３１５；未完了ならｆｍ←ｆｍ＋１、ステップＳ３２７；未完了ならｍ←ｍ＋１）。一方、更新の継続の結果、全ての音声データの全てのフレームについての処理が終わったら、ループを抜ける（ステップＳ３１５；完了、ステップＳ３２７；完了） The pre-learning spectrum parameter correction unit 17 repeats the above correction process while sequentially updating the audio data and the frame (step S315; fm ← fm + 1 if not completed, step S327; m ← m + 1 if not completed). On the other hand, when the processing for all the frames of all the audio data is completed as a result of the continuation of the update, the process exits the loop (step S315; completion, step S327; completion).

学習前スペクトルパラメータ補正部１７は、補正済音声スペクトルパラメータデータ列（補正済みＬＳＰ係数群の列）を第１音素ＨＭＭ学習部２１に供給する。 The pre-learning spectrum parameter correction unit 17 supplies the corrected speech spectrum parameter data string (corrected LSP coefficient group string) to the first phoneme HMM learning unit 21.

第１音素ＨＭＭ学習部２１は、音素ラベルと補正済みの音声スペクトルパラメータデータ列（補正済ＬＳＰ係数群の列）とを対応付けて、音素ＨＭＭに基づく学習を行う。学習手法自体は、既知の任意手法を採用できる。 The first phoneme HMM learning unit 21 associates the phoneme label with the corrected speech spectrum parameter data string (corrected LSP coefficient group string), and performs learning based on the phoneme HMM. As the learning method itself, a known arbitrary method can be adopted.

学習中スペクトルパラメータ補正部２３は、第１音素ＨＭＭ学習部２１から供給された音素ＨＭＭについて、後述の具体例に示す補正処理を行って、次段の音素ＨＭＭ学習部に供給する。 The in-learning spectral parameter correction unit 23 performs correction processing shown in a specific example described later on the phoneme HMM supplied from the first phoneme HMM learning unit 21 and supplies the phoneme HMM to the next phoneme HMM learning unit.

以後、同様の処理が繰り返されて、最終的な音素ＨＭＭがデータ書き出し部２９に供給され、音声合成辞書３３に書き込まれる。 Thereafter, the same processing is repeated, and the final phoneme HMM is supplied to the data writing unit 29 and written into the speech synthesis dictionary 33.

本実施の形態の音声合成辞書構築装置では、学習前スペクトルパラメータ補正部１７及び学習中スペクトルパラメータ補正部２３において、スペクトルパラメータが生成されるたびに、所定の安定条件を満たすように補正がなされるので、より音質の高い合成音声を出力するのに資する音声合成辞書の構築が達成できる。 In the speech synthesis dictionary construction device of the present embodiment, the pre-learning spectral parameter correction unit 17 and the in-learning spectral parameter correction unit 23 make corrections so as to satisfy a predetermined stability condition each time a spectral parameter is generated. Therefore, the construction of a speech synthesis dictionary that contributes to outputting synthesized speech with higher sound quality can be achieved.

ここまでは、ｃｆ［ｆｍ_ＮＧ］＝０のとき、１≦ｋ≦Ｎ_ｄなる全ての次数ｋに対して、ω_ｍ、ｋ［ｆｍ_ＮＧ］の補正を行うことを想定してきたが、安定条件を満たさない原因となった次数ｋについてのみω_ｍ、ｋ［ｆｍ_ＮＧ］の補正を行ってもよい。 Up to this point, it has been assumed that when cf [fm _NG ] = 0, correction of ω _{m, k} [fm _NG ] is performed for all orders k satisfying 1 ≦ k ≦ N _d. The correction of ω _{m, k} [fm _NG ] may be performed only for the order k that caused the condition not to be satisfied.

そのようにすれば、必要な次数ｋについてのＬＳＰ係数の補正は行われ安定条件が満たされる一方、フレームｆｍ_ＮＧに属している全てのＬＳＰ係数が補正の対象となる場合と異なり、元来安定条件を満たしていた次数ｋについてのＬＳＰ係数は余計な補正を受けずに済むから、より適切な補正が実現される。 In such a case, the LSP coefficient is corrected for the required order k and the stability condition is satisfied. However, unlike the case where all LSP coefficients belonging to the frame fm _NG are to be corrected, it is inherently stable. Since the LSP coefficient for the order k that satisfies the condition does not need to be subjected to extra correction, a more appropriate correction is realized.

（学習中補正の具体例１）
第１実施形態に係る音声合成辞書構築装置１１における、音素ＨＭＭ学習部１９の内部の具体例を、図４に示すフローチャートを参照して説明する。 (Specific example of correction during learning 1)
A specific example of the inside of the phoneme HMM learning unit 19 in the speech synthesis dictionary construction device 11 according to the first embodiment will be described with reference to the flowchart shown in FIG.

本具体例に係る音声合成辞書構築装置は、音素ＨＭＭ学習部は、第１〜第５音素ＨＭＭ学習部と、学習中スペクトルパラメータ補正部と、を備えることを特徴とする。 The speech synthesis dictionary construction device according to this example is characterized in that the phoneme HMM learning unit includes first to fifth phoneme HMM learning units and a learning spectral parameter correction unit.

第１〜第５音素ＨＭＭ学習部は、それぞれ、モノフォンＨＭＭの初期化学習、モノフォンＨＭＭの再学習、トライフォンＨＭＭの初期化学習、トライフォンＨＭＭの再学習、及び、決定木を用いたクラスタリング処理を担う。 The first to fifth phoneme HMM learning units respectively perform initialization learning of the monophone HMM, re-learning of the monophone HMM, initialization learning of the triphone HMM, re-learning of the triphone HMM, and clustering processing using a decision tree Take on.

モノフォンＨＭＭとして状態数５のＨＭＭを採用する。状態Ｓ_０は初期状態、状態Ｓ_４は終了状態であり、いずれの状態もＬＳＰ係数を出力しない。ＬＳＰ係数は、状態Ｓ_１、Ｓ_２、Ｓ_３から出力される。 An HMM with 5 states is used as the monophone HMM. State S ₀ is an initial state and state S ₄ is an end state, and neither state outputs an LSP coefficient. LSP coefficients are output from states S ₁ , S ₂ , S ₃ .

第１音素ＨＭＭ学習部では、音素ラベル列と音声スペクトルパラメータデータ列ω_ｍ、ｋ［ｆｍ］を学習データとして、音素ラベル毎に、ＬＳＰ係数に関するモノフォンＨＭＭを初期化学習する（ステップＳ４０１）。 The first phoneme HMM learning unit uses the phoneme label string and the speech spectrum parameter data string ω _{m, k} [fm] as learning data to initialize and learn the monophone HMM related to the LSP coefficient for each phoneme label (step S401).

学習結果は、図１に示す学習中スペクトルパラメータ補正部２３に引き渡される。 The learning result is delivered to the in-learning spectrum parameter correction unit 23 shown in FIG.

学習中スペクトルパラメータ補正部では、各音素ラベルのＬＳＰ係数に関するモノフォンＨＭＭに対して、各状態のＬＳＰ係数の平均値ω_{ｋ、Ａｖｅ}［Ｓ_ｉ］（但し、ｉは１乃至３であり、１≦ｋ≦Ｎ_ｄであり、Ｎ_ｄはＬＳＰ係数の次数である。）が安定条件０＜ω_{１、Ａｖｅ}［Ｓ_ｉ］＜ω_{２、Ａｖｅ}［Ｓ_ｉ］＜……＜ω_{Ｎｄ、Ａｖｅ}［Ｓ_ｉ］＜πを満たしているか判別する（ステップＳ４２５）。判別は、Ｓ_１、Ｓ_２、Ｓ_３の全てについて行う必要があるので、カウンタｉを用いて順次処理する（ステップＳ４２３、ステップＳ４２９）。 In the learning spectral parameter correction unit, the average value ω _{k, Ave} [S _i ] of the LSP coefficients in each state (where i is 1 to 3 and 1 ≦ 1) with respect to the monophone HMM related to the LSP coefficient of each phoneme label k ≦ N _d , where N _d is the order of the LSP coefficient.) is the stability condition 0 <ω _{1, Ave} [S _i ] <ω _{2, Ave} [S _i ] <…… <ω _{Nd, Ave} [S _i ] <π is discriminated (step S425). Since it is necessary to perform discrimination for all of S ₁ , S ₂ , and S ₃ , processing is sequentially performed using the counter i (steps S 423 and S 429).

安定条件が満たされている場合（ステップＳ４２５；満たす）には、状態Ｓ_１、Ｓ_２、Ｓ_３の全てについての処理が終わっていれば（ステップＳ４２９；ＹＥＳ）、補正操作を終了する。 If the stability condition is satisfied (step S425; satisfied), the correction operation is ended if the processing for all of the states S ₁ , S ₂ , and S ₃ is completed (step S429; YES).

安定条件が満たされていない場合には（ステップＳ４２５；満たさない）、満たすように補正する操作を施す（ステップＳ４２７）。 If the stability condition is not satisfied (step S425; not satisfied), an operation of correcting to satisfy is performed (step S427).

補正操作の結果は、第２音素ＨＭＭ学習部に引き渡される。 The result of the correction operation is delivered to the second phoneme HMM learning unit.

第２音素ＨＭＭ学習部では、音素ラベル毎に、ＬＳＰ係数に関するモノフォンＨＭＭを再学習する。 The second phoneme HMM learning unit relearns the monophone HMM related to the LSP coefficient for each phoneme label.

学習結果は、学習中スペクトルパラメータ補正部２３に引き渡される。 The learning result is transferred to the in-learning spectrum parameter correction unit 23.

学習中スペクトルパラメータ補正部２３は、前記の操作と同様な補正操作を行い、その結果を、第３音素ＨＭＭ学習部に引き渡す。 The in-learning spectral parameter correction unit 23 performs the same correction operation as described above, and passes the result to the third phoneme HMM learning unit.

第３音素ＨＭＭ学習部では、トライフォンＨＭＭの初期化学習を行う。すなわち、ＬＳＰ係数に関するモノフォンＨＭＭを、前後の音素ラベルを考慮したＬＳＰ係数に関するトライフォンＨＭＭにコピーし、初期化学習する。 The third phoneme HMM learning unit performs initialization learning of the triphone HMM. That is, the monophone HMM related to the LSP coefficient is copied to the triphone HMM related to the LSP coefficient considering the preceding and following phoneme labels, and is subjected to initialization learning.

トライフォンＨＭＭも、モノフォンＨＭＭと同様に、状態数５のＨＭＭとし、ＨＭＭ内の各状態におけるＬＳＰ係数の平均値をω_{ｋ、Ａｖｅ}［Ｓ_ｉ］で表すものとする。 Similarly to the monophone HMM, the triphone HMM is an HMM having five states, and the average value of the LSP coefficients in each state in the HMM is represented by ω _{k, Ave} [S _i ].

学習中スペクトルパラメータ補正部２３は、前記の操作と同様な補正操作を行い、その結果を、第４音素ＨＭＭ学習部に引き渡す。 The in-learning spectral parameter correction unit 23 performs the same correction operation as described above, and passes the result to the fourth phoneme HMM learning unit.

第４音素ＨＭＭ学習部では、トライフォンＨＭＭの再学習を行う。 The fourth phoneme HMM learning unit performs relearning of the triphone HMM.

学習中スペクトルパラメータ補正部２３は、前記の操作と同様な補正操作を行い、その結果を、第５音素ＨＭＭ学習部に引き渡す。 The in-learning spectral parameter correction unit 23 performs the same correction operation as described above, and passes the result to the fifth phoneme HMM learning unit.

第５音素学習部では、ＬＳＰ係数に関するトライフォンＨＭＭに対して、決定木を用いたクラスタリングを行い、学習データ中に存在しない音素ラベルの組み合わせにも対応できるモデルを生成して、再度、ＬＳＰ係数に関するトライフォンＨＭＭを再学習する。 The fifth phoneme learning unit performs clustering using a decision tree on the triphone HMM related to the LSP coefficient, generates a model that can also deal with combinations of phoneme labels that do not exist in the learning data, and again generates the LSP coefficient. Re-learn triphone HMM for

学習中スペクトルパラメータ補正部は、前記の操作と同様な補正操作を行い、その結果を、図１に示すデータ書き出し部２９に引き渡す。 The in-learning spectral parameter correction unit performs a correction operation similar to the above-described operation, and delivers the result to the data writing unit 29 shown in FIG.

このように、音素ＨＭＭ学習部の内部では、各段階の学習が済む毎に、学習中スペクトルパラメータ補正部により、ＬＳＰ係数の補正操作が施される。以下では、図５に示すフローチャートを参照して、当該補正操作の具体的な内容を説明する。 In this manner, in the phoneme HMM learning unit, every time learning is completed, the LSP coefficient correction operation is performed by the in-learning spectral parameter correction unit. Below, the specific content of the said correction operation is demonstrated with reference to the flowchart shown in FIG.

モノフォンＨＭＭ又はトライフォンＨＭＭの各状態のＬＳＰ係数の平均値ω_{ｋ、Ａｖｅ}［Ｓ_ｉ］を読み出し（ステップＳ５０１）、これらに対して、安定条件０＜ω_{１、Ａｖｅ}［Ｓ_ｉ］＜ω_{２、Ａｖｅ}［Ｓ_ｉ］＜……＜ω_{Ｎｄ、Ａｖｅ}［Ｓ_ｉ］＜πが満たされているか判別する。 The average values ω _{k and Ave} [S _i ] of the LSP coefficients in each state of the monophone HMM or the triphone HMM are read (step S501), and for these, the stability condition 0 <ω _{1 and Ave} [S _i ] <ω _{2 , Ave} [S _i ] <... <Ω _{Nd, Ave} [S _i ] <π is determined.

状態Ｓ_１乃至Ｓ_３の全てについて前記安定条件が満たされている場合（ステップＳ５０３；ＹＥＳ、ステップＳ５０５；ＹＥＳ、ステップＳ５０９；ＹＥＳ）は、補正をせずに、補正操作を終了する。 If the stability condition for all states _{S 1} to _{S 3} is satisfied (step S503; YES, Step S505; YES, Step S509; YES), without correction, and terminates the correction operation.

状態Ｓ_１については前記安定条件が満たされていないが（ステップＳ５０３；ＮＯ）、Ｓ_２及びＳ_３については満たさされている場合（ステップＳ５０７；ＹＥＳ、ステップＳ５１３；ＹＥＳ）は、状態Ｓ_１のＬＳＰ係数の平均値ω_{ｋ、Ａｖｅ}［Ｓ_１］に対しては、
ω_{ｋ、Ａｖｅ}［Ｓ_１］＝ω_{ｋ、Ａｖｅ}［Ｓ_２］（但し、１≦ｋ≦Ｎ_ｄとする。）
のように補正し（ステップＳ５２１）、状態Ｓ_２及びＳ_３に関するＬＳＰ係数は補正せずに、補正操作を終了する。 Although the state _{S 1} is not satisfied the stability condition (step S503; NO), if it is satisfied for the _{S 2} and _{S 3} (step S507; YES, Step S513; YES), the state _{S 1} For the mean value ω _{k, Ave} [S ₁ ] of the LSP coefficients,
ω _{k, Ave} [S ₁ ] = ω _{k, Ave} [S ₂ ] (where 1 ≦ k ≦ N _d )
Is corrected (step S521), LSP coefficients for the state _{S 2} and _{S 3} as without correction, and terminates the correction operation.

状態Ｓ_２については前記安定条件が満たされていないが、Ｓ_１及びＳ_３については満たさされている場合（ステップＳ５０３；ＹＥＳ、ステップＳ５０５；ＮＯ、ステップＳ５１１；ＹＥＳ）は、状態Ｓ_２のＬＳＰ係数の平均値ω_{ｋ、Ａｖｅ}［Ｓ_２］に対しては、
ω_{ｋ、Ａｖｅ}［Ｓ_２］＝（ω_{ｋ、Ａｖｅ}［Ｓ_１］＋ω_{ｋ、Ａｖｅ}［Ｓ_３］）／２（但し、１≦ｋ≦Ｎ_ｄ）のように補正し（ステップＳ５１９）、状態Ｓ_１及びＳ_３に関するＬＳＰ係数は補正せずに、補正操作を終了する。 Although the state _{S 2} has not been met the stability criteria, if it is satisfied for _{S 1} and _{S 3} (step S503; YES, Step S505; NO, step S511; YES), the state _{S 2} LSP For the mean value ω _{k, Ave} [S ₂ ] of the coefficients,
ω _{k, Ave} [S ₂ ] = (ω _{k, Ave} [S ₁ ] + ω _{k, Ave} [S ₃ ]) / 2 (where 1 ≦ k ≦ N _d ) (step S519) The correction operation is terminated without correcting the LSP coefficients related to S ₁ and S ₃ .

状態Ｓ_３については前記安定条件が満たされていないが、Ｓ_１及びＳ_２については満たさされている場合（ステップＳ５０３；ＹＥＳ、ステップＳ５０５；ＹＥＳ、ステップＳ５０９；ＮＯ）は、状態Ｓ_３のＬＳＰ係数の平均値ω_{ｋ、Ａｖｅ}［Ｓ_３］に対しては、
ω_{ｋ、Ａｖｅ}［Ｓ_３］＝ω_{ｋ、Ａｖｅ}［Ｓ_２］（但し、１≦ｋ≦Ｎ_ｄとする。）
のように補正し（ステップＳ５１７）、状態Ｓ_１及びＳ_２に関するＬＳＰ係数は補正せずに、補正操作を終了する。 Although the state _{S 3} is not the stability condition is satisfied, if it is satisfied for _{S 1} and _{S 2} (step S503; YES, Step S505; YES, Step S509; NO) is, LSP state _{S 3} For the mean value ω _{k, Ave} [S ₃ ] of the coefficients,
ω _{k, Ave} [S ₃ ] = ω _{k, Ave} [S ₂ ] (where 1 ≦ k ≦ N _d )
Is corrected (step S517), LSP coefficients for the state _{S 1} and _{S 2} as uncorrected, and terminates the correction operation.

状態Ｓ_１及び状態Ｓ_２については前記安定条件が満たされていないが、Ｓ_３については満たさされている場合（ステップＳ５０３；ＮＯ、ステップＳ５０７；ＮＯ、ステップＳ５１５；ＹＥＳ）は、状態Ｓ_１及び状態Ｓ_２のＬＳＰ係数の平均値ω_{ｋ、Ａｖｅ}［Ｓ_１］及びω_{ｋ、Ａｖｅ}［Ｓ_２］に対しては、
ω_{ｋ、Ａｖｅ}［Ｓ_１］＝ω_{ｋ、Ａｖｅ}［Ｓ_３］、ω_{ｋ、Ａｖｅ}［Ｓ_２］＝ω_{ｋ、Ａｖｅ}［Ｓ_３］（但し、１≦ｋ≦Ｎ_ｄとする。）
のように補正し（ステップＳ５２７）、状態Ｓ_３に関するＬＳＰ係数は補正せずに、補正操作を終了する。 Although the state _{S 1} and the state _{S 2} is not the stability condition is satisfied, if it is satisfied for _{S 3} (step S503; NO, step S507; NO, step S515; YES), the state _{S 1} and For the mean value ω _{k, Ave} [S ₁ ] and ω _{k, Ave} [S ₂ ] of the LSP coefficients in state S ₂ ,
ω _{k, Ave} [S ₁ ] = ω _{k, Ave} [S ₃ ], ω _{k, Ave} [S ₂ ] = ω _{k, Ave} [S ₃ ] (where 1 ≦ k ≦ N _d )
Is corrected (step S527), LSP coefficients for the state _{S 3} as without correction, and terminates the correction operation.

状態Ｓ_１及び状態Ｓ_３については前記安定条件が満たされていないが、Ｓ_２については満たさされている場合（ステップＳ５０３；ＮＯ、ステップＳ５０７；ＹＥＳ、ステップＳ５１３；ＮＯ）は、状態Ｓ_１及び状態Ｓ_３のＬＳＰ係数の平均値ω_{ｋ、Ａｖｅ}［Ｓ_１］及びω_{ｋ、Ａｖｅ}［Ｓ_３］に対しては、
ω_{ｋ、Ａｖｅ}［Ｓ_１］＝ω_{ｋ、Ａｖｅ}［Ｓ_２］、ω_{ｋ、Ａｖｅ}［Ｓ_３］＝ω_{ｋ、Ａｖｅ}［Ｓ_２］（但し、１≦ｋ≦Ｎ_ｄとする。）
のように補正し（ステップＳ５２５）、状態Ｓ_２に関するＬＳＰ係数は補正せずに、補正操作を終了する。 Although for the state _{S 1} and the state _{S 3} not the stability condition is satisfied, if it is satisfied for _{S 2} (step S503; NO, step S507; YES, Step S513; NO), the state _{S 1} and For the mean value ω _{k, Ave} [S ₁ ] and ω _{k, Ave} [S ₃ ] of the LSP coefficients in state S ₃ ,
ω _{k, Ave} [S ₁ ] = ω _{k, Ave} [S ₂ ], ω _{k, Ave} [S ₃ ] = ω _{k, Ave} [S ₂ ] (where 1 ≦ k ≦ N _d )
Is corrected (step S525), LSP coefficients for the state _{S 2} as uncorrected, and terminates the correction operation.

状態Ｓ_２及び状態Ｓ_３については前記安定条件が満たされていないが、Ｓ_１については満たされている場合（ステップＳ５０３；ＹＥＳ、ステップＳ５０５；ＮＯ、ステップＳ５１１；ＮＯ）は、状態Ｓ_２及び状態Ｓ_３のＬＳＰ係数の平均値ω_{ｋ、Ａｖｅ}［Ｓ_２］及びω_{ｋ、Ａｖｅ}［Ｓ_３］に対しては、
ω_{ｋ、Ａｖｅ}［Ｓ_２］＝ω_{ｋ、Ａｖｅ}［Ｓ_１］、ω_{ｋ、Ａｖｅ}［Ｓ_３］＝ω_{ｋ、Ａｖｅ}［Ｓ_１］（但し、１≦ｋ≦Ｎ_ｄとする。）
のように補正し（ステップＳ５２３）、状態Ｓ_１に関するＬＳＰ係数は補正せずに、補正操作を終了する。 Although for the state _{S 2,} and the state _{S 3} not the stability condition is satisfied, if it is satisfied for _{S 1} (step S503; YES, Step S505; NO, step S511; NO), the state _{S 2} and For the mean value ω _{k, Ave} [S ₂ ] and ω _{k, Ave} [S ₃ ] of LSP coefficients in state S ₃ ,
ω _{k, Ave} [S ₂ ] = ω _{k, Ave} [S ₁ ], ω _{k, Ave} [S ₃ ] = ω _{k, Ave} [S ₁ ] (where 1 ≦ k ≦ N _d )
Is corrected (step S523), LSP coefficients for the state _{S 1} as without correction, and terminates the correction operation.

状態Ｓ_１乃至Ｓ_３について前記安定条件が満たされていない場合（ステップＳ５０３；ＮＯ、ステップＳ５０７；ＮＯ、ステップＳ５１５；ＮＯ）は、学習前の状態に戻して（ステップＳ５２９）、補正操作を終了する。 If the stability condition for the state _{S 1} to _{S 3} is not satisfied (step S503; NO, step S507; NO, step S515; NO) is returned to the pre-learning state (step S529), ends the correction operation To do.

（学習中補正の具体例２）
前記具体例においては、安定条件が満たされていない場合に行われる補正操作は、ＬＳＰ係数の全ての次数ｋ、すなわち１≦ｋ≦Ｎ_ｄなる全てのｋについて行われる。これに対し、本実施例においては、安定条件を満たさない原因となった次数ｋについてのみ行う。 (Specific example 2 of correction during learning)
In the specific example, the correction operation performed when the stability condition is not satisfied is performed for all orders k of the LSP coefficient, that is, all k satisfying 1 ≦ k ≦ N _d . On the other hand, in the present embodiment, only the order k that causes the stability condition not to be satisfied is performed.

このように補正対象を限定することにより、本来補正の必要のなかった次数ｋに関してまで補正が行われてしまうことを防ぐことができ、結果として、第４実施例に係る音声合成辞書構築装置により構築される音声合成辞書よりも、高音質な合成音声の出力に資する音声合成辞書の構築が達成される。 By limiting the correction target in this way, it is possible to prevent the correction from being performed up to the order k that originally did not need to be corrected. As a result, the speech synthesis dictionary construction device according to the fourth embodiment The construction of a speech synthesis dictionary that contributes to the output of synthesized speech with higher sound quality than the constructed speech synthesis dictionary is achieved.

（学習中補正の具体例３）
本実施例では、具体例１におけるＬＳＰ係数の補正に加えて、ＨＭＭ内の各状態におけるＬＳＰ係数の分散値ω_{ｋ、Ｖａｒ}［Ｓ_ｉ］に対して、所定の適切性判別基準を課し、不適切な値を有する次数ｋの分散値に対しては、適切な値に補正する操作を加える。 (Specific example 3 of correction during learning)
In the present embodiment, in addition to the correction of the LSP coefficient in the specific example 1, a predetermined appropriateness determination criterion is imposed on the variance value ω _{k, Var} [S _i ] of the LSP coefficient in each state in the HMM, For the variance value of the order k having an inappropriate value, an operation for correcting to an appropriate value is added.

（実施形態２）
本実施例に係る音声合成辞書構築装置は、第１実施例に係る音声合成辞書構築装置において、学習前スペクトルパラメータ補正部を省略することを特徴とする、音声合成辞書構築装置である。 (Embodiment 2)
The speech synthesis dictionary construction device according to the present embodiment is a speech synthesis dictionary construction device in which the pre-learning spectrum parameter correction unit is omitted in the speech synthesis dictionary construction device according to the first embodiment.

音素ＨＭＭ学習部１９の内部で学習中スペクトルパラメータ補正部２３により補正が繰り返されるため、本実施例に係る音声合成辞書構築装置により構築された音声合成辞書を音声合成装置に用いた場合、従来の音声合成辞書構築装置により構築された音声合成辞書を音声合成装置に用いた場合に比べて、高音質の合成音声の出力が達成される。 Since the correction is repeated by the in-learning spectral parameter correction unit 23 inside the phoneme HMM learning unit 19, when the speech synthesis dictionary constructed by the speech synthesis dictionary construction device according to the present embodiment is used for the speech synthesis device, Compared to the case where the speech synthesis dictionary constructed by the speech synthesis dictionary construction device is used in the speech synthesis device, the output of the synthesized speech with high sound quality is achieved.

学習前スペクトルパラメータ補正部が省略されているため、本実施例に係る音声合成辞書構築装置により構築された音声合成辞書を音声合成装置に用いた場合、第１実施例に係る音声合成辞書構築装置により構築された音声合成辞書を音声合成装置に用いた場合に比べて、出力される合成音声の品質は劣る。 Since the pre-learning spectrum parameter correction unit is omitted, when the speech synthesis dictionary constructed by the speech synthesis dictionary construction device according to the present embodiment is used for the speech synthesis device, the speech synthesis dictionary construction device according to the first embodiment Compared with the case where the speech synthesis dictionary constructed by the above is used in a speech synthesizer, the quality of the synthesized speech output is inferior.

しかし、学習前スペクトルパラメータ補正部を省略することにより、本実施例に係る音声合成辞書構築装置は、第１実施例に係る音声合成辞書構築装置に比べて、構造が単純になるというメリットがある。 However, by omitting the pre-learning spectrum parameter correction unit, the speech synthesis dictionary construction device according to the present embodiment has an advantage that the structure is simpler than the speech synthesis dictionary construction device according to the first embodiment. .

なお、この発明は、条規実施形態に限定されず、種々の変形及び応用が可能である。例えば、上述のハードウェア構成やブロック構成、フローチャートは例示であって、限定されるものではない。また、この発明は、音声合成辞書構築装置に限定されるものではなく、任意のコンピュータを用いて構築可能である。例えば、上述の処理をコンピュータに実行させるためのコンピュータプログラムを記録媒体や通信により配布し、これをコンピュータにインストールして実行させることにより、この発明の音声合成辞書構築装置として機能させることも可能である。 In addition, this invention is not limited to rule embodiment, A various deformation | transformation and application are possible. For example, the above-described hardware configuration, block configuration, and flowchart are examples, and are not limited. The present invention is not limited to the speech synthesis dictionary construction device, and can be constructed using any computer. For example, by distributing a computer program for causing a computer to execute the above-described processing through a recording medium or communication, and installing and executing the computer program on the computer, the computer can function as the speech synthesis dictionary construction device of the present invention. is there.

本発明による音声合成辞書構築装置及び方法の概要構成図である。It is a schematic block diagram of the speech synthesis dictionary construction apparatus and method by this invention. 実施形態１に係る音声合成辞書構築装置の物理的な構成を示す図である。It is a figure which shows the physical structure of the speech synthesis dictionary construction apparatus which concerns on Embodiment 1. FIG. 学習前補正における動作の流れを示すフローチャートである。It is a flowchart which shows the flow of operation | movement in the correction before learning. 第１音素ＨＭＭ学習部における学習の流れを示すフローチャートである。It is a flowchart which shows the flow of learning in a 1st phoneme HMM learning part. 学習中補正における動作の流れを示すフローチャートである。It is a flowchart which shows the flow of operation | movement in correction | amendment during learning.

Explanation of symbols

１１・・・音声合成辞書構築装置、１３・・・データ取り出し部、１５・・・スペクトル分析部、１７・・・学習前スペクトルパラメータ補正部、１９・・・音素ＨＭＭ学習部、２１・・・第１音素ＨＭＭ学習部、２３・・・学習中スペクトルパラメータ補正部、２５・・・第２音素ＨＭＭ学習部、２７・・・第Ｎ音素ＨＭＭ学習部、２９・・・データ書き出し部、３１・・・音声データベース、３３・・・音声合成辞書、４１・・・コンピュータ装置、４３・・・ＣＰＵ、４５・・・ＲＯＭ、４７・・・記憶部、４９・・・ユーザＩ／Ｆ、５１・・・データ入出力Ｉ／Ｆ、５５・・・元データ入りハードディスク、５７・・・処理済データ記録用ハードディスク、６１・・・キーボード、６３・・・モニタ、６５・・・ＲＡＭ、６７・・・ハードディスク、７１・・・バス DESCRIPTION OF SYMBOLS 11 ... Speech synthesis dictionary construction apparatus, 13 ... Data extraction part, 15 ... Spectrum analysis part, 17 ... Pre-learning spectrum parameter correction part, 19 ... Phoneme HMM learning part, 21 ... 1st phoneme HMM learning unit, 23... Learning spectral parameter correction unit, 25... 2nd phoneme HMM learning unit, 27... Nth phoneme HMM learning unit, 29. .. Voice database, 33... Speech synthesis dictionary, 41... Computer device, 43... CPU, 45... ROM, 47. Data input / output I / F, 55: Hard disk with original data, 57: Hard disk for recording processed data, 61: Keyboard, 63: Monitor, 65 ... RAM, 67・Dodisuku, 71 ... bus

Claims

An LSP coefficient group time series data generating unit that performs LSP analysis on voice data and generates LSP coefficient group time series data including multidimensional LSP (Line Spectrum Pair) coefficients;
A pre-learning spectral parameter correction unit that performs a correction process on the LSP coefficient group time-series data generated by the LSP coefficient group time-series data generation unit so as to satisfy a predetermined stability condition;
The phoneme label sequence and the corrected LSP coefficient group time-series data output from the pre-learning spectrum parameter correction unit are received, and phoneme HMM learning is performed in a plurality of stages by associating the phoneme HMM for each phoneme label by learning based on the hidden Markov model. A phoneme HMM learning unit;
The LSP coefficient defining the phoneme HMM associated with each phoneme label obtained in the phoneme HMM learning at each stage of the phoneme HMM learning unit is subjected to correction processing so as to satisfy a predetermined stability condition. A learning spectral parameter correction unit that is handed over to the phoneme HMM learning of
A data writing unit for recording a phoneme label and a phoneme HMM associated by the phoneme HMM learning unit in a speech synthesis dictionary;
A speech synthesis dictionary construction device comprising:

The correction process, for LSP coefficient group, to determine whether the predetermined stability condition is satisfied, if not satisfied, and a process of replacing the LSP coefficient group that satisfies predetermined stability condition ,
The speech synthesis dictionary construction apparatus according to claim 1.

The correction process includes a process of replacing the LSP coefficient group with a coefficient group that satisfies the predetermined stability condition,
The predetermined stability condition is that LSP coefficients are all larger than 0 and smaller than π, and are arranged in ascending order when the coefficients are arranged in ascending order.
The speech synthesis dictionary construction apparatus according to claim 1.

LSP analysis is performed on voice data to generate LSP coefficient group time series data including multidimensional LSP (Line Spectrum Pair) coefficients,
A correction process is performed on the generated LSP coefficient group time series data so as to satisfy a predetermined stability condition,
Receiving a phoneme label sequence and the corrected LSP coefficient group time-series data, and performing phoneme HMM learning to associate a phoneme HMM for each phoneme label by learning based on a hidden Markov model;
In the phoneme HMM learning, the LSP coefficient defining the phoneme HMM associated with each phoneme label obtained by the phoneme HMM learning in each step is subjected to correction processing so as to satisfy a predetermined stability condition, and the next step Handed over to phoneme HMM learning
Record the associated phoneme label and the phoneme HMM in the speech synthesis dictionary;
A speech synthesis dictionary construction method characterized by that.

On the computer,
LSP analysis is performed on voice data to generate LSP coefficient group time series data including multidimensional LSP (Line Spectrum Pair) coefficients,
A correction process is performed on the generated LSP coefficient group time series data so as to satisfy a predetermined stability condition,
Receiving a phoneme label sequence and the corrected LSP coefficient group time-series data, and performing phoneme HMM learning to associate a phoneme HMM for each phoneme label by learning based on a hidden Markov model;
In the phoneme HMM learning, the LSP coefficient defining the phoneme HMM associated with each phoneme label obtained by the phoneme HMM learning in each step is subjected to correction processing so as to satisfy a predetermined stability condition, and the next step Handed over to phoneme HMM learning
Record the associated phoneme label and the phoneme HMM in the speech synthesis dictionary;
A computer program for executing processing.