JP4816201B2

JP4816201B2 - Speech processing apparatus and method, text speech synthesis apparatus, and program

Info

Publication number: JP4816201B2
Application number: JP2006096139A
Authority: JP
Inventors: 勝彦佐藤
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2006-03-30
Filing date: 2006-03-30
Publication date: 2011-11-16
Anticipated expiration: 2026-03-30
Also published as: JP2007271829A

Abstract

<P>PROBLEM TO BE SOLVED: To output a synthesized speech of high quality even when an LSP coefficient group does not meet stableness conditions any more. <P>SOLUTION: Time-series data of the LSP coefficient group are generated and if an abrupt LSP coefficient is obtained and the stableness conditions are not met any more, a correcting method is switched according to the position of the frame where the abrupt LSP coefficient is output while other LSP coefficients corresponding to the same phoneme HMM and the same state of a hidden Markov model is referred to, thereby minimizing deterioration in speech quality. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、与えられたテキスト文字列から、音声データを作り出す過程で施される音声処理装置及び方法、テキスト音声合成装置、プログラムに関する。
また、本発明は、与えられたテキスト文字列から音声データを作り出す音声合成装置に関する。 The present invention relates to a speech processing apparatus and method, a text-to-speech synthesizer, and a program that are applied in the process of creating speech data from a given text character string.
The present invention also relates to a speech synthesizer that creates speech data from a given text string.

隠れマルコフモデル（以下ＨＭＭ）に基づいた音声認識技術及び音声合成技術は、大きな成功を収めている。 Speech recognition technology and speech synthesis technology based on Hidden Markov Models (hereinafter HMM) have been very successful.

ＨＭＭに基づいた音声認識技術及び音声合成技術は、例えば、特許文献１に開示されている。
特開２００２−６２８９０号公報 A speech recognition technique and a speech synthesis technique based on the HMM are disclosed in, for example, Patent Document 1.
JP 2002-62890 A

ＨＭＭに基づいた音声合成方法に用いられるスペクトルパラメータとして、ＬＳＰ（ＬｉｎｅＳｐｅｃｔｒｕｍＰａｉｒ）係数群がある。 As a spectral parameter used in the speech synthesis method based on the HMM, there is an LSP (Line Spectrum Pair) coefficient group.

ＬＳＰ係数群は、フレームと称呼される時間区分毎に一群ずつ配置されるものである。ＨＭＭを採用するにあたりｍ次元の特徴ベクトルを用いることとしてあった場合には、一群のＬＳＰ係数群は、各次元に対応したｍ個のＬＳＰ係数から成る。しかも、すべてのＬＳＰ係数が必ず０からπの間に存在する。さらに、フレームｆｍにおけるｍ次元のＬＳＰ係数が、１次元から順に、ω_１［ｆｍ］、ω_２［ｆｍ］、・・・、ω_ｍ［ｆｍ］で表されるとすれば、ω_１［ｆｍ］＜ω_２［ｆｍ］＜・・・＜ω_ｍ［ｆｍ］の関係が保持される。 A group of LSP coefficients is arranged for each time segment called a frame. In the case of using an m-dimensional feature vector in adopting the HMM, a group of LSP coefficient groups is composed of m LSP coefficients corresponding to each dimension. In addition, all LSP coefficients always exist between 0 and π. Further, if the m-dimensional LSP coefficient in the frame fm is expressed by ω ₁ [fm], ω ₂ [fm],..., Ω _m [fm] in order from the first dimension, ω ₁ [fm ] <Ω ₂ [fm] <... <Ω _m [fm].

上記のＬＳＰ係数の存在範囲についての制限及び大小関係についての制限を、以下、安定条件と呼ぶ。 The above-described limitation on the existence range of the LSP coefficient and limitation on the magnitude relationship are hereinafter referred to as a stable condition.

通常、当該安定条件は満たされる。すなわち、原則としては、隣接するフレームのＬＳＰ係数群同士で、同次元のＬＳＰ係数を時間軸に沿って線でつないだとき、それぞれの線が交差することはなく、かつ、これらの線がＬＳＰ係数についての０とπの間の領域をはみ出すことはない。 Usually, the stability condition is satisfied. That is, in principle, when LSP coefficients of the same frame are connected by lines along the time axis between LSP coefficient groups of adjacent frames, the lines do not cross each other, and these lines do not intersect with LSP. The region between 0 and π for the coefficient does not protrude.

しかしながら、現実には、音素ＨＭＭ列を例えばセンテンス単位で尤度最大化する過程で、安定条件が満たされないＬＳＰ係数が得られる場合がある。 However, in reality, an LSP coefficient that does not satisfy the stability condition may be obtained in the process of maximizing the likelihood of the phoneme HMM sequence in units of sentences, for example.

かかる場合を、以下では、「突発的なＬＳＰ係数が得られた。」と呼ぶ。 Hereinafter, this case is referred to as “a sudden LSP coefficient has been obtained”.

突発的なＬＳＰデータを放置すると、合成音声の品質が劣化する。 If sudden LSP data is left unattended, the quality of the synthesized speech deteriorates.

そこで、特許文献１には、突発的なＬＳＰ係数が得られた場合に、音声の劣化を抑える方法が示唆されている。この方法は、突発的なＬＳＰ係数が割り当てられたフレームの両隣のフレームのＬＳＰ係数を用いて、同次元毎に平均値を算出し、当該平均値と突発的なＬＳＰ係数とを入れ替えることにより、安定条件を満たすように補正する。 Therefore, Patent Document 1 suggests a method for suppressing the deterioration of voice when a sudden LSP coefficient is obtained. This method uses the LSP coefficients of the frames adjacent to the frame to which the sudden LSP coefficient is assigned, calculates an average value for each dimension, and replaces the average value with the sudden LSP coefficient. Correct to meet the stability condition.

以下では、このような入れ替え処理を、単純隣接平均補間処理と称する。 Hereinafter, such replacement processing is referred to as simple adjacent average interpolation processing.

単純隣接平均補間処理では、突発的なＬＳＰ係数が対応するフレームｆｍが何であっても、任意の次元ｄ（１≦ｄ≦ｍ）に対して、ω_ｄ［ｆｍ］＝（ω_ｄ［ｆｍ−１］＋ω_ｄ［ｆｍ＋１］）／２のように、隣り合うＬＳＰ係数群を用いて、次元毎に平均値を算出し、当該平均値で補間することになる。 In the simple adjacent average interpolation processing, ω _d [fm] = (ω _d [fm−) for an arbitrary dimension d (1 ≦ d ≦ m) regardless of the frame fm corresponding to the sudden LSP coefficient. 1] + ω _d [fm + 1]) / 2, an average value is calculated for each dimension using adjacent LSP coefficient groups, and the average value is interpolated.

しかしながら、単純隣接平均補間処理においては、ＬＳＰ係数の安定条件は満たされるとしても、同次元の隣接するＬＳＰ係数の差が大きい場合には、不適当なＬＳＰ係数で補間される場合がある。特に、ＨＭＭにおける状態の切り替わり時に、隣接するＬＳＰ係数の差が大きくなる場合がある。このような場合に合成音声の劣化を抑えることができない場合がある。 However, in the simple adjacent average interpolation processing, even if the stability condition of the LSP coefficient is satisfied, if the difference between adjacent LSP coefficients of the same dimension is large, interpolation may be performed with an inappropriate LSP coefficient. In particular, when the state is switched in the HMM, the difference between adjacent LSP coefficients may increase. In such a case, deterioration of the synthesized speech may not be suppressed.

また、単純隣接平均補間処理では、安定性を満たさない原因となった次数以外のＬＳＰ係数、すなわち、元々突発的ではなく補間の必要のなかったＬＳＰ係数までも補間されてしまう。従って、かえって合成音声の品質を落とす場合がある。 In addition, in the simple adjacent average interpolation process, LSP coefficients other than the order that caused the failure to satisfy the stability, that is, LSP coefficients that were originally not sudden and need not be interpolated are also interpolated. Therefore, the quality of the synthesized speech may be degraded.

本発明は、上記従来技術の欠点に鑑みてなされたものであり、適切な補正処理を行うことによって、高品質の合成音声を生成することが可能な音声処理装置およびテキスト音声合成装置を提供することを目的とする。 The present invention has been made in view of the above-described drawbacks of the prior art, and provides a speech processing device and a text speech synthesizer capable of generating high-quality synthesized speech by performing appropriate correction processing. For the purpose.

また、本発明は、高音質の合成音声を生成可能とすることを目的とする。 Another object of the present invention is to enable generation of high-quality synthesized speech.

上記目的を達成するために、この発明の第１の観点に係る音声処理装置は、与えられたテキスト文字列を音素ラベル列に変換する音素ラベル列変換手段と、前記音素ラベル列変換手段から出力された音素ラベル列から、複数次元のＬＳＰ（Line Spectrum Pair）係数を含むＬＳＰ係数群の時系列データを生成する時系列データ生成手段と、前記時系列データ生成手段で生成されたＬＳＰ係数群時系列データを構成する個々のＬＳＰ係数群について、所定の安定条件が満たされているか否かを判別し、満たされていないと判別した場合に、このＬＳＰ係数群が前記所定の安定条件を満たすように、当該ＬＳＰ係数群を出力した音素ＨＭＭ（隠れマルコフモデル）の１の状態において先頭、最後尾、それ以外のいずれに位置するかによって補正方法を切り替えて、当該ＬＳＰ係数群を補正する補正手段と、を備える音声処理装置であって、前記所定の安定条件とは、前記個々のＬＳＰ係数群を構成する複数のＬＳＰ係数が、全て０より大きくかつπより小さく、かつ、当該ＬＳＰ係数の次元の昇順に並べた場合に、小さい順に並ぶことであって、前記補正手段は、前記所定の安定条件が満たされていない１のＬＳＰ係数群が、前記１の状態において先頭に位置する場合、少なくともその状態において２番目に位置するＬＳＰ係数群を用いて当該１のＬＳＰ係数群を補正し、前記１の状態において最後尾に位置する場合、少なくともその状態において最後尾から２番目に位置するＬＳＰ係数群を用いて当該１のＬＳＰ係数群を補正し、前記１の状態においてそれ以外に位置する場合、少なくともその状態において隣接するＬＳＰ係数群を用いて、当該１のＬＳＰ係数群を補正することを特徴とする。 In order to achieve the above object, a speech processing apparatus according to a first aspect of the present invention includes a phoneme label string conversion unit that converts a given text character string into a phoneme label string, and an output from the phoneme label string conversion unit. Time-series data generating means for generating time-series data of LSP coefficient groups including multi-dimensional LSP (Line Spectrum Pair) coefficients from the phoneme label sequence thus generated, and the LSP coefficient group time generated by the time-series data generating means For each LSP coefficient group constituting the series data, it is determined whether or not a predetermined stability condition is satisfied. When it is determined that the predetermined stability condition is not satisfied, the LSP coefficient group satisfies the predetermined stability condition. Depending on whether the phoneme HMM (Hidden Markov Model) that output the LSP coefficient group is in the 1 state, the correction method is switched depending on whether it is located at the head, tail, or other position. Instead, the speech processing apparatus includes a correcting unit that corrects the LSP coefficient group, wherein the predetermined stability condition is that a plurality of LSP coefficients constituting the individual LSP coefficient group are all greater than zero. and smaller than [pi, and, when arranged in ascending order of the dimensions of the LSP coefficients, I der be arranged in ascending order, the correction means 1 of the LSP coefficient group said predetermined stable condition is not met When the first position is located in the first state, at least the second LSP coefficient group is corrected in the state using the LSP coefficient group located second, and when the last position is located in the first state, at least When the LSP coefficient group located at the second position from the tail in that state is used to correct the one LSP coefficient group and the other one is located in the first state, at least The one LSP coefficient group is corrected using adjacent LSP coefficient groups in the state of (1 ).

上記目的を達成するために、この発明の第２の観点に係るテキスト音声合成装置は、前記音声処理装置と、該音声処理装置の前記音素ラベル列変換手段に音声合成対象のテキストデータを供給する手段と、該音声処理装置で生成されたＬＳＰ係数群時系列データをＬＳＰ合成フィルタ係数としたＬＳＰ合成フィルタに励起音源データを入力して合成音声を生成する合成音声生成手段と、を備える。 In order to achieve the above object, a text-to-speech synthesizer according to a second aspect of the present invention supplies text data to be synthesized to the speech processing apparatus and the phoneme label string conversion means of the speech processing apparatus. means, and synthetic speech generating means to LSP synthesis filter by entering an excitation sound source data for generating a synthesized speech time series data LSP coefficient group was LSP synthesis filter coefficients generated by the audio processing apparatus, Ru provided with.

上記目的を達成するために、この発明の第３の観点に係る音声処理方法は、テキスト文字列を音素ラベル列に変換する音素ラベル列変換ステップと、当該音素ラベル列から、ＬＳＰ（Line Spectrum Pair）係数を含むＬＳＰ係数群の時系列データを生成する時系列データ生成ステップと、当該ＬＳＰ係数群時系列データを構成する個々のＬＳＰ係数群について、所定の安定条件が満たされていない場合に、このＬＳＰ係数群が前記所定の安定条件を満たすように、当該ＬＳＰ係数群を出力した音素ＨＭＭの１の状態において先頭、最後尾、それ以外のいずれに位置するかによって補正方法を切り替えて当該ＬＳＰ係数群を補正するステップと、を備える音声処理方法であって、前記所定の安定条件とは、前記個々のＬＳＰ係数群を構成する複数のＬＳＰ係数が、全て０より大きくかつπより小さく、かつ、当該ＬＳＰ係数の次元の昇順に並べた場合に、小さい順に並ぶことであって、前記補正するステップでは、前記所定の安定条件が満たされていない１のＬＳＰ係数群が、前記１の状態において先頭に位置する場合、少なくともその状態において２番目に位置するＬＳＰ係数群を用いて当該１のＬＳＰ係数群を補正し、前記１の状態において最後尾に位置する場合、少なくともその状態において最後尾から２番目に位置するＬＳＰ係数群を用いて当該１のＬＳＰ係数群を補正し、前記１の状態においてそれ以外に位置する場合、少なくともその状態において隣接するＬＳＰ係数群を用いて、当該１のＬＳＰ係数群を補正することを特徴とする。 In order to achieve the above object, a speech processing method according to a third aspect of the present invention includes a phoneme label string conversion step for converting a text character string into a phoneme label string, and an LSP (Line Spectrum Pair) from the phoneme label string. ) A time series data generation step of generating time series data of LSP coefficient groups including coefficients, and when a predetermined stability condition is not satisfied for each LSP coefficient group constituting the LSP coefficient group time series data, In order for this LSP coefficient group to satisfy the predetermined stability condition, the correction method is switched by switching the correction method depending on whether the LSP coefficient group is located at the head, tail, or other position in the state of the phoneme HMM that has output the LSP coefficient group. Correcting the coefficient group, wherein the predetermined stability condition is a plurality of L constituting the individual LSP coefficient group. P coefficients are all smaller than larger and π than 0, and, when arranged in ascending order of the dimensions of the LSP coefficients, it der be arranged in ascending order, in the step of correcting said predetermined stable condition met If one LSP coefficient group that has not been placed is at the head in the state 1, the one LSP coefficient group is corrected using at least the second LSP coefficient group in the state, and the state 1 If the LSP coefficient group is corrected at least using the LSP coefficient group located second from the tail in that state, and if it is located in the other state, at least The one LSP coefficient group is corrected using LSP coefficient groups adjacent in the state .

上記目的を達成するために、この発明の第４の観点に係るコンピュータプログラムは、コンピュータに、テキスト文字列を音素ラベル列に変換する音素ラベル列変換ステップと、当該音素ラベル列から、ＬＳＰ（Line Spectrum Pair）係数を含むＬＳＰ係数群の時系列データを生成する時系列データ生成ステップと、当該ＬＳＰ係数群時系列データを構成する個々のＬＳＰ係数群について、所定の安定条件が満たされていない場合に、このＬＳＰ係数群が前記所定の安定条件を満たすように、当該ＬＳＰ係数群を出力した音素ＨＭＭの１の状態において先頭、最後尾、それ以外のいずれに位置するかによって補正方法を切り替えて当該ＬＳＰ係数群を補正するステップと、を実行させるコンピュータプログラムであって、前記所定の安定条件とは、前記個々のＬＳＰ係数群を構成する複数のＬＳＰ係数が、全て０より大きくかつπより小さく、かつ、当該ＬＳＰ係数の次元の昇順に並べた場合に、小さい順に並ぶことであって、前記補正するステップでは、前記所定の安定条件が満たされていない１のＬＳＰ係数群が、前記１の状態において先頭に位置する場合、少なくともその状態において２番目に位置するＬＳＰ係数群を用いて当該１のＬＳＰ係数群を補正し、前記１の状態において最後尾に位置する場合、少なくともその状態において最後尾から２番目に位置するＬＳＰ係数群を用いて当該１のＬＳＰ係数群を補正し、前記１の状態においてそれ以外に位置する場合、少なくともその状態において隣接するＬＳＰ係数群を用いて、当該１のＬＳＰ係数群を補正することを特徴とする。
In order to achieve the above object, a computer program according to a fourth aspect of the present invention provides a computer with a phoneme label string conversion step for converting a text character string into a phoneme label string, and from the phoneme label string, an LSP (Line (Spectrum Pair) When a time series data generation step for generating time series data of LSP coefficient groups including coefficients and individual LSP coefficient groups constituting the LSP coefficient group time series data does not satisfy a predetermined stability condition In addition, the correction method is switched depending on whether the LSP coefficient group is located at the head, tail, or other position in the 1 state of the phoneme HMM that has output the LSP coefficient group so that the predetermined stability condition is satisfied. A step of correcting the LSP coefficient group, wherein the predetermined stability condition is: A plurality of LSP coefficients constituting the serial individual LSP coefficient group are all smaller than the larger and π than 0, and, when arranged in ascending order of the dimensions of the LSP coefficients, I der be arranged in ascending order, the correction In this step, when one LSP coefficient group that does not satisfy the predetermined stability condition is located at the head in the first state, the one LSP coefficient group located at least in the second state is used as the first LSP coefficient group. When the LSP coefficient group is corrected and is located at the end in the state 1, the LSP coefficient group is corrected at least using the LSP coefficient group located at the second position from the end in the state. In the case of being located in the other state, at least one LSP coefficient group adjacent in that state is used to correct the one LSP coefficient group .

本発明によれば、ＨＭＭに基づく音声合成の過程で突発的なＬＳＰ係数が得られたとしても、より適切な補正が実現され、その結果、より高品質のテキスト音声の合成が実現できる。
According to the present invention, even if the sudden LSP coefficients in the course of the speech synthesis based on HMM is obtained, A better correction is achieved, as a result, synthesis of a higher quality of the text speech can be realized.

以下、本発明の実施の形態に係るテキスト音声合成装置について詳細に説明する。 Hereinafter, a text-to-speech synthesizer according to an embodiment of the present invention will be described in detail.

（実施形態１）
まず、本実施形態に係るテキスト音声合成装置の構成を説明する。図１は、本実施形態に係るテキスト音声合成装置８１の機能構成図である。 (Embodiment 1)
First, the configuration of the text-to-speech synthesizer according to this embodiment will be described. FIG. 1 is a functional configuration diagram of a text-to-speech synthesizer 81 according to the present embodiment.

テキスト音声合成装置８１は、図示するように、テキスト取り込み部１３と、音素ラベル列変換部１５と、時系列データ生成部８３と、スペクトルパラメータ補正部１９と、励起音源生成部８５と、合成フィルタ８７と、スピーカー部２３と、を備える。 The text-to-speech synthesizer 81 includes a text capturing unit 13, a phoneme label string converter 15, a time-series data generator 83, a spectrum parameter corrector 19, an excitation sound source generator 85, a synthesis filter, as shown in the figure. 87 and a speaker unit 23.

テキスト音声合成装置８１は、音声合成辞書２５に接続される。音声合成辞書２５は、音素ラベル毎にＬＳＰ係数に関する音素ＨＭＭ及びピッチに関する音素ＨＭＭを記憶している。ＬＳＰ係数に関する音素ＨＭＭを定義するパラメータには、平均値、分散値、状態毎の遷移確率、各状態におけるＬＳＰ係数の出力確率が含まれ、ピッチに関する音素ＨＭＭを定義するパラメータには平均値、分散値、状態毎の遷移確率、各状態におけるピッチの出力確率が含まれる。これらのデータはハードディスク等に記憶されている。 The text speech synthesizer 81 is connected to the speech synthesis dictionary 25. The speech synthesis dictionary 25 stores a phoneme HMM related to LSP coefficients and a phoneme HMM related to pitch for each phoneme label. The parameters that define the phoneme HMM related to the LSP coefficient include the average value, the variance value, the transition probability for each state, and the output probability of the LSP coefficient in each state. The parameters that define the phoneme HMM related to the pitch include the average value and the variance Value, transition probability for each state, and output probability of pitch in each state. These data are stored in a hard disk or the like.

テキスト音声合成装置８１のテキスト取り込み部１３は、与えられたテキスト文字列を取り込む。 The text capturing unit 13 of the text speech synthesizer 81 captures a given text character string.

音素ラベル列変換部１５は、取り込んだテキスト文字列を構成する各テキスト文字を取り込み順に音素ラベルに順次変換することにより、音素ラベル列に変換し、スペクトル及びピッチ時系列データ生成部８３に引き渡す。 The phoneme label string conversion unit 15 converts each text character constituting the captured text character string into a phoneme label string by sequentially converting the text characters to the phoneme label in the order of capture, and passes it to the spectrum and pitch time series data generation unit 83.

音素ラベルの具体的な態様は、音声合成辞書２５の仕様に適応できるならば、任意である。例えば、音素３つ組をひとつの音声単位とみなすトライフォンの場合、音声合成辞書２５における検索に際し、検索対象音素の前後の音素に関する情報も必要となる。従って、ひとつの音素ラベルは、その前後の音素の情報が考慮されている。 The specific form of the phoneme label is arbitrary as long as it can be adapted to the specification of the speech synthesis dictionary 25. For example, in the case of a triphone that regards a triplet of phonemes as one speech unit, information on phonemes before and after the search target phoneme is also required when searching in the speech synthesis dictionary 25. Therefore, information of phonemes before and after the phoneme label is taken into consideration.

時系列データ生成部８３は、受け取った音素ラベル列を手がかりに、対応するＬＳＰ係数及びピッチに関する音素ＨＭＭをそれぞれ音声合成辞書２５から検索し、音素ラベルの出現順につなぐ。このつながったものが、それぞれＬＳＰ係数及びピッチに関する音素ＨＭＭ列であり、これらのＬＳＰ係数及びピッチに関する音素ＨＭＭ列から尤度が最大となるようにそれぞれＬＳＰ係数群時系列データ及びピッチ時系列データを生成する。 Using the received phoneme label string as a clue, the time-series data generation unit 83 searches the phoneme HMM for the corresponding LSP coefficient and pitch from the speech synthesis dictionary 25 and connects them in the order in which the phoneme labels appear. These connections are phoneme HMM sequences related to LSP coefficients and pitch, respectively, and LSP coefficient group time series data and pitch time series data are respectively set so as to maximize the likelihood from these phoneme HMM sequences related to LSP coefficients and pitch. Generate.

つまり、時系列データ生成部８３は、受け取った音素ラベル列から、ＬＳＰ係数群時系列データと、ピッチ時系列データと、を生成する。 That is, the time series data generation unit 83 generates LSP coefficient group time series data and pitch time series data from the received phoneme label string.

時系列データ生成部８３は、生成したＬＳＰ係数群時系列データを、スペクトルパラメータ補正部１９に引き渡すとともに、生成したピッチ時系列データを、励起音源生成部８５に引き渡す。 The time series data generation unit 83 delivers the generated LSP coefficient group time series data to the spectrum parameter correction unit 19 and delivers the generated pitch time series data to the excitation sound source generation unit 85.

スペクトルパラメータ補正部１９は、受け取ったＬＳＰ係数群時系列データに対し、所定の補正処理を施す。 The spectrum parameter correction unit 19 performs a predetermined correction process on the received LSP coefficient group time series data.

所定の補正処理とは、ＬＳＰ係数群時系列データについて、フレーム毎に、所定の安定条件が満たされているか否かを点検し、満たされていない場合には、満たすように、当該ＬＳＰ係数群を出力した音素ＨＭＭの状態から生成された他のＬＳＰ係数群を参照して補正する処理のことである。この補正処理の詳細については、図３及び図４を参照して後述する。 Predetermined correction processing refers to checking whether or not a predetermined stability condition is satisfied for each frame of LSP coefficient group time-series data and, if not, the LSP coefficient group Is corrected with reference to another LSP coefficient group generated from the state of the phoneme HMM that has output. Details of this correction processing will be described later with reference to FIGS.

スペクトルパラメータ補正部１９は、補正済ＬＳＰ係数群時系列データを、合成フィルタ８７に引き渡す。 The spectrum parameter correction unit 19 passes the corrected LSP coefficient group time series data to the synthesis filter 87.

一方、励起音源生成部８５は、ピッチ時系列データを時系列データ生成部８３から受け取り、励起音源データを生成し、合成フィルタ８７に引き渡す。 On the other hand, the excitation sound source generation unit 85 receives the pitch time series data from the time series data generation unit 83, generates excitation sound source data, and passes it to the synthesis filter 87.

合成フィルタ８７には、補正済のＬＳＰ係数群時系列データと、励起音源データとが入力される。合成フィルタ８７は、補正済ＬＳＰ係数で定義される合成フィルタ８７に励起音源データを入力することにより合成音声データを生成し、スピーカー部２３に伝達する。 The synthesized filter 87 receives the corrected LSP coefficient group time-series data and excitation sound source data. The synthesis filter 87 generates the synthesized voice data by inputting the excitation sound source data to the synthesis filter 87 defined by the corrected LSP coefficient, and transmits the synthesized voice data to the speaker unit 23.

スピーカー部２３は、引き渡された合成音声データを出力する。スピーカー部２３は、音源やアンプ等、合成音声の出力に必要な機能を有している。 The speaker unit 23 outputs the delivered synthesized voice data. The speaker unit 23 has functions necessary for outputting synthesized speech, such as a sound source and an amplifier.

図１に示すテキスト音声合成装置８１は、物理的には、図２に示すような、ユーザＩ／Ｆ３９、ＣＰＵ３３、ＲＯＭ３５、記憶部３７、データ入力Ｉ／Ｆ４１と、これらを相互に接続するバス４７と、を備えるコンピュータ３１から構成される。 The text-to-speech synthesizer 81 shown in FIG. 1 physically has a user I / F 39, a CPU 33, a ROM 35, a storage unit 37, a data input I / F 41, and a bus for interconnecting them as shown in FIG. 47. The computer 31 is provided with 47.

さらに、合成音声を発するために、コンピュータ３１は、スピーカー４５と、当該スピーカーに接続されたスピーカー駆動用サウンド・ボード４３と、を備え、当該サウンド・ボード４３は、バス４７に接続されている。 Further, the computer 31 includes a speaker 45 and a speaker drive sound board 43 connected to the speaker, and the sound board 43 is connected to the bus 47 in order to emit synthesized speech.

ＲＯＭ３５は、テキスト音声合成のための動作プログラム、特に、この実施形態においては、音声データのＬＳＰ係数を補正する動作を含む動作プログラムを記憶する。 The ROM 35 stores an operation program for text-to-speech synthesis, and in particular, in this embodiment, an operation program including an operation for correcting the LSP coefficient of speech data.

記憶部３７は、ＲＡＭ５１やハードディスク５３から構成されて、テキスト文字列、音素ラベル列、音素ＨＭＭ列、音素スペクトル列、ピッチに関する情報、ＬＳＰ係数群時系列データ、ピッチ時系列データ、補正済ＬＳＰ係数群時系列データ、励起音源データ、音声データ、を記憶する。 The storage unit 37 includes a RAM 51 and a hard disk 53, and includes a text character string, phoneme label string, phoneme HMM string, phoneme spectrum string, pitch information, LSP coefficient group time series data, pitch time series data, and corrected LSP coefficients. Group time series data, excitation sound source data, and voice data are stored.

データ入力Ｉ／Ｆ４１は、音声合成辞書２５に接続され、データの入力を行う。 The data input I / F 41 is connected to the speech synthesis dictionary 25 and inputs data.

ユーザＩ／Ｆ３９は、キーボード５５と、モニタ５７と、から構成され、ユーザがテキスト音声合成装置にテキスト文字列を与える。 The user I / F 39 includes a keyboard 55 and a monitor 57, and the user gives a text character string to the text-to-speech synthesizer.

（動作）
次に、図１及び図２に示す音声合成装置の動作を、図２から図４を参照して説明する。 (Operation)
Next, the operation of the speech synthesizer shown in FIGS. 1 and 2 will be described with reference to FIGS.

ユーザＩ／Ｆ３９からテキスト文字列が与えられ、音声合成開始の指示がされると、ＣＰＵ３３は、ＲＯＭ３５に格納された動作プログラムを実行することにより、音声合成動作を実行する。 When a text character string is given from the user I / F 39 and a voice synthesis start instruction is given, the CPU 33 executes a voice synthesis operation by executing an operation program stored in the ROM 35.

まず、ＣＰＵ３３は、与えられたテキスト文字列を記憶部３７に記憶する。 First, the CPU 33 stores the given text character string in the storage unit 37.

次に、ＣＰＵ３３は、記憶部３７に格納したテキスト文字列を、順次読み出して、音素ラベル列に変換して、一旦、記憶部３７に記憶する。なお、テキスト文字列を音素ラベル列に変換する手法自体は任意であり、既知の任意の手法を使用可能である。 Next, the CPU 33 sequentially reads the text character strings stored in the storage unit 37, converts them into phoneme label strings, and temporarily stores them in the storage unit 37. Note that the method of converting a text character string into a phoneme label sequence is arbitrary, and any known method can be used.

ＣＰＵ３３は、記憶部３７に記憶されている音素ラベル列を基にして、音声合成辞書２５から、音素ＨＭＭをひとつずつ順次取り出し、音素ＨＭＭ列を生成する（図３（ｂ）参照）。 The CPU 33 sequentially extracts phoneme HMMs one by one from the speech synthesis dictionary 25 based on the phoneme label sequence stored in the storage unit 37, and generates a phoneme HMM sequence (see FIG. 3B).

ＣＰＵ３３はまた、音素ＨＭＭ列からＬＳＰ係数群を生成する（図３（ｃ）参照）。 The CPU 33 also generates an LSP coefficient group from the phoneme HMM sequence (see FIG. 3C).

ＬＳＰ係数群は、通常、１つの状態の期間に複数存在する。各ＬＳＰ係数群に対応する時間スパンを１フレームと呼ぶ。 There are usually a plurality of LSP coefficient groups in one state period. A time span corresponding to each LSP coefficient group is called one frame.

ここでは、図３に示すように、各音素ＨＭＭにはＬＳＰ係数群を出力する状態として３状態が存在し、各状態から複数のフレームｆｍが出力されるとする。各フレームには１０個のＬＳＰ係数が属する。横軸は時間軸に相当する。 Here, as shown in FIG. 3, it is assumed that each phoneme HMM has three states as states for outputting the LSP coefficient group, and a plurality of frames fm are outputted from each state. Each frame has 10 LSP coefficients. The horizontal axis corresponds to the time axis.

音素ＨＭＭλ_ｉの状態Ｓ_ｊに対応するフレームのうち先頭のものをｆｍ_{λｉ、Ｓｊ、Ｓ}と、最後尾のものをｆｍ_{λｉ、Ｓｊ、Ｅ}と、それぞれ称呼する。記憶部３７には、各状態から出力されたＬＳＰ係数群のデータと共にそのフレーム識別情報も格納される。 Of the frames corresponding to the state S _j of the phoneme HMMλ _i , the first one is called fm λ _i , S _j _{, S,} and the last one is called fm λ _i , S _j _{, E.} The storage unit 37 stores the frame identification information together with the LSP coefficient group data output from each state.

即ち、図３に示すように、フレーム数がＮ_ｆｍ、ＬＳＰ係数の次元がＮ_ｄ次元であるとすると、ＬＳＰ係数群時系列データは、ＬＳＰ係数ω_ｄ［ｆｍ］を用いて、
｛ω_ｄ［１］、ω_ｄ［２］、・・・、ω_ｄ［ｆｍ］、・・・、ω_ｄ［Ｎ_ｆｍ−１］、ω_ｄ［Ｎ_ｆｍ］｝
（但し、１≦ｄ≦Ｎ_ｄ＝１０である。）
と表される。 That is, as shown in FIG. 3, assuming that the number of frames is N _fm and the dimension of the LSP coefficient is N _d dimension, the LSP coefficient group time series data uses the LSP coefficient ω _d [fm],
{Ω _d [1], ω _d [2], ..., ω _d [fm], ..., ω _d [N _fm -1], ω _d [N _fm ]}
(However, 1 ≦ d ≦ N _{d = 10.} )
It is expressed.

次に、ＣＰＵ３３は、記憶部３７に格納されているＬＳＰ係数群時系列データに、所定の補正処理を施し、補正済ＬＳＰ係数群時系列データとして記憶部３７に記憶する。所定の補正処理については、図４を参照して後述する。 Next, the CPU 33 performs a predetermined correction process on the LSP coefficient group time series data stored in the storage unit 37 and stores it in the storage unit 37 as corrected LSP coefficient group time series data. The predetermined correction process will be described later with reference to FIG.

以上の説明においては、ユーザは、読み上げさせたいテキスト文字列を、テキスト音声合成装置に一括して入力した後、音声合成開始の指示を出し、合成音声の出力があるのを待つことを前提としている。これを一括法と呼ぶ。 In the above description, it is assumed that the user inputs a text string to be read out to the text-to-speech synthesizer, then issues a speech synthesis start instruction, and waits for the output of synthesized speech. Yes. This is called a batch method.

一方、テキスト音声合成に際しては、テキスト文字列を所定の区分単位ずつ読み込ませつつ、逐次、合成音声による読み上げをさせたいとの要請も考えられる。これをリアルタイム法と呼ぶ。 On the other hand, at the time of text-to-speech synthesis, there may be a request to read a text character string sequentially by synthesized speech while reading the text character string for each predetermined division unit. This is called a real-time method.

リアルタイム法といっても、音声合成装置に読み上げさせたいテキスト文字列を、より短いテキスト文字列の連なりと解釈し、当該短いテキスト文字列を一括法により順次処理していくことに他ならない。 In the real-time method, the text character string to be read out by the speech synthesizer is interpreted as a series of shorter text character strings, and the short text character strings are processed sequentially by the batch method.

つまり、一括法とリアルタイム法のいずれを実現するにしても、テキスト音声合成装置の動作は本質的には同じである。よって、本実施形態は、一括法とリアルタイム法のいずれも実現することができる。 That is, the operation of the text-to-speech synthesizer is essentially the same whether the batch method or the real-time method is realized. Therefore, this embodiment can realize both the batch method and the real-time method.

ただし、リアルタイム法を実現するにあたっては、テキスト文字列をいかなる区分単位ずつテキスト音声合成装置に読み込ませるかには注意を要する。図１の音声合成辞書２５の構造次第では、あるまとまった分量のテキスト文字列をひとつの区分単位として音声合成装置に入力する必要があり得る。例えば、音声合成辞書２５が、検索のために文脈等に関する情報を必要とする場合には、当該文脈等が判明するのに十分な長さのテキスト文字列を区分単位とする必要がある。 However, in realizing the real-time method, it is necessary to pay attention to what division unit the text character string is read into the text-to-speech synthesizer. Depending on the structure of the speech synthesis dictionary 25 of FIG. 1, it may be necessary to input a certain amount of text character string to the speech synthesizer as one division unit. For example, when the speech synthesis dictionary 25 requires information on the context or the like for the search, it is necessary to use a text character string having a length sufficient to find the context or the like as a classification unit.

図１に示すように、本実施形態に係るテキスト音声合成装置８１の特徴は、スペクトルパラメータ補正部１９においてＬＳＰ係数群時系列データに所定の補正処理を施すことにある。 As shown in FIG. 1, the feature of the text-to-speech synthesizer 81 according to the present embodiment is that the spectrum parameter correction unit 19 performs a predetermined correction process on the LSP coefficient group time series data.

次に、ＣＰＵ３３が実行するスペクトル補正処理について、図３と図４を参照して説明する。
まず、図３に、音素ＨＭＭと、音素ＨＭＭの各状態と、ＬＳＰ係数と、フレームとの対応関係を整理して示した。理解を容易にするために、状態数５（Ｓ_０は初期状態、Ｓ_４は終了状態でＬＳＰ係数は出力しない）の音素ＨＭＭを適用したとし、ＬＳＰ係数は１０次元であるとする。 Next, the spectrum correction process executed by the CPU 33 will be described with reference to FIGS.
First, FIG. 3 shows the correspondence relationship between the phoneme HMM, each state of the phoneme HMM, the LSP coefficient, and the frame. In order to facilitate understanding, it is assumed that a phoneme HMM having a state number 5 (S ₀ is an initial state, S ₄ is an end state, and no LSP coefficient is output) is applied, and the LSP coefficient is 10-dimensional.

各音素ＨＭＭにはＬＳＰ係数群を出力する状態としてＳ１、Ｓ２、Ｓ３の３つの状態が存在し、各状態から複数フレーム（ｆｍ）分のＬＳＰ係数群が出力される。各フレームには１０個のＬＳＰ係数が属する。横軸は時間軸に相当する。 Each phoneme HMM has three states S1, S2, and S3 as states for outputting LSP coefficient groups, and LSP coefficient groups for a plurality of frames (fm) are output from each state. Each frame has 10 LSP coefficients. The horizontal axis corresponds to the time axis.

音素ＨＭＭλ_ｉの状態Ｓ_ｊに対応するフレームのうち先頭のものをｆｍ_{λｉ、Ｓｊ、Ｓ}と、最後尾のものをｆｍ_{λｉ、Ｓｊ、Ｅ}と、それぞれ称呼する。 Of the frames corresponding to the state S _j of the phoneme HMMλ _i , the first one is called fm λ _i , S _j _{, S,} and the last one is called fm λ _i , S _j _{, E.}

フレーム数がＮ_ｆｍ、ＬＳＰ係数の次元がＮ_ｄ次元であるとすると、スペクトル時系列データは、ＬＳＰ係数ω_ｄ［ｆｍ］を用いて、
｛ω_ｄ［１］、ω_ｄ［２］、・・・、ω_ｄ［ｆｍ］、・・・、ω_ｄ［Ｎ_ｆｍ−１］、ω_ｄ［Ｎ_ｆｍ］｝
（但し、１≦ｄ≦Ｎ_ｄである。）
で表される。図３の場合、Ｎ_ｄ＝１０である。 Assuming that the number of frames is N _fm and the dimension of the LSP coefficient is the N _d dimension, the spectral time series data uses the LSP coefficient ω _d [fm],
{Ω _d [1], ω _d [2], ..., ω _d [fm], ..., ω _d [N _fm -1], ω _d [N _fm ]}
(However, 1 ≦ d ≦ N _d .)
It is represented by In the case of FIG. 3, N _d = 10.

なんらかの理由により、いずれかのフレームにおいて、ＬＳＰ係数に関する安定条件が満たされないケースが発生する。これを放置すると、合成音声の音質に悪影響を及ぼす。そこで、ＬＳＰ係数に対する補正を行う。 For some reason, there is a case where the stability condition regarding the LSP coefficient is not satisfied in any frame. If this is left unattended, the sound quality of the synthesized speech will be adversely affected. Therefore, the LSP coefficient is corrected.

図４に示すフローチャートを参照して、補正処理を説明する。
前提として、上述の処理により、ＬＳＰ係数群時系列データが生成され、ＬＳＰ係数ω_ｄ［ｆｍ］は全て求められて、図２の記憶部３７に記憶されているものとする。 The correction process will be described with reference to the flowchart shown in FIG.
As a premise, it is assumed that LSP coefficient group time-series data is generated by the above-described processing, and all LSP coefficients ω _d [fm] are obtained and stored in the storage unit 37 of FIG.

まず、ＣＰＵ３３は、フレームを識別するための番号を指定するポインタｆｍを１に初期化する（ステップＳ１１）。 First, the CPU 33 initializes a pointer fm that designates a number for identifying a frame to 1 (step S11).

ＣＰＵは、フレームｆｍについて、ＬＳＰ係数ω_１［ｆｍ］、ω_２［ｆｍ］、・・・、ω_Ｎｄ［ｆｍ］を記憶部３７から読み出す（ステップＳ１３）。 The CPU reads LSP coefficients ω ₁ [fm], ω ₂ [fm],..., Ω _Nd [fm] from the storage unit 37 for the frame fm (step S13).

ＬＳＰ係数の安定条件が満たされているか否かを点検する。すなわち、０＜ω_１［ｆｍ］＜ω_２［ｆｍ］＜・・・＜ω_Ｎｄ−１［ｆｍ］＜ω_Ｎｄ［ｆｍ］＜πが満たされているか否かを判別する（ステップＳ１５）。 Check whether the stability condition of the LSP coefficient is satisfied. That is, it is determined whether or not 0 <ω ₁ [fm] <ω ₂ [fm] <... <Ω _Nd−1 [fm] <ω _Nd [fm] <π is satisfied (step S15).

安定条件を満たすと判別された場合は（ステップＳ１５；満たす）、ＬＳＰ係数確認フラグｃｆ［ｆｍ］に安定条件を満たしていることを示す値１を設定する（ｃｆ［ｆｍ］＝１）。図２の記憶部３７に、フレーム番号ｆｍと対応付けて記憶する（ステップＳ１７）。 If it is determined that the stability condition is satisfied (step S15; satisfied), the LSP coefficient confirmation flag cf [fm] is set to a value 1 indicating that the stability condition is satisfied (cf [fm] = 1). 2 is stored in association with the frame number fm (step S17).

ステップＳ１５において、安定条件を満たさないと判別された場合（ステップＳ１５；満たさない）は、ｃｆ［ｆｍ］＝０として、記憶部３７に記憶する（ステップＳ１９）。 If it is determined in step S15 that the stability condition is not satisfied (step S15; not satisfied), cf [fm] = 0 is stored in the storage unit 37 (step S19).

全てのｆｍについて処理が完了したか否かを判別する（ステップＳ２１）。 It is determined whether or not processing has been completed for all fm (step S21).

未完了であれば、ｆｍを１だけインクリメントしてステップＳ１３に戻る。一方、完了したならば、生成されたＬＳＰ係数確認フラグ系列データ｛ｃｆ［１］、ｃｆ［２］、・・・、ｃｆ［ｆｍ］、・・・、ｃｆ［Ｎ_ｆｍ−１］、ｃｆ［Ｎ_ｆｍ］｝のうち、ｃｆ［ｆｍ_ＮＧ］＝０なるｆｍ_ＮＧが存在するか否かを判別する（ステップＳ２３）。 If not completed, fm is incremented by 1, and the process returns to step S13. On the other hand, if completed, the generated LSP coefficient confirmation flag sequence data {cf [1], cf [2],..., Cf [fm],..., Cf [N _fm −1], cf [ N _fm ]}, it is determined whether or not fm _NG with cf [fm _NG ] = 0 exists (step S23).

かかるｆｍ_ＮＧが存在しない場合には、補正処理を終了する。 If there is no such fm _NG , the correction process is terminated.

ステップＳ２３において、ｃｆ［ｆｍ_ＮＧ］＝０なるｆｍ_ＮＧが存在すると判別された場合には、フレームｆｍ_ＮＧの位置に対して、状態Ｓ_１乃至Ｓ_３のうちどの状態に対応するかを判別する（ステップＳ２５）。 In step S23, when it is determined that fm _NG with cf [fm _NG ] = 0 exists, it is determined which of the states S _{1 to} S ₃ corresponds to the position of the frame fm _NG. (Step S25).

すなわち、図３において、ｆｍ_{λｉ、Ｓｊ、Ｓ}≦ｆｍ_ＮＧ≦ｆｍ_{λｉ、Ｓｊ、Ｅ}であれば、フレームｆｍ_ＮＧは状態Ｓ_ｊに対応する。 That is, in FIG. 3, if fm _{λi, Sj, S} ≦ fm _NG ≦ fm _{λi, Sj, E} , the frame fm _NG corresponds to the state S _j .

ステップＳ２５における位置判定の結果、フレームｆｍ_ＮＧは状態Ｓ_１に対応すると判別されたとする。かかる場合、続いて、フレームｆｍ_ＮＧが、状態Ｓ_１に対応するフレームのうちでも、さらに細かく、いかなる位置にあるかをフレーム識別情報に基づいて判別する（ステップＳ２７）。 Result of the position determination in step S25, the frame _{fm NG} is to have been determined to correspond to a state _{S 1.} In such a case, subsequently, it is determined based on the frame identification information whether the frame fm _NG is in a finer position among the frames corresponding to the state S ₁ (step S27).

ステップＳ２７の判別結果は、３種類に分けられる。すなわち、フレームｆｍ_ＮＧがＳ_１に対応するフレームのうち先頭のフレームである場合（ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ１、Ｓ}）、先頭でも最後尾でもない場合（ｆｍ_{λｉ、Ｓ１、Ｓ}＜ｆｍ_ＮＧ＜ｆｍ_{λｉ、Ｓ１、Ｅ}）、及び最後尾である場合（ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ１、Ｅ}）である。 The discrimination result in step S27 is divided into three types. That is, when the frame fm _NG is the _first frame among the frames corresponding to S ₁ (fm _NG = fm _{λi, S1, S} ), it is neither the head nor the tail (fm _{λi, S1, S} <fm _NG < fm _{λi, S1, E} ) and the last case (fm _NG = fm _{λi, S1, E} ).

先頭の場合は、Ｓ_１に対応する２番目以降のフレームのＬＳＰ係数を参照して補正を行う（ステップＳ２９）。先頭でも最後尾でもない場合は、Ｓ_１に対応しかつｆｍ_ＮＧに隣接又は隣隣接等の位置にある（近傍の）フレームのＬＳＰ係数を参照して補正する（ステップＳ３１）。最後尾の場合は、Ｓ_１に対応する最後尾から２番目以前のフレームのＬＳＰ係数を参照して補正する（ステップＳ３３）。 If the beginning of, the correction is performed with reference to the LSP coefficients of the second and subsequent frame corresponding to S ₁ (step S29). If neither end at the top, corrected by referring to the LSP coefficients of the in position, such as adjacent or neighboring adjacent to the corresponding vital fm _NG in S ₁ (near) frame (step S31). For the last, it is corrected by referring to the LSP coefficients of the second previous frame from the end corresponding to S ₁ (step S33).

ステップＳ２５において、Ｓ_２に対応すると判別された場合及びＳ_３に対応すると判別された場合についても、ステップＳ２９〜Ｓ３３と同様の処理を行う。即ち、同一状態から出力され且つ隣接するＬＳＰ係数群を用いて補正する。 In step S25, the case where it is determined to correspond to the case and _{S 3} that is determined to correspond to _{S 2} also performs similar processing to that of Step S29～S33. That is, correction is performed using LSP coefficient groups that are output from the same state and are adjacent.

以上説明したように、ＣＰＵ３３は、音素ＨＭＭλ_ｉなる音素ＨＭＭに対応するＬＳＰ係数に突発的なＬＳＰ係数が含まれた結果安定条件が満たされない場合には、ＬＳＰ係数の補正に際し、隣接する音素ＨＭＭλ_ｉ−１以前の音素ＨＭＭ及び音素ＨＭＭλ_ｉ＋１以降の音素ＨＭＭに対応するＬＳＰ係数を参照せず、音素ＨＭＭλ_ｉから得られた、安定条件を満たすＬＳＰ係数のみを用いて補正を行う。 As described above, when the LSP coefficient corresponding to the phoneme HMM corresponding to the phoneme HMMλ _i includes an abrupt LSP coefficient as a result, the stability condition is not satisfied, the CPU 33 corrects the adjacent phoneme HMMλ when correcting the LSP coefficient. _The correction is performed using only the LSP coefficients that satisfy the stability condition obtained from the phoneme HMMλ _i without referring to the LSP coefficients corresponding to the phoneme HMM before _{i−1 and} the phoneme HMM after the phoneme HMMλ _{i + 1} .

さらに、状態Ｓ_ｊ（１≦ｊ≦３）に対応するＬＳＰ係数に補正の必要が生じれば、Ｓ_ｊに対応した他のＬＳＰ係数のみを参照して補正を行い、同一音素ＨＭＭλ_ｉに対応していても他の状態である状態Ｓ_ｋ（ｋ≠ｊ）から出力されたＬＳＰ係数は参照しない。 Further, if the LSP coefficient corresponding to the state S _j (1 ≦ j ≦ 3) needs to be corrected, the correction is performed with reference to only the other LSP coefficient corresponding to S _j and corresponding to the same phoneme HMMλ _i . However, the LSP coefficient output from the other state S _k (k ≠ j) is not referred to.

従って、状態の切り替わり時に、ＬＳＰに大きなギャップが存在するような場合でも、無理な補間を行わないので、合成される音声データに無理が無く、高品質な合成音声の提供が達成できる。即ち、ＬＳＰ係数時系列データに対してより適切な補正が施されることにより、より高品質な合成音声の提供が達成できる。 Therefore, even when there is a large gap in the LSP when the state is switched, since excessive interpolation is not performed, it is possible to provide high-quality synthesized speech without any unreasonable speech data synthesis. In other words, by providing more appropriate correction to the LSP coefficient time series data, it is possible to provide higher quality synthesized speech.

次に、この実施の形態の補正処理、特に、図４のステップＳ２５、Ｓ２７、Ｓ２９、Ｓ３１、Ｓ３３等の場合分け及び各場合に応じた処理の好適な具体例を、以下に説明する。 Next, preferred specific examples of the correction processing according to this embodiment, in particular, the case classification of steps S25, S27, S29, S31, S33, etc. in FIG. 4 and the processing corresponding to each case will be described below.

（場合分け及び各場合に応じた処理の具体例１）
場合分けは、次の（ａ）〜（ｈ）のような場合分けである。フレームｆｍ_ＮＧの位置によって、次のようにＬＳＰ係数を補正する。 (Specific example 1 of processing according to case classification and each case)
Case classification is case classification as shown in the following (a) to (h). Depending on the position of the frame fm _NG , the LSP coefficient is corrected as follows.

（ａ）ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ１、Ｓ}のとき（Ｓ２５；Ｓ_１，Ｓ２７；先頭）、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ１、Ｓ}＋１］
（但し、１≦ｄ≦Ｎ_ｄ）
（ステップＳ２９）とする。
即ち、１フレーム後ろのＬＳＰ群をコピーする。
（ｂ）ｆｍ_{λｉ、Ｓ１、Ｓ}＜ｆｍ_ＮＧ＜ｆｍ_{λｉ、Ｓ１、Ｅ}のとき（Ｓ２５；Ｓ_１、Ｓ２７；先頭でも、最後尾でもない）、
ω_ｄ［ｆｍ_ＮＧ］＝（ω_ｄ［ｆｍ−１］＋ω_ｄ［ｆｍ＋１］）／２
（但し、１≦ｄ≦Ｎ_ｄ）
（ステップＳ３１）とする。
即ち、前後１フレームのＬＳＰ群の平均値に補正する。
（ｃ）ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ１、Ｅ}のとき（Ｓ２５；Ｓ１、Ｓ２７；最後尾）、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ１、Ｅ}−１］
（但し、１≦ｄ≦Ｎ_ｄ）
とし、
（ｄ）ｆｍ_{λｉ、Ｓ２、Ｓ}≦ｆｍ_ＮＧ＜ｆｍ_{λｉ、Ｓ２、Ｅ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ２、Ｓ}＋１］
（但し、１≦ｄ≦Ｎ_ｄ）
とし、
（ｅ）ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ２、Ｅ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ２、Ｅ}−１］
（但し、１≦ｄ≦Ｎ_ｄ）
とし、
（ｆ）ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ３、Ｓ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ３、Ｓ}＋１］
（但し、１≦ｄ≦Ｎ_ｄ）
とし、
（ｇ）ｆｍ_{λｉ、Ｓ３、Ｓ}＜ｆｍ_ＮＧ＜ｆｍ_{λｉ、Ｓ３、Ｅ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝（ω_ｄ［ｆｍ−１］＋ω_ｄ［ｆｍ＋１］）／２
（但し、１≦ｄ≦Ｎ_ｄ）
とし、
（ｈ）ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ３、Ｅ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ３、Ｅ}−１］
（但し、１≦ｄ≦Ｎ_ｄ）
とする。 _{_{(A) fm NG = fm λi}} , S1, when _{_{S (S25; S 1, S27}} ; top)
ω _d [fm _NG ] = ω _d [fm _{λi, S1, S} + 1]
(However, 1 ≦ d ≦ N _d )
(Step S29).
That is, the LSP group one frame behind is copied.
_{(B) fm λi, S1,} S <fm NG <fm λi, S1, when _{_{E (S25; S 1, S27}} ; even at the beginning, nor the last),
ω _d [fm _NG ] = (ω _d [fm−1] + ω _d [fm + 1]) / 2
(However, 1 ≦ d ≦ N _d )
(Step S31).
That is, it is corrected to the average value of the LSP group of one frame before and after.
(C) When fm _NG = fm _{λi, S1, E} (S25; S1, S27; tail)
ω _d [fm _NG ] = ω _d [fm _{λi, S1, E} −1]
(However, 1 ≦ d ≦ N _d )
age,
(D) When fm _{λi, S2, S} ≦ fm _NG <fm _{λi, S2, E} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S2, S} + 1]
(However, 1 ≦ d ≦ N _d )
age,
(E) When fm _NG = fm _{λi, S2, E} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S2, E} −1]
(However, 1 ≦ d ≦ N _d )
age,
(F) When fm _NG = fm _{λi, S3, S} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S3, S} + 1]
(However, 1 ≦ d ≦ N _d )
age,
(G) When fm _{λi, S3, S} <fm _NG <fm _{λi, S3, E} ,
ω _d [fm _NG ] = (ω _d [fm−1] + ω _d [fm + 1]) / 2
(However, 1 ≦ d ≦ N _d )
age,
(H) When fm _NG = fm _{λi, S3, E} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S3, E} −1]
(However, 1 ≦ d ≦ N _d )
And

この具体例では、突発的なＬＳＰ係数に対して、当該係数を出力した音素ＨＭＭの状態から生成された別のフレームのＬＳＰ係数を参照して行うことに特徴がある。 This specific example is characterized in that an unexpected LSP coefficient is referenced with reference to the LSP coefficient of another frame generated from the state of the phoneme HMM that output the coefficient.

（場合分け及び各場合に応じた処理の具体例２）
本具体例では、前記具体例と異なり、場合によっては、以下のように、補正対象のＬＳＰ係数が属するフレームと隣接するフレーム及び当該隣接フレームにさらに隣接するフレームのＬＳＰ係数をも参照する点に特徴がある。 (Specific example 2 of processing according to case classification and each case)
In this specific example, unlike the specific example described above, in some cases, the LSP coefficient of the frame adjacent to the frame to which the correction target LSP coefficient belongs and the frame adjacent to the adjacent frame are also referred to as follows. There are features.

前記（ａ）の場合、すなわち、ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ１、Ｓ}の場合は、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ１、Ｓ}＋１］×２＋ω_ｄ［ｆｍ_{λｉ、Ｓ１、Ｓ}＋２］×（−１）
（但し、１≦ｄ≦Ｎ_ｄとする。）
とし、
（ｃ）の場合、すなわち、ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ１、Ｅ}の場合は、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ１、Ｅ}−２］×（−１）＋ω_ｄ［ｆｍ_{λｉ、Ｓ１、Ｅ}−１］×２
（但し、１≦ｄ≦Ｎ_ｄとする。）
とし、
（ｆ）の場合、すなわち、ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ３、Ｓ}の場合は、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ３、Ｓ}＋１］×２＋ω_ｄ［ｆｍ_{λｉ、Ｓ３、Ｓ}＋２］×（−１）
（但し、１≦ｄ≦Ｎ_ｄとする。）
とし、
（ｈ）の場合、すなわち、ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ３、Ｅ}の場合は、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ３、Ｅ}−２］×（−１）＋ω_ｄ［ｆｍ_{λｉ、Ｓ３、Ｅ}−１］×２
（但し、１≦ｄ≦Ｎ_ｄとする。）
とする。 In the case of (a), ie, fm _NG = fm _{λi, S1, S} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S1, S} + 1] × 2 + ω _d [fm _{λi, S1, S} + 2] × (−1)
(However, 1 ≦ d ≦ N _d .)
age,
In the case of (c), ie, fm _NG = fm _{λi, S1, E} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S1, E} −2] × (−1) + ω _d [fm _{λi, S1, E} −1] × 2
(However, 1 ≦ d ≦ N _d .)
age,
In the case of (f), ie, fm _NG = fm _{λi, S3, S} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S3, S} +1] × 2 + ω _d [fm _{λi, S3, S} +2] × (−1)
(However, 1 ≦ d ≦ N _d .)
age,
In the case of (h), ie, fm _NG = fm _{λi, S3, E} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S3, E} −2] × (−1) + ω _d [fm _{λi, S3, E} −1] × 2
(However, 1 ≦ d ≦ N _d .)
And

（場合分け及び各場合に応じた処理の具体例３）
本具体例では、フレームｆｍ_ＮＧの位置によって、ＬＳＰ係数の補正方法を切り替える点においては、前記２つの具体例と同じである。本具体例においては、さらに、補正が必要な次数のＬＳＰ係数のみ、安定性を満たすように補正し、補正後のＬＳＰ係数データを出力する。ここで、補正が必要なＬＳＰ係数の次数ｄが、α≦ｄ≦β(但し、α≦β、１≦α≦Ｎ_ｄ、１≦β≦Ｎ_ｄとする。）であるとすると、次のようにＬＳＰ係数を補正する。 (Example 3 of case classification and processing according to each case)
This specific example is the same as the two specific examples in that the LSP coefficient correction method is switched depending on the position of the frame fm _NG . In this specific example, only the LSP coefficient of the order that needs correction is corrected so as to satisfy the stability, and the corrected LSP coefficient data is output. Here, if the order d of the LSP coefficient that needs to be corrected is α ≦ d ≦ β (where α ≦ β, 1 ≦ α ≦ N _d , 1 ≦ β ≦ N _d ), The LSP coefficient is corrected as follows.

（ａ）ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ１、Ｓ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ１、Ｓ}＋１］
（但し、α≦ｄ≦βとする。）
とし、
（ｂ）ｆｍ_{λｉ、Ｓ１、Ｓ}＜ｆｍ_ＮＧ＜ｆｍ_{λｉ、Ｓ１、Ｅ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝（ω_ｄ［ｆｍ−１］＋ω_ｄ［ｆｍ＋１］）／２
（但し、α≦ｄ≦βとする。）
とし、
（ｃ）ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ１、Ｅ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ１、Ｅ}−１］
（但し、α≦ｄ≦βとする。）
とし、
（ｄ）ｆｍ_{λｉ、Ｓ２、Ｓ}≦ｆｍ_ＮＧ＜ｆｍ_{λｉ、Ｓ２、Ｅ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ２、Ｓ}＋１］
（但し、α≦ｄ≦βとする。）
とし、
（ｅ）ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ２、Ｅ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ２、Ｅ}−１］
（但し、α≦ｄ≦βとする。）
とし、
（ｆ）ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ３、Ｓ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ３、Ｓ}＋１］
（但し、α≦ｄ≦βとする。）
とし、
（ｇ）ｆｍ_{λｉ、Ｓ３、Ｓ}＜ｆｍ_ＮＧ＜ｆｍ_{λｉ、Ｓ３、Ｅ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝（ω_ｄ［ｆｍ−１］＋ω_ｄ［ｆｍ＋１］）／２
（但し、α≦ｄ≦βとする。）
とし、
（ｈ）ｆｍ_ＮＧ＝ｆｍ_{λｉ、Ｓ３、Ｅ}のとき、
ω_ｄ［ｆｍ_ＮＧ］＝ω_ｄ［ｆｍ_{λｉ、Ｓ３、Ｅ}−１］
（但し、α≦ｄ≦βとする。）
とする。 (A) When fm _NG = fm _{λi, S1, S} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S1, S} + 1]
(However, α ≦ d ≦ β.)
age,
(B) When fm _{λi, S1, S} <fm _NG <fm _{λi, S1, E} ,
ω _d [fm _NG ] = (ω _d [fm−1] + ω _d [fm + 1]) / 2
(However, α ≦ d ≦ β.)
age,
(C) When fm _NG = fm _{λi, S1, E} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S1, E} −1]
(However, α ≦ d ≦ β.)
age,
(D) When fm _{λi, S2, S} ≦ fm _NG <fm _{λi, S2, E} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S2, S} + 1]
(However, α ≦ d ≦ β.)
age,
(E) When fm _NG = fm _{λi, S2, E} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S2, E} −1]
(However, α ≦ d ≦ β.)
age,
(F) When fm _NG = fm _{λi, S3, S} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S3, S} + 1]
(However, α ≦ d ≦ β.)
age,
(G) When fm _{λi, S3, S} <fm _NG <fm _{λi, S3, E} ,
ω _d [fm _NG ] = (ω _d [fm−1] + ω _d [fm + 1]) / 2
(However, α ≦ d ≦ β.)
age,
(H) When fm _NG = fm _{λi, S3, E} ,
ω _d [fm _NG ] = ω _d [fm _{λi, S3, E} −1]
(However, α ≦ d ≦ β.)
And

本具体例においては、安定性を損ねる原因となった次数のＬＳＰ係数のみが補正されるため、ＬＳＰ係数の安定条件を満たしつつ、該当するフレームのＬＳＰ係数として、前記２つの具体例よりもさらに適切な値で補正することが可能となる。 In this specific example, only the LSP coefficient of the order that caused the loss of stability is corrected. Therefore, the LSP coefficient of the corresponding frame is satisfied as compared with the two specific examples while satisfying the stability condition of the LSP coefficient. It becomes possible to correct with an appropriate value.

なお、具体例１のバリエーションとして具体例２を挙げたのと同様、本具体例のバリエーションとして、補正対象のＬＳＰ係数が属するフレームと隣接するフレーム及び当該隣接フレームにさらに隣接するフレームのＬＳＰ係数をも参照するという具体例も挙げられる。 As in the case of specific example 2 as a variation of specific example 1, as a variation of this specific example, the frame adjacent to the frame to which the correction target LSP coefficient belongs and the LSP coefficient of the frame further adjacent to the adjacent frame are set. There is also a specific example of referring to.

（実施形態２）
なお、具体例３等においても、同一音素ＨＭＭ同一状態内からのＬＳＰ係数のみを用いて突発的ＬＳＰ係数を補間したが、これらの場合には、影響が小さいので、異なる音素ＨＭＭ、異なる状態のＬＳＰ係数を使って補間しても問題ない。 (Embodiment 2)
In specific example 3 and the like as well, the sudden LSP coefficients are interpolated using only the LSP coefficients from the same phoneme HMM in the same state, but in these cases, since the influence is small, different phoneme HMMs, There is no problem even if interpolation is performed using LSP coefficients.

例えば、Ｓ_２の先頭フレームのＬＳＰ係数の２、３次元の係数が逆転して突発的となった場合、Ｓ_１の最後尾のＬＳＰ係数の２、３次元とＳ_２の第２フレームのＬＳＰ係数の２次元、３次元の係数の平均（加重平均でもよい）を求めて、この平均値を用いて補間するようにしてもよい。このようにすれば、補間が与える影響が突発的な次元に限定され、異なる状態のＬＳＰ係数を使っても大きな影響なく、むしろ良い場合もある。 For example, if 2,3-D coefficient of the LSP coefficients of the first frame of the _{S 2} becomes sudden reversed, LSP of the second frame of 2,3-D and _{S 2} of the last LSP coefficients _{S 1} An average (or a weighted average) of two-dimensional and three-dimensional coefficients may be obtained, and interpolation may be performed using this average value. In this way, the influence of interpolation is limited to sudden dimensions, and using LSP coefficients in different states has no significant effect and may be good.

（実施形態３）
実施形態１では、同一音素ＨＭＭの同一状態から生成されたＬＳＰ係数のみを用いて突発的ＬＳＰ係数を補正した。一方、同一音素ＨＭＭの別の状態から出力されたＬＳＰ係数を用いて補正しても、十分に高品質のテキスト音声の合成が実現できる場合もある。 (Embodiment 3)
In the first embodiment, the sudden LSP coefficient is corrected using only the LSP coefficient generated from the same state of the same phoneme HMM. On the other hand, even if correction is performed using LSP coefficients output from different states of the same phoneme HMM, sufficiently high-quality text speech synthesis may be realized.

本実施形態によれば、状況に応じてより柔軟な補正方法を採ることもできる。 According to the present embodiment, a more flexible correction method can be adopted depending on the situation.

また、図４に示すフローチャートにおける、ｆｍ_ＮＧの位置判定（ステップＳ２５、ステップＳ２７）を省略することも考えられる。これにより、より簡便なＬＳＰ係数補正を実現することができる。 It is also conceivable to omit the position determination (step S25, step S27) of fm _{NG in} the flowchart shown in FIG. Thereby, simpler LSP coefficient correction can be realized.

なお、この発明は、上記実施の形態に限定されず、種々の変形及び応用が可能である。 In addition, this invention is not limited to the said embodiment, A various deformation | transformation and application are possible.

例えば、ハードウエア構成やブロック構成、フローチャートは例示であって、限定されるものでもない。状態情報やフレーム識別情報も付さずに、ＬＳＰを解析することにより、状態やフレームの位置を判別してもよい。 For example, the hardware configuration, the block configuration, and the flowchart are examples and are not limited. The state and the position of the frame may be determined by analyzing the LSP without attaching the state information or the frame identification information.

場合分けと対応する処理の組み合わせも、発明の趣旨の範囲内で任意である。 A combination of processing corresponding to case classification is also arbitrary within the scope of the gist of the invention.

さらに、この発明は専用の装置及びシステムによらず、一般のコンピュータを用いて実現可能である。即ち、コンピュータに上述の処理を実行するためのコンピュータプログラムを製造・頒布し、或いは、これをコンピュータにインストールすることにより上述の処理を実行させてもよい。 Furthermore, the present invention can be realized by using a general computer without using a dedicated device and system. That is, a computer program for executing the above-described processing may be manufactured and distributed on a computer, or the above-described processing may be executed by installing the computer program on the computer.

実施形態１に係る、スペクトル補正部を備えたテキスト音声合成装置の機能構成図である。It is a function block diagram of the text speech synthesizer provided with the spectrum correction part based on Embodiment 1. FIG. 実施形態１に係るテキスト音声合成装置の物理的な構成を示す図である。1 is a diagram illustrating a physical configuration of a text-to-speech synthesizer according to Embodiment 1. FIG. 音素ＨＭＭ列と、各音素ＨＭＭに適用された状態数と、ＬＳＰ係数と、フレームの対応関係を示す説明図である。It is explanatory drawing which shows the correspondence of a phoneme HMM row | line | column, the number of states applied to each phoneme HMM, an LSP coefficient, and a frame. スペクトル補正処理の実施形態１における動作の流れを示す図である。It is a figure which shows the flow of operation | movement in Embodiment 1 of a spectrum correction process.

Explanation of symbols

１３・・・テキスト取り込み部、１５・・・音素ラベル列変換部、１９・・・スペクトルパラメータ補正部、２３・・・スピーカー部、２５・・・音声合成辞書、３１・・・テキスト音声合成用のコンピュータ、３３・・・ＣＰＵ、３５・・・ＲＯＭ、３７・・・記憶部、３９・・・ユーザＩ／Ｆ、４１・・・データ入力Ｉ／Ｆ、４３・・・サウンド・ボード、４５・・・スピーカー、４７・・・バス、５１・・・ＲＡＭ、５３・・・ハードディスク、５５・・・キーボード、５７・・・モニタ、８１・・・実施形態１に係るテキスト音声合成装置、８３・・・時系列データ生成部、８５・・・励起音源生成部、８７・・・合成フィルタ DESCRIPTION OF SYMBOLS 13 ... Text capture part, 15 ... Phoneme label sequence conversion part, 19 ... Spectral parameter correction part, 23 ... Speaker part, 25 ... Speech synthesis dictionary, 31 ... For text speech synthesis 33 ... CPU, 35 ... ROM, 37 ... storage unit, 39 ... user I / F, 41 ... data input I / F, 43 ... sound board, 45 ... Speaker, 47 ... Bus, 51 ... RAM, 53 ... Hard disk, 55 ... Keyboard, 57 ... Monitor, 81 ... Text-to-speech synthesizer according to Embodiment 1, 83 ... Time series data generation unit, 85 ... Excitation sound source generation unit, 87 ... Synthesis filter

Claims

Phoneme label string conversion means for converting a given text string into a phoneme label string;
Time-series data generating means for generating time-series data of LSP coefficient groups including multi-dimensional LSP (Line Spectrum Pair) coefficients from the phoneme label string output from the phoneme label string converting means;
When it is determined whether or not a predetermined stability condition is satisfied for each LSP coefficient group constituting the LSP coefficient group time-series data generated by the time-series data generation means, and it is determined that it is not satisfied In order for this LSP coefficient group to satisfy the predetermined stability condition, correction is made depending on whether the phoneme HMM (Hidden Markov Model) that has output the LSP coefficient group is positioned at the head, tail, or other position. Correction means for switching the method to correct the LSP coefficient group;
A speech processing apparatus comprising:
The predetermined stability condition means that when a plurality of LSP coefficients constituting the individual LSP coefficient group are all larger than 0 and smaller than π and arranged in ascending order of the dimension of the LSP coefficient, they are arranged in ascending order. I Kotodea,
When the one LSP coefficient group that does not satisfy the predetermined stability condition is located at the head in the first state, the correcting means uses the LSP coefficient group located at least second in the state to use the 1 LSP coefficient group. When the first LSP coefficient group is corrected and is located at the tail in the state 1, the one LSP coefficient group is corrected at least using the LSP coefficient group located second from the tail in the state. The speech processing apparatus according to claim 1, wherein at least one LSP coefficient group is corrected using at least one adjacent LSP coefficient group in that state .

When it is determined that the predetermined stability condition is not satisfied, the correcting unit determines that another LSP generated by the phoneme HMM that outputs the LSP coefficient group so that the LSP coefficient group satisfies the predetermined stability condition. Correcting the LSP coefficient group using the coefficient group;
The speech processing apparatus according to claim 1.

The speech processing apparatus according to claim 1, wherein the correction unit corrects only the LSP coefficient of a dimension that causes the stability condition not to be satisfied in the LSP coefficient group.

The speech processing apparatus according to any one of claims 1 to 3,
Means for supplying text synthesis target text data to the phoneme label string conversion means of the speech processing apparatus;
Synthetic speech generation means for generating synthesized speech by inputting excitation sound source data to an LSP synthesis filter using LSP coefficient group time-series data generated by the speech processing apparatus as LSP synthesis filter coefficients;
A text-to-speech synthesizer.

A phoneme label string conversion step for converting a text string into a phoneme label string;
A time series data generation step for generating time series data of LSP coefficient groups including LSP (Line Spectrum Pair) coefficients from the phoneme label string;
For each LSP coefficient group constituting the LSP coefficient group time series data, when the predetermined stability condition is not satisfied, the LSP coefficient group is output so that the LSP coefficient group satisfies the predetermined stability condition. A step of correcting the LSP coefficient group by switching a correction method depending on whether the phoneme HMM is located in the first, last, or other position in the 1 state of the phoneme HMM,
A voice processing method comprising:
The predetermined stability condition means that when a plurality of LSP coefficients constituting the individual LSP coefficient group are all larger than 0 and smaller than π and arranged in ascending order of the dimension of the LSP coefficient, they are arranged in ascending order. I Kotodea,
In the correcting step, when one LSP coefficient group that does not satisfy the predetermined stability condition is located at the head in the state 1, the LSP coefficient group located at the second position in the state is used at least. When one LSP coefficient group is corrected and is located at the end in the state 1, the LSP coefficient group is corrected at least using the LSP coefficient group located second from the end in the state, A speech processing method comprising: correcting one LSP coefficient group using at least one LSP coefficient group adjacent in that state when the other one is located in one state .

On the computer,
A phoneme label string conversion step for converting a text string into a phoneme label string;
A time series data generation step for generating time series data of LSP coefficient groups including LSP (Line Spectrum Pair) coefficients from the phoneme label string;
For each LSP coefficient group constituting the LSP coefficient group time series data, when the predetermined stability condition is not satisfied, the LSP coefficient group is output so that the LSP coefficient group satisfies the predetermined stability condition. A step of correcting the LSP coefficient group by switching a correction method depending on whether the phoneme HMM is located in the first, last, or other position in the 1 state of the phoneme HMM,
A computer program for executing
The predetermined stability condition means that when a plurality of LSP coefficients constituting the individual LSP coefficient group are all larger than 0 and smaller than π and arranged in ascending order of the dimension of the LSP coefficient, they are arranged in ascending order. I Kotodea,
In the correcting step, when one LSP coefficient group that does not satisfy the predetermined stability condition is located at the head in the state 1, the LSP coefficient group located at the second position in the state is used at least. When one LSP coefficient group is corrected and is located at the end in the state 1, the LSP coefficient group is corrected at least using the LSP coefficient group located second from the end in the state, A computer program for correcting one LSP coefficient group using at least one LSP coefficient group adjacent in that state when positioned in the other state in one state .