JP3444396B2

JP3444396B2 - Speech synthesis method, its apparatus and program recording medium

Info

Publication number: JP3444396B2
Application number: JP23974597A
Authority: JP
Inventors: 公人田中; 匡伸阿部
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1996-09-11
Filing date: 1997-09-04
Publication date: 2003-09-08
Anticipated expiration: 2017-09-04
Also published as: JPH10143196A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、音声素片を用い
たテキストから音声への変換技術において、生成する音
声の基本周波数パターンが、音声素片のパターンと大き
く異なる場合に生ずる、合成音声の品質劣化を防いだ
り、分析合成において原音声の基本周波数パターンと大
きく異なる合成音声を生成する場合に生じる合成音声の
品質劣化を防止することを目的とする音声合成方法に関
するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a text-to-speech conversion technique using speech units, which produces a synthesized speech that is generated when the fundamental frequency pattern of the generated speech is significantly different from the pattern of the speech units. The present invention relates to a speech synthesizing method intended to prevent quality degradation and prevent quality degradation of synthetic speech that occurs when synthetic speech that is significantly different from a fundamental frequency pattern of original speech is generated in analysis and synthesis.

【０００２】[0002]

【従来の技術】従来においては、例えばテキストから音
声への変換を行う場合、あらかじめ録音した音声素片か
ら基本周期ごとに１周期波形を切り出し、テキストの解
析結果から生成された基本周波数パターンに合わせて、
その波形を再配列するというものであった。これはＰＳ
ＯＬＡ法と呼ばれ、例えばM. Moulines 等“Pitch-sync
hronous waveform, processing techniques for text-t
o-speech synthesis using diphones" Speech Communic
ation, vol. 9, pp.453-467(1990-12)に示されている。2. Description of the Related Art Conventionally, for example, when converting text into speech, one period waveform is cut out from a prerecorded speech segment for each fundamental period and matched with a fundamental frequency pattern generated from a text analysis result. hand,
It was to rearrange the waveform. This is PS
It is called the OLA method, and is referred to as “Pitch-sync” by M. Moulines et al.
hronous waveform, processing techniques for text-t
o-speech synthesis using diphones "Speech Communic
ation, vol. 9, pp.453-467 (1990-12).

【０００３】また分析合成においては原音声を分析して
スペクトル特徴量を保持し、このスペクトル特徴量を用
いて原音声を合成するものであった。従来の技術では、
あらかじめ録音した音声素片の持つ基本周波数パターン
と、合成したい音声の基本周波数パターンとが大きく異
なっている場合、合成された音声の品質は著しく劣化す
る。これらについては例えば T.Hirokawa 等“Segment
Selection and Pitch Modification for High Quality
Speech Synthesis using Waveform Segments”ＩＣＳＬ
Ｐ９０３３７〜３４０頁、D. H. Klatt 等“Analysi
s, synthesis, and perception of voice quality vari
ations among female and male talkers ”J. Acoust.
Soc. Am. ８７（２），February 1990 ,８２０〜８５７
頁、に示されている。このため従来のＰＳＯＬＡ法では
テキストの解析結果から生成される基本周波数パターン
にそのまま合わせて波形配列をすると品質が著しく劣化
することがあるため、基本周波数パターンの変化が小さ
い、平坦なものとを用いることがあった。Further, in the analysis and synthesis, the original voice is analyzed to retain the spectral feature amount, and the original voice is synthesized using this spectral feature amount. With conventional technology,
When the fundamental frequency pattern of the voice unit recorded in advance and the fundamental frequency pattern of the voice to be synthesized are greatly different, the quality of the synthesized voice is significantly deteriorated. For these, for example, T. Hirokawa et al. “Segment
Selection and Pitch Modification for High Quality
Speech Synthesis using Waveform Segments "ICSL
P90 pp. 337-340, DH Klatt et al., "Analysi
s, synthesis, and perception of voice quality vari
ations among female and male talkers ”J. Acoust.
Soc. Am. 87 (2), February 1990, 820-857.
Page. For this reason, in the conventional PSOLA method, when the waveform array is directly aligned with the basic frequency pattern generated from the text analysis result, the quality may be significantly deteriorated. Therefore, a flat one with a small change in the basic frequency pattern is used. There was an occasion.

【０００４】音声素片の基本周波数を大きく変更したと
きに生じる合成音声の品質劣化の原因は、基本周波数と
スペクトルとが音響的にマッチしないためと考えられ
る。従って基本周波数と整合のとれたスペクトル構造を
もつ、数多くの音声素片を用意すれば、品質が良い合成
音声を得ることができる。しかし、全ての音声素片につ
いて、所望する基本周波数で発声させることは難しく、
たとえそれが可能であったとしても、記憶容量が膨大に
なり、実現性に乏しい。It is considered that the cause of the quality deterioration of the synthesized speech caused when the fundamental frequency of the speech unit is largely changed is that the fundamental frequency and the spectrum do not acoustically match. Therefore, if a large number of speech units having a spectral structure matched with the fundamental frequency are prepared, a synthesized speech with good quality can be obtained. However, it is difficult to utter at the desired fundamental frequency for all speech units,
Even if it is possible, the storage capacity will be huge and the feasibility will be poor.

【０００５】このような点から、日本国の特開昭５７−
１７１３９８号公報（１９８２年１０月２１日公開）で
は各音韻ごとに基本周波数の異なる複数の音声に対する
スペクトル包絡パラメータ値を記憶しておき、最も近い
基本周波数のスペクトル包絡パラメータを用いる。これ
は、基本周波数の種類が少ないため、品質向上がわずか
であり、しかも記憶容量が著しく大となる欠点がある。From this point of view, Japanese Unexamined Patent Publication No. 57-
In 171398 gazette (published on October 21, 1982), spectrum envelope parameter values for a plurality of voices having different fundamental frequencies are stored for each phoneme, and the spectrum envelope parameter of the closest fundamental frequency is used. This is because there are few types of fundamental frequencies, so there is a slight improvement in quality, and the storage capacity is significantly large.

【０００６】また日本国の特開平７−１０４７９５号公
報（１９９５年４月２１日公開）では人間の声をモデル
化し、変換規則を作成し、基本周波数の変更に応じてス
ペクトルを変形している。この方法は、声のモデル化が
必ずしも正確には行われず、従って変換規則も人間の音
声に正しくマッチしたものとならず、品質のよいものは
期待できない。Further, in Japanese Patent Laid-Open No. 7-104795 (published on April 21, 1995), a human voice is modeled, a conversion rule is created, and a spectrum is transformed according to a change in fundamental frequency. . In this method, the modeling of the voice is not always performed accurately, and therefore the conversion rule does not match the human voice correctly, and high quality cannot be expected.

【０００７】更に日本音響学会平成８年３月の講演論文
集３３７〜３３８頁に基本周波数と、スペクトルを変更
して音声合成することが提案されている。この方法は基
本周波数Ｆ₀を高くすると、これに伴ってスペクトルの
間隔を広げるという大ざっぱな変更であって、品質の良
い合成音声は得られない。また分析合成においても、原
音声のピッチ周期と大きく異なるピッチ周期の合成音声
を生成する場合、合成音の品質が劣化する問題があっ
た。[0007] Further, it is proposed in the Acoustical Society of Japan, March 1996, Proceedings, pages 337 to 338, to change the fundamental frequency and the spectrum to perform speech synthesis. This method is a rough modification in which the spectrum interval is widened when the fundamental frequency F ₀ is increased, and a good quality synthesized speech cannot be obtained. Further, also in the analysis and synthesis, there is a problem that the quality of the synthesized speech is deteriorated when the synthesized speech having the pitch period greatly different from the pitch period of the original speech is generated.

【０００８】なお、この出願の優先権主張日１９９６年
９月１１日より後に本発明者により、この出願の発明の
一部又は全てを、下記学会及びその論文集で発表してい
る。Ａ．Kimihito Tanaka,and Masanobu Abe,"A New Fundam
ental Frecuency Mod-ification Algorithm With Trans
formation of Spectrum Envelope According to F0”,
1997 International Conference on Acoustics,Speech,
and Signal Processing(ICASSP 97)Vol.II,pp.951-954,
The Institute of Electronics Engineers(IEEE) Signa
l Processing Society,April 21-24,1997 Ｂ．田中公人、阿部匡伸「基本周波数に応じてスペ
クトル包絡を変形するテキスト音声合成システム」電子
情報通信学会技術研究報告（信学技報）Ｖｏｌ．９６
Ｎｏ．５６６２３〜３０頁，ＳＰ９６−１３０１９
９７年３月７日（公表は６日）社団法人電子情報通信
学会Ｃ．田中公人、阿部匡伸「Ｆ０に応じてスペクトル
包絡を変形する音声合成方式」日本音響学会平成９年
度春季研究発表会講演論文集Ｉ２１７〜２１８頁
１９９７年３月１７日，社団法人日本音響学会Ｄ．国内発表＋論文集田中公人、阿部匡伸「基本周波数に応じてスペクト
ル包絡を変形する音声合成方式」日本音響学会平成８
年度秋季研究発表会講演論文集Ｉ２１７〜２１８頁
１９９６年９月２５日，社団法人日本音響学会[0008] Note that, after the priority claim date of this application, September 11, 1996, the present inventor has announced some or all of the inventions of this application in the following academic conferences and their collections of papers. A. Kimihito Tanaka, and Masanobu Abe, "A New Fundam
ental Frecuency Mod-ification Algorithm With Trans
formation of Spectrum Envelope According to F0 ”,
1997 International Conference on Acoustics, Speech,
and Signal Processing (ICASSP 97) Vol.II, pp.951-954,
The Institute of Electronics Engineers (IEEE) Signa
l Processing Society, April 21-24, 1997 B. Kimito Tanaka, Masanobu Abe "Text-to-speech synthesis system that transforms spectral envelope according to fundamental frequency" IEICE Technical Report Vol. 96
No. 566, pages 23-30, SP96-13019.
March 7, 1997 (published 6th) The Institute of Electronics, Information and Communication Engineers C.I. Kimito Tanaka, Masanobu Abe "Voice Synthesis Method that Transforms Spectral Envelope According to F0", Acoustical Society of Japan, 1997 Spring Research Presentation, Proceedings I 217-218
March 17, 1997, Acoustical Society of Japan D.A. Presentations in Japan + Collection of papers K. Tanaka, T. Abe "Speech synthesis method that transforms spectral envelope according to fundamental frequency" Acoustical Society of Japan 1996
Autumn Research Conference, Proceedings I pp. 217-218 September 25, 1996, The Acoustical Society of Japan

【０００９】[0009]

【課題を解決するための手段】前期問題点を解決するた
めに、この発明は、入力音声、つまり音声素片又は原音
声の基本周波数に対する合成する音声の基本周波数の差
に応じて、自然音声のスペクトル包絡と基本周波数との
関係を利用してスペクトル包絡に変形処理を施す。SUMMARY OF THE INVENTION In order to solve the above-mentioned problems, the present invention relates to a natural speech in accordance with the difference between the fundamental frequency of the input speech, that is, the fundamental frequency of the speech unit or the original speech, of the synthesized speech. The transformation processing is performed on the spectrum envelope by utilizing the relationship between the spectrum envelope of s and the fundamental frequency.

【００１０】そのために、例えば数段階の基本周波数レ
ンジで同じテキストを発声させた学習用音声データか
ら、各基本周波数レンジごとにコードブックをあらかじ
め作成しておく。これらのコードブックは、各基本周波
数レンジ間で、コードベクトルが１対１に対応づけられ
ている。音声を合成するときには、入力音声から抽出し
たスペクトル包絡の音声特徴量を、その入力音声がもつ
基本周波数レンジのコードブック（基準コードブック）
を用いてベクトル量子化し、合成したい基本周波数レン
ジのマッピングコードブック上でデコードすることによ
り、スペクトル包絡の変形を行う。変形されたスペクト
ル包絡は、基本周波数とスペクトルが音響的にマッチし
ているので、これを用いることにより、高品質な音声の
合成が可能となる。For that purpose, for example, a codebook is prepared in advance for each basic frequency range from learning voice data in which the same text is uttered in several basic frequency ranges. In these codebooks, code vectors are associated with each other in one-to-one correspondence between the fundamental frequency ranges. When synthesizing speech, the speech feature amount of the spectral envelope extracted from the input speech is codebook of the fundamental frequency range of the input speech (reference codebook).
Then, the vector envelope is transformed by using and the spectrum envelope is transformed by decoding on the mapping codebook of the fundamental frequency range to be synthesized. In the modified spectrum envelope, the fundamental frequency and the spectrum are acoustically matched, and by using this, high-quality speech synthesis is possible.

【００１１】また前記基準コードブックと他の基本周波
数レンジのコードブックとの各対応コードベクトル間の
差分ベクトルを求めて差分ベクトルコードブックを用意
し、更に、基準コードブックと他の基本周波数レンジの
コードブックとの各対応クラスにそれぞれ属する要素ベ
クトルの基本周波数の平均値間の差を求めて差分周波数
コードブックを用意し、前記入力音声のスペクトル包絡
を基準コードブックでベクトル量子化し、その量子化コ
ードと対応する差分ベクトルを、前記差分ベクトルコー
ドブックから求め、また前記量子化コードと対応する差
分周波数を前記差分周波数コードブックから求め、この
差分周波数と、入力音声の基本周波数と、所望基本周波
数とからこれら両基本周波数の差に応じた伸縮率を求
め、その伸縮率に応じて前記差分ベクトルを伸縮させ、
その伸縮させた差分ベクトルを、入力音声のスペクトル
包絡に加算し、その加算したスペクトル包絡を時間領域
に変換して、スペクトル包絡が変形された音声素片が得
られる。この場合は、コードブックを作成した基本周波
数レンジと異なる任意の基本周波数にマッチしたスペク
トル包絡の変形が可能となる。請求項１の発明によれ
ば、入力音声素片波形（以下、入力音声と記す）を、そ
の基本周波数と異なる所望の基本周波数の音声に合成す
る音声合成方法において、基本周波数レンジの違う学習
用音声データから作られた、入力音声の基本周波数レン
ジのスペクトル包絡について作成したコードブック（以
下、このコードブックを基準コードブックと記す）と、
この基準コードブックの各コードベクトルと入力音声と
基本周波数レンジの異なるコードブックの対応コードベ
クトルとの差分ベクトルよりなる差分ベクトルコードブ
ックとを用いて、上記入力音声のスペクトル包絡を、上
記基準コードブックを用いてベクトル量子化し、そのベ
クトル量子化されたコードと対応した差分ベクトルを上
記差分ベクトルコードブックから求め、その差分ベクト
ルを、上記入力音声の基本周波数に対する上記所望基本
周波数のずれ量に応じて伸縮し、その伸縮した差分ベク
トルと、上記ベクトル量子化されたコードのベクトルと
を加算したものから上記入力音声のスペクトル包絡に対
し変形処理されたものを得るものである。 A difference vector codebook is prepared by obtaining a difference vector between corresponding code vectors of the reference codebook and a codebook of another fundamental frequency range, and further, a difference vector codebook is prepared. A difference frequency codebook is prepared by obtaining the difference between the average values of the fundamental frequencies of the element vectors belonging to each corresponding class with the codebook, and the spectrum envelope of the input speech is vector-quantized by the reference codebook, and the quantization is performed. A difference vector corresponding to the code is obtained from the difference vector codebook, and a difference frequency corresponding to the quantized code is obtained from the difference frequency codebook, and the difference frequency, the fundamental frequency of the input voice, and the desired fundamental frequency. The expansion ratio is calculated from the difference between the two fundamental frequencies and To extend and retract the difference vector Te,
The expanded / contracted difference vector is added to the spectrum envelope of the input voice, and the added spectrum envelope is transformed into the time domain to obtain a speech unit with a modified spectrum envelope. In this case, it is possible to modify the spectrum envelope that matches any fundamental frequency different from the fundamental frequency range for which the codebook was created. According to the invention of claim 1
For example, input speech segment waveform (hereinafter referred to as input speech)
To a voice with a desired fundamental frequency different from the fundamental frequency of
Learning with different fundamental frequency ranges
Input audio fundamental frequency range created from audio data
A codebook created for the spectral envelope of
Below, this codebook is referred to as the reference codebook),
With each code vector and input voice of this reference codebook
Corresponding code bases of code books with different basic frequency ranges
Difference vector code block consisting of the difference vector
And the spectral envelope of the input speech above
Vector quantization using the standard codebook and
The difference vector corresponding to the quantized code is
The difference vector obtained from the difference vector codebook
The desired fundamental for the fundamental frequency of the input voice.
Expands and contracts according to the amount of frequency shift, and expands and contracts the differential vector.
And the vector of the vector quantized code above
Is added to the spectral envelope of the input speech above.
The result is a deformed one.

【００１２】[0012]

【発明の実施の形態】図１にこの発明の基本手順を示
す。入力音声はステップＳ１でスペクトル特徴量が抽出
され、ステップＳ２で入力音声と合成音声との基本周波
数差に応じて、基本周波数とスペクトル包絡との関係を
用いて、入力音声のスペクトル包絡に変形処理を行い、
合成音声を得る。1 shows the basic procedure of the present invention. In step S1, the spectrum feature amount of the input voice is extracted, and in step S2, the process of transforming into the spectrum envelope of the input voice is performed using the relationship between the fundamental frequency and the spectrum envelope according to the fundamental frequency difference between the input voice and the synthesized voice. And then
Get synthetic speech.

【００１３】以下、この発明をテキスト音声合成に適用
する場合の実施例を述べる。音声素片を用いたテキスト
音声合成システムでは、入力されたテキストを解析し
て、合成に用いる音声素片の系列と基本周波数パターン
が得られる。合成する音声の基本周波数パターンと音声
素片が本来持っている基本周波数パターンが大きく異な
る場合、この発明では、音声素片の基本周波数パターン
の、与えられた基本周波数パターンに対する変形量に応
じて、音声素片のスペクトル包絡を変形する。この変形
のためには音声素片、つまり入力音声波形のスペクトル
特徴量の抽出を行うが、これは図２に示すようにして行
う。なお、ここで用いる音声データには、すべて、音素
の境界および基本周期を表すピッチマークが付与されて
いるものとする。An embodiment in which the present invention is applied to text-to-speech synthesis will be described below. A text-to-speech synthesis system using speech units analyzes an input text and obtains a sequence of speech units and a fundamental frequency pattern used for synthesis. When the fundamental frequency pattern of the speech unit to be synthesized and the fundamental frequency pattern originally possessed by the speech unit are largely different, in the present invention, the fundamental frequency pattern of the speech unit, in accordance with the amount of deformation with respect to the given fundamental frequency pattern, Transform the spectral envelope of a speech unit. For this modification, the speech unit, that is, the spectral feature quantity of the input speech waveform is extracted, which is performed as shown in FIG. In addition, it is assumed that all the voice data used here are provided with pitch marks representing boundaries of phonemes and fundamental periods.

【００１４】この図２は、音声信号を効率よく表現する
ための、スペクトル包絡情報を表す音声特徴量を抽出す
る手順である。この手法は、対数スペクトルを基本周波
数の整数倍の近傍の最大値をサンプリングして余弦モデ
ルの最小二乗近似によりスペクトル包絡を推定する方法
（H.Matsumoto 等“A Minimum Distortion SpectralMap
ping Applied to Voice Quality Conversion ”ＩＣＳ
ＬＰ９０，５，９，pp. １６１〜１６４（１９９０））
を改良したものである。FIG. 2 is a procedure for extracting a voice feature amount representing spectral envelope information for efficiently expressing a voice signal. This method is a method of estimating the spectrum envelope by the least square approximation of the cosine model by sampling the maximum value of the logarithmic spectrum in the vicinity of an integer multiple of the fundamental frequency (H. Matsumoto et al. “A Minimum Distortion SpectralMap
ping Applied to Voice Quality Conversion "ICS
LP90, 5, 9, pp. 161-164 (1990))
Is an improvement of.

【００１５】音声波形が入力されると、ステップＳ１０
１では、ピッチマークを中心に、基本周期の例えば５倍
の長さの窓関数をかけ、波形を切り出す。ステップＳ１
０２では、切り出した波形をＦＦＴ（高速フーリエ変
換）し、対数パワースペクトルを求める。ステップＳ１
０３では、ステップＳ１０２で求めた対数パワースペク
トルについて、基本周波数Ｆ₀ の整数倍の近傍（ｎＦ₀
−Ｆ₀ ／２＜ｆ_n＜ｎＦ₀ ＋Ｆ₀／２）における、対数
パワースペクトルの最大値をサンプリングする。ここ
で、ｎは整数を表す。つまり図３に示すように、周波数
Ｆ₀ 、２Ｆ₀ 、３Ｆ₀ …をそれぞれ中心とする周波数Ｆ
₀ の区間内における各対数パワースペクトルの最大値を
取り出す。また例えば３Ｆ₀ を中心とする区間で取り出
された最大値の周波数ｆ₃が３Ｆ₀ 以下でその隣の４Ｆ
₀ を中心とする区間で取り出された最大値の周波数ｆ₄
が４Ｆ₀ より高く、ｆ₃とｆ₄の差ΔＦ、つまり隣接サ
ンプリング間隔が1.5 Ｆ₀ よりも大きい区間がある場
合、その区間ｆ₃〜ｆ₄における対数パワースペクトル
の極大値もサンプリングする。When the voice waveform is input, step S10
In 1, a window function having a length of, for example, 5 times the basic period is applied to the center of the pitch mark to cut out the waveform. Step S1
In 02, the cut-out waveform is subjected to FFT (Fast Fourier Transform) to obtain a logarithmic power spectrum. Step S1
In 03, the logarithmic power spectrum obtained in step S102 is in the vicinity of an integral multiple of the fundamental frequency F ₀ (nF ₀
In _{_{-F 0/2 <f n <}} nF 0 + F 0/2), sampling the maximum value of the log power spectrum. Here, n represents an integer. That is, as shown in FIG. 3, the frequencies F ₀ , 2F ₀ , 3F _0, ...
The maximum value of each logarithmic power spectrum in the interval of ₀ is extracted. The example of the adjacent frequency f ₃ of the maximum value extracted by the interval centered on 3F ₀ is at 3F ₀ less 4F
_The maximum frequency f ₄ extracted in the section centered around ₀
Is higher than 4F ₀ and the difference ΔF between f ₃ and f ₄ , that is, the section where the adjacent sampling interval is larger than 1.5 F ₀ , the maximum value of the logarithmic power spectrum in the section f _{3 to} f ₄ is also sampled.

【００１６】ステップＳ１０４では、ステップＳ１０３
で求めたサンプリング点を、直線で補間する。ステップ
Ｓ１０５では、ステップＳ１０４で求まった直線補間パ
ターンを、Ｆ ₀ ／ｍ＜５０Ｈｚを満たす最大のＦ₀ ／ｍ
間隔でサンプリングする。ここでｍは整数を表す。In step S104, step S103
The sampling points obtained in step 1 are interpolated with a straight line. Step
In step S105, the linear interpolation pattern obtained in step S104 is
Turn, F ₀ / M <50Hz maximum F₀ / M
Sampling at intervals. Here, m represents an integer.

【００１７】ステップＳ１０６では、ステップＳ１０５
でサンプリングしたサンプリング点を以下の式（１）で
示す余弦モデルで最小二乗近似する。 Y(λ) ＝Σ^M _i=1 Ａ_icosiλ，（０≦λ≦π）（１）上記式（１）から、音声特徴量（ケプストラム）Ａ_iが
求まる。この音声特徴量抽出法はパワースペクトルのピ
ークを忠実に表現している。この音声特徴量Ａ _iの抽出
手法をＩＰＳＥ法と呼ぶ。In step S106, step S105
The sampling points sampled by
Least squares approximation is performed using the cosine model shown. Y (λ) = Σ^M _{i = 1} A_icosiλ, (0 ≦ λ ≦ π) (1) From the above equation (1), the voice feature amount (cepstrum) A_iBut
I want it. This speech feature extraction method uses the power spectrum
Faithfully represents the ark. This audio feature A _iExtraction of
The method is called the IPSE method.

【００１８】次にスペクトル包絡の変形に用いる基本周
波数レンジの違う、コードブックを作成するためのアル
ゴリズムを図５を参照して説明する。ここでは一例とし
て、基本周波数のレンジが、「高」、「中」、「低」の
３段階の場合を考える。入力として用いる音声データ
（学習音声データ）は、３段階の基本周波数レンジで、
一人の話者が同一のテキストをそれぞれ発声したものを
用いる。Next, an algorithm for creating a codebook having different fundamental frequency ranges used for transforming the spectrum envelope will be described with reference to FIG. Here, as an example, consider a case where the range of the fundamental frequency has three stages of “high”, “medium”, and “low”. The voice data (learning voice data) used as input has three basic frequency ranges,
One speaker speaks the same text.

【００１９】図５の中で、ステップＳ２０１、Ｓ２０
２、Ｓ２０３ではそれぞれ、基本周波数レンジ「高」、
「中」、「低」の各音声データから、図２に示したアル
ゴリズムにより、ピッチマークごとに音声特徴量、この
例ではＩＰＳＥケプストラムを抽出する。ステップＳ２
０４，Ｓ２０５，Ｓ２０６ではそれぞれステップＳ２０
１，Ｓ２０２，Ｓ２０３で抽出したＩＰＳＥケプストラ
ムを、聴覚特性を向上させるため周波数尺度をメル尺度
に変化してメルＩＰＳＥケプストラムとする。メル尺度
については例えば“Computation of Spectra with Uneq
ual Resolution Using theFast Fourier Transform”Pr
oceeding of The IEEE February １９７１, ２９９〜３
０１頁に示されている。In FIG. 5, steps S201 and S20 are performed.
2, in S203, the basic frequency range "high",
From each of the "medium" and "low" voice data, the voice feature amount, in this example, the IPSE cepstrum is extracted for each pitch mark by the algorithm shown in FIG. Step S2
In 04, S205, and S206, step S20 is performed.
The IPSE cepstrum extracted in S1, S202, and S203 is changed to a mel scale in the frequency scale in order to improve the auditory characteristics, and is referred to as a mel IPSE cepstrum. For the Mel scale, for example, “Computation of Spectra with Uneq
ual Resolution Using theFast Fourier Transform ”Pr
oceeding of The IEEE February 1971, 299-3
It is shown on page 01.

【００２０】ステップＳ２０７では、図４に示すよう
に、同一テキストについて、基本周波数レンジ「高」の
音声データ中のピッチマーク列と、基本周波数レンジ
「中」の音声データのピッチマーク列との間で、各有声
音素ごとに線形伸縮マッチングを行い、両音声データの
ピッチマーク間の対応関係を求める。つまり、有声音素
Ａの基本周波数レンジ「高」の音声データ中のピッチマ
ーク列がＨ１、Ｈ２、Ｈ３、Ｈ４、Ｈ５であり、基本周
波数レンジ「中」の音声データ中のピッチマーク列がＭ
１、Ｍ２、Ｍ３、Ｍ４であった場合、Ｈ１はＭ１と、Ｈ
２はＭ２と、Ｈ３及びＨ４はＭ３と、Ｈ５はＭ４とそれ
ぞれ対応付け、このようにして、基本周波数レンジ
「高」と基本周波数レンジ「中」の対応音素区間内にお
ける各ピッチマークを、時間軸を線形伸縮してその区間
内における位置が近いものを互いに対応付ける。ステッ
プＳ２０８においても同様に、基本周波数レンジ「低」
の音声データと、基本周波数レンジ「中」の音声データ
の間で、ピッチマーク間の対応関係を求める。In step S207, as shown in FIG. 4, between the pitch mark string in the voice data of the basic frequency range "high" and the pitch mark string of the voice data in the basic frequency range "medium" for the same text. Then, linear expansion / contraction matching is performed for each voiced phoneme to find the correspondence between the pitch marks of both voice data. That is, the pitch mark sequence in the voice data of the fundamental frequency range "high" of the voiced phoneme A is H1, H2, H3, H4, H5, and the pitch mark sequence in the voice data of the basic frequency range "medium" is M.
If it is 1, M2, M3, M4, H1 is M1 and H
2 is associated with M2, H3 and H4 are associated with M3, and H5 is associated with M4. In this way, each pitch mark in the corresponding phoneme section of the fundamental frequency range “high” and the fundamental frequency range “medium” is timed. The axes are linearly expanded and contracted so that those that are close to each other in the section are associated with each other. Similarly in step S208, the basic frequency range is "low".
Then, the correspondence between the pitch marks is obtained between the voice data of No. 1 and the voice data of the basic frequency range “medium”.

【００２１】ステップＳ２０９では、基本周波数レンジ
「中」の音声データからピッチマークごとに抽出した音
声特徴量（メルＩＰＳＥケプストラム）をＬＢＧアルゴ
リズムによりクラスタリングし、基本周波数レンジ
「中」のコードブックＣＢ_Mを作る。なお、ＬＢＧアル
ゴリズムの詳細は、例えば、Linde らの、"An Algorith
mfor Vector Quantization Design,"（ＩＥＥＥＣＯ
Ｍ−２８（１９８０−０１）８４〜９５頁）に記載され
ている。In step S209, the voice feature amount (mel IPSE cepstrum) extracted for each pitch mark from the voice data in the basic frequency range "medium" is clustered by the LBG algorithm, and the codebook CB _M in the basic frequency range "medium" is obtained. create. For details of the LBG algorithm, see "An Algorithm" by Linde et al.
mfor Vector Quantization Design, "(IEEE CO
M-28 (1980-01) pp. 84-95).

【００２２】ステップＳ２１０では、ステップＳ２０９
で作った基本周波数レンジ「中」のコードブックを用い
て、基本周波数レンジ「中」のメルＩＰＳＥケプストラ
ムをベクトル量子化する。つまり基本周波数レンジ
「中」のメルＩＰＳＥケプストラムが属するクラスタを
求める。ステップＳ２１１では、ステップＳ２０７で求
めた基本周波数レンジ「高」の音声データと基本周波数
レンジ「中」の音声データのピッチマーク間の対応付け
の結果を利用して、ステップＳ２０９で作成したコード
ブックのコードベクトルごとに、これと対応する基本周
波数レンジ「高」の音声データから抽出した各音声特徴
量（メルＩＰＳＥケプストラム）をそのコードベクトル
のクラスに所属させる。つまり、例えば有声音素Ａのピ
ッチマークＨ１（図４）における特徴量（メルＩＰＳＥ
ケプストラム）は、ピッチマークＭ１における特徴量
（メルＩＰＳＥケプストラム）が量子化されたコードベ
クトル番号のクラスに所属させ、Ｈ２における特徴量は
Ｍ２における特徴量の量子化コードベクトル番号のクラ
スに所属させ、Ｈ３、Ｈ４における各特徴量はＭ３にお
ける特徴量の量子化コードベクトル番号のクラスにそれ
ぞれ所属させ、Ｈ５における特徴量はＭ４における特徴
量の量子化コードベクトル番号のクラスとし、以下同様
に基本周波数レンジ「高」の各特徴量（メルＩＰＳＥケ
プストラム）を、基本周波数レンジ「中」の対応特徴量
（メルＩＰＳＥケプストラム）の量子化コードベクトル
番号にクラス分けする。基本周波数レンジ「高」の音声
データの特徴量（メルＩＰＳＥケプストラム）に対する
クラスタリングが行われる。In step S210, step S209
Vector quantizing the mel IPSE cepstrum in the fundamental frequency range "medium" using the codebook in the fundamental frequency range "medium" created in. That is, the cluster to which the mel IPSE cepstrum in the basic frequency range “medium” belongs is obtained. In step S211, the result of the correspondence between the pitch marks of the audio data of the basic frequency range "high" and the audio data of the basic frequency range "medium" obtained in step S207 is used to generate the codebook created in step S209. For each code vector, each voice feature amount (mel IPSE cepstrum) extracted from the voice data of the corresponding basic frequency range “high” is assigned to the code vector class. That is, for example, the feature amount (mel IPSE) in the pitch mark H1 (FIG. 4) of the voiced phoneme A is measured.
Cepstrum) belongs to the class of code vector numbers in which the feature quantity (mel IPSE cepstrum) in the pitch mark M1 is quantized, and the feature quantity in H2 belongs to the class of quantized code vector numbers of the feature quantity in M2. Each of the feature quantities in H3 and H4 belongs to the class of the quantized code vector number of the feature quantity in M3, and the feature quantity in H5 is the class of the quantized code vector number of the feature quantity in M4. Each "high" feature amount (mel IPSE cepstrum) is classified into a quantized code vector number of a corresponding feature amount (mel IPSE cepstrum) in the basic frequency range "medium". Clustering is performed on the feature amount (mel IPSE cepstrum) of the audio data in the basic frequency range “high”.

【００２３】ステップＳ２１２ではこのクラスタリング
された基本周波数レンジ「高」のメルＩＰＳＥケプスト
ラムを、その各クラスごとにこれに属した特徴量の重心
ベクトル（平均）を求め、これを基本周波数レンジ
「高」のコードベクトルとして、コードブックＣＢ_Hを
得る。このようにして１周期波形ごとに時間的対応をと
り、基本周波数レンジ「中」のコードブック（基準コー
ドブック）ＣＢ_Mにおけるクラスタリングの結果を参照
しながら基本周波数レンジ「高」の音声データに対する
スペクトルパラメータの写像先であるマッピングコード
ブックが作成される。ステップＳ２１３でもステップＳ
２１１と同様な手法を用いて、基本周波数レンジ「低」
の音声データの特徴量（メルＩＰＳＥケプストラム）を
クラスタリングし、ステップＳ２１４でその各クラスの
特徴量の重心ベクトルを求めて基本周波数レンジ「低」
のコードブックＣＢ_Lを作成する。In step S212, the clustered mel IPSE cepstrum of the basic frequency range "high" is obtained for each class for the centroid vector (average) of the feature quantities belonging to it, and this is calculated as the basic frequency range "high". The codebook CB _H is obtained as the code vector of In this way, take time corresponding to each cycle waveform, the spectrum for the audio data of the basic frequency range "high" with reference to the results of clustering in the codebook (the reference codebook) CB _M of the fundamental frequency range, "middle" A mapping codebook, which is a mapping destination of parameters, is created. Also in step S213, step S
Using the same method as 211, the fundamental frequency range "low"
Of the voice data of the above (mel IPSE cepstrum) are clustered, and the centroid vector of the feature amount of each class is obtained in step S214 to obtain the basic frequency range "low".
To create a code book CB _L.

【００２４】以上により、基本周波数レンジ「低」、
「中」、「高」の３つについて、それぞれ同一コード番
号のコードベクトル間で、１対１の対応付けが行われ
た、３つのコードブックＣＢ_L，ＣＢ_M，ＣＢ_Hが作成
された。次にステップＳ２１５では基本周波数レンジ
「高」のコードブックＣＢ_Hと基本周波数レンジ「中」
のコードブックＣＢ_Mの間での、対応する各コードベク
トルの差分を求め、差分ベクトルコードブックＣＢ_MHを
作る。同様にステップＳ２１６では基本周波数レンジ
「低」のコードブックＣＢ_Lと基本周波数レンジ「中」
のコードブックＣＢ_Mの間の対応する各コードベクトル
の差分を求め、差分ベクトルコードブックＣＢ_MLを作
る。From the above, the basic frequency range "low",
Three codebooks CB _L , CB _M , and CB _H were created in which code vectors having the same code number were respectively associated with each other for three of “medium” and “high”. Next, in step S215, the codebook CB _{H of the} basic frequency range “high” and the basic frequency range “medium”.
Then, the difference of each corresponding code vector between the code books CB _M of the above is obtained, and a difference vector code book CB _MH is created. Similarly the codebook CB _L and the fundamental frequency range of the step S216 the fundamental frequency range, "low", "medium"
The difference of each corresponding code vector between the codebooks CB _M of the above is calculated, and a difference vector codebook CB _ML is created.

【００２５】この実施例では更に、ステップＳ２１７，
Ｓ２１８，Ｓ２１９では、各コードブックＣＢ_H，ＣＢ
_M，ＣＢ_Lの各クラスに属する要素ベクトルに付属する
基本周波数の平均値Ｆ_H，Ｆ_M，Ｆ_Lをそれぞれ求め
る。ステップＳ２２０ではコードブックＣＢ_HとＣＢ_M
との間で対応するコードベクトル間の周波数平均値Ｆ_H
とＦ_Mとの差分ΔＦ_HMを求めて、平均周波数差分コード
ブックＣＢ_FMHを作る。同様にステップＳ２２１ではコ
ードブックＣＢ_MとＣＢ_Lとの間で対応するコードベク
トル間の周波数平均値Ｆ_MとＦ_Lとの差分ΔＦ_LMを求め
て平均周波数差分コードブックＣＢ_FMLを作る。In this embodiment, further, step S217,
In S218 and S219, in each codebook CB _H and CB
Determining _M, the average value F _H of the fundamental frequency that is included with the element vector belonging to respective classes of CB _L, F _M, the F _L, respectively. In step S220, codebooks CB _H and CB _M
Frequency mean value F _H between corresponding code vectors between
And the difference ΔF _HM between F _M and F _M are obtained to create an average frequency difference codebook CB _FMH . Similarly, in step S221, the difference ΔF _LM between the frequency average values F _M and F _L between the corresponding code vectors between the codebooks CB _M and CB _L is obtained to create the average frequency difference codebook CB _FML .

【００２６】この実施例では基本周波数レンジ「中」の
コードブックＣＢ_Mと、二つの差分ベクトルコードブッ
クＣＢ_MH，ＣＢ_MLと、二つの平均周波数差分コードブッ
クＣＢ_FMH，ＣＢ_FMLとの５つが用意される。次に、図
５に示した手法により作成した５つのマッピングコード
ブックを用いて、基本周波数に応じてスペクトル包絡変
形を行う音声合成方法の処理手順を図６を参照して説明
する。このアルゴリズムの入力は、テキスト音声合成部
において選択された音声素片波形と、合成したい音声の
基本周波数Ｆ_0tと、前記選択された音声素片波形の基本
周波数Ｆ_0uとである。出力は合成音声である。以下、そ
れぞれの処理について詳細に述べる。In this embodiment, five codebooks CB _M of the basic frequency range “medium”, two difference vector codebooks CB _MH and CB _ML , and two average frequency difference codebooks CB _FMH and CB _FML are prepared. To be done. Next, the processing procedure of the speech synthesis method for performing the spectral envelope transformation according to the fundamental frequency using the five mapping codebooks created by the method shown in FIG. 5 will be described with reference to FIG. The input of this algorithm is the speech unit waveform selected in the text-to-speech synthesis unit, the fundamental frequency F _{0t of the} speech to be synthesized, and the fundamental frequency F _{0u of the} selected speech unit waveform. The output is a synthetic voice. Hereinafter, each processing will be described in detail.

【００２７】ステップＳ４０１では、入力された音声素
片から、図２中のステップＳ２０１〜Ｓ２０３で説明し
たアルゴリズムと同様の手法により音声特徴量、この例
ではＩＰＳＥケプストラムを抽出する。更にステップＳ
４０２ではその抽出したＩＰＳＥケプストラムの周波数
尺度をメル尺度に変換したメルＩＰＳＥケプストラムと
する。In step S401, a voice feature amount, in this example, IPSE cepstrum, is extracted from the input voice segment by a method similar to the algorithm described in steps S201 to S203 in FIG. Further step S
In 402, the frequency scale of the extracted IPSE cepstrum is converted into a mel scale to be a mel IPSE cepstrum.

【００２８】ステップＳ４０３では、図５に示したアル
ゴリズムにより作成した、基本周波数レンジ「中」のコ
ードブックＣＢ_Mを用いて、ステップＳ４０２で抽出し
た音声特徴量をファジーベクトル量子化して式（２）で
示すようなｋ−近傍ファジー級関数μ_kを求める。 μ_k＝（１／（Σ（ｄ_k／ｄ_j）^1/(f-1) （２）ｄ_jは入力ベクトルとコードベクトルとの距離、ｆはフ
ァジネスを表わし、Σはｊ＝１からｊ＝ｋである。ファ
ジーベクトル量子化の詳細については、例えば、中村、
鹿野の“ファジーベクトル量子化を用いたスペクトログ
ラムの正規化”（音響学会誌４５巻２号（１９８９））
又は（Ａ．Ho-Ping Tseng,Michael J.Sabin and Edward
A Lee,"Fuzzy Vector Quantazation Applied to Hidde
n MarkovModeling",Proceedings of IEEE Internationa
l Conference on Acoustics,Speech,and Signal Proces
sing (ICASSP) Vol.2,pp.641-644,April 1987. ）に記
載されている。In step S403, the voice feature quantity extracted in step S402 is fuzzy vector quantized using the codebook CB _M of the fundamental frequency range "medium" created by the algorithm shown in FIG. The k-neighborhood fuzzy class function μ _k as shown in FIG. μ _k = (1 / (Σ (d _k / d _j ) ^{1 / (f-1)} (2) d _j is the distance between the input vector and the code vector, f is the fuzziness, and Σ is j = 1 to j = K For details of fuzzy vector quantization, see Nakamura,
Kano's "Normalization of spectrograms using fuzzy vector quantization" (Academic Society of Japan, Vol. 45, No. 2 (1989))
Or (A. Ho-Ping Tseng, Michael J. Sabin and Edward
A Lee, "Fuzzy Vector Quantazation Applied to Hidde
n Markov Modeling ", Proceedings of IEEE Internationa
l Conference on Acoustics, Speech, and Signal Proces
sing (ICASSP) Vol.2, pp.641-644, April 1987.).

【００２９】ステップＳ４０４では式（３）で示すよう
に、差分ベクトルコードブックＣＢ _HM又はＣＢ_HLを用い
ｋ−近傍における差分ベクトルＶ_iに対して、ファジー
級関数μ_kによる重みづけ合成を行い、入力ベクトルに
対する差分ベクトルＶを求める。Ｖ＝Σμ_jＶ_j／Σμ_j （３） Σはｊ＝１からｋまで合成したい音声の基本周波数Ｆ_0tが、入力音声素片のＦ
_0uより高い場合はコードブックＣＢ_HMを用い、低い場合
はコードブックＣＢ_MLを用いる。このような差分ベクト
ルＶを求める手法はいわゆる移動ベクトル場平滑化法に
よる手法と同一であり、この手法は例えば橋本、樋口の
“話者選択と移動ベクトル場平滑化を用いた声質変換の
ためのスペクトル写像”日本電子情報通信学会、信学技
報ＳＰ９５−１（１９９５−０５１）（この英文はＣ．
Makoto Hasimoto and Norio Higuchi,"Spectral Mappin
g for Voice Conversion Using Speaker Selection and
Vector Field Smoothing ",Proceedings of 4th Europ
ean Conference on Speech Communication and Technon
ogy(EUROSPEECH)Vol.1,pp.431-434,Sept.95.移動ベクト
ル場平滑化法に関する英文論文）に記載されている。In step S404, as shown in equation (3),
And the difference vector codebook CB _HMOr CB_HLUsing
Difference vector V in k-neighborhood_iAgainst fuzzy
Class function μ_kThe weighted synthesis by
The difference vector V with respect to it is calculated. V = Σμ_jV_j/ Σμ_j (3) Σ is from j = 1 to k The fundamental frequency F of the voice you want to synthesize_0tIs the input speech unit F
_0uCodebook CB if higher_HMAnd if lower
Is the codebook CB_MLTo use. Such a difference vector
The method to obtain the rule V is the so-called moving vector field smoothing method.
This is the same as the method by Hashimoto and Higuchi.
“Voice quality conversion using speaker selection and motion vector field smoothing
Spectrum map for "The Institute of Electronics, Information and Communication Engineers, IEICE
Report SP95-1 (1995-051)
Makoto Hasimoto and Norio Higuchi, "Spectral Mappin
g for Voice Conversion Using Speaker Selection and
Vector Field Smoothing ", Proceedings of 4th Europ
ean Conference on Speech Communication and Technon
ogy (EUROSPEECH) Vol.1, pp.431-434, Sept.95.
Le field smoothing method).

【００３０】ステップＳ４０５は、合成したい音声の基
本周波数Ｆ_0tと、入力音声素片の基本周波数Ｆ_0uと、図
５で求めた平均周波数差分コードブックＣＢ_FMH又はＣ
Ｂ_FM _Lとを用いて式（４）により差分ベクトルＶに対す
る伸縮率ｒを求める。ｒ＝（Ｆ_0t−Ｆ_0u）／ΔＦ（４） ΔＦ＝Σμ_jΔＦ_j／Σμ_j （５） Σはｊ＝１からｋまで、ΔＦ_jはコードブックＣＢ_FMH
又はＣＢ_FMLのコード平均基本周波数の差分である。In step S405, the fundamental frequency F _{0t of the} speech to be synthesized, the fundamental frequency F _0u of the input speech segment, and the average frequency difference codebook CB _FMH or C _found in FIG.
The expansion / contraction ratio r for the difference vector V is obtained by the equation (4) using B _FM _L. r = (F _0t −F _0u ) / ΔF (4) ΔF = Σμ _j ΔF _j / Σμ _j (5) Σ is from j = 1 to k, ΔF _j is the codebook CB _FMH
Alternatively, it is the difference between the code average fundamental frequencies of CB _FML .

【００３１】ステップＳ４０６ではステップＳ４０５で
求めた差分ベクトルＶを、ステップＳ４０６で求めた伸
縮率ｒに従って線形伸縮する。ステップＳ４０７ではス
テップＳ４０６で線形伸縮された差分ベクトルをステッ
プＳ４０２で求めたメルＩＰＳＥケプストラム（入力ベ
クトル）に加算して、合成したい音声の基本周波数Ｆ_0t
に応じて変形されたメルＩＰＳＥケプストラムが求ま
る。In step S406, the difference vector V calculated in step S405 is linearly expanded / contracted according to the expansion / contraction ratio r calculated in step S406. In step S407, the difference vector linearly expanded and contracted in step S406 is added to the mel IPSE cepstrum (input vector) obtained in step S402, and the fundamental frequency F _{0t of the} speech to be synthesized is added.
The mel IPSE cepstrum deformed according to is obtained.

【００３２】ステップＳ４０８ではこの変形されたＩＰ
ＳＥケプストラムを、Oppenheim の漸化式により、メル
尺度から線形尺度に周波数尺度を変換する。ステップＳ
４０９ではその線形尺度とされたＩＰＳＥケプストラム
を逆高速フーリエ変換し（零位相）、Ｆ_0tに応じてスペ
クトル包絡が変形された音声波形を得る。In step S408, the modified IP
The SE cepstrum is transformed from Mel scale to linear scale by Oppenheim's recurrence formula. Step S
In 409, the IPSE cepstrum that has been used as the linear scale is subjected to inverse fast Fourier transform (zero phase) to obtain a speech waveform whose spectrum envelope is modified according to F _0t .

【００３３】ステップＳ４１０ではステップＳ４０９で
求めた音声波形を低域通過フィルタにかけ、低域成分の
みの波形を求める。ステップＳ４１１ではステップＳ４
０９で求めた音声波形から、高域通過フィルタにより高
域成分のみを取り出す。この高域通過フィルタの遮断周
波数と、ステップＳ４１０で用いる低域通過フィルタの
遮断周波数とを等しくする。In step S410, the voice waveform obtained in step S409 is low-pass filtered to obtain a waveform of only low-frequency components. In step S411, step S4
From the speech waveform obtained in 09, only the high frequency component is extracted by the high pass filter. The cutoff frequency of the high pass filter is made equal to the cutoff frequency of the low pass filter used in step S410.

【００３４】ステップＳ４１２では入力音声素片から、
ピッチマーク位置を中心に、基本周期の２倍の長さのハ
ミング窓をかけて、波形を切り出す。ステップＳ４１３
ではステップＳ４１２で切り出した入力波形をステップ
Ｓ４１１で用いたものと同じ高域通過フィルタに通して
高域成分を取り出す。ステップＳ４１４ではステップＳ
４１３で求めた入力波形の高域成分のレベルを、ステッ
プＳ４１１で求めた、スペクトル包絡が変形された音声
波形の高域成分と同一レベルになるようにレベル調整す
る。In step S412, from the input speech unit,
A waveform is cut out by applying a Hamming window having a length twice the basic period centering on the pitch mark position. Step S413
Then, the input waveform cut out in step S412 is passed through the same high-pass filter as that used in step S411 to extract high-pass components. In step S414, step S
The level of the high frequency component of the input waveform obtained in 413 is adjusted so as to be the same level as the high frequency component of the speech waveform whose spectrum envelope is deformed, obtained in step S411.

【００３５】ステップＳ４１５ではステップＳ４１４で
レベル調整された高域成分と、ステップＳ４１０で取出
された低域成分とが足し合わされる。ステップＳ４１６
ではステップＳ４１５で求めた波形を、所望の基本周波
数Ｆ _0tに合わせて配列して合成音声を得る。以上におけ
るスペクトル包絡の変形処理を概念的に示すと図７に示
すようになる。入力ベクトル（ステップＳ４０２で得た
メルＩＰＳＥケプストラム）をコードブックＣＢ_Mでフ
ァジーベクトル量子化されたベクトル１１に対し、ｋ個
のその近傍コードベクトル１２が決まり、これとコード
ブックＣＢ_Hの対応コードベクトルとの差分ベクトルＶ
_iがコードブックＣＢ_MHにより求まり、更に式（３）に
より、ファジーベクトル量子化されたベクトル１１に対
する差分ベクトルＶが求まり、このＶを、式（４）にも
とづく伸縮率ｒで線形伸縮され、この伸縮されたベクト
ル１３に、入力ベクトルを加算して目的とする変形され
たベクトル（メルＩＰＳＥケプストラム）１４が得られ
る。差分ベクトルコードブックＣＢ_MH，ＣＢ_MLを用いる
ことなく、コードブックＣＢ_H，ＣＢ_Lを用いることも
できる。その場合の実施例を図８に図６と同一処理に同
一ステップ番号を付けて示す。In step S415, in step S414
Extract the high-frequency component whose level has been adjusted in step S410
The low frequency components thus generated are added together. Step S416
Then, the waveform obtained in step S415 is changed to the desired fundamental frequency.
Number F _0tTo produce a synthetic voice. Above
Figure 7 shows a conceptual representation of the process of transforming the spectral envelope.
Will come to you. Input vector (obtained in step S402
Mel IPSE Cepstrum) Codebook CB_MAt
K vector for quantized vector 11
And its code code 12
Book CB_HDifference vector V from the corresponding code vector of
_iIs the codebook CB_MHThen, in equation (3)
From the fuzzy vector quantized vector 11
Then, the difference vector V is obtained, and this V is also applied to the equation (4).
The linearly expanded and contracted vector with the expansion / contraction ratio r
The target vector is transformed by adding the input vector to
A vector (mel IPSE cepstrum) 14 was obtained.
It Difference vector codebook CB_MH, CB_MLUse
Without codebook CB_H, CB_LCan also be used
it can. FIG. 8 shows the same processing as that of FIG.
One step number is attached and shown.

【００３６】この場合は処理を簡略化するためメル尺度
変換をしていないがメル尺度変換を行ってもよい。ステ
ップＳ８０１では、基本周波数レンジ「高」、「低」の
中から、合成したい音声の基本周波数と最も近いものの
コードブックを選ぶ。ステップＳ８０２では、ステップ
Ｓ８０１で選択された基本周波数レンジ、例えば「高」
のコードブックＣＢ_Hを用いて、ステップＳ４０３でフ
ァジーベクトル量子化した音声特徴量をデコードする。In this case, in order to simplify the processing, the mel scale conversion is not performed, but the mel scale conversion may be performed. In step S801, a codebook having a frequency closest to the basic frequency of the voice to be synthesized is selected from the basic frequency ranges “high” and “low”. In step S802, the fundamental frequency range selected in step S801, for example, "high"
Using the codebook CB _H , the speech feature quantity fuzzy vector quantized in step S403 is decoded.

【００３７】ステップＳ４０９において、ステップＳ８
０２でデコードされたベクトル（音声特徴量）をＩＦＦ
Ｔ（逆高速フーリエ変換）することにより、音声波形を
求める。ステップＳ４１０では、ステップＳ４０９で求
めた音声波形を低域ろ過フィルタにかけ、低域成分のみ
の波形を求める。In step S409, step S8
The vector (voice feature quantity) decoded in 02 is IFF
A voice waveform is obtained by performing T (Inverse Fast Fourier Transform). In step S410, the voice waveform obtained in step S409 is filtered by a low-pass filter to obtain a waveform of only low-pass components.

【００３８】この例では図６中のステップＳ４１１，Ｓ
４１４が省略、簡素化された場合で、ステップＳ４１５
では、ステップＳ４１０で求めた低域成分のみの波形
と、ステップＳ４１３で求めた高域成分のみの波形をた
し合わせる。その後の処理は図６と同一である。なお１
つのコードブックＣＢ_M中のコードベクトルと対応する
コードベクトルを他のコードブックＣＢ_Hより取出し
て、音声の性質を変更する技術は例えば文献H.Matsumot
o "A Minimum Distortion Spectral Mapping Applied t
o Voice Quality Conversion" ＩＣＳＬＰ９０１６１
〜１６４に示されている。In this example, steps S411 and S in FIG.
If 414 is omitted or simplified, step S415
Then, the waveform of only the low frequency component obtained in step S410 and the waveform of only the high frequency component obtained in step S413 are added together. The subsequent processing is the same as in FIG. 1
For example, a technique for extracting the code vector corresponding to the code vector in one code book CB _M from another code book CB _H and changing the nature of the voice is disclosed in Reference H. Matsumot.
o "A Minimum Distortion Spectral Mapping Applied t
o Voice Quality Conversion "ICSLP90 161
~ 164.

【００３９】図８に示した音声合成アルゴリズムにおい
て、Ｓ４０３で音声特徴量をファジーベクトル量子化す
る代りに移動ベクトル場平滑化の方法を用いて、基本周
波数レンジ「中」のコードブックで基本周波数レンジ
「中」の音声データをベクトル量子化したのち、合成し
たい基本周波数レンジのコードブックへの移動ベクトル
を求め、その移動先でデコードするという方法としても
よい。In the speech synthesis algorithm shown in FIG. 8, a moving vector field smoothing method is used instead of the fuzzy vector quantization of the speech feature quantity in S403, and the basic frequency range is set in the basic frequency range "medium" codebook. It is also possible to perform vector quantization of the "medium" voice data, obtain a movement vector to the codebook in the basic frequency range to be synthesized, and decode at the movement destination.

【００４０】またステップＳ４０３でファジーベクトル
量子化や移動ベクトル場平滑化法でコードブックへの移
動ベクトルを求める場合に限らず、通常のベクトル量子
化と同様に、１つの入力特徴量を１つのベクトルコード
として量子化してもよい。しかし、このようにするより
もファジーベクトル量子化や、移動ベクトル場平滑化法
を用いたほうが、ステップＳ４１６で得られた時間領域
信号の連続性が優れたものとなる。Further, not only in the case where the movement vector to the codebook is obtained by the fuzzy vector quantization or the movement vector field smoothing method in step S403, one input feature quantity is converted into one vector as in the ordinary vector quantization. It may be quantized as a code. However, if the fuzzy vector quantization or the moving vector field smoothing method is used, the continuity of the time domain signal obtained in step S416 is superior to the above case.

【００４１】またステップＳ４１０の低域ろ過フィルタ
による低域成分の取り出しは、入力音声素片の基本周波
数パターンと、合成したい基本周波数パターンとの差が
スペクトル包絡に影響を与える成分を取り出し、ステッ
プＳ４１３の高域ろ過フィルタは逆に基本周波数パター
ンの差（変化）によるスペクトル包絡への影響がほとん
どない高域成分が取り出される。これら低域成分と、高
域成分の境界周波数としては５００〜２０００Ｈｚ程度
に選定される。The low-pass component is extracted by the low-pass filter in step S410. The component in which the difference between the fundamental frequency pattern of the input speech unit and the fundamental frequency pattern to be synthesized affects the spectrum envelope is extracted, and step S413 is performed. On the contrary, the high-pass filter of (1) takes out the high-frequency component that has almost no effect on the spectrum envelope due to the difference (change) in the fundamental frequency pattern. The boundary frequency between the low frequency component and the high frequency component is selected to be about 500 to 2000 Hz.

【００４２】なお入力音声波形を、まず低域成分と高域
成分とに分離し、それぞれ図６又は図８のステップＳ４
０１，Ｓ４１２へ渡してもよい。上述ではこの発明をテ
キスト合成における入力音声素片と入力基本周波数パタ
ーンとの差が大きい場合に、合成音声の基本周波数とス
ペクトルとがマッチするように適用した。この場合に限
らず、一般の波形合成にもこの発明を適用でき、更に分
析合成においても、分析した原音声の基本周波数に対
し、合成音声の基本周波数を比較的大きく異ならせる場
合にもこの発明を適用すると良品質の合成音声が得られ
る。この場合は、図６の入力音声波形として原音声を用
い、基本周波数レンジ「中」のコードブック、つまり、
基準コードブックは、原音声の基本周波数レンジについ
て、先に述べたと同様の手法で作ればよい。The input speech waveform is first separated into a low frequency component and a high frequency component, and the step S4 of FIG. 6 or FIG. 8 is performed, respectively.
01, S412 may be passed. In the above description, the present invention is applied so that the fundamental frequency and the spectrum of the synthesized speech match when the difference between the input speech unit and the input fundamental frequency pattern in text synthesis is large. The present invention can be applied not only to this case but also to general waveform synthesis, and also in the case of analysis and synthesis, the present invention can be applied to a case where the fundamental frequency of the synthesized speech is relatively different from the fundamental frequency of the analyzed original speech. By applying, good quality synthetic speech can be obtained. In this case, the original voice is used as the input voice waveform of FIG. 6, and the codebook of the basic frequency range “medium”, that is,
The reference codebook may be created by the same method as described above for the fundamental frequency range of the original voice.

【００４３】分析合成では原音声は前記実施例における
入力音声素片（入力音声波形）と対応しており、この原
音声が通常、特徴量のベクトルコードとして量子化され
ており、これをデコードして音声合成するものであるか
ら、分析合成に、この発明を適用する場合は、例えば図
８中で、合成音声の基本周波数に応じたコードブックを
用いてベクトルコードをステップＳ８０２でデコードす
ればよい。分析合成に図６に示した手法を適用するに
は、合成しようとする音声のベクトルコードと対応する
コードベクトルと差分ベクトルをコードブックＣＢ_Mと
差分ベクトルコードブックＣＢ_MH又はＣＢ_MLからそれぞ
れ取出し、原音声の基本周波数と、合成したい音声の基
本周波数との差に応じて、伸縮率を求め、この伸縮率に
より、前記取出した差分ベクトルを伸縮させ、これと前
記取出したコードベクトルを加算すればよい。In the analysis and synthesis, the original speech corresponds to the input speech unit (input speech waveform) in the above-mentioned embodiment, and this original speech is normally quantized as a vector code of the feature quantity, which is decoded. When the present invention is applied to analysis and synthesis, the vector code may be decoded in step S802 using a codebook corresponding to the fundamental frequency of the synthesized speech in FIG. 8, for example. . To apply the method shown in FIG. 6 to the analysis and synthesis, the vector code of the speech to be synthesized and the corresponding code vector and difference vector are extracted from the codebook CB _M and the difference vector codebook CB _MH or CB _ML , respectively. Depending on the difference between the fundamental frequency of the original voice and the fundamental frequency of the voice to be synthesized, the expansion / contraction rate is obtained, and by this expansion / contraction rate, the extracted difference vector is expanded / contracted, and this and the extracted code vector are added. Good.

【００４４】また上述の各音声合成処理は通常はＤＳＰ
（Digital Signal Processor）などによりプログラムを
解読実行して処理される。従ってそのためのプログラム
は記録媒体に記録されている。この発明をテキスト合成
に適用した場合の聴取実験について述べる。ＡＴＲ音素
バランス５２０単語を、女性話者１名が高ピッチ、中ピ
ッチ、低ピッチの３段階の高さで発声したものから、各
ピッチについて３２７個をコードブック作成に、７４個
を評価用データに用いた実験条件はサンプリング周波数
１２ＫＨｚ、帯域分離周波数５００Ｈｚ（ステップＳ４
１０，Ｓ４１１，Ｓ４１３でのフィルタの遮断周波
数）、コードブックサイズ５１２、ケプストラム次数
（図２の手法で得た特徴量）３０次、ｋ近傍数１２、フ
ァジネス１．５である。The above-mentioned voice synthesis processing is usually a DSP.
The program is decoded and executed by (Digital Signal Processor) and processed. Therefore, the program for that is recorded on the recording medium. A listening experiment when the present invention is applied to text synthesis will be described. ATR phoneme balance 520 words uttered by one female speaker at three different pitches: high pitch, medium pitch, and low pitch. For each pitch, 327 were made into a codebook and 74 were made into evaluation data. The experimental conditions used for the sampling frequency are 12 KHz, the band separation frequency is 500 Hz (step S4
10, the cutoff frequency of the filter in S411, S413), the codebook size 512, the cepstrum order (feature amount obtained by the method of FIG. 2) 30th, the number of k neighborhoods 12, and the fuzzyness 1.5.

【００４５】次にコードブックマッピングによるスペク
トル包絡の変形が合成音の品質向上に有効であるかを評
価するために、基本周波数変形音声の聴取実験を行っ
た。実験では、５単語について、自然音声Ａと同一テキ
ストで基本周波数レンジの異なる自然音声Ｂの基本周波
数パターンを、従来のＰＳＯＬＡ法により自然音声Ａの
ものに変形したもの（従来技術：合成音(1) ）、正解音
声（自然音声Ａ）を入力したもの（合成音(2) ）、図６
に示した方法により、自然音声Ｂの基本周波数パターン
を自然音声Ａのものに変形したもの（合成音(3) ）の３
つの合成音声についてＡＢＸ法により評価した。Ａ、Ｂ
には、それぞれ合成音(1) および(3) 、Ｘには合成音
(1) 〜(3) を用い、ＸがＡとＢのどちらに近いかを被験
者に判断させた。基本周波数パターンの変形は、中ピッ
チ（平均基本周波数２１６Ｈｚ）から低ピッチ（平均基
本周波数１７２Ｈｚ）、および中ピッチから高ピッチ
（平均基本周波数３１０Ｈｚ）とし、ピッチレンジの違
う同一単語音声の基本周波数パターンを入れ替えること
により実現した。また、差分ベクトルの伸縮率ｒは１．
０に固定し、パワーと音韻継続時間は、基本周波数変形
先の単語に一致させた。被験者は１２名である。聴取実
験の結果から、判定率ＣＲ( CR＝Pj/Pa*100(%)) を求め
た。ここで、PjはＸが合成音(3) に近いと判定された回
数、Paは提示回数である。図９Ａ，Ｂに結果を示す。Next, in order to evaluate whether the modification of the spectrum envelope by codebook mapping is effective in improving the quality of synthesized speech, a listening experiment of the fundamental frequency modified speech was conducted. In the experiment, the basic frequency pattern of natural speech B having the same text as natural speech A but different fundamental frequency range is transformed into that of natural speech A by the conventional PSOLA method (conventional technique: synthetic speech (1 )), Correct speech (natural speech A) input (synthetic sound (2)), Fig. 6
By the method shown in (3), the fundamental frequency pattern of natural speech B is transformed into that of natural speech A (synthesized speech (3)).
Two synthetic voices were evaluated by the ABX method. A, B
Are synthetic sounds (1) and (3) respectively, and X is a synthetic sound.
Using (1) to (3), the subject was made to judge whether X was closer to A or B. The basic frequency pattern is modified from a medium pitch (average basic frequency 216 Hz) to a low pitch (average basic frequency 172 Hz) and a medium pitch to a high pitch (average basic frequency 310 Hz). It was realized by replacing. The expansion / contraction rate r of the difference vector is 1.
It was fixed at 0, and the power and the phoneme duration were matched with the word to which the fundamental frequency was transformed. There are 12 test subjects. The judgment rate CR (CR = Pj / Pa * 100 (%)) was determined from the results of the listening experiment. Here, Pj is the number of times X is determined to be close to the synthetic sound (3), and Pa is the number of presentations. The results are shown in FIGS. 9A and 9B.

【００４６】図９Ａは中ピッチから低ピッチへの変換に
対するものであり、自然音声(2) の判定率が８５％、中
ピッチから高ピッチに上げる場合（図９Ｂ）の自然音声
の判定率が５９％であることから、この発明によれば従
来のＰＳＯＬＡ法に比べてより自然音声に近い基本周波
数変形音声の合成が可能であることがわかる。特に基本
周波数を下げる場合に、この発明は非常に有効であるこ
とがわかる。FIG. 9A is for the conversion from medium pitch to low pitch. The judgment rate of natural speech (2) is 85%, and the judgment rate of natural speech when raising from medium pitch to high pitch (FIG. 9B) is Since it is 59%, it is understood that according to the present invention, it is possible to synthesize a fundamental frequency modified voice closer to a natural voice as compared with the conventional PSOLA method. It can be seen that the present invention is very effective especially when the fundamental frequency is lowered.

【００４７】図６に示した方法をテキスト音声合成に適
用した場合と、従来のＰＳＯＬＡ法を適用した場合と比
較した。ＡＴＲ音素バランス５０３文から抜き出した５
つの文章をピッチレンジ「低」、「中」、「高」で合成
し、プリファレンステストにより評価した。規則から求
めたピッチパターンの不自然さによるテストへの影響を
避けるため、自然音声から抽出したピッチパターンをピ
ッチ「中」の基本周波数パターンとして用いた。そのピ
ッチレンジを上げてピッチ「高」、下げてピッチ「低」
のピッチパターンを作成し、合成に用いた。スペクトル
包絡変形に用いたコードブックは先の実験に用いたもの
と同一のものを用い、実験条件も先の場合と同じであ
る。図１０Ａ，Ｂ，Ｃにその結果を示す。Ａは低ピッチ
レンジ、Ｂは中ピッチレンジ、Ｃは高ピッチレンジであ
る。この結果から、ピッチレンジが「低」と「中」の合
成音はＰＳＯＬＡ法と比較して、この発明の方法が被験
者に好まれることがわかる。The case where the method shown in FIG. 6 is applied to text-to-speech synthesis is compared with the case where the conventional PSOLA method is applied. 5 extracted from 503 sentences of ATR phoneme balance
Two sentences were synthesized in a pitch range of "low", "medium", and "high", and evaluated by a preference test. In order to avoid the influence of the unnaturalness of the pitch pattern obtained from the rule on the test, the pitch pattern extracted from the natural speech was used as the fundamental frequency pattern of the "medium" pitch. Raise the pitch range to raise the pitch to "high" and lower it to lower the pitch to "low".
The pitch pattern was prepared and used for synthesis. The codebook used for the spectral envelope transformation is the same as that used in the previous experiment, and the experimental conditions are also the same as in the previous case. The results are shown in FIGS. 10A, 10B, 10C. A is a low pitch range, B is a medium pitch range, and C is a high pitch range. From this result, it is understood that the method of the present invention is preferred by the subject for the synthesized sounds having the pitch ranges of "low" and "medium" as compared with the PSOLA method.

【００４８】図８に示したこの発明の方法と、従来法
（ＰＳＯＬＡ法）と聴取実験を示す。実験条件は、帯域
分離周波数を１５００Ｈｚとした点以外は先の場合と同
一である。従来の波形合成法で合成した基本周波数変形
音声と、この発明方法によるものとを聴取実験により比
較した実験では、この発明方法の最大ポテンシャルをみ
るために、低域部スペクトル包絡（ＩＰＳＥ）の変形は
完全にできたものとして、基本周波数パターン変形先の
単語から抽出したスペクトル包絡（正解スペクトル包
絡）を入力した。基本周波数パターンの変形は、高ピッ
チから低ピッチ、および低ピッチから高ピッチとし、ピ
ッチレンジの違う同一単語音声の基本周波数パターンを
入れ替えることにより実現した。またパワーと音韻継続
部は、ＦＯ変形先の単語に一致させた。評価は５単語に
ついて、５段階でその優劣を対比較した。被験者は８名
である。この実験結果を図１１Ａに示す。この図から、
この発明方法による合成音声の方が従来の波形合成によ
る合成音声よりもかなり品質が高いことが分かる。The method of the present invention shown in FIG. 8, the conventional method (PSOLA method) and the listening experiment are shown. The experimental conditions are the same as the previous case except that the band separation frequency is 1500 Hz. In an experiment in which the fundamental frequency modified speech synthesized by the conventional waveform synthesis method and the method according to the present invention are compared by a listening experiment, in order to see the maximum potential of the method according to the present invention, a modification of the low frequency spectrum envelope (IPSE) is performed. Assuming that is completely completed, the spectral envelope (correct spectral envelope) extracted from the word of the fundamental frequency pattern transformation destination was input. The modification of the basic frequency pattern was realized by changing the basic frequency pattern of the same word speech with different pitch range from high pitch to low pitch and from low pitch to high pitch. Moreover, the power and the phoneme continuation part were matched with the FO transformation destination word. As for the evaluation, the superiority and inferiority of five words were compared in five stages. There are eight test subjects. The results of this experiment are shown in FIG. 11A. From this figure,
It can be seen that the synthesized speech produced by the method of the present invention has considerably higher quality than the synthetic speech produced by the conventional waveform synthesis.

【００４９】図１１Ａ中の評価１は従来の波形合成の方
が非常によい、評価２は従来の波形合成の方が少しよ
い、評価３は変わらない、評価４はこの発明方法が少し
よい、評価５はこの発明方法の方が非常によい、をそれ
ぞれ示す。また図９に示した実験結果と同様の実験を行
った。実験条件は帯域分離周波数を１５００Ｈｚとした
点以外は先の場合と同一である。その結果を図１１Ｂ，
Ｃに示す。Ｂは中ピッチから低ピッチへの変形であり、
Ｃは中ピッチから高ピッチへの変形である。Evaluation 1 in FIG. 11A is very good in the conventional waveform synthesis, evaluation 2 is slightly better in the conventional waveform synthesis, evaluation 3 is the same, evaluation 4 is a little better in the method of the present invention, Evaluation 5 shows that the method of the present invention is much better. An experiment similar to the experiment result shown in FIG. 9 was performed. The experimental conditions are the same as the previous case except that the band separation frequency is 1500 Hz. The result is shown in FIG. 11B,
Shown in C. B is the transformation from medium pitch to low pitch,
C is the transformation from medium pitch to high pitch.

【００５０】合成音(1) と(2) の判定率はそれぞれ、基
本周波数を中ピッチから低ピッチへ変形した場合で２１
％と９１％、中ピッチから高ピッチで１０％と９４％で
ある。また合成音(3) の判定率は、中ピッチから低ピッ
チで９０％、中ピッチから高ピッチで８５％であり、コ
ードブックマッピングによって、低域スペクトル包絡が
適切に変形できたことが分かる。図１０Ａの結果と合わ
せて考えると、この発明の音声合成方法により、従来の
波形合成法と比較して、より高品質な基本周波数変形音
声の合成が可能であることがわかる。The judgment rates of the synthetic sounds (1) and (2) are 21 when the fundamental frequency is changed from the medium pitch to the low pitch.
% And 91%, and 10% and 94% for medium to high pitches. Further, the determination rate of the synthetic sound (3) is 90% from the middle pitch to the low pitch and 85% from the middle pitch to the high pitch, and it can be seen that the low-frequency spectrum envelope can be appropriately transformed by the codebook mapping. Considering together with the result of FIG. 10A, it can be seen that the speech synthesis method of the present invention can synthesize higher-quality fundamental frequency modified speech as compared with the conventional waveform synthesis method.

【００５１】[0051]

【発明の効果】以上説明したように、この発明によれ
ば、例えばテキスト音声合成システムにおいて、音声素
片の基本周波数パターンを大きく変更して合成すること
による、合成音声の品質劣化を防止することが可能とな
る。その結果、従来のテキスト音声合成システムと比較
して、より高品質な音声が合成可能となる。また、分析
合成において、原音声と、基本周波数が比較的大きく異
なっても、高品質の合成音声を得ることができる。つま
りより人間らしい音声、感情のこもった音声を合成する
ためには、基本周波数パターンを様々に変形する必要が
あるが、この発明により、そのような音声が高品質に合
成可能となる。As described above, according to the present invention, for example, in a text-to-speech synthesis system, it is possible to prevent the quality of synthesized speech from being deteriorated by largely synthesizing the fundamental frequency pattern of the speech unit. Is possible. As a result, it is possible to synthesize higher quality speech as compared with the conventional text speech synthesis system. Further, in the analysis and synthesis, it is possible to obtain high quality synthetic speech even if the fundamental frequency is relatively different from the original speech. That is, in order to synthesize a more human voice and emotional voice, it is necessary to modify the fundamental frequency pattern in various ways, but according to the present invention, such a voice can be synthesized with high quality.

[Brief description of drawings]

【図１】この発明の原理の基本手順を示す図。FIG. 1 is a diagram showing a basic procedure of the principle of the present invention.

【図２】この発明において、音声波形からスペクトル包
絡を抽出するためのアルゴリズムを示すフローチャー
ト。FIG. 2 is a flowchart showing an algorithm for extracting a spectrum envelope from a voice waveform in the present invention.

【図３】図２のアルゴリズムで最大値のサンプリング点
を説明するための図。FIG. 3 is a diagram for explaining a maximum sampling point in the algorithm of FIG.

【図４】この発明において、異なった基本周波数レンジ
の音声データの間で、ピッチマーク間の対応付けを説明
するための図。FIG. 4 is a diagram for explaining correspondence between pitch marks between voice data having different fundamental frequency ranges in the present invention.

【図５】この発明における一実施形態において、テキス
ト音声合成システムにあらかじめ組み込んでおく、３つ
のマッピングコードブックの作成方法を示すフローチャ
ート。FIG. 5 is a flowchart showing a method of creating three mapping codebooks to be incorporated in advance in the text-to-speech synthesis system according to the embodiment of the present invention.

【図６】この発明の実施例において、合成したい基本周
波数パターンに応じて、音声素片のスペクトル包絡を変
形するアルゴリズムを示すフローチャート。FIG. 6 is a flowchart showing an algorithm for transforming the spectrum envelope of a speech unit according to a fundamental frequency pattern to be synthesized in the embodiment of the present invention.

【図７】図６に示した差分ベクトルによるスペクトル包
絡変形処理の概念を示す図。FIG. 7 is a diagram showing a concept of a spectrum envelope transformation process using the difference vector shown in FIG.

【図８】この発明の他の実施例において、合成したい基
本周波数パターンに応じて音声素片のスペクトル包絡を
変形するアルゴリズムを示すフローチャート。FIG. 8 is a flowchart showing an algorithm for modifying the spectrum envelope of a speech unit according to a fundamental frequency pattern to be synthesized in another embodiment of the present invention.

【図９】Ａ、Ｂは図６に示した実施例の効果を説明する
ための実験結果を示す図である。9A and 9B are diagrams showing experimental results for explaining the effect of the embodiment shown in FIG.

【図１０】Ａ、Ｂは図６に示した実施例の効果を説明す
るための他の実験結果を示す図である。10A and 10B are diagrams showing other experimental results for explaining the effect of the embodiment shown in FIG.

【図１１】Ａ〜Ｃは図８に示した実施例の効果を説明す
るための実験結果を示す図である。11A to 11C are diagrams showing experimental results for explaining the effect of the embodiment shown in FIG.

フロントページの続き特許法第30条第１項適用申請有り・田中公人，阿部匡伸，Ｆ０に応じてスペクトル包絡を変形する音声合成方式の規則合成システムへの適用，日本音響学会平成９年度春季研究発表会講演論文集，２−７−１，ｐ．217− 218，平成９年３月17日特許法第30条第１項適用申請有り・ＫｉｍｉｈｉｔｏＴａｎａｋａ，ＭａｓａｎｏｂｕＡｂｅ，ＡＮｅｗＦｕｎｄａｍｅｎｔａｌＦｒｅｑｕｅｎｃｙＭｏｄｉｆｉｃａｔｉｏｎＡｌｇｏｒｉｔｈｍｗｉｔｈＴｒａｎｓｆｏｒｍａｔｉｏｎｏｆＳｐｅｃｔｒｕｍＥｎｖｅｌｏｐｅＡｃｃｏｒｄｉｎｇｔｏＦ０，Ｐｒｏｃ．ＩＣＡＳＳＰ97，Ｖｏｌ．ＩＩ, ｐ．951−954，1997．４．21 (56)参考文献特開昭56−55999（ＪＰ，Ａ) 特開平１−237600（ＪＰ，Ａ) 特開平１−97997（ＪＰ，Ａ) 特開平７−104792（ＪＰ，Ａ) 特開平８−248994（ＪＰ，Ａ) 特開平９−152892（ＪＰ，Ａ) 田中公人，阿部匡伸，基本周波数に応じてスペクトル包絡を変形する音声合成方式，日本音響学会平成８年度秋季研究発表会講演論文集，1996年９月25日, １−４−14，ｐ．217−218 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 13/00 G10L 13/06 G10L 21/04 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of front page Application for application of Article 30 (1) of the Patent Law ・ Kito Tanaka, Masanobu Abe, application of speech synthesis method to transform spectrum envelope according to F0 to rule synthesis system, Acoustics Society of Japan, Spring 1997 Research Presentation, Proceedings, 2-7-1, p. 217-218, March 17, 1997, application for application of Article 30, Paragraph 1 of the Patent Act ・ Kimihito Tanaka, Masanobu Abe, A New Funding Frequency Requirement Alignment Requirement Alignment Requirement Alignment Alignment Requirement Alignment Requirement Alignment Requirement Requirement Alignment of Alignment Requirement Requirement Alignment Alignment Requirement Alignment Alignment of the Scope ICASSP97, Vol. II, p. 951-954, 1997.4.21 (56) Reference JP 56-55999 (JP, A) JP 1-237600 (JP, A) JP 1-97997 (JP, A) JP 7-104792 (JP, A) JP-A-8-248994 (JP, A) JP-A-9-152892 (JP, A) K. Tanaka, Masanobu Abe, Speech synthesis that transforms the spectral envelope according to the fundamental frequency. Method, Proceedings of the 1996 Autumn Research Conference of ASJ, September 25, 1996, 1-4-14, p. 217-218 (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 13/00 G10L 13/06 G10L 21/04 JISST file (JOIS)

Claims

(57) [Claims]

1. An input speech segment waveform (hereinafter referred to as an input speech).
Vinegar), made from training speech data of different basic frequency range in a speech synthesis method for synthesizing the sound <br/> voice desired fundamental frequency different from its fundamental frequency, spectrum of the fundamental frequency range of the input speech The codebook created for the envelope (hereinafter , this codebook is referred to as the reference codebook ) , each code vector of this reference codebook , and the input
By using the differential vector codebook consisting difference vector between corresponding code vectors of different codebook click of force speech and the fundamental frequency range, the spectral envelope of the input speech, and vector quantized using the reference codebook, the vector The difference vector corresponding to the quantized code is obtained from the difference vector codebook, and the difference vector is set to the fundamental frequency of the input speech.
That the desired base according to the amount Re without frequency stretch, the difference vector that expands and contracts, the vector quantized code vector and the input speech from those obtained by adding the
Speech synthesis method characterized by relative spectral envelope obtain those modified process.

2. The speech synthesis method according to claim 1, wherein the vector quantization is fuzzy vector quantization.

3. A one speech synthesis method according to claim 1 or 2, wherein creating the reference codebook clustered by statistical methods the spectral envelope of learning speech data in the same basic frequency range and the input speech , Between the input voice and the learning voice data having a different fundamental frequency range, and the learning voice data having the same fundamental frequency range as the input voice, the time axis of the pitch mark in each voiced phoneme in the same text. The linear expansion / contraction matching is performed above, the time correspondence is obtained for each period waveform, and the codebook having a different fundamental frequency range from the input speech is created with reference to the clustering result in the reference codebook. Characteristic speech synthesis method.

4. A either speech synthesis method according to claim 1 to 3, wherein, on a logarithmic power spectrum, samples the maximum value in the vicinity of integral multiples of the fundamental frequency, interpolating between the sampling points by a straight line, A speech synthesis method characterized by sampling the interpolated linear pattern at equal intervals, approximating the sampling sequence with a cosine model, and using the coefficient of the model as the spectral envelope.

5. Any of the speech synthesis method according to claim 1 to 4, wherein the speech synthesis method and performing deformation processing of the spectral envelope only for low-frequency component than a predetermined frequency in the spectral region.

6. Any of the speech synthesis method according to claim 1 to 4, wherein the spectral envelope of the input speech, after conversion into mel scale, perform the above spectral envelope transformation processing, which is the spectrum envelope deformation processing A speech synthesis method characterized by converting an object into a linear scale.

7. Any of the speech synthesis method of claims 1 to 6, wherein, different codebooks is higher than the fundamental frequency range of the input speech fundamental frequency Ren of the fundamental frequency range of the input speech
And a low fundamental frequency range .

8. An input speech segment waveform (hereinafter referred to as an input speech).
In a voice with a desired fundamental frequency different from the fundamental frequency , a spectral envelope of learning voice data having the same fundamental frequency range as the input speech is clustered by a statistical method. Was created from the reference codebook created as above and the learning voice data having the same basic frequency range as the above input voice and the same text as the above learning voice data, in association with the code vector of the above reference codebook. Another range codebook, a difference vector codebook consisting of difference vectors of corresponding code vectors with the reference codebook, and a basic frequency average value of element vectors between corresponding classes of the reference codebook and other range codebooks. The difference frequency codebook consisting of the difference and the spectrum envelope of the input speech are Quantization means for performing vector quantization using the reference codebook, difference vector evaluation means for obtaining the difference vector corresponding to the code quantized by the quantization means using the difference vector codebook, and the input An expansion / contraction ratio calculation means for calculating an expansion / contraction ratio from the fundamental frequency of the voice, the desired fundamental frequency, and the difference frequency obtained from the difference frequency codebook corresponding to the quantized code, and An expansion / contraction means for expanding / contracting the difference vector, a means for adding the expanded / compressed difference vector, and a spectrum envelope of the input speech signal, and a time domain conversion means for converting the added spectrum envelope into a time domain, A voice synthesizing device comprising.

9. The speech synthesizer according to claim 8, wherein a low-pass filter for extracting a low-pass component of the signal converted into the time domain and a cut-off frequency same as a cut-off frequency of the low-pass filter are provided. A voice synthesizing apparatus comprising: a high-pass filter for extracting a high-pass component of the input voice signal; and means for adding an output of the low-pass filter and an output of the high-pass filter.

10. A voice synthesizer according to claim 8 or 9.
A computer-readable recording medium recording a program for causing a computer to function as a storage device.