JP2006276660A

JP2006276660A - Method of indicating features of intonation variation by modification of tone and computer program thereof

Info

Publication number: JP2006276660A
Application number: JP2005098067A
Authority: JP
Inventors: Ni Jinfu; ジンフ・ニ; Hisashi Kawai; 恒河井
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-03-30
Filing date: 2005-03-30
Publication date: 2006-10-12
Anticipated expiration: 2025-03-30
Also published as: JP4793776B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method capable of measuring intonation variations underlying a speech under natural conditions. <P>SOLUTION: The method of measuring intonation variations includes a step 80 of preparing a prescribed set of citation values 56 of fundamental frequency (F0) targets for each of lexical tones obtained from isolated syllables 50 of a loudspeaker. The set of citation values of F0 targets characterizes a corresponding lexical tone. The method further includes a step 82 of extracting F0 target values for each syllable in sample speech data 52 of the loudspeaker and a step of calculating a prescribed first parameter 58 that measures the change from a citation value of a lexical tone of a syllable to an F0 target value, for each of the F0 target values of each syllable in the sample speech data 52. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は話し言葉の処理に関し、特に、話し言葉でのイントネーションの変化を測定して所望のイントネーションの音声を合成することに関する。 The present invention relates to spoken language processing, and more particularly, to measuring a change in intonation in spoken language to synthesize a desired intonation voice.

中国語の基本周波数（Ｆ０）の輪郭（一般的な意味でのイントネーション）は、語の複数の声調、及び平叙文と疑問文との対比を表すような実際のイントネーション（語の声調を除く）を明らかにするものである。伝統的に第一声、第二声、第三声、第四声と呼ばれ（声調１から４）、その各々が他と区別される独自の特徴を持った４つの語の声調と、このような顕著な特徴のない中立声調（声調０）とがある。 Chinese basic frequency (F0) outline (intonation in a general sense) is the actual tone (excluding the tone of a word) that represents multiple tones of a word and contrasts between plain and questioned sentences Is to clarify. Traditionally called the first, second, third and fourth voices (tones 1 to 4), each of which has its own distinctive characteristics that distinguish it from the others, and this There is a neutral tone (tone 0) without such a remarkable feature.

声調の種類は中国語の音節を直接に構成する要素である。例えば、「ｍａ」は声調の種類によって以下の５つの異なる意味を持つ。 The type of tone is an element that directly composes Chinese syllables. For example, “ma” has the following five different meanings depending on the type of tone.

このために重要な問題が生じる。テキスト−トゥ−スピーチ（ｔｅｘｔ−ｔｏ−ｓｐｅｅｃｈ：ＴＴＳ）合成においてイントネーションを合成する際に、語の声調と実際のイントネーションとの相互作用をどのように明らかにするか、ということである。これはＴＴＳを会話システムに適用する際に非常に重要である。会話システムでは例えば、疑問、メッセージの確認、及び感情が、人間によって、通常は音節のイントネーション（すなわち語の声調）と区別され、さらに通常の平叙文とも区別されるイントネーションのパターンで実現される［非特許文献１参照］。

This creates an important problem. It is how to clarify the interaction between the tone of the word and the actual intonation when synthesizing intonation in text-to-speech (TTS) synthesis. This is very important when applying TTS to conversational systems. In a conversation system, for example, questions, message confirmations, and emotions are realized by humans in an intonation pattern that is usually distinguished from syllable intonations (ie, tone of words) and also from normal plain text [ Non-patent document 1].

これに対してとり得る解決策はおそらく、Ｆ０輪郭をアクセントと句の成分とに分解するフジサキのモデルであろう［非特許文献２参照］。イントネーションの変化をアクセントと句の成分との両者に分配してもよいが、モデルのパラメータ数は限られている。実際のイントネーションが語の声調に及ぼす影響に対処するため、言語学者は一般に音節［非特許文献３］または句［非特許文献４］のレベルでのピッチ範囲の変化に注目する。
Ｇ．コチャンスキー及びＣ．シー、「ソフトテンプレートを用いた韻律学モデリング」音声コミュニケーション、第３９巻、ｐｐ．３１１−３５２、２００３年（G． Kochanski and C. Shih, "Prosody modeling with soft templates," Speech Communication, Vol. 39, pp. 3l1-352, 2003.）Ｈ.フジサキ及びＫ.ヒロセ、「日本語宣言文における音声基本周波数輪郭の分析」日本音響学会誌、第５巻、第４号、ｐｐ．２３３−２４２、１９８４年（H. Fujisaki and K. Hirose, "Analysis of voice fundamental frequency contours for declarative sentences of Japanese," J. Acoust. Soc. Japan, Vol.5, No.4, pp. 233-242, 1984.）Ｊ．シェン、「北京方言における声調とイントネーションのピッチ範囲」、実験的音声学における調査報告書、Ｔ．リン及びＬ．Ｊ．ワン編、北京大学出版局、ｐｐ．７３−１３０、１９８５年（中国語）（J. Shen, "Pitch range of tone and intonation in Beijing dialect," in Working papers in experimental phonetics, ed. by T. Lin and L. J. Wang, Beijing Univ. Press, pp. 73-130, 1985. (in Chinese)）Ｚ．ウー、「標準中国語のためのイントネーション分析の新方法：文中の句輪郭の周波数転位処理」話し言葉の分析、知覚及び処理、Ｇ．ファンら編、ｐｐ．２５５−２６８、１９９６年（Z. Wu, "A new method of intonation analysis for standard Chinese: frequency transposition processing of phrasal contours in a sentence," Analysis, perception and processing of spoken language, ed. by G. Fant, et al, pp. 255-268, 1996.）Ｙ．Ｒ．チャオ、中国語話し言葉の文法。バークレー、カリフォルニア大学出版局、１９６８年（Y. R. Chao, A grammar of spoken Chinese. Berkeley, University of California Press, 1968.）Ｐ．クラトチヴィル、北京語のイントネーション、イントネーションシステム、２０ヶ国語の調査内、Ｄ．ハースト及びＡ．Ｄ．クリスト編、ケンブリッジ大学出版局、４１７−４３１、１９９８年（P. Kratochvil, Intonation in Beijing Chinese, in Intonation systems, a survey of twenty languages, ed. by D. Hirst and A. D. Cristo, Cambridge Uni. Press, 417-431, 1998.）Ｊ．ニ及びＫ．ヒロセ、「標準中国語文の基本周波数輪郭の機能的モデリングの実験的評価」ＩＳＣＳＬＰ２０００、北京、ｐｐ．３１９−３２２、２０００年（J. Ni and K. Hirose, "Experimental evaluation of a functional modeling of fundamental frequency contours of standard Chinese sentences," ISCSLP2000, Beijing, pp. 319-322, 2000.）Ｊ．ニ及びＨ．カワイ、「ピッチ範囲が中国語の声調とイントネーションパターンを固定する」音声韻律学２００４、奈良、ｐｐ．９５−９８、２００４年（J. Ni and H. Kawai, "Pitch targets anchor Chinese tone and intonation patterns," Speech Prosody 2004, Nara, pp. 95-98, 2004.）Ｊ．ニ及びＨ．カワイ、「パラメトリックモデリング及び合成による分析ベースのパターンマッチングを通した声調特徴量の抽出」ＩＣＡＳＳＰ２００３、ｐｐ．７２−７５、２００３年（J. Ni and H. Kawai, "Tone feature extraction through parametric modeling and analysis-by-synthesis-based pattern matching," ICASSP2003, pp. 72-75, 2003）Ｊ．ニ及びＨ．カワイ、「関数モデル及びその評価による中国語基本周波数輪郭の骨格化」ＴＡＬ２００４、ｐｐ．１５１−１５４、北京、２００４年（J. Ni and H. Kawai, "Skeletonising Chinese fundamental frequency contours with a functional model and its evaluation," TAL2004, pp. 151-154, Beijing, 2004.）Ｊ．トゥハート、Ｒ．コリナー及びＣ．コーエン、イントネーションの知覚的研究：音声のメロディに対する実験的、音声学的アプローチ、ケンブリッジ大学出版局、１９９０年（J. 'tHart, R. Collier and A. Cohen, A perceptual study of intonation: an experimental-phonetic approach to speech melody, Cambridge University Press, 1990.） A possible solution to this is probably a Fujisaki model that decomposes the F0 contour into accent and phrase components [see Non-Patent Document 2]. Intonation changes may be distributed to both accent and phrase components, but the number of parameters in the model is limited. In order to deal with the effects of actual intonation on the tone of words, linguists generally focus on changes in pitch range at the level of syllables [3] or phrases [4].
G. Cochance Key and C.I. See, “Prosodic Modeling Using Soft Template” Speech Communication, Vol. 39, pp. 311-352, 2003 (G. Kochanski and C. Shih, "Prosody modeling with soft templates," Speech Communication, Vol. 39, pp. 3l1-352, 2003.) H. Fujisaki and K. Hirose, “Analysis of Speech Basic Frequency Contours in Japanese Declarations”, Journal of the Acoustical Society of Japan, Vol. 233-242, 1984 (H. Fujisaki and K. Hirose, "Analysis of voice fundamental frequency contours for declarative sentences of Japanese," J. Acoust. Soc. Japan, Vol.5, No.4, pp. 233-242 , 1984.) J. et al. Shen, “Pitch range of tone and intonation in Beijing dialect”, research report on experimental phonetics, T. Phosphorus and L. J. et al. One, Peking University Press, pp. 73-130, 1985 (Chinese) (J. Shen, "Pitch range of tone and intonation in Beijing dialect," in Working papers in experimental phonetics, ed. By T. Lin and LJ Wang, Beijing Univ. Press, pp. 73-130, 1985. (in Chinese)) Z. Wu, “New method of intonation analysis for Mandarin Chinese: Frequency transposition processing of phrase contours in sentences”, analysis, perception and processing of spoken language, Fan et al., Pp. 255-268, 1996 (Z. Wu, "A new method of intonation analysis for standard Chinese: frequency transposition processing of phrasal contours in a sentence," Analysis, perception and processing of spoken language, ed. By G. Fant, et al, pp. 255-268, 1996.) Y. R. Chao, Chinese spoken grammar. Berkeley, University of California Press, 1968 (YR Chao, A grammar of spoken Chinese. Berkeley, University of California Press, 1968.) P. Kratochiville, Mandarin intonation, intonation system, in 20 languages, Hurst and A.M. D. Christo, Cambridge University Press, 417-431, 1998 (P. Kratochvil, Intonation in Beijing Chinese, in Intonation systems, a survey of twenty languages, ed. By D. Hirst and AD Cristo, Cambridge Uni. Press, 417 -431, 1998.) J. et al. D. and K.K. Hirose, “Experimental Evaluation of Functional Modeling of Basic Frequency Contours in Mandarin Chinese” ISCSLP2000, Beijing, pp. 319-322, 2000 (J. Ni and K. Hirose, "Experimental evaluation of a functional modeling of fundamental frequency contours of standard Chinese sentences," ISCSLP2000, Beijing, pp. 319-322, 2000.) J. et al. D. and H.H. Kawai, “Pitch Range Fixes Chinese Tone and Intonation Patterns” Phonetic Prosody 2004, Nara pp. 95-98, 2004 (J. Ni and H. Kawai, "Pitch targets anchor Chinese tone and intonation patterns," Speech Prosody 2004, Nara, pp. 95-98, 2004.) J. et al. D. and H.H. Kawai, “Extraction of Tone Features through Analysis-Based Pattern Matching by Parametric Modeling and Synthesis,” ICASSP 2003, pp. 72-75, 2003 (J. Ni and H. Kawai, "Tone feature extraction through parametric modeling and analysis-by-synthesis-based pattern matching," ICASSP2003, pp. 72-75, 2003) J. et al. D. and H.H. Kawai, “skeletonization of Chinese basic frequency contour by function model and its evaluation”, TAL 2004, pp. 151-154, Beijing, 2004 (J. Ni and H. Kawai, "Skeletonising Chinese fundamental frequency contours with a functional model and its evaluation," TAL2004, pp. 151-154, Beijing, 2004.) J. et al. To Heart, R.D. Coriner and C.I. Cohen, perceptual study of intonation: an experimental, phonetic approach to phonetic melodies, Cambridge University Press, 1990. phonetic approach to speech melody, Cambridge University Press, 1990.)

このようなアプローチの限界は、測定されたピッチ範囲が多少とも語の声調の影響を含んでいることである。さらに、もしある発話中の語の声調がたまたま全て声調１であった場合、ピッチ範囲の計算ができなくなる。というのも、声調１は高音域レベルの特性を有し、ピッチ範囲を推定するのに基準として利用可能な低音域の特徴がないからである。 The limitation of such an approach is that the measured pitch range includes more or less word tone effects. Furthermore, if the tone of a word being uttered is all tone 1, the pitch range cannot be calculated. This is because tone 1 has the characteristics of the high sound level and there is no low sound characteristic that can be used as a reference for estimating the pitch range.

この発明は、このイントネーションの変化を測定するという問題に別の方向から取組み、分離された個々の音節からの参考値の内部での声調変化を含む、声調の種類への依存性と、Ｆ０輪郭の起伏とを分解する際に生じる困難さを避けるようにする。 The present invention addresses the problem of measuring this intonation change from a different direction, including dependence on the tone type, including tone changes within the reference values from the separated individual syllables, and the F0 contour. Try to avoid the difficulties that arise when breaking down the undulations.

従って、この発明の目的の１つは、自然な条件下で、音声の基にあるイントネーションの変化を測定可能な方法を提供することである。 Accordingly, one object of the present invention is to provide a method capable of measuring changes in intonation underlying speech under natural conditions.

この発明の別の目的は、語の声調に影響されることなく、音声の基にあるイントネーションの変化を測定可能な方法を提供することである。 Another object of the present invention is to provide a method capable of measuring a change in intonation based on speech without being affected by the tone of a word.

この発明の第１の局面に従えば、イントネーションの種類を声調の変形により特徴づける方法は、話者の個々の音節から得た語の声調の各々について、基本周波数（Ｆ０）ターゲットに関する参考値の所定の組を準備するステップを含み、Ｆ０ターゲットの参考値の組は対応する語の声調を特徴づけるものであり、話者のサンプル音声データ中の各音節についてＦ０ターゲット値を抽出するステップと、サンプル音声データ中の各音節のＦ０ターゲット値の各々について、その音節の語の声調に関する参考値から当該Ｆ０ターゲット値への変化の度合いを表す所定の第１のパラメータを計算するステップとをさらに含む。 According to the first aspect of the present invention, a method of characterizing intonation types by tone transformation is a method of characterizing a reference value for a fundamental frequency (F0) target for each tone of a word obtained from an individual syllable of a speaker. Preparing a predetermined set, the reference set of reference values for the F0 target characterizing the tone of the corresponding word, and extracting the F0 target value for each syllable in the sample speech data of the speaker; For each of the F0 target values of each syllable in the sample voice data, calculating a predetermined first parameter representing a degree of change from the reference value related to the tone of the word of the syllable to the F0 target value is further included. .

好ましくは、準備するステップは、語の声調の各々について、話者による複数個の個々の音節を録音するステップと、それぞれの語の声調に従って、録音された個々の音節のＦ０ターゲット値を抽出するステップと、語の声調の各々について、語の声調を特徴づけるＦ０ターゲットの各々のＦ０ターゲット値を平均するステップとを含む。 Preferably, the preparing step includes recording a plurality of individual syllables by the speaker for each of the tone of the word, and extracting an F0 target value of each recorded syllable according to the tone of each word. And, for each word tone, averaging each F0 target value of each F0 target characterizing the tone of the word.

より好ましくは、この方法は、所定の第２のパラメータの分布が所定の第２のパラメータの所定の基準値の両側でつりあうように、前記所定の第１のパラメータを所定の第２のパラメータに正規化するステップをさらに含む。 More preferably, in this method, the predetermined first parameter is changed to the predetermined second parameter so that the distribution of the predetermined second parameter is balanced on both sides of the predetermined reference value of the predetermined second parameter. The method further includes normalizing.

この発明の第２の局面は、コンピュータ上で実行されると、上記したいずれかの全てのステップをコンピュータに行わせる、コンピュータプログラムに関する。 A second aspect of the present invention relates to a computer program that, when executed on a computer, causes the computer to perform any of the steps described above.

Ａ．方法の概観
Ａ．１変形
非特許文献７で扱われている、機能モデルで構築された変形は、さまざまな声域でのＦ０輪郭をλ時空間と呼ばれる正規化された空間にマッピングすることを可能にする。ここで、ｆ０はヘルツ表示のＦ０を表すものとし、λはλ（正規化された周波数）でのＦ０を表すものとする。ｆ０とλとの間の変形は以下の式で表される。 A. Method Overview A. 1 Deformation Deformation constructed with a functional model, which is dealt with in Non-Patent Document 7, makes it possible to map F0 contours in various vocal ranges into a normalized space called λ space-time. Here, f0 represents F0 in Hertz display, and λ represents F0 at λ (normalized frequency). The deformation between f0 and λ is expressed by the following equation.

ここでＡ（λ，ζ）は単純な共振システム内での振幅−周波数応答を表す。

Where A (λ, ζ) represents the amplitude-frequency response in a simple resonant system.

ζは共振システムの減衰比を表す。物理的には、減衰比は共振システム中の粘性抵抗の等価物を表す。他のモデルパラメータは以下を示す。

ζ represents the damping ratio of the resonant system. Physically, the damping ratio represents the equivalent of viscous resistance in a resonant system. Other model parameters are as follows:

[ｆ0_ｂ，ｆ0_ｔ]：声域の最高周波数と最低周波数
[λ_ｂ，λ_ｔ]：λで表した声域の最高周波数と最低周波数
声域[ｆ0_ｂ，ｆ0_ｔ]は話者に依存する。実際には、対象となる話者の発話の周波数範囲として測定することができる。ほとんどの場合、λ_ｔとλ_ｂとはそれぞれ１及び２に固定できる。 [f0 _b, f0 _t]: the highest and lowest frequencies of the vocal range
[λ _b , λ _t ]: The highest frequency and the lowest frequency of the voice range expressed by λ The voice range [f 0 _b , f 0 _t ] depends on the speaker. Actually, it can be measured as the frequency range of the speech of the target speaker. In most cases, λ _t and λ _b can be fixed at 1 and 2, respectively.

λとζとが与えられると、ｆ０は上述の変換で直接計算できる。便宜上、Ｔ_ｆ０（）はζにおけるλからｆ０への変形を示すものとする。 Given λ and ζ, f0 can be directly calculated by the above transformation. For convenience, T _f0 () represents the deformation from λ to f0 in ζ.

ｆ0＝Ｔ_ｆ０（λ，ζ）（３）
他方で、λ（又はζ）は、ｆ０とζ（又はλ）が与えられれば、反復処理によって決定することもできる。Ｔ_λ（）がζでのｆ０からλへの変形を表すものとする。ｆ０が大きくなるほど、λで表した値は小さくなる。 f0 = _Tf0 (λ, ζ) (3)
On the other hand, λ (or ζ) can also be determined by iterative processing, given f0 and ζ (or λ). _Let T _λ () represent the deformation from f0 to λ at ζ. As f0 increases, the value represented by λ decreases.

λ＝Ｔ_λ（ｆ０，ζ）（４）
さらに、Ｔ_ζ（）がλからｆ０への変形のためのζを表すものとする。 λ = T _λ (f0, ζ) (4)
Further, T _ζ () represents ζ for deformation from λ to f0.

ζ＝Ｔ_ζ（λ，ｆ０）（５）
Ａ．２声調の変形
この変換により、以下のζで示すように、[ｆ0_ｂ，ｆ0_ｔ]内でのｆ０_１からｆ０_２への変化を測定する方法が提供される。 ζ = T _ζ (λ, f0) (5)
A. 2 Tone Deformation This transformation provides a method for measuring the change from f0 ₁ to f0 ₂ in [f0 _b , f0 _t ], as shown by ζ below.

ζ＝Ｔ_ζ（Ｔ_λ（ｆ０_１，ζ_０），ｆ０_２）（６）
ここでζ_０は、ｆ０_１及びｆ０_２をともにλ値にマッピングするときのζの基準値である。好ましくは、ζ_０は０．１５６に固定される。 ζ = T _ζ (T _λ (f0 ₁ , ζ ₀ ), f0 ₂ ) (6)
Here, ζ ₀ is a reference value of ζ when both f0 ₁ and f0 ₂ are mapped to λ values. Preferably, ζ ₀ is fixed at 0.156.

ｆ０_１及びｆ０_２間の一対一のマッピングを保証するために、ζは（０，０．７]の集合に属していなければならない。これにより、以下のｆ０_１＝Ｔ_ｆ０（λ_ｉ，ζ_０）という条件下で図１に見られるように、個々のζについて、ｆ０_１及びｆ０_２間での制約が導かれる。 In order to guarantee a one-to-one mapping between f0 ₁ and f0 ₂ , ζ must belong to the set of (0, 0.7], so that f0 ₁ = T _f0 (λ _i , ζ _As can be seen in FIG. 1 under the condition of ₀ ), a constraint between f0 ₁ and f0 ₂ is derived for each ζ.

λ_２＝Ｔ_λ（Ｔ_ｆ０（λ_１，ζ_０），ζ）（７）
ζが基準のζ_０（＝０．１５６）から遠ざかるにつれて、λ_１は非線形にかつ単調にλ_２へと変化し、その範囲は領域［１，２］の両端において急激に狭くなる。 λ ₂ = T _λ (T _f0 (λ ₁ , ζ ₀ ), ζ) (7)
As ζ moves away from the reference ζ ₀ (= 0.156), λ ₁ changes non-linearly and monotonously to λ ₂ , and the range sharply narrows at both ends of the region [1, 2].

ζをζ_０の両側でつりあわせるため、正規化された減衰比ζ_ｎをζ_ｎ∈［−１，１］として次のように定義する。 In order to balance ζ on both sides of ζ ₀ , the normalized damping ratio ζ _n is defined as ζ _n ∈ [−1, 1] as follows.

この方法を拡張して、語の声調及びピッチアクセント等の、２個のＦ０ターゲットのシーケンス間の変化を測定することが可能である。ある声調の中でのすべてのＦ０ターゲットは、同じζ_０におけるλによる相対量として表される。この方法を２個の声調間の変化を測定するために用いる利点は、声調内の内部変化が見え、このため、実際の声調の変化を測定可能となることである。

This method can be extended to measure changes between sequences of two F0 targets, such as word tone and pitch accent. All F0 targets in a tone are expressed as relative quantities with λ at the same ζ ₀ . The advantage of using this method to measure the change between two tones is that internal changes in the tone can be seen, thus making it possible to measure the actual tone change.

図２から図４はこの声調変形をマンダリン語の声調に適用した例を示す。図２（ａ）は４個の語の声調（ボックス３０に示すように、声調１から声調４を同じ時間軸上で重ねたもの）を６回繰返した様子を示し、図２（ｂ）はζ_ｎ＝０を示し、これはターゲット声調変化がない、基準となる語の声調を表す。図３（ｂ）に示すように、ζ_ｎが２秒間に０から−1まで線形に変化すると、図２（ａ）の声調のシーケンスは図３（ａ）に示すものへと変化する。ζ_ｎは図４（ｂ）の太線に対応し、図２（ａ）の声調シーケンスは図４（ａ）に示す太線へと変化する。確かに、声域の非常に高い／低い領域ではピッチ範囲が狭くなる現象が実際の発声でよく見られる。 2 to 4 show an example in which this tone modification is applied to a Mandarin tone. FIG. 2 (a) shows a four word tone (tones 1 to 4 superimposed on the same time axis as shown in box 30) repeated 6 times, and FIG. 2 (b) ζ _n = 0, which represents the tone of the reference word with no target tone change. As shown in FIG. 3B, when ζ _n changes linearly from 0 to −1 in 2 seconds, the tone sequence in FIG. 2A changes to that shown in FIG. ζ _n corresponds to the thick line in FIG. 4B, and the tone sequence in FIG. 2A changes to the thick line shown in FIG. Certainly, a phenomenon in which the pitch range becomes narrow in a very high / low range of the voice range is often seen in actual speech.

Ａ．３イントネーションの変化測定
音節のイントネーションは声調と呼ばれる。音節と一致する時間−Ｆ０輪郭は声調パターンとして知られている。チャオ（Ｃｈａｏ）の声調理論［非特許文献５を参照されたい。」に従って、４つの語の声調を４個の声調パターンとして表し、さらにこれを、図５に示すようないくつかの選択されたＦ０ターゲットにより表す。各声調は主要ターゲットによって特徴づけられる［非特許文献６を参照されたい。］。図５では主要ターゲットを黒丸で示す。 A. 3. Measurement of changes in intonation Stonal intonation is called tone. The time-F0 contour that matches the syllable is known as the tone pattern. Chao's tone theory [see Non-Patent Document 5]. The four word tones are represented as four tone patterns, and this is represented by a number of selected F0 targets as shown in FIG. Each tone is characterized by a main target [see Non-Patent Document 6]. ]. In FIG. 5, the main target is indicated by a black circle.

Ｆ０輪郭で明示される声調の変化は、基となる語の声調を特定の態様で変更したものである［非特許文献６を参照］。Ｆ０輪郭は、Ｆ０ターゲットのシーケンスで信頼性をもって表すことができ、Ｆ０ターゲットの数と種類とは、声調パターンに従い、基となる語の声調から決定できる［非特許文献８を参照］。従って、声調変形を用いてＦ０輪郭から声調の変化を測定するアルゴリズムは、基本的に以下のステップを含む。 The change in tone clearly indicated by the F0 contour is a change in the tone of the underlying word in a specific manner [see Non-Patent Document 6]. The F0 contour can be reliably represented by the sequence of F0 targets, and the number and type of F0 targets can be determined from the tone of the underlying word according to the tone pattern [see Non-Patent Document 8]. Therefore, the algorithm for measuring the tone change from the F0 contour using tone transformation basically includes the following steps.

・初期化：話者による個々の音節から測定された平均のＦ０ターゲットに従って、４つの声調パターンについてＦ０ターゲットの基準値（参考値）を決定する。 Initialization: Determine the reference value (reference value) of the F0 target for the four tone patterns according to the average F0 target measured from individual syllables by the speaker.

・ステップ１：図５の声調パターンに従って、Ｆ０輪郭からＦ０ターゲット（観測値）を抽出する。Ｆ０輪郭からＦ０ターゲットを推定するためのアルゴリズムを、非特許文献９及び１０に記載のとおり利用することができ、これによってまず声調特徴を抽出し、その後これをＦ０ターゲットに変換する。 Step 1: F0 target (observed value) is extracted from the F0 contour according to the tone pattern of FIG. An algorithm for estimating the F0 target from the F0 contour can be used as described in Non-Patent Documents 9 and 10, whereby a tone feature is first extracted and then converted to an F0 target.

・ステップ２：声調パターンについて対（ｆ０_ｉ，＾ｆ０_ｉ）を作成する。ここで、ｆ０ｉはｉ番目のＦ０ターゲットの観測値を表し、＾ｆ０_ｉ（「ｆ」の前の「＾」記号は本来ｆの上部に表記すべきものである。）はその参考値を表す。声調０については、このＦ０ターゲットの参考値は、単に先行する声調での最後のＦ０ターゲットの参考値をとるものとする。 Step 2: Create a pair (f0 _i , ^ f0 _i ) for the tone pattern. Here, f0i represents the observed value of the i-th F0 target, and ^ f0 _i (the “^” symbol in front of “f” is supposed to be written above the original f) represents the reference value. For tone 0, the reference value for this F0 target is simply the reference value for the last F0 target in the preceding tone.

・ステップ３：ζ_ｉ＝Ｔ_ζ（Ｔ_λ（＾ｆ０_ｉ，ζ_０），ｆ０_ｉ）、及びζ_ｎを計算する。ただし，ｉ＝１，…Ｎ（Ｆ０ターゲットの数）とする。これがイントネーションの変化の特徴を表している。 Step 3: Calculate ζ _i = T _ζ (T _λ (^ f 0 _i , ζ ₀ ), f 0 _i ), and ζ _n . However, i = 1,... N (number of F0 targets). This represents the characteristics of intonation changes.

図６は、（ａ）ζ_ｎ（丸）により特徴が表されたイントネーションパターンの推定に用いられたＦ０ターゲット対と、（ｂ）対応する発話データで得られたＦ０輪郭のためのＦ０ターゲット対との、参考値（三角）と観測値（丸）とをプロットしている。線Ｐ０Ｐ４はζ_ｎ＝−１．０４５ｔ＋０．６８６を示し、線Ｐ５Ｐ７はζ_ｎ＝−０．８０９ｔ＋１．１９８を示す。 FIG. 6 shows (a) the F0 target pair used for estimating the intonation pattern characterized by ζ _n (circle), and (b) the F0 target pair for the F0 contour obtained from the corresponding speech data. The reference value (triangle) and the observed value (circle) are plotted. Line P0P4 indicates ζ _n = −1.045t + 0.686, and line P5P7 indicates ζ _n = −0.809t + 1.198.

Ｂ．実施例の説明
Ｂ．１構造
Ｂ．１．１機能ブロック
図７はこの発明の一実施例に従った音声合成システム４０を示すブロック図である。図７を参照して、音声合成システム４０は、所定の話者の基準発話のための記憶装置５０と、話者のサンプル発話を記憶するための記憶装置５２と、基準発話の声調の各々に対する基準Ｆ０ターゲットを抽出し、さらに記憶装置５２に記憶されたサンプル発話の各々について、イントネーション変化を示す正規化された減衰比ζ_ｎのシーケンスを抽出するためのイントネーション抽出モジュール５４とを含む。 B. DESCRIPTION OF EXAMPLE 1 Structure B. 1.1 Functional Block FIG. 7 is a block diagram showing a speech synthesis system 40 according to one embodiment of the present invention. Referring to FIG. 7, the speech synthesis system 40 includes a storage device 50 for a reference utterance of a predetermined speaker, a storage device 52 for storing a sample utterance of the speaker, and a tone of the reference utterance. An intonation extraction module 54 for extracting a reference F0 target and for each sample utterance stored in the storage device 52 for extracting a sequence of normalized damping ratios ζ _n indicative of intonation changes.

音声合成システム４０はさらに、基準発話の基準Ｆ０ターゲットを記憶するための記憶装置５６と、ζ_ｎのシーケンスを記憶するための記憶装置５８とを含む。減衰比ζ_ｎのシーケンスは、サンプル発話のイントネーション変化の特徴を表すものである。従って、ユーザは、記憶装置５８に記憶されたζ_ｎのシーケンスを利用して、所望のイントネーションを指定することができる。 The speech synthesis system 40 further includes a storage device 56 for storing the reference F0 target of the reference utterance and a storage device 58 for storing the sequence of ζ _n . The sequence of the damping ratio ζ _n represents the characteristics of the intonation change of the sample utterance. Therefore, the user can specify a desired intonation using the sequence of ζ _n stored in the storage device 58.

音声合成システム４０はさらに、合成すべき入力テキスト６２と関連付けられたイントネーション情報６０を受け、入力テキスト６２中の音節の各々についてＦ０を合成するためのＦ０シンセサイザ６４と、入力されたテキスト６２とＦ０シンセサイザ６４から出力されたＦ０とに従って音声信号を合成するための音声シンセサイザ６６とを含む。 The speech synthesis system 40 further receives intonation information 60 associated with the input text 62 to be synthesized, an F0 synthesizer 64 for synthesizing F0 for each syllable in the input text 62, and the input text 62 and F0. An audio synthesizer 66 for synthesizing an audio signal according to F0 output from the synthesizer 64 is included.

イントネーション抽出モジュール５４は、記憶装置５０内の基準発話の音節の各々からＦ０ターゲットを抽出し、抽出されたｆ０ターゲットを記憶装置５６に記憶するための第１のターゲット抽出モジュール８０と、記憶装置５２内のサンプル発話の音節の各々からＦ０ターゲットを抽出するための第２のターゲット抽出モジュール８２と、第２のターゲット抽出モジュール８２から出力されたＦ０ターゲットの各々について、減衰比ζ_ｎを計算し、ζ_ｎのシーケンスを記憶装置５８に出力するためのζ_ｎ計算モジュール８４とを含む。 The intonation extraction module 54 extracts a F0 target from each syllable of the reference utterance in the storage device 50, and stores the extracted f0 target in the storage device 56. The first target extraction module 80 stores the extracted f0 target in the storage device 56. Calculating a damping ratio ζ _n for each of the F0 targets output from the second target extraction module 82 and the second target extraction module 82 for extracting the F0 target from each of the syllables of the sample utterances, and a ζ _n calculation module 84 for outputting the sequence of ζ _n to the storage device 58.

Ｆ０シンセサイザ６４は、イントネーション情報内のζ_ｎのシーケンスからζを計算するζ計算モジュール９０と、以下の式に従って、入力テキスト６２の各々の音節のｆ０_ｉを計算し、計算されたｆ０_ｉを音声シンセサイザ６６に出力するためのＦ０計算モジュール９０とを含む。 The F0 synthesizer 64 calculates f0 _i of each syllable of the input text 62 according to the following formula using a ζ calculation module 90 that calculates ζ from the sequence of ζ _n in the intonation information, and the calculated f0 _i is voiced. And an F0 calculation module 90 for outputting to the synthesizer 66.

ｆ０_ｉ＝Ｔ_ｆ０（Ｔ_λ（ｆ０_ｉ，ζ_０），ζ）（９）
Ｂ．１．２コンピュータによる実現
図７に示されたモジュールは、この実施例ではコンピュータソフトウェアで実現される。図８は第１のターゲット抽出モジュール８０を実現するコンピュータプログラムの制御構造を示す。図８を参照して、プログラムはステップ１００で始まり、基準発話に見出される声調１〜声調４の各々について、ステップ１０２〜１２０が繰返される。 f0 _i = T _f0 (T _λ (f0 _i , ζ ₀ ), ζ) (9)
B. 1.2 Implementation by Computer The module shown in FIG. 7 is implemented by computer software in this embodiment. FIG. 8 shows a control structure of a computer program that realizes the first target extraction module 80. Referring to FIG. 8, the program begins at step 100 and steps 102-120 are repeated for each of tone 1 to tone 4 found in the reference utterance.

ステップ１０２で、変数ＳＵＭがゼロに初期化される。 In step 102, the variable SUM is initialized to zero.

ステップ１１０で、基準発話内の、関心のある声調データの全てについて、ステップ１１２〜１１６が繰返される。ステップ１１４で、音節の音声データからＦ０ターゲットが抽出される。抽出されたＦ０はステップ１１６でＳＵＭに加えられる。 At step 110, steps 112-116 are repeated for all of the tone data of interest within the reference utterance. In step 114, the F0 target is extracted from the syllable speech data. The extracted F0 is added to the SUM at step 116.

ステップ１１２から１１６が関心のある声調の音節全てに対し繰返された後、ステップ１１８でＳＵＭの平均を求める。ステップ１１８で、この平均が、対象の声調と関連付けた上でメモリに記憶される。 After steps 112 to 116 are repeated for all syllables of the tone of interest, step 118 determines the SUM average. At step 118, this average is stored in memory in association with the subject's tone.

この処理の終わりには、声調１〜声調４の平均Ｆ０がメモリに記憶されていることになる。 At the end of this process, the average F0 of tone 1 to tone 4 is stored in the memory.

図９は図７に示す第２のターゲット抽出モジュール８２及びζ_ｎ計算モジュール８４を実現するコンピュータプログラムの制御構造を示す。図９を参照して、ステップ１４０で、記憶装置５２に記憶されたサンプル発話の全てについてＦ０輪郭が計算される。ステップ１４２で、入力テキスト６２（図７を参照）の全ての音節について、ステップ１４４から１５２が繰返される。 FIG. 9 shows a control structure of a computer program for realizing the second target extraction module 82 and the ζ _n calculation module 84 shown in FIG. Referring to FIG. 9, in step 140, F0 contours are calculated for all of the sample utterances stored in storage device 52. In step 142, steps 144 to 152 are repeated for all syllables of the input text 62 (see FIG. 7).

この繰返しでは、まず、処理中の音節の声調のＦ０ターゲットが抽出される。抽出されたｉ番目のＦ０ターゲットをｆ０_ｉ，１≦ｉ≦Ｎ（発話中のターゲットの数）とする。 In this repetition, first, the F0 target of the tone of the syllable being processed is extracted. Let the extracted i-th F0 target be f0 _i , 1 ≦ i ≦ N (the number of targets in speech).

ステップ１４６で、ステップ１４４で抽出されたｆ０_ｉが音節の声調パターンの＾ｆ０_ｉと対にされる。ここで＾ｆ０_ｉはｆ０_ｉの参考値を表す。声調０については、そのＦ０ターゲットの参考値は単に、先行する声調の最後のＦ０ターゲットの参考値をとるだけである。 In step 146, f0 _i extracted in step 144 is paired with ^ f0 _i of the syllable tone pattern. Here, f0 _i represents a reference value of f0 _i . For tone 0, the reference value for that F0 target simply takes the reference value for the last F0 target of the preceding tone.

ステップ１４８で、ζ_ｉが以下の式に従って計算される。 In step 148, ζ _i is calculated according to the following equation:

ζ_ｉ＝Ｔ_ζ（Ｔ_λ（＾ｆ０_ｉ，ζ_０），ｆ０_ｉ）（１０）
ステップ１５０で、正規化されたζ_ｎｉ（１≦ｉ≦Ｎ）が以下の式に従って計算される。 ζ _i = T _ζ (T _λ (^ f 0 _i , ζ ₀ ), f 0 _i ) (10)
At step 150, normalized ζ _ni (1 ≦ i ≦ N) is calculated according to the following equation:

ステップ１５２で、結果ζ_ｎｉが記憶装置５８に記憶される（図７を参照）。

In step 152, the result ζ _ni is stored in the storage device 58 (see FIG. 7).

記憶装置５２に記憶されているサンプル発話の音節全てについて上述の処理を繰返した後、ユーザは正規化されたζ_ｎを用いればどのようなイントネーションも記述できる。従って、イントネーション情報６０はζ_ｎのシーケンスの形で準備することができる。 After repeating the above process for all syllables of the sample utterances stored in the storage device 52, the user can describe any intonation using the normalized ζ _n . Therefore, the intonation information 60 can be prepared in the form of a sequence of ζ _n .

この実施例では、図７に示すＦ０シンセサイザ６４もまたコンピュータソフトウェアで実現される。このコンピュータプログラムの制御構造を図１０に示す。 In this embodiment, the F0 synthesizer 64 shown in FIG. 7 is also realized by computer software. The control structure of this computer program is shown in FIG.

図１０を参照して、Ｆ０シンセサイザ６４が起動されると、まずイントネーション情報６０内のイントネーションデータζ_ｎｉを読出す。次に、ステップ１７２で、入力テキスト６２の音節全てについてステップ１７４から１７８を繰返す。ここでζ_ｎｉ（１≦ｉ≦Ｎ）はイントネーション情報６０の正規化された減衰率のシーケンスとする。 Referring to FIG. 10, when F0 synthesizer 64 is activated, intonation data ζ _ni in intonation information 60 is first read. Next, in step 172, steps 174 to 178 are repeated for all syllables of the input text 62. Here, ζ _ni (1 ≦ i ≦ N) is a sequence of normalized attenuation rates of intonation information 60.

ステップ１７４で、式（１１）の逆関数に従って、ζ_ｎｉからζ_ｉを計算する。 In step 174, ζ _i is calculated from ζ _ni according to the inverse function of equation (11).

ステップ１７６で、ｉ番目の音節（声調）のＦ０ターゲットｆ０_ｉが以下の式に従って計算される。 At step 176, the F0 target f0 _i of the i th syllable (tone) is calculated according to the following equation:

ｆ０_ｉ＝Ｔ_ｆ０（Ｔ_λ（＾ｆ０_ｉ，ζ_０），ζ_ｉ）（１２）
ここで＾ｆ０_ｉは基準発話から抽出された参考値（Ｆ０ターゲット）を表し、ζ_０は定数（好ましくは、ζ_０は０．１５６）を表す。 f0 _i = T _f0 (T _λ (^ f0 _i , ζ ₀ ), ζ _i ) (12)
Here, ^ f0 _i represents a reference value (F0 target) extracted from the reference utterance, and ζ ₀ represents a constant (preferably ζ ₀ is 0.156).

ステップ１７８で、このようにして計算されたｆ０_ｉがメモリに記憶される。 In step 178, f0 _i calculated in this way is stored in the memory.

入力テキスト６２の全ての音節について、ステップ１７４から１７８が繰返された後、イントネーション情報６０によりイントネーションパターンが指定された入力テキスト６２中の声調のシーケンスのＦ０ターゲットとして、ｆ０_ｉのシーケンスがステップ１８０で出力される。 Steps 174 to 178 are repeated for all syllables of the input text 62, and then the sequence of f0 _i is used as the F0 target of the tone sequence in the input text 62 in which the intonation pattern is specified by the intonation information 60 in step 180. Is output.

Ｂ．１．３コンピュータハードウェア
図１１は上述のコンピュータプログラムを実行するこの実施例のコンピュータシステム３３０の外観を示し、図１２はこのシステム３３０をブロック図で示す。 B. 1.3 Computer Hardware FIG. 11 shows the appearance of a computer system 330 of this embodiment that executes the above-described computer program, and FIG. 12 shows this system 330 in a block diagram.

図１１を参照して、このコンピュータシステム３３０は、ＦＤ（フレキシブルディスク）ドライブ３５２およびＣＤ−ＲＯＭ（コンパクトディスク読出専用メモリ）ドライブ３５０を有するコンピュータ３４０と、キーボード３４６と、マウス３４８と、モニタ３４２と、一対のスピーカ３７２と、マイクロフォン３７０と、を含む。 Referring to FIG. 11, this computer system 330 includes a computer 340 having an FD (flexible disk) drive 352 and a CD-ROM (compact disk read only memory) drive 350, a keyboard 346, a mouse 348, and a monitor 342. And a pair of speakers 372 and a microphone 370.

図１２を参照して、コンピュータ３４０はさらに、ＣＰＵ（中央処理装置）３５６と、ＣＰＵ３５６、ＦＤドライブ３５２およびＣＤ−ＲＯＭドライブ３５０に接続されたバス３６６と、ハードディスク３５４と、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）３５８と、ＣＰＵ３５６に接続され、アプリケーションプログラム命令、システムプログラム、及びデータ等を記憶するランダムアクセスメモリ（ＲＡＭ）３６０とを含む。 Referring to FIG. 12, computer 340 further stores a CPU (central processing unit) 356, a bus 366 connected to CPU 356, FD drive 352, and CD-ROM drive 350, a hard disk 354, a boot-up program, and the like. A read only memory (ROM) 358, and a random access memory (RAM) 360 connected to the CPU 356 and storing application program instructions, system programs, data, and the like.

ここでは示さないが、コンピュータ３４０はさらにローカルエリアネットワーク（ＬＡＮ）への接続を提供するネットワークアダプタボードを含んでもよい。 Although not shown here, the computer 340 may further include a network adapter board that provides a connection to a local area network (LAN).

コンピュータシステム３３０に上述の音声合成システムを実現させるためのコンピュータプログラムは、ＣＤ−ＲＯＭドライブ３５０またはＦＤドライブ３５２に挿入されるＣＤ−ＲＯＭ３６２またはＦＤ３６４に記憶され、さらにハードディスク３５４に転送される。または、プログラムは図示しないネットワークを通じてコンピュータ３４０に送信されハードディスク３５４に記憶されてもよい。プログラムは実行の際にＲＡＭ３６０にロードされる。ＣＤ−ＲＯＭ３６２から、ＦＤ３６４から、またはネットワークを介して、直接にＲＡＭ３６０にプログラムをロードしてもよい。 A computer program for causing the computer system 330 to realize the above-described speech synthesis system is stored in the CD-ROM 362 or FD 364 inserted into the CD-ROM drive 350 or FD drive 352 and further transferred to the hard disk 354. Alternatively, the program may be transmitted to the computer 340 through a network (not shown) and stored in the hard disk 354. The program is loaded into the RAM 360 when executed. The program may be loaded directly into the RAM 360 from the CD-ROM 362, from the FD 364, or via a network.

図８から図１０を参照して説明したこのプログラムは、コンピュータ３４０にこの実施例の音声合成システム４０の機能ブロックを実現させるための複数の命令を含む。この方法を行なわせるのに必要な基本的機能のいくつかはコンピュータ３４０上で動作するオペレーティングシステム（ＯＳ）またはコンピュータ３４０にインストールされるサードパーティのプログラムにより提供される。従って、このプログラムはこの実施の形態のシステムおよび方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な関数または「ツール」を呼出すことにより、上述の処理を行う命令のみを含んでいてもよい。コンピュータシステム３３０の動作は周知であるので、ここでは繰返さない。 The program described with reference to FIGS. 8 to 10 includes a plurality of instructions for causing the computer 340 to realize the functional blocks of the speech synthesis system 40 of this embodiment. Some of the basic functions necessary to perform this method are provided by an operating system (OS) running on computer 340 or a third party program installed on computer 340. Therefore, this program does not necessarily include all functions necessary for realizing the system and method of this embodiment. This program may include only instructions that perform the above-described processing by calling an appropriate function or “tool” in a controlled manner so as to obtain a desired result. The operation of computer system 330 is well known and will not be repeated here.

Ｂ．２動作
この実施例の、上述の音声合成システム４０（図７を参照）は以下のように動作する。音声合成システム４０の動作は３段階である。すなわち、基準発話からのＦ０ターゲットの抽出と、基準発話からのζ_ｎの計算と、Ｆ０ターゲット及び音声合成とである。これらの段階における音声合成システム４０の動作を以下で説明する。 B. 2 Operation The above-described speech synthesis system 40 (see FIG. 7) of this embodiment operates as follows. The operation of the speech synthesis system 40 is in three stages. That is, extraction of the F0 target from the reference utterance, calculation of ζ _n from the reference utterance, and F0 target and speech synthesis. The operation of the speech synthesis system 40 at these stages will be described below.

Ｂ．２．１基準発話からのＦ０ターゲットの抽出
図７を参照して、所定の話者の音声データを、声調１〜声調４の全てについて録音し、基準発話として記憶装置５０に記憶する。声調１〜声調４の各々について、第１のターゲット抽出モジュール８０により、基準発話からＦ０ターゲットが抽出される。声調１〜声調４の各々について平均のＦ０ターゲットが記憶装置５６に記憶される。 B. 2.1 Extraction of F0 Target from Reference Speech Referring to FIG. 7, voice data of a predetermined speaker is recorded for all of the tone 1 to tone 4 and stored in the storage device 50 as a reference utterance. For each of the tone 1 to tone 4, the first target extraction module 80 extracts the F0 target from the reference utterance. The average F0 target is stored in the storage device 56 for each of the tone 1 to tone 4.

Ｂ．２．２基準発話からのζ_ｎの計算
基準発話と同じ話者のサンプル発話を録音し、記憶装置５２に記憶する。サンプル発話の各々の各音節について、第２のターゲット抽出モジュール８２がＦ０ターゲットを抽出する。その後、モジュール８２から出力されたＦ０ターゲットの各々について、ζ_ｎ計算モジュール８４がζ_ｎを計算し、サンプル発話の各々についてζ_ｎのシーケンスを生成する。 B. 2.2 Calculation of ζ _n from reference utterance A sample utterance of the same speaker as the reference utterance is recorded and stored in the storage device 52. For each syllable of each sample utterance, the second target extraction module 82 extracts the F0 target. Then, for each of the F0 target output from the module 82, to calculate the _n zeta _n calculation module 84 is zeta, for each sample utterances to generate a sequence of zeta _n.

Ｂ．２．３Ｆ０ターゲット及び音声合成
ユーザは、入力テキスト６２と、入力テキストをそのイントネーションで合成したいと考えているイントネーションを特定する関連のイントネーション情報６０とを準備する。ユーザは、記憶装置５８に記憶されているζ_ｎのシーケンスを調べることにより、イントネーション情報を準備することができる。 B. 2.3 F0 Target and Speech Synthesis The user prepares the input text 62 and related intonation information 60 that identifies the intonation that the user wants to synthesize the input text. The user can prepare intonation information by examining the sequence of ζ _n stored in the storage device 58.

イントネーション情報６０と入力テキスト６２とが準備されると、入力テキスト６２の各音節について、ζ計算モジュール９０がζを計算し、これをＦ０計算モジュール９２に出力する。例えば、ｉ番目の音節に対し、ζ計算モジュール９０は式（１１）の逆関数に従ってζ_ｎｉからこの音節のζ_ｉを計算する。 When the intonation information 60 and the input text 62 are prepared, the ζ calculation module 90 calculates ζ for each syllable of the input text 62 and outputs this to the F0 calculation module 92. For example, for the i-th syllable, the ζ calculation module 90 calculates ζ _i of this syllable from ζ _ni according to the inverse function of equation (11).

Ｆ０計算モジュール９２は、音節の各々に対し、このようにして計算されたζ_ｉと、記憶装置５６に記憶された＾ｆ０_ｉと、定数ζ_０＝０．１５６とに以下の関数を適用してＦ０ターゲットｆ０_ｉを計算する。 The F0 calculation module 92 applies the following function to each of the syllables, ζ _i calculated in this way, ^ f0 _i stored in the storage device 56, and the constant ζ ₀ = 0.156. F0 target f0 _i is calculated.

ｆ０_ｉ＝Ｔ_ｆ０（Ｔ_λ（＾ｆ０_ｉ，ζ_０），ζ_ｉ）（１３）
この結果、入力テキスト６２内の音節について、Ｆ０計算モジュール９２により、ｆ０_ｉのシーケンスが出力される。このシーケンスが音声シンセサイザ６６に与えられる。 f0 _i = T _f0 (T _λ (^ f0 _i , ζ ₀ ), ζ _i ) (13)
As a result, for the syllable in the input text 62, the F0 calculation module 92 outputs the sequence of f0 _i . This sequence is provided to the voice synthesizer 66.

Ｆ０計算モジュール９２からｆ０_ｉのシーケンスが与えられると、音声シンセサイザ６６は、イントネーション情報６０で指定されたイントネーションを備えた入力テキスト６２の音声信号６８を合成することができる。 Given the sequence of f0 _i from the F0 calculation module 92, the speech synthesizer 66 can synthesize the speech signal 68 of the input text 62 with the intonation specified by the intonation information 60.

Ｃ．実験結果
ここで提案した方法が、測定されたＦ０輪郭内の、語の声調よりも高いレベルのイントネーションの変化を明らかにすることが可能であると示すために、２つの実験結果を報告する。音声サンプルは中国語音声コーパスから選択され、専門のナレータに朗読してもらった。ナレータの声域[ｆ０_ｂ，ｆ０_ｔ］は[１００Ｈｚ，５００Ｈｚ］と一致し、ナレータによる語の声調の参考値は表１に示されるとおりである。太字は主要ターゲットを示す。これらの参考値に対応する声調パターンを図２（ａ）に見ることができる。 C. Experimental Results Two experimental results are reported to show that the proposed method is able to account for higher levels of intonation changes in the measured F0 contour than the tone of the word. Audio samples were selected from a Chinese speech corpus and read by a specialized narrator. The voice range [f0 _b , f0 _t ] of the narrator is consistent with [100 Hz, 500 Hz], and the reference value of the voice of the word by the narrator is as shown in Table 1. Bold indicates major targets. The tone patterns corresponding to these reference values can be seen in FIG.

図１３〜図１６に示される結果は、４つの慣用の挨拶を含むイントネーション変化の分析から得られた。４つの挨拶の実際のイントネーションは音韻論的には同じであるが、語の声調のためにＦ０輪郭は大きく起伏する。計算の例として、表２は、図１３（ａ）に示されたサンプルからの観測値ｆ０_ｉ，ｉ＝１，…５、対応の参考値＾ｆ０_ｉ、及び結果として得られるパラメータζ_ｉ及びζ_ｎｉを列挙している。これらの結果は図１３（ｂ）に示される。

The results shown in FIGS. 13-16 were obtained from analysis of intonation changes including four conventional greetings. The actual intonation of the four greetings is phonologically the same, but the F0 contour is greatly undulated due to the tone of the word. As an example of the calculation, Table 2 shows the observed values f0 _i , i = 1,... 5 from the sample shown in FIG. 13 (a), the corresponding reference values ^ f0 _i , and the resulting parameters ζ _i and ζ _ni is listed. These results are shown in FIG. 13 (b).

この例では、文のアクセントは、声調２の主要ターゲット（最初の声調３の表面声調）である０．０２４から第２の声調３の−０．４２３までζ_ｎが下降したことで示される。他の文の文アクセントもまた、基となる声調の種類に関わりなく一貫して下降するように思われる。この４つの挨拶で示される基本的な特徴は、（１）文のアクセントは発話の最後に位置し、もう１つの音節にかかること、（２）最後の声調（声調１〜４）はその参考声調パターンを維持する（すなわちζ_ｎが変化しない）ことである。声調０は最後の非−声調０である声調の連続したものであるとみなされる。この結果は上述の仮定と一致する。イントネーション変化の現象は、例えば非特許文献１１で例示されているように、非声調言語でイントネーションを説明するのに通常用いられるいわゆる「ハットパターン」に非常に類似している。 In this example, the accent of the sentence is indicated by a decrease in ζ _n from 0.024, which is the main target of tone 2 (surface tone of first tone 3), to −0.423 of second tone 3. The sentence accents of other sentences also appear to fall consistently regardless of the type of underlying tone. The basic features shown in these four greetings are: (1) the accent of the sentence is located at the end of the utterance and the other syllable, (2) the last tone (tones 1-4) is a reference Maintaining the tone pattern (ie, ζ _n does not change). Tone 0 is considered to be a succession of tones that are the last non-tone 0. This result is consistent with the above assumptions. The phenomenon of intonation change is very similar to the so-called “hat pattern” that is usually used to describe intonation in non-tone language, as exemplified in Non-Patent Document 11, for example.

図１７は声調及びイントネーションを合成する例を示す。図１７（ａ）は基となる語の声調の参考値を示す。図１７（ｂ）はζ_ｎ（ｔ）によりイントネーションパターンをプロットする。図１７（ｃ）はこれらのＦ０ターゲット（丸）とこれらのターゲットによりモデルによって与えられる輪郭（連続線）とを示す。「＋」のシーケンスはサンプル発話の測定されたＦ０輪郭を示す。 FIG. 17 shows an example of synthesizing a tone and intonation. FIG. 17A shows a reference value of the tone of the word that is the basis. FIG. 17B plots the intonation pattern according to ζ _n (t). FIG. 17 (c) shows these F0 targets (circles) and the contours (continuous lines) provided by the model with these targets. The “+” sequence indicates the measured F0 contour of the sample utterance.

図１７から明らかなように、モデルによって与えられるＦ０輪郭は元のＦ０輪郭に非常に近い。 As is apparent from FIG. 17, the F0 contour provided by the model is very close to the original F0 contour.

図１８は同じ話者にいくつかの数字列を読んでもらうことで得られたさらなる結果を示す。朗読した数字列は、言語学的意味がないため、中立である。明瞭な結果を求めるため、主要な声調ターゲットのζ_ｎ値のみを図にプロットする。加えて、これらの発話では休止（ポーズ）がない。イントネーション変化には２つの形状が現れる。１つは最初から最後まで下がる線である（左側）。他方は、下降部とそれに続く平坦部とからなる線である。この下降は最初の２個の音節間で起こる。明らかになったイントネーション変化は、語の声調を越えた高いレベルで体系的である。 FIG. 18 shows further results obtained by having the same speaker read several numbers. The number sequence read is neutral because it has no linguistic meaning. In order to obtain a clear result, only the ζ _n values of the main tone targets are plotted in the figure. In addition, there is no pause in these utterances. Two shapes appear in the intonation change. One is a line going down from the beginning to the end (left side). The other is a line composed of a descending portion followed by a flat portion. This descent occurs between the first two syllables. The revealed intonation changes are systematic at a high level beyond the tone of words.

３人の話者による約２００個の中国語サンプルを分析した。これらのサンプルでは実際のイントネーションは多少変化するものの、分析した結果は、この方法により、上で示したとおりイントネーションの変化をはっきりと明らかにできることを示した。 Approximately 200 Chinese samples from three speakers were analyzed. Although the actual intonation varies slightly in these samples, the results of the analysis show that this method can clearly reveal changes in intonation as indicated above.

Ｄ．結論
この発明の実施の形態は、測定されたＦ０輪郭から語の声調を除外したイントネーション変化を測定する方法に関する。イントネーション変化は語の声調パターンを構成する選択されたＦ０ターゲットを用いてサンプリングされ、時間軸上の１点のパラメータで特徴づけられる。実験結果から、この提案した方法が、Ｆ０輪郭に埋もれ、語の声調と混じりあった、実際のマンダリン語のイントネーションを分析するのに非常に有望であることがわかった。明らかにされた実際のイントネーションは、非声調言語で報告されたイントネーションとの類似性を示した。提案された方法は基となる語の声調をともなったＦ０輪郭の自動的な分析を試みるものであり、これは音声合成、認識、さらには理解において決定的に重要である。 D. CONCLUSION Embodiments of the present invention relate to a method of measuring intonation changes by excluding word tone from measured F0 contours. Intonation changes are sampled using the selected F0 targets that make up the tone pattern of the word and are characterized by a single point parameter on the time axis. The experimental results show that the proposed method is very promising for analyzing the actual Mandarin intonation buried in the F0 contour and mixed with the tone of the word. The actual intonation revealed was similar to the intonation reported in non-tonal languages. The proposed method attempts to automatically analyze the F0 contour with the tone of the underlying word, which is critical in speech synthesis, recognition and even understanding.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

λ_１、λ_２及びζ間の条件を示す図である。It is a figure which shows the conditions between (lambda) ₁ , (lambda) _2, and (zeta). 声調変形をマンダリン語の声調に適用した例を示す図である。It is a figure which shows the example which applied tone deformation to the tone of Mandarin. 声調変形をマンダリン語の声調に適用した別の例を示す図である。It is a figure which shows another example which applied tone deformation to the tone of Mandarin. 声調変形をマンダリン語の声調に適用した別の例を示す図である。It is a figure which shows another example which applied tone deformation to the tone of Mandarin. Ｆ０ターゲットをベースとしてマンダリン語の声調を表す図である。It is a figure showing the tone of Mandarin language based on F0 target. ζ_ｎ（丸）でのイントネーション変化の推定に用いられるＦ０ターゲット対と、元のＦ０輪郭のための、参考値（三角）と観測値（丸）とをプロットした図である。It is the figure which plotted the reference value (triangle) and the observed value (circle) for the F0 target pair used for estimation of the intonation change in ζ _n (circle), and the original F0 contour. この発明の一実施の形態に従った音声合成システム４０のブロック図である。1 is a block diagram of a speech synthesis system 40 according to an embodiment of the present invention. 第１のＦ０ターゲット抽出モジュール８０を実現するコンピュータプログラムの制御構造をフローチャートで示す図である。It is a figure which shows the control structure of the computer program which implement | achieves the 1st F0 target extraction module 80 with a flowchart. 第２のターゲット抽出モジュール８２とζ_ｎ計算モジュール８４とを実現するコンピュータプログラムの制御構造をフローチャートで示す図である。It is a figure which shows the control structure of the computer program which implement | achieves the 2nd target extraction module 82 and the ζ _n calculation module 84 with a flowchart. Ｆ０シンセサイザ６４を実現するコンピュータプログラムの制御構造をフローチャートで示す図である。It is a figure which shows the control structure of the computer program which implement | achieves F0 synthesizer 64 with a flowchart. 一実施の形態に係るコンピュータプログラムを実行するコンピュータシステム３３０の斜視図である。It is a perspective view of the computer system 330 which executes the computer program which concerns on one embodiment. システム３３０のブロック図である。2 is a block diagram of system 330. FIG. 慣用の挨拶「ｎｉ３ｈａｏ３」（こんにちは）のＦ０輪郭を示す図である。Is a diagram showing the F0 contour of the greeting of the customary "ni3hao3" (Hello). 慣用の挨拶「ｚｅｎ３ｍｅ０ｙａｎｇ４ａ０？」（いかがお過ごしですか）のＦ０輪郭を示す図である。It is a figure which shows the F0 outline of the usual greeting "zen3me0yang4a0?" 慣用の挨拶「ｎｉ３ｍａｎｇ２ｍａ０？」（お忙しいですか）のＦ０輪郭を示す図である。It is a figure which shows F0 outline of the usual greeting "ni3mang2ma0?" 慣用の挨拶「ｎｉ３ｓｈｅｎ１ｔｉ３ｈａｏ３ｍａ０？」（ごきげんいかがですか）のＦ０輪郭を示す図である。It is a figure which shows the F0 outline of the usual greeting "ni3shen1ti3hao3ma0?" 語による韻律の特徴と、語によらない韻律の特徴とを合成する例を示す図である。It is a figure which shows the example which synthesize | combines the feature of the prosody by a word, and the feature of the prosody which does not depend on a word. 朗読された数字列での中立イントネーションの変化を示す図である。It is a figure which shows the change of neutral intonation in the read-out number sequence.

Explanation of symbols

４０音声合成システム
５０、５２、５６、５８記憶装置
５４イントネーション抽出モジュール
６０イントネーション情報
６２入力テキスト
６４Ｆ０シンセサイザ
６６音声シンセサイザ
６８イントネーションのある音声信号
８０第１のＦ０ターゲット抽出モジュール
８２第２のＦ０ターゲット抽出モジュール
８４ ζ_ｎ計算モジュール
９０ ζ計算モジュール
９２Ｆ０計算モジュール 40 speech synthesis system 50, 52, 56, 58 storage device 54 intonation extraction module 60 intonation information 62 input text 64 F0 synthesizer 66 speech synthesizer 68 speech signal with intonation 80 first F0 target extraction module 82 second F0 target extraction Module 84 ζ _n calculation module 90 ζ calculation module 92 F0 calculation module

Claims

A method of expressing the characteristics of intonation changes by transformation of tone,
Providing a predetermined set of reference values for a fundamental frequency (F0) target for each of the speaker's tones of words obtained from individual syllables, wherein the set of reference values for the F0 target is a corresponding word And characterize the tone of
Extracting an F0 target value for each syllable in the speaker's sample voice data;
For each of the F0 target values of each syllable in the sample voice data, calculating a predetermined first parameter representing a degree of change from the reference value related to the tone of the word of the syllable to the F0 target value; A method for expressing the characteristics of changes in intonation by tone modification.

The step of preparing comprises
Recording a plurality of individual syllables by the speaker for each tone of the word;
Extracting the F0 target value of each recorded syllable according to the tone of each word;
2. The method of claim 1, comprising, for each word tone, averaging each F0 target value of each F0 target characterizing the tone of the word.

Normalizing the predetermined first parameter to the predetermined second parameter so that the distribution of the predetermined second parameter is balanced on both sides of the predetermined reference value of the predetermined second parameter; The method of claim 1, further comprising:

A computer program that, when executed on a computer, causes the computer to perform all the steps according to any one of claims 1 to 3.