JP4617092B2

JP4617092B2 - Chinese tone classification device and Chinese F0 generator

Info

Publication number: JP4617092B2
Application number: JP2004074594A
Authority: JP
Inventors: キンソンチョウ; 哲中村; 啓吉広瀬
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2004-03-16
Filing date: 2004-03-16
Publication date: 2011-01-19
Anticipated expiration: 2024-03-16
Also published as: JP2005265955A

Abstract

<P>PROBLEM TO BE SOLVED: To make it easier to determine tones in Chinese speech. <P>SOLUTION: A Chinese language tone classification apparatus 110 includes: a tone model 106 trained to output probabilities of tone classification. The anchor based tone discriminating features include onset gap and offset gap, and preferably, an F0 slope coefficient. The apparatus 110 further includes: phonetic segmentation module 150 for segmenting a Chinese speech 108 into a series of tone nuclei; feature extraction module 152 and anchoring feature extraction module 154 for extracting anchor based tone discriminating features; and a pattern matching module 156 that outputs the tone classification that achieves the highest probability. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

この発明は声調言語の声調分類に関し、特に、中国語等の声調言語において、文脈による声調の変化を非常に正確に識別することに関する。 The present invention relates to tone classification of tonal languages, and more particularly to identifying changes in tone with context very accurately in tone languages such as Chinese.

中国語の基本的語彙声調（第一声、第二声、第三声、第四声と呼ばれる）は通常、それらの基本周波数（“Ｆ０”）輪郭のパターンが異なることによって特徴付けられる。すなわち、第一声は高いレベルの輪郭、第二声は途中から上がる輪郭、第三声は低く一旦下がる輪郭、第四声は高いところから下がる輪郭である。これらを識別するため、研究者はこれまでの研究において、第一声（高いＦ０）と第三声（低いＦ０）との区別（非特許文献１を参照されたい。）、及び第三声（低いＦ０）と第四声（高いＦ０）との区別（非特許文献２を参照されたい。）にはＦ０の高さが決定的に重要であると報告している。 Chinese basic vocabulary tones (called first voice, second voice, third voice, fourth voice) are usually characterized by their different fundamental frequency ("F0") contour patterns. That is, the first voice has a high level contour, the second voice has a contour that rises from the middle, the third voice has a low contour that falls once, and the fourth voice has a contour that descends from a high location. In order to distinguish these, researchers have distinguished the first voice (high F0) and the third voice (low F0) in previous studies (see Non-Patent Document 1), and the third voice ( It is reported that the height of F0 is decisively important for the distinction between low F0) and the fourth voice (high F0) (see Non-Patent Document 2).

Ｆ０傾斜勾配は、第一声（平坦）、第二声（上がる）、及び第四声（下がる）等の、動きの方向が異なる声調間を区別するのに有効である（非特許文献２を参照されたい。）。 The F0 gradient is effective for distinguishing between different tones such as the first voice (flat), the second voice (up), and the fourth voice (down) (see Non-Patent Document 2). Please refer.)

声調のＦ０輪郭は、連続した音声においては、互いに分離された音節に比べてかなり変動する。Ｆ０の高さのみでなく、傾斜勾配も相当に変動するため、表面上のＦ０輪郭からその下にある声調性を特定することはできない。他方で、知覚実験によれば、人間は、声調的な文脈が与えられれば、Ｆ０がかなり変動しても、その下にあると思われる語彙的な声調を高い一貫性をもって知覚できることが分かった（非特許文献３を参照されたい。）。このことは、声調的な文脈中に、Ｆ０の高さ及びＦ０の傾斜勾配のほかに、区別のための手がかりが存在することを示している。 The tone F0 contour varies considerably in continuous speech compared to syllables separated from each other. Since not only the height of F0 but also the gradient of the slope fluctuates considerably, it is not possible to specify the tone of the underlying tone from the F0 contour on the surface. On the other hand, perceptual experiments show that humans can perceive vocabulary tones that appear to be below them with high consistency even if F0 fluctuates considerably, given tonal context. (See Non-Patent Document 3). This indicates that there are clues for distinction in addition to the height of F0 and the slope of F0 in the tonal context.

Ｄ．ワレン及びＹ．シュウ、「振幅輪郭及び短いセグメントにおける標準中国語の声調情報」、フォネティカ４９、１９９２、ｐｐ.２５−４７。（D. Whalen and Y. Xu, "Information for Mandarin tones in the amplitude contour and in brief segments", Phonetica 49, 1992, pp. 25-47.）D. Wallen and Y.W. Shu, “Amplitude contour and Mandarin tone information in short segments”, Phonetica 49, 1992, pp. 25-47. (D. Whalen and Y. Xu, "Information for Mandarin tones in the amplitude contour and in brief segments", Phonetica 49, 1992, pp. 25-47.) Ｅ．ガーディング他、「現代標準中国語の第四声と第三声との区別」言語と音声、第２９巻第３部、１９８６、ｐｐ.２８１−２９３。（E. Garding et al., "Tone 4 and tone 3 discrimination in modern standard Chinese", Language and Speech, Vol. 29, Part 3, 1986, pp. 281-293.）E. Guarding et al., “Distinction between Modern Mandarin Fourth and Third Voices” Language and Speech, Vol. 29, Part 3, 1986, pp.281-293. (E. Garding et al., "Tone 4 and tone 3 discrimination in modern standard Chinese", Language and Speech, Vol. 29, Part 3, 1986, pp. 281-293.) Ｙ．シュウ、「同時調音声調の生成と知覚」Ｊ.Ａ.Ｓ.Ａ.（４）、１９９４、ｐｐ.２２４０−２２５３。（Y. Xu, "Production and perception of coarticulated tones", J.A.S.A. (4), 1994, pp. 2240-2253.）Y. Shu, “Generation and perception of simultaneous tone,” JASA (4), 1994, pp. 2240-2253. (Y. Xu, "Production and perception of coarticulated tones", J.A.S.A. (4), 1994, pp. 2240-2253.) Ｙ.Ｒ．チャオ、「口語中国語文法」バークレイ、カリフォルニア大学出版局、１９６８。（Y.-R. Chao, "A grammar of spoken Chinese", Berkeley: University of California Press,1968.）Y.R. Chao, “Spoken Chinese Grammar” Berkeley, University of California Press, 1968. (Y.-R. Chao, "A grammar of spoken Chinese", Berkeley: University of California Press, 1968.) Ｔ．ツムラ他、「周波数遷移の聴覚検出」Ｊ.Ａ.Ｓ.Ａ.第５３巻第１号、１９７３、ｐｐ.１７−２５。（T. Tsumura et al., "Auditory detection of frequency transition", J.A.S.A., Vol. 53, No. 1, 1973, pp. 17-25.）T.A. Tsumura et al., "Hearing detection of frequency transitions" JASA Vol. 53, No. 1, 1973, pp. 17-25. (T. Tsumura et al., "Auditory detection of frequency transition", J.A.S.A., Vol. 53, No. 1, 1973, pp. 17-25.) Ｓ．シゲノ、Ｈ．フジサキ、「音声及び非音声刺激の範疇判断に対する先行するアンカーの効果」、日本心理学研究、１９７９、第２１巻、第４号、ｐｐ.１６５−１７３。（S. Shigeno, H. Fujisaki, "Effect of a preceding anchor upon the categorical judgment of speech and non-speech stimuli", Japanese Psychological Research, 1979, Vol. 21, No. 4, pp. 165-173.）S. Shigeno, H.C. Fujisaki, “Effect of preceding anchor on category judgment of voice and non-voice stimuli”, Japanese Psychological Research, 1979, Vol. 21, No. 4, pp.165-173. (S. Shigeno, H. Fujisaki, "Effect of a preceding anchor upon the categorical judgment of speech and non-speech stimuli", Japanese Psychological Research, 1979, Vol. 21, No. 4, pp. 165-173.) Ｊ．Ｓ．チャン、Ｋ．ヒロセ、「関連付け仮説と中国語連続音声の声調認識へのその応用」ＩＣＡＳＳＰ２０００、第ＩＩＩ巻、ｐｐ.１４１９−１４２２。（J.-S. Zhang, K. Hirose, "Anchoring hypothesis and its application to tone recognition of Chinese continuous speech", ICASSP2000, Vol. III, pp. 1419-1422.）J. et al. S. Chang, K. Hirose, “The Association Hypothesis and Its Application to Tone Recognition of Chinese Continuous Speech” ICASSP2000, Volume III, pp. 1419-1422. (J.-S. Zhang, K. Hirose, "Anchoring hypothesis and its application to tone recognition of Chinese continuous speech", ICASSP2000, Vol. III, pp. 1419-1422.) Ｊ．Ｓ．チャン、Ｓ．ナカムラ、Ｋ．ヒロセ、「関連付けＦ０特徴による中国語語彙声調の区別」ＩＣＳＬＰ２０００予稿集、第ＩＩ巻、ｐｐ.８７−９０。（J.-S. Zhang, S. Nakamura, and K. Hirose, "Discriminating Chinese lexical tones by anchoring F0 Features", Proc. of ICSLP2000, Vol. II, pp. 87-90.）J. et al. S. Chang, S. Nakamura, K. Hirose, “Distinction of Chinese Vocabulary Tone by Association F0 Feature” ICSLP2000 Proceedings, Volume II, pp. 87-90. (J.-S. Zhang, S. Nakamura, and K. Hirose, "Discriminating Chinese lexical tones by anchoring F0 Features", Proc. Of ICSLP2000, Vol. II, pp. 87-90.)

これまで、どの研究でも、Ｆ０の高さ及びＦ０の傾斜勾配以外、声調的な文脈中での識別のための手がかりについて明らかに示した研究は存在していない。もしこのような手がかりが利用できれば、中国語等の声調言語で声調を分類する大きな助けとなるはずであり、さらにこのような手がかりによってトレーニングされた適切なモデルがあれば、音声認識又は音声合成装置の性能がさらに改善されるはずである。 To date, none of the studies have clearly shown clues for discrimination in tonal contexts other than the height of F0 and the slope of F0. If such cues are available, it should be a great help to classify tones in tonal languages such as Chinese, and if there is an appropriate model trained with such cues, a speech recognition or speech synthesizer The performance should be improved further.

従って、この発明の目的の一つは、中国語の声調の判断を容易にすることである。 Accordingly, one of the objects of the present invention is to facilitate the determination of the Chinese tone.

別の目的は、より自然な品質の中国語音声を合成することである。 Another object is to synthesize more natural quality Chinese speech.

この発明に従った中国語の声調分類装置は、関連付けにより声調を識別する特徴量を含む特徴量の組が与えられると、中国語で使用される声調分類の確率を出力するように、トレーニングデータセットを用いてトレーニングされた声調モデルを記憶するための手段を含む。関連付けにより声調を識別する特徴量は先頭ギャップ及び末尾ギャップを含む。声調分類装置はさらに、入力された中国語の音声データを一連の声調核にセグメント化するための手段と、声調核の各々から関連付けにより声調を識別する特徴量を抽出するための手段と、抽出する手段によって抽出された音響特徴量を声調モデルに適用し、声調モデルにより出力される最も高い確率を達成する声調分類を選択するためのパターンマッチング手段とを含む。 The Chinese tone classification device according to the present invention provides training data so as to output a probability of tone classification used in Chinese when a set of feature amounts including a feature amount for identifying a tone is given by association. Means for storing a tone model trained with the set is included. Feature quantity identifying the tone by association includes the head gap and tail gap. Tone classifier further includes means for segmenting the audio data is input Chinese into a series of tone nuclei, means for extracting a feature value identifying the tone by association from each tone nuclear extract Pattern matching means for applying the acoustic feature extracted by the means to the tone model and selecting a tone classification that achieves the highest probability output by the tone model.

好ましくは、関連付けにより声調を識別する特徴量はさらに、声調核の基本周波数（Ｆ０）輪郭の傾斜勾配を含む。 Preferably, the feature to identify the tone by association further comprises an inclined slope of the fundamental frequency (F0) contours of tone nucleus.

より好ましくは、関連付けにより声調を識別する特徴量はさらに、声調核の先頭Ｆ０及び末尾Ｆ０を含む。 More preferably, the feature to identify the tone by associating further comprises a head F0 and end F0 tones nucleus.

さらに好ましくは、関連付けにより声調を識別する特徴量はさらに、声調核の正規化されたパワーを含む。 More preferably, the feature to identify the tone by associating further comprises the normalized power of the tone nucleus.

この発明の別の局面に従った中国語の声調分類装置は、関連付けにより声調を識別する特徴量を含む特徴量の組が与えられると、中国語で使用される声調分類の確率を出力するように、トレーニングデータセットを用いてトレーニングされた声調モデルを記憶するための手段と、入力された中国語の音声データを一連の声調核にセグメント化するための手段と、声調核の各々から関連付けにより声調を識別する特徴量を抽出するための手段とを含む。関連付けにより声調を識別する特徴量は先頭ギャップ及び末尾ギャップを含む。装置はさらに、抽出するための手段によって抽出された先頭ギャップ及び末尾ギャップの符号の組合せに従って声調の分類を判断し、判断された分類を出力するための声調分類手段を含む。 The Chinese tone classification device according to another aspect of the present invention outputs a probability of tone classification used in Chinese when a set of feature amounts including a feature amount for identifying a tone is given by association. A means for storing a tone model trained using the training data set, a means for segmenting input Chinese speech data into a series of tone nuclei, and an association from each of the tone nuclei and means for extracting a feature value to identify the tone. Feature quantity identifying the tone by association includes the head gap and tail gap. The apparatus further includes a tone classification means for determining a tone classification according to the combination of the leading gap and trailing gap codes extracted by the extracting means and outputting the determined classification.

好ましくは、声調分類手段は、声調の先頭ギャップが正であり末尾ギャップが正であるとき、その声調が中国語の第一声であると判断するための手段を含む。 Preferably, the tone classification means includes means for determining that the tone is the first Chinese voice when the tone gap is positive and the tail gap is positive.

声調分類手段はさらに、声調の先頭ギャップが負であり末尾ギャップが正であるとき、その声調が中国語の第二声であると判断するための手段を含んでも良い。 The tone classification means may further include means for determining that the tone is a Chinese second voice when the tone head gap is negative and the tail gap is positive.

好ましくは、声調分類手段はさらに、声調の先頭ギャップが負であり末尾ギャップが負であるとき、その声調が中国語の第三声であると判断するための手段を含む。 Preferably, the tone classification means further includes means for determining that the tone is a third Chinese voice when the tone gap is negative and the tail gap is negative.

さらに好ましくは、声調分類手段はさらに、声調の先頭ギャップが正であり末尾ギャップが負であるとき、その声調が中国語の第四声であると判断するための手段を含む。 More preferably, the tone classification means further includes means for determining that the tone is the fourth voice of Chinese when the leading gap of the tone is positive and the trailing gap is negative.

この発明のさらに別の局面に従った中国語のＦ０生成装置は、関連付けにより声調を識別する特徴量を含む特徴量の組が与えられると、中国語で用いられるそれぞれの声調分類の確率を出力するようにトレーニングデータセットを用いてトレーニングされた声調モデルを記憶するための手段を含む。関連付けにより声調を識別する特徴量は先頭ギャップ及び末尾ギャップを含む。この装置はさらに、構文解析された中国語テキストが与えられると、その中国語テキスト内の音声単位の各々について可能な中国語声調の確率を出力する、確率的Ｆ０モデルを記憶するための手段と、確率的Ｆ０モデルの出力に従って、入力された中国語のテキストに適合するＦ０のシーケンスを生成するための手段と、生成するための手段によって出力されたＦ０シーケンスが声調モデルと整合しているか否かを判断するための手段とを含む。 The Chinese F0 generation device according to still another aspect of the present invention outputs a probability of each tone classification used in Chinese when a set of feature amounts including a feature amount for identifying a tone is given by association. Means for storing a tone model trained using the training data set. Feature quantity identifying the tone by association includes the head gap and tail gap. The apparatus further includes means for storing a probabilistic F0 model that, given a parsed Chinese text, outputs the probabilities of possible Chinese tones for each of the speech units in the Chinese text. The means for generating a sequence of F0 that matches the input Chinese text according to the output of the stochastic F0 model, and whether the F0 sequence output by the means for generating is consistent with the tone model Means for determining whether or not.

[前提条件]
（声調的な文脈によるＦ０変化現象）
ここで発明者らが検討した文脈による変化は、二つの周知の現象、「階段状の降下」及び「文脈上の同化」である。厳密に言えば、階段状の降下現象は高いピッチターゲットが低いピッチターゲットに続く特定の文脈における、文脈上の同化の一種である。 [Prerequisites]
(F0 change phenomenon due to tonal context)
The contextual changes considered by the inventors here are two well-known phenomena: “step-down” and “contextual assimilation”. Strictly speaking, the staircase descent phenomenon is a kind of contextual assimilation in the specific context where a high pitch target follows a low pitch target.

―階段状降下の効果
階段状降下の効果は、「ＨＬＨ」の声調シーケンスにおいて、「Ｌ」が存在するために、二番目の「Ｈ」の声調のＦ０の高さが、最初の「Ｈ」より低くなる現象として知られている。もしＨとＬとが交互に現れるシーケンスでこれが連続して作用すると、Ｆ０の輪郭が階段状の関数となるであろう。ある発話で後の位置にあるＨの声調のＦ０高さが、同じ発話の前方にあるＬの声調より低くなることもあり得る。 -Effect of stair-step descent The effect of stair-step descent is that the height of F0 of the second “H” tone is the first “H” because “L” exists in the tone sequence of “HLH”. This is known as the phenomenon of lowering. If H and L act alternately in a sequence where H and L appear alternately, the contour of F0 will be a stepped function. The F0 height of the H tone at a later position in a certain utterance may be lower than the L tone in front of the same utterance.

中国語の四つの基本的語彙声調のうち三つは、図１に示すように先頭（第二声）、末尾（第四声）又は先頭と末尾の両方（第三声）のいずれかに低いターゲットを有する。 Three of the four basic vocabulary tones in Chinese are low in either the first (second voice), last (fourth voice), or both the first and last (third voice) as shown in FIG. Have a target.

図２は階段状降下現象の例を示す。発話されたテキストは、"you3 qing1 wei1 e4 hua4 xian4 xiang4"（「少しずつ悪化する現象がある。」）である。図２で点線２０で示されるＦ０シーケンスから、二つの興味深い現象を見ることができる。 FIG. 2 shows an example of a step-like descent phenomenon. The spoken text is "you3 qing1 wei1 e4 hua4 xian4 xiang4" ("There is a phenomenon that gets worse little by little."). Two interesting phenomena can be seen from the F0 sequence indicated by the dotted line 20 in FIG.

１．階段状降下の効果は後半３個の第四声で連続して３回起こり、下向きの階段となっている。 1. The effect of the staircase-like descent occurs three times in succession in the third half of the fourth voice, making it a downward staircase.

２．さらに最後の第四声のＨの先頭が発話の始めの第三声のＦ０高さと同じレベルになっている。 2. Furthermore, the head of H of the last fourth voice is at the same level as the F0 height of the third voice at the beginning of the utterance.

語彙的声調のＦ０の高さは階段状降下の影響を相当受けるが、人間のピッチの知覚には何ら干渉しないことが良く知られている。すなわち、図２に示す複数の第四声は同じ声調であると知覚されるのみならず、先頭でもほぼ同じ高さであると知覚されるのである。さらに、音声合成の研究では、階段状降下の影響を受けると思われる隣接する二つの声調に高いピッチ点で同じＦ０の値を割り当てると、合成音声では２番目の語彙的声調に不自然なストレスがあるように感じられる。 It is well known that the height of the vocabulary tone F0 is significantly affected by a step-down, but does not interfere with human pitch perception. That is, the plurality of fourth voices shown in FIG. 2 are not only perceived as having the same tone, but are also perceived as having substantially the same height at the beginning. Furthermore, in speech synthesis research, if the same F0 value is assigned to two adjacent tones that are likely to be affected by a staircase descent at a high pitch point, unnatural stress is applied to the second lexical tone in synthesized speech. It feels like there is.

―文脈上の同化による声調変化
ここで用いる文脈上の同化による声調変化の概念とは、同化による変化の影響が激しいため、声調のＦ０傾斜の方向が他の声調のものに変わってしまうことさえある、という現象を指す。図３は、発話されたテキストが"si1 zhu2 guan3 xian2"（伝統的な弦楽器と木管楽器）である例を示す。図３に示した例では、線３０で囲まれた部分のＦ０輪郭は第二声であり、その特徴的なパターンは上昇勾配である。しかしここでは、Ｆ０の輪郭は平坦なものに変わっており、これは第一声のものである。 -Tonal change by contextual assimilation The concept of tonal change by contextual assimilation used here is that the influence of the change by assimilation is so severe that the direction of the F0 slope of the tone changes to that of another tone. It refers to the phenomenon of being. FIG. 3 shows an example in which the spoken text is “si1 zhu2 guan3 xian2” (traditional stringed and woodwind instruments). In the example shown in FIG. 3, the F0 contour of the portion surrounded by the line 30 is the second voice, and the characteristic pattern is an ascending gradient. Here, however, the contour of F0 has changed to a flat one, which is that of the first voice.

重要な問題の一つは、Ｆ０輪郭が変化した場合、声調が別の声調に変わったかどうかである。この問題は非常に分かりにくいため、かつてはかなり相違する見解が示されていた。非特許文献４では、図３に示すような事例を説明するのに、音響学的な声調連声に関する規則を提案することまでしている。その提案では、先頭が高い声調が先行する第二声であって、その後に四つの基本的声調のうちの一つが続く場合、間にはさまれた第二声は、普通の速さの音声では第一声に変化する、と示唆されている。 One important issue is whether the tone has changed to another tone when the F0 contour changes. The problem was so confusing that it used to show a very different view. In Non-Patent Document 4, to explain the case as shown in FIG. In that proposal, if the second voice is preceded by a higher tone, followed by one of the four basic tones, the second voice sandwiched in between is a normal speed voice. It is suggested that it will change to the first voice.

しかし、最近の研究により、文脈によるＦ０変化に続く声調文脈がある限り、聞き手は依然としてその声調を元の声調のものであると知覚することが明らかになった（非特許文献３）。これは、図３におけるＦ０の変化を受けた第二声が依然として第一声ではなく第二声として知覚される、という意味である。このことは何人かの中国語を母語とする話者によって確認されている。 However, recent research has revealed that as long as there is a tone context following F0 changes due to context, the listener still perceives that tone as that of the original tone (Non-Patent Document 3). This means that the second voice that has undergone the change in F0 in FIG. 3 is still perceived as the second voice rather than the first voice. This has been confirmed by some Chinese-speaking speakers.

―知覚実験の二つの知見
非特許文献３は声調文脈によるＦ０変化と人間のピッチ知覚に対するその影響とに関し、より体系的な調査を行なっている。非特許文献３での興味ある実験の一つは、人間が三音節のシーケンスの中で、間に挟まれた声調をどのように知覚するか、さらに、同じ声調が、その音節をはさむ第一音節と第三音節とを交換した文脈では同様に知覚されるか、を検討したものである。第一音節と第三音節とを交換すると、間に挟まれた声調にとっては声調環境がかなり相違することになるので、ピッチ知覚に対する声調文脈の影響を試験するのには特別な方法を提供するものと期待される。 -Two findings on perception experiments Non-Patent Document 3 conducts a more systematic investigation on F0 changes due to tone context and its effect on human pitch perception. One of the interesting experiments in Non-Patent Document 3 is how humans perceive a tone that is sandwiched in a sequence of three syllables. It is examined whether it is perceived similarly in the context where the syllable and the third syllable are exchanged. The exchange of the first and third syllables provides a special way to test the effect of tone context on pitch perception, as the tone environment is quite different for the tones caught between them Expected.

図４は非特許文献４の実験から得られた二つの知見を例示する。 FIG. 4 illustrates two findings obtained from the experiment of Non-Patent Document 4.

１．元の声調文脈が自然で交換後の文脈が不自然な場合、間に挟まれた声調は元の文脈と同じ声調であると知覚される傾向がある（図４の第１行を参照）。 1. When the original tone context is natural and the context after the exchange is unnatural, the tone between them tends to be perceived as the same tone as the original context (see the first line in FIG. 4).

２．元の声調文脈が不自然で交換後の文脈が自然な場合、間に挟まれた声調は、交換された文脈において元の声調とは逆方向への輪郭を備えた声調であると知覚される傾向がある（図４の下の行を参照）。 2. If the original tone context is unnatural and the post-exchange context is natural, the tone between the two is perceived as a tone with a contour in the opposite direction to the original tone in the exchanged context There is a trend (see the bottom row in FIG. 4).

（関連付け仮説及び関連付けによるＦ０特徴量）
語彙的声調を区別するためのより効果的な特徴量を見出すことを目的として、本件発明者らは非特許文献５又は非特許文献６の心理学的音響学的知覚に関する知見を採用し、以下のような関連付け仮説をたてた（非特許文献７及び非特許文献８を参照）。 (F0 feature quantity by association hypotheses and associated)
For the purpose of finding a more effective feature amount for distinguishing lexical tone, the present inventors adopt the knowledge on psychological acoustic perception of Non-Patent Document 5 or Non-Patent Document 6, and It made a such association hypothesis as (see non-Patent Document 7 and non-Patent Document 8).

−最初の語彙的声調の末尾と２番目の語彙的声調の先頭との相対的なＦ０の相違は、Ｆ０輪郭に関する下り勾配という直接的な手がかりの他に、高いピッチ又は低いピッチについて識別するための重要な手がかりとなりうる。 -The relative F0 difference between the end of the first lexical tone and the beginning of the second lexical tone is to identify high or low pitches, as well as a direct clue to the down slope with respect to the F0 contour. It can be an important clue.

−競合効果について、タイミング割当機構があるはずである（非特許文献６を参照されたい。）。 -There must be a timing allocation mechanism for competitive effects (see Non-Patent Document 6).

この仮説に基づき、連続した音声の語彙的声調は、平坦、上昇、途中で下がる、又は下降、というＦ０パターンを用いること以外に、図５に示された表により示される関連付けによるパターンを用いて、音響的に特徴付けることができる。図５で用いられる用語は以下の通りである。 Based on this hypothesis, the vocabulary tone of the continuous speech uses the association pattern shown by the table shown in FIG. 5 in addition to using the F0 pattern of flat, up, down, or down. Can be acoustically characterized. The terms used in FIG. 5 are as follows.

−先頭ギャップ：先頭のＦ０と、先行する語彙の声調の末尾のＦ０との相違。
−末尾ギャップ：末尾のＦ０と、後続する語彙の声調の先頭のＦ０との相違。 -Leading gap: The difference between the leading F0 and the trailing F0 of the tone of the preceding vocabulary .
-End gap: difference between the end F0 and the beginning F0 of the tone of the following vocabulary.

（声調文脈による変化の関連付けによる識別）
本件発明者らは、この関連付け仮説を活用して、上述の声調変化をする声調を一貫して予測できることを見出した。 (Identification by association of change by tone context)
The inventors of the present invention have found that it is possible to consistently predict a tone that changes the tone described above by using this association hypothesis.

−階段状に下がる声調の関連付けによる識別
図６は先頭ギャップ及び末尾ギャップの推定方法を例示する。特に、図２の階段状に下がる声調の先頭及び末尾ギャップの推定を例示する。 -Identification by associating voices that fall in a staircase pattern FIG. 6 illustrates a method for estimating a leading gap and a trailing gap. In particular, the estimation of the leading and trailing gaps of the tone that goes down in a staircase pattern of FIG. 2 is illustrated.

図６において、細い縦線は音節の境界を示す。特徴量“ｒ”は、先頭及び末尾ギャップを表し、特徴量“ｄ”は持続時間を正規化したＦ０傾斜勾配を表す。先頭及び末尾点は声調核に対応する点である。 In FIG. 6, thin vertical lines indicate syllable boundaries. The feature quantity “r” represents the leading and trailing gaps, and the feature quantity “d” represents the F0 slope with the duration normalized. The leading and trailing points are points corresponding to the tone kernel.

第四声は「ＨＬ」のピッチパターンを有する。関連付けによる声調識別仮説によれば、以下の条件を満たす場合には第四声と判定される。 The fourth voice has a pitch pattern of “HL”. According to the tone discrimination hypothesis by association , when the following conditions are satisfied, it is determined as the fourth voice.

１．先頭ギャップが正である。すなわちｒ≧０である。第四声は先頭がＨであるので、もし先行する声調の末尾がＬであればｒ＞０となる。一方、先行する声調の末尾がＨであれば、ｒ≒０である。 1. The leading gap is positive. That is, r ≧ 0. Since the fourth voice starts with H, if the end of the preceding tone is L, r> 0. On the other hand, if the end of the preceding tone is H, r≈0.

２．上と同様の機構により、末尾ギャップが負である。 2. By the same mechanism as above, the tail gap is negative.

３．Ｆ０傾斜の勾配が負、すなわちｄ＜０（下り勾配のピッチ）である。 3. The slope of the F0 slope is negative, that is, d <0 (down slope pitch).

図６から、四つの第四声はそれらの絶対的なＦ０の高さが大きく異なるにも関わらず、上述の条件を満たすという点ではほとんど相違しないことがわかる。四つの第四声のどの第四声も、同じようなレベルの正の先頭ギャップと、同じようなレベルの負の末尾ギャップと、同じようなＦ０輪郭の下り勾配を有する。これら三つの特徴量はピッチ関連付け仮説によれば、声調を区別するため非常に重要であるので、これら四つの第四声は同様の声調と知覚されるはずであり、リスニングによりそれが実証された。 From FIG. 6, it can be seen that the four fourth voices are hardly different in that the above-mentioned conditions are satisfied, although their absolute F0 heights differ greatly. Every fourth of the four fourth voices has a similar level of positive leading gap, a similar level of negative trailing gap, and a similar F0 contour downslope. These three features are very important to distinguish the tone according to the pitch association hypothesis, so these four fourth voices should be perceived as similar tones, which was demonstrated by listening .

―文脈上の同化を受けた声調の関連付けによる区別
図７は第二声に対する先頭及び末尾ギャップｒとＦ０傾斜勾配ｄの推定を例示する。図７から以下のことがわかる。 Distinguishing by association of tones that have undergone contextual assimilation FIG. 7 illustrates the estimation of leading and trailing gaps r and F0 slope gradient d for the second voice. The following can be seen from FIG.

１．先頭ギャップはどちらかといえば負、すなわちｒ１＜０である。 1. The leading gap is rather negative, i.e. r1 <0.

２．末尾ギャップはどちらかといえば正、すなわちｒ２＞０である。 2. The tail gap is rather positive, i.e. r2> 0.

３．Ｆ０輪郭傾斜は平坦、すなわちｄ≒０である。 3. The F0 contour slope is flat, ie d≈0.

図５から、上述の特徴量に関して第二声が最も当てはまると分かるので、この声調は第二声であろうと予測する。関連付け仮説に基づく第二声の予測を決定するのは、負の先頭ギャップと正の末尾ギャップである。 Since it can be seen from FIG. 5 that the second voice is most applicable to the above-described feature amount, it is predicted that this tone will be the second voice. It is the negative leading gap and the positive trailing gap that determine the prediction of the second voice based on the association hypothesis.

（交換された文脈での声調の予測）
関連付け声調識別仮説に基づき、上述の交換された文脈での実験で知覚される声調が何かを容易に正確に予測することもできる。図８及び図９はこの予測手順と、交換された文脈で、間に挟まれた声調についての予測結果を示す。ターゲットはまず、先頭ギャップ、末尾ギャップ及び下り勾配のピッチに基づいて高い（＋）又は低い（−）ターゲットのいずれであるかが予測され、その後その先頭と末尾との２個のターゲットに基づいてどの声調かが予測可能となる。 (Prediction of tone in exchanged context)
Based on the associated tone discrimination hypothesis, it is also possible to easily and accurately predict what tone is perceived in experiments in the above-described exchanged context. 8 and 9 show this prediction procedure and the prediction results for the tones sandwiched in the exchanged context. The target is first predicted whether it is a high (+) or low (-) target based on the leading gap, trailing gap, and downhill pitch, and then based on the two targets at the beginning and end Which tone is predictable.

図８及び図９から、こうして予測された結果が、報告された二つの知見と一貫していることが分かる。 From FIGS. 8 and 9, it can be seen that the predicted results are consistent with the two findings reported.

中国語の声調区別パターンは音声における声調の音響的特徴量となるパターンを示す。良好なパターンは、発話が分離された音節からなるのか、連続した自然な音声かによらず、実際の音声の音響的特徴量から、一貫した信頼性のある声調の識別を可能とするものであるべきである。 The Chinese tone distinction pattern indicates a pattern that is an acoustic feature amount of tone in speech. A good pattern enables consistent and reliable tone identification from the acoustic features of the actual speech, regardless of whether the utterance consists of separated syllables or continuous natural speech. Should be.

関連付けによる声調識別パターンは、連続音声における中国語の声調について提案された最初の識別パターンである。伝統的な中国語の声調パターンと比較して、この新たなパターンは連続音声において声調をより良く識別する能力を提供する。この提案には多くの応用があるはずである。以下に、これを中国語の自動声調分類と中国語音声の合成に応用したものを簡単に説明する。 The tone identification pattern by association is the first identification pattern proposed for the Chinese tone in continuous speech. Compared to traditional Chinese tone patterns, this new pattern provides the ability to better identify tone in continuous speech. This proposal should have many applications. The following is a brief description of the application of this to Chinese automatic tone classification and Chinese speech synthesis.

[第１の実施の形態]
図１０はこの発明の第１の実施の形態に従った声調分類システム１００を例示する。システム１００は上述の関連付けによる声調識別特徴量に基づいたものである。 [First embodiment]
FIG. 10 illustrates a tone classification system 100 according to the first embodiment of the present invention. The system 100 is based on the tone discrimination feature quantity based on the association described above.

図１０を参照して、システム１００は、一組の関連付けによる特徴量が与えられると最も確率の高い声調分類を出力するように声調モデル１０６をトレーニングデータ１０２を用いてトレーニングするための声調モデルトレーニングユニット１０４と、入力された音声１０８に応答して、声調モデルトレーニングユニット１０４によってトレーニングされた声調モデル１０６を用いて、声調分類１１２を出力する声調分類ユニット１１０とを含む。 Referring to FIG. 10, the system 100 performs tone model training for training the tone model 106 using the training data 102 so as to output the tone classification with the highest probability given a feature value of a set of associations. A unit 104 and a tone classification unit 110 that outputs a tone classification 112 using the tone model 106 trained by the tone model training unit 104 in response to the input speech 108.

声調モデル１０６はそれぞれの声調に対して準備された多数のガウス混合モデル（ＧＭＭ）を含む。文脈に依存しないＧＭＭのみを用いる場合、ＧＭＭの総数は５である。すなわち、基本の四声について４個と、軽声について１個である。文脈に依存するＧＭＭを用いるのであれば、最大で１７５個のＧＭＭを使用可能である（三つの声調からなる文脈に５＊５＊５＝１２５、左境界の声調に４＊５、右境界の声調に５＊５、孤立した声調に５である）。 The tone model 106 includes a number of Gaussian mixture models (GMMs) prepared for each tone. If only context-independent GMMs are used, the total number of GMMs is 5. That is, four for basic four voices and one for light voices. If a context-dependent GMM is used, a maximum of 175 GMMs can be used (5 * 5 * 5 = 125 for three tone contexts, 4 * 5 for left boundary tone, right boundary 5 * 5 for tone and 5 for isolated tone).

声調モデルトレーニングユニット１０４は、トレーニングデータ１０２の発話の各々の声調核を音響的にセグメント化するための音響的セグメント化モジュール１３０と、音響的セグメント化モジュール１３０によってセグメント化された音声信号の声調核のＦ０及びパワーを含む音響的特徴量を抽出するための特徴抽出モジュール１３２と、特徴抽出モジュール１３２によって抽出された音響的特徴量から、先頭ギャップ、末尾ギャップ、Ｆ０輪郭勾配等の関連付けによる特徴量を抽出するための関連付けによる特徴抽出モジュール１３４と、関連付けによる特徴抽出モジュール１３４によって抽出された関連付け特徴量を利用して、声調モデル１０６のモデルパラメータを推定するための声調モデル推定モジュール１３６とを含む。 The tone model training unit 104 includes an acoustic segmentation module 130 for acoustically segmenting the tone kernels of each utterance of the training data 102, and a tone kernel of the speech signal segmented by the acoustic segmentation module 130. A feature extraction module 132 for extracting an acoustic feature quantity including F0 and power, and a feature quantity by associating a head gap, a tail gap, an F0 contour gradient, etc. from the acoustic feature quantity extracted by the feature extraction module 132 a feature extraction module 134 by associating to extract, by using a feature amount associated extracted by the feature extraction module 134 by associating, and a tone model estimation module 136 for estimating the model parameters of the tone models 106 .

音響的セグメント化モジュール１３０での声調核の音響的セグメント化は、統計的方法により、又は音声認識装置の音声セグメント化を利用するだけで実現できる。 The acoustic segmentation of tonal nuclei in the acoustic segmentation module 130 can be achieved by statistical methods or simply using the speech segmentation of the speech recognizer.

声調分類ユニット１１０は、入力された音声１０８をセグメント化し、セグメント化された音声信号の声調核を出力するための音響的セグメント化モジュール１５０と、音響的セグメント化モジュール１５０によってセグメント化された音声信号の声調核のＦ０及びパワーを含む音響的特徴量を抽出するための特徴抽出モジュール１５２と、特徴抽出モジュール１５２によって抽出された音響的特徴量から関連付けによる特徴量を抽出するための関連付けによる特徴抽出モジュール１５４と、声調モデル１０６のうち関連付けによる特徴抽出モジュール１５４によって抽出された関連付けによる特徴量に最も良く一致する特徴パターンを探索し、最も良く一致するパターンに対応する声調分類１１２を出力するためのパターンマッチングモジュール１５６とを含む。 The tone classification unit 110 segments the input speech 108 and outputs an acoustic segmentation module 150 for outputting the tone kernel of the segmented speech signal, and the speech signal segmented by the acoustic segmentation module 150. feature extraction and feature extraction module 152 for extracting an acoustic feature quantity including F0 and power of tone nucleus, by association for extracting a feature amount by the association from the acoustic feature amount extracted by the feature extraction module 152 Searching for a feature pattern that best matches the feature quantity by the association extracted by the module 154 and the feature extraction module 154 by the association among the tone models 106, and outputting the tone classification 112 corresponding to the best matching pattern Pattern matching And an Lumpur 156.

システム１００は以下のように動作する。このシステムには二つの動作局面がある。トレーニングの局面と動作の局面とである。 System 100 operates as follows. This system has two operating phases. A training aspect and an operational aspect.

トレーニングの局面では、トレーニングデータ１０２内の発話が、音響的セグメント化モジュール１３０によって声調の境界で音響的にセグメント化される。セグメント化された音声信号の声調核が特徴抽出モジュール１３２に与えられ、Ｆ０及びパワーを含むセグメント化された音声信号の音響的特徴量が抽出される。 In the training aspect, utterances in training data 102 are acoustically segmented at tonal boundaries by acoustic segmentation module 130. The tone kernel of the segmented speech signal is provided to the feature extraction module 132, and the acoustic features of the segmented speech signal including F0 and power are extracted.

関連付けによる特徴抽出モジュール１３４は声調核の音響的特徴量から、関連付けに基づく声調識別のための特徴量を抽出する。二つの最も重要な関連付けによる識別特徴量は先頭ギャップと末尾ギャップであり、これらは図１１により、ｒ_1,1及びｒ_2,2として計算される。別の重要な特徴量はＦ０輪郭の傾斜勾配、先頭Ｆ０、末尾Ｆ０である。必要に応じて、声調核の正規化されたパワーを含めても良い。この場合、１個の声調の特徴量は５次元又は６次元のベクトルとなる。 The feature extraction module 134 by association extracts a feature amount for tone discrimination based on association from the acoustic feature amount of the tone nucleus. Identifying feature according to the two most important association is the top gap and bottom gap, these by 11, is calculated as r _{1, 1} and r _{2, 2.} Another important feature amount is the slope of the F0 contour, the head F0, and the tail F0. If desired, the normalized power of tone kernels may be included. In this case, the feature quantity of one tone is a five-dimensional or six-dimensional vector.

声調モデルトレーニング処理により、各声調についての声調モデル１０６内のガウス混合モデルのパラメータがトレーニングデータ１０２で推定される。 Through the tone model training process, the parameters of the Gaussian mixture model in the tone model 106 for each tone are estimated from the training data 102.

次の、声調認識／分類局面では、入力された音声１０８が音響的セグメント化モジュール１５０でセグメント化され、特徴抽出モジュール１５２で、セグメント化された音声信号から音響的特徴量が抽出される。さらに、特徴抽出モジュール１５２によって抽出された特徴量から、関連付けによる特徴抽出モジュール１５４により、関連付けによる特徴量が抽出される。 In the next tone recognition / classification phase, the input speech 108 is segmented by the acoustic segmentation module 150, and the acoustic feature quantity is extracted from the segmented speech signal by the feature extraction module 152. Further, from the feature quantity extracted by the feature extraction module 152, a feature quantity by association is extracted by the feature extraction module 154 by association .

関連付けによる特徴量のパターンが声調モデルと比較される。最大確率を有する声調のシーケンスが、認識された声調のシーケンスとして出力される(声調分類１１２）。 The pattern of the feature amount by the association is compared with the tone model. The tone sequence with the highest probability is output as the recognized tone sequence (tone classification 112).

[第２の実施の形態]
関連付けによるパターンは、従来の規則ベースの合成システムのようにテキストからＦ０パターンを生成するために直接利用するばかりでなく、確率的Ｆ０生成システムと統合して、合成音声の声調的分かりやすさを改善することもできる。 [Second Embodiment]
The pattern by association is not only used directly for generating the F0 pattern from the text as in the conventional rule-based synthesis system, but also integrated with the stochastic F0 generation system to improve the tonal intelligibility of the synthesized speech. It can also be improved.

規則ベースのイントネーション生成システムでは、トレーニングコーパスから開発した多数の規則を用いて合成すべきテキストのイントネーションＦ０パターンを予測する。関連付け声調識別パターンをこの規則の組と統合して、声調を識別することが十分可能となる手がかりを、生成されたＦ０から得られるようにすることが容易にできる。言いかえれば、標準的な声調パターンと比較して、関連付けによるパターンはＦ０の軌跡の配置に対し自由度が高い。これにより、合成音声の自然さが改善されるはずである。 In a rule-based intonation generation system, a text intonation F0 pattern to be synthesized is predicted using a number of rules developed from a training corpus. The associated tone identification pattern can be integrated with this set of rules to easily obtain a clue from the generated F0 that is sufficient to identify the tone. In other words, as compared with the standard tone pattern, the association pattern has a high degree of freedom with respect to the arrangement of the locus of F0. This should improve the naturalness of the synthesized speech.

現在の確率的Ｆ０生成システムでは、合成処理に先だってトレーニング期間を設ける。このトレーニング処理で特徴的な事項は、関連する文脈に関するファクタの数が多すぎて、頑健なイントネーションモデルを構築するために十分なトレーニングデータが得られない、ということである。 In the current probabilistic F0 generation system, a training period is provided prior to the synthesis process. What is characteristic of this training process is that there are too many factors related to the context and sufficient training data cannot be obtained to build a robust intonation model.

中国語を例にとると、約１３００の声調音節があり、可能な三音節のつながりは２２億を超える。強調や韻律句内での位置（始点、中位、終点）、疑問文か平叙文か、等のさらなるファクタを考慮するとすれば、可能な組合せは兆の単位になりうる。このような膨大な文脈的な組合せについてモデルをトレーニングするに足るデータを収集することは不可能である。従って、深刻なデータスパースネスの問題により、生成されたイントネーションのＦ０が奇妙に聞こえることは多々ある。 Taking Chinese as an example, there are about 1300 tonal syllables, with more than 2.2 billion possible three syllable connections. Considering additional factors such as emphasis and position in the prosodic phrase (start point, middle point, end point), question or plain sentence, etc., possible combinations can be in trillions of units. It is impossible to collect enough data to train a model for such a large number of contextual combinations. Therefore, the generated intonation F0 often sounds strange due to serious data sparseness problems.

しかし、関連付けに基づく声調区別パターンを補助として用いることにより、確率的に合成されたＦ０輪郭を検討するための事後チェックモジュールを簡単に利用することができる。 However, by using a tone distinction pattern based on association as an auxiliary, a post-check module for examining a stochastically synthesized F0 contour can be easily used.

図１２は、入力されたテキスト２００からＦ０シーケンス２０４を生成するための、この発明の第２の実施の形態によるＦ０生成ユニット２０２を示す。Ｆ０生成ユニット２０２は先行技術のＦ０生成装置で用いられる確率的Ｆ０モデル２０６、及びこの発明の第１の実施の形態で用いた声調モデル２０８と共に用いられる。 FIG. 12 shows an F0 generation unit 202 according to the second embodiment of the present invention for generating an F0 sequence 204 from input text 200. The F0 generation unit 202 is used together with the probabilistic F0 model 206 used in the prior art F0 generator and the tone model 208 used in the first embodiment of the present invention.

Ｆ０生成ユニット２０２は、入力されたテキスト２００を構文解析するためのテキスト構文解析モジュール２２０と、確率的Ｆ０モデル２０６を用いてテキスト構文解析モジュール２２０によって構文解析されたテキストのイントネーションＦ０を生成するためのイントネーションＦ０生成モジュール２２２と、声調モデル２０８を参照してイントネーションＦ０生成モジュール２２２が出力したＦ０シーケンスをチェックし、パターンに合わないＦ０を訂正して、訂正されたＦ０シーケンス２０４を出力する事後チェックモジュール２２４とを含む。 The F0 generation unit 202 generates a text parsing module 220 for parsing the input text 200 and a text intonation F0 parsed by the text parsing module 220 using the probabilistic F0 model 206. Check the F0 sequence output by the intonation F0 generation module 222 and the tone model 208 and check the F0 sequence output by the intonation F0 generation module 222, correct the F0 that does not match the pattern, and output the corrected F0 sequence 204 Module 224.

上述の通り、関連付けによる特徴パターンは、話す速度の変化、感情の変化等の高いレベルでの話し方のスタイルに従って、合成されたＦ０輪郭を修正する際に、確実に声調を識別可能にする。この問題は、先行技術の中国語ＴＴＳ(テキスト―音声合成）システムではほとんど実現され得ないものである。 As described above, the feature pattern by association makes it possible to reliably identify the tone when correcting the synthesized F0 contour according to the style of speaking at a high level such as a change in speaking speed and a change in emotion. This problem can hardly be realized with the prior art Chinese TTS (text-to-speech synthesis) system.

上述の実施の形態は単なる例示であって制限的なものと解してはならない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The above-described embodiments are merely examples and should not be construed as limiting. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

中国語の声調パターンを示す図である。It is a figure which shows a Chinese tone pattern. ４個の連続した第四声で観察された階段状に下降する効果の例を示す図である。It is a figure which shows the example of the effect which descends in the step shape observed by the 4th continuous 4th voice. 文脈上の同化の例を示す図である。It is a figure which shows the example of assimilation on a context. 間に挟まれた語彙声調の声調知覚に対する、文脈交換の影響を示す図である。It is a figure which shows the influence of context exchange with respect to the tone perception of the vocabulary tone pinched | interposed. 連続音声における４個の基本的語彙声調の、関連付けによる特徴量を示す図である。It is a figure which shows the feature-value by the correlation of the four basic vocabulary tone in a continuous sound. 図２の階段状に下がる声調について、先頭ギャップと末尾ギャップとの推定を示す図である。It is a figure which shows the estimation of a head gap and a tail gap about the tone which falls in the step shape of FIG. 図３の２番目の声調について、先頭ギャップと末尾ギャップとの推定を示す図である。It is a figure which shows the estimation of a head gap and a tail gap about the 2nd tone of FIG. 交換文脈の知覚実験において、声調変化の予測を示す図である。It is a figure which shows prediction of a tone change in the perception experiment of an exchange context. 交換文脈の知覚実験において、声調変化の予測を示す図である。It is a figure which shows prediction of a tone change in the perception experiment of an exchange context. この発明の第１の実施の形態に従った声調分類システム１００を示すブロック図である。1 is a block diagram showing a tone classification system 100 according to a first embodiment of the present invention. 関連付けによる声調区別特徴量の抽出を示す図である。It is a figure which shows extraction of the tone distinction feature-value by correlation . 声調モデルの関連付けによる声調パターンによりチェックされ訂正されたＦ０シーケンスを生成するための、この発明の第２の実施の形態に従ったＦ０生成ユニット２０２を示す図である。FIG. 7 is a diagram showing an F0 generation unit 202 according to the second embodiment of the present invention for generating an F0 sequence checked and corrected by a tone pattern by associating a tone model.

１００声調分類システム、１０２トレーニングデータ、１０４声調モデルトレーニングユニット、１０６声調モデル、１０８入力音声、１１０声調分類ユニット、１１２声調分類、１３０，１５０音響的セグメント化モジュール、１３２，１５２特徴抽出モジュール、１３４，１５４関連付けによる特徴抽出モジュール、１３６声調モデル推定モジュール、１５６パターンマッチングモジュール、２００入力テキスト、２０２Ｆ０生成ユニット、２０４Ｆ０シーケンス、２０６確率的Ｆ０モデル、２０８声調モデル、２２０テキスト構文解析モジュール、２２２イントネーションＦ０生成モジュール、２２４事後チェックモジュール 100 tone classification system, 102 training data, 104 tone model training unit, 106 tone model, 108 input speech, 110 tone classification unit, 112 tone classification, 130, 150 acoustic segmentation module, 132, 152 feature extraction module, 134, 154 association feature extraction module, 136 tone model estimation module, 156 pattern matching module, 200 input text, 202 F0 generation unit, 204 F0 sequence, 206 stochastic F0 model, 208 tone model, 220 text parsing module, 222 intonation F0 Generation module, 224 post-check module

Claims

A training data set that outputs a probability of tone classification used in Chinese given a set of features including features that identify the tone by the fundamental frequency (F0) at the tone kernel boundary of the Chinese speech. A tonal classifier for Chinese, including means for storing a tonal model trained using, wherein the features identifying the tone by F0 at the tone kernel boundary are a head gap, a tail gap, and a tone kernel A combination of signs of the slope of the F0 contour of
The tone classification device further includes:
Means for segmenting input Chinese speech data into a series of tone nuclei;
Means for extracting from each of the tone nuclei a feature that identifies the tone by F0 at the tone nucleus boundary ;
Pattern matching means for applying to the tone model the acoustic feature extracted by the means for extracting and selecting a tone classification that achieves the highest probability output by the tone model; Tone classification device.

2. The Chinese tone classification apparatus according to claim 1, wherein the feature amount for identifying a tone by F0 at a tone nucleus boundary further includes a head F0 and a tail F0 of the tone nucleus.

3. The Chinese tone classification device according to claim 1, wherein the feature quantity for identifying a tone by F0 at a tone nucleus boundary further includes a normalized power of the tone nucleus. 4.

A tone model trained using a training data set to output the probability of tone classification used in Chinese given a set of features that include a feature that identifies the tone by F0 at the tone kernel boundary Means for storing
Means for segmenting input Chinese speech data into a series of tone nuclei;
Means for extracting a feature quantity for identifying a tone by F0 at a tone kernel boundary from each of the tone nuclei, wherein the feature quantity identifies a tone by F0 at a tone kernel boundary. Includes a combination of a leading gap, a trailing gap, and a sign of the slope gradient of the F0 contour of the tone kernel,
The tone classification device further includes:
The tone for judging the tone classification according to the combination of the leading gap, the trailing gap extracted by the means for extracting, and the sign of the slope gradient of the F0 contour of the tone nucleus, and outputting the judged category Chinese tone classification device including classification means.

5. The Chinese character according to claim 4, wherein the tone classification means includes means for determining that the tone is the first Chinese voice when the tone head gap is positive and the tail gap is positive. Tone classification device.

The Chinese tone classification means according to claim 5, wherein the tone classification means further includes means for determining that the tone is a second Chinese voice when the tone start gap is negative and the tail gap is positive. Tonal classification device for words.

The Chinese tone classification unit according to claim 6, wherein the tone classification means further includes means for determining that the tone is a third Chinese voice when the leading gap of the tone is negative and the trailing gap is negative. Tonal classification device for words.

The Chinese tone classification unit according to claim 7, wherein the tone classification means further includes means for determining that the tone is the fourth voice of Chinese when the leading gap of the tone is positive and the trailing gap is negative. Tonal classification device for words.

A Chinese F0 generator,
A tone model trained using a training data set to output the probability of tone classification used in Chinese given a set of features that include a feature that identifies the tone by F0 at the tone kernel boundary A Chinese F0 generation device including means for storing the feature, wherein the feature quantity for identifying a tone by association is a combination of a leading gap, a trailing gap, and a sign of an inclination gradient of a tone kernel F0 contour. Including
The F0 generator further includes
Means for storing a probabilistic F0 model that, given a parsed Chinese text, outputs the probabilities of possible Chinese tones for each of the speech units in the Chinese text;
Means for generating a sequence of F0 matching the input Chinese text according to the output of the probabilistic F0 model;
Means for generating Chinese F0, comprising: means for determining whether the F0 sequence output by the means for generating is consistent with the tone model;