JP5687611B2

JP5687611B2 - Phrase tone prediction device

Info

Publication number: JP5687611B2
Application number: JP2011269228A
Authority: JP
Inventors: 秀治中嶋; 歩相名神山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-12-08
Filing date: 2011-12-08
Publication date: 2015-03-18
Anticipated expiration: 2031-12-08
Also published as: JP2013120351A

Description

この発明は、会話に頻出する句末音調を、テキスト情報から予測する句末音調予測装置に関する。 The present invention relates to a phrase end tone predicting apparatus that predicts a phrase end tone that frequently appears in conversation from text information.

従来の情報提供を目的とした読み上げ音声合成では、合成対象の文には平叙文が多く、アクセント句毎の基本周波数の変化は、アクセント句にアクセント核がある場合はその核の位置以後、核の無い場合にはアクセント句の頭の位置から句末に向かって概ね下降する傾向がある。ところが、人間同士の会話音声においては、基本周波数がアクセント句の末尾に向かって下がった後に再び上昇させて、会話を続ける意図などを伝える場合がある。 In conventional speech-to-speech synthesis for the purpose of providing information, there are many plain sentences in the text to be synthesized, and if the accent phrase has an accent nucleus, the change in the fundamental frequency is different from the position of the nucleus. When there is no symbol, there is a tendency to generally descend from the position of the head of the accent phrase toward the end of the phrase. However, in conversation speech between humans, there is a case where the fundamental frequency is lowered again toward the end of the accent phrase and then raised again to convey the intention of continuing the conversation.

このアクセント句内における下降の後に、句末までのどこかの位置において、基本周波数が再び上昇する動きを「句末音調」という。例えば、長く話し続けたい場合に「〜して↑、〜して↑」と句末で基本周波数を再び上昇させる発話に見られる現象である。そのような再上昇の存在を示すラベルを「句末音調ラベル」と呼ぶ。例えば音声合成器が、より自然な会話音声を合成するためには、句末においてこのような句末音調の発生する（基本周波数が再上昇する）アクセント句か否かの認定、すなわち句末音調ラベルの付与が不可欠である。 The movement in which the fundamental frequency rises again at some position up to the end of the phrase after the fall in the accent phrase is called “end of tone”. For example, it is a phenomenon seen in an utterance that raises the fundamental frequency again at the end of a phrase such as “~ ↑, ~ ↑” when it is desired to continue speaking for a long time. A label indicating the existence of such a re-rise is referred to as “end of phrase label”. For example, in order for a speech synthesizer to synthesize a more natural conversation speech, it is recognized whether or not an accent phrase is generated at the end of the phrase (the fundamental frequency is increased again), that is, the end-of-phrase tone. Labeling is essential.

従来は、例えば英語のニュース音声を対象として、句や文や段落といったさまざまな単位の長さや位置や句境界前後の数単語に対応する品詞を特徴量として用いて、句末で基本周波数が再上昇するか否かの分類が行われた（非特許文献１）。このような予測モデルの構築は、非特許文献１から明らかなように、大量のデータに基づいて自動構成する方式が一般的である。 Conventionally, for example, for English news speech, the basic frequency is re-established at the end of a phrase by using the length and position of various units such as phrases, sentences, and paragraphs, and parts of speech corresponding to several words before and after the phrase boundary as features. The classification of whether or not to rise was performed (Non-Patent Document 1). As is apparent from Non-Patent Document 1, the construction of such a prediction model is generally a method of automatic configuration based on a large amount of data.

K. Ross and M. Ostendorf, “Prediction of abstract prosodic labels for speech synthesis”, Computer Speech and Language, vol.10, Issue 3, pp. 155-185(1996).K. Ross and M. Ostendorf, “Prediction of abstract prosodic labels for speech synthesis”, Computer Speech and Language, vol.10, Issue 3, pp. 155-185 (1996).

しかしながら、会話はニュースの情報案内文のようにほぼ正しい文法で話されるとは限らず、助詞などの機能語が省略されても、内容語から意図が伝わる。そのため、従来のように品詞に基づくだけではなく、発話の内容に基づいた句末音調の処理が必要となる。また、句境界付近で句末音調が生じるが、その境界からの距離は常に一定とは限らないので、句境界前後のできるだけ多くの単語を分類のための特徴量として組み込む必要がある。 However, conversations are not always spoken with almost correct grammar like news information sentences, and even if function words such as particles are omitted, the intention is transmitted from the content words. For this reason, not only based on the part of speech as in the prior art, but also a phrase end tone processing based on the content of the utterance is required. In addition, although the end-of-phrase tone is generated near the phrase boundary, the distance from the boundary is not always constant, so it is necessary to incorporate as many words as possible before and after the phrase boundary as feature values for classification.

しかし、従来技術では、上記したように例えば句境界前後の数単語に限定してその品詞情報から得られる特徴量を用いた句末音調の予測が行われていたため、予測精度が悪いという課題がある。 However, in the prior art, as described above, for example, the phrase end tone is predicted using the feature quantity obtained from the part-of-speech information limited to a few words before and after the phrase boundary. is there.

この発明は、このような課題に鑑みてなされたものであり、品詞以外の他の多くの情報に基づいて句末音調を正確に予測する句末音調予測装置を提供することを目的とする。 The present invention has been made in view of such a problem, and an object thereof is to provide an end-of-phrase tone prediction apparatus that accurately predicts an end-of-phrase tone based on a lot of information other than the part of speech.

この発明の句末音調予測装置は、特徴量情報抽出部と、単語情報データベースと、特徴量変換部と、句末音調予測モデルと、予測部と、を具備する。特徴量情報抽出部は、形態素情報とアクセント句情報を入力として、それらの情報の中から句末音調予測モデルが必要とする出現形、品詞、読み、アクセント句のアクセント型、アクセント句末ポーズの有無、の特徴量情報を抽出する。単語情報データベースは、扱う全ての単語の出現形、品詞に対応したビット列を記憶する。特徴量変換部は、特徴量情報抽出部が出力する特徴量情報を入力として、単語情報データベースに記憶されたビット列を参照して特徴量情報に対応させた特徴量ベクトルを生成する。句末音調予測モデルは、句末音調の有無を２値分類する予測モデルである。予測部は、特徴量ベクトルを入力として、当該特徴量ベクトルを句末音調予測モデルで２値分類して句末音調有りの場合に、当該アクセント句に句末音調ラベルを付与する。 The phrase end tone prediction apparatus of the present invention includes a feature amount information extraction unit, a word information database, a feature amount conversion unit, a phrase end tone prediction model, and a prediction unit. The feature quantity information extraction unit receives morpheme information and accent phrase information as input, and from that information the appearance form, part of speech, reading, accent phrase accent type, accent phrase end pose required by the phrase end tone prediction model Feature information about presence / absence is extracted. The word information database stores bit strings corresponding to the appearance forms and parts of speech of all words to be handled. The feature amount conversion unit receives the feature amount information output from the feature amount information extraction unit and generates a feature amount vector corresponding to the feature amount information with reference to a bit string stored in the word information database. The phrase end tone prediction model is a prediction model that binarizes the presence / absence of the phrase end tone. The prediction unit receives the feature amount vector, and binary-classifies the feature amount vector using the phrase end tone prediction model, and assigns a phrase end tone label to the accent phrase when a phrase end tone exists.

この発明の句末音調予測装置によれば、品詞に比べて数が多くなる出現形を含めた特徴量から句末音調の有無を２値分類する予測モデルを用いて、従来人手で付与されていた句末音調ラベルの付与を高精度に自動的に行うことが出来る。よって、テキストで表現された例えば会話文から会話音声の合成を大量に行う場合に、正確な句末音調ラベルが付与された音声合成のための入力データを大量に生成することが出来る。 According to the phrase endnote prediction device of the present invention, it is conventionally given manually using a prediction model that binaryly classifies the presence or absence of the phrase endnote from the feature quantity including the appearance form that is larger in number than the part of speech. The end-of-phrase tone label can be automatically assigned with high accuracy. Accordingly, when a large amount of speech is synthesized from, for example, a conversation sentence expressed in text, a large amount of input data for speech synthesis to which an accurate ending tone label is assigned can be generated.

この発明の句末音調予測装置１００の機能構成例を示す図。The figure which shows the function structural example of the phrase end tone prediction apparatus 100 of this invention. 句末音調予測装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the phrase end tone prediction apparatus. 特徴量情報の一例を示す図。The figure which shows an example of feature-value information. アクセント核の一例を示す図。The figure which shows an example of an accent nucleus. 特徴量変換部３０の動作フローを示す図。The figure which shows the operation | movement flow of the feature-value conversion part 30. FIG. １アクセント句の特徴量ベクトルの構成例を示す図。The figure which shows the structural example of the feature-value vector of 1 accent phrase. 出現形ビット列６０の例を示す図。The figure which shows the example of the appearance form bit string 60. FIG. サポートベクトルの例を示す図。The figure which shows the example of a support vector. 特徴量ベクトルの他の例を示す図。The figure which shows the other example of a feature-value vector. 実施例４の特徴量ベクトルの例を示す図。FIG. 10 is a diagram illustrating an example of a feature amount vector according to the fourth embodiment.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の句末音調予測装置１００の機能構成例を示す。その動作フローを図２に示す。句末音調予測装置１００は、特徴量情報抽出部１０と、単語情報データベース２０と、特徴量変換部３０と、句末音調予測モデル記憶部４０と、予測部５０と、制御部６０と、を具備する。句末音調予測装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows a functional configuration example of the phrase end tone prediction apparatus 100 of the present invention. The operation flow is shown in FIG. The phrase end tone prediction apparatus 100 includes a feature quantity information extraction unit 10, a word information database 20, a feature quantity conversion unit 30, a phrase end tone prediction model storage unit 40, a prediction unit 50, and a control unit 60. It has. The phrase endnote prediction device 100 is realized by a predetermined program being read into a computer including, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

特徴量情報抽出部１０は、形態素情報とアクセント句情報を入力として、それらの情報の中から句末音調予測モデルが必要とする出現形、品詞、読み、アクセント句のアクセント型、句末ポーズの有無、の特徴量情報を抽出する（ステップＳ１０）。単語情報データベース２０は、扱う全ての単語の出現形、品詞に対応したビット列を記憶する。それらのビット列は予め記憶されている。 The feature quantity information extraction unit 10 receives the morpheme information and the accent phrase information as input, and the appearance form, the part of speech, the reading, the accent phrase accent type, and the phrase end pose required by the phrase end tone prediction model from the information. Feature information about presence / absence is extracted (step S10). The word information database 20 stores bit strings corresponding to appearance forms and parts of speech of all words to be handled. Those bit strings are stored in advance.

特徴量変換部３０は、特徴量情報抽出部１０が出力する特徴量情報を入力として、上記単語情報データベース２０に記憶されたビット列を参照して特徴量情報に対応させた特徴量ベクトルを生成する（ステップＳ３０）。 The feature amount conversion unit 30 receives the feature amount information output from the feature amount information extraction unit 10 and generates a feature amount vector corresponding to the feature amount information with reference to the bit string stored in the word information database 20. (Step S30).

予測部５０は、特徴量ベクトルを入力として、当該特徴量ベクトルを、句末音調予測モデル記憶部４０に記憶された句末音調予測モデルを用いて２値分類し、句末音調有りの場合に、当該アクセント句に句末音調ラベルを付与する（ステップＳ５０）。 The predicting unit 50 receives the feature quantity vector, performs binary classification on the feature quantity vector using the phrase end tone prediction model stored in the phrase end tone prediction model storage unit 40, and when there is a phrase end tone. Then, a phrase end tone label is given to the accent phrase (step S50).

ステップ１０の特徴量情報抽出過程と、ステップ２０の特徴量変換過程と、ステップ５０の予測過程は、入力されるテキストデータの全てのアクセント句についての処理が終了するまで繰り返される（ステップＳ６０のｎｏ）。この繰り返し処理や、上記した各機能部間の時系列的な動作の制御は、制御部６０が行う。 The feature quantity information extraction process in step 10, the feature quantity conversion process in step 20, and the prediction process in step 50 are repeated until the processing for all accent phrases in the input text data is completed (no in step S60). ). The control unit 60 performs the repetitive processing and the control of the time-series operations between the functional units described above.

以上、説明したようにこの発明の句末音調予測装置１００によれば、句末音調ラベルの付与を自動的に行うことが出来る。以降、各部の動作を具体例を示して更に詳しく説明する。 As described above, according to the phrase end tone predicting apparatus 100 of the present invention, it is possible to automatically give a phrase end tone label. Hereinafter, the operation of each unit will be described in more detail with specific examples.

〔特徴量情報抽出部〕
特徴量抽出部１０は形態素情報とアクセント句情報を入力とする。形態素情報とは、出現形、品詞、読み等の情報である。アクセント句情報とは、アクセント句境界の位置、アクセント句のアクセント型、アクセント句末尾にポーズが付くか否等の情報である。特徴量情報抽出部１０は、これらの情報を入力として、句末音調予測を行うアクセント句分の分類に必要な特徴量情報を抽出する。 [Feature information extraction unit]
The feature quantity extraction unit 10 receives morpheme information and accent phrase information as inputs. The morpheme information is information such as appearance form, part of speech, and reading. The accent phrase information is information such as the position of the accent phrase boundary, the accent type of the accent phrase, and whether or not a pause is added at the end of the accent phrase. The feature quantity information extraction unit 10 receives these pieces of information as input, and extracts feature quantity information necessary for classification of accent phrases for which the end-of-phrase tone prediction is performed.

抽出する特徴量情報としては、アクセント句境界の位置情報から、句末音調予測を行うアクセント句に含まれる単語の出現形の系列、品詞の系列、を取り出す。続いて、そのアクセント句に含まれる読みの系列をつなげて、そのアクセント句全体の長さを抽出する。そのアクセント句のアクセント型情報からそのアクセント句におけるアクセントの位置情報（アクセント核の位置）を、また、そのアクセント句の末尾にポーズが付くか否かの情報を抽出する。 As feature quantity information to be extracted, a series of word appearances and part-of-speech series included in an accent phrase for which the phrase end tone prediction is performed is extracted from position information of the accent phrase boundary. Subsequently, the reading series included in the accent phrase is connected, and the length of the entire accent phrase is extracted. From the accent type information of the accent phrase, the position information of the accent (accent nucleus position) in the accent phrase and information on whether or not a pause is added at the end of the accent phrase are extracted.

図３に、一つのアクセント句から抽出された特徴量情報の例を示す。図３の左側から出現形、品詞、読み、の系列、そして当該アクセント句のアクセント核の位置、句末ポーズの有無、の特徴量情報が並んでいる。図４に、アクセント核の位置を示す。図４の横方向は経過時間であり、ＬとＨの周波数の高低を表す。アクセント核はこの例の場合、先頭から３番目の読みとなる。なお、アクセント核はアクセント句の末尾から読みを数えた数としても良い。 FIG. 3 shows an example of feature amount information extracted from one accent phrase. From the left side of FIG. 3, a series of appearance forms, parts of speech, and readings, and feature amount information such as the position of the accent nucleus of the accent phrase and the presence or absence of a phrase end pose are arranged. FIG. 4 shows the position of the accent nucleus. The horizontal direction in FIG. 4 is the elapsed time and represents the level of the L and H frequencies. In this example, the accent kernel is the third reading from the beginning. The accent core may be a number obtained by counting readings from the end of the accent phrase.

なお、形態素情報とアクセント句情報は、従来の音声合成装置で行う言語解析処理の結果で得られる情報である。
〔特徴量変換部〕
図５に、特徴量変換部３０の動作フローを示す。特徴量変換部３０は、アクセント句を構成する全ての単語（出現形）とその品詞の特徴量情報を利用する。従来技術で行われていた単語数で制限をかけて特徴量ベクトルを作る代わりに、単語数とそれに対応した品詞を特徴量ベクトルとして設定する。 The morpheme information and the accent phrase information are information obtained as a result of language analysis processing performed by a conventional speech synthesizer.
[Feature conversion unit]
FIG. 5 shows an operation flow of the feature amount conversion unit 30. The feature amount conversion unit 30 uses feature amount information of all words (appearance forms) constituting the accent phrase and its part of speech. Instead of creating a feature vector by limiting the number of words used in the prior art, the number of words and the corresponding part of speech are set as a feature vector.

これによりアクセント句を構成する単語の数が変化しても、常に単語の出現形とそれに対応する品詞の数の和のサイズで特徴量ベクトルを設定する。従来技術では、ある一定の値に定めた個数の上限値を超える数の単語を特徴量に含めなかったが、この表現方法により、従来含めることが出来なかった単語と品詞とを含めることが可能となる。 As a result, even if the number of words constituting the accent phrase changes, the feature quantity vector is always set with the size of the sum of the appearance form of the word and the number of parts of speech corresponding thereto. In the prior art, the number of words exceeding the upper limit of the number set to a certain value was not included in the feature amount, but this expression method can include words and parts of speech that could not be included conventionally. It becomes.

図６に、１アクセント句の特徴量ベクトルの構成例を示す。特徴量ベクトルは、出現形ビット列６０と、品詞ビット列６１と、アクセント句長６２と、アクセント核位置６３と、句末ポーズ有無６４と、で構成される。なお、この順番はこの例に限定されない。出現形ビット列６０と品詞ビット列６１との順番が前後に逆転しても構わない。図６の数列が、特徴量ベクトルの一例である。 FIG. 6 shows a configuration example of a feature vector of one accent phrase. The feature vector includes an appearance bit string 60, a part-of-speech bit string 61, an accent phrase length 62, an accent nucleus position 63, and a phrase end pose presence / absence 64. This order is not limited to this example. The order of the appearance bit string 60 and the part-of-speech bit string 61 may be reversed back and forth. The numerical sequence in FIG. 6 is an example of the feature quantity vector.

特徴量変換部３０は、特徴量情報抽出部１０で抽出された１アクセント句単位の特徴量情報を入力として、単語情報データベースに記憶されたビット列を参照してその特徴量情報に対応させた特徴量ベクトルを生成する。入力された特徴量情報が、アクセント句を構成する出現形情報の場合（ステップＳ３１のｙｅｓ）、特徴量変換部３０は、単語情報データベース２０に記憶された出現形のビット列を参照して、入力された出現形情報に対応するビット列のビットを“１”にセットする（ステップＳ３２）。 The feature amount conversion unit 30 receives the feature amount information in units of one accent phrase extracted by the feature amount information extraction unit 10, and refers to the bit string stored in the word information database to correspond to the feature amount information. Generate a quantity vector. When the input feature quantity information is appearance form information constituting an accent phrase (yes in step S31), the feature quantity conversion unit 30 refers to the appearance form bit string stored in the word information database 20 and inputs the feature quantity information. The bit of the bit string corresponding to the appearing appearance information is set to “1” (step S32).

図７に、出現形ビット列６０の一部を例示する。「出」、「買」、「売」、「借」、「‥‥」の出現形にそれぞれ対応するビットが、句末音調予測装置１００が扱う全ての出現形の数分、配列されている。その長さは、例えば単語数が６万個であれば６万個のビットの配列が出現形ビット列６０となる。 FIG. 7 illustrates a part of the appearance bit string 60. Bits respectively corresponding to the appearance forms of “Out”, “Buy”, “Sell”, “Loan”, “...” Are arranged for the number of all the appearance forms handled by the phrase endnote prediction device 100. . For example, if the number of words is 60,000, an array of 60,000 bits becomes the appearance type bit string 60.

アクセント句が「出てくるんですね」の場合、この例では出現形の「出」が出現形ビット列６０の先頭に位置しているので、出現形ビット列６０の最初のビットが“１”にセットされる。そして、アクセント句の全ての出現形に対応する位置の出現形ビット列６０のビットが“１”にセットされるまで、入力された出現形情報に対応するビット列のビットを“１”にセットする処理が繰り返される。アクセント句が「出てくるんですね」の場合、出現形の個数は５個なので、６万個のビット列中のその出現形に対応する何れかの位置のビットが“１”にセットされる。 In the case where the accent phrase is “Is it coming out”, in this example, since the appearance type “out” is located at the head of the appearance type bit string 60, the first bit of the appearance type bit string 60 is set to “1”. Is done. Processing for setting the bits of the bit string corresponding to the input appearance information to “1” until the bits of the appearance bit string 60 at the positions corresponding to all the appearances of the accent phrase are set to “1”. Is repeated. When the accent phrase is “I'm coming out”, since the number of appearance forms is 5, the bit at any position corresponding to the appearance form in 60,000 bit strings is set to “1”.

品詞についても、出現形と同様の処理によって、品詞ビット列６１の品詞が対応する位置のビットが“１”にセットされる（ステップＳ３４）。品詞ビット列６１は、品詞の数だけ配列されて構成される。 For the part of speech, the bit at the position corresponding to the part of speech of the part of speech bit string 61 is set to “1” by the same processing as the appearance form (step S34). The part-of-speech bit string 61 is arranged by the number of parts-of-speech.

次に、特徴量変換部３０は、特徴量情報の読みの数を数えてアクセント句長とする（ステップＳ３６）。この例の場合、８個の読みを数えて一つのアクセント句が終了する（ステップＳ３７のｙｅｓ）ので、出現形ビット列６０と品詞ビット列６１の配列の後に、アクセント句長、この例の場合「８」、特徴量情報のアクセント核位置、この例の場合「３」、句末ポーズ有無「有り」のビット情報を付与して一つのアクセント句の特徴量変換の処理を終了する（ステップＳ３８）。そして、次のアクセント句の処理に備えてステップＳ３６で数えた読みの数をリセットする。以上説明したステップＳ３１〜ステップＳ３８の処理は、全てのアクセント句に対する処理が終了するまで繰り返される。図６に示した数列が、アクセント句「出てくるんですね」の特徴量ベクトルの例である。出現形ビット列の“０”ビットの数と、品詞ビット列６１の“０”,“１”ビットの数は省略されている。なお、特徴量変換部３０が読みを数える例で説明したが、読みは、音素、音節、モーラの何れに代えても良い。 Next, the feature quantity conversion unit 30 counts the number of readings of the feature quantity information and sets it as the accent phrase length (step S36). In the case of this example, eight accents are counted and one accent phrase is completed (yes in step S37). Therefore, after the arrangement of the appearance bit string 60 and the part-of-speech bit string 61, the accent phrase length, in this case “8” ”, The accent nucleus position of the feature value information, in this example“ 3 ”, and bit information of presence / absence of the phrase end pose is added, and the feature value conversion processing of one accent phrase is finished (step S38). Then, the number of readings counted in step S36 is reset in preparation for the processing of the next accent phrase. The processes in steps S31 to S38 described above are repeated until the processes for all accent phrases are completed. The sequence shown in FIG. 6 is an example of the feature vector of the accent phrase “Is it coming out?”. The number of “0” bits in the appearance bit string and the number of “0” and “1” bits in the part-of-speech bit string 61 are omitted. Although the example in which the feature amount conversion unit 30 counts readings has been described, reading may be replaced with any of phonemes, syllables, and mora.

〔句末音調予測モデル記憶部〕
句末音調予測モデル記憶部４０は、句末音調の有無を２値分類する予測モデルである句末音調予測モデルと、分類境界の特徴を現す特徴量のベクトルであるサポートベクトルと、そのサポートベクトルの分類カテゴリと、を記憶する。句末音調予測モデルはデータが少ない状況下で的確な予測を行うため、各カテゴリの分布形状をモデル化せずに、分類カテゴリの分類境界だけをモデル化することで課題の解決を行う。これには、例えば、サポートベクターマシン（Support Vector Machine）をはじめとした分類器を用いることができる。
一般にサポートベクターマシンによる分類では、次式を用いて分類する。 [End of Tone Prediction Model Storage Unit]
The phrase end tone prediction model storage unit 40 includes a phrase end tone prediction model that is a prediction model for binary classification of presence / absence of a phrase end tone, a support vector that is a feature quantity vector that represents a feature of a classification boundary, and a support vector thereof The classification category is stored. Since the phrase end tone prediction model performs accurate prediction in a situation where there is little data, the problem is solved by modeling only the classification boundary of the classification category without modeling the distribution shape of each category. For this, for example, a classifier such as a support vector machine can be used.
In general, in the classification using the support vector machine, classification is performed using the following formula.

ここで、ｓ_ｋは分類境界の特徴を現す特徴量ベクトルであるサポートベクトルである。ｙ_ｋはｓ_ｋと対を成すサポートベクトルｓ_ｋに対する分類カテゴリである。ｂは、学習時に、学習用のデータからサポートベクトルｓ_ｋを選ぶときに同時に求まる分類境界面の切片に相当する値である。 Here, s _k is a support vector that is a feature vector representing the features of the classification boundary. y _k is a classification category for support vector _{s k} forming a _{s k} and a pair. b is the time of learning, is a value corresponding to the sections of the classification boundary surface which is obtained at the same time when choosing a support vector s _k from data for learning.

図８に、句末音調予測モデル記憶部４０に記憶されるサポートベクトルｓ_ｋと分類カテゴリｙ_ｋの例を示す。分類カテゴリｙ_ｋには＋１又は−１が設定される。＋１は句末音調があること、つまり、アクセント句末で基本周波数が再上昇することを意味する。−１は句末音調が無いこと、つまり基本周波数が句末で再上昇しないことを意味する。 FIG. 8 shows an example of the support vector s _k and the classification category y _k stored in the phrase end tone prediction model storage unit 40. The classification category y _k is set to +1 or -1. +1 means that there is a phrase end tone, that is, the fundamental frequency rises again at the end of the accent phrase. -1 means that there is no ending tone, that is, the fundamental frequency does not rise again at the ending.

関数Ｋ（・）は、特徴量ベクトルｘとサポートベクトルｓ_ｋとを入力とし、その２つのベクトル間の類似性に相当する非負の値を算出する関数である。この関数Ｋ（・）には多項式関数やガウス関数が用いられる。この関数Ｋ（・）の出力値が高い場合、サポートベクトルｓ_ｋに与えられた分類カテゴリｙ_ｋが高く評価される。式（１）のように、全サポートベクトルｓ_ｋについてｙ_ｋとＫ（・）との積を計算して足し合わせ、分類先、すなわち、句末音調が有るか否かを決める。 Function K (·) as input and the support vector s _k feature vector x, is a function for calculating a non-negative value corresponding to the similarity between the two vectors. For this function K (•), a polynomial function or a Gaussian function is used. In this case the output value of the function K (·) is high, classification category y _k given in support vector s _k is appreciated. As in Equation (1), the product of y _k and K (•) is calculated and added for all support vectors s _k to determine whether or not there is a classification destination, that is, a phrase end tone.

〔予測部〕
予測部５０は、句末音調予測モデル記憶部４０に記憶された句末音調予測モデルとサポートベクトルｓ_ｋと分類カテゴリｙ_ｋとを読み込み、特徴量変換部３０で変換された特徴量ベクトルを入力として、句末音調予測モデルを用いてアクセント句末で基本周波数の再上昇が有るか否かの判定を行い、句末音調有りの場合に句末音調ラベルを付与する。 [Predictor]
The prediction unit 50 reads the phrase end tone prediction model, the support vector _sk, and the classification category y _k stored in the phrase end tone prediction model storage unit 40, and inputs the feature amount vector converted by the feature amount conversion unit 30. Then, it is determined whether or not there is a re-increase in the fundamental frequency at the end of the accent phrase using the phrase end tone prediction model, and a phrase end tone label is assigned if there is a phrase end tone.

なお、特徴量ベクトルは、上記した特徴量ベクトルの要素の部分的な組み合わせで構成しても良い。例えば、出現形ビット列６０と品詞ビット列６１との組み合わせのみ、或いは、出現形ビット列６０とアクセント句長６２とアクセント核位置６３と句末ポーズ有無６４との組み合わせ等、特徴量ベクトルの構成は図６に示した例に限定されない。 Note that the feature vector may be composed of a partial combination of the elements of the feature vector described above. For example, the configuration of the feature quantity vector such as only the combination of the appearance bit string 60 and the part of speech bit string 61 or the combination of the appearance bit string 60, the accent phrase length 62, the accent nucleus position 63, and the phrase end pose presence / absence 64 is shown in FIG. It is not limited to the example shown in.

出現形ビット列６０と品詞ビット列６１を一つの特徴量ベクトルの要素としても良い。例えば、出現形「出」と品詞「動詞語幹」、同様に「て」と「活用語尾」、「くる」と「補助動詞」、「ん」と「補助名詞」、「ですね」と「判定詞」のそれぞれの組みに対応する要素のみを“１”とする。 The appearance bit string 60 and the part-of-speech bit string 61 may be elements of one feature vector. For example, the appearance form “out” and the part of speech “verb stem”, as well as “te” and “utilization ending”, “kuru” and “auxiliary verb”, “n” and “auxiliary noun”, “sound” and “determination” Only the element corresponding to each set of “Lyrics” is set to “1”.

図９に、出現形と品詞を一つの組として出現形・品詞ビット列８０で構成した特徴量ベクトルを示す。アクセント句長６２、アクセント核６３、句末ポーズ有無６４は、図６に示した実施例１の特徴量ベクトルと同じである。出現形・品詞ビット列８０の各ビットは、例えば、「出現形＿出：品詞＿動詞語幹」，「出現形＿買：品詞＿動詞語幹」，…，「出現形＿ですね：品詞＿判定詞」，…としたものである。 FIG. 9 shows a feature vector composed of the appearance form / part of speech bit string 80 as a set of the appearance form and the part of speech. The accent phrase length 62, the accent nucleus 63, and the phrase end pose presence / absence 64 are the same as the feature vector of the first embodiment shown in FIG. Each bit of the appearance form / part of speech bit string 80 includes, for example, “appearance form_output: part of speech_verb stem”, “appearance form_buy: part of speech_verb stem”,. "...".

特徴量ベクトルをこのように扱うことで、出現形と品詞との組み合わせの共起関係を明らかにすることができる。また、特徴量ベクトルの次元数を減らすことができる。 By handling the feature quantity vector in this way, it is possible to clarify the co-occurrence relationship between the appearance form and the part of speech. In addition, the number of dimensions of the feature vector can be reduced.

なお、特徴量ベクトルは、実施例１と同様に特徴量ベクトルの要素の部分的な組み合わせで構成しても良い。例えば、出現形・品詞ビット列８０とアクセント句長６２との組み合わせのみ、或いは、出現形・品詞ビット列８０とアクセント句長６２とアクセント核位置６３との組み合わせ等、特徴量ベクトルの構成は図９に示した例に限定されない。 Note that the feature quantity vector may be configured by a partial combination of elements of the feature quantity vector as in the first embodiment. For example, the configuration of the feature quantity vector such as only the combination of the appearance form / part of speech bit string 80 and the accent phrase length 62 or the combination of the appearance form / part of speech bit string 80, the accent phrase length 62, and the accent nucleus position 63 is shown in FIG. It is not limited to the example shown.

上記した実施例は、アクセント句の全ての出現形を対象とした特徴量ベクトルを生成する例であるが、全ての出現形を対象にしなくても良い。アクセント句の句末から先頭方向に向かってＮ個の単語までに制限しても良い。例えば、Ｎ＝２個の出現形の数に制限対応する特徴量ベクトルとしても良い。 The above-described embodiment is an example of generating a feature vector for all appearance forms of an accent phrase, but it is not necessary to target all appearance forms. It may be limited to N words from the end of the accent phrase toward the beginning. For example, the feature quantity vector may be limited to the number of N = 2 appearance forms.

アクセント句を上記した「出てくるんですね」とした場合、句末から例えば２個の単語に制限したとすると、特徴量ベクトルを実施例２に示した出現形・品詞ビット列８０とした場合、「ん」と「補助名詞」の組、と「ですね」と「判定詞」の組に対応する出現形・品詞ビット列８０を構成する２個のビットのみが“１”となる。その他の単語の出現形と品詞の組に対応する要素は全て０とする。そして、その他のアクセント句の長さやアクセント核の位置や句末ポーズの有無の情報は、実施例１又は２と同様の方法で変換して設定する。この実施例の場合、特徴量変換部３０の処理量が減るので計算機の負荷を軽減することができる。なお、実施例１に示した特徴量ベクトル（図６）に対しても、同様の考えが適用可能である。 If the accent phrase is “I'm coming out” as described above, and if it is limited to two words from the end of the phrase, for example, when the feature vector is the appearance form / part of speech bit string 80 shown in Example 2, Only two bits constituting the appearance form / part-of-speech bit string 80 corresponding to the set of “n” and “auxiliary noun” and the set of “sound” and “determination” are “1”. All the elements corresponding to the combinations of the appearance form and the part of speech of other words are set to 0. The length of the other accent phrases, the position of the accent nucleus, and the presence / absence of the phrase end pose are converted and set in the same manner as in the first or second embodiment. In the case of this embodiment, since the processing amount of the feature amount conversion unit 30 is reduced, the load on the computer can be reduced. The same idea can be applied to the feature vector (FIG. 6) shown in the first embodiment.

アクセント句末からＮ個の単語をその位置ごとに表現しても良い。つまり、句末からｉ番目の単語の情報として、出現形と品詞の全組み合わせの要素を備えた部分ベクトルを、Ｎ個分連結して特徴量ベクトルとする方法である。Ｎ個に満たなかったことを表現するために、各ｉの位置において、単語が無いことを示す要素を追加する。 N words from the end of the accent phrase may be expressed for each position. In other words, as information of the i-th word from the end of the phrase, N partial vectors having elements of all combinations of appearance forms and parts of speech are connected to form a feature vector. In order to express that the number is less than N, an element indicating that there is no word is added at each i position.

図１０にその例を示す。図１０は、Ｎ＝３で句末から３個目の単語が無い場合の例として示している。出現形・品詞ビット列８０_１は、句末から数えて１番目の部分特徴量ベクトルである。出現形・品詞ビット列８０_２は、句末から数えて２番目の部分特徴量ベクトルである。この例では、句末から数えて２個目まで単語が存在する場合を示しているので、句末から数えて３個目の部分特徴量ベクトル（出現形・品詞ビット列）８０_３の末尾に、句末からＮ＝３個目の単語が存在しないことを意味するビット“１”が付与されている。 An example is shown in FIG. FIG. 10 shows an example where N = 3 and there is no third word from the end of the phrase. Appearance form, part of speech bit stream 80 _1, a first partial feature quantity vectors counted from phrase end. Appearance form, part of speech bit stream ₈₀₂ is a second part feature vectors counted from phrase end. In this example, it indicates a case where the word up to two eyes exist counted from phrase end, clause 3, counting from the end of the partial characteristic amount vector (appearance form, part of speech bit stream) 80 to the end of _3, A bit “1” indicating that there is no N = third word from the end of the phrase is given.

このように句末音調の有無を判定する対象のアクセント句を構成する単語数（Ｎ）を限定した固定長の特徴量ベクトルを作ることができる。特徴量ベクトルを固定長とすることで、コンピュータをこの発明の句末音調装置１００として機能させるためのプログラムを、簡単にすることができる。また、この発明の句末音調装置１００をハードウェアで構成した場合のハードウェア構成を簡単にする効果を奏する。 In this way, it is possible to create a fixed-length feature vector that limits the number of words (N) that make up the accent phrase to be determined for the presence or absence of the end-of-phrase tone. By setting the feature vector to a fixed length, a program for causing a computer to function as the end-of-pitch tone apparatus 100 of the present invention can be simplified. Moreover, there is an effect of simplifying the hardware configuration in the case where the phrase end tone device 100 of the present invention is configured by hardware.

なお、上記した実施例は、アクセント句単位で、つまり１個のアクセント句のみに着目して句末音調の有無を予測する例で説明したが、予測対象のアクセント句の前後のアクセント句の情報を用いて、当該予測対象の句末音調の有無を予測するようにしても良い。 The above-described embodiment has been described with respect to an accent phrase unit, that is, an example of predicting the presence or absence of a phrase end tone by focusing on only one accent phrase, but information on accent phrases before and after the accent phrase to be predicted May be used to predict the presence or absence of the end-of-phrase tone to be predicted.

以上説明したこの発明の句末音調予測装置１００によれば、テキストとして表現された会話文から会話音声の合成を大量に行なう場合に、正確な句末音調ラベルが付与されたテキストデータを大量に生成することが出来る。句末音調予測装置１００で生成したテキストデータは、音声合成装置の入力データとして用いることが可能であり、そのテキストデータは句末音調の有無が正確に付与されているので合成音声も表現豊かな音声とすることが出来る。 According to the phrase end tone prediction apparatus 100 of the present invention described above, a large amount of text data to which an accurate phrase end tone label is assigned is used when a large amount of speech is synthesized from a conversation sentence expressed as text. Can be generated. The text data generated by the end-of-speech tone prediction device 100 can be used as input data for the speech synthesizer, and the text data is accurately given the presence or absence of the end-speech tone, so that the synthesized speech is also rich in expression. It can be voice.

なお、上記方法及び装置において説明した処理は、記載の順に従って時系列に実行され
るのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 Note that the processes described in the above method and apparatus are not only executed in time series according to the order of description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Good.

また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

Using morpheme information and accent phrase information as input, feature information on the appearance form, part of speech, reading, accent phrase accent type, presence of accent phrase end pose, etc. required by the phrase end tone prediction model A feature information extracting unit to extract;
A word information database for storing bit strings corresponding to appearance forms and parts of speech of all words to be handled;
A feature amount conversion unit that generates the feature amount vector corresponding to the feature amount information with reference to the bit string stored in the word information database, using the feature amount information output by the feature amount information extraction unit;
A model boundary of presence / absence of end-of-sound tone is modeled, and an end-of- speech tone prediction model, which is a prediction model that binarizes the input feature vector into presence / absence of end- of- pitch tone,
When the feature value vector output from the feature value conversion unit is input, the feature value vector is binary-classified by the above-mentioned phrase end tone prediction model, and when there is a phrase end tone, a phrase end tone label is assigned to the accent phrase A prediction unit to
An end-of-phrase tone prediction apparatus comprising: