JP6002598B2

JP6002598B2 - Emphasized position prediction apparatus, method thereof, and program

Info

Publication number: JP6002598B2
Application number: JP2013032129A
Authority: JP
Inventors: 秀治中嶋; 水野　秀之; 秀之水野; 吉岡　理; 理吉岡; 勇祐井島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-02-21
Filing date: 2013-02-21
Publication date: 2016-10-05
Anticipated expiration: 2033-02-21
Also published as: JP2014163978A

Description

本発明は、音声合成技術に関し、特に音声の強調位置を予測する技術に関する。 The present invention relates to a voice synthesis technique, and more particularly to a technique for predicting a voice enhancement position.

例えば、映画のシーンに応じた台詞を発話する場面、童話の語り聞かせの場面、テレビなどのメディアを通じた商品宣伝の場面、およびコールセンタなどでの電話応対場面などで自然に発せられた「表現豊かな音声」において、強調は頻繁に行われている。音声合成によって生成される合成音においても、適切な強調を行うことにより、合成音の自然性が高まる。 For example, “expressive” naturally uttered in scenes where speech is spoken according to movie scenes, scenes of storytelling of fairy tales, scenes of product promotion through TV and other media, and telephone reception scenes at call centers, etc. In “speech”, the emphasis is frequently made. Even in the synthesized sound generated by speech synthesis, the naturalness of the synthesized sound is enhanced by performing appropriate emphasis.

非特許文献１にあるように、特定の区間が強調されて発話された場合、強調区間の基本周波数が読上げ調で発話された部分に比べて高くなる。従来の音声合成装置で、読み上げとは異なる日常の様々な表現豊かな音声から音声合成用モデルを構築し、そのモデルを用いて音声合成を行っても、このような強調区間での声の高さを十分に再現できない。 As described in Non-Patent Document 1, when a specific section is emphasized and uttered, the fundamental frequency of the emphasized section is higher than that of a part uttered in a reading tone. Even if a conventional speech synthesizer constructs a speech synthesis model from various everyday expressive speech different from reading aloud, and uses that model to synthesize speech, the voice in such an emphasis section is high. I can't reproduce it enough.

非特許文献１では、原音声と合成音声との基本周波数の差分が閾値よりも大きい区間を強調区間と推定し、推定した強調区間にマーク（強調マーク）を付与し、それらの強調マークを含む学習データを用いて音声合成用モデルを再構築することで、声の高さの再現性能を改善している。 In Non-Patent Document 1, a section where the difference between the fundamental frequencies of the original voice and the synthesized voice is larger than a threshold is estimated as an emphasis section, a mark (emphasis mark) is given to the estimated emphasis section, and the emphasis marks are included. By reconstructing a speech synthesis model using learning data, the performance of voice pitch reproduction is improved.

前野，能勢，小林，井島，中嶋，水野，吉岡，「強調音声合成のための局所韻律コンテキスト自動付与の検討」，電子情報通信学会技術研究報告音声研究会，112(81)，pp.1-6，2012．Maeno, Nose, Kobayashi, Ijima, Nakajima, Mizuno, Yoshioka, "Study of automatic assignment of local prosodic context for enhanced speech synthesis", IEICE Technical Report, Speech Study Group, 112 (81), pp.1- 6, 2012.

非特許文献１の手法によって音声合成を行うためには、合成対象の入力テキストの各区間に強調マークを付与するか否かを決定する必要がある。しかし、非特許文献１の手法では、原音声と合成音声との差分に基づいて強調マークを付与するか否かを決定するため、当該強調マークが付与された区間の性質が、従来の言語処理や音声学の研究で推定されてきたテキストの言語情報に基づく強調区間の性質と同じになるとは限らない。 In order to perform speech synthesis by the method of Non-Patent Document 1, it is necessary to determine whether or not to add an emphasis mark to each section of the input text to be synthesized. However, in the method of Non-Patent Document 1, in order to determine whether or not to add an emphasis mark based on the difference between the original speech and the synthesized speech, the nature of the section to which the emphasis mark is attached is based on conventional language processing. And the nature of the emphasis interval based on the linguistic information of the text, which has been estimated in phonetics research, is not always the same.

この発明は、このような課題に鑑みてなされたものであり、入力テキストの言語情報に基づいて、入力テキストの強調区間を予測することが可能な技術を提供することを目的とする。 This invention is made in view of such a subject, and it aims at providing the technique which can estimate the emphasis area of an input text based on the linguistic information of an input text.

本発明では、テキストの形態素解析結果および係り受け解析結果の少なくとも一方から構成される特徴量とテキストの強調位置との関係を表す強調位置予測モデルを格納しておく。音声合成を行う際には、入力テキストの形態素解析結果および係り受け解析結果の少なくとも一方から構成される前記の特徴量と同種の特徴量を求め、当該特徴量および強調位置予測モデルを用いて入力テキストの強調位置を識別する。 In the present invention, an emphasized position prediction model representing the relationship between the feature amount composed of at least one of the text morphological analysis result and the dependency analysis result and the emphasized position of the text is stored. When speech synthesis is performed, a feature quantity of the same type as the feature quantity composed of at least one of the morphological analysis result and the dependency analysis result of the input text is obtained, and input using the feature quantity and the emphasized position prediction model. Identifies the highlighted position of the text.

本発明では、入力テキストの言語情報に基づいて、入力テキストの強調区間を予測することができる。また非特許文献１のような方法では、モデル作成時に学習データに依存して強調区間が変化する。本発明では、そのような変化に対応できる強調区間をテキストから予測できる。 In the present invention, it is possible to predict the emphasis section of the input text based on the language information of the input text. Further, in the method as described in Non-Patent Document 1, the emphasis interval changes depending on the learning data at the time of model creation. In the present invention, it is possible to predict from the text an emphasis section that can cope with such a change.

強調位置予測装置の機能構成例を示す図。The figure which shows the function structural example of an emphasis position prediction apparatus. 強調位置予測装置の動作フローを示す図。The figure which shows the operation | movement flow of an emphasis position prediction apparatus. 特徴量構成部で構成される特徴量の組の一例。An example of the group of the feature-value comprised by a feature-value structure part.

以下、図面を参照して本発明の実施形態を説明する。
図１は、本形態の強調位置予測装置１００のブロック図であり、図２はその動作フローを表す。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram of an emphasized position prediction apparatus 100 according to the present embodiment, and FIG. 2 shows an operation flow thereof.

本形態の強調位置予測装置１００は、制御部１１１、特徴量構成部１１２、強調位置予測部１１３、および強調位置予測モデル格納部１１４を有する。強調位置予測装置１００は、さらにカテゴリ名辞書を格納したカテゴリ名辞書格納部１１５を有していてもよい。強調位置予測装置１００は、例えば、ＣＰＵ（central processing unit）やＲＡＭ（random-access memory）等を備える公知のコンピュータに所定のプログラムが読み込まれて構成される装置である。強調位置予測装置１００の少なくとも一部が集積回路等のハードウェアで構成されていてもよい。強調位置予測装置１００は、制御部１１１の制御のもとで各処理を実行する。 The emphasis position prediction apparatus 100 of this embodiment includes a control unit 111, a feature amount configuration unit 112, an emphasis position prediction unit 113, and an emphasis position prediction model storage unit 114. The emphasized position prediction apparatus 100 may further include a category name dictionary storage unit 115 that stores a category name dictionary. The emphasized position prediction apparatus 100 is an apparatus configured by reading a predetermined program into a known computer including, for example, a CPU (central processing unit), a RAM (random-access memory), and the like. At least a part of the emphasized position prediction apparatus 100 may be configured by hardware such as an integrated circuit. The emphasized position prediction apparatus 100 executes each process under the control of the control unit 111.

強調位置予測装置１００の外部の解析結果格納部１０１には、形態素解析結果および係り受け解析結果が言語情報として付与された入力テキストが格納されている。入力テキストは、音声合成対象となる１個以上の単語からなる系列であり、この系列は強調生起範囲毎に区切られている。 An analysis result storage unit 101 outside the emphasized position prediction apparatus 100 stores input text to which morphological analysis results and dependency analysis results are assigned as language information. The input text is a sequence composed of one or more words to be synthesized with speech, and this sequence is divided for each emphasis occurrence range.

形態素解析は、テキストを単語ごとに分割し、それぞれの単語に品詞や読みなどの辞書的情報を付与する技術であり、例えば、参考文献１（松本裕治，形態素解析システム『茶筅』，情報処理，41(11)，pp.1208-1214, 2000）の方法によって実施できる。 Morphological analysis is a technique for dividing text into words and adding lexical information such as parts of speech and reading to each word. For example, Reference 1 (Yuji Matsumoto, morphological analysis system “tea bowl”, information processing, 41 (11), pp. 1208-1214, 2000).

係り受け解析は、形態素解析結果を入力として、複数の単語列からなる文節を構成し、品詞や単語の出現形や単語のアイディの関係に基づいて、文節間の係り受け関係を予測する技術であり、例えば、参考文献２（工藤拓，松本裕治，チャンキングの段階適用による日本語係り受け解析，情報処理学会論文誌，43(6)，pp.1834-1842，2002）の方法によって実施できる。これら自体は、従来技術であり、それらの詳しい説明は省略する。 Dependency analysis is a technology that uses morpheme analysis results as input, constructs a phrase consisting of multiple word strings, and predicts the dependency relation between phrases based on the part-of-speech, the appearance of words, and the relationship between words. Yes, for example, it can be implemented by the method of Reference 2 (Taku Kudo, Yuji Matsumoto, Japanese Dependency Analysis by Chunking Stage Application, IPSJ Transactions, 43 (6), pp.1834-1842, 2002) . These are conventional techniques, and detailed description thereof is omitted.

声の高さのような韻律処理ではアクセント句を単位として処理を行う場合が多い。アクセント句は、１つ以上の文節から構成される系列である。アクセント句の先頭の文節の係り受け情報と最後の文節の係り受け情報をアクセント句間での係り受け情報として用いることができる。アクセント句の境界の予測は、例えば、参考文献３（Nakajima, H., Miyazaki, N., Yoshida, A., Nakamura, T., Mizuno, H., “Creation and Analysis of a Japanese Speaking Style Parallel Database for Expressive Speech Synthesis”, in Proc. Oriental COCOSDA, 2010,http://desceco.org/O-COCOSDA2010/proceedings/paper_30.pdf.）にて実施できることが知られている。 In prosodic processing such as voice pitch, processing is often performed in units of accent phrases. An accent phrase is a sequence composed of one or more phrases. The dependency information of the first phrase and the dependency information of the last phrase of the accent phrase can be used as dependency information between accent phrases. The prediction of the accent phrase boundary can be found in, for example, Reference 3 (Nakajima, H., Miyazaki, N., Yoshida, A., Nakamura, T., Mizuno, H., “Creation and Analysis of a Japanese Speaking Style Parallel Database for Expressive Speech Synthesis ”, in Proc. Oriental COCOSDA, 2010, http://desceco.org/O-COCOSDA2010/proceedings/paper_30.pdf.).

強調生起範囲は、所定の基準に基づいて定められた単語または単語列からなる区間である。どのような区間を強調生起範囲としてもよいが、例えば、ポーズ位置とポーズ位置で挟まれた区間（イントネーション句）にある単語または単語列を１つの強調生起範囲とすることができる。ポーズ位置の予測は、例えば、参考文献４（木暮監修、山森編著，「未来ねっと技術シリーズ４メディア処理技術」，pp.76-77,電気通信協会）に記載された方法によって実行できる。或いは、アクセント句を強調生起範囲としてもよい。 The emphasis occurrence range is a section composed of words or word strings determined based on a predetermined standard. Any section may be used as the emphasis occurrence range. For example, a word or a word string in a section (intonation phrase) between the pause position and the pause position can be set as one emphasis occurrence range. The pose position can be predicted by the method described in Reference Document 4 (supervised by Kogure, edited by Yamamori, “Future Netto Technology Series 4 Media Processing Technology”, pp. 76-77, Telecommunications Association). Alternatively, an accent phrase may be used as the emphasized occurrence range.

特徴量構成部１１２は、解析結果格納部１０１から、入力テキストの各強調生起範囲に対応する形態素解析結果および係り受け解析結果の少なくとも一方を読み出し（ステップＳ１１，Ｓ１２）、読み出した情報に対応する特徴量を生成して出力する。特徴量構成部１１２は、入力テキストのすべての強調生起範囲についてそれぞれ特徴量を生成する。特徴量は、形態素解析結果および係り受け解析結果の少なくとも一方から抽出可能な情報である。例えば、以下のａ）〜ｅ）の要素の全てまたはそれらの一部の要素の組み合わせからなる列（例えば、ベクトルや要素の結合値）を特徴量とする。この場合、強調位置予測性能の観点から、ａ）およびｂ）の両方の要素を含む列を特徴量とすることが望ましい。好ましくは、特徴量がａ）およびｂ）の両方の要素を含む列であることは必須である。また、下記のｃ−１）〜ｃ−４）の一部またはすべての要素を含む列を加えて特徴量としてもよい。 The feature quantity constructing unit 112 reads at least one of the morpheme analysis result and the dependency analysis result corresponding to each emphasis occurrence range of the input text from the analysis result storage unit 101 (steps S11 and S12), and corresponds to the read information. Generate and output feature quantities. The feature quantity configuration unit 112 generates a feature quantity for each of the emphasis occurrence ranges of the input text. The feature amount is information that can be extracted from at least one of the morphological analysis result and the dependency analysis result. For example, a column (for example, a combined value of vectors or elements) including all of the following elements a) to e) or a combination of some of the elements is used as the feature amount. In this case, from the viewpoint of emphasized position prediction performance, it is desirable that a column including both elements a) and b) be a feature amount. Preferably, it is essential that the feature amount is a column including both elements a) and b). Moreover, it is good also as a feature-value by adding the row | line | column containing the one part or all elements of the following c-1) -c-4).

ａ）強調生起範囲内の着目する単語とその前後のそれぞれＮ個の単語の出現形（あるいは表層形と呼ぶ）を表す情報を特徴量の要素としてもよい。これは形態素解析結果から取り出すことが可能な情報である。強調生起範囲に対してどの位置の単語を「着目する単語」とするか、各強調生起範囲に対していくつの単語を「着目する単語」とするか、ならびにＮをどのような値にするかなどの条件は事前に定められる。例えば、事前に学習データを用いた予備実験が行われ、そこで最高性能を示した条件が採用される。例えば、強調生起範囲に対して特定の関係にある単語のうち主辞の単語のみを「着目する単語」としてもよいし、強調生起範囲に対して特定の関係にあるすべての単語を「着目する単語」としてもよい。「強調生起範囲に対して特定の関係にある単語」は、強調生起範囲に属する単語であってもよいし、強調生起範囲から所定の範囲内にある単語であってもよいし、強調生起範囲から所定距離だけ離れた単語であってもよい。Ｎは０以上の整数であり、すべての入力データ対して同じ値であってもよいし、入力データの種別等に応じて異なってもよい。これらの条件の設定は、以下の他の要素についても同様な方法で行われる。ただし、ａ）〜ｅ）での条件は同じであってもよいし、異なっていてもよい。 a) Information indicating the word of interest within the emphasis generation range and the appearance form (or surface form) of N words before and after the word may be used as the feature amount element. This is information that can be extracted from the morphological analysis result. Which position is the word to be focused on with respect to the emphasized occurrence range, how many words are to be focused on with respect to each emphasized occurrence range, and what value is N Such conditions are determined in advance. For example, a preliminary experiment using learning data is performed in advance, and the condition that shows the highest performance is adopted there. For example, only the word of the main word among the words having a specific relationship with respect to the emphasized occurrence range may be set as “word of interest”, or all the words having a specific relationship with respect to the emphasized occurrence range may be referred to as “word of interest”. It is good also as. The “word having a specific relationship with the emphasized occurrence range” may be a word belonging to the emphasized occurrence range, a word within a predetermined range from the emphasized occurrence range, or the emphasized occurrence range. The word may be a predetermined distance away from. N is an integer greater than or equal to 0, and may be the same value for all input data, or may differ depending on the type of input data. These conditions are set in the same manner for the following other elements. However, the conditions in a) to e) may be the same or different.

ｂ）上記着目する単語とその前後のそれぞれＮ個の単語（すなわち、上記ａ）の２Ｎ＋１個の単語）の品詞を表す情報を特徴量の要素としてもよい。これも形態素解析結果から取り出すことが可能な情報である。 b) Information representing the part-of-speech of the word of interest and N words before and after it (ie, 2N + 1 words of the above a)) may be used as an element of the feature amount. This is also information that can be extracted from the morphological analysis result.

ｃ）強調生起範囲の前後の単語に関して着目する単語と当該着目する単語と係り受けの関係にある他の単語との関係を表す情報を特徴量の要素としてもよい。これは係り受け解析結果から取り出すことが可能な情報である。例えば、以下を特徴量の要素とすることができる。
ｃ−１）着目する単語から当該着目する単語が係る文末側の他の単語までの単語数で数えた距離を表す情報。
ｃ−２）着目する単語から当該着目する単語に係る文頭側の他の単語までの単語数で数えた最小距離を表す情報。すなわち、着目する単語から当該着目する単語に係る文頭側の最も近い他の単語までの単語数で数えた距離を表す情報。
ｃ−３）着目する単語から当該着目する単語に係る文頭側の他の単語までの単語数で数えた最大距離を表す情報。すなわち、着目する単語から当該着目する単語に係る文頭側の最も遠い他の単語までの単語数で数えた距離を表す情報。
ｃ−４）着目する単語に係る文頭側の単語の個数を表す情報。
なお、ｃ−１）では、着目する単語が文末の単語ではなく、かつ、文末方向に係る単語がない場合には、要素の値を０とする。ｃ−２）とｃ−３）では、着目する単語が文頭の単語ではなく、かつ、その単語に係る文頭側の単語がない場合には、要素の値を０とする。また、ｃ−１）の「文末側」を「文頭側」に置換した情報を要素としてもよいし、ｃ−２）〜ｃ−４）の「文頭側」を「文末側」に置換した情報を要素としてもよい。また、ｃ）の係り受けは通常は文節単位で得られるが、音声合成で頻繁に用いられるアクセント句単位での係り受け情報を得る場合には、アクセント句境界の前後の単語から「着目する単語」への前記の距離や個数を特徴量の要素として用いることができる。なお、アクセント句境界は、特徴量構成部１１２に入力されたアクセント句境界やポーズの有無を表す情報によって特定されてもよいし、前述の参考文献３等に基づいて特徴量構成部１１２によって予測されてもよい。「着目する単語」は、強調生起範囲内の個々の単語である。 c) Information indicating the relationship between the word of interest regarding the words before and after the emphasis occurrence range and other words that are in a dependency relationship with the word of interest may be used as the element of the feature amount. This is information that can be extracted from the dependency analysis result. For example, the following can be used as elements of the feature amount.
c-1) Information representing the distance counted by the number of words from the focused word to other words at the end of the sentence related to the focused word.
c-2) Information representing the minimum distance counted by the number of words from the focused word to other words at the beginning of the sentence related to the focused word. That is, information representing the distance counted by the number of words from the focused word to the nearest other word on the sentence head related to the focused word.
c-3) Information representing the maximum distance counted by the number of words from the focused word to other words at the beginning of the sentence related to the focused word. That is, information representing the distance counted by the number of words from the focused word to the other word farthest on the sentence head related to the focused word.
c-4) Information representing the number of words on the sentence side related to the focused word.
In c-1), if the word of interest is not the word at the end of the sentence and there is no word related to the direction of the sentence end, the value of the element is set to 0. In c-2) and c-3), if the word of interest is not the word at the beginning of the sentence and there is no word at the beginning of the sentence related to the word, the value of the element is set to 0. Information obtained by replacing “end of sentence” in “c-1) with“ beginning of sentence ”may be used as an element, and information obtained by replacing“ start of sentence ”in c-2) to c-4) with“ end of sentence ”. May be an element. In addition, the dependency in c) is usually obtained in units of phrases, but when obtaining dependency information in units of accent phrases frequently used in speech synthesis, the word of interest is extracted from words before and after the accent phrase boundary. The above-mentioned distance and number to "can be used as an element of the feature amount. Note that the accent phrase boundary may be specified by the information indicating the accent phrase boundary and the presence / absence of a pose input to the feature quantity construction unit 112, or predicted by the feature quantity construction unit 112 based on the above-mentioned Reference 3 and the like. May be. The “word of interest” is an individual word within the emphasized occurrence range.

ｄ）強調生起範囲内の着目するアクセント句の前後の位置でのポーズの有無を表す情報を要素としてもよい。これは参考文献３および４の結果に基づいて得られる情報である。着目するアクセント句のＭ１個前側のアクセント句の前の位置のポーズの有無、着目するアクセント句のＭ２個後ろ側のアクセント句の後ろの位置のポーズの有無を要素として含めても良い。なお、Ｍ１とＭ２は事前に定められる。例えば、事前に学習データを用いた予備実験が行われ、そこで最高性能を示した条件が採用される。「着目するアクセント句」は、強調生起範囲内の着目する単語が含まれているアクセント句である。 d) Information indicating the presence or absence of a pose at positions before and after the accent phrase of interest within the emphasis generation range may be used as an element. This is information obtained based on the results of References 3 and 4. The presence / absence of a pose at the position before the accent phrase M1 before the accent phrase of interest and the presence / absence of a pose at the position behind the accent phrase M2 behind the accent phrase of interest may be included as elements. M1 and M2 are determined in advance. For example, a preliminary experiment using learning data is performed in advance, and the condition that shows the highest performance is adopted there. The “accent phrase of interest” is an accent phrase including the word of interest within the emphasized occurrence range.

ｅ）強調位置予測装置１００がカテゴリ名辞書格納部１１５を備える場合、強調生起範囲に対応する単語をキーとしてカテゴリ名辞書を検索することで得られる「単語のカテゴリ」を表す情報を特徴量の要素としてもよい。例えば、上記ａ）の２Ｎ＋１個の単語のうちの主要部の単語の、品詞以外のカテゴリ名の全てまたはそれらの部分的な組み合わせを表す情報を特徴量の要素としてもよい。例えば、単語が属する組織名や会社名といった細分類カテゴリを表す情報を特徴量の要素としてもよい。このような細分類カテゴリは、例えば、参考文献５（日本語語彙大系，ＮＴＴコミュニケーション科学研究所監修，池原他編集，1997刊，岩波書店）を用いて付与することができる。カテゴリ名としては例えば外来語か和語かといった種別がある。また，独自構築したカテゴリ名辞書を参照することにより得られる単語のカテゴリを表す情報を特徴量の要素としてもよい。例えば、「明るい」「楽しい」といったポジティブなカテゴリや、「暗い」「つらい」といったネガティブなカテゴリを設定することもできる。なお、「主要部」とは「主辞」を意味する。すなわち、「主要部」とは係り受けで係ってくる単語（受け側の単語）を意味する。上記の２Ｎ＋１個の単語に主要部が複数個存在する場合には、いずれか１つの主要部の単語のみについて特徴量の要素が構成されてもよいし、複数個の主要部の単語について特徴量の要素が構成されてもよい。上記の２Ｎ＋１個の単語に主要部が存在しない場合には、主要部が存在しないことを表す情報を特徴量の要素としてもよい。 e) When the emphasized position prediction apparatus 100 includes the category name dictionary storage unit 115, information representing “word category” obtained by searching the category name dictionary using the word corresponding to the emphasized occurrence range as a key is used as the feature amount. It may be an element. For example, information representing all or a partial combination of category names other than the part of speech of the main part of the 2N + 1 words in the above a) may be used as the element of the feature amount. For example, information representing a subcategory category such as an organization name or company name to which a word belongs may be used as an element of the feature amount. Such a subcategory category can be assigned using, for example, Reference 5 (Japanese Vocabulary System, supervised by NTT Communication Science Laboratories, edited by Ikehara et al., 1997, Iwanami Shoten). The category name includes, for example, a foreign word or a Japanese word. In addition, information representing a category of a word obtained by referring to an independently constructed category name dictionary may be used as an element of the feature amount. For example, a positive category such as “bright” or “fun”, or a negative category such as “dark” or “hard” can be set. “Main part” means “main part”. That is, the “main part” means a word (word on the receiving side) that is involved in the dependency. When there are a plurality of main parts in the above 2N + 1 words, a feature amount element may be configured for only one of the main part words, or a feature amount for a plurality of main part words. The elements may be configured. When the main part does not exist in the above 2N + 1 words, information indicating that the main part does not exist may be used as the element of the feature amount.

図３に複数の要素からなる列を特徴量とした例を示す。図３の例では、各アクセント句を強調生起範囲とし、アクセント句に対応する主要部の単語の出現形を表す情報（２）の列）、当該主要部の単語の品詞を表す情報（３）の列）、アクセント句の前側にポーズがあるかないかを表す情報（４）の列）、アクセント句の後側にポーズがあるかないかを表す情報（５）の列）、上記主要部の単語から当該単語に係り受けで係ってくる前側の単語までの距離の最小値を表す情報（６）の列）、および上記主要部の単語の品詞以外のカテゴリ名を表す情報（７）の列）を特徴量の要素としている。例えば、アクセント句番号２に対応する特徴量は、ａ）として図３の２）列の『ソフト』という単語とその前後の『クリーン』や『温風』という単語を表す情報、ｂ）として３）列の『形容詞』と前後の『形容詞』と『名詞』を表す情報、ｃ）として６）列の『１』を表す情報、ｄ）として４）列と５）列の『無』と『無』を表す情報、ｅ）として７）列の『外来語』を表す情報を要素とするベクトルとなる。 FIG. 3 shows an example in which a column composed of a plurality of elements is used as a feature amount. In the example of FIG. 3, each accent phrase is set as an emphasis occurrence range, the information (2) column indicating the appearance of the main word corresponding to the accent phrase), and the part of speech of the main word (3) Column), information indicating whether or not there is a pose on the front side of the accent phrase (column of 4), information indicating whether or not there is a pose on the back side of the accent phrase (column of 5), and the main part word Column of information (6) indicating the minimum value of the distance from the word to the front word related to the word by dependency, and a column of information (7) indicating the category name other than the part of speech of the main word ) As an element of feature quantity. For example, the feature quantity corresponding to accent phrase number 2 is a) as a) information representing the word “soft” and the words “clean” and “warm air” before and after the 2) column in FIG. ) "Adjective" in the column and information indicating "adjective" and "noun" before and after, c) as information 6) as information "1" in column, as d) 4) and 5) "no" and "in column It is a vector whose elements are information representing “no” and information representing “foreign word” in column 7).

特徴量構成部１１２から出力された特徴量は、強調位置予測部１１３に入力される。強調位置予測部１１３は、強調位置予測モデル格納部１１４に格納された強調位置予測モデル、および特徴量構成部１１２で得られた特徴量を用い、入力テキストの強調位置を識別する（ステップＳ１１４）。強調位置予測モデルは、テキストの形態素解析結果および係り受け解析結果の少なくとも一方から構成される特徴量と当該テキストの強調位置との関係を表すモデルである。すなわち、強調位置予測モデルは、テキストを構成する強調生起範囲（区間）が強調位置であるかと特徴量との関係を表すモデルであり、強調位置予測部１１３は、特徴量構成部１１２で得られた特徴量を用い、入力テキストを構成する各強調生起範囲が強調位置であるかを識別する。 The feature value output from the feature value configuration unit 112 is input to the enhancement position prediction unit 113. The enhancement position prediction unit 113 identifies the enhancement position of the input text using the enhancement position prediction model stored in the enhancement position prediction model storage unit 114 and the feature amount obtained by the feature amount configuration unit 112 (step S114). . The emphasis position prediction model is a model that represents the relationship between the feature amount composed of at least one of the text morphological analysis result and the dependency analysis result and the text emphasis position. That is, the emphasized position prediction model is a model that represents the relationship between whether the emphasized occurrence range (section) constituting the text is the emphasized position and the feature amount, and the emphasized position predicting unit 113 is obtained by the feature amount configuring unit 112. The feature amount is used to identify whether each emphasis occurrence range constituting the input text is an emphasis position.

以下に詳細を例示する。各強調生起範囲ｉ＝０，…，Ｉ−１に対して得られた特徴量をｘ_ｉとし、各強調生起範囲ｉが強調位置であるか否かを表す識別情報（強調マークを付与するか否かを表す識別情報）をｙ_ｉとする。ただし、Ｉは１以上の整数である。例えば、Ｉは入力テキストに属するすべての強調生起範囲の個数である。特徴量ｘ_ｉはベクトル等である。識別情報ｙ_ｉの例は、強調生起範囲ｉが強調位置である場合にｙ_ｉ＝１となり、強調生起範囲ｉが強調位置でない場合にｙ_ｉ＝０となる二値情報である。特徴量ｘ_ｉの系列をｘ＝（ｘ_０，…，ｘ_Ｉ−１）とし、識別情報ｙ_ｉの系列をｙ＝（ｙ_０，…，ｙ_Ｉ−１）とする。この場合、強調位置予測モデルは、入力テキストの特徴量の系列ｘと識別情報の系列ｙとを対応付けるモデルである。例えば、強調位置予測モデルは、特徴量の系列ｘを入力変数とし、識別情報の系列ｙを出力変数とするモデルである。強調位置予測モデルに限定はないが、強調位置予測モデルの具体例は、系列ラベリング等に用いられる隠れマルコフモデル等の確率モデルである。強調位置予測モデルは、例えば、学習用のテキストの形態素解析結果および係り受け解析結果の少なくとも一方から構成される特徴量と、当該テキストの各強調生起範囲が強調位置であるか否かを表す識別情報との組からなる学習データを用い、一般の機械学習手法を実施することで構築できる。強調位置予測モデルの学習に用いられる特徴量の構成は、特徴量構成部１１２で得られる特徴量の構成と同じである。学習データの識別情報は人手で付与されたものであってもよいし、非特許文献１等の従来技術によって自働抽出されたものであってもよい。機械学習手法の詳細は、例えば参考文献６（高村大也ほか著，「言語処理のための機械学習入門」，コロナ社）等に記載されている。強調位置予測モデルは、例えば、強調生起範囲を処理単位として、その処理単位、および、その前後の単位に対応する特徴量から構成した特徴量の系列を入力とし、その処理単位での強調マークの有無を表す識別情報の系列を予測するためのモデルである。以下に隠れマルコフモデルで構築された強調位置予測モデルを例示する。

ただし、Ｐ（ｘ，ｙ）はｘ＝（ｘ_０，…，ｘ_Ｉ−１），ｙ＝（ｙ_０，…，ｙ_Ｉ−１）の同時確率、Ｐ（ｙ_ｉ｜ｘ_ｉ）はｘ_ｉのもとでのｙ_ｉの条件付き確率、Ｐ（ｘ_ｉ｜ｘ_ｉ−１）はｘ_ｉ−１のもとでのｘ_ｉの条件付き確率である。ｘ_−１は文の先頭の単語よりも前にあると仮定する文の開始位置という便宜上の強調予測範囲に関して得られる特徴量で定数を要素とするベクトルである。 Details are illustrated below. The feature amount obtained for each emphasis occurrence range i = 0,..., I−1 is x _i, and identification information indicating whether each emphasis occurrence range i is an emphasis position (whether an emphasis mark is added) (Identification information indicating whether or not) is y _i . However, I is an integer of 1 or more. For example, I is the number of all emphasized occurrence ranges belonging to the input text. The feature quantity x _i is a vector or the like. An example of the identification information y _i is binary information in which y _i = 1 when the emphasis occurrence range i is an emphasis position and y _i = 0 when the emphasis occurrence range i is not an enhancement position. A sequence of feature values x _i is x = (x ₀ ,..., X _I-1 ), and a sequence of identification information y _i is y = (y ₀ ,..., Y _I-1 ). In this case, the emphasized position prediction model is a model that associates the feature amount series x of the input text with the identification information series y. For example, the enhancement position prediction model is a model in which a feature quantity series x is an input variable and an identification information series y is an output variable. The enhanced position prediction model is not limited, but a specific example of the enhanced position prediction model is a probabilistic model such as a hidden Markov model used for sequence labeling or the like. The enhancement position prediction model is, for example, a feature amount composed of at least one of a morphological analysis result and a dependency analysis result of learning text, and an identification indicating whether each enhancement occurrence range of the text is an enhancement position. It can be constructed by implementing general machine learning techniques using learning data consisting of pairs with information. The configuration of the feature amount used for learning the emphasized position prediction model is the same as the configuration of the feature amount obtained by the feature amount configuration unit 112. The identification information of the learning data may be provided manually or may be automatically extracted by a conventional technique such as Non-Patent Document 1. Details of the machine learning method are described in Reference Document 6 (Daiya Takamura et al., “Introduction to Machine Learning for Language Processing”, Corona). The enhancement position prediction model, for example, takes an enhancement occurrence range as a processing unit, and inputs a series of feature amounts composed of the processing unit and feature amounts corresponding to the preceding and following units, and the emphasis mark in the processing unit is input. It is a model for predicting a series of identification information indicating presence or absence. An example of an emphasized position prediction model constructed by a hidden Markov model is shown below.

Where P (x, y) is x = (x ₀ ,..., X _I-1 ), y = (y ₀ ,..., Y _I-1 ), and P (y _i | x _i ) is x The conditional probability of y _i under _i , P (x _i | x _i−1 ) is the conditional probability of x _i under x _i−1 . x ₋₁ is a vector having a constant as an element, which is a feature amount obtained with respect to an enhanced prediction range for convenience, which is the start position of a sentence that is assumed to be before the first word of the sentence.

強調位置予測部１１３は、強調位置予測モデル格納部１１４から強調位置予測モデルを読み込み、当該強調位置予測モデルの入力変数に特徴量構成部１１２から出力された特徴量の系列ｘを設定し、識別情報の系列ｙを予測し、当該識別情報の系列ｙを強調位置予測結果として出力する。例えば、隠れマルコフモデルで強調位置予測モデルが構成されている場合、入力された特徴量の系列ｘに対してＰ（ｘ，ｙ）を最大にする識別情報の系列ｙが出力される。このような識別情報の系列ｙの探索は、例えば公知のＶｉｔｅｒｂｉアルゴリズムを用いて行うことができる。強調生起範囲を処理単位として、その処理単位での強調マークの有無を表す識別情報の系列ｙを、その単位、および、その前後の単位に対応する特徴量から構成した特徴量の列を入力した強調位置予測モデルを用いて、文頭から文末までの前記の単位ごとに強調マークを付与する場合と付与しない場合のすべての可能性を列挙して、文頭から文末まで大域的に、強調位置予測モデルが与える確率が最大の系列をＶｉｔｅｒｂｉアルゴリズムで選択すれば良い。或いは、強調生起範囲に代えて、着目する単語、文節、アクセント句を処理単位として識別情報の系列ｙを得てもよい。すなわち、特徴量を得る処理単位と識別情報を得る処理単位とは、同一であってもよいし、異なっていてもよい。あるいは、処理単位の前後数単位間で確率の高い識別情報の系列に探索範囲を絞って探索してもよい。また、特徴量の構成に関わった範囲のみについて強調位置の予測を行うことにし、強調位置の予測自体を局所的に行なうことも可能である。これらの探索法自体は周知であり、例えば、参考文献６に記載された方法で実施できる。また、強調位置予測モデルを隠れマルコフモデルに限定はしない。例えば、強調位置予測モデルは、スコアや確率値を付与できるモデルであれば、どのようなモデルによって構成されても良い。決定木によって構成されても良いし、対数線形モデルによって構成されても良いし、ニューラルネットワークで構成されても良い。 The enhancement position prediction unit 113 reads the enhancement position prediction model from the enhancement position prediction model storage unit 114, sets the feature amount series x output from the feature amount configuration unit 112 as an input variable of the enhancement position prediction model, and identifies The information sequence y is predicted, and the identification information sequence y is output as an emphasized position prediction result. For example, when the emphasized position prediction model is configured by a hidden Markov model, a sequence y of identification information that maximizes P (x, y) is output with respect to the input feature amount sequence x. Such a search for the identification information sequence y can be performed using, for example, a known Viterbi algorithm. With the emphasis occurrence range as a processing unit, a series of identification information representing the presence / absence of an emphasis mark in the processing unit is input as a sequence of feature amounts composed of the unit and feature amounts corresponding to the preceding and following units. Using the emphasis position prediction model, enumerate all the possibilities with and without emphasis marks for each unit from the beginning of the sentence to the end of the sentence, and emphasize the position prediction model globally from the beginning of the sentence to the end of the sentence. A sequence having the maximum probability given by may be selected by the Viterbi algorithm. Alternatively, instead of the emphasis occurrence range, a series y of identification information may be obtained using a focused word, phrase, or accent phrase as a processing unit. That is, the processing unit for obtaining the feature amount and the processing unit for obtaining the identification information may be the same or different. Alternatively, the search range may be narrowed down to a series of identification information having a high probability between several units before and after the processing unit. It is also possible to predict the emphasized position only for the range related to the configuration of the feature quantity, and to predict the emphasized position itself. These search methods are well known, and can be performed by the method described in Reference 6, for example. Further, the emphasized position prediction model is not limited to the hidden Markov model. For example, the emphasized position prediction model may be configured by any model as long as it can provide a score and a probability value. It may be constituted by a decision tree, may be constituted by a logarithmic linear model, or may be constituted by a neural network.

以上のように、本形態では、テキストの言語情報に対応する特徴量とテキストの強調位置との関係を表す強調位置予測モデルを用い、入力テキストの強調位置を識別できる。また、本形態の強調位置予測モデルを用いて予測された強調位置は、非特許文献１のモデルに整合した性質を持つ。その結果、声の高さが精度高く再現された自然な音声を合成できる。 As described above, in the present embodiment, the emphasized position of the input text can be identified using the emphasized position prediction model that represents the relationship between the feature amount corresponding to the language information of the text and the emphasized position of the text. Further, the emphasized position predicted using the emphasized position prediction model of the present embodiment has a property consistent with the model of Non-Patent Document 1. As a result, it is possible to synthesize natural speech in which the pitch of the voice is accurately reproduced.

なお、本発明は上述の実施の形態に限定されるものではない。例えば、強調位置予測装置が、入力テキストの形態素解析や係り受け解析を行う手段を備えていてもよいし、強調位置予測装置が、形態素解析結果や係り受け解析結果を格納する解析結果格納部を備えていてもよい。また上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 The present invention is not limited to the embodiment described above. For example, the emphasis position prediction device may include a unit that performs morphological analysis and dependency analysis of input text, or the emphasis position prediction device includes an analysis result storage unit that stores morphological analysis results and dependency analysis results. You may have. The various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capacity of the apparatus that executes the processes. Needless to say, other modifications are possible without departing from the spirit of the present invention.

上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は、非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 When the above configuration is realized by a computer, the processing contents of the functions that each device should have are described by a program. By executing this program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and the like.

このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed, for example, by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。処理の実行時、このコンピュータは、自己の記録装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads a program stored in its own recording device and executes a process according to the read program. As another execution form of the program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and each time the program is transferred from the server computer to the computer. The processing according to the received program may be executed sequentially. The above-described processing may be executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. Good.

上記実施形態では、コンピュータ上で所定のプログラムを実行させて本装置の処理機能が実現されたが、これらの処理機能の少なくとも一部がハードウェアで実現されてもよい。 In the above embodiment, the processing functions of the apparatus are realized by executing a predetermined program on a computer. However, at least a part of these processing functions may be realized by hardware.

１００強調位置予測装置 100 Emphasized position prediction device

Claims

An emphasis position prediction model storage unit that stores an emphasis position prediction model representing a relationship between a feature amount composed of at least one of a text morpheme analysis result and a dependency analysis result and the emphasis position of the text;
A feature quantity component that obtains a feature quantity composed of at least one of a morphological analysis result and a dependency analysis result of the input text;
An emphasized position prediction unit for identifying an emphasized position of the input text using the feature amount obtained by the emphasized position prediction model and the feature amount configuration unit,
The feature amount includes information representing information and part of speech of the words in the head word representing the appearance form of the words in the head word, or, each occurrence form of the word of words and a predetermined number before and after a word of the head word of the head word As well as information representing each part of speech of the word,
An emphasized position predicting apparatus characterized by that.

The emphasized position prediction apparatus according to claim 1,
The feature amount includes information indicating a minimum value of a distance from the word of the main word to a front word that is related to the word of the main word.
An emphasized position predicting apparatus characterized by that.

  The emphasized position prediction apparatus according to claim 1 or 2,
  The feature amount includes information representing a word category,
  An emphasized position predicting apparatus characterized by that.

The emphasized position prediction apparatus according to any one of claims 1 to 3 ,
The emphasized position prediction model is a model that represents a relationship between whether the section constituting the text is an emphasized position and the feature amount,
The emphasis position prediction unit identifies whether each section constituting the input text is an emphasis position.
An emphasized position predicting apparatus characterized by that.

An emphasized position prediction model representing a relationship between a feature amount composed of at least one of a text morpheme analysis result and a dependency analysis result and the emphasized position of the text is stored in the emphasized position prediction model storage unit,
A feature quantity configuration step for obtaining a feature quantity composed of at least one of a morphological analysis result and a dependency analysis result of the input text;
Emphasis position prediction step for identifying an emphasis position of the input text using the feature amount obtained in the enhancement position prediction model and the feature amount configuration step;
Run
The feature amount includes information representing information and part of speech of the words in the head word representing the appearance form of the words in the head word, or, each occurrence form of the word of words and a predetermined number before and after a word of the head word of the head word As well as information representing each part of speech of the word,
An emphasized position prediction method characterized by the above.

  The emphasized position prediction method according to claim 5,
  The feature amount includes information indicating a minimum value of a distance from the word of the main word to a front word that is related to the word of the main word.
  An emphasized position prediction method characterized by the above.

  The emphasized position prediction method according to claim 5 or 6, comprising:
  The feature amount includes information representing a word category,
  An emphasized position prediction method characterized by the above.

  The emphasized position prediction method according to any one of claims 5 to 7,
  The emphasized position prediction model is a model that represents a relationship between whether the section constituting the text is an emphasized position and the feature amount,
  The emphasized position predicting step identifies whether each section constituting the input text is an emphasized position.
  An emphasized position prediction method characterized by the above.

Program for causing a computer to function claims 1 as either stress position prediction device 4.