JP6370749B2

JP6370749B2 - Utterance intention model learning device, utterance intention extraction device, utterance intention model learning method, utterance intention extraction method, program

Info

Publication number: JP6370749B2
Application number: JP2015151648A
Authority: JP
Inventors: 厚志安藤; 太一浅見
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-07-31
Filing date: 2015-07-31
Publication date: 2018-08-08
Anticipated expiration: 2035-07-31
Also published as: JP2017032738A

Description

本発明は、発話から発話意図の抽出を行うためのモデルを学習する発話意図モデル学習装置、発話意図モデル学習方法、発話から発話意図を抽出する発話意図抽出装置、発話意図抽出方法、プログラムに関する。 The present invention relates to an utterance intention model learning apparatus, an utterance intention model learning method, an utterance intention extraction apparatus, an utterance intention extraction method, and a program for extracting an utterance intention from an utterance.

音声対話システムや議事録作成支援システムにおいて、発話の音声認識結果だけでなく発話意図（例えば、肯定的、否定的など）を抽出する技術が求められている。音声対話システムでは、例えば「明日ねー…」など、文面のみからはただの相槌に見える発話に対しても、場合により「否定的」などの発話意図を読み取って別の提案を行うなど、発話意図に応じた動作が求められる。発話意図を抽出することで、音声対話システムは言葉に表れないユーザの要求にも適切な応答を生成することが可能となる。 In a voice dialogue system or a minutes creation support system, a technique for extracting not only a speech recognition result of an utterance but also a speech intention (for example, positive or negative) is required. In the spoken dialogue system, for example, “Tomorrow Ne ...”, even for an utterance that seems to be just a companion from the text alone, the utterance intent may be read, such as “Negative” in some cases and making another proposal. The operation according to is required. By extracting the utterance intention, the voice interaction system can generate an appropriate response to a user request that does not appear in words.

一方、議事録作成支援システムにおいては、賛成発話・反対発話などの会議における重要発話の自動抽出が可能となり、会議の全体像の把握や議事録要約生成に役立つ。 On the other hand, in the minutes creation support system, it is possible to automatically extract important utterances in meetings such as utterances for and against utterances, and it is useful for grasping the whole picture of the meeting and generating the minutes summary.

このような発話意図抽出の従来技術が、非特許文献１に開示されている。非特許文献１では、少なくとも一単語を含み、単語と単語の間が一定時間以下（例えば0.5秒以下）の時間間隔で連続している区間を発話区間、発話区間の音声を発話と定義する。また、一つの発話につき一つの発話意図を持つと仮定する。非特許文献１では、各発話に対し発話に表れる韻律情報（声の高さ、間の取り方など）や言語情報（発話に含まれる単語や品詞）と発話意図との関係性を利用し発話意図抽出を行う。韻律特徴・言語特徴と発話意図との関係性は発話と正解の発話意図のペアの学習データを用いて事前に学習される。以下、図１〜図３を参照して非特許文献１の発話意図抽出装置の概略を説明する。図１は、非特許文献１の発話意図抽出装置９の構成を示すブロック図である。図２は、非特許文献１の発話意図抽出装置９の動作を示すフローチャートである。図３は、非特許文献１の発話意図抽出装置９の発話意図抽出の例を示す図である。図１に示すように非特許文献１の発話意図抽出装置９は、韻律抽出部９０１と、認識結果分析部９０２と、韻律正規化部９０３と、韻律特徴抽出部９０４と、言語特徴抽出部９０５と、発話意図モデル学習部９０８と、発話意図抽出部９０９を含む。韻律抽出部９０１は、発話意図の抽出対象として入力された発話から、韻律（短時間ごとの基本周波数、短時間ごとの音圧レベル）を抽出する（Ｓ９０１）。認識結果分析部９０２は認識結果を分析し、認識結果に含まれる単語、音素とその開始・終了時刻を得る（Ｓ９０２）。韻律正規化部９０３は、抽出された韻律（短時間ごとの基本周波数、短時間ごとの音圧レベル）を話者ごとに正規化する（Ｓ９０３）。韻律特徴抽出部９０４は、韻律特徴（声の高さ、間の長さの平均値、勾配などの統計量）を発話ごとに抽出する（Ｓ９０４）。言語特徴抽出部９０５は、言語特徴（発話先頭の単語や品詞など）を発話ごとに抽出する（Ｓ９０５）。発話意図モデル学習部９０８は、発話ごとの韻律特徴および言語特徴と、これに対応する人手で付与した発話意図正解ラベルとを学習データとし、発話意図モデルを予め学習する（Ｓ９０８）。発話意図抽出部９０９は、学習された発話意図モデルを用い、発話ごとの韻律特徴および言語特徴に基づいて、発話ごとに発話意図を抽出する（Ｓ９０９）。図３では、発話例である「わたしもそうおもいます」の声の高さの平均値が高いこと（韻律特徴）、同発話の先頭２単語が「わたし」「も」であること（言語特徴）から、ステップＳ９０９において当該発話の発話意図「肯定的」を抽出している。 Non-patent document 1 discloses a conventional technique for such utterance intention extraction. In Non-Patent Document 1, a section including at least one word and continuing between the words at a time interval of a certain time or less (for example, 0.5 seconds or less) is defined as an utterance section, and a voice in the utterance section is defined as an utterance. Also assume that there is one utterance intention per utterance. In Non-Patent Document 1, for each utterance, utterances are made by using the relationship between prosodic information (pitch of voice, how to arrange them) and language information (words and parts of speech included in the utterance) and utterance intentions. Perform intention extraction. The relationship between prosodic features / language features and utterance intentions is learned in advance using learning data of a pair of utterances and correct utterance intentions. Hereinafter, an outline of the utterance intention extraction apparatus of Non-Patent Document 1 will be described with reference to FIGS. FIG. 1 is a block diagram showing the configuration of the utterance intention extraction device 9 of Non-Patent Document 1. FIG. 2 is a flowchart showing the operation of the utterance intention extraction device 9 of Non-Patent Document 1. FIG. 3 is a diagram illustrating an example of utterance intention extraction performed by the utterance intention extraction apparatus 9 of Non-Patent Document 1. As shown in FIG. 1, the utterance intention extraction device 9 of Non-Patent Document 1 includes a prosody extraction unit 901, a recognition result analysis unit 902, a prosody normalization unit 903, a prosody feature extraction unit 904, and a language feature extraction unit 905. And an utterance intention model learning unit 908 and an utterance intention extraction unit 909. The prosody extraction unit 901 extracts the prosody (basic frequency for each short time, sound pressure level for each short time) from the utterance input as the speech intention extraction target (S901). The recognition result analysis unit 902 analyzes the recognition result and obtains words and phonemes included in the recognition result and their start / end times (S902). The prosody normalization unit 903 normalizes the extracted prosody (basic frequency for each short time, sound pressure level for each short time) for each speaker (S903). The prosodic feature extraction unit 904 extracts prosodic features (statistics such as voice pitch, average value of lengths and gradients) for each utterance (S904). The language feature extraction unit 905 extracts language features (such as words at the beginning of speech and parts of speech) for each utterance (S905). The utterance intention model learning unit 908 learns the utterance intention model in advance using the prosodic features and language features for each utterance and the corresponding utterance intention correct answer labels manually assigned thereto as learning data (S908). The utterance intention extraction unit 909 extracts the utterance intention for each utterance based on the prosodic features and the language features for each utterance using the learned utterance intention model (S909). In Fig. 3, the average voice pitch of the utterance example "I am so happy" is high (prosodic feature), and the first two words of the utterance are "I" and "mo" (language feature) ), The utterance intention “positive” of the utterance is extracted in step S909.

D. Hillard, M. Ostendorf, E. Shriberg, Detection of agreement vs. disagreement in meetings: training with unlabeled data, Proc. of the HLT-NAACL Conference, May 2003D. Hillard, M. Ostendorf, E. Shriberg, Detection of agreement vs. disagreement in meetings: training with unlabeled data, Proc. Of the HLT-NAACL Conference, May 2003

発話意図は発話の一部の区間にのみ表出することがある。非特許文献１の発話意図抽出装置９では発話区間全体から韻律特徴を求めるため、発話の一部の区間にのみ表れる韻律の変化を表現することが出来ず、発話意図を正しく抽出することができない場合があった。図４にその一例を示す。図４は、発話意図が発話の一部のみに表出した場合の韻律特徴の変化の例を示す図である。「肯定的」の発話意図が表れる区間では、声の高さの平均値が高くなることが知られているが、図４の例のように発話の一部の区間（ドットハッチングを施した領域）にのみ発話意図「肯定的」が表れる場合、発話意図「肯定的」が表出した区間だけを用いて声の高さの平均値を求めれば、発話意図が肯定的である場合に特有の特徴が表れる（＝平均値が高い）ものの、発話区間全体から声の高さの平均値を求めると、発話意図が肯定的である場合に特有の特徴が表れない（＝平均値が低い）場合がある。このため非特許文献１の発話意図抽出装置９によっても発話意図を正しく抽出することができない場合があった。 The utterance intention may be expressed only in a part of the utterance. Since the utterance intention extraction device 9 of Non-Patent Document 1 obtains prosodic features from the entire utterance section, it cannot express a change in the prosody that appears only in a part of the utterance, and cannot correctly extract the utterance intention. There was a case. An example is shown in FIG. FIG. 4 is a diagram illustrating an example of changes in prosodic features when an utterance intention is expressed in only part of an utterance. It is known that the average value of the voice pitch is higher in the section where the “positive” utterance intention appears, but as shown in the example of FIG. 4, a part of the utterance (area with dot hatching) ) Only when the utterance intention “positive” appears, the average value of the voice pitch is calculated using only the section where the utterance intention “positive” appears. When features appear (= average value is high), but when the average value of the voice pitch is calculated from the entire utterance section, if the intention of utterance is positive, no specific features appear (= average value is low) There is. For this reason, the utterance intention cannot be correctly extracted even by the utterance intention extraction device 9 of Non-Patent Document 1.

そこで本発明では、発話の一部の区間にのみ発話意図が表出する場合でも正しく発話意図を抽出するためのモデルを学習する発話意図モデル学習装置を提供することを目的とする。 Therefore, an object of the present invention is to provide an utterance intention model learning device that learns a model for correctly extracting an utterance intention even when the utterance intention is expressed only in a partial section of the utterance.

本発明の一態様は、少なくとも一単語を含み、単語と単語の間が一定時間以下の時間間隔で連続している区間を発話区間とし、前記発話区間の音声を発話とし、前記発話に含まれる部分区間ごとに抽出した発話意図である部分区間毎発話意図に対応するインデクスの列である部分区間毎発話意図インデクス系列と、前記発話ごとに人手で付与された発話意図ラベルとを学習データとして、前記発話ごとの発話意図の抽出に用いるＮ−ｇｒａｍモデルである発話意図Ｎ−ｇｒａｍモデルを学習する。 One embodiment of the present invention includes at least one word, a section where words are continuous at a time interval of a predetermined time or less as an utterance section, and voice of the utterance section is an utterance, and is included in the utterance As learning data, an utterance intention index sequence that is manually assigned to each utterance and an utterance intention label sequence that is an index string corresponding to the utterance intention for each partial section that is an utterance intention extracted for each partial section, An utterance intention N-gram model, which is an N-gram model used to extract an utterance intention for each utterance, is learned.

本発明によれば、発話の一部の区間にのみ発話意図が表出する場合でも正しく発話意図を抽出するためのモデルを学習することができる。 According to the present invention, it is possible to learn a model for correctly extracting an utterance intention even when the utterance intention is expressed only in a partial section of the utterance.

非特許文献１の発話意図抽出装置９の構成を示すブロック図。The block diagram which shows the structure of the speech intention extraction apparatus 9 of a nonpatent literature 1. FIG. 非特許文献１の発話意図抽出装置９の動作を示すフローチャート。The flowchart which shows operation | movement of the speech intention extraction apparatus 9 of a nonpatent literature 1. FIG. 非特許文献１の発話意図抽出装置９の発話意図抽出の例を示す図。The figure which shows the example of the speech intention extraction of the speech intention extraction apparatus 9 of a nonpatent literature 1. FIG. 発話意図が発話の一部のみに表出した場合の韻律特徴の変化の例を示す図。The figure which shows the example of the change of a prosodic feature when the utterance intention expresses only to a part of utterance. 実施例１で用いられる局所韻律特徴の一覧を示す図。FIG. 6 is a diagram showing a list of local prosodic features used in the first embodiment. 実施例１の発話意図抽出装置１の構成を示すブロック図。1 is a block diagram illustrating a configuration of an utterance intention extraction device 1 according to Embodiment 1. FIG. 実施例１の発話意図抽出装置１の動作を示すフローチャート。3 is a flowchart showing the operation of the utterance intention extraction apparatus 1 according to the first embodiment. 実施例１の局所韻律特徴抽出部の詳細な構成を示すブロック図。FIG. 3 is a block diagram illustrating a detailed configuration of a local prosody feature extraction unit according to the first embodiment. 実施例１の局所韻律特徴抽出部の詳細な動作を示すフローチャート。7 is a flowchart showing a detailed operation of the local prosody feature extraction unit according to the first embodiment. 局所韻律系列特徴の抽出の例を示す図。The figure which shows the example of extraction of a local prosodic sequence feature. アクセント句ごとに人手で発話意図ラベルを付与した例を示す図。The figure which shows the example which provided the speech intention label manually for every accent phrase. 発話意図モデルを決定木として学習した例を示す図。The figure which shows the example which learned the speech intention model as a decision tree. 非特許文献１の発話意図抽出装置９による、一つの発話に二つ以上の発話意図の特徴が表れる発話からの発話意図抽出の例を示す図。The figure which shows the example of the speech intention extraction from the speech by which the characteristic of two or more speech intentions appears in one speech by the speech intention extraction apparatus 9 of nonpatent literature 1. 実施例１の発話意図抽出装置１による、一つの発話に二つ以上の発話意図の特徴が表れる発話からの発話意図抽出の例を示す図。The figure which shows the example of the speech intention extraction from the speech by which the characteristic of two or more speech intentions appears in one speech by the speech intention extraction apparatus 1 of Example 1. FIG. 発話意図Ｎ−ｇｒａｍモデルに基づく発話毎の発話意図抽出の例を示す図（確率ベクトルを用いないもの）。The figure which shows the example of the utterance intention extraction for every utterance based on the utterance intention N-gram model (thing which does not use a probability vector). 発話意図Ｎ−ｇｒａｍモデルに基づく発話毎の発話意図抽出の例を示す図（確率ベクトルを用いるもの）。The figure which shows the example of the speech intention extraction for every speech based on the speech intention N-gram model (thing using a probability vector). 実施例２の発話意図抽出装置２の構成を示すブロック図。The block diagram which shows the structure of the speech intention extraction apparatus 2 of Example 2. FIG. 実施例２の発話意図抽出装置２の動作を示すフローチャート。9 is a flowchart showing the operation of the utterance intention extraction device 2 of the second embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

また、以下の説明では、発話の一部の区間のことを部分区間という。部分区間の例として、単語、アクセント句、イントネーション句がある。 In the following description, a part of the utterance is referred to as a partial section. Examples of partial sections include words, accent phrases, and intonation phrases.

＜実施例１の発明の要点＞
発話の一部の区間にのみ発話意図が含まれる音声を分析した結果、韻律の変化は単語よりも長い区間に渡って表出しており、特にアクセント句における声の高さの上昇時・下降時の勾配や上昇・下降が始まるタイミングに差が表れることが見出された。このことから、本発明ではアクセント句単位での韻律変化に着目した。ここで、単に従来技術において韻律特徴を抽出する区間を発話全体からアクセント句ごとに変えただけでは、アクセント句区間全体の平均値や勾配を求めることになり、声の高さの上昇時・下降時の勾配や上昇・下降が始まるタイミングなどの局所的な韻律の変化を特徴量として表現することができない。この問題に対処するため、本発明では発話の単語区間ごとの韻律特徴を求めることで局所的な韻律の変化を表現し、それらをアクセント句区間ごとに連結した特徴を発話意図抽出に用いることでアクセント句における局所的な韻律の変化を表現する。以下では、発話の単語区間ごとの韻律特徴を局所韻律特徴、局所韻律特徴をアクセント句区間ごとに連結した特徴を局所韻律系列特徴と呼ぶ。 <The main points of the invention of Example 1>
As a result of analyzing speech that includes utterance intention in only a part of the utterance, prosody changes are expressed over a longer section than words, especially when the voice pitch rises and falls in accent phrases It was found that there was a difference in the slope and the timing of rising and falling. For this reason, the present invention focuses on prosodic changes in units of accent phrases. Here, simply changing the interval for extracting prosodic features from the entire utterance for each accent phrase in the conventional technique, the average value and gradient of the entire accent phrase interval are obtained, and when the voice pitch rises and falls Local prosodic changes such as the time gradient and the timing when ascending / descending starts cannot be expressed as feature quantities. In order to deal with this problem, the present invention expresses local prosodic changes by obtaining prosodic features for each utterance word interval, and uses features obtained by connecting them for each accent phrase interval for utterance intent extraction. Express local prosodic changes in accent phrases. Below, the prosodic features for each word segment of the utterance are called local prosodic features, and the features that connect the local prosodic features for each accent phrase segment are called local prosodic feature.

＜実施例１の具体的説明＞
以下、局所韻律系列特徴に基づいて発話意図を抽出する実施例１の発話意図抽出装置について説明する。本実施例の発話意図抽出装置は、発話と、発話ごとの音声認識結果を入力とする。発話の定義は上述の非特許文献１における定義と同一とする。図５、図６、図７を参照して、本実施例の発話意図抽出装置の構成、および動作について説明する。図５は、本実施例で用いられる局所韻律特徴の一覧を示す図である。図６は、本実施例の発話意図抽出装置１の構成を示すブロック図である。図７は、本実施例の発話意図抽出装置１の動作を示すフローチャートである。図６に示すように、本実施例の発話意図抽出装置１は、韻律抽出部９０１と、認識結果分析部９０２と、韻律正規化部９０３と、局所韻律特徴抽出部１０４と、アクセント句境界推定部１０５と、局所韻律系列特徴抽出部１０６と、アクセント句毎発話意図ラベル作成部１０７と、発話意図モデル学習部１０８と、発話意図抽出部１０９を含む。 <Specific Explanation of Example 1>
Hereinafter, the utterance intention extraction apparatus according to the first embodiment that extracts an utterance intention based on local prosodic sequence features will be described. The utterance intention extraction apparatus of this embodiment receives an utterance and a speech recognition result for each utterance. The definition of the utterance is the same as the definition in Non-Patent Document 1 described above. With reference to FIG. 5, FIG. 6, FIG. 7, the configuration and operation of the utterance intention extraction apparatus of the present embodiment will be described. FIG. 5 is a diagram showing a list of local prosodic features used in the present embodiment. FIG. 6 is a block diagram illustrating a configuration of the utterance intention extraction apparatus 1 according to the present embodiment. FIG. 7 is a flowchart showing the operation of the utterance intention extraction apparatus 1 of this embodiment. As shown in FIG. 6, the utterance intention extraction apparatus 1 of this embodiment includes a prosody extraction unit 901, a recognition result analysis unit 902, a prosody normalization unit 903, a local prosody feature extraction unit 104, and an accent phrase boundary estimation. A unit 105, a local prosody sequence feature extraction unit 106, an accent phrase utterance intention label creation unit 107, an utterance intention model learning unit 108, and an utterance intention extraction unit 109.

＜韻律抽出部９０１＞
入力：発話（発話意図の抽出対象として入力された発話）
出力：短時間ごとの基本周波数、短時間ごとの音圧レベル
韻律抽出部９０１は、発話から、声の高さと声の大きさの物理量を求める。声の高さを表す物理量として基本周波数を、声の大きさを表す物理量として音圧レベルを用いることができる。韻律抽出部９０１は、これらの物理量（基本周波数、音圧レベル）を短い周期ごとに求める。すなわち、韻律抽出部９０１は、発話を所定の短時間（例えば10ms）ごとに分析し、短時間ごとの基本周波数と音圧レベルを抽出する（Ｓ９０１）。本実施例では、韻律抽出部９０１は基本周波数を自己相関法により、音圧レベルを振幅の二乗平均平方根の対数値により得ることとするが、基本周波数、音圧レベルの抽出方法はこれに限定されるものではなく、従来の何れの基本周波数抽出法、音圧レベル抽出法を用いてもよい。 <Prosody Extraction Unit 901>
Input: Utterance (Speech entered as utterance intent extraction target)
Output: fundamental frequency every short time, sound pressure level every short time The prosody extraction unit 901 obtains physical quantities of voice pitch and loudness from speech. The fundamental frequency can be used as a physical quantity representing the pitch of the voice, and the sound pressure level can be used as a physical quantity representing the volume of the voice. The prosody extraction unit 901 obtains these physical quantities (basic frequency, sound pressure level) for each short cycle. That is, the prosody extraction unit 901 analyzes the utterances every predetermined short time (for example, 10 ms), and extracts the fundamental frequency and sound pressure level for each short time (S901). In this embodiment, the prosody extracting unit 901 obtains the fundamental frequency by the autocorrelation method and the sound pressure level by the logarithm of the root mean square of the amplitude, but the method for extracting the fundamental frequency and the sound pressure level is limited to this. However, any conventional fundamental frequency extraction method or sound pressure level extraction method may be used.

＜認識結果分析部９０２＞
入力：発話、発話ごとの音声認識結果
出力：単語系列、各単語の開始・終了時刻、音素系列、各音素の開始・終了時刻
認識結果分析部９０２は、認識結果に含まれる単語、音素とその開始・終了時刻を得る（Ｓ９０２）。例えば、単語系列は発話ごとの音声認識結果を形態素解析することで取得できる。また単語の開始・終了時刻、音素系列、音素の開始・終了時刻は発話ごとの音声認識結果から音声認識結果のみを受理するネットワーク文法を作成し、単語セグメンテーションまたは音素セグメンテーションを行うことで得られる（参考非特許文献１参照）。ただし、入力の発話ごとの音声認識結果の時点で単語系列や単語の開始・終了時刻、音素系列、音素の開始・終了時刻が得られている場合は、音声認識結果の値を用いてもよい。
（参考非特許文献１：鹿野清宏、河原達也、山本幹雄、伊藤克亘、武田一哉、ITText音声認識システム、pp.47-49/169-170、オーム社、2001） <Recognition result analysis unit 902>
Input: utterance, speech recognition result output for each utterance: word sequence, start / end time of each word, phoneme sequence, start / end time of each phoneme The recognition result analysis unit 902 includes words and phonemes included in the recognition result and their Start / end times are obtained (S902). For example, a word sequence can be acquired by performing a morphological analysis on a speech recognition result for each utterance. The start / end time of words, phoneme sequences, and start / end times of phonemes can be obtained by creating a network grammar that accepts only speech recognition results from speech recognition results for each utterance and performing word segmentation or phoneme segmentation ( Reference nonpatent literature 1 reference). However, if the word sequence, the start / end time of the word, the phoneme sequence, and the start / end time of the phoneme are obtained at the time of the speech recognition result for each input utterance, the value of the speech recognition result may be used. .
(Reference Non-Patent Document 1: Kiyohiro Shikano, Tatsuya Kawahara, Mikio Yamamoto, Katsunobu Ito, Kazuya Takeda, ITText Speech Recognition System, pp.47-49 / 169-170, Ohmsha, 2001)

＜韻律正規化部９０３＞
入力：短時間ごとの基本周波数、短時間ごとの音圧レベル、話者ごとの基本周波数の平均値、標準偏差、話者ごとの音圧レベルの平均値、標準偏差
出力：短時間ごとの正規化基本周波数、短時間ごとの正規化音圧レベル
韻律正規化部９０３は、短時間ごとの基本周波数、短時間ごとの音圧レベルを話者ごとに正規化し、平均０、標準偏差１とする（Ｓ９０３）。これは、声の高さや声の大きさの話者ごとの違いを吸収することに相当する。韻律正規化部９０３により、発話意図抽出部１０９において、どんな話者に対しても同一の発話意図抽出基準を適用して発話意図抽出を行うことが可能となる。 <Prosody normalization unit 903>
Input: basic frequency every short time, sound pressure level every short time, average value of standard frequency per speaker, standard deviation, average value of sound pressure level per speaker, standard deviation Output: normal every short time Normalization frequency, normalized sound pressure level for each short time The prosody normalization unit 903 normalizes the basic frequency for each short time, and the sound pressure level for each short time for each speaker to obtain an average of 0 and a standard deviation of 1 (S903). This is equivalent to absorbing the difference between the loudness and loudness of each speaker. The prosody normalization unit 903 enables the utterance intention extraction unit 109 to apply the same utterance intention extraction criterion to any speaker and perform utterance intention extraction.

ある時間tにおける、短時間ごとの正規化基本周波数f^￣(t)、短時間ごとの正規化音圧レベルP^￣(t)は以下の式で与えられる。 At a certain time t, the normalized fundamental frequency f ^￣ (t) for each short time and the normalized sound pressure level P ^￣ (t) for each short time are given by the following equations.

f_m(t),P_m(t)は話者mの短時間ごとの基本周波数、短時間ごとの音圧レベルであり、μ_f,m,σ_f,m,μ_P,m,σ_P,mは話者mの基本周波数の全発話の平均値、標準偏差、音圧レベルの全発話の平均値、標準偏差である。μ_f,m,σ_f,m,μ_P,m,σ_P,mは、事前に収集した話者mの発話の全発話から算出するものとする。 f _m (t) and P _m (t) are the fundamental frequency and the sound pressure level for each short time of speaker m, and μ _{f, m} , σ _{f, m} , μ _{P, m} , σ _{P , m} are the average value and standard deviation of all utterances of the fundamental frequency of speaker m, and the average value and standard deviation of all utterances of the sound pressure level. μ _{f, m} , σ _{f, m} , μ _{P, m} , σ _{P, m} are calculated from all the utterances of the speaker m collected in advance.

＜局所韻律特徴抽出部１０４＞
入力：短時間ごとの正規化基本周波数、短時間ごとの正規化音圧レベル、単語系列、各単語の開始・終了時刻、音素系列、各音素の開始・終了時刻
出力：局所韻律特徴（図５最右列の全要素） <Local prosodic feature extraction unit 104>
Input: Normalized fundamental frequency for each short time, normalized sound pressure level for each short time, word series, start / end time of each word, phoneme series, start / end time of each phoneme Output: local prosodic features (FIG. 5) All elements in the rightmost column)

局所韻律特徴抽出部１０４は、認識結果に含まれる単語ごとに局所韻律特徴を求める（Ｓ１０４）。局所韻律特徴は、発話意図の表出に伴う局所的な韻律変化を表現するための特徴であり、発話の単語区間ごとの声の高さ、単語区間ごとの声の大きさ、（次の単語あるいは前の単語との）間の取り方、単語区間ごとの話速、単語区間ごとの音の伸ばし方に関する特徴を表現する。局所韻律特徴は、これらの特徴のうち、少なくとも何れか一つ以上の特徴を表現するものであればよい。本実施例では局所韻律特徴として、図５最右列の全要素を含むものとする。以下、図８、図９を参照して局所韻律特徴抽出部１０４の詳細な構成、および動作について説明する。図８は、本実施例の局所韻律特徴抽出部１０４の詳細な構成を示すブロック図である。図９は、本実施例の局所韻律特徴抽出部１０４の詳細な動作を示すフローチャートである。図８に示すように、本実施例の局所韻律特徴抽出部１０４は、Ｆ０局所韻律特徴抽出部１０４１と、パワー局所韻律特徴抽出部１０４２と、ポーズ局所韻律特徴抽出部１０４３と、スピーチレート局所韻律特徴抽出部１０４４と、デュレーション局所韻律特徴抽出部１０４５を含む。 The local prosodic feature extraction unit 104 obtains a local prosodic feature for each word included in the recognition result (S104). The local prosodic feature is a feature for expressing local prosody changes accompanying the expression of the utterance intention. The voice pitch for each word segment of the utterance, the voice volume for each word segment, (next word Alternatively, it expresses characteristics relating to how to determine the distance between the previous word, the speaking speed for each word segment, and how to stretch the sound for each word segment. The local prosodic feature only needs to express at least one of these features. In this embodiment, it is assumed that all elements in the rightmost column in FIG. 5 are included as local prosodic features. Hereinafter, the detailed configuration and operation of the local prosody feature extraction unit 104 will be described with reference to FIGS. 8 and 9. FIG. 8 is a block diagram showing a detailed configuration of the local prosodic feature extraction unit 104 of the present embodiment. FIG. 9 is a flowchart showing a detailed operation of the local prosody feature extraction unit 104 of the present embodiment. As shown in FIG. 8, the local prosody feature extraction unit 104 of this embodiment includes an F0 local prosody feature extraction unit 1041, a power local prosody feature extraction unit 1042, a pause local prosody feature extraction unit 1043, and a speech rate local prosody. A feature extraction unit 1044 and a duration local prosody feature extraction unit 1045 are included.

＜Ｆ０局所韻律特徴抽出部１０４１＞
入力：短時間ごとの正規化基本周波数、各単語の開始・終了時刻
出力：単語前半・単語後半それぞれの基本周波数の平均値、標準偏差、最大値、最小値、勾配
Ｆ０局所韻律特徴抽出部１０４１は、声の高さに関する局所韻律特徴を抽出する（Ｓ１０４１）。声の高さに関する局所韻律特徴として、単語前半・単語後半それぞれの基本周波数の平均値、標準偏差、最大値、最小値、勾配が含まれる。 <F0 Local Prosody Feature Extraction Unit 1041>
Input: Normalized fundamental frequency for each short time, start / end time of each word Output: average value, standard deviation, maximum value, minimum value, gradient of fundamental frequency of first half of word and second half of word F0 local prosodic feature extraction unit 1041 Extracts local prosodic features related to voice pitch (S1041). Local prosodic features related to the pitch of the voice include the average value, standard deviation, maximum value, minimum value, and gradient of the fundamental frequency of the first half and the second half of the word.

Ｆ０局所韻律特徴抽出部１０４１は、各単語の開始・終了時刻に基づき、単語前半・単語後半の基本周波数の系列を短時間ごとの正規化基本周波数から切り出す。勾配以外は単語前半・単語後半の基本周波数の系列の統計量から、勾配は単語前半・単語後半の基本周波数の系列の回帰分析結果から得られる。ただし、短時間ごとの正規化基本周波数は母音区間でのみ正確な値をとるため、母音区間の正規化基本周波数のみを用いるものとする。本実施例では、音素アライメントにより推定した母音区間を用いるが、別の母音区間推定手法により求めた母音区間を用いてもよい。 The F0 local prosody feature extraction unit 1041 cuts out the basic frequency series of the first half of the word and the second half of the word from the normalized basic frequency for each short time based on the start / end times of each word. Other than the gradient, the basic frequency series statistics of the first half of the word and the second half of the word are obtained, and the gradient is obtained from the regression analysis result of the basic frequency series of the first half of the word and the second half of the word. However, since the normalized fundamental frequency for each short time takes an accurate value only in the vowel section, only the normalized fundamental frequency in the vowel section is used. In this embodiment, a vowel section estimated by phoneme alignment is used, but a vowel section obtained by another vowel section estimation method may be used.

＜パワー局所韻律特徴抽出部１０４２＞
入力：短時間ごとの正規化音圧レベル、各単語の開始・終了時刻
出力：単語前半・単語後半それぞれの音圧レベルの平均値、標準偏差、最大値、最小値、勾配
パワー所韻律特徴抽出部１０４２は、声の大きさに関する局所韻律特徴を抽出する（Ｓ１０４２）。声の大きさに関する局所韻律特徴として、単語前半・単語後半それぞれの音圧レベルの平均値、標準偏差、最大値、最小値、勾配が含まれる。 <Power Local Prosodic Feature Extraction Unit 1042>
Input: Normalized sound pressure level for each short time, start / end time of each word Output: Average value, standard deviation, maximum value, minimum value, gradient power sound prosody feature extraction The unit 1042 extracts local prosodic features related to voice volume (S1042). Local prosodic features relating to the loudness of the voice include the average value, standard deviation, maximum value, minimum value, and gradient of the sound pressure levels of the first half and the second half of the word.

Ｆ０局所韻律特徴抽出部１０４１と同様に、パワー局所韻律特徴抽出部１０４２は、単語の開始・終了時刻に基づいて単語前半・単語後半それぞれの音圧レベルの系列を短時間ごとの音圧レベルから切り出し、統計量または回帰分析の結果から声の大きさに関する局所韻律特徴を抽出する。ただしパワー所韻律特徴抽出部１０４２は、Ｆ０局所韻律特徴抽出部１０４１とは異なり、母音以外の区間も含めた単語前半・単語後半の全ての音圧レベルの値を用いて声の大きさに関する局所韻律特徴を求める。 Similar to the F0 local prosody feature extraction unit 1041, the power local prosody feature extraction unit 1042 calculates a sequence of sound pressure levels for the first half of the word and the second half of the word from the sound pressure level for each short time based on the start / end times of the word. Local prosodic features related to loudness are extracted from the results of segmentation, statistics or regression analysis. However, unlike the F0 local prosody feature extraction unit 1041, the power syllabic feature extraction unit 1042 uses the values of all sound pressure levels in the first half of the word and the second half of the word including the section other than the vowel to determine the locality related to the loudness of the voice. Find prosodic features.

＜ポーズ局所韻律特徴抽出部１０４３＞
入力：単語系列、各単語の開始・終了時刻
出力：次の単語までの間の長さ
ポーズ局所韻律特徴抽出部１０４３は、単語間の間の取り方に関する局所韻律特徴を抽出する（Ｓ１０４３）。単語間の間の取り方に関する局所韻律特徴として次の単語までの（あるいは前の単語からの）間の長さが含まれる。本実施例では、以下の２つの区間を間と定義する。＜１＞ある単語の終了時刻から次の単語の開始時刻までの区間。＜２＞音声認識結果に含まれる句読点またはポーズの区間。単語ごとの次の単語までの間の長さは、（次の単語の開始時刻−単語の終了時刻）として得られる。このとき、単語が続けて発声される場合、次の単語までの間の長さは０秒となる。ただし、句読点またはポーズは間とみなすため単語に含めない。また、発話末尾の単語においては、次の単語までの間の長さは０秒であるものとする。 <Pause local prosodic feature extraction unit 1043>
Input: word sequence, start / end time of each word Output: length between next words The pause local prosody feature extraction unit 1043 extracts local prosody features related to the way between words (S1043). The local prosodic feature regarding how to take between words includes the length to the next word (or from the previous word). In this embodiment, the following two sections are defined as intermediate. <1> A section from the end time of one word to the start time of the next word. <2> A punctuation or pause section included in the speech recognition result. The length between each word up to the next word is obtained as (next word start time−word end time). At this time, if a word is continuously uttered, the length until the next word is 0 seconds. However, punctuation marks or pauses are not included in the word because they are considered to be in between. Also, in the word at the end of the utterance, the length until the next word is 0 seconds.

＜スピーチレート局所韻律特徴抽出部１０４４＞
入力：音素系列、各単語の開始・終了時刻
出力：単語ごとの話速
スピーチレート局所韻律特徴抽出部１０４４は、話速に関する局所韻律特徴を抽出する（Ｓ１０４４）。話速に関する局所韻律特徴として、単語ごとの話速が含まれる。話速は単位時間あたりに発話した音素数であるものとし、音素数／（単語の終了時刻−単語の開始時刻）を単語ごとに計算することで得られる。音素数は単語ごとの音素系列に含まれる音素の数である。 <Speech Rate Local Prosodic Feature Extraction Unit 1044>
Input: phoneme sequence, start / end time of each word output: speech speed for each word The speech rate local prosody feature extraction unit 1044 extracts local prosody features related to speech speed (S1044). Local prosodic features related to speech speed include speech speed for each word. The speaking speed is assumed to be the number of phonemes uttered per unit time, and is obtained by calculating the number of phonemes / (word end time−word start time) for each word. The number of phonemes is the number of phonemes included in the phoneme sequence for each word.

＜デュレーション局所韻律特徴抽出部１０４５＞
入力：各音素の開始・終了時刻、各単語の開始・終了時刻
出力：単語ごとの音素継続長の平均値、標準偏差、最大値、最小値、単語末尾の音素の音素継続長
デュレーション局所韻律特徴抽出部１０４５は、音の伸ばし方に関する局所韻律特徴を抽出する（Ｓ１０４５）。音の伸ばし方に関する局所韻律特徴として、単語ごとの音素継続長の平均値、標準偏差、最大値、最小値、単語末尾の音素の音素継続長が含まれる。音素継続長は、音素の終了時刻−音素の開始時刻を音素ごとに計算することで得られる。デュレーション局所韻律特徴抽出部１０４５は、単語に含まれる全音素に対し音素継続長を求め、それらの値から単語ごとの音素継続長の平均値、標準偏差、最大値、最小値、単語末尾の音素の音素継続長を取得できる。 <Duration Local Prosodic Feature Extraction Unit 1045>
Input: start / end time of each phoneme, start / end time of each word Output: average value, standard deviation, maximum value, minimum value of phoneme duration for each word, phoneme duration of phoneme at end of word Duration local prosodic feature The extraction unit 1045 extracts local prosodic features related to the sound extension method (S1045). Local prosodic features related to how to stretch the sound include the average value, standard deviation, maximum value, minimum value, and phoneme duration of the phoneme at the end of the word. The phoneme duration is obtained by calculating the phoneme end time-phoneme start time for each phoneme. The duration local prosodic feature extraction unit 1045 obtains phoneme durations for all phonemes included in the word, and calculates an average value, standard deviation, maximum value, minimum value, and phoneme at the end of the word from these values. The phoneme duration can be acquired.

＜アクセント句境界推定部１０５＞
入力：単語系列
出力：アクセント句境界
アクセント句境界推定部１０５は、単語系列からアクセント句境界を推定する（Ｓ１０５）。ここで、アクセント句境界とは、あるアクセント句と別のアクセント句との境界地点を指し、アクセント句境界に挟まれた区間を一つのアクセント句区間とする。本実施例ではアクセント句境界推定手法に、参考非特許文献２の手法を用いるが、本発明はこれに限定されず、他のどのアクセント句境界推定手法を用いてもよい。
（参考非特許文献２：浅野、松岡、市井、大山、“テキスト音声変換における読み・韻律付与処理の評価:ニュース文を対象として、”第51回情報処理学会全国大会講演論文集、pp.109-100、1995） <Accent phrase boundary estimation unit 105>
Input: Word sequence output: Accent phrase boundary The accent phrase boundary estimation unit 105 estimates an accent phrase boundary from the word series (S105). Here, the accent phrase boundary refers to a boundary point between an accent phrase and another accent phrase, and an interval between the accent phrase boundaries is defined as one accent phrase interval. In this embodiment, the technique of Reference Non-Patent Document 2 is used as the accent phrase boundary estimation technique, but the present invention is not limited to this, and any other accent phrase boundary estimation technique may be used.
(Reference Non-Patent Document 2: Asano, Matsuoka, Ichii, Oyama, “Evaluation of reading and prosody assignment processing in text-to-speech conversion: For news sentences,” Proc. Of the 51st Annual Conference of Information Processing Society of Japan, pp.109 -100, 1995)

＜局所韻律系列特徴抽出部１０６＞
入力：局所韻律特徴、アクセント句境界
出力：局所韻律系列特徴
局所韻律系列特徴抽出部１０６は、アクセント句区間に含まれる単語区間ごとの局所韻律特徴を連結し、アクセント句単位での局所韻律系列特徴を抽出する（Ｓ１０６）。局所韻律系列特徴はアクセント句区間にわたる韻律の局所的な変化を表現する。連結とは、単語ごとの局所韻律特徴ベクトル（局所韻律特徴のベクトル表現）を、アクセント句に含まれる単語数ｎだけ連結し、局所韻律系列特徴ベクトルを作成することを指す。このとき、ｎを連結数と呼ぶ。局所韻律系列特徴の抽出の例を図１０に示す。図１０の例では、アクセント句「そうですね」に含まれる３つの局所韻律特徴ベクトルが連結されて単語連結数３の局所韻律系列特徴ベクトルが生成される。これに対し、アクセント句「うーん」には１つの局所韻律特徴ベクトルのみが含まれるため、この局所韻律特徴ベクトルがそのまま単語連結数１の局所韻律系列特徴ベクトルとされる。一方、アクセント句「わたしですか」に含まれる３つの局所韻律特徴ベクトルは連結されて単語連結数３の局所韻律系列特徴ベクトルが生成される。 <Local Prosodic Sequence Feature Extraction Unit 106>
Input: local prosodic feature, accent phrase boundary output: local prosodic feature The local prosodic feature extracting unit 106 connects local prosodic features for each word segment included in the accent phrase segment, and local prosodic sequence feature in the accent phrase unit. Is extracted (S106). Local prosodic sequence features represent local changes in prosody over accent phrase intervals. Concatenation refers to concatenating local prosodic feature vectors (vector representation of local prosodic features) for each word by the number n of words included in the accent phrase to create a local prosodic sequence feature vector. At this time, n is called the number of connections. An example of local prosody sequence feature extraction is shown in FIG. In the example of FIG. 10, three local prosodic feature vectors included in the accent phrase “That's right” are concatenated to generate a local prosodic sequence feature vector with three word links. On the other hand, since the accent phrase “mm” contains only one local prosodic feature vector, this local prosodic feature vector is used as it is as a local prosodic sequence feature vector with one word connection. On the other hand, three local prosodic feature vectors included in the accent phrase “I am?” Are concatenated to generate a local prosodic sequence feature vector having three word links.

＜アクセント句毎発話意図ラベル作成部１０７＞
入力：アクセント句境界、発話意図ラベル
出力：アクセント句ごと発話意図ラベル
アクセント句毎発話意図ラベル作成部１０７は、アクセント句ごとの発話意図ラベルを作成する（Ｓ１０７）。このステップでは、アクセント句境界と、発話意図ラベルが用いられる。発話意図ラベルは、人間が音声を聴取し、発話意図を感じた音声区間にラベルを付与することで得られる。本実施例では、人間が音声を聴取し、「肯定的」「否定的」の二つのラベルのうちの何れかのラベルをアクセント句ごとに付与する。あるアクセント句に対し、各ラベルは高々一つしか付与されないものとし、どのラベルも付与されなかったアクセント句は「どちらでもない」ラベルが付与されたものとする。 <Accent phrase utterance intention label creation unit 107>
Input: accent phrase boundary, utterance intention label output: utterance intention label for each accent phrase The utterance intention label creation unit 107 for each accent phrase creates an utterance intention label for each accent phrase (S107). In this step, accent phrase boundaries and utterance intention labels are used. The utterance intention label is obtained by giving a label to a voice section in which a human listens to a voice and feels the utterance intention. In the present embodiment, a human listens to a voice and assigns one of two labels “positive” and “negative” to each accent phrase. It is assumed that only one label is assigned to an accent phrase at most, and an accent phrase to which no label is assigned is assigned a “neither” label.

例えばアクセント句ごとに各ラベルが占める区間の割合を求め、最も割合が大きいラベルをそのアクセント句の発話意図ラベルとすることができる。図１１にアクセント句ごとに人手で発話意図ラベルを付与した例を示す。図１１の例において、先頭アクセント句「そうですね」の区間については、「肯定的」ラベルを付与された割合が最も高かったものとする。この場合、先頭アクセント句「そうですね」の発話意図ラベルは人手で付与された割合が最も高かった「肯定的」に決定される。一方、二番目のアクセント句「うーん」最後のアクセント句「わたしですか」については、どのラベルも付与されなかった割合が最も高くなったものとする。この場合、二番目および最後のアクセント句の発話意図ラベルは「どちらでもない」に決定される。 For example, the ratio of the section occupied by each label for each accent phrase can be obtained, and the label with the largest ratio can be set as the utterance intention label of the accent phrase. FIG. 11 shows an example in which an utterance intention label is manually assigned to each accent phrase. In the example of FIG. 11, it is assumed that the ratio of the “positive” label is the highest in the section of the head accent phrase “I think so”. In this case, the utterance intention label of the head accent phrase “sounds right” is determined to be “positive” with the highest rate of manual assignment. On the other hand, for the second accent phrase “Umm” and the last accent phrase “I am?”, It is assumed that the ratio of no labels being given is the highest. In this case, the utterance intention label of the second and last accent phrase is determined to be “neither”.

＜発話意図モデル学習部１０８＞
入力：局所韻律系列特徴、アクセント句ごと発話意図ラベル
出力：発話意図モデル
発話意図モデル学習部１０８は、アクセント句ごとの局所韻律系列特徴と、これに対応するアクセント句ごとの発話意図ラベルとを学習データとし、発話意図抽出を行うための発話意図モデルを予め学習する（Ｓ１０８）。発話意図モデルは、連結数ｎごとに学習する。すなわち、アクセント句ごとの局所韻律系列特徴とそれに対応する発話意図ラベルの集合から、同一の連結数を持つ局所韻律系列特徴とそれに対応する発話意図ラベルを選び、発話意図モデルを学習する。発話意図モデルは、例えば決定木であってもよい。図１２に発話意図モデルを決定木として学習した例（連結数２の例）を示す。 <Speech intention model learning unit 108>
Input: Local prosodic sequence feature, utterance intention label for each accent phrase Output: Utterance intention model The utterance intention model learning unit 108 learns a local prosodic sequence feature for each accent phrase and the corresponding utterance intention label for each accent phrase. As a data, an utterance intention model for utterance intention extraction is learned in advance (S108). The utterance intention model is learned for each number n of connections. In other words, a local prosodic sequence feature having the same number of connections and a corresponding speech intention label are selected from a set of local prosodic sequence features and corresponding speech intention labels for each accent phrase, and an utterance intention model is learned. The utterance intention model may be a decision tree, for example. FIG. 12 shows an example of learning the utterance intention model as a decision tree (example of the number of connections 2).

決定木は、アクセント句ごとの（同一の連結数を持つ）局所韻律系列特徴とそれに対応する発話意図ラベルの集合を入力とし、ＣＡＲＴなどの公知の決定木学習アルゴリズムを用いて学習してもよいし、人手で決定木の構造としきい値を決めて学習してもよい。発話意図モデルは、条件付き確率場やサポートベクターマシンなどの機械学習により学習してもよい。 The decision tree may be learned by using a known decision tree learning algorithm such as CART using a set of local prosodic sequence features (having the same number of connections) for each accent phrase and a corresponding set of utterance intention labels as inputs. However, the structure and threshold value of the decision tree may be determined manually and learned. The utterance intention model may be learned by machine learning such as a conditional random field or a support vector machine.

なお、上述した発話意図モデル学習部１０８のみを抜き出して単独の装置（発話意図モデル学習装置）としてもよい。この場合、発話意図モデル学習装置は、前述の局所韻律系列特徴と、アクセント句区間ごとに人手で付与された発話意図ラベルとを学習データとして、アクセント句区間ごとの発話意図の抽出に用いる発話意図モデルを学習する装置として構成される。 Note that only the utterance intention model learning unit 108 described above may be extracted and used as a single device (utterance intention model learning device). In this case, the utterance intention model learning device uses, as learning data, the utterance intention used to extract the utterance intention for each accent phrase section, using the above-mentioned local prosodic feature and the utterance intention label manually given for each accent phrase section as learning data. Configured as a model learning device.

＜発話意図抽出部１０９＞
入力：局所韻律系列特徴、発話意図モデル
出力：発話ごとの発話意図
発話意図抽出部１０９は、局所韻律系列特徴と、ステップＳ１０８で学習した発話意図モデルに基づいて、アクセント句ごとの発話意図を抽出し、抽出されたアクセント句ごとの発話意図に基づいて、発話ごとの発話意図を抽出する（Ｓ１０９）。 <Speech intention extraction unit 109>
Input: Local prosodic feature, utterance intention model Output: utterance intention for each utterance The utterance intention extraction unit 109 extracts utterance intention for each accent phrase based on the local prosodic feature and the utterance intention model learned in step S108. Based on the extracted utterance intention for each accent phrase, the utterance intention for each utterance is extracted (S109).

本実施例では、「肯定的」「否定的」「どちらでもない」の三種を発話意図とみなす。発話意図抽出部１０９は、局所韻律系列特徴を発話意図モデルに入力することで、アクセント句ごとの発話意図を得る。このとき発話意図抽出部１０９は、局所韻律系列特徴の連結数ｎに合った発話意図モデルを用いるものとする。 In this embodiment, three types of “positive”, “negative”, and “neither” are regarded as utterance intentions. The utterance intention extraction unit 109 obtains the utterance intention for each accent phrase by inputting the local prosody sequence feature into the utterance intention model. At this time, the utterance intention extraction unit 109 uses an utterance intention model that matches the number n of connected local prosodic sequence features.

発話意図抽出部１０９は、発話に含まれる全てのアクセント句ごとの発話意図を求めたのち、後述するように発話ごとの発話意図を決定する。一つの発話中に、アクセント句ごとの発話意図「肯定的」「否定的」が一つも含まれない場合、発話意図抽出部１０９は当該発話の発話意図を「どちらでもない」とする。一つの発話中に、アクセント句ごとの発話意図「肯定的」「否定的」のどちらか一方のみ含まれる場合、発話意図抽出部１０９は当該含まれる発話意図を発話ごとの発話意図とする。一つの発話中に、アクセント句ごとの発話意図「肯定的」「否定的」の双方が含まれる場合、発話意図抽出部１０９は「肯定的」「否定的」それぞれの発話意図の区間の総和が大きい方を発話ごとの発話意図とする。 The utterance intention extraction unit 109 obtains the utterance intention for every accent phrase included in the utterance, and then determines the utterance intention for each utterance as described later. If no utterance intention “positive” or “negative” for each accent phrase is included in one utterance, the utterance intention extraction unit 109 determines that the utterance intention of the utterance is “neither”. When only one of the utterance intentions “positive” and “negative” for each accent phrase is included in one utterance, the utterance intention extraction unit 109 sets the included utterance intention as the utterance intention for each utterance. When both utterance intentions “positive” and “negative” for each accent phrase are included in one utterance, the utterance intention extraction unit 109 calculates the sum of the sections of the “positive” and “negative” utterance intentions. The larger one is the utterance intention for each utterance.

本実施例の発話意図抽出装置１によれば、発話の一部の区間にのみ発話意図が含まれる場合には、当該音声のアクセント句単位での韻律変化に着目すべきであるという新たな知見を利用し、アクセント句単位での韻律変化を局所韻律系列特徴として抽出し、当該局所韻律系列特徴に基づいて学習された発話意図モデルを用いて発話意図を抽出するように構成したため、発話の一部区間にのみ発話意図が表出する場合でも、当該発話意図を正しく抽出することができる。 According to the utterance intention extraction apparatus 1 of the present embodiment, when the utterance intention is included only in a part of the utterance, the new finding that attention should be paid to the prosodic change in the accent phrase unit of the speech. Is used to extract prosodic changes in units of accent phrases as local prosodic sequence features, and to extract utterance intentions using an utterance intention model learned based on the local prosodic sequence features. Even when the utterance intention appears only in the section, the utterance intention can be correctly extracted.

＜実施例２の発明の要点＞
機械との音声対話や打合せでは、一人の話者が継続して話すことがある。このような場合、一つの発話に二つ以上の発話意図の特徴が表れる発話が発生することもある。例えば、発話単位では否定の発話意図である「その通りです。しかし私は反対です。」という発話には、発話の前半に肯定的な発話意図の特徴が、発話の後半に否定的な発話意図の特徴が表れる（図１３参照）。しかし、非特許文献１の発話意図抽出装置９は、一つの発話には一つの発話意図のみが表れると仮定し、発話全体から求めた韻律特徴や言語特徴に基づいて発話意図抽出を行う。そのため、非特許文献１の発話意図抽出装置９で求めた韻律特徴や言語特徴には異なる種類の発話意図の特徴が含まれることがあると考えられ、非特許文献１の発話意図抽出装置９では正しく発話意図を抽出することが困難な場合があった。 <Summary of Invention of Example 2>
In a spoken dialogue or meeting with a machine, a single speaker may continue speaking. In such a case, an utterance in which two or more utterance intention characteristics appear in one utterance may occur. For example, in the utterance unit, an utterance that is a negative utterance intention, “Yes, but I am the opposite,” an utterance characteristic is positive in the first half of the utterance, and a negative utterance intention in the second half of the utterance. (See FIG. 13). However, the utterance intention extraction device 9 of Non-Patent Document 1 assumes that only one utterance intention appears in one utterance, and performs utterance intention extraction based on prosodic features and language features obtained from the entire utterance. For this reason, it is considered that prosodic features and language features obtained by the utterance intention extraction device 9 of Non-Patent Document 1 may include different types of utterance intention features. In some cases, it was difficult to correctly extract the utterance intention.

一つの発話に二つ以上の発話意図の特徴が表れる発話が発生する場合、発話全体での発話意図は部分区間ごとに求めた発話意図の順序と高い関係性があると考えられる。以下、図１４を参照しながら説明する。例えば、発話前半に肯定、後半に否定が表れる場合は発話全体として否定的な発話意図を感じることが多い。また、発話前半に否定、後半に肯定が表れる場合は発話全体として肯定的な発話意図を感じることが多い。 When an utterance in which two or more utterance intention features appear in one utterance, the utterance intention in the entire utterance is considered to have a high relationship with the order of the utterance intention obtained for each partial section. Hereinafter, a description will be given with reference to FIG. For example, when affirmative appears in the first half of the utterance and negative in the second half, the negative utterance intention is often felt as a whole utterance. Also, if the first half of the utterance is negative and the second half is positive, the utterance as a whole often feels positive.

実施例１では、部分区間ごとに表れる発話意図の順序を考慮することなく、肯定的/否定的な発話意図の部分区間の長さのみに基づいて部分区間ごとの発話意図抽出結果を統合し、発話ごとの発話意図を抽出している。このため、発話ごとの発話意図抽出精度が低下し、図１４の例のように、発話単位では否定の発話意図である「その通りです。しかし私は反対です。」という発話を肯定的な意図を有すると認識してしまう場合もある。 In Example 1, the speech intention extraction results for each partial section are integrated based only on the length of the partial sections of the positive / negative speech intention without considering the order of the speech intentions appearing for each partial section, The utterance intention for each utterance is extracted. For this reason, the utterance intention extraction accuracy for each utterance is lowered, and as shown in the example of FIG. 14, the utterance intention is “Yes, but I am the opposite.” It may be recognized that it has.

実施例２の発明の要点は、部分区間ごとの発話意図の時系列情報と発話ごとの発話意図の関係性をＮ−ｇｒａｍモデルとして学習する点にある。Ｎ−ｇｒａｍモデルとは、ある文（単語の系列）の出現確率をＮ単語の連鎖の出現確率の積として表現するモデルであり（参考非特許文献３）、単語の順序が文の出現確率に反映される。これを発話意図に適用する。以下、図１５を参照しながら説明する（本図ではＮ＝３である）。すなわち、発話ごとの発話意図別に部分区間ごとの発話意図Ｎ連鎖のモデル（以下、発話意図Ｎ−ｇｒａｍモデルという）を事前に作成し、発話意図の抽出対象として入力された発話に対応する部分区間ごとの発話意図の系列に対して、その出現確率が最大となるような発話ごとの発話意図の発話意図Ｎ−ｇｒａｍモデルを選択することで発話ごとの発話意図を抽出する。これにより、部分区間ごとの発話意図の順序情報を利用した発話ごとの発話意図の抽出が可能となる。 The main point of the invention of the second embodiment is that the relationship between the utterance intention time-series information for each partial section and the utterance intention for each utterance is learned as an N-gram model. The N-gram model is a model that expresses the appearance probability of a certain sentence (a sequence of words) as a product of the appearance probabilities of a chain of N words (Reference Non-Patent Document 3), and the word order is the sentence appearance probability. Reflected. This is applied to the utterance intention. Hereinafter, description will be made with reference to FIG. 15 (N = 3 in this figure). That is, an utterance intention N-chain model for each partial section for each utterance intention for each utterance (hereinafter referred to as an utterance intention N-gram model) is created in advance, and the partial section corresponding to the utterance input as the utterance intention extraction target For each utterance intention sequence, the utterance intention for each utterance is extracted by selecting the utterance intention N-gram model of the utterance intention for each utterance with the highest probability of appearance. This makes it possible to extract the utterance intention for each utterance using the utterance intention order information for each partial section.

また、実施例１や図１５の例では部分区間ごとに３つの発話意図（肯定的、否定的、どちらでもない）を抽出している。この抽出結果をそのまま発話意図Ｎ−ｇｒａｍモデルに利用してもよいが、部分区間ごとの発話意図抽出結果をより多くの分類に分け（例えば、強く肯定的、やや肯定的、など）、それらを発話意図Ｎ−ｇｒａｍモデルに利用する方が発話意図Ｎ−ｇｒａｍモデルの表現精度が増し、発話ごとの発話意図抽出の精度が向上すると考えられる。そこで、本実施例では、部分区間ごとの発話意図の抽出結果を３つの発話意図ラベルではなく各発話意図の確率のベクトルとして表現し、当該ベクトルをベクトル量子化しインデクス（以下、発話意図インデクスという）に変換することで、各発話意図の分類の多様化と発話意図抽出精度の向上を図る（図１６参照）。（参考非特許文献３：鹿野清宏、河原達也、山本幹雄、伊藤克亘、武田一哉、IT Text音声認識システム、pp.53-69、オーム社、2001） Further, in the example of Example 1 and FIG. 15, three utterance intentions (positive and negative, neither) are extracted for each partial section. This extraction result may be used as it is in the utterance intention N-gram model, but the utterance intention extraction result for each partial section is divided into more categories (for example, strongly positive, slightly positive, etc.) It is considered that the expression accuracy of the utterance intention N-gram model is increased when the utterance intention N-gram model is used, and the utterance intention extraction accuracy for each utterance is improved. Therefore, in this embodiment, the extraction result of the utterance intention for each partial section is expressed as a vector of the probability of each utterance intention instead of the three utterance intention labels, and the vector is quantized to an index (hereinafter referred to as an utterance intention index). By converting to, the utterance intention classification is diversified and the utterance intention extraction accuracy is improved (see FIG. 16). (Reference Non-Patent Document 3: Kiyohiro Shikano, Tatsuya Kawahara, Mikio Yamamoto, Katsunobu Ito, Kazuya Takeda, IT Text Speech Recognition System, pp.53-69, Ohmsha, 2001)

＜実施例２の具体的説明＞
以下、Ｎ−ｇｒａｍモデルを用いて発話意図を抽出する実施例２の発話意図抽出装置について説明する。本実施例の発話意図抽出装置は、発話と、発話ごとの音声認識結果を入力とする。発話の定義は上述の非特許文献１における定義と同一とする。図１７、図１８を参照して、本実施例の発話意図抽出装置の構成、および動作について説明する。図１７は、本実施例の発話意図抽出装置２の構成を示すブロック図である。図１８は、本実施例の発話意図抽出装置２の動作を示すフローチャートである。図１７に示すように、本実施例の発話意図抽出装置２は、部分区間毎特徴量抽出部２０１と、部分区間毎発話意図モデル学習部２０２と、部分区間毎発話意図抽出部２０３と、発話意図インデクスコードブック作成部２０４と、発話意図インデクス変換部２０５と、Ｎ−ｇｒａｍモデル学習部２０６と、発話毎発話意図抽出部２０７を含む。 <Specific Explanation of Example 2>
Hereinafter, the utterance intention extraction apparatus according to the second embodiment that extracts an utterance intention using an N-gram model will be described. The utterance intention extraction apparatus of this embodiment receives an utterance and a speech recognition result for each utterance. The definition of the utterance is the same as the definition in Non-Patent Document 1 described above. With reference to FIGS. 17 and 18, the configuration and operation of the utterance intention extraction apparatus of the present embodiment will be described. FIG. 17 is a block diagram illustrating a configuration of the utterance intention extraction device 2 according to the present embodiment. FIG. 18 is a flowchart showing the operation of the utterance intention extraction apparatus 2 of the present embodiment. As shown in FIG. 17, the utterance intention extraction apparatus 2 of the present embodiment includes a partial segment feature quantity extraction unit 201, a partial segment utterance intention model learning unit 202, a partial segment utterance intention extraction unit 203, and an utterance. An intention index codebook creation unit 204, an utterance intention index conversion unit 205, an N-gram model learning unit 206, and an utterance intention utterance intention extraction unit 207 are included.

なお、部分区間毎発話意図モデル学習部２０２、Ｎ−ｇｒａｍモデル学習部２０６部で学習に用いる発話は、同一のものでもよいし、異なるものでもよい。 Note that the utterances used for learning by the partial section utterance intention model learning unit 202 and the N-gram model learning unit 206 may be the same or different.

＜部分区間毎特徴量抽出部２０１＞
入力：発話、発話ごとの音声認識結果
出力：部分区間ごとの特徴量
部分区間毎特徴量抽出部２０１は、部分区間ごとの特徴量を抽出する（Ｓ２０１）。例えば、部分区間をアクセント句とし、実施例１の９０１〜９０３、１０４〜１０６と同様の方法で特徴量として局所韻律系列特徴を抽出してもよい。また、部分区間ごとの特徴量は、韻律特徴または言語特徴の少なくとも一つを含む。韻律特徴は、実施例１の局所韻律特徴の少なくとも一つを含む。言語特徴は、例えば部分区間内の単語列のＢａｇ−ｏｆ−Ｗｏｒｄｓを用いることができるが、部分区間に含まれる単語から決定可能な特徴量であればどのような特徴量を用いてもよい。 <Partial section feature extraction unit 201>
Input: speech, speech recognition result output for each utterance: feature amount for each partial section The feature amount extraction unit 201 for each partial section extracts the feature amount for each partial section (S201). For example, local prosodic sequence features may be extracted as feature amounts in the same manner as in 901 to 903 and 104 to 106 in the first embodiment using a partial section as an accent phrase. Further, the feature amount for each partial section includes at least one of prosodic features or language features. The prosodic features include at least one of the local prosodic features of the first embodiment. For example, Bag-of-Words of a word string in a partial section can be used as the language feature, but any feature quantity that can be determined from words included in the partial section may be used.

＜部分区間毎発話意図モデル学習部２０２＞
入力：部分区間ごとの特徴量、部分区間ごとの発話意図ラベル
出力：部分区間ごとの発話意図モデル
部分区間毎発話意図モデル学習部２０２は、部分区間ごとの特徴量と、それに対応する部分区間ごとの発話意図ラベルを用いて、部分区間ごとの特徴量と部分区間ごとの発話意図との関係性を表現するモデルを学習する（Ｓ２０２）。ここでは学習手法としてニューラルネットワークを用いるが、クラス分類が可能な他の学習手法を用いてもよい。また、学習を行わず、人手で部分区間ごとの特徴量と部分区間ごとの発話意図との関係性を表現するルールを作成してもよい。 <Speech intention model learning unit 202 for each partial section>
Input: feature quantity for each partial section, utterance intention label for each partial section: utterance intention model for each partial section The utterance intention model learning unit 202 for each partial section, the feature quantity for each partial section and the corresponding partial section The model expressing the relationship between the feature amount for each partial section and the utterance intention for each partial section is learned using the utterance intention label (S202). Here, a neural network is used as a learning method, but another learning method capable of classifying may be used. Further, a rule that expresses the relationship between the feature amount for each partial section and the utterance intention for each partial section may be created without performing learning.

＜部分区間毎発話意図抽出部２０３＞
入力：部分区間ごとの特徴量、部分区間ごとの発話意図モデル
出力：部分区間ごとの発話意図の確率ベクトル
部分区間毎発話意図抽出部２０３は、部分区間ごとの発話意図モデルを用いて、部分区間ごとの特徴量からその部分区間の発話意図の確率を求める（Ｓ２０３）。部分区間の発話意図の確率は、例えばニューラルネットワークであれば出力層の活性化関数にソフトマックス関数を用いた際の出力値などを用いる。部分区間の発話意図の確率を結合し、部分区間ごとの発話意図の確率ベクトルとして出力する。 <Speech intention extraction unit 203 for each partial section>
Input: feature amount for each partial section, utterance intention model output for each partial section: probability vector of utterance intention for each partial section The utterance intention extraction unit for each partial section 203 uses the utterance intention model for each partial section, The probability of the utterance intention of the partial section is obtained from the feature quantity of each (S203). For example, in the case of a neural network, an output value when a softmax function is used as the activation function of the output layer is used as the probability of the intention to speak in the partial section. The probabilities of utterance intentions of the partial sections are combined and output as a utterance intention probability vector for each partial section.

なお、確率ベクトルを出力する代わりに、部分区間ごとの発話意図、つまり、肯定的、否定的、どちらでもない、のいずれかの値をそのまま出力してもよい。 Instead of outputting the probability vector, the utterance intention for each partial section, that is, any value of positive, negative, or neither may be output as it is.

＜発話意図インデクスコードブック作成部２０４＞
入力：部分区間ごとの発話意図の確率ベクトル
出力：発話意図インデクスコードブック
発話意図インデクスコードブック作成部２０４は、部分区間ごとの発話意図の確率ベクトルを発話意図インデクスに変換するための、コードブックを作成する（Ｓ２０４）。ここでは、ベクトル量子化のためのコードブック作成方法としてｋ平均法を用いる。部分区間ごとの発話意図の確率ベクトルの集合を用意し、クラスタ数をｋ個としてｋ平均法を適用することで、部分区間ごとの発話意図の確率ベクトルのセントロイドがｋ個得られる。各セントロイドに発話意図インデクスを割り当て、コードブックとする。ｋの数は発話意図の分類の数であり、ｋが多いほど発話意図の分類を細かくすることに相当する。例えば、ｋ＝２０とする。また、発話意図インデクス変換部２０４にてベクトル量子化が可能であるならば、既存のどのコードブック作成方法を用いてもよい。 <Speech intention index code book creation unit 204>
Input: Probability vector of utterance intention for each partial section Output: utterance intention index codebook The utterance intention index codebook creation unit 204 generates a codebook for converting the probability vector of the utterance intention for each partial section into an utterance intention index. Create (S204). Here, the k-means method is used as a codebook creation method for vector quantization. A set of utterance intention probability vectors for each partial section is prepared, and k centroids of utterance intention probability vectors for each partial section are obtained by applying the k-average method with k clusters. An utterance intention index is assigned to each centroid to form a code book. The number of k is the number of utterance intention classifications, and the larger k is, the smaller the utterance intention classification is. For example, k = 20. Further, any existing code book creation method may be used as long as the vector quantization is possible in the utterance intention index conversion unit 204.

なお、部分区間毎発話意図抽出部２０３で確率ベクトルの代わりに部分区間ごとの発話意図の値をそのまま出力することとした場合は、肯定的、否定的、どちらでもない、のそれぞれに１、２、３のインデクスを付与するなどとすればよい。また、図１５のように、肯定的、否定的、どちらでもない、をそのままインデクスとするのでもよい。 If the utterance intention extraction unit 203 for each partial section outputs the utterance intention value for each partial section as it is instead of the probability vector, it is 1, 2 for each of positive, negative, and neither. For example, an index of 3 may be given. Further, as shown in FIG. 15, “positive” or “negative” may be used as an index as it is.

＜発話意図インデクス変換部２０５＞
入力：部分区間ごとの発話意図の確率ベクトル、発話意図インデクスコードブック
出力：部分区間ごとの発話意図インデクス
発話意図インデクス変換部２０５は、部分区間ごとの発話意図の確率ベクトルを部分区間ごとの発話意図インデクスに変換する（Ｓ２０５）。ｋ平均法を用いて発話意図インデクスコードブックを作成した場合、ある部分区間の発話意図の確率ベクトルから最もユークリッド距離の近いセントロイドの発話意図インデクスを、その部分区間における発話意図インデクスとする。 <Speech intention index conversion unit 205>
Input: probability vector of utterance intention for each partial section, utterance intention index codebook output: utterance intention index for each partial section The utterance intention index conversion unit 205 uses the utterance intention probability vector for each partial section as the utterance intention for each partial section. The index is converted (S205). When the utterance intention index codebook is created using the k-average method, the utterance intention index of the centroid having the closest Euclidean distance from the probability vector of the utterance intention in a certain partial section is set as the utterance intention index in the partial section.

なお、部分区間ごとの発話意図の確率ベクトルの系列が入力される場合は、部分区間ごとの発話意図インデクスの系列が出力される。 When a sequence of probability vectors of utterance intention for each partial section is input, a sequence of utterance intention indexes for each partial section is output.

＜Ｎ−ｇｒａｍモデル学習部２０６＞
入力：部分区間ごとの発話意図インデクス（の系列）、発話ごとの発話意図ラベル
出力：発話意図Ｎ−ｇｒａｍモデル
Ｎ−ｇｒａｍモデル学習部２０６は、発話ごとの発話意図別に、部分区間ごとの発話意図インデクスのＮ−ｇｒａｍである発話意図Ｎ−ｇｒａｍを学習する（Ｓ２０６）。ここでは、Ｎ＝３としてモデル学習を行う。発話意図Ｎ−ｇｒａｍの学習は、Ｎ−ｇｒａｍ言語モデルの学習と同様の枠組みで行う。すなわち、Ｎ−ｇｒａｍ確率は最尤推定により決定し、その後学習データに含まれなかった発話意図Ｎ−ｇｒａｍへの対処としてバックオフ平滑化を実施する。出力として、発話ごとの発話意図が肯定的、否定的、どちらでもない、のそれぞれにおける発話意図Ｎ−ｇｒａｍモデルを得る（図１５、図１６参照）。すなわち、３つの発話意図Ｎ−ｇｒａｍモデルを得る。 <N-gram model learning unit 206>
Input: Utterance intention index (series) for each partial section, Utterance intention label output for each utterance: Utterance intention N-gram model The N-gram model learning unit 206 utters intention for each partial section for each utterance intention for each utterance. The utterance intention N-gram which is an index N-gram is learned (S206). Here, model learning is performed with N = 3. The utterance intention N-gram is learned in the same framework as the learning of the N-gram language model. That is, the N-gram probability is determined by maximum likelihood estimation, and then backoff smoothing is performed as a countermeasure to the utterance intention N-gram that is not included in the learning data. As an output, an utterance intention N-gram model is obtained for each of the utterance intentions for each utterance, which is positive or negative (see FIGS. 15 and 16). That is, three utterance intention N-gram models are obtained.

＜発話毎発話意図抽出部２０７＞
入力：部分区間ごとの発話意図インデクス（の系列）、発話意図Ｎ−ｇｒａｍモデル
出力：発話ごとの発話意図の抽出結果
発話毎発話意図抽出部２０７は、発話意図Ｎ−ｇｒａｍモデルを用いて部分区間ごとの発話意図インデクス（の系列）から発話ごとの発話意図を抽出する（Ｓ２０７）。ある発話全体の部分区間ごとの発話意図インデクスの出現確率を、Ｎ−ｇｒａｍモデル学習部２０６の出力の発話意図Ｎ−ｇｒａｍモデルごとに求める。ある発話全体の部分区間ごとの発話意図インデクスの出現確率が最も高くなるような発話意図Ｎ−ｇｒａｍモデルが発話ごとの発話意図の抽出結果となる（図１５、図１６参照）。 <Speech intention extraction unit 207 for each utterance>
Input: utterance intention index (sequence) for each partial section, utterance intention N-gram model output: utterance intention extraction result for each utterance The utterance intention extraction unit 207 for each utterance uses the utterance intention N-gram model as a partial section. The utterance intention for each utterance is extracted from the utterance intention index (sequence) for each utterance (S207). The appearance probability of the utterance intention index for each partial section of an entire utterance is obtained for each utterance intention N-gram model output from the N-gram model learning unit 206. The utterance intention N-gram model in which the appearance probability of the utterance intention index for each partial section of an entire utterance becomes the highest is the extraction result of the utterance intention for each utterance (see FIGS. 15 and 16).

具体的には、以下のようにして発話ごとの発話意図を求める。なお、ここでは部分区間ごとの発話意図インデクスの系列の代わりに部分区間ごとの発話意図の系列を用いて説明する。 Specifically, the utterance intention for each utterance is obtained as follows. Here, the description is made using the utterance intention sequence for each partial section instead of the utterance intention index sequence for each partial section.

部分区間ごとの発話意図の系列、発話ごとの発話意図をそれぞれx=(x₁,x₂,x₃,…,x_n)、y（ただし、x_i、yは、肯定的、否定的、どちらでもない、のいずれかの値をとる）とする。部分区間ごとの発話意図の系列がx=(x₁,x₂,x₃,…,x_n)であるときの発話ごとの発話意図がyである確率を条件付き確率P(y|x)を用いて表現すると、発話ごとの発話意図抽出結果Yは以下のようにして求まる。 X = (x ₁ , x ₂ , x ₃ , ..., x _n ), y (where x _i and y are positive, negative, It takes one of the following values): Conditional probability P (y | x) is the probability that the utterance intention for each utterance is y when the sequence of utterance intention for each sub-section is x = (x ₁ , x ₂ , x ₃ , ..., x _n ) The utterance intention extraction result Y for each utterance can be obtained as follows.

ここで、発話意図の出現確率は一様と考えられることから、P(x)とP(y)は一定であると仮定した。 Here, since the appearance probability of the utterance intention is considered to be uniform, it is assumed that P (x) and P (y) are constant.

Ｎ−ｇｒａｍモデル学習部２０６が作成した発話ごとの発話意図y別の発話意図Ｎ−ｇｒａｍモデルを用いて、発話毎発話意図抽出部２０７が部分区間ごとの発話意図の系列xの出現確率が最も高くなるyの発話意図Ｎ−ｇｒａｍモデルを選ぶことで発話ごとの発話意図Yを抽出することができる。 Using the utterance intention N-gram model for each utterance y created by the N-gram model learning unit 206, the utterance intention extraction unit 207 for each utterance has the highest occurrence probability of the sequence x of the utterance intention for each partial section. The utterance intention Y for each utterance can be extracted by selecting a higher y utterance intention N-gram model.

本実施例の発話意図抽出装置２によれば、発話意図の抽出対象となる発話について部分区間ごとに抽出した発話意図（インデクス）の系列と、発話意図Ｎ−ｇｒａｍモデルを用いて発話ごとの発話意図を抽出するようにしたため、発話の一部の区間にのみ発話意図が表出する場合でも、発話意図の表出順序を考慮して正しく発話意図を抽出することが可能となる。特に、一つの発話に二つ以上の発話意図の特徴が表れる発話が発生する場合においても正しく発話意図を抽出することが可能となる。 According to the utterance intention extraction apparatus 2 of the present embodiment, utterances for each utterance using a series of utterance intentions (indexes) extracted for each partial section of the utterances to be extracted as utterance intentions, and the utterance intention N-gram model. Since the intention is extracted, even when the utterance intention is expressed only in a part of the utterance, it is possible to correctly extract the utterance intention in consideration of the expression order of the utterance intention. In particular, even when an utterance in which two or more utterance intention features appear in one utterance, the utterance intention can be correctly extracted.

また、本実施例の発話意図抽出装置２によれば、部分区間ごとの発話意図の表現に確率（ベクトル）を用いることにより、発話意図を３つの値（肯定的、否定的、どちらでもない）で表現する場合に比して、発話ごとの発話意図抽出の精度を向上させることが可能になる。 Further, according to the utterance intention extraction apparatus 2 of the present embodiment, the probability (vector) is used to express the utterance intention for each partial section, so that the utterance intention is expressed by three values (positive, negative, neither) It is possible to improve the accuracy of the utterance intention extraction for each utterance as compared with the case of expressing with.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A section including at least one word, and a continuous interval between words with a time interval of a certain time or less is an utterance section, and the voice of the utterance section is an utterance,
An utterance intention index sequence for each partial section, which is a sequence of indexes corresponding to the utterance intention for each partial section, which is an utterance intention extracted for each partial section included in the utterance, and an utterance intention label manually assigned to each utterance; Is an utterance intention model learning device that learns an utterance intention N-gram model, which is an N-gram model used to extract the utterance intention for each utterance.

The utterance intention model learning device according to claim 1 has an utterance intention index codebook that associates an utterance intention classification with an index indicating the classification,
The partial section utterance intention is expressed using a probability that each utterance intention appears,
The utterance intention index sequence for each partial section is an utterance that is a sequence of indexes obtained by converting the utterance intention classification determined using the probability of the utterance intention for each partial section using the utterance intention index codebook. Intent model learning device.

A section including at least one word, and a continuous interval between words with a time interval of a certain time or less is an utterance section, and the voice of the utterance section is an utterance,
An utterance intention extraction device that extracts an utterance intention for each utterance from an utterance input as an utterance intention extraction target,
Generation of utterance intention index sequence for each partial section that generates a utterance intention index sequence for each partial section that is a sequence of indexes corresponding to the utterance intention for each partial section that is the utterance intention extracted for each partial section included in the utterance And
An utterance intention index unit for extracting the utterance intention based on the utterance intention index sequence for each partial section and an utterance intention N-gram model;
The utterance intention N-gram model uses the partial section utterance intention index sequence and an utterance intention label manually assigned for each utterance as learning data, and is used for extracting the utterance intention for each utterance. An utterance intention extraction device characterized by learning as a model.

The utterance intention extraction device according to claim 3 has an utterance intention index codebook for associating an utterance intention classification with an index indicating the classification,
The partial section utterance intention is expressed using a probability that each utterance intention appears,
The utterance intention index sequence for each partial section is an utterance that is a sequence of indexes obtained by converting the utterance intention classification determined using the probability of the utterance intention for each partial section using the utterance intention index codebook. Intention extraction device.

A section including at least one word, and a continuous interval between words with a time interval of a certain time or less is an utterance section, and the voice of the utterance section is an utterance,
An utterance intention index sequence for each partial section, which is a sequence of indexes corresponding to the utterance intention for each partial section, which is an utterance intention extracted for each partial section included in the utterance, and an utterance intention label manually assigned to each utterance; Utterance intention model learning method for learning an utterance intention N-gram model which is an N-gram model used for extraction of the utterance intention for each utterance.

A section including at least one word, and a continuous interval between words with a time interval of a certain time or less is an utterance section, and the voice of the utterance section is an utterance,
An utterance intention extraction method for extracting an utterance intention for each utterance from an utterance input as an utterance intention extraction target,
Generating, from the utterance, a partial section utterance intention index sequence that is a sequence of indexes corresponding to the partial section utterance intention that is an utterance intention extracted for each partial section included in the utterance;
Extracting the utterance intention based on the utterance intention index sequence for each partial section and an utterance intention N-gram model;
The utterance intention N-gram model uses the partial section utterance intention index sequence and an utterance intention label manually assigned for each utterance as learning data, and is used for extracting the utterance intention for each utterance. An utterance intention extraction method characterized by learning as a model.

A program for causing a computer to function as any one of the utterance intention model learning device according to claim 1 or 2, or the utterance intention extraction device according to claim 3 or 4.