JP2018031851A

JP2018031851A - Discourse function estimation device and computer program for the same

Info

Publication number: JP2018031851A
Application number: JP2016162927A
Authority: JP
Inventors: カルロストシノリイシイ; Toshinori Ishi Carlos; 超然劉; Chaoran Liu; 石黒　浩; Hiroshi Ishiguro; 浩石黒
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2016-08-23
Filing date: 2016-08-23
Publication date: 2018-03-01
Anticipated expiration: 2036-08-23
Also published as: JP6712754B2

Abstract

PROBLEM TO BE SOLVED: To provide a discourse function detection device for determining a discourse function with high accuracy.SOLUTION: A discourse function estimation device 44 includes: a morpheme analysis part 72 and a time series part-of-speech information storage part 74 for receiving text data of speech production 42, performing morpheme analysis and generating a first feature vector; an F0 extraction part 76, an F0 average storage part 78 and a speaker normalization part 80 for extracting F0 from the inside of a voice signal of 150 milliseconds just before a phase end of a voice signal corresponding to the speech production, and generating a second vector representing a change of F0 by normalizing F0; and a classifier 82 composed of learned SVM or DNN by previous machine learning to receive a feature vector composed of a first vector and the second vector as an input such that a discourse function of the speech production at a phase end is classified into any of a plurality of predetermined discourse functions.SELECTED DRAWING: Figure 2

Description

この発明はヒューマン・マシン対話システムに関し、特に、人間とロボットとの間の自然なインタラクションを可能にするために、発話のターン終了ポイントを検出する技術に関する。 The present invention relates to a human-machine interaction system, and more particularly to a technique for detecting a turn end point of an utterance in order to enable natural interaction between a human and a robot.

音声認識タスクでは、ユーザの発話終了ポイントを適切に検出することが重要である。また、ヒューマン・マシン対話システムに対しては、発話終了とともに発話のターン終了ポイントを適切に検出することが要求される。一文一答形式の会話、又は単一音声コマンドと違い、自然なコミュニケーションである自由対話（談話）では、話者の１ターンに複数の発話文が含まれることが多々ある。したがって、自由対話音声における発話のターン終了ポイントの推定は困難である。 In the speech recognition task, it is important to appropriately detect the user's utterance end point. Further, it is required for the human-machine dialogue system to appropriately detect the utterance turn end point at the same time as the utterance end. Unlike a single-sentence one-answer conversation or a single voice command, a free conversation (discourse), which is a natural communication, often includes a plurality of spoken sentences in one turn of the speaker. Therefore, it is difficult to estimate the turn end point of utterance in free dialogue speech.

我々が他の人と会話するとき、その目的は話者の意図を理解し、適切に話者の発話に対して応答することである。発話により何が意図されているかを理解することは、発話内の全ての単語を理解することと同程度に重要であり、円滑な会話に多大な影響を与える。 When we talk to other people, the purpose is to understand the speaker's intention and respond appropriately to the speaker's utterance. Understanding what is intended by an utterance is as important as understanding all the words in the utterance and has a significant impact on smooth conversation.

発話ターン終了ポイントの正確な推定は発話文の切り分けに利用でき、また、発話内容の理解に役立つため、より自然なインタラクション制御に繋がる。さらに、日本語話者が発話する際の頭部動作と発話権交代の関連性が報告されている（非特許文献１）。以上から、発話終了タイミングの推定結果に合わせ、コミュニケーションロボットの動作を制御することで、ロボットの人間らしさを向上させることができる。 Accurate estimation of the utterance turn end point can be used to segment the utterance sentence, and also helps to understand the utterance content, leading to more natural interaction control. Furthermore, the relationship between the head movement and the utterance right change when a Japanese speaker speaks has been reported (Non-Patent Document 1). From the above, it is possible to improve the humanity of the robot by controlling the operation of the communication robot in accordance with the estimation result of the utterance end timing.

C. Liu, C. T. Ishi, H. Ishiguro, Proc. of HRI 2012, pp. 285-292, 2012.C. Liu, C. T. Ishi, H. Ishiguro, Proc. Of HRI 2012, pp. 285-292, 2012. R. Hariharan, J. H akkinen, and K. Laurila, ICASSP 2001, vol. 1, pp. 249-252, May 2001.R. Hariharan, J. Hakkinen, and K. Laurila, ICASSP 2001, vol. 1, pp. 249-252, May 2001. Q. Li, J. Zheng, Q. R. Zhou, and C. Lee, ICASSP 2001, vol. 1, pp. 233-236, May 2001.Q. Li, J. Zheng, Q. R. Zhou, and C. Lee, ICASSP 2001, vol. 1, pp. 233-236, May 2001. L. Huang and C. Yang, Proc. ICASSP 2000, vol. 3, pp. 1751-1754, 2000.L. Huang and C. Yang, Proc. ICASSP 2000, vol. 3, pp. 1751-1754, 2000.

従来の研究では、発話のターン終了ポイントの検出には、無音区間（非特許文献２）、ゼロクロス及びエントロピー（非特許文献３、４）が用いられている。しかし、これらの手法による文末検出の精度は環境に左右されやすく、したがってターン終了ポイントの検出の精度も低いという問題がある。また、自然対話にはオーバーラップ及び同時発話も多く含まれているため、無音区間、ゼロクロス、又はエントロピーを用いる従来技術ではターン終了ポイントの正確な検出は困難である。 In the conventional research, silence intervals (Non-Patent Document 2), zero-cross and entropy (Non-Patent Documents 3 and 4) are used to detect the turn end point of an utterance. However, the accuracy of sentence ending detection by these methods is easily influenced by the environment, and therefore there is a problem that the accuracy of detecting the turn end point is low. In addition, since natural conversation includes many overlaps and simultaneous utterances, it is difficult to accurately detect the turn end point with the conventional technique using a silent section, zero cross, or entropy.

また、発話の終了は単に発話ターンを他者に譲渡することを意味しない。引き続きそれまでの話者が発話権を保持する場合もあるし、相手に対して質問又は応答を要求している場合もある。すなわち、発話は単に何かを述べるだけではなく、話者間の会話をある方向に進める機能を持つ。このような機能をここでは談話機能と呼ぶ。 Moreover, the end of the utterance does not simply mean transferring the utterance turn to another person. In some cases, the previous speaker may hold the right to speak, or may request a question or response from the other party. That is, utterances have a function to advance conversations between speakers in a certain direction, not just to describe something. Such a function is called a discourse function here.

ヒューマン・マシン対話システムでは、このような、発話の句末の談話機能を精度高く検出しないと、自然な対話ができないという問題がある。従来の技術では、そのような談話機能の判定を高精度で行うことは難しい。 In the human-machine dialogue system, there is a problem that a natural dialogue cannot be made unless the discourse function at the end of the utterance is detected with high accuracy. In the prior art, it is difficult to determine such a discourse function with high accuracy.

それ故に本発明は、談話機能を高精度で判定する談話機能検出装置を提供することである。 Therefore, the present invention is to provide a discourse function detection device that determines a discourse function with high accuracy.

本発明の第１の局面に係る談話機能推定装置は、発話のテキストデータを受け、当該テキストデータを形態素解析して発話における談話機能推定のための第１の素性ベクトルを生成する第１のベクトル生成手段と、発話に対応する音声信号において、発話中に検出された句末の直前の所定区間の音声信号中から基本周波数成分を抽出し、基本周波数成分の変化を表す第２のベクトルを生成する第２のベクトル生成手段と、第１のベクトル及び第２のベクトルからなる素性ベクトルを入力として受け、句末における発話の談話機能を、予め定める複数通りの談話機能のいずれかに分類するよう、予め機械学習により学習済の分類手段とを含む。 The discourse function estimation apparatus according to the first aspect of the present invention receives a text data of an utterance, and generates a first feature vector for estimating a discourse function in the utterance by analyzing the text data. And generating a second vector representing a change in the fundamental frequency component by extracting the fundamental frequency component from the speech signal in a predetermined section immediately before the end of the phrase detected during the speech in the speech signal corresponding to the speech. A second vector generation unit that receives the feature vector composed of the first vector and the second vector as an input, and classifies the discourse function of the utterance at the end of the phrase as one of a plurality of predetermined discourse functions And classifying means that have been learned in advance by machine learning.

好ましくは、第２のベクトル生成手段は、発話中に検出された句末の直前の所定区間を複数個の分割区間に分割する分割手段と、分割手段により分割された各分割区間の基本周波数を要素として第２のベクトルを生成するための手段を含む。 Preferably, the second vector generating means divides the predetermined section immediately before the end of the phrase detected during speech into a plurality of divided sections, and the fundamental frequency of each divided section divided by the dividing means. Means for generating a second vector as an element.

より好ましくは、第１のベクトル生成手段は、発話のテキストデータを受け、当該テキストデータを形態素解析し、形態素列を出力するための形態素解析手段と、形態素解析手段により出力された形態素列を時系列的に記憶するための形態素列記憶手段と、少なくとも、形態素列記憶手段に記憶された最新の所定個数の形態素の各々から得られる品詞情報を要素として、第１のベクトルを生成し分類器に出力するための手段とを含む。 More preferably, the first vector generation unit receives the text data of the utterance, performs morphological analysis on the text data, and outputs the morpheme sequence output by the morpheme analysis unit. Using the morpheme sequence storage means for storing in series and at least the part of speech information obtained from each of the latest predetermined number of morphemes stored in the morpheme sequence storage means, a first vector is generated and used as a classifier Means for outputting.

さらに好ましくは、第１のベクトル生成手段は、発話のテキストデータを受け、当該テキストデータを形態素解析し、形態素列を出力するための形態素解析手段と、形態素列に出現する単語の集合（ＢＯＷ）を表すＢＯＷベクトルを生成するＢＯＷベクトル生成手段と、ＢＯＷベクトル生成手段の要素を、所定のデータセット中における各単語の出現頻度と、発話中における各単語の出現頻度とにより正規化し、正規化後ＢＯＷベクトルを出力するためのＢＯＷベクトル正規化手段と、ＢＯＷベクトル正規化手段の出力する正規化後ＢＯＷベクトルの次元を削減して第１のベクトルとして出力するための次元削減手段とを含む。 More preferably, the first vector generation means receives utterance text data, morphologically analyzes the text data, and outputs a morpheme string, and a set of words appearing in the morpheme string (BOW) BOW vector generating means for generating a BOW vector representing the element, and elements of the BOW vector generating means are normalized by the appearance frequency of each word in a predetermined data set and the appearance frequency of each word during utterance, and after normalization BOW vector normalizing means for outputting a BOW vector, and dimension reducing means for reducing the dimension of the normalized BOW vector output from the BOW vector normalizing means and outputting it as a first vector.

次元削減手段は、ＢＯＷベクトル正規化手段の出力する正規化後ＢＯＷベクトルの次元を、潜在的ディリクレ配分法（ＬＤＡ）により削減して第１のベクトルを生成するための手段を含んでもよい。 The dimension reduction means may include means for reducing the dimension of the normalized BOW vector output from the BOW vector normalization means by a latent Dirichlet allocation method (LDA) to generate the first vector.

次元削減手段は、ＢＯＷベクトル正規化手段の出力する正規化後ＢＯＷベクトルを受けるように接続された、入力と出力とが等しくなるように予め学習済のボトルネックニューラルネットワークと、正規化後ＢＯＷベクトルが与えられたことに応答してボトルネックニューラルネットワークのボトルネック層の各ノードから出力される値を要素として第１のベクトルを生成するための手段とを含んでもよい。 The dimension reduction means is connected to receive the normalized BOW vector output from the BOW vector normalizing means, and has been learned in advance so that the input and the output are equal, and the normalized BOW vector. And a means for generating a first vector using the value output from each node of the bottleneck layer of the bottleneck neural network as an element.

好ましくは、分類手段は、素性ベクトルを入力として受け、句末における発話の談話機能を、予め定める複数通りの談話機能のいずれかに分類するよう学習済のサポートベクトルマシンを含む。 Preferably, the classification means includes a support vector machine that has received a feature vector as input and has learned to classify the discourse function of the utterance at the end of the phrase into one of a plurality of predetermined discourse functions.

より好ましくは、分類手段は、発話の談話機能に対応する隠れ状態の遷移経路と、各状態における特徴ベクトルの各要素の出力確率とを表現する隠れマルコフモデルと、素性ベクトルを入力として受け、当該素性ベクトルを出力した状態の後に、状態の各々に隠れマルコフモデルの状態が遷移する確率を出力するように予め機械学習により学習済のディープニューラルネットワークと、素性ベクトル、隠れマルコフモデル、及びディープニューラルネットワークの出力に基づき、発話の不可視の状態の遷移経路として最尤の経路を推定する最尤推定手段と、最尤推定手段により推定された経路に基づいて発話の談話機能を推定するための手段とを含む。 More preferably, the classification means receives as input a hidden Markov model that expresses a hidden state transition path corresponding to the speech discourse function, an output probability of each element of the feature vector in each state, and a feature vector. Deep neural network trained by machine learning in advance to output the probability that the state of the hidden Markov model transitions to each of the states after outputting the feature vector, and the feature vector, hidden Markov model, and deep neural network A maximum likelihood estimation means for estimating a maximum likelihood path as a transition path of an invisible state of speech, and a means for estimating a speech discourse function based on the path estimated by the maximum likelihood estimation means; including.

さらに好ましくは、２のベクトル生成手段は、発話に対応する音声信号において、発話中に検出された句末の直前の所定区間の音声信号中から基本周波数成分を抽出し対数基本周波数成分として記憶するための基本周波数抽出手段と、予め抽出した、発話の話者の音声の基本周波数の対数の平均値を記憶する基本周波数平均記憶手段と、基本周波数抽出手段により抽出された対数基本周波数成分から基本周波数平均記憶手段に記憶された平均値を減ずることにより対数基本周波数成分を正規化し、当該正規化された対数基本周波数成分を要素として第２のベクトルを生成するための手段とを含む。 More preferably, the second vector generation means extracts a fundamental frequency component from a speech signal in a predetermined section immediately before the end of a phrase detected during speech in a speech signal corresponding to speech, and stores it as a logarithmic fundamental frequency component. Fundamental frequency extraction means for storing, fundamental frequency average storage means for storing the logarithm average value of the fundamental frequency of the speech of the speaker who has been extracted in advance, and the fundamental from the logarithmic fundamental frequency component extracted by the fundamental frequency extraction means Means for normalizing the logarithmic fundamental frequency component by subtracting the average value stored in the frequency mean storage means, and generating a second vector using the normalized logarithmic fundamental frequency component as an element.

好ましくは、談話機能推定装置は、発話における話者の音声の基本周波数の対数を所定時間ごとに算出するための基本周波数算出手段と、基本周波数算出手段により所定時間ごとに算出された基本周波数の対数の平均値を算出し、基本周波数平均記憶手段に格納するための手段とをさらに含む。 Preferably, the discourse function estimation device includes a fundamental frequency calculation means for calculating a logarithm of a fundamental frequency of a speaker's voice in an utterance every predetermined time, and a fundamental frequency calculated by the fundamental frequency calculation means every predetermined time. Means for calculating an average value of the logarithm and storing it in the fundamental frequency average storage means.

より好ましくは、談話機能推定装置は、発話の句末を検出して句末信号を出力するための句末検出手段をさらに含む。第１のベクトル生成手段及び第２のベクトル生成手段は、それぞれ、句末検出手段により検出された句末の直前のテキストデータ及び音声信号から第１のベクトル及び第２のベクトルを生成し出力する。 More preferably, the discourse function estimating device further includes a phrase end detection means for detecting a phrase end of the utterance and outputting a phrase end signal. The first vector generation means and the second vector generation means respectively generate and output the first vector and the second vector from the text data and the speech signal immediately before the phrase end detected by the phrase end detection means. .

さらに好ましくは、句末検出手段は、発話に対する音声認識を行ってテキストデータを出力する音声認識装置と、音声認識装置の出力するテキストデータの、句末直前の音素情報から、句末として取り扱うべき句末区間を特定する句末特定手段を含む。第２のベクトル生成手段は、句末区間をそれぞれ所定長さの部分区間に区切って各部分区間の基本周波数の対数を抽出するための手段と、抽出するための手段により抽出された各部分区間の基本周波数の対数の間の関係に基づいて、固定長の第２のベクトルを生成するための手段とを含む。 More preferably, the phrase end detection means should be treated as a phrase end from the speech recognition device that performs speech recognition on an utterance and outputs text data, and the phoneme information of the text data output by the speech recognition device immediately before the end of the phrase. A phrase end specifying means for specifying the phrase end section is included. The second vector generating means includes means for dividing the phrase end section into partial sections each having a predetermined length and extracting the logarithm of the fundamental frequency of each partial section; and each partial section extracted by the means for extracting And means for generating a second vector of fixed length based on the relationship between the logarithm of the fundamental frequencies.

本発明の第２の局面に係るコンピュータプログラムは、コンピュータを、上記したいずれかの談話機能推定装置として機能させる。 A computer program according to the second aspect of the present invention causes a computer to function as any one of the above-described discourse function estimation devices.

本発明の第１の実施の形態に係る談話機能推定装置を含むヒューマン・マシン対話システムの構成を示す図である。It is a figure which shows the structure of the human-machine dialogue system containing the discourse function estimation apparatus which concerns on the 1st Embodiment of this invention. 図１に示す談話機能推定装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the discourse function estimation apparatus shown in FIG. 本発明の第２の実施の形態に係る談話機能推定装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the discourse function estimation apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施の形態に係る談話機能推定装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the discourse function estimation apparatus which concerns on the 3rd Embodiment of this invention. 第３の実施の形態で使用されるボトルネットネットワークの概略構成を示す模式図である。It is a schematic diagram which shows schematic structure of the bottle net network used by 3rd Embodiment. 第３の実施の形態の分類器で使用されるディープニューラルネットワークの構成を示す模式図である。It is a schematic diagram which shows the structure of the deep neural network used with the classifier of 3rd Embodiment. 第３の実施の形態の分類器による、談話機能の最尤系列の推定過程を説明するための模式図である。It is a schematic diagram for demonstrating the estimation process of the maximum likelihood series of a discourse function by the classifier of 3rd Embodiment. 本発明の第４の実施の形態に係る談話機能推定装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the discourse function estimation apparatus which concerns on the 4th Embodiment of this invention. 本発明の実施の形態において、分類器としてＬＤＡを用いた場合の効果を説明するためのグラフである。In embodiment of this invention, it is a graph for demonstrating the effect at the time of using LDA as a classifier. 本発明の実施の形態において、分類器としてＤＮＮを用いた場合の効果を説明するためのグラフである。In embodiment of this invention, it is a graph for demonstrating the effect at the time of using DNN as a classifier. 本発明の各実施の形態に係る談話機能推定装置を実現するためのコンピュータシステムの外観図である。It is an external view of the computer system for implement | achieving the discourse function estimation apparatus which concerns on each embodiment of this invention. 図１１に外観を示すコンピュータシステムの内部構成を示すブロック図である。It is a block diagram which shows the internal structure of the computer system which shows an external appearance in FIG.

以下の説明及び図面では、同一の部品には同一の参照番号を付してある。したがって、それらについての詳細な説明は繰返さない。 In the following description and drawings, the same parts are denoted by the same reference numerals. Therefore, detailed description thereof will not be repeated.

［第１の実施の形態］
〈概略〉
図１を参照して、本発明の第１の実施の形態に係る談話機能推定装置４４は、操作者４０が発する音声４２から得られる言語情報だけではなく、その韻律情報も含めて談話機能を検出する。検出された談話機能により、ロボット４８の頭部動作４６を制御したり、ロボット４８の応答を制御したりすることで、操作者４０とロボット４８との間の自然な対話を実現する。 [First Embodiment]
<Outline>
With reference to FIG. 1, the discourse function estimation apparatus 44 according to the first embodiment of the present invention has a discourse function including not only language information obtained from the speech 42 uttered by the operator 40 but also its prosodic information. To detect. A natural dialogue between the operator 40 and the robot 48 is realized by controlling the head movement 46 of the robot 48 or controlling the response of the robot 48 by the detected discourse function.

〈構成〉
図２を参照して、本実施の形態に係る談話機能推定装置４４は、発話者の音声４２を受けて音声認識し、音声認識結果のテキストデータを出力する音声認識装置７０、このテキストデータに対して形態素解析を行い、品詞情報などが付された形態素列を出力する形態素解析部７２、及び、形態素解析部７２が出力した形態素列の品詞情報の時系列を記憶し、それら品詞を要素とするベクトルを出力する時系列品詞情報記憶部７４を含む。 <Constitution>
Referring to FIG. 2, the discourse function estimating device 44 according to the present embodiment receives the speech 42 of the speaker, recognizes the speech, and outputs the speech recognition result text data. The morpheme analysis is performed on the morpheme analysis unit 72 that outputs a morpheme sequence to which part-of-speech information is attached, and the time series of the morpheme information of the morpheme sequence output by the morpheme analysis unit 72 is stored. A time-sequential part-of-speech information storage unit 74 for outputting a vector to be output.

この実施の形態が扱う言語は日本語であり、品詞としては１１通りである。これらに「その他」を含めて、品詞情報としては１２種類を用いる。品詞を要素とするベクトルは、品詞の種類数（１２個）の要素を持つ固定長ベクトルであり、該当する品詞に対応する要素の値が１、それ以外の要素の値は０としたものである。 The language handled in this embodiment is Japanese, and there are 11 parts of speech. Twelve types of part-of-speech information are used including “others”. A vector whose part of speech is an element is a fixed-length vector having the number of types of parts of speech (12). The value of the element corresponding to the corresponding part of speech is 1, and the values of the other elements are 0. is there.

談話機能推定装置４４はさらに、音声４２からログスケールの基本周波数（Ｆ０）を１０ミリ秒ごとに抽出するためのＦ０抽出部７６、音声４２とは別にあらかじめ準備された、音声４２の話者の発話データのＦ０の平均値をあらかじめ記憶するためのＦ０平均記憶部７８、及び、Ｆ０抽出部７６が抽出したＦ０を、１０ミリ秒ごとに所定時間分（本実施の形態では１５０ミリ秒）だけ記憶し、句境界情報８４に応答して、Ｆ０平均記憶部７８に記憶された話者のＦ０の平均値を減ずることにより話者の音声４２のＦ０を正規化した値を要素とするベクトルを出力するための話者正規化部８０と、時系列品詞情報記憶部７４の出力するベクトルと話者正規化部８０の出力するベクトルとを連結したベクトルを素性ベクトルとして受け、談話機能５０を推定して出力するための分類器８２とを含む。句境界情報８４を生成するための句境界は本実施の形態及び後述する第２及び第３の実施の形態では既知であるものとする。なお、Ｆ０の情報を１５０ミリ秒分だけ用いることとしたのは、日本語の１モーラに相当するのが１５０ミリ秒程度であることに基づく。なお、この長さは適宜変更してもよい。 The discourse function estimation device 44 further includes the F0 extraction unit 76 for extracting the log scale fundamental frequency (F0) from the voice 42 every 10 milliseconds, and the voice 42 prepared in advance separately from the voice 42. The F0 average storage unit 78 for storing the average value of the F0 of the speech data in advance and the F0 extracted by the F0 extraction unit 76 for a predetermined time every 10 milliseconds (150 milliseconds in this embodiment). In response to the phrase boundary information 84, a vector whose elements are normalized values of F0 of the speaker's voice 42 by subtracting the average value of the speaker's F0 stored in the F0 average storage unit 78 is stored. A speaker normalizing unit 80 for outputting, a vector obtained by concatenating a vector output from the time-series part-of-speech information storage unit 74 and a vector output from the speaker normalizing unit 80 is received as a feature vector. 50 estimated by the containing and classifier 82 for outputting. The phrase boundaries for generating the phrase boundary information 84 are assumed to be known in the present embodiment and second and third embodiments described later. The reason that the information of F0 is used for 150 milliseconds is based on the fact that it corresponds to one Japanese mora is about 150 milliseconds. In addition, you may change this length suitably.

分類器８２は、本実施の形態ではＳＶＭであり、あらかじめ学習データを用いて、素性ベクトルに対して談話機能５０を推定するように機械学習を行っている。また、本実施の形態では、談話機能５０は、ｋ（ターンの保持）、ｇ（ターンの譲渡）、及びｏ（その他）を識別するようにあらかじめ学習済みである。 The classifier 82 is an SVM in the present embodiment, and performs machine learning so as to estimate the discourse function 50 with respect to a feature vector in advance using learning data. Further, in the present embodiment, the discourse function 50 has been learned in advance to identify k (holding of turns), g (turning of turns), and o (others).

〈動作〉
話者の音声４２が入力されると、音声認識装置７０は音声４２に対する音声認識を行い、発話の内容に対応するテキストデータを出力する。形態素解析部７２はこの出力を受け、付属の形態素解析用辞書（図示せず）を参照して形態素解析を行い、品詞情報が付された形態素列を出力する。時系列品詞情報記憶部７４は、この形態素列のうち、品詞情報の時系列を所定個数だけ記憶する。 <Operation>
When the voice 42 of the speaker is input, the voice recognition device 70 performs voice recognition on the voice 42 and outputs text data corresponding to the content of the utterance. The morpheme analysis unit 72 receives this output, performs morpheme analysis with reference to an attached dictionary for morpheme analysis (not shown), and outputs a morpheme string with part-of-speech information. The time-series part-of-speech information storage unit 74 stores a predetermined number of part-of-speech information time series in the morpheme string.

Ｆ０抽出部７６は、音声４２から１０ミリ秒ごとにＦ０を抽出し話者正規化部８０に与える。話者正規化部８０は、Ｆ０抽出部７６から与えられるＦ０からＦ０平均記憶部７８に記憶されているＦ０平均値を減算して正規化し、最新の１５０ミリ秒分を記憶しておき、句境界情報８４を受けたことに応答して、記憶されていた１５０ミリ秒分のＦ０の値を要素とするベクトルを生成し、分類器８２に与える。 The F0 extraction unit 76 extracts F0 from the voice 42 every 10 milliseconds and gives it to the speaker normalization unit 80. The speaker normalization unit 80 subtracts and normalizes the F0 average value stored in the F0 average storage unit 78 from F0 given from the F0 extraction unit 76, stores the latest 150 milliseconds, In response to receiving the boundary information 84, a vector having the stored value of F0 for 150 milliseconds as an element is generated and given to the classifier 82.

分類器８２の入力には、時系列品詞情報記憶部７４からのベクトルと話者正規化部８０からのベクトルとが連結されたものが素性ベクトルとして与えられる。分類器８２は、この素性ベクトルに基づいて、音声４２により表される発話の句末の談話機能の推定値（k、g、ｏのいずれか）を推定し出力する。図１に示す頭部動作４６を、この談話機能に基づいて制御できる。 An input of the classifier 82 is a feature vector obtained by concatenating a vector from the time-series part-of-speech information storage unit 74 and a vector from the speaker normalization unit 80. The classifier 82 estimates and outputs an estimated value (one of k, g, or o) of the discourse function at the end of the utterance expressed by the speech 42 based on the feature vector. The head movement 46 shown in FIG. 1 can be controlled based on this discourse function.

実験によれば、この実施の形態による談話機能推定装置４４では、談話機能の認識結果の精度として６９％という結果を得た。 According to the experiment, the discourse function estimation device 44 according to this embodiment obtained a result of 69% as the accuracy of the discourse function recognition result.

［第２の実施の形態］
〈構成〉
図３に、本発明の第２の実施の形態に係る談話機能推定装置１００のブロック図を示す。この談話機能推定装置１００が第１の実施の形態に係る談話機能推定装置４４と異なるのは、時系列品詞情報記憶部７４に代えて、形態素解析部７２の出力する形態素列に基づいて、時系列の品詞情報ではなく、最新の形態素のｎグラムをバッグ・オブ・ワーズ（ＢＯＷベクトル）表現により表すベクトルを生成するためのベクトル生成部１１０を含む点と、ベクトル生成部１１０の出力するＢＯＷベクトルを正規化するためのベクトル正規化部１１１を含む点と、ベクトル正規化部１１１により正規化されたＢＯＷベクトルの次元をＬＤＡ（Latent Dirichlet Association）により削減する処理を行い、次元が削減されたベクトルを出力するためのベクトル次元削減処理部１１２を含む点と、図２の分類器８２に代えて、ベクトル次元削減処理部１１２からのベクトルと話者正規化部８０の出力するベクトルとを連結したものを素性ベクトルとして受け、音声４２の表す発話の談話機能をk、g、q（質問・応答要求）、bc（相槌）のいずれかに分類する、ＳＶＭからなる分類器１１４を含む点とである。分類器１１４は、あらかじめ上記した４つのタグによりラベル付けされた学習データと、当該学習データに対して話者正規化部８０の出力する正規化されたＦ０とにより学習を行っている。 [Second Embodiment]
<Constitution>
FIG. 3 shows a block diagram of the discourse function estimation apparatus 100 according to the second embodiment of the present invention. The discourse function estimation device 100 is different from the discourse function estimation device 44 according to the first embodiment in that it is based on the morpheme sequence output by the morpheme analysis unit 72 instead of the time-series part-of-speech information storage unit 74. A point including a vector generation unit 110 for generating a vector representing n-grams of the latest morpheme in a bag-of-words (BOW vector) expression, not a part-of-speech information, and a BOW vector output from the vector generation unit 110 The vector including the vector normalization unit 111 for normalizing the vector and the dimension of the BOW vector normalized by the vector normalization unit 111 is reduced by an LDA (Latent Dirichlet Association), thereby reducing the dimension. And a vector dimension reduction processing unit 11 instead of the classifier 82 of FIG. 2 obtained by connecting the vectors from 2 and the vectors output from the speaker normalization unit 80 as feature vectors, and the discourse function of the utterance represented by the speech 42 is k, g, q (question / response request), bc (interference) And a point including a classifier 114 made of SVM. The classifier 114 performs learning using learning data labeled in advance with the four tags described above and the normalized F0 output from the speaker normalization unit 80 with respect to the learning data.

ＢＯＷベクトルは、学習に用いたデータ全体に出現する単語の数だけの要素を持つ。各要素の値は、処理対象の発話データを音声認識した結果の最後のフレーズにおいて、各単語が出現した頻度である。したがってこのベクトルの大部分の要素は０である。 The BOW vector has as many elements as the number of words appearing in the entire data used for learning. The value of each element is the frequency at which each word appears in the last phrase as a result of speech recognition of the utterance data to be processed. Therefore, most elements of this vector are zero.

ベクトル正規化部１１１は、以下のようにしてこのＢＯＷベクトルを正規化する。この正規化では、いわゆるtf-idfを用いる。すなわち、あらかじめ学習に用いたデータ全体での各単語の出現頻度を算出しておく。そして、ＢＯＷベクトルの各要素を、その要素に対応する単語の、データ全体での出現頻度で除算した後、ベクトルの大きさが１となるように正規化する。 The vector normalization unit 111 normalizes the BOW vector as follows. In this normalization, so-called tf-idf is used. That is, the appearance frequency of each word in the entire data used for learning is calculated in advance. Then, after dividing each element of the BOW vector by the appearance frequency of the word corresponding to the element in the entire data, normalization is performed so that the magnitude of the vector becomes 1.

この正規化されたベクトルをベクトル次元削減処理部１１２でＬＤＡを用いて処理することによりベクトルの次元を削減する。ＬＤＡは、多数の離散的データのための確率的生成モデルである。このモデルは階層的ベイズモデルであって、各句が、あるトピックの集合の有限な混合物であると考える。したがって、各句はトピックの確率の集合として表現できる。一般的に、ＬＤＡが扱うトピックの範囲は１００〜３００程度である。このＬＤＡを用いることによって、ベクトルのサイズを語彙の数からトピックの数にまで削減できる。後述する実験では、トピックの数を５１２、１０２４、１５３６、及び２０４８に設定した。 The normalized vector is processed by the vector dimension reduction processing unit 112 using LDA, thereby reducing the dimension of the vector. LDA is a probabilistic generation model for a large number of discrete data. This model is a hierarchical Bayesian model where each phrase is considered a finite mixture of a set of topics. Thus, each phrase can be expressed as a set of topic probabilities. Generally, the range of topics handled by LDA is about 100 to 300. By using this LDA, the vector size can be reduced from the number of vocabularies to the number of topics. In the experiment described later, the number of topics was set to 512, 1024, 1536, and 2048.

〈動作〉
この第２の実施の形態に係る談話機能推定装置１００は以下のように動作する。音声認識装置７０、形態素解析部７２、Ｆ０抽出部７６、Ｆ０平均記憶部７８及び話者正規化部８０の動作は、第１の実施の形態と同じである。ベクトル生成部１１０は、形態素解析部７２の出力する形態素列に基づいて、最後の句のＢＯＷベクトルを生成し、ベクトル正規化部１１１に与える。ベクトル正規化部１１１は、前述した手順にしたがってＢＯＷベクトルを正規化し、ベクトル次元削減処理部１１２に与える。ベクトル次元削減処理部１１２は、このように正規化されたＢＯＷベクトルに対してＬＤＡ処理を行うことにより、次元が削減されたベクトルを生成する。 <Operation>
The discourse function estimation apparatus 100 according to the second embodiment operates as follows. The operations of the speech recognition device 70, the morphological analysis unit 72, the F0 extraction unit 76, the F0 average storage unit 78, and the speaker normalization unit 80 are the same as those in the first embodiment. The vector generation unit 110 generates a BOW vector of the last phrase based on the morpheme sequence output from the morpheme analysis unit 72 and supplies the BOW vector to the vector normalization unit 111. The vector normalization unit 111 normalizes the BOW vector according to the above-described procedure, and gives the vector to the vector dimension reduction processing unit 112. The vector dimension reduction processing unit 112 generates a vector with reduced dimensions by performing LDA processing on the normalized BOW vector.

分類器１１４は、ベクトル次元削減処理部１１２の出力したベクトルと話者正規化部８０の出力したベクトルとを連結したものを素性ベクトルとして受け、あらかじめ学習していたパラメータにしたがって談話機能１０２を推定し出力する。 The classifier 114 receives a concatenation of the vector output from the vector dimension reduction processing unit 112 and the vector output from the speaker normalization unit 80 as a feature vector, and estimates the discourse function 102 according to parameters learned in advance. Then output.

なお、この実施の形態では分類器１１４としてＳＶＭを用いている。しかし本発明はそうした実施の形態には限定されず、識別機能を備えた分類器であればどのようなものでも適用できる。例えば、ＳＶＭに代えてＤＮＮを用いることもできる。 In this embodiment, SVM is used as the classifier 114. However, the present invention is not limited to such an embodiment, and any classifier having an identification function can be applied. For example, DNN can be used instead of SVM.

［第３の実施の形態］
〈構成〉
図４に、第３の実施の形態に係る談話機能推定装置１３０の概略構成を示す。図４を参照して、この談話機能推定装置１３０が図３に示す談話機能推定装置１００と異なるのは、図２に示すベクトル次元削減処理部１１２に代えてボトルネックニューラルネットワーク１４０を含む点、及び、図２のＳＶＭを用いた分類器１１４に代えて、ディープニューラルネットワーク（ＤＮＮ）と隱れマルコフモデル（ＨＭＭ）を組み合わせた分類器１４２を含む点である。その他の点において、談話機能推定装置１３０は談話機能推定装置１００と同一である。ただし、この実施の形態では、音声のＦ０を用いる区間の長さは、１５０ミリ秒〜２００ミリ秒であって、事前の実験により適切な値を選択するものとする。 [Third Embodiment]
<Constitution>
FIG. 4 shows a schematic configuration of the discourse function estimation apparatus 130 according to the third embodiment. Referring to FIG. 4, the discourse function estimation device 130 is different from the discourse function estimation device 100 shown in FIG. 3 in that it includes a bottleneck neural network 140 instead of the vector dimension reduction processing unit 112 shown in FIG. In addition, in place of the classifier 114 using the SVM of FIG. 2, a classifier 142 that combines a deep neural network (DNN) and a drown Markov model (HMM) is included. In other respects, the discourse function estimation device 130 is the same as the discourse function estimation device 100. However, in this embodiment, the length of the section using the voice F0 is 150 milliseconds to 200 milliseconds, and an appropriate value is selected by a prior experiment.

この実施の形態に係る分類器１４２は、sil（silence）、listening、k、g、i及びqの談話機能を判別する機能を持つ。 The classifier 142 according to this embodiment has a function of discriminating discourse functions of sil (silence), listening, k, g, i, and q.

ボトルネックニューラルネットワーク１４０は、図２に示すベクトル次元削減処理部１１２と同様、ベクトルの次元を削減するためのものである。ボトルネックニューラルネットワーク１４０は、例えば図５に示すような構成を持つ。この例はあくまで例示である。 Similar to the vector dimension reduction processing unit 112 shown in FIG. 2, the bottleneck neural network 140 is for reducing vector dimensions. The bottleneck neural network 140 has a configuration as shown in FIG. 5, for example. This example is merely illustrative.

図５を参照して、ボトルネックニューラルネットワーク１４０は、入力ベクトル１５０を受ける入力層１５２、入力層１５２と同じ数のノードを持つ出力層１６０、及び、入力層１５２と出力層１６０との間に設けられた複数の隱れ層１５４、１５６及び１５８を含む。隠れ層１５４、１５６及び１５８のうち、隱れ層１５６のノード数は他の層と比較して少なくなっている。したがってこの隱れ層１５６はボトルネック層と呼ばれる。ボトルネックニューラルネットワーク１４０の学習は、入力層１５２に入力ベクトル１５０が与えられたときに、出力層１６０から出力される出力ベクトル１６２が入力ベクトル１５０と等しくなるように、多数の学習データを用いて行われる。学習した後のボトルネックニューラルネットワーク１４０によれば、入力層１５２に与えられたベクトルの要素数を一旦ボトルネック層１５６のノード数まで削減した後、再び入力ベクトルと同じベクトルを再現出来る。つまり、ボトルネック層１５６の出力は、少ない数で入力ベクトル１５０の内容を再現するに十分な情報を持っていると考えられる。そこで、ボトルネック層１５６の出力をボトルネック特徴量１６４として取り出すことにより、入力ベクトルの次元を削減できる。 Referring to FIG. 5, the bottleneck neural network 140 includes an input layer 152 that receives an input vector 150, an output layer 160 that has the same number of nodes as the input layer 152, and between the input layer 152 and the output layer 160. It includes a plurality of blister layers 154, 156 and 158 provided. Among the hidden layers 154, 156, and 158, the number of nodes of the drown layer 156 is smaller than that of the other layers. Therefore, the drown layer 156 is called a bottleneck layer. The learning of the bottleneck neural network 140 is performed by using a lot of learning data so that the output vector 162 output from the output layer 160 becomes equal to the input vector 150 when the input vector 150 is given to the input layer 152. Done. According to the bottleneck neural network 140 after learning, after the number of elements of the vector given to the input layer 152 is once reduced to the number of nodes in the bottleneck layer 156, the same vector as the input vector can be reproduced again. That is, the output of the bottleneck layer 156 is considered to have sufficient information to reproduce the contents of the input vector 150 with a small number. Therefore, by extracting the output of the bottleneck layer 156 as the bottleneck feature quantity 164, the dimension of the input vector can be reduced.

図６に、図４の分類器１４２のうち、隱れマルコフモデルの状態遷移確率を定めるためのＤＮＮ１８０の構成の概略を示す。図６を参照して、ＤＮＮ１８０は、素性ベクトル１８２を受けるように接続された複数のノードを持つ入力層１９０と、silence、listening、k、g、i、及びqに対応して設けられ、素性ベクトル１８２が与えられたときに、その直後に隱れマルコフモデルが遷移する状態がこれらである確率を出力する６つのノードを持つ出力層１９４と、入力層１９０と出力層１９４との間に設けられた複数の隠れ層１９２とを含む。 FIG. 6 shows an outline of the configuration of the DNN 180 for determining the state transition probability of the drown Markov model in the classifier 142 of FIG. Referring to FIG. 6, DNN 180 is provided corresponding to silence, listening, k, g, i, and q, and input layer 190 having a plurality of nodes connected to receive feature vector 182. Provided between the input layer 190 and the output layer 194, and an output layer 194 having six nodes for outputting the probability that the state in which the Markov model is drowned immediately after the vector 182 is given is the transition state. A plurality of hidden layers 192 formed.

図７の上段に、隠れマルコフモデルによる状態遷移とそのときの出力（素性）との関係を例示する。この例は、silenceから状態ｓ１及びｓ２を経てsilenceに状態が遷移することを示す。状態がsilenceのときには、出力からは素性silが得られる。同様に状態がｓ１のときには素性ｆ１が得られ、状態がｓ２のときには素性ｆ２が得られる。この素性ｆ１及びｆ２は、それぞれ状態ｓ１及びｓ２についてあらかじめ学習により得られた確率密度関数にしたがって出力されるベクトルである。 The upper part of FIG. 7 illustrates the relationship between the state transition by the hidden Markov model and the output (feature) at that time. This example shows a state transition from silence to silence via states s1 and s2. When the state is silence, the feature sil is obtained from the output. Similarly, the feature f1 is obtained when the state is s1, and the feature f2 is obtained when the state is s2. The features f1 and f2 are vectors that are output according to the probability density function obtained by learning in advance for the states s1 and s2, respectively.

このＤＮＮ１８０の入力層１９０に素性ベクトル１８２が与えられると、ＤＮＮ１８０は、隱れマルコフモデルが次にどの様な確率でどの状態に遷移するか示す確率ベクトルを出力する。 When the feature vector 182 is given to the input layer 190 of the DNN 180, the DNN 180 outputs a probability vector indicating what state the drowning Markov model will next transition to.

図７の下段に示すように、分類器１４２は、状態２１０から状態２１２の間で遷移する確率が最も高い（最尤の）最尤系列２１８を尤度計算により選択する。図７に示す例では、状態２１０、２１４、２１６及び２１２の経路が他の経路よりも確率が高く、したがって最尤の経路として選択されている。この場合、状態２１２が句末を示すとすれば、状態２１６に対応する談話機能が句末の談話機能であると推定される。 As shown in the lower part of FIG. 7, the classifier 142 selects the maximum likelihood sequence 218 having the highest probability of transition from the state 210 to the state 212 (maximum likelihood) by likelihood calculation. In the example shown in FIG. 7, the routes in the states 210, 214, 216, and 212 have a higher probability than the other routes, and are therefore selected as the most likely routes. In this case, if state 212 indicates the end of a phrase, it is estimated that the discourse function corresponding to state 216 is the discourse function at the end of the phrase.

〈動作〉
第３の実施の形態に係る談話機能推定装置１３０は以下のように動作する。音声認識装置７０、形態素解析部７２、ベクトル生成部１１０、ベクトル正規化部１１１、Ｆ０抽出部７６、Ｆ０平均記憶部７８、及び音声認識装置７０は第２の実施の形態と同様に動作する。ボトルネックニューラルネットワーク１４０は、ベクトル正規化部１１１の出力するベクトルを受けて、ボトルネック特徴量を出力する。話者正規化部８０は、直前の所定時間の音声の１０ミリ秒ごとのＦ０を正規化したもののうち、最新の所定個数を要素として持つベクトルを分類器１４２に与える。 <Operation>
The discourse function estimation apparatus 130 according to the third embodiment operates as follows. The speech recognition device 70, the morphological analysis unit 72, the vector generation unit 110, the vector normalization unit 111, the F0 extraction unit 76, the F0 average storage unit 78, and the speech recognition device 70 operate in the same manner as in the second embodiment. The bottleneck neural network 140 receives a vector output from the vector normalization unit 111 and outputs a bottleneck feature amount. The speaker normalizing unit 80 gives the vector having the latest predetermined number as an element to the classifier 142 out of the normalized F0 for every 10 milliseconds of the voice of the immediately preceding predetermined time.

分類器１４２は、ボトルネックニューラルネットワーク１４０からのベクトルと話者正規化部８０からのベクトルとを連結した素性ベクトルを受け取り、句境界の直前の談話機能の状態のシーケンスを推定し、最後の談話機能１３２を出力する。 The classifier 142 receives a feature vector obtained by connecting the vector from the bottleneck neural network 140 and the vector from the speaker normalization unit 80, estimates the sequence of the state of the discourse function immediately before the phrase boundary, and determines the last discourse. The function 132 is output.

［第４の実施の形態］
〈構成〉
図８に、本発明の第４の実施の形態に係る談話機能推定装置２５０の概略構成を示す。図８を参照して、談話機能推定装置２５０が第３の実施の形態に係る談話機能推定装置１３０と異なるのは、音声認識装置７０に代えて、音声認識を行ってテキストデータを出力するのに加えて、発話の句末を検出して句末区間を特定する信号を出力する機能を持つ音声認識装置２６０を含む点、話者正規化部８０に代えて、Ｆ０抽出部７６の出力するＦ０を、１０ミリ秒ごとに、Ｆ０平均記憶部７８に記憶されたＦ０平均値を減算して正規化して複数記憶し、音声認識装置２６０が出力する信号に応答して、その信号により表される期間に相当するＦ０の系列をベクトルとして出力する話者正規化部２６２と、話者正規化部２６２の出力するＦ０を用いて、句末の韻律を表す固定長のベクトルを出力する句末区間正規化部２６４を含む点、及び第３の実施の形態の分類器１４２に代えて、ボトルネックニューラルネットワーク１４０の出力するボトルネック特徴量からなるベクトルと、句末区間正規化部２６４が出力する、句末区間の韻律を示す固定長のベクトルとを連結したベクトルを素性ベクトルとして受け、素性ベクトルに対応する談話機能２５２を推定し出力する分類器２６６を含む点である。 [Fourth Embodiment]
<Constitution>
FIG. 8 shows a schematic configuration of a discourse function estimation apparatus 250 according to the fourth embodiment of the present invention. Referring to FIG. 8, discourse function estimation apparatus 250 is different from discourse function estimation apparatus 130 according to the third embodiment in that instead of speech recognition apparatus 70, speech recognition is performed and text data is output. In addition to the above, a speech recognition device 260 having a function of detecting a phrase end of an utterance and outputting a signal specifying a phrase end section is included, and instead of the speaker normalizing unit 80, an output of the F0 extracting unit 76 is provided. The F0 is normalized every 10 milliseconds by subtracting the F0 average value stored in the F0 average storage unit 78, normalized, and stored in response to the signal output from the speech recognition device 260. A speaker normalization unit 262 that outputs a sequence of F0 corresponding to a period of time as a vector, and F0 output from the speaker normalization unit 262, and a phrase end that outputs a fixed-length vector representing the prosody of the phrase end A point including an interval normalization unit 264, Instead of the classifier 142 according to the third embodiment, the vector composed of the bottleneck features output from the bottleneck neural network 140 and the prosody of the phrase end interval output from the phrase end interval normalization unit 264 are shown. This is a point including a classifier 266 that receives a vector obtained by concatenating a fixed-length vector as a feature vector, and estimates and outputs a discourse function 252 corresponding to the feature vector.

話者正規化部２６２の出力するベクトルの要素数は、句末区間の長さが変動するのに伴って変動する。句末区間正規化部２６４は、この可変長のベクトルを固定長に正規化する。例えば句末区間正規化部２６４は、入力される可変長ベクトルにより表される音声の調子を、第１のカテゴリ（上昇調、下降調、平坦調、下降・上昇調等）と、第２のカテゴリ（短い、長い、とても長い）によりそれぞれ分類し、それらのカテゴリを表す情報の組み合わせを表す固定長のベクトルを出力する。または、句末区間正規化部２６４は、可変数のＦ０を一定数にダウンサンプリングすることで固定長のベクトルを出力するようにしてもよい。 The number of elements of the vector output from the speaker normalization unit 262 varies as the length of the phrase end section varies. The phrase end interval normalization unit 264 normalizes this variable length vector to a fixed length. For example, the phrase end interval normalization unit 264 sets the tone of the speech represented by the input variable length vector to the first category (up tone, down tone, flat tone, down / up tone, etc.) and the second category. Each is classified according to category (short, long, very long), and a fixed-length vector representing a combination of information representing these categories is output. Alternatively, the phrase end interval normalization unit 264 may output a fixed-length vector by down-sampling a variable number of F0s to a fixed number.

句末区間正規化部２６４の出力するベクトルの次元が第３の実施の形態の場合と同じであれば、分類器２６６は分類器１４２と同じ構成でもよい。ただし、学習データを変更すべきことはいうまでもない。 The classifier 266 may have the same configuration as the classifier 142 if the dimension of the vector output by the phrase end interval normalization unit 264 is the same as that in the third embodiment. However, it goes without saying that the learning data should be changed.

〈動作〉
談話機能推定装置２５０の音声認識装置２６０は、音声４２を音声認識してテキストデータを出力するとともに、句末を検出して句末の区間を特定する信号を話者正規化部２６２に与える。形態素解析部７２、ベクトル生成部１１０、ベクトル正規化部１１１及びボトルネックニューラルネットワーク１４０は第３の実施の形態と同様に動作し、言語情報に基づいて得られた素性のベクトルを分類器２６６に与える。Ｆ０抽出部７６は、音声４２のＦ０を１０ミリ秒ごとに算出して話者正規化部２６２に与える。話者正規化部２６２はこの値からＦ０平均記憶部７８に記憶されていたＦ０の平均値を減算することにより正規化し、時系列として記憶する。音声認識装置２６０から句末の期間を特定する信号が与えられると、話者正規化部２６２は、その期間の正規化後のＦ０系列を句末区間正規化部２６４に与える。句末区間正規化部２６４は、このＦ０系列を上記した２種類のカテゴリにしたがって分類し、分類されたカテゴリを示すデータをベクトル形式で分類器２６６に与える。分類器２６６は、ボトルネックニューラルネットワーク１４０
からの言語情報に基づく素性のベクトルと、句末区間正規化部２６４からの、韻律情報に基づく素性のベクトルとを連結したものを素性ベクトルとして受け、学習パラメータにしたがって句末の談話機能を推定し談話機能２５２を出力する。 <Operation>
The speech recognition device 260 of the discourse function estimation device 250 recognizes the speech 42 and outputs text data, and provides a signal to the speaker normalization unit 262 to detect the end of the phrase and identify the end of the phrase. The morphological analysis unit 72, the vector generation unit 110, the vector normalization unit 111, and the bottleneck neural network 140 operate in the same manner as in the third embodiment, and the feature vector obtained based on the language information is input to the classifier 266. give. The F0 extraction unit 76 calculates F0 of the voice 42 every 10 milliseconds and gives it to the speaker normalization unit 262. The speaker normalizing unit 262 normalizes the value by subtracting the average value of F0 stored in the F0 average storage unit 78 from this value, and stores it as a time series. When a signal specifying the phrase end period is given from the speech recognizer 260, the speaker normalizing unit 262 gives the normalized F0 sequence for the period to the phrase end period normalizing unit 264. The phrase end interval normalization unit 264 classifies the F0 series according to the above-described two types of categories, and provides the classifier 266 with data indicating the classified categories in a vector format. The classifier 266 includes the bottleneck neural network 140.
The feature vector based on the linguistic information from and the feature vector based on the prosodic information from the phrase end interval normalization unit 264 are received as a feature vector, and the discourse function at the end of the phrase is estimated according to the learning parameters The discourse function 252 is output.

［実験結果］
上記した第２の実施の形態の談話機能推定装置１００の構成を用いて以下の様な実験を行った。韻律情報（Ｆ０）を用いず、言語情報のみによる談話機能を推定する予備実験を行ったところ、ベクトル次元削減処理部１１２にＬＤＡを用いた場合には、ＢＯＷ、ＰＯＳのユニグラム、バイグラム及びトライグラムのいずれを用いた場合よりも高い精度が得られた。そこで、言語情報のみを用いてベクトル次元削減処理部１１２としてＬＤＡを用いた結果と、言語情報に加えて韻律情報を加えてベクトル次元削減処理部１１２にＬＤＡを用いた場合の精度を比較した。結果を図９に示す。 [Experimental result]
The following experiment was performed using the configuration of the discourse function estimation apparatus 100 of the second embodiment described above. A preliminary experiment was performed to estimate the discourse function based only on linguistic information without using prosodic information (F0). When LDA was used for the vector dimension reduction processing unit 112, BOW, POS unigrams, bigrams and trigrams were used. Higher accuracy was obtained than when either of these was used. Therefore, the accuracy of using LDA as the vector dimension reduction processing unit 112 using only language information was compared with the accuracy when LDA was used for the vector dimension reduction processing unit 112 by adding prosodic information in addition to language information. The results are shown in FIG.

図９を参照して、横軸はＬＤＡのトピック数、縦軸は予測精度である。グラフ３００は言語情報のみを用いた場合の予測精度を示し、グラフ３０２は言語情報に加えて韻律情報を加えた場合の予測精度を示す。このグラフから明らかなように、韻律情報を加えることにより、予測精度は大幅に高くなった。また、トピック数１００の場合とそれ以外の場合とで精度に明らかな相違があることから、トピック数を１００まで減少させると、情報の一部が失われる結果、精度が低くなることが分かる。 Referring to FIG. 9, the horizontal axis represents the number of LDA topics, and the vertical axis represents the prediction accuracy. A graph 300 shows prediction accuracy when only language information is used, and a graph 302 shows prediction accuracy when prosodic information is added in addition to language information. As is clear from this graph, the accuracy of prediction was significantly increased by adding prosodic information. Further, since there is a clear difference in accuracy between the case of 100 topics and the other cases, it can be seen that when the number of topics is reduced to 100, a part of information is lost, resulting in a decrease in accuracy.

図１０に、ベクトル次元削減処理部１１２としてＬＤＡを用い、さらに図３の分類器１１４としてＳＶＭに代えてＤＮＮを用いた実験結果を示す。実験では、ＤＮＮの各隠れ層のノード数を５１２から２０４８まで５１２ずつ変化させ、その予測精度の変化を調べた。このグラフから分かるように、ＤＮＮを用いた場合には隠れ層のノード数を増加させると精度も向上する。また、隠れ層のノード数が５１２のときを除き、分類器１１４にＳＶＭを用いた場合よりも高い精度が得られることが分かる。 FIG. 10 shows an experimental result using LDA as the vector dimension reduction processing unit 112 and using DNN as the classifier 114 in FIG. 3 instead of SVM. In the experiment, the number of nodes in each hidden layer of the DNN was changed 512 by 512 from 512 to 2048, and the change in the prediction accuracy was examined. As can be seen from this graph, when DNN is used, the accuracy is improved by increasing the number of nodes in the hidden layer. Further, it can be seen that higher accuracy is obtained than when the SVM is used for the classifier 114 except when the number of nodes in the hidden layer is 512.

［実施の形態の効果］
以上のように本発明の実施の形態によると、言語情報だけではなく、句末の韻律情報を考慮して句末の談話機能を推定する。したがって、言語情報のみを用いる場合と比較してより高い精度で談話機能を推定できる。さらに、分類器としてＳＶＭ、ＤＮＮ、または隱れマルコフモデルとＤＮＮの組み合わせを用いることにより、学習結果を反映した安定した高精度で句末の談話機能を推定できる。したがって、この談話機能を用いてヒューマン・マシンインターフェイスを構築することにより、より自然なインタラクションを実現できる。 [Effect of the embodiment]
As described above, according to the embodiment of the present invention, the discourse function at the end of a phrase is estimated in consideration of not only language information but also prosodic information at the end of the phrase. Therefore, it is possible to estimate the discourse function with higher accuracy than in the case of using only language information. Further, by using SVM, DNN, or a combination of a Markov model and DNN as a classifier, it is possible to estimate the discourse function at the end of a phrase with stable and high accuracy reflecting the learning result. Therefore, more natural interaction can be realized by constructing a human-machine interface using this discourse function.

［コンピュータによる実現］
本発明の各実施の形態に係る談話機能推定装置は、コンピュータハードウェアと、そのコンピュータハードウェア上で実行されるコンピュータプログラムとにより実現できる。図１１はこのコンピュータシステム５３０の外観を示し、図１２はコンピュータシステム５３０の内部構成を示す。 [Realization by computer]
The discourse function estimation apparatus according to each embodiment of the present invention can be realized by computer hardware and a computer program executed on the computer hardware. FIG. 11 shows the external appearance of the computer system 530, and FIG. 12 shows the internal configuration of the computer system 530.

図１１を参照して、このコンピュータシステム５３０は、メモリポート５５２及びＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）ドライブ５５０を有するコンピュータ５４０と、キーボード５４６と、マウス５４８と、モニタ５４２とを含む。 Referring to FIG. 11, the computer system 530 includes a computer 540 having a memory port 552 and a DVD (Digital Versatile Disc) drive 550, a keyboard 546, a mouse 548, and a monitor 542.

図１２を参照して、コンピュータ５４０は、メモリポート５５２及びＤＶＤドライブ５５０に加えて、ＣＰＵ（中央処理装置）５５６と、ＣＰＵ５５６、メモリポート５５２及びＤＶＤドライブ５５０に接続されたバス５６６と、ブートプログラム等を記憶する読出専用メモリ（ＲＯＭ）５５８と、バス５６６に接続され、プログラム命令、システムプログラム及び作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）５６０と、ハードディスク５５４を含む。コンピュータシステム５３０はさらに、バス５６６に接続され、音声信号をデジタル化してコンピュータにおいて処理可能な形式に変換するためのサウンドボード５６８と、他端末との通信を可能とするネットワーク５６８への接続を提供するネットワークインターフェイスカード（ＮＩＣ）５７４を含む。サウンドボード５６８にはマイクロフォン５７０が接続される。 12, in addition to the memory port 552 and the DVD drive 550, the computer 540 includes a CPU (Central Processing Unit) 556, a bus 566 connected to the CPU 556, the memory port 552, and the DVD drive 550, and a boot program. And the like, a read only memory (ROM) 558 for storing etc., a random access memory (RAM) 560 connected to the bus 566 for storing program instructions, system programs, work data and the like, and a hard disk 554. The computer system 530 is further connected to the bus 566 and provides a connection to a sound board 568 for digitizing and converting the audio signal into a form that can be processed by the computer, and a network 568 that allows communication with other terminals. Network interface card (NIC) 574. A microphone 570 is connected to the sound board 568.

コンピュータシステム５３０を上記した各実施の形態に係る談話機能推定装置の各機能部として機能させるためのコンピュータプログラムは、ＤＶＤドライブ５５０又はメモリポート５５２に装着されるＤＶＤ５６２又はリムーバブルメモリ５６４に記憶され、さらにハードディスク５５４に転送される。又は、プログラムはネットワーク５６８を通じてコンピュータ５４０に送信されハードディスク５５４に記憶されてもよい。プログラムは実行の際にＲＡＭ５６０にロードされる。ＤＶＤ５６２から、リムーバブルメモリ５６４から又はネットワーク５６８を介して、直接にＲＡＭ５６０にプログラムをロードしてもよい。 A computer program for causing the computer system 530 to function as each function unit of the discourse function estimation apparatus according to each of the above-described embodiments is stored in the DVD 562 or the removable memory 564 installed in the DVD drive 550 or the memory port 552, and Transferred to the hard disk 554. Alternatively, the program may be transmitted to the computer 540 through the network 568 and stored in the hard disk 554. The program is loaded into the RAM 560 when executed. The program may be loaded directly into the RAM 560 from the DVD 562, from the removable memory 564, or via the network 568.

このプログラムは、コンピュータ５４０を、上記各実施の形態に係る談話機能推定装置４４、１００、１３０、及び２５０の各機能部として機能させるための複数の命令からなる命令列を含む。コンピュータ５４０にこの動作を行わせるのに必要な基本的機能のいくつかはコンピュータ５４０上で動作するオペレーティングシステム若しくはサードパーティのプログラム又はコンピュータ５４０にインストールされる、ダイナミックリンク可能な各種プログラミングツールキット又はプログラムライブラリにより提供される。したがって、このプログラム自体はこの実施の形態のシステム、装置及び方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能又はプログラミングツールキット又はプログラムライブラリ内の適切なプログラムを実行時に動的に呼出すことにより、上記したシステム、装置又は方法としての機能を実現する命令のみを含んでいればよい。もちろん、プログラムのみで必要な機能を全て提供してもよい。 This program includes an instruction sequence including a plurality of instructions for causing the computer 540 to function as each functional unit of the discourse function estimation apparatuses 44, 100, 130, and 250 according to the above embodiments. Some of the basic functions necessary to cause the computer 540 to perform this operation are an operating system or third party program running on the computer 540 or various dynamically linkable programming toolkits or programs installed on the computer 540. Provided by the library. Therefore, this program itself does not necessarily include all the functions necessary for realizing the system, apparatus, and method of this embodiment. The program is a system as described above by dynamically calling an appropriate program in an appropriate function or programming toolkit or program library in a controlled manner to obtain a desired result among instructions, It is only necessary to include an instruction for realizing a function as an apparatus or a method. Of course, all necessary functions may be provided only by the program.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are included. Including.

４０操作者
４２音声
４４、１００、１３０、２５０談話機能推定装置
４６頭部動作
４８ロボット
５０、１０２、１３２、２５２談話機能
７０、２６０音声認識装置
７２形態素解析部
７４時系列品詞情報記憶部
７６Ｆ０抽出部
７８Ｆ０平均記憶部
８０、２６２話者正規化部
８２、１１４、１４２、２６６分類器
８４句境界情報
１１０ベクトル生成部
１１１ベクトル正規化部
１１２ベクトル次元削減処理部
１４０ボトルネックニューラルネットワーク
１５０入力ベクトル
１５２入力層
１５４、１５８、１９２隠れ層
１５６ボトルネック層
１６０、１９４出力層
１６２出力ベクトル
１６４ボトルネック特徴量
１８０ＤＮＮ
１８２素性ベクトル
１９０入力層
２１０、２１２、２１４、２１６談話機能の状態
２１８最尤系列
２６４句末区間正規化部 40 operator 42 speech 44, 100, 130, 250 discourse function estimation device 46 head movement 48 robot 50, 102, 132, 252 discourse function 70, 260 speech recognition device 72 morpheme analysis unit 74 time-series part-of-speech information storage unit 76 F0 Extraction unit 78 F0 average storage unit 80, 262 Speaker normalization unit 82, 114, 142, 266 Classifier 84 Phrase boundary information 110 Vector generation unit 111 Vector normalization unit 112 Vector dimension reduction processing unit 140 Bottleneck neural network 150 Input Vector 152 Input layer 154, 158, 192 Hidden layer 156 Bottleneck layer 160, 194 Output layer 162 Output vector 164 Bottleneck feature 180 DNN
182 feature vector 190 input layer 210, 212, 214, 216 state of discourse function 218 maximum likelihood sequence 264 phrase end interval normalization unit

Claims

First vector generating means for receiving text data of utterance, and generating morphological analysis of the text data to generate a first feature vector for estimating a discourse function in the utterance;
In the speech signal corresponding to the speech, a fundamental frequency component is extracted from the speech signal in a predetermined section immediately before the end of a phrase detected during speech, and a second vector representing a change in the fundamental frequency component is generated. Two vector generation means;
The machine receives a feature vector composed of the first vector and the second vector as an input, and classifies the discourse function of the utterance at the end of the phrase into one of a plurality of predetermined discourse functions. A discourse function estimation device including a learned classification means.

The second vector generation means includes
Dividing means for dividing the predetermined section immediately before the end of the phrase detected during utterance into a plurality of divided sections;
The discourse function estimation apparatus according to claim 1, further comprising means for generating the second vector using the fundamental frequency of each divided section divided by the dividing means as an element.

The first vector generation means includes:
Morphological analysis means for receiving text data of speech, performing morphological analysis on the text data, and outputting a morpheme string;
A morpheme sequence storage unit for storing the morpheme sequence output by the morpheme analysis unit in time series;
Means for generating and outputting the first vector to the classifier using at least part-of-speech information obtained from each of the latest predetermined number of morphemes stored in the morpheme string storage means as an element. The discourse function estimation apparatus according to claim 1 or 2.

The first vector generation means includes:
Morphological analysis means for receiving text data of speech, performing morphological analysis on the text data, and outputting a morpheme string;
BOW vector generation means for generating a BOW vector representing a set of words (BOW) appearing in the morpheme sequence;
BOW vector normalization for normalizing the elements of the BOW vector generation means by the appearance frequency of each word in a predetermined data set and the appearance frequency of each word in the utterance, and outputting a normalized BOW vector Means,
The discourse function according to claim 1, further comprising dimension reduction means for reducing a dimension of the normalized BOW vector output from the BOW vector normalizing means and outputting the reduced vector as the first vector. Estimating device.

The dimension reduction means includes means for reducing the dimension of the normalized BOW vector output from the BOW vector normalization means by a latent Dirichlet allocation method (LDA) to generate the first vector. The discourse function estimation apparatus according to claim 4.

The dimension reduction means includes
A bottleneck neural network trained in advance so that the input and the output are equal, connected to receive the normalized BOW vector output from the BOW vector normalizing means;
Means for generating the first vector using as an element the value output from each node of the bottleneck layer of the bottleneck neural network in response to the provision of the normalized BOW vector. The discourse function estimation apparatus according to claim 4.

The classification means includes a support vector machine that has received the feature vector as an input and has been learned to classify the discourse function of the utterance at the end of the phrase into one of a plurality of predetermined discourse functions. The discourse function estimation apparatus according to claim 6.

The classification means includes
A hidden Markov model expressing the transition path of the hidden state corresponding to the discourse function of the utterance and the output probability of each element of the feature vector in each hidden state;
Deep neural network that has been trained by machine learning in advance to output the probability that the state of the hidden Markov model transitions to each of the hidden states after the hidden state that has received the feature vector as input and output the feature vector When,
Maximum likelihood estimation means for estimating a maximum likelihood path as a transition path of a hidden state of an utterance based on the feature vector, the hidden Markov model, and the output of the deep neural network;
The discourse function estimation apparatus according to any one of claims 1 to 6, further comprising means for estimating a discourse function of the utterance based on the route estimated by the maximum likelihood estimation means.

The second vector generation means includes
A fundamental frequency extracting means for extracting a fundamental frequency component from the speech signal in a predetermined section immediately before the end of a phrase detected during the speech and storing it as a logarithmic fundamental frequency component in the speech signal corresponding to the utterance;
Basic frequency average storage means for storing an average value of logarithm of the fundamental frequency of the voice of the speaker of the utterance extracted in advance;
The logarithmic fundamental frequency component is normalized by subtracting the average value stored in the fundamental frequency average storage means from the logarithmic fundamental frequency component extracted by the fundamental frequency extracting means, and the normalized logarithmic fundamental frequency The discourse function estimation apparatus according to any one of claims 1 to 8, further comprising: means for generating the second vector using a component as an element.

Fundamental frequency calculation means for calculating a logarithm of the fundamental frequency of the voice of the speaker in the utterance every predetermined time;
The discourse function according to claim 9, further comprising means for calculating an average value of logarithm of the fundamental frequency calculated every predetermined time by the fundamental frequency calculating means and storing the average value in the fundamental frequency average storage means. Estimating device.

Further comprising a phrase end detection means for detecting the phrase end of the utterance and outputting a phrase end signal;
The first vector generation unit and the second vector generation unit respectively include the first vector and the second vector from the text data and the speech signal immediately before the end of the phrase detected by the end of phrase detection unit. The discourse function estimation apparatus according to claim 1, wherein the vector is generated and output.

The phrase ending detection means performs speech recognition on the utterance and outputs the text data;
A phrase end specifying means for specifying a phrase end section to be treated as the phrase end from the phoneme information immediately before the end of the phrase of the text data output by the speech recognition device;
The second vector generation means includes
Means for dividing the phrase end section into partial sections each having a predetermined length and extracting the logarithm of the fundamental frequency of each partial section;
And means for generating the second vector of fixed length based on the relationship between the logarithm of the fundamental frequency of each subsection extracted by the means for extracting. Discourse function estimation device.

A computer program that causes a computer to function as the discourse function estimation device according to any one of claims 1 to 12.