JP5385677B2

JP5385677B2 - Dialog state dividing apparatus and method, program and recording medium

Info

Publication number: JP5385677B2
Application number: JP2009115499A
Authority: JP
Inventors: 済央野本; 敏高橋; 理吉岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-05-12
Filing date: 2009-05-12
Publication date: 2014-01-08
Anticipated expiration: 2029-05-12
Also published as: JP2010266522A

Description

この発明は、二人の話者が対話する状況下の音声データを、対話の状態に応じて分類する装置とその方法と、そのプログラムと記録媒体に関する。 The present invention relates to an apparatus and method, a program, and a recording medium for classifying voice data under a situation in which two speakers interact with each other according to the state of the conversation.

近年、大量に蓄積されたデータから知見を獲得するマイニング技術が注目されている。例えばＷｅｂ上にある不特定話者によって書かれたブログや商品に対する自由記述アンケートから商品に対する世間一般の評判やその傾向を調べる目的でテキストマイニングといった手法が用いられる。 In recent years, a mining technique for acquiring knowledge from a large amount of accumulated data has attracted attention. For example, a technique such as text mining is used for the purpose of investigating the general reputation and trends of products from blogs written by unspecified speakers on the Web and free description questionnaires for products.

テキストマイニング技術の一つとして、単語ランキングや話題分類などがある。例えば、商品に対する自由記述アンケートやブログ記事などのテキストで書かれた文書を複数集めて、それらの話題傾向を調べる際、その中でどのような話題がどの程度あるかを調べる目的で文書頻度（Document Frequency、以下ＤＦと称する）による単語ランキングが用いられる。ＤＦとはある単語を含む文書がいくつあるかを表した値である。 One of the text mining techniques includes word ranking and topic classification. For example, when collecting multiple documents written in text, such as free-form questionnaires and blog articles on products, and examining their topic trends, the document frequency ( Word ranking by Document Frequency (hereinafter referred to as DF) is used. DF is a value representing how many documents contain a certain word.

このようなマイニング技術は、ＣＲＭ（Customer Relationship Management）の分野において注目されており、顧客との応対記録を分析し、顧客のニーズ開拓やＣＳ（Customer Satisfactin）向上などを目指す試みがなされている。ＣＲＭ分析データとしてコールセンタにおけるオペレータと顧客との電話応対を録音したもの（以下、応対音声）などがある。 Such mining technology has been attracting attention in the field of CRM (Customer Relationship Management), and attempts have been made to analyze customer records and develop customer needs and improve CS (Customer Satisfactin). As CRM analysis data, there is a recording of a telephone reception between an operator and a customer in a call center (hereinafter referred to as reception voice).

そこで応対音声を音声認識や人手で書き起こした文書に対し、単語ランキングなどを用いて分析する場合、単純に応対音声の開始から終了までの全範囲を対象として分析を行っても意図どおりの通話内容を得ることは難しい。一言で応対音声と言っても、顧客が電話をかけて来た用件をオペレータに説明したり、顧客の本人確認をしたり、顧客の用件に対しオペレータが説明をしたりなど、一つの会話の中をいくつかの状態に分割することが出来る。そのため、会話をいくつかの状態に分割し、その分割された各状態について分析することで、データマイニングの精度の向上が期待出来る。つまり、顧客が用件を述べている状態なのか、又は、オペレータが顧客から情報を聞きだしている状態なのか、或いは、オペレータが回答している状態なのか、について対話状態を分割することでデータ分析の精度の向上が期待できる。 Therefore, when analyzing the response speech using voice recognition or manually transcribed documents using word ranking etc., even if the analysis is performed for the entire range from the start to the end of the response speech, the intended call It is difficult to get the contents. Even if it is said to be a response voice in one word, the customer can explain to the operator what he has called, confirm the identity of the customer, and the operator can explain the customer's requirements. A conversation can be divided into several states. Therefore, it is expected that the accuracy of data mining can be improved by dividing the conversation into several states and analyzing each divided state. In other words, it is possible to divide the dialogue state as to whether the customer is in the state of the business, the operator is listening to information from the customer, or the operator is answering. Improvement of data analysis accuracy can be expected.

関連する従来技術としては、例えば非特許文献１に開示された単語の出現傾向からテキストを分割するテキストセグメンテーション技術が知られている。テキストセグメンテーション技術とは、新聞記事や小説などといった文書を意味のまとまり毎に分割して行く技術である。しかし、応対音声のような二者の間で交わされる会話の対応状態を分割・類別するような技術はこれまでに報告されていない。 As a related prior art, for example, a text segmentation technique for dividing text from the appearance tendency of words disclosed in Non-Patent Document 1 is known. Text segmentation technology is a technology that divides documents such as newspaper articles and novels into groups of meanings. However, no technology has been reported so far that divides and classifies the correspondence state of conversations between two parties such as reception voice.

Marti A. Hearst. Multi-Paragraph Segmentation of Expository Text. 32ndAnnual Meeting of the Association for Computational Linguistics. Pp.9-16. 1994Marti A. Hearst. Multi-Paragraph Segmentation of Expository Text. 32nd Annual Meeting of the Association for Computational Linguistics. Pp. 9-16. 1994

従来のテキストセグメンテーション技術を利用する場合、応対音声を一旦テキストに書き起こす必要がある。応対音声を人手によって書き起こすと大きなコストがかかる。また、応対音声を音声認識して自動でテキスト化すると、認識結果に含まれる誤認識の影響から分割精度が悪化する心配がある。 When using a conventional text segmentation technique, it is necessary to transcribe the response voice into text once. It takes a lot of cost to manually write the response voice. In addition, when the response voice is recognized and converted into text automatically, there is a concern that the division accuracy may deteriorate due to the influence of misrecognition included in the recognition result.

この発明は、このような点に鑑みてなされたものであり、テキストの書き起こしにかかるコストを削減し、誤認識による分割精度の低下を回避することが可能な対話状態分割装置とその方法と、そのプログラムと記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and is a dialog state dividing apparatus and method capable of reducing the cost of transcription of text and avoiding a reduction in division accuracy due to erroneous recognition. An object of the present invention is to provide a program and a recording medium.

この発明の対話状態分割装置は、発話区間検出部と、フレーム抽出部と、フレーム内発話時間比計算部と、フレーム代表スコア計算部と、対話状態分類部とを具備する。発話区間検出部は、発話者Ａと発話者Ｂによる二者が会話する音声データを入力として、その二者のそれぞれの発話区間を検出する。フレーム抽出部は、それぞれの発話区間を経過時間順に並べて所定数の発話区間を１フレームとして出力する。フレーム内発話時間比計算部は、上記１フレーム内の発話者Ａまたは発話者Ｂの総発話時間を、当該フレーム内における発話者Ａの総発話時間と発話者Ｂの総発話時間の合計値で除した値であるフレーム内発話時間比Ｒ_ｊを、上記１フレーム毎に計算する。フレーム代表スコア計算部は、上記フレーム内発話時間比Ｒ_ｊ又は平滑化したフレーム内発話時間比Ｒ_ｊを上記１フレーム内の代表スコアとして決定する。対話状態分類部は、発話区間の時間比を少なくとも２個の閾値と比較することで各フレームを、少なくとも３つの対話状態にそれぞれ分類する。 The dialog state dividing device of the present invention includes an utterance section detecting unit, a frame extracting unit, an intra-frame utterance time ratio calculating unit, a frame representative score calculating unit, and a dialog state classifying unit. The utterance section detection unit detects the respective utterance sections of the two parties, using as input the voice data of the conversation between the two parties of the speaker A and the speaker B. The frame extraction unit arranges the respective utterance sections in order of elapsed time and outputs a predetermined number of utterance sections as one frame. The intra-frame utterance time ratio calculation unit calculates the total utterance time of the utterer A or the utterer B in the frame by the total value of the total utterance time of the utterer A and the total utterance time of the utterer B in the frame. An intra-frame speech time ratio R _j that is the divided value is calculated for each frame. The frame representative score calculation unit determines the intra-frame speech time ratio R _j or the smoothed intra-frame speech time ratio R _j as a representative score in the one frame. The dialogue state classification unit classifies each frame into at least three dialogue states by comparing the time ratio of the utterance period with at least two threshold values.

この発明の対話状態分割装置によれば、テキスト情報を用いることなく対話状態を少なくとも、「顧客が用件を述べている状態」、「オペレータが顧客から情報を聞きだしている状態」、「オペレータが回答している状態」の３つの状態に分割することができる。会話者の発話時間比を用いて対話状態を分割するので、会話情報をテキスト情報に変換するコストがかからない。また、テキストの書き起こしに音声認識を用いた場合の認識結果誤りの影響を受けない。 According to the dialogue state dividing apparatus of the present invention, at least the dialogue state without using the text information is “a state in which the customer is describing the business”, “a state in which the operator is hearing information from the customer”, and “an operator Can be divided into three states: “state that is answering”. Since the conversation state is divided by using the conversation time ratio of the talker, there is no cost for converting the conversation information into text information. In addition, it is not affected by recognition result errors when speech recognition is used for transcription of text.

コールセンターにおける顧客対応状態の会話の一例を示す図。The figure which shows an example of the conversation of the customer correspondence state in a call center. この発明の対話状態分割装置１００の機能構成例を示す図。The figure which shows the function structural example of the dialog state division | segmentation apparatus 100 of this invention. 対話状態分割装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the dialog state division | segmentation apparatus 100. FIG. 発話区間検出部１０の機能構成例を示す図。The figure which shows the function structural example of the utterance area detection part. 発話分離部２０の機能構成例を示す図。The figure which shows the function structural example of the speech separation part 20. FIG. フレーム抽出部１１，１１′の機能構成例を示す図。The figure which shows the function structural example of the frame extraction parts 11 and 11 '. 二者の発話区間を発話時間順に配列した一例を示す図。The figure which shows an example which arranged the utterance area of two persons in order of utterance time. フレーム内発話時間比計算部１２の機能構成例を示す図。The figure which shows the function structural example of the utterance time ratio calculation part 12 in a flame | frame. フレーム代表スコア計算部１３の出力信号の一例を概念的に示す図。The figure which shows notionally an example of the output signal of the frame representative score calculation part 13. FIG. フレーム代表スコア計算部１３′の動作フローを示す図。The figure which shows the operation | movement flow of frame representative score calculation part 13 '. フレーム代表スコア計算部１３′の出力信号の一例を概念的に示す図。The figure which shows notionally an example of the output signal of frame representative score calculation part 13 '. 対話状態分割部１４の動作フローを示す図。The figure which shows the operation | movement flow of the dialog state division | segmentation part 14. FIG. 対話状態分割部１４が動作した結果の一例を示す図。The figure which shows an example of the result as which the dialog state division | segmentation part 14 operate | moved. 相槌の発話区間から成るフレームの例を示す図。The figure which shows the example of the flame | frame which consists of an utterance area of a conflict.

この発明の実施例の説明をする前に、この発明の考えについて説明する。
〔この発明の考え〕
この発明の対話状態分割方法は、二者の話者のどちらが会話の主導権を握っているか、に着目して対話状態を分割する考えである。図１に、例えばコールセンターにおける顧客とオペレータの対話状態の一例を示す。図１の横方向は経過時間であり、その経過時間を表す中心線の上側にオペレータの発話区間、下側に顧客の発話区間を示す。 Before describing the embodiments of the present invention, the idea of the present invention will be described.
[Concept of this invention]
The dialog state dividing method of the present invention is an idea of dividing the dialog state by paying attention to which of the two speakers has the initiative of the conversation. FIG. 1 shows an example of a conversation state between a customer and an operator in a call center, for example. The horizontal direction in FIG. 1 is the elapsed time, and the operator's speech section is shown above the center line representing the elapsed time, and the customer's speech section is shown below.

コールセンター等における顧客とオペレータの会話の流れは、顧客が用件を述べている状態：Ｕ_Ｒ（以下、状態Ｕ_Ｒと称する）、オペレータが顧客情報を聞き出している状態：Ｏ_Ｈ（以下、状態Ｏ_Ｈと称する）、オペレータが回答している状態：Ｏ_Ａ（以下、状態Ｏ_Ａと称する）、の順番に推移するのが一般的である。この一連の会話の流れは、どちらの話者がどれだけ話をしているか、について着目することで分割することができる。 The flow of conversation customers and operators in call centers, the state customer is stated requirements: U R _{(hereinafter,} referred to as state U _R), the state operator is elicit customer information: O H _{(hereinafter,} the state O _H ), and the state in which the operator has answered: O _A (hereinafter referred to as state O _A ). This series of conversation flows can be divided by paying attention to which speaker is talking how much.

この発明の対話状態分割方法は、それぞれの話者の発話時間の比を求め、その比によって、顧客がオペレータより長く発話している区間を状態Ｕ_Ｒ、オペレータと顧客が同程度の時間発話している区間を状態Ｏ_Ｈ、オペレータが顧客より長く発話している区間を状態Ｏ_Ａ、の３つの状態に分割するものである。この方法によれば、会話情報をテキスト情報に変換する必要がない。したがって、テキスト情報に変換するコストが不要であり、テキスト情報に変換する際の変換誤差の影響も受けずに対話状態を分割することが可能である。 Dialog state division method of the present invention determines the ratio of the respective speaker speech time, by the ratio, the state a section customer is speaking longer than the operator U _R, talk time originated extent operators and customers same Is divided into three states, state O _H , and a section in which the operator speaks longer than the customer, state O _A. According to this method, it is not necessary to convert conversation information into text information. Therefore, the cost of converting to text information is unnecessary, and the dialog state can be divided without being affected by the conversion error when converting to text information.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図２にこの発明の対話状態分割装置１００の機能構成例を示す。その動作フローを図３に示す。対話状態分割装置１００は、発話区間検出部１０と、フレーム抽出部１１と、フレーム内発話時間比計算部１２と、フレーム代表スコア計算部１３と、対話状態分割部１４と、制御部１５とを具備する。対話状態分割装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 2 shows a functional configuration example of the dialog state dividing apparatus 100 of the present invention. The operation flow is shown in FIG. The dialog state dividing apparatus 100 includes an utterance section detecting unit 10, a frame extracting unit 11, an intra-frame utterance time ratio calculating unit 12, a frame representative score calculating unit 13, a dialog state dividing unit 14, and a control unit 15. It has. The dialog state dividing apparatus 100 is realized by a predetermined program being read into a computer composed of, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

発話区間検出部１０は、二者が会話する音声データを入力として、その二者のそれぞれの発話区間を検出する（ステップＳ１０）。フレーム抽出部１１は、発話区間検出部１０が出力する一方の話者の発話区間と他方の話者の発話区間を入力として、それぞれの発話区間を経過時間順に並べて所定数の発話区間をまとめて１フレームとして出力する（ステップＳ１１）。 The utterance section detection unit 10 receives voice data of conversation between the two parties, and detects the respective utterance sections of the two parties (step S10). The frame extraction unit 11 inputs the utterance interval of one speaker and the utterance interval of the other speaker output from the utterance interval detection unit 10, arranges the respective utterance intervals in order of elapsed time, and collects a predetermined number of utterance intervals. Output as one frame (step S11).

フレーム内発話時間比計算部１２は、フレーム内のそれぞれの発話区間の時間比をフレーム毎に計算する（ステップＳ１２）。フレーム代表スコア計算部１３は、その発話区間の時間比から１フレーム内の代表スコアＲ_ｊ＾（＾の表記は図中の表記が正しい）を発話区間の単位で決定する（ステップＳ１３）。対話状態分割部１４は、その代表スコアＲ_ｊ＾を２個の閾値と比較することで、各フレームを３つの対話状態にそれぞれ分類する（ステップＳ１４）。制御部１５は、全てのフレームの分類が終わるまでステップＳ１０〜Ｓ１４の処理を繰り返すように、各部の動作を制御する（ステップＳ１５）。 The intra-frame utterance time ratio calculation unit 12 calculates the time ratio of each utterance section in the frame for each frame (step S12). The frame representative score calculation unit 13 determines a representative score _Rj ^ (notation of ^ is correct in the figure) in one frame from the time ratio of the utterance interval in units of utterance intervals (step S13). The dialogue state dividing unit 14 classifies each frame into three dialogue states by comparing the representative score R _j ^ with two threshold values (step S14). The control unit 15 controls the operation of each unit so as to repeat the processing of steps S10 to S14 until all the frames are classified (step S15).

この対話状態分割装置１００によれば、二者の発話区間を経過時間順に所定の数並べて１フレームとし、各フレーム毎に二者の発話区間の時間比を求める。そして、各フレームの発話区間の時間比から各フレームの代表スコアＲ_ｊ＾を発話区間を単位として決定する。その代表スコアＲ_ｊ＾の値を２個の閾値と比較することで、二者が会話する音声データを状態Ｕ_Ｒと状態Ｏ_Ｈと状態Ｏ_Ａの３つの対話状態に分割することができる。 According to the dialog state dividing apparatus 100, a predetermined number of two utterance sections are arranged in order of elapsed time to form one frame, and the time ratio of the two utterance sections is obtained for each frame. Then, the representative score R _j ^ of each frame is determined from the time ratio of the utterance interval of each frame in units of the utterance interval. By comparing the values of the representative score R _{j ^} and two thresholds, it is possible to divide the audio data two parties to talk to the three dialog state condition U _R and State O _H and a state O _A.

従来技術のようにテキスト情報を用いる必要が無いのでコストを安く、また、音声データをテキスト情報に変換する際に発生する変換誤差の影響を受けること無く対話状態を分割・類別することを可能にする。 It is not necessary to use text information as in the prior art, so the cost is low, and it is possible to divide and categorize dialog states without being affected by conversion errors that occur when converting voice data to text information. To do.

以下、対話状態分割装置１００の各部の機能構成例を示してその動作を更に詳しく説明する。 Hereinafter, the functional configuration example of each part of the dialog state dividing apparatus 100 will be shown and the operation will be described in more detail.

〔発話区間検出部〕
図４に発話区間検出部１０の機能構成例を示す。図４に示す例は、二者が会話する音声データが、それぞれ分離された２チャネルの（ステレオ信号）信号で与えられる場合の例である。 [Speech section detector]
FIG. 4 shows a functional configuration example of the utterance section detection unit 10. The example shown in FIG. 4 is an example in which the voice data with which the two parties talk is given as a two-channel (stereo signal) signal separated from each other.

発話区間検出部１０は、パワー計算手段１０１ａと１０１ｂ、音声区間検出手段１０２ａと１０２ｂとを備える。パワー計算手段１０１ａと１０１ｂは、一方の話者の発話音声データと、他方の話者の発話音声データとをそれぞれ入力としてそれぞれの音声パワーを計算する。音声区間検出手段１０２ａと１０２ｂは、それぞれの発話の音声パワーを入力として、音声パワーと所定の閾値とを比較し音声パワーが一定時間以上続く区間をそれぞれの発話区間として出力する。発話区間検出部１０は、従来から知られているいわゆる音声スイッチと呼ばれるものと同じ構成で実現できる。 The utterance section detection unit 10 includes power calculation means 101a and 101b and voice section detection means 102a and 102b. The power calculation means 101a and 101b calculate the respective voice powers by using the speech data of one speaker and the speech data of the other speaker as inputs. The voice section detecting means 102a and 102b receive the voice power of each utterance, compare the voice power with a predetermined threshold value, and output a section where the voice power continues for a predetermined time or more as each utterance section. The utterance section detection unit 10 can be realized with the same configuration as what is called a so-called voice switch.

二者の会話する音声データが１つのチャネルで与えられる場合、二者の発話を分離する必要がある。図５に二者の発話を分離するための発話分離部２０の機能構成例を示す。発話分離部２０は、音声データを音声認識して話者を分類するものである。 When voice data for conversation between two parties is given by one channel, it is necessary to separate the utterances of the two parties. FIG. 5 shows a functional configuration example of the utterance separating unit 20 for separating two utterances. The utterance separating unit 20 classifies speakers by recognizing voice data.

発話分離部２０は、ＡＤ変換手段２１と、特徴量抽出手段２２と、話者分類手段２３と、モデルパラメータ記録部２４と、ＤＡ変換手段２５，２６とを備える。ＡＤ変換手段２１は、二者が会話する１チャネルのアナログ信号である音声データをディジタル信号に変換する。特徴量抽出部２２は、ディジタル信号に変換された音声データを例えば短時間フーリエ変換等で周波数領域の信号に変換して音声データの特徴量を抽出する。 The utterance separation unit 20 includes an AD conversion unit 21, a feature amount extraction unit 22, a speaker classification unit 23, a model parameter recording unit 24, and DA conversion units 25 and 26. The AD conversion means 21 converts voice data, which is an analog signal of one channel with which the two parties talk, into a digital signal. The feature quantity extraction unit 22 extracts the feature quantity of the voice data by converting the voice data converted into the digital signal into a frequency domain signal by, for example, short-time Fourier transform.

話者分類部２３は、その特徴量をモデルパラメータ記録部２４に記録されている音響モデルと言語モデルと比較することで話者を認識してその発話を分離する。分離された音声データは、ＤＡ変換手段２５，２６で話者毎にアナログ信号に変換される。 The speaker classification unit 23 recognizes the speaker by comparing the feature quantity with the acoustic model and the language model recorded in the model parameter recording unit 24, and separates the speech. The separated voice data is converted into an analog signal for each speaker by the DA conversion means 25 and 26.

話者毎の音声データがアナログ信号に変換された後の動作は、上記した発話区間検出部１０（図４）の動作と同じである。なお、発話分離部２０を、二者の会話を音声認識する例で説明したが、話者の声の音響的な特徴を音響モデルのみを用いて分類するようにしても良い。また、話者の声の周波数に一定の差があれば、簡単な周波数フィルタで話者を分類することも可能である。 The operation after the voice data for each speaker is converted into an analog signal is the same as the operation of the utterance section detection unit 10 (FIG. 4). In addition, although the speech separation unit 20 has been described as an example of recognizing a conversation between two parties, the acoustic features of the speaker's voice may be classified using only an acoustic model. Further, if there is a certain difference in the frequency of the voice of the speaker, it is possible to classify the speaker with a simple frequency filter.

〔フレーム抽出部〕
図６にフレーム抽出部１１の機能構成例を示す。フレーム抽出部１１は、発話時間順配列手段１１１と、フレーム生成手段１１２とを備える。発話時間順配列手段１１１は、二者のそれぞれの発話区間を発話開始時間順に配列する。 [Frame Extraction Unit]
FIG. 6 shows a functional configuration example of the frame extraction unit 11. The frame extraction unit 11 includes an utterance time order arrangement unit 111 and a frame generation unit 112. The utterance time order arrangement unit 111 arranges the utterance sections of the two parties in the order of the utterance start time.

図７に二者の発話区間を発話時間順に配列した一例を示す。図７に楕円で囲ったＯＰＥ１は、オペレータの１番目の発話区間を意味する。同じくＵＳＲ１は、顧客の１番目の発話区間を意味する。図７に示すオペレータの「本日はどのようなご用件でしょうか？」で始まる会話の発話区間は、発話区間検出部１０と発話時間順配列手段１１１の動作によって、経過時間順に並べられる。 FIG. 7 shows an example in which two utterance sections are arranged in the order of utterance time. OPE1 enclosed by an ellipse in FIG. 7 means the first utterance section of the operator. Similarly, USR1 means the customer's first utterance section. The speech utterances of the conversation starting with “What is your business today?” Shown in FIG. 7 are arranged in order of elapsed time by the operations of the utterance duration detection unit 10 and the utterance time order arrangement means 111.

フレーム生成手段１１２は、経過時間順に並べられた発話区間を予め定めた発話の数ｋ、例えばｋ＝３ずつまとめて１フレームＦ_ｊ（１≦ｊ≦Ｎ−ｋ＋１）として出力する。ここでＮは、会話におけるオペレータの発話数と顧客の発話数の合計である。フレーム生成手段１１２は、そのＮを一定間隔毎、例えば１発話毎スライドして通話全体に対してフレームを生成する。図７に示す例では、２個の発話区間を共有してフレームが形成される。このようにフレームを重ねることで発話区間に時間比の値が安定する効果が期待できる。なお、全く発話区間が重ならないようにしてフレームを構成しても構わない。 The frame generation unit 112 collects the utterance sections arranged in the order of elapsed time and outputs the frames as a single frame F _j (1 ≦ j ≦ N−k + 1), for example, by k = 3, for example. Here, N is the total of the number of utterances of the operator and the number of utterances of the customer in the conversation. The frame generation means 112 generates a frame for the entire call by sliding N at regular intervals, for example, for each utterance. In the example shown in FIG. 7, a frame is formed by sharing two utterance sections. By overlapping frames in this way, an effect of stabilizing the value of the time ratio in the utterance interval can be expected. Note that the frames may be configured such that the speech sections do not overlap at all.

〔フレーム内発話時間比計算部〕
図８にフレーム内発話時間比計算部１２の機能構成例を示す。フレーム内発話時間比計算部１２は、発話者発話時間集計手段１２１と、発話時間比計算手段１２２とを備える。発話者発話時間集計手段１２１は、発話者毎に発話時間を集計する。図７に示したフレームＦ_１の例では、オペレータの発話時間ＯＰＥ１＋ＯＰＥ２を計算して、顧客の発話時間ＵＳＲ１と切り分ける動作を行う。 [Intra-frame utterance time ratio calculator]
FIG. 8 shows a functional configuration example of the intra-frame speech time ratio calculation unit 12. The intra-frame utterance time ratio calculation unit 12 includes a utterer utterance time counting unit 121 and an utterance time ratio calculation unit 122. The speaker utterance time totaling unit 121 totals the utterance time for each speaker. In the example of frames _{F 1} shown in FIG. 7, by calculating the speech time OPE1 + OPE2 operator performs an operation to isolate the customer speech time USR1.

発話時間比計算手段１２２は、フレームＦ_ｊ毎にフレーム中に含まれるオペレータと顧客の発話時間比であるフレーム内発話時間比Ｒ_ｊを式（１）で計算する。 Speech time ratio calculating means 122 calculates the frame speech time ratio R _j is an operator and speech time ratio of the customer contained in the frame for each frame F _j in equation (1).

図７に示したフレームＦ_１の例では、Ｒ_ｊ＝（ＯＰＥ１＋ＯＰＥ２）/（ＯＰＥ１＋ＵＳ
Ｒ１＋ＯＰＥ２）である。 In the example of the frame F ₁ shown in FIG. 7, R _j = (OPE1 + OPE2) / (OPE1 + US)
R1 + OPE2).

フレーム内発話時間比Ｒ_ｊが１に近い値であれば、そのフレーム内ではオペレータが顧
客に対して長く話をしていることを意味する。０に近い値であれば顧客がオペレータに対
して長く話をしていることを意味する。０．５に近い値であればオペレータと顧客がほぼ
同じ時間話をしていることを意味する。 If the intra-frame speech time ratio R _j is a value close to 1, it means that the operator has been talking to the customer for a long time within the frame. A value close to 0 means that the customer has been talking to the operator for a long time. A value close to 0.5 means that the operator and the customer are talking for almost the same time.

〔フレーム代表スコア計算部〕
フレーム代表スコア計算部１３は、１フレーム内の代表スコアを発話区間の単位で決定
する。図９にフレーム代表スコア計算部１３の出力信号の一例を概念的に示す。図９の横
方向には発話区間Ｕ_ｉが経過時間順に並べられ、縦方向はフレームＦ_ｊが経過時間順に並べられている。図９は１フレームが３個の発話区間から成る例である。 [Frame Representative Score Calculator]
The frame representative score calculation unit 13 determines a representative score in one frame in units of utterance sections. FIG. 9 conceptually shows an example of the output signal of the frame representative score calculation unit 13. In the horizontal direction of FIG. 9, the speech sections U _i are arranged in the order of elapsed time, and in the vertical direction, the frames F _j are arranged in the order of elapsed time. FIG. 9 shows an example in which one frame is composed of three utterance sections.

そして図９では、１フレームの代表スコアをフレームの真中の発話区間の値として決定している。フレーム内のどの発話区間を代表スコアとするかは任意である。フレーム代表スコア計算部１３は、フレームの先頭の発話区間、若しくは最後の発話区間を代表スコアとして決定しても良い。 In FIG. 9, the representative score of one frame is determined as the value of the utterance section in the middle of the frame. Which utterance section in the frame is used as the representative score is arbitrary. The frame representative score calculation unit 13 may determine the first utterance section or the last utterance section of the frame as the representative score.

また、代表スコアの変動を抑制する目的で複数フレームの発話区間の時間比の平均値を求め、その平均値を代表スコアとしても良い。図１０に複数フレームの発話区間の時間比の平均値を代表スコアとするフレーム代表スコア計算部１３′の動作フローを示す。 Further, an average value of time ratios of speech sections of a plurality of frames may be obtained for the purpose of suppressing the variation of the representative score, and the average value may be used as the representative score. FIG. 10 shows an operation flow of the frame representative score calculation unit 13 ′ using the average value of the time ratios of the utterance sections of a plurality of frames as a representative score.

フレーム代表スコア計算部１３′は、例えば３個の発話区間の時間比を、経過時間順に記憶する記憶手段を備える。この記憶手段は、最も古い発話区間の時間比を、新しい発話区間の時間比が入力される度に消去する。よって、常時最新の３個の発話区間の時間比を記憶する。この記憶手段は一般的なメモリ回路で構成できる物なので機能構成例を図示した説明は省略する。 The frame representative score calculation unit 13 ′ includes storage means for storing, for example, time ratios of three utterance sections in order of elapsed time. This storage means erases the time ratio of the oldest utterance interval every time a new utterance interval time ratio is input. Therefore, the time ratio of the latest three utterance sections is always stored. Since this storage means can be constituted by a general memory circuit, a description of the functional configuration example is omitted.

ステップＳ１３０でその記憶手段に３フレーム分の時間比が記憶されたか否かを判断する。まだ、３フレーム分の時間比が記憶されていない間（ステップＳ１３０のＮ）は、そのフレーム毎の時間比を代表スコアとして決定する（ステップＳ１３１）。 In step S130, it is determined whether or not the time ratio for three frames is stored in the storage means. While the time ratio for three frames is not yet stored (N in step S130), the time ratio for each frame is determined as the representative score (step S131).

３フレーム分の時間比が記憶手段に記憶されると（ステップＳ１３０のＹ）、その３個の時間比の平均値を計算する（ステップＳ１３２）。その平均値を真中のフレームの代表スコアとして決定する（ステップＳ１３３）。ステップＳ１３０〜Ｓ１３３までの動作を全てのフレームについて行う（ステップＳ１５０のＮ）。 When the time ratios for three frames are stored in the storage means (Y in step S130), the average value of the three time ratios is calculated (step S132). The average value is determined as the representative score of the middle frame (step S133). The operation from step S130 to S133 is performed for all frames (N in step S150).

以上の動作の結果、図９に示した各フレームの代表スコアは、図１１に示すように変化する。フレームＦ_ｊ＋１が０.２から０.３に、フレームＦ_ｊ＋２が０.４から０.３に、フレームＦ_ｊ＋３が０．３から０．４に、それぞれ変化する。フレームＦ_ｊ＋４の値は、図の表記の関係から移動平均前の値である。このように複数のフレームの発話区間の時間比を移動平均して代表スコアとしても良い。代表スコアを平均値とすることで、代表スコアの局所的な変動を抑制できる。 As a result of the above operation, the representative score of each frame shown in FIG. 9 changes as shown in FIG. The frame F _{j + 1} is changed from 0.2 to 0.3, the frame F _{j + 2} is changed from 0.4 to 0.3, and the frame F _{j + 3} is changed from 0.3 to 0.4. The value of the frame F _{j + 4 is} a value before moving average because of the notation in the figure. In this way, the time ratio of the utterance sections of a plurality of frames may be subjected to a moving average to obtain a representative score. By making the representative score an average value, local variation of the representative score can be suppressed.

〔対話状態分割部〕
図１２に対話状態分割部１４の動作フローを示す。対話状態分割部１４は、フレーム代表スコア計算部１３が出力する代表スコアを、少なくとも２個の閾値Ｘ，Ｙと比較することで、各フレームを少なくとも３つの対話状態に分類する。 [Dialogue State Division]
FIG. 12 shows an operation flow of the dialog state dividing unit 14. The dialog state dividing unit 14 classifies each frame into at least three dialog states by comparing the representative score output by the frame representative score calculating unit 13 with at least two threshold values X and Y.

対話状態分割部１４は、フレーム代表スコア計算部１３が出力するフレーム毎の代表スコアＲ_ｊ＾を、まず閾値Ｘと比較する。閾値Ｘは０．５より小さな例えば０．４と、予め定められた値である。代表スコアＲ_ｊ＾が０．４未満であると（ステップＳ１４０のＹ）、そのフレームは顧客が用件を述べている状態：Ｕ_Ｒに分類される（ステップＳ１４１）。 The dialog state dividing unit 14 first compares the representative score R _j ^ for each frame output by the frame representative score calculating unit 13 with the threshold value X. The threshold value X is a predetermined value such as 0.4, which is smaller than 0.5. Representative score _{R j} ^ is is less than 0.4 (Y in step S140), the frame state customer is stated requirements: fall into _{U R} (step S141).

代表スコアＲ_ｊ＾が閾値Ｘよりも大であると、次に代表スコアＲ_ｊ＾は閾値Ｙと比較される。閾値Ｙは、０．５より大きな例えば０．６といった値である。代表スコアＲ_ｊ＾が閾値Ｙよりも大きいと（ステップＳ１４２のＹ）、そのフレームはオペレータが回答している状態：Ｏ_Ａに分類される（ステップＳ１４３）。 If the representative score R _j ^ is greater than the threshold X, then the representative score R _j ^ is compared with the threshold Y. The threshold Y is a value larger than 0.5, for example, 0.6. If the representative score R _j ^ is larger than the threshold Y (Y in step S142), the frame is classified into a state that the operator has answered: O _A (step S143).

代表スコアＲ_ｊ＾が閾値Ｙよりも小さいと（ステップＳ１４２のＮ）、そのフレームはオペレータが顧客情報を聞き出している状態：Ｏ_Ｈに分類される。このフレームを分類する動作は、それぞれのフレームをメモリ回路に記憶させる方法でも良いし、各フレームに各状態Ｕ_Ｒ，Ｏ_Ａ，Ｏ_Ｈのラベルを付す動作でも良い。 Representative score R _{j ^} is smaller than the threshold value Y (N in step S142), the frame state operator is elicit customer information: is classified into O _H. Operation of classifying the frame can be a method of storing a respective frame in the memory circuits, each state U _R in each _frame, O _A, may be operating subjecting the labels O _H.

以上のように対話状態分割部１４が動作することで、各フレームを発話区間の単位で３つの対話状態に分割することができる。図１３にその分割した一例を示す。その横軸は経過時間順に配列された発話区間Ｕ_ｉ、縦軸は各フレームの代表スコアＲ_ｊ＾である。 As described above, the dialogue state dividing unit 14 operates, whereby each frame can be divided into three dialogue states in units of utterance sections. FIG. 13 shows an example of the division. The horizontal axis represents utterance intervals U _i arranged in the order of elapsed time, and the vertical axis represents the representative score R _j ^ of each frame.

発話区間Ｕ_５までが状態Ｕ_Ｒ、発話区間Ｕ_６〜Ｕ_１１までが状態Ｏ_Ｈ、発話区間Ｕ_１２以降が状態Ｏ_Ａに分割されている。このように、各フレームの代表スコアが発話区間の単位で決定されているので、発話区間を単位として対話状態の分割が行える。状態Ｕ_Ｒから状態Ｏ_Ｈに変化する発話区間Ｕ_５は、１フレームが３個の発話区間から成るとすると、図１３の原点から２個目のフレームの真中の発話区間に当たる。このフレーム内のどの発話区間を代表スコアとするかは上記したように任意である。 Speech segment _{U 5} until the condition _{U R,} to the speech segment _U 6 _{~U 11} state _{O H,} is the speech segment _{U 12} after being divided into state _{O A.} As described above, since the representative score of each frame is determined in units of utterance sections, the conversation state can be divided in units of utterance sections. Speech segment U ₅ changes from state U _R state O _H, when one frame consists of three speech period corresponds to the speech segment in the middle of the two first frames from the origin in FIG. 13. Which utterance section in this frame is used as the representative score is arbitrary as described above.

以上説明したように対話状態分割装置１００によれば、二者の会話する音声データをテキスト情報に変換すること無く、その対話状態の分割を可能にする。 As described above, according to the dialog state dividing device 100, it is possible to divide the dialog state without converting the voice data of the conversation between the two into text information.

〔変形例１〕
対話状態分割装置１００は、発話区間検出部１０で検出した発話区間を、フレーム抽出部１１が経過時間順に所定数並べて１フレームとする。発話区間としては、「はい」や「えー」等の相槌も含まれる。 [Modification 1]
In the dialog state dividing apparatus 100, the frame extraction unit 11 arranges a predetermined number of the utterance sections detected by the utterance section detection unit 10 in order of elapsed time to be one frame. As the utterance section, “Yes”, “Eh” and the like are also included.

図１４に相槌の発話区間から成るフレームの例を示す。図１４の横方向は経過時間を表す。顧客の「先日引越しをしたので、住所変更の手続きを行ったんですね。」で始まり、オペレータが相槌を返す会話部分を表している。その発話を経過時間順に並べるとオペレータの相槌であるＯＰＥ１とＯＰＥ２とＯＰＥ３の発話区間が１フレームとみなされ、オペレータが長く話しているように判定されてしまうことが考えられる。このようなフレームが生成されると誤判定の原因になるので相槌の発話区間を削除するようにしても良い。 FIG. 14 shows an example of a frame made up of utterance intervals. The horizontal direction of FIG. 14 represents elapsed time. It starts with the customer's "I moved the other day, so I changed the address." It shows the conversation part where the operator returns a mutual inquiry. If the utterances are arranged in the order of elapsed time, the utterance section of OPE1, OPE2, and OPE3, which is an operator's relationship, is regarded as one frame, and it may be determined that the operator is speaking for a long time. If such a frame is generated, an erroneous determination may be caused, so that the utterance interval of the conflict may be deleted.

図６に相槌を削除するようにしたフレーム抽出部１１′の機能構成例を破線で示す。フレーム抽出部１１′のフレーム生成手段１１２′は、相槌発話区間削除手段１１２０を備える。 FIG. 6 shows an example of a functional configuration of the frame extraction unit 11 ′ in which the conflict is deleted by a broken line. The frame generation unit 112 ′ of the frame extraction unit 11 ′ includes a conflicting utterance section deletion unit 1120.

相槌発話区間削除手段１１２０は、例えば短い発話区間を相槌として削除する。例えば１秒未満の発話区間は相槌とみなしても良い。また、一方が発話中に他方が発話を開始して短時間で終了したような発話区間は相槌とみなしても良い。また、「はい」や「えー」を音声認識してその発話区間を削除するようにしても良い。 The conflicting utterance section deleting means 1120 deletes, for example, a short utterance section as a conflict. For example, an utterance period of less than 1 second may be regarded as a conflict. In addition, an utterance section in which one side utters while the other starts uttering and ends in a short time may be regarded as a conflict. Further, “Yes” or “Eh” may be recognized as a voice and the utterance section may be deleted.

以上述べたように、この発明の対話状態分割装置１００によれば、テキスト情報を用い
ることなく対話状態を少なくとも、「顧客が用件を述べている状態」、「オペレータが顧
客から情報を聞きだしている状態」、「オペレータが回答している状態」の３つの状態に
分割することができる。会話者の発話時間比を用いて対話状態を分割するので、会話情報
をテキスト情報に変換するコストがかからない。また、テキスト書き起こしに音声認識を
用いた場合の認識結果誤りの影響を受けない等の効果を奏する。 As described above, according to the dialog state dividing apparatus 100 of the present invention, at least the dialog state without using the text information, “the state in which the customer is describing the business”, “the operator hears the information from the customer”. It is possible to divide the state into three states, that is, a state where the operator is answering and a state where the operator is answering. Since the conversation state is divided by using the conversation time ratio of the talker, there is no cost for converting the conversation information into text information. In addition, there is an effect that the recognition result error is not affected when speech recognition is used for text transcription.

この発明の方法及び装置は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能である。例えば、対話状態分割部１４は、代表スコアを２個の閾値と比較することで３つの対話状態に分割する例で説明を行ったが、代表スコアと比較する閾値をＮ個として、分割する状態数をＮ＋１個の状態に分割するようにしても良い。 The method and apparatus of the present invention are not limited to the above-described embodiments, and can be appropriately changed without departing from the spirit of the present invention. For example, the dialog state dividing unit 14 has been described as an example in which the representative score is divided into three dialog states by comparing the representative score with two threshold values, but the state to be divided is set to N threshold values to be compared with the representative score. The number may be divided into N + 1 states.

なお、上記方法及び装置において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 Note that the processes described in the above method and apparatus are not only executed in time series according to the order of description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Good.

また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A speech section detection unit that detects speech sections of the two parties by inputting voice data of conversation between the two parties of the speaker A and the speaker B , and
A frame extracting unit that arranges the respective utterance sections in order of elapsed time and outputs a predetermined number of utterance sections as one frame;
The total utterance time of the speaker A or the speaker B in the frame is divided by the total value of the total utterance time of the speaker A and the total utterance time of the speaker B in the frame. An intra-frame speech time ratio calculation unit for calculating the ratio R _j for each frame,
A frame representative score calculation unit that determines the intra-frame speech time ratio R _j or the smoothed intra-frame speech time ratio R _j as a representative score in the one frame;
A dialogue state dividing unit for classifying each frame into at least three dialogue states by comparing the representative score with at least two threshold values;
A dialog state dividing apparatus comprising:

In the dialog state dividing device according to claim 1,
The frame extraction unit
Further comprising a compatible utterance unit deleting means for deleting the utterance section that is less than the predetermined time width,
A dialog state dividing device characterized by the above.

In the dialog state dividing device according to claim 1 or 2,
One of the two thresholds of the dialog state splitting unit is less than 0.5 and the other threshold is greater than 0.5;
A dialog state dividing device characterized by the above.

An utterance interval detection process in which an utterance interval detection unit detects voice data of conversation between the two parties of the utterer A and the utterer B and detects the respective utterance intervals of the two parties;
A frame extraction process in which the frame extraction unit arranges the respective utterance sections in order of elapsed time and outputs a predetermined number of utterance sections as one frame;
The intra-frame utterance time ratio calculation unit calculates the total utterance time of the utterer A or the utterer B within the frame by the total value of the total utterance time of the utterer A and the total utterance time of the utterer B within the frame. An intra-frame speech time ratio calculation process for calculating the intra-frame speech time ratio R _j , which is the divided value , for each frame,
A frame representative score calculation unit in which the frame representative score calculation unit determines the intra-frame speech time ratio R _j or the smoothed intra-frame speech time ratio R _j as a representative score in the one frame;
A dialogue state classification process in which the dialogue state classification unit classifies each frame into at least three dialogue states by comparing the representative score with at least two threshold values;
Dialog state splitting method including

A program for causing a computer to function as the dialog state dividing device according to any one of claims 1 to 3.

A computer-readable recording medium on which any one of the programs according to claim 5 is recorded.