JP5099218B2

JP5099218B2 - Problem solving time estimation processing program, processing apparatus, and processing method

Info

Publication number: JP5099218B2
Application number: JP2010509012A
Authority: JP
Inventors: 功難波; 佐知子小野寺
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2008-04-25
Filing date: 2008-04-25
Publication date: 2012-12-19
Anticipated expiration: 2028-04-25
Also published as: WO2009130785A1; JPWO2009130785A1

Description

本発明は，コンピュータに，顧客と顧客の電話応対を行うオペレータとの対話内容が録音された音声対話データを用いて，顧客から問い合わせられた問題の解決所要時間を推定する処理を実行させるための問題解決時間推定処理に関する。 According to the present invention, there is provided a computer for executing a process for estimating a time required for solving a problem inquired by a customer using voice dialogue data in which a dialogue between the customer and an operator who handles the telephone is recorded. It relates to problem solving time estimation processing.

顧客がコールセンタへ問い合わせる内容は，応答したオペレータが即座に回答できるようなものから，調査結果や解決方法の提示まで時間を要するためにオペレータがコールバックしなければならないようなものまで，様々な難易度のものがある。 The contents of customer inquiries to the call center vary from those that can be answered immediately by the responding operator to those that require the operator to call back because it takes time to present the survey results and solutions. There is something of a degree.

コールセンタの管理者，運営者は，顧客からの問い合わせを問題として認識した場合に，この問題を解決するまでの所要時間（問題解決時間）を把握し，応対業務の効率や顧客満足度の向上を図りたいという要求がある。 When a call center manager or operator recognizes an inquiry from a customer as a problem, the call center manager grasps the time required to resolve this problem (problem resolution time), and improves the efficiency of customer service and customer satisfaction. There is a demand to plan.

従来では，顧客の入電があると，音声対話データとは別に，これに対するオペレータの応答内容，受付時刻，調査，二次ライン（別部署）への転送などの対応処理を記録した応対履歴情報を蓄積する応対履歴管理システムを設け，顧客の問い合わせに対するコールバックの履歴情報を記録したり，コールバック情報を応対の履歴情報に関連付けて記録したりしていた。 Conventionally, when there is a customer's incoming call, in addition to the voice dialogue data, the response history information that records the response processing of the operator's response, reception time, investigation, transfer to the secondary line (separate department), etc. A response history management system was established to record callback history information for customer inquiries, and record callback information in association with response history information.

一方，コールセンタでは，顧客とオペレータとの対話内容を後から聴取できるように，全対話内容を録音した音声対話データを蓄積している。蓄積された大量の音声対話データは，単に対話内容の確認資料として利用されるだけではなく，様々な目的の分析資料として利用可能であるが，利用目的に応じて必要な部分を特定する必要がある。 On the other hand, in the call center, voice conversation data recording all conversation contents is accumulated so that the conversation contents between the customer and the operator can be heard later. The accumulated large amount of spoken dialogue data can be used not only as confirmation data for dialogue content but also as analysis material for various purposes. However, it is necessary to identify the necessary parts according to the purpose of use. is there.

音声対話データのうち，オペレータと顧客の応答の核心的部分である問い合わせ内容を含む部分を特定し再生可能とするために，音声認識処理などによって抽出したキーワードやオペレータの端末画面の操作情報などを音声対話録音データにインデックスとして付与しておき，音声対話データ再生時に再生開始位置を特定するために利用できるようにする従来方法がある（例えば，特許文献１参照）。
特開平１１−２５１１２号公報 In order to make it possible to identify and reproduce the part of the spoken dialogue data that contains the inquiry content, which is the core part of the response between the operator and the customer, the keywords extracted by voice recognition processing, the operation information on the operator's terminal screen, etc. There is a conventional method in which the voice conversation recording data is given as an index so that it can be used to specify the playback start position when the voice conversation data is played back (for example, see Patent Document 1).
Japanese Patent Laid-Open No. 11-25112

顧客からの問い合わせられた内容を問題として認識し，この問題に対する解決に要した時間を把握したい場合に，顧客の入電への応対時間内に回答できるときは，応対の開始時刻と終了時刻とを記録して応対時間を解決所要時間とみなすことができる。 If the customer's inquiry content is recognized as a problem and the time required to resolve this problem is known, if the customer can respond within the response time for incoming calls, the start time and end time of the response can be set. The response time can be regarded as the time required for resolution by recording.

しかし，問題によっては，オペレータが一旦応対を終了し，問題の解決方法を得るなどして回答できる状況になってから顧客へコールバックしなければならないケースがある。このコールバックして回答した場合に，問題の認識から回答提示までの所要時間を推計できるシステムは実現されていなかった。 However, depending on the problem, there is a case where the operator has to call back to the customer after completing the response and getting a solution to the problem. When answering with this callback, a system that can estimate the time required from problem recognition to answer presentation has not been realized.

従来の応対履歴管理システムは，単に応対履歴情報をデータベースで管理するものであり，各応対の内容，受付時刻，終了時刻，担当者などの情報とともにコールバック時刻が記録されるが，対応する応答の履歴情報にコールバックに関する情報を記録する操作や処理が必要であった。 Conventional response history management systems simply manage response history information in a database, and the callback time is recorded along with information such as the contents of each response, reception time, end time, person in charge, etc. Operation and processing to record the information about the callback in the history information.

また，録音・蓄積された音声対話データでは，顧客の入電による対話と，オペレータのコールバックの対話との対応付けがないため，音声対話データから問題解決時間を推定する技術は存在しなかった。 In addition, there is no technology for estimating the problem solving time from the voice conversation data because the voice conversation data recorded and stored has no correspondence between the customer's incoming call dialogue and the operator's callback dialogue.

本発明の目的は，録音時刻付きの音声対話データのみから，問題解決時間を推計できる処理手法を提供することである。 An object of the present invention is to provide a processing method capable of estimating a problem solving time only from voice conversation data with a recording time.

まず，本発明の原理を説明する。 First, the principle of the present invention will be described.

音声対話データにおいて，問題は，顧客が質問を発話する部分として出現し，この問題に対する解決は，この顧客の発話を受けたオペレータが回答を発話する部分として出現する。したがって，問題の認識時から回答提示完了時までの時間は，顧客が質問を行っている対話の開始から，この顧客に対してオペレータが回答している対話の終了までを問題解決時間とみなすことができる。 In the spoken dialogue data, a problem appears as a part where a customer utters a question, and a solution to this problem appears as a part where an operator who receives the customer's utterance utters an answer. Therefore, the time from the recognition of the problem to the completion of the presentation of the answer is regarded as the problem solving time from the start of the dialogue in which the customer asks the question until the end of the dialogue in which the operator answers the customer. Can do.

ところで，一般的に，対話中の話者間において，主導的に発話している発話者は，発話の応対者に比べて，一定の大きさの音声で継続的に発話する傾向がある。顧客とオペレータとの対話では，顧客が何か問い合わせを行っている場合には，顧客が先行して主導的に発話している状況が想定される。また，オペレータが顧客へコールバックして回答している対話では，顧客の質問を含む対話後にオペレータが先行して主導的に発話している状況が想定される。 By the way, in general, a speaker who speaks predominantly among speakers during a conversation tends to continuously utter with a certain amount of speech, compared to a speaker. In the dialogue between the customer and the operator, when the customer makes an inquiry, it is assumed that the customer is leading and speaking in advance. In the dialogue in which the operator calls back to the customer and answers, it is assumed that the operator is leading and speaking after the dialogue including the customer's question.

また，顧客が先行して主導的に発話している対話のみが存在し，これに続くオペレータが主導的に発話している対話が存在しない場合には，先の顧客の対話中に回答されている状況が想定される。 In addition, when there is only a dialogue that the customer leads and speaks first, and there is no dialogue that the operator leads, the response is answered during the previous customer's dialogue. The situation is assumed.

そこで，本発明では，顧客とオペレータとの音声対話データの集合から，同一の顧客とオペレータと対話とみなせる音声対話データを抽出し，各対話の先行主導話者を特定することによって対話の種別が「質問」か「非質問」かを判断する。さらに，「質問」の対話に続いて「非質問」の対話があるような対応関係にある音声対話データを取り出し，「質問およびその回答」と判断し，先の質問の対話の開始から回答の対話の終了までの時間を問題解決時間とみなして推計する。 Therefore, in the present invention, voice conversation data that can be regarded as a dialogue with the same customer and operator is extracted from a set of voice dialogue data between the customer and the operator, and the type of dialogue is determined by specifying the leading speaker of each dialogue. Determine whether “question” or “non-question”. Furthermore, voice dialogue data having a correspondence relationship such that there is a “non-question” dialogue following the “question” dialogue is extracted, and it is judged as “question and answer”. Estimate the time until the end of the dialogue as the problem solving time.

具体的には，開示するプログラムは，コンピュータを，１）オペレータの音声が録音された第１チャネルと顧客の音声が録音された第２チャネルとで構成された録音時刻付き音声対話データの集合を入力する処理部と，２）前記音声対話データ各々に対して，一方の話者が他方の話者より大きな音声で所定の長さ以上継続して発話している主導発話区間を特定し，さらに，最も先に存在する主導発話区間を特定して先行主導発話区間とし，当該先行主導発話区間が第２チャネルに含まれる場合に当該音声対話データの種別を質問と設定し，当該先行主導発話区間が第１チャネルに含まれる場合に当該音声対話データの種別を非質問と設定する処理部と，３）前記音声対話データ各々について，各チャネルの音声データの話者特徴を，予め定められた機械学習処理方法によって学習する処理部と，４）前記話者特徴をもとに，前記音声対話データの第１チャネルおよび第２チャネルそれぞれの話者特徴が一定の類似範囲内にある音声対話データを集めて類似話者集合とする処理部と，５）前記類似話者集合ごとに，前記音声対話データを録音時刻に従って並べ，前記種別が質問である音声対話データを一つ取り出し，当該質問である音声対話データと時間的に後続するデータであって種別が非質問の音声対話データと対応づける処理部と，６）前記対応付けられた音声対話データの組の最先の録音開始時刻から最後の録音終了時刻までの時間を算出して問題解決時間とする処理部として機能させるためのものである。 Specifically, the program disclosed discloses a computer as a set of voice interaction data with recording time, which is composed of 1) a first channel in which operator's voice is recorded and a second channel in which customer's voice is recorded. A processing unit for input; and 2) for each of the voice conversation data, identify a lead utterance section in which one speaker continuously speaks for a predetermined length or longer with a louder voice than the other speaker; , Identify the earliest initiative utterance section as the preceding initiative utterance section, and if the preceding initiative utterance section is included in the second channel, the type of the voice dialogue data is set as a question, and the preceding initiative utterance section A processing unit that sets the type of the voice conversation data as non-question when the first channel is included in the first channel, and 3) for each of the voice conversation data, speaker characteristics of the voice data of each channel are determined in advance. A processing unit that learns by the machine learning processing method, and 4) a voice in which the speaker characteristics of the first channel and the second channel of the voice interaction data are within a certain similar range based on the speaker characteristics. A processing unit that collects dialogue data and sets it as a similar speaker set; and 5) arranges the voice dialogue data according to the recording time for each similar speaker set, takes out one voice dialogue data whose type is a question, A processing unit that associates the voice dialogue data that is a question with the voice dialogue data that is temporally succeeding and is a non-question type, and 6) the earliest recording start time of the pair of the voice dialogue data that is associated It is for functioning as a processing unit that calculates the time from the recording end time to the last recording end time and sets it as the problem solving time.

開示するプログラムがインストールされ実行されるコンピュータは，まず，オペレータの音声が録音された第１チャネルと顧客の音声が録音された第２チャネルとで構成された録音時刻付き音声対話データの集合を入力する。入力した音声対話データ各々に対して，第１，第２の各チャネルの音声データから，一方の話者が他方の話者より大きな音声で所定の長さ以上継続して発話している主導発話区間を特定する。さらに，最先の主導発話区間を先行主導発話区間と特定し，先行主導発話区間が第２チャネルに含まれる場合にその音声対話データの種別を質問と設定し，第１チャネルに含まれる場合に種別を非質問と設定する。 The computer on which the disclosed program is installed and executed first inputs a set of voice interactive data with recording time, which is composed of a first channel in which the operator's voice is recorded and a second channel in which the customer's voice is recorded. To do. For each input voice dialogue data, one speaker speaks continuously for a predetermined length or longer from the voice data of the first and second channels. Identify the interval. In addition, when the earliest initiative utterance section is identified as the preceding initiative utterance section, and when the preceding initiative utterance section is included in the second channel, the type of the voice conversation data is set as a question, and when it is included in the first channel Set the type as non-question.

また，音声対話データ各々について，各チャネルの音声データの話者特徴を，予め定められた機械学習処理方法によって学習し，学習した話者特徴をもとに，音声対話データの第１チャネルおよび第２チャネルそれぞれの話者特徴が一定の類似範囲内にある音声対話データを集めて類似話者対話集合とする。 Further, for each voice conversation data, the speaker characteristics of the voice data of each channel are learned by a predetermined machine learning processing method, and based on the learned speaker characteristics, the first channel and the second channel of the voice conversation data are learned. Speech dialogue data in which the speaker characteristics of the two channels are within a certain similar range is collected to form a similar speaker dialogue set.

さらに，類似話者対話集合の各々について，集合を構成する音声対話データを録音時刻に従って並べ，種別が質問である音声対話データを一つ取り出し，この音声対話データと時間的に後続する，種別が非質問の音声対話データと対応づける。 Furthermore, for each of similar speaker conversation sets, the voice conversation data constituting the set is arranged according to the recording time, one voice conversation data whose type is a question is taken out, and the type of temporal conversation that follows this voice conversation data is Correlate with non-questional voice dialogue data.

さらに，対応づけられた音声対話データの組の最先の録音開始時刻から最後の録音終了時刻までの時間を算出して問題解決時間とする。 Further, the time from the earliest recording start time to the last recording end time of the set of the spoken dialogue data associated is calculated as the problem solving time.

これにより，入電された顧客との応対中に顧客の質問に回答せず，一旦断電して調査等を行った後にオペレータがコールバックして回答したケースについても，録音時刻付きの音声対話データのみを用いて問題解決時間を算出することができる。 As a result, voice dialogue data with recording time is also available for cases in which the operator does not answer the customer's question while responding to the incoming customer, and the operator calls back and answers after conducting a survey. The problem solving time can be calculated using only

さらに開示する処理装置は，前記プログラムがコンピュータに実行させる処理を実現する各処理手段を有する処理装置である。 Further, the disclosed processing apparatus is a processing apparatus having processing units for realizing processing that the program causes the computer to execute.

さらに開示する処理方法は，コンピュータが，前記プログラムによって実行する各処理ステップで構成される処理方法である。 Further, the disclosed processing method is a processing method configured by each processing step executed by the computer by the program.

本発明によれば，応対確認用に蓄積されている録音時刻情報付きの音声対話データから，顧客が問い合わせた問題の解決所要時間を自動的に推計することができる。 According to the present invention, it is possible to automatically estimate the time required for solving a problem inquired by a customer from voice dialogue data with recording time information accumulated for response confirmation.

特に，コールバックによって回答されたケースでも，顧客の入電による応対情報にコールバックの情報を対応付ける作業・処理が不要となり，簡単に問題解決時間を推計することができる。 In particular, even when a response is made by a callback, work / processing for associating the callback information with the customer's incoming call information is not required, and the problem solving time can be estimated easily.

計測された問題解決時間によって，応対個々のレベルでの問題解決時間を推定することができ，コールセンタ業務の分析，計画立案等の資料として使用することができる。 Based on the measured problem solving time, it is possible to estimate the problem solving time at the individual level of response, and it can be used as data for call center business analysis and planning.

また，同一人がなした別出願にかかる，音声対話データから顧客の問い合わせ傾向を推定する処理手法を，開示した問題解決時間の推定処理と組み合わせることによって，質問内容ごとの問題解決時間を分析する等，種々の分析資料として使用することができる。 In addition, the problem solving time for each question content is analyzed by combining the processing method for estimating customer inquiry tendency from voice dialogue data for another application made by the same person with the disclosed problem solving time estimation process. It can be used as various analysis materials.

問題解決時間推定処理装置の構成例を示す図である。It is a figure which shows the structural example of a problem solution time estimation processing apparatus. 問題解決時間推定処理装置の概要処理の処理フローを示す図である。It is a figure which shows the processing flow of the outline | summary process of a problem solution time estimation processing apparatus. 音声対話データとなるオペレータおよび顧客の発話の内容例を示す図である。It is a figure which shows the example of the content of the operator's and customer's utterance used as voice interaction data. 音声対話データのデータ構成例を示す図である。It is a figure which shows the data structural example of voice interaction data. ステップＳ２の処理のより詳細な処理フローを示す図である。It is a figure which shows the more detailed process flow of the process of step S2. ステップＳ２１の処理のより詳細な処理フローを示す図である。It is a figure which shows the more detailed process flow of the process of step S21. 音声対話データ（録音１）の音声パワー情報例を示す図である。It is a figure which shows the audio | voice power information example of audio | voice dialog data (recording 1). 総応対時間の説明図である。It is explanatory drawing of total reception time. 音声対話データ（録音１，２，…）各々の総応対時間例を示す図である。It is a figure which shows the example of total response time of each voice dialogue data (recording 1, 2, ...). 先行発話者（先行発話チャネル）の説明図である。It is explanatory drawing of a preceding speaker (preceding speech channel). 音声対話データ（録音１，２，…）各々の先行発話者（先行発話チャネル）の例を示す図である。It is a figure which shows the example of the preceding speaker (preceding speech channel) of each voice dialogue data (recording 1, 2, ...). 先行主導発話者（先行主導発話チャネル）の説明図である。It is explanatory drawing of a prior | preceding initiative speech person (preceding initiative speech channel). 先行主導発話者および先行主導発話時間を求める処理フロー図（その１）である。It is a processing flow figure (the 1) which asks for a leader initiative utterer and a precedence initiative utterance time. 先行主導発話者および先行主導発話時間を求める処理フロー図（その２）である。It is a processing flowchart (the 2) which calculates | requires a leading initiative utterer and precedence leading utterance time. 音声対話データ（録音１）の先行主導発話時間の計算結果例を示す図である。It is a figure which shows the example of a calculation result of the prior | preceding initiative speech time of audio | voice dialog data (recording 1). ルールベースによって質問発話部を判定する処理の処理フロー図である。It is a processing flowchart of the process which determines a question speech part by a rule base. 入力されるデータの例を示す図である。It is a figure which shows the example of the data input. ルール例を示す図である。It is a figure which shows the example of a rule. ステップＳ３の処理における学習データ抽出の処理フローを示す図である。It is a figure which shows the processing flow of the learning data extraction in the process of step S3. ステップＳ３の処理における学習処理の処理フローを示す図である。It is a figure which shows the processing flow of the learning process in the process of step S3. ステップＳ３の処理における類似度算出処理の処理フローを示す図である。It is a figure which shows the processing flow of the similarity calculation process in the process of step S3. 信頼度情報の例を示す図である。It is a figure which shows the example of reliability information. 算出された類似度マトリックス例を示す図である。It is a figure which shows the calculated similarity matrix example. ステップＳ４の処理のより詳細な処理フローを示す図である。It is a figure which shows the more detailed process flow of the process of step S4. 類似度マトリックスへの“不定”の設定例を示す図である。It is a figure which shows the example of a setting of "indefinite" to a similarity matrix. 類似度マトリックスへの“類似度なし”の設定例を示す図である。It is a figure which shows the example of a setting of "no similarity" to a similarity matrix. 補正係数および正規化類似度の例を示す図である。It is a figure which shows the example of a correction coefficient and a normalization similarity. 算出された正規化類似度マトリックス例を示す図である。It is a figure which shows the calculated normalization similarity matrix example. 算出された平均類似度マトリックス例を示す図である。It is a figure which shows the calculated average similarity matrix example. 類似話者集合とする音声データの決定例を示す図である。It is a figure which shows the example of determination of the audio | voice data made into a similar speaker group. ステップＳ５の処理のより詳細な処理フロー図である。It is a more detailed process flowchart of the process of step S5. 同一話者処理に入力されるデータの例を示す図である。It is a figure which shows the example of the data input into the same speaker process. 音声データの対応付け例を示す図である。It is a figure which shows the example of matching of audio | voice data. 問題解決時間の説明図である。It is explanatory drawing of problem solution time.

Explanation of symbols

１問題解決時間推定処理装置
１１データ入力部
１３対話種別推定部
１５類似話者集合算出部
１７対話データ対応付け部
１９問題解決時間算出部
３音声対話データ（時刻情報付き）
４音声パワー情報
５問題解決時間DESCRIPTION OF SYMBOLS 1 Problem solution time estimation processing apparatus 11 Data input part 13 Dialog type estimation part 15 Similar speaker set calculation part 17 Dialog data matching part 19 Problem solution time calculation part 3 Voice dialog data (with time information)
4 Voice power information 5 Problem solving time

図１は，問題解決時間推定処理装置の構成例を示す図である。 FIG. 1 is a diagram illustrating a configuration example of a problem solving time estimation processing apparatus.

問題解決時間推定処理装置１は，顧客からの問い合わせを受けて回答する業務を行うコールセンタで録音されたオペレータと顧客の対話が録音されている音声対話データから，顧客から提示された問題の解決に要した問題解決時間を推定装置である。 The problem solving time estimation processing device 1 solves a problem presented by a customer from voice dialogue data recorded by an operator and the customer's dialogue recorded at a call center that receives and answers a customer's inquiry. The problem solving time required is an estimation device.

問題解決時間推定処理装置１は，データ入力部１１，対話種別推定部１３，類似話者集合算出部１５，対話データ対応付け部１７，および問題解決時間算出部１９を備える。 The problem solving time estimation processing device 1 includes a data input unit 11, a dialogue type estimating unit 13, a similar speaker set calculating unit 15, a dialogue data associating unit 17, and a problem solving time calculating unit 19.

データ入力部１１は，音声対話データ３の集合を入力する。 The data input unit 11 inputs a set of voice conversation data 3.

対話種別推定部１３は，入力された音声対話データ３の各々から，一方の話者が他方の話者より大きな音声で所定の長さ以上継続して発話している区間（以下，主導発話区間）を特定し，さらに，最先に現れる主導発話区間を先行主導発話区間とする。そして，先行主導発話区間が顧客またはオペレータのどちらの話者の側であるかを判定する。先行主導区間の発話者（先行主導発話者）が顧客（第２チャネル）であれば，その音声対話データの種別を「質問」と設定する。先行主導発話者がオペレータ（第１チャネル）であれば，その音声対話データ３の種別を「非質問」と設定する。 The dialogue type estimation unit 13 is a section in which one speaker continuously speaks with a louder voice than the other speaker for a predetermined length or longer from each of the inputted voice conversation data 3 (hereinafter referred to as a lead utterance section). ), And the leading utterance interval that appears first is defined as the preceding initiative utterance interval. Then, it is determined whether the preceding initiative utterance section is on the speaker side of the customer or the operator. If the speaker in the preceding initiative section (preceding initiative speaker) is a customer (second channel), the type of the voice conversation data is set to “question”. If the preceding initiative speaker is an operator (first channel), the type of the voice interaction data 3 is set to “non-question”.

類似話者集合算出部１５は，入力された音声対話データ３の各々について，顧客とオペレータ各々の話者特徴を，予め定められた機械学習処理方法によって学習する。さらに，学習した話者特徴をもとに，顧客およびオペレータの両方の話者特徴が，それぞれ一定の類似範囲内にある音声対話データ３を集めて分類し，音声対話データ３の分類を類似話者集合とする。 The similar speaker set calculation unit 15 learns the speaker characteristics of each of the customer and the operator for each of the inputted voice conversation data 3 by a predetermined machine learning processing method. Further, based on the learned speaker characteristics, the voice conversation data 3 in which the speaker characteristics of both the customer and the operator are within a certain similar range are collected and classified, and the classification of the voice conversation data 3 is similar to the similar conversation. A set of people.

対話データ対応付け部１７は，類似話者対話集合の各々について，類似話者対話集合を構成する音声対話データ３を応対時刻（録音時刻）に従って並べ，所定の対応付け条件を用いて，該当する音声対話データ同士を対応づける。 The dialogue data associating unit 17 arranges the voice dialogue data 3 constituting the similar speaker dialogue set according to the reception time (recording time) for each of the similar speaker dialogue sets, and applies the corresponding correspondence conditions. Correlate speech dialogue data.

例えば，以下の条件を備えておく。 For example, the following conditions are prepared.

・種別“質問”の音声対話データを起点として対応付けを行う。・ Associating with the voice conversation data of type “question” as the starting point.

・種別“質問”と“非質問”とは，時間順に連続する。 • The types “question” and “non-question” are consecutive in time order.

・種別“非質問”と“非質問”とは，時間順に連続する。 • The types “non-question” and “non-question” are consecutive in time order.

・上記以外の種別の組は，連続させない。・ Sets of types other than the above are not consecutive.

この条件のもと，種別が「質問」である音声対話データ３ｘを一つ取り出し，取り出した音声対話データ３ｘと時間上で後続する種別が「非質問」の音声対話データ３ｙがあるかを調べる。該当する音声対話データ３ｙがあれば，音声対話データ３ｘ，３ｙを対応付ける。該当する音声対話データがなければ対応付けを行わない。 Under this condition, one voice conversation data 3x whose type is “Question” is extracted, and it is checked whether there is voice dialog data 3y whose type is “Non-Question” that is subsequent to the extracted voice conversation data 3x. . If there is the corresponding voice dialogue data 3y, the voice dialogue data 3x and 3y are associated with each other. If there is no corresponding voice dialogue data, no association is performed.

問題解決時間算出部１９は，対応付けられた音声対話データ３ｘ，３ｙの最先の録音開始時刻から最後の録音終了時刻までの時間を算出して問題解決時間５とし，問題解決時間５を出力する。 The problem solving time calculation unit 19 calculates the time from the earliest recording start time to the last recording end time of the associated voice conversation data 3x, 3y as the problem solving time 5, and outputs the problem solving time 5 To do.

図２は，問題解決時間推定処理装置１の処理フローを示す図である。 FIG. 2 is a diagram showing a processing flow of the problem solving time estimation processing apparatus 1.

ステップＳ１：問題解決時間推定処理装置１のデータ入力部１１が，音声対話データの集合３を入力する。 Step S1: The data input unit 11 of the problem solving time estimation processing device 1 inputs a set 3 of spoken dialogue data.

図３は，音声対話データ３となるオペレータおよび顧客の発話の内容例を，図４は，音声対話データ３のデータ構成を示す図である。 FIG. 3 shows an example of the contents of the speech of the operator and the customer as the voice dialogue data 3, and FIG. 4 shows the data structure of the voice dialogue data 3.

音声対話データ３は，図３に示すようなオペレータと顧客の対話の音声を，既知の録音装置によって録音した音声データである。音声対話データ３は２チャネルで構成される。第１チャネル（例えば，Ｌチャネル）にオペレータの音声データが，第２チャネル（例えば，Ｒチャネル）に顧客の音声データが，それぞれ独立して録音される。 The voice dialogue data 3 is voice data obtained by recording the voice of the dialogue between the operator and the customer as shown in FIG. 3 using a known recording device. The voice interaction data 3 is composed of two channels. Voice data of the operator is recorded on the first channel (for example, L channel), and voice data of the customer is recorded on the second channel (for example, R channel) independently.

音声対話データ３の先頭には，データインデックスとして，データの識別情報（録音１），オペレータ名（山田），録音年月日（０５／１０／１１），録音開始時刻（１５：２５：２０）および録音終了時刻（１５：３１：３２）が格納される。 At the beginning of the voice dialogue data 3, as data index, data identification information (recording 1), operator name (Yamada), recording date (05/10/11), recording start time (15:25:20) The recording end time (15:31:32) is stored.

なお，録音開始時刻／終了時刻が，応答開始時刻／終了時刻として使用される。 The recording start time / end time is used as the response start time / end time.

ステップＳ２：対話種別推定部１３が，入力された音声対話データ３の対話種別を推定する。 Step S2: The dialogue type estimation unit 13 estimates the dialogue type of the input voice dialogue data 3.

ここで，対話種別は，対話内で顧客が質問を行っている対話を示す「質問」と，顧客の質問ではない対話を示す「その他（非質問）」を予め設定しておく。 Here, as the dialogue type, “question” indicating a dialogue in which a customer asks a question in the dialogue and “other (non-question)” indicating a dialogue that is not a customer question are set in advance.

具体的には，対話種別推定部１３が，音声対話データ３に対し，Ｌ／Ｒチャネルそれぞれの音声データから，話者が他方の話者より大きな音声で所定の長さ以上継続して発話している主導発話区間を特定し，さらに，最先に出現する主導発話区間である先行主導発話区間を含むチャネルを判定し，そのチャネルを先行主導発話者（先行主導発話チャネル）とする。 Specifically, the conversation type estimation unit 13 continuously utters the voice conversation data 3 from the voice data of each of the L / R channels with a louder voice than the other speaker for a predetermined length or longer. The channel including the preceding initiative utterance section, which is the leading utterance section that appears first, is determined, and the channel is set as the preceding initiative utterer (preceding initiative utterance channel).

より詳細に説明すると，対話種別推定部１３は，音声対話データ３の各チャネルについて，所定単位区間ごとの音声のパワー値を算出し，当該パワー値を時系列で並べた音声パワー情報を生成する。そして，各チャネルの音声パワー情報同士を，時系列で先頭から比較し，所定の判定単位区間各々において，前記パワー値の当該判定単位区間の総計または割合が，より大きい値となるチャネルを当該判定単位区間での主導発話者と判定する。 More specifically, the conversation type estimation unit 13 calculates a sound power value for each predetermined unit section for each channel of the voice conversation data 3, and generates voice power information in which the power values are arranged in time series. . Then, the audio power information of each channel is compared in time series from the beginning, and in each predetermined determination unit section, a channel whose sum or percentage of the determination unit section of the power value is a larger value is determined. It is determined that the speaker is the lead speaker in the unit section.

さらに，時系列でより先頭に近い判定単位区間の主導発話者を先行主導発話者と特定する。先行主導発話者の判定単位区間から連続し，かつ先行主導発話者と同一の主導発話者の判定単位区間を先行主導発話区間とする。 Further, the leading speaker in the determination unit section closer to the head in time series is identified as the preceding leading speaker. The determination unit section of the leading speaker that is continuous from the determination unit section of the preceding initiative speaker and that is the same as the preceding initiative speaker is defined as the preceding initiative section.

その後，先行主導発話区間がＲチャネルに含まれている場合にその音声対話データ３の種別を「質問」と設定し，先行主導発話区間がＬチャネルに含まれている場合にその音声対話データ３の種別を「非質問」と設定する。 Thereafter, when the preceding initiative utterance section is included in the R channel, the type of the voice conversation data 3 is set to “question”, and when the preceding initiative utterance section is included in the L channel, the voice conversation data 3 is set. Set the type of "Non-question".

なお，対話種別推定部１３処理の詳細は後述する。 Details of the dialogue type estimation unit 13 process will be described later.

ステップＳ３：類似話者集合算出部１５は，音声対話データ３の各々について，各チャネルの音声データの話者特徴を所定の処理方法によって学習処理する。 Step S3: The similar speaker set calculation unit 15 learns the speaker characteristics of the voice data of each channel for each of the voice conversation data 3 by a predetermined processing method.

ステップＳ４：さらに，類似話者集合算出部１５は，学習処理によって得た話者特徴をもとに，音声対話データ３のＬチャネルおよびＲチャネルそれぞれについて，音声対話データ３間の類似関係を求め，ＬチャネルおよびＲチャネルの両方について話者特徴が一定の類似範囲内にある音声対話データ３を集めて分類し，類似話者対話集合とする。 Step S4: Further, the similar speaker set calculation unit 15 obtains a similarity relationship between the voice conversation data 3 for each of the L channel and the R channel of the voice conversation data 3 based on the speaker characteristics obtained by the learning process. , Voice dialogue data 3 having speaker characteristics within a certain similar range for both the L channel and the R channel are collected and classified to form a similar speaker dialogue set.

これによって，音声対話データ３の集合が，同一の顧客とオペレータとの対話に分類される。 As a result, the set of voice dialogue data 3 is classified into dialogues between the same customer and the operator.

ステップＳ５：対話データ対応付け部１７は，類似話者対話集合ごとに，音声対話データ３を録音時刻（例えば，録音開始時刻）に従って並べ，種別が「質問」である音声対話データを一つ取り出す。そして，所定の条件を満たす音声対話データを取り出して対応付ける。例えば，時間上後続する，種別が「非質問」の音声対話データがあれば，これを取り出して対応付ける。 Step S5: The dialogue data associating unit 17 arranges the voice dialogue data 3 according to the recording time (for example, the recording start time) for each similar speaker dialogue set, and extracts one voice dialogue data whose type is “question”. . Then, voice dialogue data satisfying a predetermined condition is extracted and associated. For example, if there is voice dialogue data of the type “non-question” that follows in time, it is extracted and associated.

これにより，同一の顧客とオペレータとの複数の対話のうち，顧客が質問した対話とその質問に対するオペレータの回答の対話とが対応付けられる。 As a result, among a plurality of dialogues between the same customer and the operator, a dialogue asked by the customer is associated with a dialogue of the operator's answer to the question.

ステップＳ６：問題解決時間算出部１９は，対応付けられた音声対話データの先の対話の音声対話データの録音開始時刻から後の音声対話データの録音終了時刻までの時間を算出して問題解決時間とする。 Step S6: The problem solving time calculation unit 19 calculates the time from the recording start time of the voice dialogue data of the previous dialogue of the associated voice dialogue data to the recording end time of the subsequent voice dialogue data to calculate the problem solving time. And

また，問題解決時間算出部１９は，対応付けされていない，種別が質問である音声対話データ３を取り出し，この音声対話データの録音開始時刻から録音終了時刻までの時間を算出して問題解決時間とする。 In addition, the problem solving time calculation unit 19 takes out the voice conversation data 3 that is not associated and the type is a question, calculates the time from the recording start time to the recording end time of the voice conversation data, and calculates the problem solving time. And

これにより，顧客の質問に対してコールバックして回答したケース，質問された応対中に回答したケースの問題解決時間を推定することができる。 As a result, it is possible to estimate the problem solving time of the case where the customer's question is called back and answered and the case where the question was answered during the answering.

以下，問題解決時間推定処理装置１が行う，図２の処理フローの各処理をより詳細に説明する。
〔対話種別推定処理〕
図５は，ステップＳ２の処理のより詳細な処理フローを示す図である。Hereinafter, each processing of the processing flow of FIG. 2 performed by the problem solving time estimation processing device 1 will be described in more detail.
[Dialogue type estimation process]
FIG. 5 is a diagram showing a more detailed processing flow of the processing in step S2.

ステップＳ２０：音声対話データ３を所定の単位区間に分割する。単位区間は，例えば，１〜２秒の値とする。 Step S20: Divide the voice interaction data 3 into predetermined unit intervals. The unit interval is, for example, a value of 1 to 2 seconds.

ステップＳ２１：各単位区間の音声のパワー値の平均を求め，時系列のパワー値の連続である音声パワー情報４に変換する。 Step S21: The average of the power values of the voices in each unit section is obtained and converted to voice power information 4 that is a continuation of time-series power values.

音声パワー情報４は，各チャネルの音声データの所定単位区間での大きさ（パワー）の平均値を，所定の閾値ｔｈを用いてビット列へ変換し，時系列で並べたビット列の情報である。したがって，発話の音声パワーが一定の閾値ｔｈ以上の大きさであれば，ビットに“１”を格納し，そうでなければ“０”のままとなる。 The audio power information 4 is bit string information obtained by converting an average value (power) of audio data of each channel in a predetermined unit section into a bit string using a predetermined threshold th and arranging them in time series. Therefore, if the voice power of the utterance is greater than or equal to a certain threshold th, “1” is stored in the bit, otherwise it remains “0”.

図６に，ステップＳ２１の音声パワー情報４の生成処理の処理フローを示す。 FIG. 6 shows a processing flow of the generation processing of the audio power information 4 in step S21.

音声対話データ３の各チャネルに対して，フーリエ変換処理を適応し，［パワー，ピッチ］の列を得る（ステップＳ１１１）。さらに，パワー列の最少時間単位である単位区間ｍを定める（ステップＳ１１２）。音声パワー情報４として，音声対話データ３の先頭から単位区間ｍごとに，平均パワー値を求め，平均パワー値が閾値ｔｈ以上であれば，“１”を，閾値ｔｈ未満であれば“０”を付与した，ビット列を出力する（ステップＳ１１３）。 A Fourier transform process is applied to each channel of the voice interaction data 3 to obtain a column of [power, pitch] (step S111). Further, a unit section m which is the minimum time unit of the power train is determined (step S112). As the voice power information 4, an average power value is obtained for each unit section m from the beginning of the voice conversation data 3. If the average power value is equal to or greater than the threshold th, “1” is indicated. If the average power value is less than the threshold th, “0” is indicated. A bit string to which is added is output (step S113).

図７は，音声対話データ（録音１）３の音声パワー情報４を示す図である。図７に示す音声パワー情報４において，［発話開始：発話終了］の形式で，発話開始時刻から発話終了時刻までの間で値“１”が付与されているビット列を表す。例えば，単位区間ｍ＝１秒の場合に，［発話開始＝０：発話終了＝３］は，開始０秒から３までの間が，値“１”が付与されている区間，すなわち，閾値ｔｈ以上の大きさで発話があった時間を意味する。 FIG. 7 is a diagram showing the voice power information 4 of the voice conversation data (recording 1) 3. The voice power information 4 shown in FIG. 7 represents a bit string to which a value “1” is assigned between the utterance start time and the utterance end time in the format of “utterance start: utterance end”. For example, in the case of the unit interval m = 1 second, [utterance start = 0: utterance end = 3] is the interval in which the value “1” is given from the start 0 second to 3, ie, the threshold th It means the time when there was an utterance with the above size.

ステップＳ２２：変換された音声パワー情報４から，属性情報として，総応対時間，先行発話チャネル，先行主導発話者（チャネル），先行主導発話時間を取得する。 Step S22: From the converted voice power information 4, the total response time, the preceding speech channel, the preceding initiative speaker (channel), and the precedence initiative speech time are acquired as attribute information.

総応対時間は，音声対話データ３の実際の対話の総時間を示す。図８に示すように，音声対話データのインデックス情報の対話の録音の開始時刻と終了時刻の差で求める。図９は，音声対話データ（録音１，２，…）３各々の総応対時間例を示す図である。 The total response time indicates the total actual dialogue time of the voice dialogue data 3. As shown in FIG. 8, it is obtained by the difference between the recording start time and the end time of the dialogue of the index information of the voice dialogue data. FIG. 9 is a diagram showing an example of the total response time of each of the voice conversation data (recordings 1, 2,...) 3.

先行発話チャネルは，顧客とオペレータの対話において先行して発話があったチャネルを示す。音声パワー情報４のパワー値のビット列において，ビットに“１”が付与されている最先の単位区間を持つチャネルを，先行発話チャネルとする。先行発話チャネルの値は，“Ｌ”，“Ｒ”，“ＬＲ”とする。 The preceding utterance channel indicates a channel in which an utterance precedes in the dialogue between the customer and the operator. In the bit string of the power value of the audio power information 4, the channel having the earliest unit section in which “1” is assigned to the bit is defined as the preceding speech channel. The values of the preceding speech channel are “L”, “R”, and “LR”.

コールセンタで録音される音声対話データ３では，一般的に，電話の発呼の受け手側が対話を開始，すなわち最初に発話する。したがって，通常の問い合わせ時の顧客側発呼の場合には最初の発話はオペレータである。反対に，オペレータが顧客にコールバックする場合，オペレータが発呼し，最初の発話は顧客である。一般的にコールバックの対話に顧客の質問が含まれることはほとんどないことから，オペレータと顧客のどちらの音声が録音されたチャネルが先行発話チャネルに該当するかを特定することによって，オペレータのコールバック時の対話を特定することができる。 In the voice dialogue data 3 recorded at the call center, generally, the recipient of the telephone call starts a dialogue, that is, speaks first. Therefore, in the case of a customer-side call at the time of a normal inquiry, the first utterance is an operator. Conversely, when the operator calls back to the customer, the operator calls and the first utterance is the customer. In general, callback conversations rarely include customer questions, so by identifying which channel the operator or customer's voice was recorded on corresponds to the preceding speech channel, the operator's call You can specify the back-up dialogue.

図１０に示す音声パワー情報４のビット列では，Ｌチャネルでビット列に“１”が付与された単位区間＝０，Ｒチャネルでビット列に“１”が付与された単位区間＝３であるので，先行発話チャネル＝Ｌと求まる。図１１は，音声対話データ（録音１，２，…）３各々の先行発話チャネルを表す図である。 In the bit sequence of the audio power information 4 shown in FIG. 10, the unit interval in which “1” is assigned to the bit sequence in the L channel = 0, and the unit interval in which “1” is assigned to the bit sequence in the R channel = 3. Speech channel = L. FIG. 11 is a diagram showing the preceding speech channel of each of the voice conversation data (recordings 1, 2,...) 3.

先行主導発話者（先行主導発話チャネル）は，所定の判定単位区間における主導発話者のうち，先頭に最も近い判定単位区間の主導発話者（チャネル）である。 The leading initiative utterer (preceding initiative utterance channel) is the initiative utterer (channel) in the determination unit interval closest to the head among the initiative utterers in the predetermined determination unit interval.

対話種別推定部１３は，所定の判定単位区間内で音声パワー情報４のパワー値のビットが“１”となっている単位区間の合計数が大きい（又は割合が高い）チャネルを主導発話者と判定する。そして先頭に最も近い判定単位区間（時系列の最先の判定単位区間）における主導発話者を先行主導発話として特定する。 The conversation type estimation unit 13 sets a channel having a large total number (or a high ratio) of unit sections in which a power value bit of the voice power information 4 is “1” within a predetermined determination unit section as a leading speaker. judge. Then, the leading utterer in the determination unit section closest to the head (the first determination unit section in the time series) is specified as the preceding initiative utterance.

さらに，先行発話チャネルに設定されたチャネルの音声パワー情報４において，最初にパワー値に“１”が付与された単位区間から，先行主導発話チャネルが主導発話者として判定されている単位判定区間の連続を，先行主導発話時間とする。 Furthermore, in the voice power information 4 of the channel set as the preceding speech channel, from the unit interval in which “1” is first added to the power value, the unit determination interval in which the preceding initiative speech channel is determined as the lead speaker The continuation is the lead-led utterance time.

図１２は，先行主導発話者および先行主導発話時間を説明するための図である。 FIG. 12 is a diagram for explaining the preceding initiative utterer and the precedence initiative utterance time.

対話種別推定部１３は，所定の判定処理の対象とする単位区間の範囲を示すウィンドウを，所定の移動単位でずらして判定処理を行う。パワー値の単位区間ｍ＝１秒のときに，単位判定時間に相当する処理のウィンドウサイズｎ＝１５秒（単位区間），ウィンドウをずらす移動単位ｋ＝３秒（単位区間）として，ウィンドウサイズｎ内で，チャネルごとにパワー値として“１”が付与されている単位区間数を計算し，単位区間数の多いチャネルを主導発話者として判定する。さらに，移動単位（サイズ）ｋ＝３秒ずらしたウィンドウサイズｎ内で，同様に，“１”の単位区間数が多いチャネルを主導発話者として判定する。 The dialogue type estimation unit 13 performs a determination process by shifting a window indicating the range of a unit section that is a target of a predetermined determination process by a predetermined movement unit. When the power value unit interval m = 1 second, the window size n of the processing corresponding to the unit determination time is 15 seconds (unit interval), and the window shift unit k is 3 seconds (unit interval). The number of unit sections to which “1” is assigned as the power value for each channel is calculated, and a channel with a large number of unit sections is determined as the leading speaker. Furthermore, within the window size n shifted by the movement unit (size) k = 3 seconds, similarly, a channel having a large number of unit sections of “1” is determined as the lead speaker.

図１２では，１回目〜５回目の判定処理では，主導発話者として“Ｒチャネル”が，６回目の判定処理で“Ｌチャネル”が，７回目の判定処理では“ＬＲ”がそれぞれ判定されている。したがって，最先の判定単位区間で主導発話者に判定された“Ｒチャネル”が先行主導発話者（先行主導発話チャネル）と判定される。 In FIG. 12, “R channel” is determined as the leading speaker in the first to fifth determination processes, “L channel” is determined in the sixth determination process, and “LR” is determined in the seventh determination process. Yes. Therefore, the “R channel” determined as the leading speaker in the earliest determination unit section is determined as the preceding leading speaker (preceding leading utterance channel).

次に，先行発話者チャネルに特定されたＬチャネルにおいて，パワー値のビットに“１”が付与されている最先の単位判定区間から，先行主導発話チャネルが主導発話者として判定されている単位判定区間の連続区間を先行主導発話時間とする。 Next, in the L channel specified as the preceding speaker channel, the unit in which the preceding initiative utterance channel is determined as the initiative speaker from the first unit determination section in which “1” is assigned to the power value bit. The continuous section of the determination section is set as the preceding initiative utterance time.

ここでは，主導発話者がＲチャネルからＬチャネルに変わった場合に，その時のウィンドウサイズｎの半分を加えた単位区間までの連続区間を，先行主導発話期間として計算する。 Here, when the lead speaker changes from the R channel to the L channel, a continuous section up to a unit section plus half of the window size n at that time is calculated as the preceding lead speech period.

図１３および図１４は，先行主導発話者および先行主導発話時間を求める処理フロー図である。 FIG. 13 and FIG. 14 are processing flowcharts for obtaining the preceding initiative utterer and the precedence initiative utterance time.

対話種別推定部１３は，先行発話チャネルに特定されたＬチャネルを選択する（ステップＳ１３１）。ウィンドウサイズｎを設定し（ステップＳ１３２），音声パワー情報のビット列の先頭にポインタをセットする（ステップＳ１３３）。 The conversation type estimation unit 13 selects the L channel specified as the preceding speech channel (step S131). A window size n is set (step S132), and a pointer is set at the head of the bit string of the audio power information (step S133).

ウィンドウ内でＬチャネル側でのビットが“１”となっている単位区間数を計算して値Ａとする（ステップＳ１３４）。さらに，ウィンドウ内でＲチャネル側でのビットが“１”となっている単位区間数を計算して値Ｂとする（ステップＳ１３５）。 The number of unit sections in which the bit on the L channel side is “1” in the window is calculated as a value A (step S134). Further, the number of unit sections in which the bit on the R channel side is “1” in the window is calculated as a value B (step S135).

値Ａが値Ｂより大きいかを判定し（ステップＳ１３６），値Ａが値Ｂより大きい場合は主導発話者＝Ｌチャネルとする（ステップＳ１３７）。値Ａが値Ｂより大きくない場合は，さらに，値Ａが値Ｂと等しいかを判定し（ステップＳ１３８），値Ａが値Ｂと等しければ，主導発話者＝ＬＲチャネルとする（ステップＳ１３９）。値Ａが値Ｂと等しくなければ，主導発話者＝Ｒチャネルとする（ステップＳ１３１０）。 It is determined whether the value A is greater than the value B (step S136). If the value A is greater than the value B, the lead speaker = L channel is set (step S137). If the value A is not greater than the value B, it is further determined whether or not the value A is equal to the value B (step S138). If the value A is equal to the value B, the lead speaker = LR channel is set (step S139). . If the value A is not equal to the value B, the lead speaker = R channel is set (step S1310).

そして，［ポインタ位置，主導発話者値］の組を出力する（ステップＳ１３１１）。 Then, a set of [pointer position, initiative speaker value] is output (step S1311).

次に，ウィンドウを移動単位ｋ分ずらし（ステップＳ１３１２），ウィンドウが音声パワー情報４のビット列の最後まで到達していれば（図１４：ステップＳ１３１３），ステップＳ１３１４の処理へ進み，ウィンドウが音声パワー情報４のビット列の最後まで到達していなければ，ステップＳ１３４の処理へ戻る。ステップＳ１３１４の処理では，ポインタ位置が“０”の主導発話者値を先行主導発話者の値とする。 Next, the window is shifted by the movement unit k (step S1312), and if the window has reached the end of the bit string of the audio power information 4 (FIG. 14: step S1313), the process proceeds to step S1314. If the end of the bit string of information 4 has not been reached, the process returns to step S134. In the process of step S1314, the initiative speaker value whose pointer position is “0” is set as the value of the preceding initiative speaker.

そして，先行主導発話者と主導発話者の値が連続して同じ値をとる単位区間の範囲（Ｌ）を求める（ステップＳ１３１５）。ポインタ位置＝０からポインタ位置＝Ｌまでの区間を，発話時刻に変換し，先行主導発話時間とする（ステップＳ１３１６）。 Then, a unit interval range (L) in which the values of the preceding initiative speaker and the initiative speaker continuously take the same value is obtained (step S1315). The section from the pointer position = 0 to the pointer position = L is converted into the utterance time and is set as the preceding initiative utterance time (step S1316).

図１５は，音声対話データ（録音１）３の先行主導発話時間の計算結果を示す図である。図１５の図において，開始秒は，ウィンドウの開始位置を示し，窓サイズは，ウィンドウサイズｎを示す。主導チャネルは主導発話者と判定されたチャネル，Ｌ割合およびＲ割合は，ウィンドウ内で，“１”が付与された単位区分数を示す。 FIG. 15 is a diagram illustrating a calculation result of the preceding initiative utterance time of the voice conversation data (recording 1) 3. In FIG. 15, the start second indicates the start position of the window, and the window size indicates the window size n. The lead channel is the channel determined to be the lead speaker, and the L ratio and the R ratio indicate the number of unit sections to which “1” is assigned in the window.

音声対話データ（録音１）３の先行主導発話者（チャネル）＝Ｒチャネル，先行主導発話時間＝５５．５秒である。 In the voice conversation data (recording 1) 3, the preceding initiative speaker (channel) = R channel, and the precedence initiative speech time = 55.5 seconds.

ステップＳ２３：対話種別推定部１３は，先行主導発話者（チャネル）および先行主導発話時間から，質問発話部を判定する。例えば，先行主導発話チャネルがＲチャネル，すなわち顧客の音声が録音されたチャネルである場合に，先行主導発話時間に該当する時間を質問発話部として特定する。 Step S23: The dialogue type estimation unit 13 determines the question utterance unit from the preceding initiative utterer (channel) and the precedence initiative utterance time. For example, when the preceding initiative utterance channel is the R channel, that is, the channel in which the customer's voice is recorded, the time corresponding to the precedence initiative utterance time is specified as the question utterance section.

図１６は，ルールベースによって質問発話部を判定する処理フロー図である。 FIG. 16 is a process flow diagram for determining a question utterance unit based on a rule base.

対話種別推定部１３は，図１７に示すような，判定対象の音声対象データに対する，［先行発話者（チャネル），先行主導発話者（チャネル），先行主導発話時間，総応対時間］の組を入力する（ステップＳ１４１）。 The dialogue type estimation unit 13 sets a set of [preceding utterer (channel), preceding led utterer (channel), preceding led utterance time, total response time] for the speech target data to be determined as shown in FIG. Input (step S141).

そして，図１８に示すルールベースにもとづいて，ステップＳ１４２〜ステップＳ１４７の判定処理を行う。 Then, based on the rule base shown in FIG. 18, the determination processing in steps S142 to S147 is performed.

図１８のルールベースでは，以下の判定条件が定義されている。 In the rule base of FIG. 18, the following determination conditions are defined.

ルール１：先行発話者＝先行主導発話者であれば，“ｒｅｊｅｃｔ”；
ルール２：先行発話者＝ＬＲであれば，“ｒｅｊｅｃｔ”；
ルール３：先行主導発話者＝Ｌまたは先行主導発話者＝ＬＲであれば，“ｒｅｊｅｃｔ”；
ルール４：総応対時間が，平均応対時間の１／３以下であれば，“ｒｅｊｅｃｔ”；
ルール５：先行主導発話時間が５秒以下であれば，“ｒｅｊｅｃｔ”；
初期値：ルール１〜ルール５のいずれでもなければ，“ａｃｃｅｐｔ”とする。
ここで，“ｒｅｊｅｃｔ”＝質問発話部は存在しない，“ａｃｃｅｐｔ”＝先行主導発話部分を質問発話部分とする。Rule 1: If the preceding speaker is a leading initiative speaker, “reject”;
Rule 2: If the preceding speaker = LR, “reject”;
Rule 3: If “rejection-led” speaker = L or “first-led” speaker = LR, “reject”;
Rule 4: If the total response time is 1/3 or less of the average response time, “reject”;
Rule 5: If the lead-led utterance time is 5 seconds or less, “reject”;
Initial value: If none of rule 1 to rule 5, it is “accept”.
Here, “reject” = no question utterance part exists, and “accept” = preceding initiative utterance part is a question utterance part.

ステップＳ１４１の入力が，ルール１に該当するかを判定し（ステップＳ１４２），ルール１に該当すれば，さらに，ルール２に該当するかを判定し（ステップＳ１４３），ルール２に該当すれば，さらに，ルール３に該当するかを判定し（ステップＳ１４４），ルール３に該当すれば，さらに，ルール４に該当するかを判定し（ステップＳ１４５），ルール４に該当すれば，さらに，ルール５に該当するかを判定し（ステップＳ１４６），ルール５に該当すれば，質問発話部はない（ｒｅｊｅｃｔ）と判定する（ステップＳ１４７）。一方，ルール１〜ルール５のいずれにも該当しなければ，質問発話部を含むと判定する（ステップＳ１４８）。 It is determined whether the input of step S141 corresponds to rule 1 (step S142). If it corresponds to rule 1, it is further determined whether it corresponds to rule 2 (step S143). If it corresponds to rule 2, Further, it is determined whether it corresponds to rule 3 (step S144). If it corresponds to rule 3, it is further determined whether it corresponds to rule 4 (step S145). (Step S146), and if it falls under rule 5, it is determined that there is no question utterance part (reject) (step S147). On the other hand, if none of the rules 1 to 5 is applicable, it is determined that the question utterance part is included (step S148).

この判定処理により，図１７の各音声対話データのうち，例えば，録音１および録音２の音声対話データについて質問発話部を含む（ａｃｃｅｐｔ）と判定され，一方，録音３および録音４の音声対話データについて質問発話部を含まない（ｒｅｊｅｃｔ）と判定される。 By this determination processing, it is determined that, for example, the voice dialogue data of recording 1 and recording 2 includes the question utterance part (accept) among the voice dialogue data of FIG. It is determined that the question utterance part is not included (reject).

ステップＳ２４：対話種別推定部１３は，音声対話データ３に質問発話部が存在するかを判定する。 Step S24: The dialogue type estimation unit 13 determines whether or not a question utterance unit exists in the voice dialogue data 3.

ステップＳ２５：ステップＳ２４の処理で“ａｃｃｅｐｔ”と判定された場合，すなわち，その音声対話データ３に質問発話部が存在する場合に（ステップＳ２４のＹＥＳ），その音声対話データ３の対話種別を“質問”とする。 Step S25: When it is determined as “accept” in the process of Step S24, that is, when a question utterance part exists in the voice dialogue data 3 (YES in Step S24), the dialogue type of the voice dialogue data 3 is set to “ “Question”.

ステップＳ２６：ステップＳ２４の処理で“ｒｅｊｅｃｔ”と判定された場合，すなわち，その音声対話データ３に質問発話部が存在しない場合に（ステップＳ２４のＮＯ），その音声対話データ３の対話種別を“その他”とする。
〔話者特徴学習処理〕
類似話者集合算出部１５は，話者特徴学習処理に用いる学習データとして，音声対話データのうち一定のパワーおよび長さで録音されているチャネルごとの音声データを抽出し，抽出した音声データ各々について話者特徴を学習し，全ての音声データに対して話者特徴の類似度を算出する。類似度の算出は，音声データの総当たり方式で算出する。Step S26: When it is determined as “reject” in the process of Step S24, that is, when there is no question utterance part in the voice dialogue data 3 (NO in Step S24), the dialogue type of the voice dialogue data 3 is “ Other ”.
[Speaker feature learning process]
The similar speaker set calculation unit 15 extracts voice data for each channel recorded with a certain power and length from the voice dialogue data as learning data used for the speaker feature learning process, and each of the extracted voice data The speaker feature is learned about and the similarity of the speaker feature is calculated for all speech data. The similarity is calculated by the brute force method of audio data.

図１９は，ステップＳ３の処理における学習データ抽出の処理フローを示す図である。 FIG. 19 is a diagram showing a process flow of learning data extraction in the process of step S3.

ステップＳ３００：類似話者集合算出部１５は，音声対話データ３の各チャネルに対してフーリエ変換処理を適用し，[パワー，ピッチ]の列を得る。 Step S300: The similar speaker set calculation unit 15 applies a Fourier transform process to each channel of the voice conversation data 3 to obtain a column of [power, pitch].

ステップＳ３０１：パワー列の最少時間単位である単位区間ｍを定める。 Step S301: A unit interval m which is the minimum time unit of the power train is determined.

ステップＳ３０２：音声対話データ３の各チャネルの音声データに対して，先頭から単位区間ｍごとに平均パワー値を求め，平均パワー値が閾値ｔｈ２以上である箇所を出力する。出力箇所に対応する音声データを音声集合Ａに格納する。 Step S302: For the voice data of each channel of the voice interaction data 3, an average power value is obtained for each unit interval m from the head, and a portion where the average power value is equal to or greater than the threshold th2 is output. Audio data corresponding to the output location is stored in the audio set A.

ステップＳ３０３：各音声データに対応付けて，出力した音声データの総録音時間を記録する。 Step S303: The total recording time of the output audio data is recorded in association with each audio data.

図２０は，ステップＳ３の処理における学習処理の処理フローを示す図である。音声集合Ａの各音声データについて，以下の処理が行われる。 FIG. 20 is a diagram showing a processing flow of the learning process in the process of step S3. The following processing is performed for each piece of voice data in the voice set A.

学習処理は既知のものであればどのような学習処理手法でもよいが，例えばマハラノビスの距離判定を使用する。マハラノビスの距離による判定処理は，参考文献に詳説されている（P.C. Mahalanobis, "On the generalized distance in statistics", Proceedings of the National Institute of Science of India, 12 (1936) 49-55, 1936）
ステップＳ３１０：類似話者集合算出部１５は，音声集合Ａから音声データを一つ取り出す。As long as the learning processing is known, any learning processing method may be used. For example, Mahalanobis distance determination is used. The decision process based on Mahalanobis distance is detailed in the reference (PC Mahalanobis, "On the generalized distance in statistics", Proceedings of the National Institute of Science of India, 12 (1936) 49-55, 1936).
Step S310: The similar speaker set calculation unit 15 extracts one piece of voice data from the voice set A.

ステップＳ３１１：取り出した音声データを用いて話者特徴を学習し，学習結果（話者特徴データセット）を学習集合Ｂに格納する。 Step S311: The speaker features are learned using the extracted voice data, and the learning results (speaker feature data set) are stored in the learning set B.

ステップＳ３１２：音声集合Ａから使用した音声データを取り除く。 Step S312: The used voice data is removed from the voice set A.

ステップＳ３１３：音声集合Ａに残りの音声データがあれば，ステップＳ３１０の処理へ戻り，全ての音声データを使用したら，処理を終了する。 Step S313: If there is remaining voice data in the voice set A, the process returns to step S310, and if all voice data is used, the process is terminated.

図２１は，ステップＳ３の処理における類似度算出処理の処理フローを示す図である。音声集合Ａの各音声データについて，以下の処理が行われる。 FIG. 21 is a diagram illustrating a processing flow of similarity calculation processing in step S3. The following processing is performed for each piece of voice data in the voice set A.

ステップＳ３２０：類似話者集合算出部１５は，学習集合Ｂから話者特徴データセットを一つ取り出す。 Step S320: The similar speaker set calculation unit 15 extracts one speaker feature data set from the learning set B.

ステップＳ３２１：音声集合Ａの全ての音声データに対する類似度を算出する。 Step S321: The similarity to all the audio data of the audio set A is calculated.

類似度算出の場合に，信頼度情報として，各音声データの総学習対象時間，平均類似度を計算しておく。図２２に，音声集合Ａを構成する音声データ（Ａ，Ｂ，Ｃ，…）の総学習対象時間，平均類似度を示す。 When calculating similarity, the total learning target time and average similarity of each voice data is calculated as reliability information. FIG. 22 shows the total learning target time and average similarity of speech data (A, B, C,...) Constituting speech set A.

総学習対象時間は，話者特徴を学習するために使用した音声データの録音時間である。平均類似度は，他の音声データとの類似度を出力した際の類似度の平均値である。 The total learning target time is the recording time of voice data used to learn speaker characteristics. The average similarity is an average value of similarities when outputting similarities with other audio data.

ステップＳ３２２：学習集合Ｂから使用した話者特徴データセットを取り除く。 Step S322: The speaker feature data set used is removed from the learning set B.

ステップＳ３２３：学習集合Ｂに残りの話者特徴データセットがあれば，ステップＳ３２０の処理へ戻り，全ての話者特徴データセットを使用したら，処理を終了する。 Step S323: If there is any remaining speaker feature data set in the learning set B, the process returns to step S320, and if all speaker feature data sets are used, the process is terminated.

図２３は，算出された類似度マトリックスの例を示す図である。この類似度マトリックスにおいて，類似度は，“０”を最大（距離最小）として，値が大きいほど類似していないことを示す。 FIG. 23 is a diagram illustrating an example of the calculated similarity matrix. In this similarity matrix, the similarity is “0” as the maximum (minimum distance), and the greater the value, the less similar.

図２３に示す類似度マトリックスのＡの列は，音声データＡに対する，他の音声データ（Ｂ，Ｃ，…）の類似度であり，音声データＡは，音声データＡ，Ｂ，Ｃ，Ｄ，Ｅ，…と，それぞれ，“０，３０，１５００，２５０００，２３０，…”の類似度であることを示す。 The column A of the similarity matrix shown in FIG. 23 is the similarity of the other audio data (B, C,...) With respect to the audio data A, and the audio data A includes the audio data A, B, C, D, E,..., And “0, 30, 1500, 25000, 230,.

〔類似話者集合算出〕
類似話者集合算出部１５は，各音声データの類似度をもとに話者認証，すなわち，同一の話者とみなすことができる音声データの集合（類似話者集合）を算出する。ここで，判定精度を向上させるため，類似度の補正処理を行う。[Similar speaker set calculation]
The similar speaker set calculation unit 15 calculates speaker authentication, that is, a set of voice data that can be regarded as the same speaker (similar speaker set) based on the similarity of each voice data. Here, in order to improve the determination accuracy, a similarity correction process is performed.

図２４は，ステップＳ４の処理のより詳細な処理フローを示す図である。 FIG. 24 is a diagram showing a more detailed processing flow of the processing in step S4.

ステップＳ４０：類似話者集合算出部１５は，類似話者候補とみなす音声データを判定するための閾値ｔｈ３を決定する。例えば，閾値ｔｈ３を“類似度＝１０００”とする。 Step S40: The similar speaker set calculation unit 15 determines a threshold th3 for determining speech data regarded as a similar speaker candidate. For example, the threshold th3 is set to “similarity = 1000”.

ステップＳ４１：全音声データの平均類似度を求める。 Step S41: The average similarity of all audio data is obtained.

ステップＳ４２：平均類似度に対して一定の程度以上低い音声データに“不定”を設定して，排除する。平均類似度が“２３００００００”である場合に，図２５に示すように，“平均類似度の１／４以下”の音声データに“不定”を設定する。 Step S42: “Undetermined” is set for the audio data lower than a certain level with respect to the average similarity, and is excluded. When the average similarity is “23000000”, “undefined” is set to the audio data “1/4 or less of the average similarity” as shown in FIG.

ステップＳ４３：音声データのうち，対応する音声対話データのＬチャネル側（オペレータ）の話者特徴が同一話者とみなされない２つの音声データについて“類似度なし”を設定する。 Step S43: Among the voice data, “no similarity” is set for two voice data whose speaker characteristics on the L channel side (operator) of the corresponding voice dialogue data are not regarded as the same speaker.

音声データＢと音声データＤのＬチャネルの音声データが一定の類似範囲でないと仮定すると，図２６に示す類似マトリックスのように，該当する項目に“類似度なし（−）”を設定する。 Assuming that the L channel audio data of the audio data B and the audio data D are not in a certain similar range, “no similarity (−)” is set in the corresponding item as in the similarity matrix shown in FIG.

ステップＳ４４：対応する平均類似度の値に応じて，音声データの類似度を補正し，図２７（Ａ）に示すように，補正係数，正規化類似度を算出する。 Step S44: The similarity of the audio data is corrected according to the corresponding average similarity value, and the correction coefficient and the normalized similarity are calculated as shown in FIG.

補正係数＝（各音声対話データの平均類似度／音声対話データ全体の平均類似度）の２乗
正規化類似度＝元の類似度＊補正係数
これにより，図２７（Ｂ）に示すように，各音声対話データの補正係数を計算し，図２８に示す，各音声データの正規化類似度マトリックスを得る。Correction coefficient = (average similarity of each voice conversation data / average similarity of the whole voice conversation data) squared normalized similarity = original similarity * correction coefficient Thus, as shown in FIG. The correction coefficient of each voice interaction data is calculated, and the normalized similarity matrix of each voice data shown in FIG. 28 is obtained.

ステップＳ４５：処理対象とする音声データと他の音声データとの対の類似度を，それぞれの音声データの正規化類似度の平均値として計算する。 Step S45: The pair similarity between the audio data to be processed and other audio data is calculated as an average value of the normalized similarity of each audio data.

すなわち，音声データＡ，Ｂの類似度は，音声データＡの音声データＢに対する類似度（Ａ→Ｂ類似度）と，音声データＢの音声データＡに対する類似度（Ｂ→Ａ類似度）の平均とする。 That is, the similarity between the audio data A and B is the average of the similarity between the audio data A and the audio data B (A → B similarity) and the similarity between the audio data B and the audio data A (B → A similarity). And

また，「Ａ→Ｂ類似度」，「Ｂ→Ａ類似度」のうち一方が“不定”である場合に，不定でない方の類似度とする。「Ａ→Ｂ類似度」，「Ｂ→Ａ類似度」の両方が“不定”である場合には，同一話者でないと判定する。 Further, when one of “A → B similarity” and “B → A similarity” is “indefinite”, it is determined as the non-indeterminate similarity. If both “A → B similarity” and “B → A similarity” are “undefined”, it is determined that they are not the same speaker.

図２９は，各音声対話データの平均類似度マトリックスの例を示す。 FIG. 29 shows an example of the average similarity matrix of each voice interaction data.

ステップＳ４６：音声対話データの平均類似度マトリックスを用いて，同一話者かを判定する。音声データの平均類似度が閾値ｔｈ３（＝１０００）以上であれば，非同一話者（１），閾値ｔｈ３未満であれば同一話者（０）とする。 Step S46: Using the average similarity matrix of the voice conversation data, it is determined whether or not they are the same speaker. If the average similarity of the voice data is equal to or greater than the threshold th3 (= 1000), the speaker is not the same speaker (1), and if it is less than the threshold th3, the speaker is the same speaker (0).

図３０に示すように，音声データＡについては，音声データＢ，Ｃ，Ｅが同一話者とみなすことができ，類似話者集合｛Ａ，Ｂ，Ｃ，Ｅ｝が算出される。 As shown in FIG. 30, for the voice data A, the voice data B, C, E can be regarded as the same speaker, and a similar speaker set {A, B, C, E} is calculated.

同様に，音声データＢについては，音声データＥが同一話者とみなすことができ，類似話者集合｛Ｂ，Ｅ｝が，音声データＣについては，音声データＤが同一話者とみなすことができ，類似話者集合｛Ｃ，Ｄ｝が，算出される。
〔対話データ対応付け処理〕
対話データ対応付け部１７は，音声対話データの対応付けのための条件として，以下のような条件を予め設定しておく。Similarly, for voice data B, voice data E can be regarded as the same speaker, for similar speaker set {B, E}, for voice data C, voice data D can be regarded as the same speaker. A similar speaker set {C, D} is calculated.
[Interaction data mapping process]
The dialogue data association unit 17 sets the following conditions in advance as conditions for associating the voice dialogue data.

・対話種別“質問”のデータを起点として対応付けを行う。・ Association is started from the data of the dialogue type “question”.

・対話種別“質問”と“その他”は，時間順に連続する。 • The dialogue types “Question” and “Others” continue in chronological order.

・対話種別“その他”と“その他”は，時間順に連続する。 • The dialogue types “Other” and “Other” continue in chronological order.

・上記以外の対話種別の組は，連続しない。 -Other than the above types of dialogue types are not consecutive.

図３１は，ステップＳ５の処理のより詳細な処理フロー図である。 FIG. 31 is a more detailed process flow diagram of the process of step S5.

ステップＳ５０：対話データ対応付け部１７は，類似話者集合を構成する音声データについて，その音声データの対話種別をもとに，条件を満たす音声データを残す。 Step S50: The dialogue data associating unit 17 leaves the voice data that satisfies the condition for the voice data constituting the similar speaker set based on the dialogue type of the voice data.

図３２（Ａ）に示す同一話者の判断結果にもとづいて，対応付け処理を行うとする。 Assume that the association processing is performed based on the determination result of the same speaker shown in FIG.

類似話者集合｛Ａ，Ｂ，Ｃ，Ｅ｝を対象とすると，図３２（Ｂ）に示す対話種別から，音声データＡが“質問”であるので，これを起点とする。音声データＢ，Ｃは“質問”であって条件を満たさないので，排除する。音声データＥは“その他”であり条件を満たすので残し，この類似話者集合に｛Ａ，Ｅ｝が残される。 If the similar speaker set {A, B, C, E} is targeted, since the voice data A is a “question” from the conversation type shown in FIG. 32B, this is the starting point. Since the voice data B and C are “questions” and do not satisfy the condition, they are excluded. The voice data E is “other” and satisfies the condition, and is left, and {A, E} is left in this similar speaker set.

また，類似話者集合｛Ｂ，Ｅ｝については，音声データＢが“質問”であるので，これを起点とすると，音声データＥが“その他”であり条件を満たす。しかし，音声データＥは，音声データＡとの類似度がより高いため，音声データＥを排除し，類似話者集合に｛Ｂ｝が残される。 For the similar speaker set {B, E}, since the voice data B is “question”, the voice data E is “other” and satisfies the condition. However, since the voice data E has a higher similarity to the voice data A, the voice data E is excluded and {B} is left in the similar speaker set.

また，類似話者集合｛Ｃ，Ｄ｝については，音声データＣが“質問”であるので，これを起点とすると，音声データＤが“その他”であり条件を満たすので残す。よって，この類似話者集合に｛Ｃ，Ｄ｝が残される。 For the similar speaker set {C, D}, since the voice data C is a “question”, the voice data D is “other” and satisfies the condition since it is the starting point. Therefore, {C, D} is left in this similar speaker set.

ステップＳ５１：各類似話者集合に残された音声データに対応する音声対話データを時間順に出力する。 Step S51: Output voice dialogue data corresponding to the voice data left in each similar speaker set in time order.

ステップＳ５０の処理結果から，各音声対話データの録音時間が，図３２（Ｃ）に示す順序である場合に，図３３に示す「音声データＡ→Ｅ」，「音声データＢ」，「音声データＣ→Ｄ」の３つの対応付けが出力される。 From the processing result of step S50, when the recording time of each voice conversation data is in the order shown in FIG. 32C, “voice data A → E”, “voice data B”, “voice data” shown in FIG. Three associations “C → D” are output.

〔問題解決時間算出処理〕
問題解決時間算出部１９は，対話データ対応付け部１７の出力を得て，各音声対話データのヘッダに記録された録音開始時刻および録音終了時刻をもとに，問題解決時間を算出する。[Problem solving time calculation process]
The problem solving time calculation unit 19 obtains the output of the dialogue data association unit 17 and calculates the problem solving time based on the recording start time and the recording end time recorded in the header of each voice dialogue data.

図３４（Ａ）に示すように，「音声データＡ→Ｅ」について，音声対話データＡの録音開始時刻から音声対話データＥの録音終了時刻までの時間ｔ１を，問題解決時間とする。同様に，図３４（Ｂ）に示すように，「音声対話データＣ→Ｄ」について，音声対話データＣの録音開始時刻から音声対話データＤの録音終了時刻までの時間ｔ２を，図３４（Ｃ）に示すように，「音声対話データＢ」について，音声対話データＢの録音開始時刻から録音終了時刻までの時間ｔ３を，問題解決時間とする。 As shown in FIG. 34A, for “voice data A → E”, the time t1 from the recording start time of the voice dialogue data A to the recording end time of the voice dialogue data E is set as the problem solving time. Similarly, as shown in FIG. 34 (B), for “voice dialogue data C → D”, the time t2 from the recording start time of the voice dialogue data C to the recording end time of the voice dialogue data D is shown in FIG. As shown in (2), for “voice dialogue data B”, the time t3 from the recording start time to the recording end time of the voice dialogue data B is set as the problem solving time.

以上，本発明をその実施の形態により説明したが，本発明はその主旨の範囲において種々の変形が可能であることは当然である。 Although the present invention has been described above with reference to the embodiments, it is obvious that the present invention can be variously modified within the scope of the gist thereof.

また，問題解決時間推定処理装置１は，コンピュータプログラムとして実現することができ，このプログラムは，コンピュータが読み取り可能な，可搬媒体メモリ，半導体メモリ，ハードディスクなどの適当な記録媒体に格納することができ，これらの記録媒体に記録して提供され，または，通信インタフェースを介して種々の通信網を利用した送受信により提供されうるものである。 The problem solving time estimation processing apparatus 1 can be realized as a computer program, and this program can be stored in a suitable recording medium such as a portable medium memory, a semiconductor memory, or a hard disk that can be read by a computer. It can be provided by being recorded on these recording media, or can be provided by transmission / reception using various communication networks via a communication interface.

Claims

A problem solving time estimation processing program for estimating a problem solving time required for solving a problem presented by a customer from voice dialogue data in which a dialogue between an operator and a customer is recorded.
Computer
A processing unit for inputting a set of voice conversation data with recording time, which is composed of a first channel in which the voice of the operator is recorded and a second channel in which the voice of the customer is recorded;
For each of the voice dialogue data, a leading utterance section in which one speaker continuously speaks for a predetermined length or longer with a louder voice than the other speaker is identified, and the leading utterance existing first When the section is identified as a leading initiative utterance section, and when the preceding initiative utterance section is included in the second channel, the type of the voice interaction data is set as a question, and the preceding initiative utterance section is included in the first channel A processing unit for setting the type of the voice interaction data as non-question,
For each of the voice interaction data, a processing unit that learns speaker characteristics of the voice data of each channel by a predetermined machine learning processing method;
Based on the speaker characteristics, a processing unit that collects voice conversation data in which the speaker characteristics of the first channel and the second channel of the voice conversation data are within a certain similar range to form a similar speaker set;
For each set of similar speakers, the voice dialogue data is arranged according to the recording time, one voice dialogue data whose type is a question is taken out, and the voice dialogue data which is the question is temporally succeeding data. Is a processing unit for associating with non-question voice dialogue data;
A processing unit that calculates a time from the earliest recording start time to the last recording end time of the set of the spoken dialogue data associated with each other and sets it as a problem solving time;
A problem solving time estimation processing program characterized in that it functions as:

The processing unit for associating the voice dialogue data is:
When there is a plurality of voice dialogue data whose type is a question before the voice dialogue data which is the question, the voice dialogue data having the highest degree of speaker similarity is selected and dealt with. The problem solving time estimation processing program according to claim 1, wherein the processing is performed.

The processing unit for calculating the problem solving time is:
The voice dialogue data that is not associated with the voice dialogue data, the type of which is a question, the voice dialogue data is taken out, and the time from the recording start time to the recording end time of the voice dialogue data is calculated. The problem solving time estimation processing program according to claim 1, wherein processing for solving time is performed.

A processing device for estimating problem solving time required to solve a problem presented by a customer from voice dialogue data in which a dialogue between an operator and a customer is recorded.
A data input unit for inputting a set of voice conversation data with recording time, which is composed of a first channel in which the operator's voice is recorded and a second channel in which the customer's voice is recorded;
For each of the voice dialogue data, a leading utterance section in which one speaker continuously speaks for a predetermined length or longer with a louder voice than the other speaker is identified, and the leading utterance existing first When the section is identified as a leading initiative utterance section, and when the preceding initiative utterance section is included in the second channel, the type of the voice interaction data is set as a question, and the preceding initiative utterance section is included in the first channel A dialogue type estimation unit that sets the type of the voice dialogue data as non-question,
A speaker feature learning unit that learns speaker features of voice data of each channel for each of the voice conversation data by a predetermined machine learning processing method;
Based on the speaker features, a similar speaker set is obtained by collecting voice dialogue data in which the speaker features of the first channel and the second channel of the voice dialogue data are within a certain similar range. A calculation unit;
For each set of similar speakers, the voice dialogue data is arranged according to the recording time, one voice dialogue data whose type is a question is taken out, and the voice dialogue data which is the question is temporally succeeding data. A dialog data mapping unit that maps non-question voice dialog data,
A problem solving time calculation unit for calculating a time from the earliest recording start time to the last recording end time of the set of the spoken dialogue data set as the problem solving time;
A problem solving time estimation processing device characterized by comprising:

The dialogue data associating unit has the highest degree of speaker similarity when there is a plurality of voice dialogue data of which the type is a question before the voice dialogue data which is the question and the temporally subsequent data. The problem solving time estimation processing apparatus according to claim 4, wherein processing for selecting and correlating voice dialogue data is performed.

The problem solving time calculation unit extracts voice dialogue data that is not associated with the voice dialogue data and whose type is a question, and records a recording end time from a recording start time of the voice dialogue data. The problem solving time estimation processing device according to claim 4, wherein a process for calculating a time until a problem solving time is calculated.

A processing method executed by a computer in order to estimate a problem solving time required for solving a problem presented by a customer from voice conversation data in which a dialogue between an operator and a customer is recorded.
Processing step of inputting a set of voice dialogue data with recording time, which is composed of a first channel in which the voice of the operator is recorded and a second channel in which the voice of the customer is recorded;
For each of the voice dialogue data, a leading utterance section in which one speaker continuously speaks for a predetermined length or longer with a louder voice than the other speaker is identified, and the leading utterance existing first When the section is identified as a leading initiative utterance section, and when the preceding initiative utterance section is included in the second channel, the type of the voice interaction data is set as a question, and the preceding initiative utterance section is included in the first channel A processing step for setting the type of the voice interaction data as non-question,
A processing step of learning speaker characteristics of the voice data of each channel for each of the voice conversation data by a predetermined machine learning processing method;
Based on the speaker characteristics, processing steps for collecting voice conversation data in which the speaker characteristics of the first channel and the second channel of the voice conversation data are within a certain similar range to form a similar speaker set;
For each set of similar speakers, the voice dialogue data is arranged according to the recording time, one voice dialogue data whose type is a question is taken out, and the voice dialogue data which is the question is temporally succeeding data. Processing steps for associating with non-questional voice interaction data;
A processing step of calculating a time from the earliest recording start time to the last recording end time of the associated voice dialogue data set as a problem solving time;
A problem solving time estimation processing method provided.

In the processing step of associating the voice dialogue data, if there is a plurality of voice dialogue data whose type is a question before the subsequent dialogue, the voice dialogue data having the highest degree of speaker similarity is selected and dealt with. The problem solving time estimation processing method according to claim 7, wherein the processing is performed.

In the processing step of calculating the problem solving time, voice dialogue data that is not associated with the voice dialogue data and whose type is a question is extracted, and from the recording start time of the voice dialogue data The problem solving time estimation processing method according to claim 7 or 8, characterized in that a process up to a recording end time is calculated to obtain a problem solving time.