JP6590617B2

JP6590617B2 - Information processing method and apparatus

Info

Publication number: JP6590617B2
Application number: JP2015188907A
Authority: JP
Inventors: 晋太木村; 淳宏桜井; 斎藤　稔; 稔斎藤
Original assignee: 株式会社アニモ
Priority date: 2015-09-25
Filing date: 2015-09-25
Publication date: 2019-10-16
Anticipated expiration: 2035-09-25
Also published as: JP2017062749A

Description

本発明は、自動対話システムと人間との対話に係る対話コーパスを生成するための技術に関する。 The present invention relates to a technique for generating a dialogue corpus relating to a dialogue between an automatic dialogue system and a human.

自動対話システムと人間との対話を記録して対話コーパスを生成し、対話コーパスを解析することで人間の特性や自動対話システムの問題を明らかにして、自動対話システムが提供する対話の品質を改善することが行われている。ここでいう対話の品質の指標としては、効率・時間、分かりやすさ、楽しさ、快適さなどがある。 The dialogue between the automatic dialogue system and the human being is recorded and a dialogue corpus is generated. By analyzing the dialogue corpus, human characteristics and problems of the automatic dialogue system are clarified, and the quality of the dialogue provided by the automatic dialogue system is improved. To be done. The dialogue quality indicators here include efficiency, time, ease of understanding, enjoyment, and comfort.

従来技術として、対話コーパスを生成するために、事前に登録されている対話例を示すスクリプトに基づいて発話テキストを順次出力し、対話履歴を対話ログデータとして登録する対話ログデータベースにも同時に登録し、発話テキストの合成音声の出力に応じた発話者の音声データを記録し、後に当該音声データに基づき対話ログデータベースにおける発話者の対話部分をテキスト化する技術が存在している。 As a conventional technology, in order to generate a dialogue corpus, utterance texts are sequentially output based on a script that shows pre-registered dialogue examples, and the dialogue history is also registered in the dialogue log database, which is registered as dialogue log data. There is a technique for recording speech data of a speaker according to the output of synthesized speech of speech text, and then texting the dialogue part of the speaker in the dialogue log database based on the speech data.

この従来技術では、合成音声の出力開始時刻と出力終了時刻をも記録することが示されているが、音声合成出力を行うユニットによって直接出力開始時刻及び出力終了時刻が記録されている。 This prior art shows that the output start time and output end time of synthesized speech are also recorded, but the output start time and output end time are recorded directly by the unit that performs speech synthesis output.

しかしながら、一般的に、音声合成出力を行うユニット（より詳しくは音声合成出力のためのデバイスドライバ）内部にバッファが存在するため、音声合成出力のイベント発生時刻と実際に人間に出力される時刻との間に、数十ミリ秒から数百ミリ秒の差が発生してしまう。この差は自動対話システムの負荷状況に依存して揺らぐため、音声合成出力のイベント発生時刻のログ記録から、実際の出力時刻を特定することはできない。 However, generally, since a buffer exists in a unit that performs speech synthesis output (more specifically, a device driver for speech synthesis output), an event occurrence time of speech synthesis output and a time actually output to a human In the meantime, a difference of several tens of milliseconds to several hundred milliseconds occurs. Since this difference fluctuates depending on the load situation of the automatic dialog system, the actual output time cannot be specified from the log record of the event occurrence time of the speech synthesis output.

音声対話の評価では、自動対話システムからの音の出力から人間の反応までの時間の把握が重要であるが、従来技術ではその時間を正しく把握することができない。 In the evaluation of spoken dialogue, it is important to grasp the time from the sound output from the automatic dialogue system to the human reaction, but the conventional technology cannot correctly grasp the time.

特開２００１−１６６７８５号公報JP 2001-166785 A

従って、本発明の目的は、一側面として、自動対話システムからの音の出力時刻を正確に把握するための技術を提供することである。 Therefore, the objective of this invention is providing the technique for grasping | ascertaining the output time of the sound from an automatic dialog system correctly as one side surface.

本発明に係る情報処理方法は、（Ａ）自動対話システムからの出力音及びユーザの音声を録音することで得られる録音データから、自動対話システムからの出力音の出力時刻を特定する特定ステップと、（Ｂ）特定された出力音の出力時刻を、自動対話システムからの出力音に係るイベントのデータに対応付けてデータ格納部に格納するステップとを含む。 The information processing method according to the present invention includes (A) a specifying step of specifying an output time of an output sound from the automatic dialog system from recording data obtained by recording the output sound from the automatic dialog system and the user's voice; And (B) storing the output time of the identified output sound in the data storage unit in association with the data of the event related to the output sound from the automatic dialogue system.

一側面によれば、自動対話システムからの音の出力時刻を正確に把握できるようになる。 According to one aspect, the output time of the sound from the automatic dialog system can be accurately grasped.

図１は、実施の形態に係るシステムの構成例を示す図である。FIG. 1 is a diagram illustrating a configuration example of a system according to an embodiment. 図２は、データの一例を示す図である。FIG. 2 is a diagram illustrating an example of data. 図３は、修正前のイベントデータの一例を示す図である。FIG. 3 is a diagram illustrating an example of event data before correction. 図４は、実施の形態に係る処理フローを示す図である。FIG. 4 is a diagram illustrating a processing flow according to the embodiment. 図５は、修正後のイベントデータの一例を示す図である。FIG. 5 is a diagram illustrating an example of event data after correction.

本発明の実施の形態に係るシステムの構成例を図１に示す。 A configuration example of a system according to an embodiment of the present invention is shown in FIG.

本実施の形態に係るシステムは、自動対話システム１００と、当該自動対話システム１００とネットワークなどによって接続された記録システム２００とを含む。自動対話システム１００は、対話制御部１０１と、対話スクリプト格納部１０２と、タッチパネル表示部１０３と、タッチパネル入力部１０４と、タッチパネル１０５と、音声認識部１０６と、マイク１０７と、音声合成部１０８と、効果音再生部１０９と、スピーカ１１０と、音声ＤＢ１１１と、効果音ＤＢ１１２と、イベント記録部１１３と、出力部１１４とを有する。 The system according to the present embodiment includes an automatic dialog system 100 and a recording system 200 connected to the automatic dialog system 100 via a network or the like. The automatic dialogue system 100 includes a dialogue control unit 101, a dialogue script storage unit 102, a touch panel display unit 103, a touch panel input unit 104, a touch panel 105, a voice recognition unit 106, a microphone 107, and a voice synthesis unit 108. A sound effect reproduction unit 109, a speaker 110, a sound DB 111, a sound effect DB 112, an event recording unit 113, and an output unit 114.

対話スクリプト格納部１０２は、対話のシナリオを表すデータを格納している。対話のシナリオは、例えば「タッチパネルにボタンを表示し、ボタンが押されたら「ピンポン」という効果音を鳴らす」といったものである。 The dialogue script storage unit 102 stores data representing a dialogue scenario. The dialogue scenario is, for example, “display a button on the touch panel and sound a sound effect“ ping-pong ”when the button is pressed”.

対話制御部１０１は、対話スクリプト格納部１０２に格納されている対話シナリオに従って、タッチパネル表示部１０３、音声認識部１０６、音声合成部１０８、効果音再生部１０９等に処理を指示し、タッチパネル入力部１０４、音声認識部１０６等からの入力データに基づき対話シナリオに沿った制御を行う。 The dialogue control unit 101 instructs the touch panel display unit 103, the voice recognition unit 106, the voice synthesis unit 108, the sound effect reproduction unit 109, and the like according to the dialogue scenario stored in the dialogue script storage unit 102, and the touch panel input unit. 104, based on input data from the voice recognition unit 106 and the like, control is performed in accordance with the dialogue scenario.

タッチパネル表示部１０３は、対話制御部１０１の指示に基づき、タッチパネル１０５に画像やテキストを表示する。タッチパネル入力部１０４は、タッチパネル１０５に入力されたタッチ内容を対話制御部１０１に出力する。音声認識部１０６は、対話制御部１０１からの指示に基づきマイク１０７からの音声データを取り込み、音声認識によって音声データをテキストデータに変換して対話制御部１０１に出力する。なお、本実施の形態では、マイク１０７により、ユーザの発話のみならずスピーカ１１０からの出力音をも、音データとして取り込んで、音声認識部１０６及び出力部１１４に出力する。 The touch panel display unit 103 displays images and text on the touch panel 105 based on instructions from the dialogue control unit 101. Touch panel input unit 104 outputs the touch content input on touch panel 105 to dialog control unit 101. The voice recognition unit 106 takes in voice data from the microphone 107 based on an instruction from the dialogue control unit 101, converts the voice data into text data by voice recognition, and outputs the text data to the dialogue control unit 101. In the present embodiment, the microphone 107 captures not only the user's speech but also the output sound from the speaker 110 as sound data, and outputs the sound data to the voice recognition unit 106 and the output unit 114.

音声合成部１０８は、対話制御部１０１からの指示に基づき音声ＤＢ１１１に格納されている音声メッセージの音声データをスピーカ１１０から出力させる。効果音再生部１０９は、対話制御部１０１からの指示に基づき効果音ＤＢ１１２に格納されている効果音をスピーカ１１０から出力させる。 The voice synthesizer 108 causes the speaker 110 to output voice data of a voice message stored in the voice DB 111 based on an instruction from the dialogue control unit 101. The sound effect reproducing unit 109 causes the speaker 110 to output the sound effect stored in the sound effect DB 112 based on an instruction from the dialogue control unit 101.

イベント記録部１１３は、タッチパネル表示部１０３、タッチパネル入力部１０４、音声認識部１０６、音声合成部１０８、効果音再生部１０９等が対話制御部１０１の指示に応じて処理したイベントに係る処理内容及び処理時刻（開始時刻及び終了時刻）のデータ（イベントデータとも呼ぶ）を、それらから受け取って記録すると共に、出力部１１４に出力する。処理内容は、例えばメッセージの出力、効果音の出力、データの表示、データの入力、音声の入力（より具体的には音声認識の実行）などである。 The event recording unit 113 includes processing contents related to events processed by the touch panel display unit 103, the touch panel input unit 104, the voice recognition unit 106, the voice synthesis unit 108, the sound effect reproduction unit 109, and the like in response to an instruction from the dialogue control unit 101. Processing time (start time and end time) data (also referred to as event data) is received and recorded from them, and output to the output unit 114. The processing content includes, for example, message output, sound effect output, data display, data input, and voice input (more specifically, voice recognition is performed).

出力部１１４は、イベント記録部１１３からのイベントデータに基づき、音声ＤＢ１１１に格納されている音声データ、効果音ＤＢ１１２に格納されている効果音のデータを読み出し、マイク１０７からの音データ（以下、録音データと呼ぶ）及びイベントデータと共に、記録システム２００に出力する。 The output unit 114 reads out sound data stored in the sound DB 111 and sound effect data stored in the sound effect DB 112 based on the event data from the event recording unit 113, and sound data from the microphone 107 (hereinafter, referred to as “sound data”). Together with event data) and event data.

上で述べたように、本実施の形態に係る自動対話システム１００において、マイク１０７により、ユーザの発話のみならずスピーカ１１０からの出力音をも、音データとして取り込んで録音データを生成する点、録音データ、イベントデータ及びサウンドデータ（音声ＤＢ１１１に格納されている音声データ及び効果音ＤＢ１１２に格納されている効果音データ）を記録システム２００に出力する点が、主に従来と異なる。 As described above, in the automatic dialogue system 100 according to the present embodiment, the microphone 107 captures not only the user's speech but also the output sound from the speaker 110 as sound data, and generates recording data. Recording data, event data, and sound data (sound data stored in the sound DB 111 and sound effect data stored in the sound effect DB 112) are mainly output to the recording system 200.

記録システム２００は、データ取得部２０１と、録音データ格納部２０２と、イベントデータ格納部２０３と、サウンドデータ格納部２０４と、選択部２０５と、照合部２０６と、データ修正部２０７と、抽出部２０８と、音声認識部２０９と、修正イベントデータ格納部２１０とを有する。 The recording system 200 includes a data acquisition unit 201, a recording data storage unit 202, an event data storage unit 203, a sound data storage unit 204, a selection unit 205, a collation unit 206, a data correction unit 207, and an extraction unit. 208, a voice recognition unit 209, and a modified event data storage unit 210.

データ取得部２０１は、自動対話システム１００の出力部１１４から録音データ、イベントデータ及びサウンドデータを取得し、録音データについては録音データ格納部２０２に格納し、イベントデータについてはイベントデータ格納部２０３に格納し、サウンドデータについてはサウンドデータ格納部２０４に格納する。 The data acquisition unit 201 acquires recording data, event data, and sound data from the output unit 114 of the automatic dialog system 100, stores the recording data in the recording data storage unit 202, and stores event data in the event data storage unit 203. The sound data is stored in the sound data storage unit 204.

選択部２０５は、イベントデータ格納部２０３に格納されているイベントデータに基づきサウンドデータ格納部２０４に格納されているサウンドデータを読み出し、照合部２０６に出力する。照合部２０６は、録音データ格納部２０２に格納されている録音データとサウンドデータとを照合して一致する部分から、サウンドデータの実際の音出力の開始時刻及び終了時刻を特定し、データ修正部２０７に出力する。データ修正部２０７は、選択部２０５からの出力されるイベントデータを、照合部２０６からの実際の音出力の開始時刻及び終了時刻のデータに基づき修正して、修正イベントデータ格納部２１０に格納する。 The selection unit 205 reads the sound data stored in the sound data storage unit 204 based on the event data stored in the event data storage unit 203 and outputs the sound data to the collation unit 206. The collation unit 206 collates the recording data stored in the recording data storage unit 202 with the sound data, specifies the actual sound output start time and end time from the portion that matches, and the data correction unit It outputs to 207. The data correction unit 207 corrects the event data output from the selection unit 205 based on the actual sound output start time and end time data from the matching unit 206 and stores the data in the corrected event data storage unit 210. .

照合部２０６は、実際の音出力の開始時刻及び終了時刻とを抽出部２０８に出力する。そうすると、抽出部２０８は、録音データから、音声メッセージの音声データ及び効果音の音データが存在しない部分であって音声が記録されている部分を抽出し、抽出された部分の録音データを音声認識部２０９に出力する。 The collation unit 206 outputs the actual sound output start time and end time to the extraction unit 208. Then, the extraction unit 208 extracts, from the recorded data, a portion where the voice data of the voice message and the sound data of the sound effect are not present and where the voice is recorded, and recognizes the recorded data of the extracted portion as a voice. Output to the unit 209.

音声認識部２０９は、抽出部２０８から出力された録音データに対して音声認識処理（話者認識を含む）を実行し、処理結果をデータ修正部２０７に出力する。データ修正部２０７は、音声認識部２０９の処理結果に基づき、選択部２０５からのイベントデータを修正して、修正イベントデータ格納部２１０に格納する。 The voice recognition unit 209 performs voice recognition processing (including speaker recognition) on the recording data output from the extraction unit 208, and outputs the processing result to the data correction unit 207. The data correction unit 207 corrects the event data from the selection unit 205 based on the processing result of the voice recognition unit 209 and stores it in the corrected event data storage unit 210.

より具体的な処理内容については、図２乃至図５を用いて説明する。 More specific processing contents will be described with reference to FIGS.

以下、図２（ア）のようなイベントデータと、図２（イ）のような録音データとが取得された場合を一例として説明する。なお、図２（ア）は、図３に示すようなイベントデータを時系列に並べた状態を表している。 Hereinafter, the case where the event data as shown in FIG. 2A and the recording data as shown in FIG. 2A are acquired will be described as an example. 2A shows a state in which event data as shown in FIG. 3 is arranged in time series.

具体的には、「ポン、という音の後に住所をお話しください。」というメッセージ１を出力する「メッセージ１出力」イベントが時刻ａに開始し、時刻ｂに終了する。その後、「ポン」という効果音ｘを出力する「効果音ｘ出力」イベントが時刻ｃに開始し、時刻ｄに終了する。さらにその後、ユーザによる音声入力が行われて「音声入力１」イベントが時刻ｅに開始し、時刻ｆに終了する。自動対話システム１００では音声認識部１０６で音声認識処理が行われるので、音声入力１の内容Ａ「えーと横浜市中区尾上町２の２７です。」が認識結果として得られる。 Specifically, a “message 1 output” event that outputs a message 1 “Please tell me your address after a beeping sound” starts at time a and ends at time b. Thereafter, the “sound effect x output” event for outputting the sound effect x “pon” starts at time c and ends at time d. Thereafter, voice input by the user is performed, and the “voice input 1” event starts at time e and ends at time f. In the automatic dialogue system 100, since the voice recognition processing is performed by the voice recognition unit 106, the content A of the voice input 1 "Eat 27, Onoecho, Naka-ku, Yokohama-shi" is obtained as a recognition result.

図２（ア）で示されるイベントのタイミングと、図２（イ）で示される音声波形のタイミングとを比較すると分かるように、音声合成部１０８による音声出力、効果音再生部１０９による効果音出力は、それらが開始時刻及び終了時刻として記録したものより遅れて実際の音出力が行われていることが分かる。 As can be seen by comparing the timing of the event shown in FIG. 2A with the timing of the speech waveform shown in FIG. 2A, the sound output by the speech synthesizer 108 and the sound effect output by the sound effect playback unit 109 It can be seen that the actual sound output is performed later than those recorded as the start time and the end time.

一方、音声入力については、音声認識部１０６が音声入力の開始時刻及び終了時刻を特定した上で音声認識処理を実行するので、この例では、開始時刻及び終了時刻は録音データと完全に一致する。 On the other hand, for voice input, the voice recognition unit 106 specifies the start time and end time of voice input and then performs voice recognition processing. In this example, the start time and end time completely match the recorded data. .

本実施の形態では、記録システム２００において、図２（ウ）に示すように、録音データに基づき「メッセージ１出力」イベント及び「効果音ｘ出力」イベントの開始時刻及び終了時刻を修正する。さらに、音声認識結果についても修正する。 In the present embodiment, in the recording system 200, as shown in FIG. 2C, the start time and end time of the “message 1 output” event and the “sound effect x output” event are corrected based on the recording data. Furthermore, the speech recognition result is also corrected.

そのための処理フローを図４に示す。まず、データ取得部２０１は、自動対話システム１００の出力部１１４から、修正前のイベントデータ、録音データ及びサウンドデータを受信し、それぞれイベントデータ格納部２０３、録音データ格納部２０２、サウンドデータ格納部２０４に格納する（ステップＳ１）。 The processing flow for that is shown in FIG. First, the data acquisition unit 201 receives event data, recording data, and sound data before correction from the output unit 114 of the automatic dialogue system 100, and an event data storage unit 203, a recording data storage unit 202, and a sound data storage unit, respectively. It stores in 204 (step S1).

そうすると、選択部２０５は、イベントデータ格納部２０３に格納されている未処理のイベントのイベントデータを１イベント分読み出す（ステップＳ３）。そして、選択部２０５は、読み出したイベントデータのイベント種別が音に係るものであるか否かを判断する（ステップＳ５）。イベントデータには、タッチパネル１０５への表示やタッチパネル１０５への入力に係るイベントのデータも含まれる場合がある。従って、読み出したイベントデータのイベント種別が音に係るものではない場合には、選択部２０５は、読み出したイベントデータをそのままデータ修正部２０７に出力して、そのまま修正イベントデータ格納部２１０に格納させる（ステップＳ２１）。そして処理はステップＳ２３に移行する。 Then, the selection unit 205 reads event data of unprocessed events stored in the event data storage unit 203 for one event (step S3). Then, the selection unit 205 determines whether or not the event type of the read event data is related to sound (step S5). The event data may include event data related to display on the touch panel 105 and input to the touch panel 105. Therefore, when the event type of the read event data is not related to sound, the selection unit 205 outputs the read event data as it is to the data correction unit 207 and stores it in the correction event data storage unit 210 as it is. (Step S21). Then, the process proceeds to step S23.

一方、読み出したイベントデータのイベント種別が音に係るものであれば、選択部２０５は、読み出したイベントデータのイベント種別が、音出力に係るものであるか否かを判断する（ステップＳ７）。音出力のイベントは、音声合成による音声出力のイベントであるか、効果音出力のイベントである。 On the other hand, if the event type of the read event data relates to sound, the selection unit 205 determines whether or not the event type of the read event data relates to sound output (step S7). The sound output event is a sound output event by speech synthesis or a sound effect output event.

読み出したイベントデータのイベント種別が音出力に係るものであれば、選択部２０５は、サウンドデータ格納部２０４から、読み出したイベントデータのイベントに対応するサウンドデータをサウンドデータ格納部２０４から抽出して、照合部２０６に出力する（ステップＳ９）。また、選択部２０５は、読み出したイベントデータをデータ修正部２０７に出力する。 If the event type of the read event data is related to sound output, the selection unit 205 extracts sound data corresponding to the event of the read event data from the sound data storage unit 204 from the sound data storage unit 204. And output to the collation unit 206 (step S9). Further, the selection unit 205 outputs the read event data to the data correction unit 207.

照合部２０６は、録音データ格納部２０２から録音データを読み出して、当該録音データと選択部２０５から受け取ったサウンドデータとを照合して、録音データにおける一致部分を特定する（ステップＳ１１）。すなわち、照合部２０６は、録音データにおける音声出力の開始時刻及び終了時刻を特定して、データ修正部２０７に出力する。なお、録音データにおける一致部分の開始時刻及び終了時刻とについては、抽出部２０８に出力する。 The collation unit 206 reads the recording data from the recording data storage unit 202, collates the recording data with the sound data received from the selection unit 205, and specifies a matching portion in the recording data (step S11). That is, the collation unit 206 specifies the start time and the end time of the voice output in the recording data and outputs them to the data correction unit 207. Note that the start time and end time of the matching portion in the recording data are output to the extraction unit 208.

データ修正部２０７は、選択部２０５から受け取ったイベントデータを、録音データにおける音出力（効果音又は音声出力）の開始時刻及び終了時刻で修正した上で、修正イベントデータ格納部２１０に格納する（ステップＳ１３）。そして処理はステップＳ２３に移行する。 The data correction unit 207 corrects the event data received from the selection unit 205 with the start time and end time of the sound output (sound effect or sound output) in the recording data, and stores the data in the correction event data storage unit 210 ( Step S13). Then, the process proceeds to step S23.

図３の最初のイベントの場合には、メッセージ１「ポン、という音の後に住所をお話しください。」が出力されているので、メッセージ１のサウンドデータと録音データ（図２（イ））とを照合して一致する部分の開始時刻ａａ及び終了時刻ｂｂとが特定され、最初のイベントデータを修正の上、修正イベントデータ格納部２１０に格納する。例えば図５における最初のイベントデータのように修正される。 In the case of the first event in FIG. 3, message 1 “Please tell me your address after the sound of popping” is output, so the sound data and recording data (FIG. 2 (A)) of message 1 are output. The start time aa and end time bb of the matching part are identified, and the first event data is corrected and stored in the corrected event data storage unit 210. For example, it is corrected like the first event data in FIG.

同様に、図３の２番目のイベントの場合には、効果音ｘが出力されているので、効果音ｘのサウンドデータと録音データ（図２（イ））とを照合して一致する部分の開始時刻ｃｃ及び終了時刻ｄｄとが特定され、２番目のイベントデータを修正の上、修正イベントデータ格納部２１０に格納する。例えば図５における２番目のイベントデータのように修正される。 Similarly, in the case of the second event in FIG. 3, since the sound effect x is output, the sound data of the sound effect x and the recorded data (FIG. 2 (a)) are collated to match The start time cc and the end time dd are specified, and the second event data is corrected and stored in the corrected event data storage unit 210. For example, it is corrected like the second event data in FIG.

一方、読み出したイベントデータのイベント種別が音出力に係るものではなく音入力に係るものであれば、選択部２０５は、読み出したイベントデータを、データ修正部２０７及び抽出部２０８に出力する。そして、抽出部２０８は、録音データ格納部２０２に格納されている録音データのうち、効果音部分及び出力音声部分以外の録音データを抽出して、音声認識部２０９に出力する（ステップＳ１５）。より具体的には、処理に係るイベントデータに含まれる開始時刻付近から終了時刻付近までの録音データであって、照合部２０６で特定された効果音及び出力音声が存在せず、且つ音声が記録されている部分の録音データを抽出する。これにより音声認識処理の認識精度を高めることができる。 On the other hand, if the event type of the read event data is not related to sound output but related to sound input, the selection unit 205 outputs the read event data to the data correction unit 207 and the extraction unit 208. Then, the extraction unit 208 extracts recording data other than the sound effect part and the output sound part from the recording data stored in the recording data storage unit 202, and outputs the recording data to the voice recognition unit 209 (step S15). More specifically, it is recording data from the vicinity of the start time to the end time included in the event data related to the processing, and there is no sound effect and output sound specified by the collation unit 206, and the sound is recorded. The recorded data is extracted. Thereby, the recognition accuracy of voice recognition processing can be improved.

そうすると、音声認識部２０９は、抽出部２０８からの録音データに対して音声認識処理を実行し、処理結果をデータ修正部２０７に出力する（ステップＳ１７）。なお、音声認識処理には、周知の話者認識処理などを含めるものとする。 Then, the voice recognition unit 209 performs voice recognition processing on the recording data from the extraction unit 208 and outputs the processing result to the data correction unit 207 (step S17). Note that the speech recognition processing includes known speaker recognition processing and the like.

そして、データ修正部２０７は、選択部２０５から受け取ったイベントデータを、音声認識部２０９による音声認識結果に基づき修正した上で、修正イベントデータ格納部２１０に格納する（ステップＳ１９）。そして処理はステップＳ２３に移行する。 Then, the data correction unit 207 corrects the event data received from the selection unit 205 based on the voice recognition result by the voice recognition unit 209 and stores it in the correction event data storage unit 210 (step S19). Then, the process proceeds to step S23.

図２の例では、「えーと横浜市中区尾上町２の２７です。」という音声認識結果が得られるので、イベントデータと同じ結果である。しかし、話者認識の結果としてユーザＡであるという結果が得られれば、話者認識結果の「ユーザＡ」というデータも併せて格納されるので、図５の例では、３番目のイベントデータの内容は「ＡＡ」と修正される。 In the example of FIG. 2, the speech recognition result “27 of Onoe-cho, Naka-ku, Yokohama-shi, 2” is obtained, so the result is the same as the event data. However, if the result of the speaker recognition is that the user A is obtained, the data “user A” of the speaker recognition result is also stored. Therefore, in the example of FIG. The content is corrected to “AA”.

選択部２０５は、イベントデータ格納部２０３に未処理のイベントデータが存在するか否かを判断する（ステップＳ２３）。未処理のイベントデータが存在する場合には処理はステップＳ３に戻る。一方、未処理のイベントデータが存在しない場合には処理は終了する。 The selection unit 205 determines whether or not unprocessed event data exists in the event data storage unit 203 (step S23). If unprocessed event data exists, the process returns to step S3. On the other hand, if there is no unprocessed event data, the process ends.

以上のような処理を実行することで、音声合成部１０８や効果音再生部１０９内部に存在するバッファによる音の出力遅れを修正したイベントデータを生成することができるようになる。 By executing the processing as described above, it is possible to generate event data in which a sound output delay caused by a buffer existing in the speech synthesis unit 108 or the sound effect reproduction unit 109 is corrected.

このような正確なイベントデータが自動的に生成できれば、自動対話システムの対話の問題点を明確化し易くなり、自動対話システムの対話の品質の向上を図りやすくなる。 If such accurate event data can be automatically generated, it becomes easy to clarify the problems of the dialog of the automatic dialog system, and it becomes easy to improve the quality of the dialog of the automatic dialog system.

以上本発明の実施の形態について説明したが、本発明はこれに限定されるものではない。図１に示したシステム構成は一例であり、プログラムモジュール構成やファイル構成とは一致しない場合もある。自動対話システム１００と記録システム２００とは一体化されている場合もある。処理フローについても、処理結果が変わらない限り、処理順番を入れ替えても良いし、並列に実行するようにしても良い。 Although the embodiment of the present invention has been described above, the present invention is not limited to this. The system configuration shown in FIG. 1 is an example, and may not match the program module configuration or the file configuration. The automatic dialogue system 100 and the recording system 200 may be integrated. As for the processing flow, as long as the processing result does not change, the processing order may be changed or may be executed in parallel.

また、イベントデータについては、開始時刻と終了時刻とを含むような例を示したが、開始時刻と長さ、終了時刻と長さのようなデータにしても良い。さらに、上で述べた例では、修正前のデータを破棄する例を示したが、破棄しないで残すようにしても良い。時刻は、どこかの時点からの相対時刻である場合もある。 Moreover, although the example which included start time and end time was shown about event data, you may make it data, such as start time and length, end time, and length. Further, in the example described above, an example in which data before correction is discarded is shown, but it may be left without being discarded. The time may be a relative time from some point in time.

なお、上で述べた自動対話システム１００及び記録システム２００は、コンピュータ装置であって、メモリとＣＰＵ（Central Processing Unit）とハードディスク・ドライブ（ＨＤＤ：Hard Disk Drive）と表示装置に接続される表示制御部とリムーバブル・ディスク用のドライブ装置と入力装置とネットワークに接続するための通信制御部とがバスで接続されている。オペレーティング・システム（ＯＳ：Operating System）及び本実施例における処理を実施するためのアプリケーション・プログラムは、ＨＤＤに格納されており、ＣＰＵにより実行される際にはＨＤＤからメモリに読み出される。ＣＰＵは、アプリケーション・プログラムの処理内容に応じて表示制御部、通信制御部、ドライブ装置を制御して、所定の動作を行わせる。また、処理途中のデータについては、主としてメモリに格納されるが、ＨＤＤに格納されるようにしてもよい。本発明の実施例では、上で述べた処理を実施するためのアプリケーション・プログラムはコンピュータ読み取り可能なリムーバブル・ディスクに格納されて頒布され、ドライブ装置からＨＤＤにインストールされる。インターネットなどのネットワーク及び通信制御部を経由して、ＨＤＤにインストールされる場合もある。このようなコンピュータ装置は、上で述べたＣＰＵ、メモリなどのハードウエアとＯＳ及びアプリケーション・プログラムなどのプログラムとが有機的に協働することにより、上で述べたような各種機能を実現する。 The automatic dialog system 100 and the recording system 200 described above are computer devices, and display control connected to a memory, a CPU (Central Processing Unit), a hard disk drive (HDD), and a display device. And a removable disk drive device, an input device, and a communication control unit for connecting to a network are connected by a bus. An operating system (OS) and an application program for performing the processing in this embodiment are stored in the HDD, and are read from the HDD to the memory when executed by the CPU. The CPU controls the display control unit, the communication control unit, and the drive device in accordance with the processing content of the application program, and performs a predetermined operation. Further, data in the middle of processing is mainly stored in the memory, but may be stored in the HDD. In the embodiment of the present invention, an application program for performing the above-described processing is stored and distributed on a computer-readable removable disk, and installed from the drive device to the HDD. In some cases, the HDD is installed via a network such as the Internet and a communication control unit. Such a computer apparatus realizes various functions as described above by organically cooperating hardware such as the CPU and memory described above with programs such as the OS and application programs.

以上述べた本実施の形態をまとめると以下のようになる。 The above-described embodiment can be summarized as follows.

実施の形態に係る情報処理方法は、（Ａ）自動対話システムからの出力音及びユーザの音声を録音することで得られる録音データから、自動対話システムからの出力音の出力時刻（例えば出力開始時刻及び出力終了時刻、それらのうち重要な方、それらのいずれかと長さなど）を特定する特定ステップと、（Ｂ）特定された出力音の出力時刻を、自動対話システムからの出力音に係るイベントのデータに対応付けてデータ格納部に格納するステップとを含む。 The information processing method according to the embodiment includes (A) the output time of the output sound from the automatic dialog system (for example, the output start time) from the recording data obtained by recording the output sound from the automatic dialog system and the user's voice. And an output end time, an important one of them, one of them and the length thereof), and (B) an event related to the output sound from the automatic dialog system, the output time of the specified output sound Storing the data in the data storage unit in association with the data.

録音データを基に、自動対話システムからの出力音の出力時刻を特定すれば、ユーザが実際に出力音を聞いたタイミングを特定できるようになるため、より実際に即した適切な対話の解析を行うことができるようになる。 By specifying the output time of the output sound from the automatic dialog system based on the recorded data, it becomes possible to specify the timing when the user actually heard the output sound, so it is possible to analyze the dialog more appropriately according to the actual situation. Will be able to do.

また、上で述べた情報処理方法は、（Ｃ）自動対話システムからの出力音についてのデータを用いて録音データの中でユーザの音声のデータを抽出するステップと、（Ｄ）抽出されたユーザの音声のデータに対して音声認識処理を実施し、音声認識処理の結果を、ユーザの音声に係るイベントのデータに対応付けてデータ格納部に格納するステップとをさらに含むようにしても良い。このようにすれば、音声認識処理の精度を向上させることができる。なお、話者認識を併せて行うようにしても良い。 In addition, the information processing method described above includes (C) a step of extracting data of a user's voice from recorded data using data about an output sound from the automatic dialogue system, and (D) an extracted user A voice recognition process may be performed on the voice data, and a result of the voice recognition process may be stored in the data storage unit in association with event data related to the user's voice. In this way, the accuracy of the speech recognition process can be improved. Note that speaker recognition may also be performed.

また、上で述べた特定ステップが、（ａ１）自動対話システムからの出力音に係るイベントのデータに含まれる出力音種別に基づき、自動対話システムからの出力音のデータを取得するステップと、（ａ２）自動対話システムからの出力音のデータと比較することで録音データの中で自動対話システムからの出力音を特定するステップとを含むようにしても良い。精度良くタイミングを特定できるようになる。 Further, the specific step described above includes (a1) acquiring output sound data from the automatic dialog system based on the output sound type included in the event data related to the output sound from the automatic dialog system; a2) A step of identifying an output sound from the automatic dialog system in the recording data by comparing with data of the output sound from the automatic dialog system may be included. The timing can be specified with high accuracy.

なお、上記方法をコンピュータに行わせるためのプログラムを作成することができ、当該プログラムは、例えばフレキシブルディスク、ＣＤ−ＲＯＭ、光磁気ディスク、半導体メモリ、ハードディスク等のコンピュータ読み取り可能な記憶媒体又は記憶装置に格納される。尚、中間的な処理結果はメインメモリ等の記憶装置に一時保管される。 A program for causing a computer to perform the above method can be created, and the program is a computer-readable storage medium or storage device such as a flexible disk, a CD-ROM, a magneto-optical disk, a semiconductor memory, or a hard disk. Stored in The intermediate processing result is temporarily stored in a storage device such as a main memory.

２００記録システム
２０１データ取得部
２０２録音データ格納部
２０３イベントデータ格納部
２０４サウンドデータ格納部
２０５選択部
２０６照合部
２０７データ修正部
２０８抽出部
２０９音声認識部
２１０修正イベントデータ格納部 DESCRIPTION OF SYMBOLS 200 Recording system 201 Data acquisition part 202 Recording data storage part 203 Event data storage part 204 Sound data storage part 205 Selection part 206 Collation part 207 Data correction part 208 Extraction part 209 Voice recognition part 210 Correction event data storage part

Claims

A specific step of specifying the output time of the output sound from the automatic dialog system from the recording data obtained by recording the output sound from the automatic dialog system and the user's voice;
The specified output time of the output sound corresponds to the event data related to the output sound from the automatic dialog system among the data of the event output from the automatic dialog system and including the processing type of the event And storing in the data storage unit,
A program that causes a computer to execute.

Extracting data of the user's voice in the recording data using data about the output sound from the automatic dialogue system;
Performing speech recognition processing on the extracted user's voice data, and storing the result of the voice recognition processing in the data storage unit in association with event data related to the user's voice;
The program according to claim 1, further causing the computer to execute.

The specific step includes
Obtaining output sound data from the automatic dialogue system based on the output sound type included in the event data relating to the output sound from the automatic dialogue system;
Identifying the output sound from the automatic dialog system in the recording data by comparing with the data of the output sound from the automatic dialog system;
The program according to claim 1 or 2, comprising:

A specific step of specifying the output time of the output sound from the automatic dialog system from the recording data obtained by recording the output sound from the automatic dialog system and the user's voice;
The specified output time of the output sound corresponds to the event data related to the output sound from the automatic dialog system among the data of the event output from the automatic dialog system and including the processing type of the event And storing in the data storage unit,
An information processing method executed by a computer.

Means for specifying the output time of the output sound from the automatic dialog system from the recording data obtained by recording the output sound from the automatic dialog system and the user's voice;
The specified output time of the output sound corresponds to the event data related to the output sound from the automatic dialog system among the data of the event output from the automatic dialog system and including the processing type of the event Means for storing in the data storage unit,
An information processing apparatus.