JP2007129626A

JP2007129626A - System for conversation between remote places

Info

Publication number: JP2007129626A
Application number: JP2005322141A
Authority: JP
Inventors: Tomohito Koizumi; 智史小泉; Masahiro Shiomi; 昌裕塩見; Takayuki Kanda; 崇行神田; Keiko Miyashita; 敬宏宮下; Hiroshi Ishiguro; 浩石黒
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-11-07
Filing date: 2005-11-07
Publication date: 2007-05-24
Anticipated expiration: 2025-11-07
Also published as: JP4735965B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system for conversation between remote places, in which a proper pause can be taken for conversation between remote places to establish smooth conversation. <P>SOLUTION: A system 10 for conversation between remote places includes two conversation devices 12 communicating speech with each other via, for example, a network and a speech timing control server 14. When a silent state is detected in a conversation, the server 14 collates a condition of a current pause with a plurality of pause patterns stored in a pause pattern DB 94. When a pause pattern matching with the condition of the current pause is found as a result, a speech corresponding to this pause pattern is outputted from a conversation device 12 corresponding to the pause pattern. In the case that the conversation device 12 is a robot 12b capable of physically moving, the conversation device 12 is caused to make a gesture corresponding to the pause pattern. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は遠隔地間対話システムに関し、特にたとえば、遠隔地に離れた対話者同士の音声をネットワークを介して通信する、遠隔地間対話システムに関する。 The present invention relates to a remote place dialogue system, and more particularly, to a remote place dialogue system that communicates voices of talkers away from a remote place via a network.

遠隔地間で対話を行う場合には、遅延により対話の空白時間が長くなり発話のタイミングが取りづらくなって、両対話者の発話が重複する事態が生じ易い。従来、このような発話の重複を防止して、遠隔地間の円滑な対話を生成しようとする技術は存在しなかった。 When a dialogue is performed between remote locations, the delay time of the dialogue becomes long due to the delay, and it is difficult to take the timing of the utterance. Conventionally, there has been no technique for preventing such duplication of utterances and generating a smooth dialogue between remote locations.

なお、たとえば特許文献１には、音声入出力装置の出力する音声と利用者の発話によって入力される音声との重畳を検出する技術の一例が開示されている。 For example, Patent Document 1 discloses an example of a technique for detecting a superimposition of a voice output from a voice input / output device and a voice input by a user's utterance.

また、特許文献２および特許文献３には、単に対話音声を出力するだけでなく、対話者の前にロボットを設置してゼスチャを行わせる技術の一例が開示されている。特許文献２の技術では、話し手の音声に基づいて当該話し手側のロボットが身振りを実行し、一方、聞き手側で受信した音声に基づいて当該聞き手側のロボットが身振りを実行することで、会話の実感を高めている。また、特許文献３の技術では、話し手側の身振り情報の送信に応じて相手側ロボットで当該身振りが再現される。
特開平７−２６４１０３号公報特開２０００−３４９９２０号公報特開２００１−１５６９３０号公報 Patent Document 2 and Patent Document 3 disclose an example of a technique for performing a gesture by installing a robot in front of a conversation person as well as simply outputting a conversation voice. In the technique of Patent Document 2, the speaker's robot performs gestures based on the speaker's voice, while the listener's robot performs gestures based on the voice received by the listener. The feeling is enhanced. In the technique of Patent Document 3, the gesture is reproduced by the partner robot in response to transmission of gesture information on the speaker side.
JP 7-264103 A JP 2000-349920 A JP 2001-156930 A

特許文献１の技術では、出力音声と入力音声との重畳を検出して、エコーキャンセラの動作が変化されるが、話者の発話タイミングを制御することはできない。また、特許文献２および３の技術では、対話者の前に設置したロボットに身振りをさせることによって、円滑な対話の実現を図っているが、話者の発話タイミングを制御することはできない。このように、従来技術では、発話のタイミングを制御することができなかったので、遅延により対話に異常な空白時間が生じても対応できなかった。したがって、両対話者の発話が重なることを防止することができず、円滑な対話を実現することができなかった。 In the technique of Patent Document 1, the superimposition of the output voice and the input voice is detected and the operation of the echo canceller is changed, but the speaker's speech timing cannot be controlled. Further, in the techniques of Patent Documents 2 and 3, a smooth conversation is realized by gesturing a robot installed in front of a conversation person, but the speaker's utterance timing cannot be controlled. As described above, in the prior art, since the timing of the utterance could not be controlled, it was not possible to cope with an abnormal blank time in the dialog due to the delay. Therefore, it was not possible to prevent the utterances of both interlocutors from overlapping, and a smooth conversation could not be realized.

それゆえに、この発明の主たる目的は、遠隔地間対話に適切な間を与えることができて、円滑な対話を実現できる、遠隔地間対話システムを提供することである。 SUMMARY OF THE INVENTION Therefore, a main object of the present invention is to provide a remote place dialogue system that can provide an appropriate space for a remote place dialogue and can realize a smooth dialogue.

請求項１の発明は、ネットワークを介して接続される２つの対話装置を含む遠隔地間で対話を行うためのシステムである。各対話装置は、音声を取得する取得手段、取得手段で取得した音声を相手側の対話装置へ送信する送信手段、相手側の対話装置から送信された音声を受信する受信手段、および受信手段で受信した音声を出力する出力手段を含んでいる。当該システムは、少なくとも空白時間と発話音声の特徴に関する情報を含む複数の間パターンを記憶する間パターン記憶手段、各対話装置における少なくとも音声取得状態および音声出力状態を含む対話状態の履歴を記録する履歴記録手段、履歴記録手段によって記録された履歴に基づいて両方の対話装置で無発話状態であると判定されるとき、少なくとも空白時間および当該空白前の発話音声の特徴を含む間の状況と複数の間パターンとの照合を行う照合手段、および照合手段による照合の結果マッチする間パターンがあるとき、当該間パターンに対応する所定の音声を、当該間パターンに対応する対話装置の出力手段から出力する間制御手段を備える。 The invention of claim 1 is a system for carrying out a dialogue between remote locations including two dialogue devices connected via a network. Each dialog device includes an acquisition unit that acquires voice, a transmission unit that transmits the voice acquired by the acquisition unit to the other party's dialog device, a reception unit that receives the voice transmitted from the other party's dialog device, and a reception unit. Output means for outputting the received voice is included. The system includes a pattern storage means for storing a plurality of inter-patterns including information relating to at least blank time and features of speech voice, a history for recording a dialog state history including at least a voice acquisition state and a voice output state in each dialog device When it is determined that there is no utterance state in both interactive devices based on the history recorded by the recording means and the history recording means, a situation between at least a blank time and a feature of the speech voice before the blank and a plurality of When there is a matching unit that performs matching with the inter-pattern, and an inter-pattern that matches as a result of the collation by the collating unit, a predetermined voice corresponding to the inter-pattern is output from the output unit of the interactive device corresponding to the inter-pattern A control means is provided.

請求項１の発明では、遠隔地間対話システムは２つの対話装置を含み、各対話装置が取得した音声を通信して相手側で出力することによって、遠隔地間での対話者同士の対話が行われる。間パターン記憶手段には複数の間パターンが記憶されている。間パターンは、対話における適切な間の取り方を示し、少なくとも空白時間と発話音声の特徴に関する情報を含む。たとえば、発話音声の特徴は、基本周波数（ピッチ）、振幅および音節の平均持続時間等を含んでよい。履歴記録手段は、各対話装置における少なくとも音声取得状態および音声出力状態を含む対話状態の履歴を記録する。対話状態の履歴は、後述される実施例では発話フラグテーブルであり、各時刻の発話の有無状態（ＳＰＥＡＫＩＮＧフラグ、ＳＩＬＥＮＴフラグ）および処理の状態（ＲＥＣＯＲＤＩＮＧフラグ、ＩＮＴＥＲＰＯＬＡＴＩＮＧフラグ）などが記録される。照合手段は、両方の対話装置で無発話状態であると判定されるとき、少なくとも空白時間および当該空白前の発話音声の特徴を含む間の状況と、複数の間パターンとの照合を行う。つまり、対話が無音状態である場合に、現在の間の状況が複数の間パターンのいずれかにマッチしているかが確認される。間制御手段は、照合の結果マッチする間パターンがあるとき、当該間パターンに対応する所定の音声を、当該間パターンに対応する対話装置の出力手段から出力する。なお、上記間パターン記憶手段、履歴記録手段、照合手段および間制御手段は、２つの対話装置のいずれか一方に、または、このシステムに含まれる別の装置（実施例では発話タイミング制御サーバ）に設けられてよい。あるいは、これらの手段は、２つの対話装置に分散して設けられてもよい。 In the invention according to claim 1, the dialog system between remote locations includes two dialog devices, and the dialogue between the remote sites can be performed between the remote locations by communicating the voices acquired by each dialog device and outputting them on the other side. Done. A plurality of inter-patterns are stored in the inter-pattern storage means. The interval pattern indicates an appropriate interval in the dialogue, and includes at least information regarding the blank time and the characteristics of the speech. For example, the features of spoken speech may include fundamental frequency (pitch), amplitude, syllable average duration, and the like. The history recording means records a dialog state history including at least a voice acquisition state and a voice output state in each dialog device. The dialogue state history is an utterance flag table in the embodiment described later, and records the utterance presence / absence state (SPEAKING flag, SILENT flag), processing state (RECORDING flag, INTERPOLATING flag), etc. at each time. When it is determined that both of the interactive devices are in the no-speech state, the collating unit collates the situation between at least the blank time and the feature of the speech voice before the blank and a plurality of patterns. That is, when the dialogue is silent, it is confirmed whether the current situation matches any of a plurality of patterns. When there is a matching pattern as a result of the collation, the inter-control unit outputs a predetermined voice corresponding to the inter-pattern from the output unit of the dialogue apparatus corresponding to the inter-pattern. The inter-pattern storage means, the history recording means, the collating means, and the inter-control means are provided in either one of the two interactive devices or in another device included in this system (the utterance timing control server in the embodiment). May be provided. Or these means may be distributed and provided in two interactive apparatuses.

請求項１の発明によれば、会話における無音状態が検出されたときに、適切な間を与える間パターンに対応する音声を出力することができる。したがって、発話が重なってしまうのを防止することができ、円滑な対話を成立させることができる。 According to the first aspect of the present invention, when a silent state in a conversation is detected, it is possible to output a sound corresponding to a pattern that gives an appropriate interval. Therefore, it is possible to prevent utterances from overlapping, and a smooth conversation can be established.

請求項２の発明は、請求項１の発明に従属し、履歴記録手段によって記録された履歴に基づいて両方の対話装置で発話が重複したと判定されるとき、一方の音声を録音して、その後発話が終了したときに当該録音音声を他方の対話装置の出力手段から出力する遅延再生手段をさらに備える。 The invention of claim 2 is dependent on the invention of claim 1, and when it is determined that utterances are duplicated in both interactive devices based on the history recorded by the history recording means, one voice is recorded, Thereafter, the apparatus further includes delay reproduction means for outputting the recorded voice from the output means of the other interactive apparatus when the utterance is finished.

請求項２の発明では、発話が重複してしまった場合には、遅延再生手段によって、一方の音声が録音され、その後発話が終了したときに、当該録音した音声が他方の対話装置から出力される。なお、この遅延再生手段も、２つの対話装置のいずれか一方に、または、このシステムに含まれる別の装置に設けられてよい。したがって、万一発話が重複してしまっても、両発話が同時に相手側で出力されるのを回避することができる。 In the invention of claim 2, when the utterances are duplicated, one voice is recorded by the delay reproduction means, and when the utterance is finished thereafter, the recorded voice is outputted from the other interactive device. The This delayed reproduction means may also be provided in either one of the two interactive devices or in another device included in this system. Therefore, even if the utterances overlap, it is possible to avoid that both utterances are output at the other side at the same time.

請求項３の発明は、請求項１または２の発明に従属し、対話装置の少なくとも一方が身振りを実行可能なロボットであるとき、間制御手段は、音声の出力とともに、さらに間パターンに対応する所定の身振りを当該対話装置に実行させる。 The invention of claim 3 is dependent on the invention of claim 1 or 2, and when at least one of the interactive devices is a robot capable of performing gestures, the intermediate control means further corresponds to the intermediate pattern together with the output of the voice. The dialogue apparatus is caused to execute a predetermined gesture.

請求項３の発明では、対話装置の少なくとも一方は、身振りを実行可能なロボットであってよい。間制御手段は、当該ロボットに、間パターンに対応する音声を出力させるとともに、間パターンに対応する身振りを実行させる。したがって、音声と身振りを使用して、対話に適切な間を与えることができるので、より円滑な遠隔地間対話を成立させることができる。 In the invention of claim 3, at least one of the interactive devices may be a robot capable of performing gestures. The inter-control unit causes the robot to output a sound corresponding to the inter-pattern and execute gesture corresponding to the inter-pattern. Therefore, since a suitable space can be given to the dialogue using voice and gestures, a smoother dialogue between remote locations can be established.

この発明によれば、対話に無音状態が検出されたときに適切な間を取るように言葉を挿入するようにしたので、対話の空白時間を適切な長さにすることができる。このため、遅延によって空白時間が長くなって対話者に違和感を与えてしまうようなことを回避できる。したがって、対話者は発話のタイミングを計りやすくなるので、両者の発話の重複を防止することができ、円滑な対話を成立させることができる。 According to the present invention, since a word is inserted so as to take an appropriate interval when a silent state is detected in the dialog, the blank time of the dialog can be set to an appropriate length. For this reason, it can be avoided that the blank time becomes longer due to the delay and the conversation person feels uncomfortable. Therefore, since it becomes easy for the interlocutor to measure the timing of the utterance, duplication of both utterances can be prevented and a smooth conversation can be established.

この発明の上述の目的，その他の目的，特徴および利点は、図面を参照して行う以下の実施例の詳細な説明から一層明らかとなろう。 The above object, other objects, features and advantages of the present invention will become more apparent from the following detailed description of embodiments with reference to the drawings.

図１を参照して、この実施例の遠隔地間対話システム（以下、単に「システム」とも言う。）１０は、遠隔地に離れた対話者同士が対話を行うためのものである。システム１０は少なくとも２つの対話装置１２（１２ａ，１２ｂ）を含む。２つの対話装置１２は、ネットワーク、たとえば公衆インターネット網を介して接続されており、対話装置１２ａ側の対話者Ａおよび対話装置１２ｂ側の対話者Ｂの発話した音声の音声データを互いに通信する。また、この実施例のシステム１０は、発話タイミング制御サーバ（以下、単に「サーバ」とも言う。）１４を含み、サーバ１４はネットワークを介して少なくとも２つの対話装置１２と通信可能に接続される。 Referring to FIG. 1, a remote place dialogue system (hereinafter also simply referred to as “system”) 10 of this embodiment is for a dialogue between remote talkers who are remote. The system 10 includes at least two interactive devices 12 (12a, 12b). The two interactive devices 12 are connected via a network, for example, a public Internet network, and communicate voice data of speech uttered by the interactive person A on the interactive device 12a side and the interactive person B on the interactive device 12b side. The system 10 of this embodiment includes an utterance timing control server (hereinafter also simply referred to as “server”) 14, and the server 14 is communicably connected to at least two interactive devices 12 via a network.

この実施例では、一方の対話装置１２としてコンピュータ１２ａが適用され、他方の対話装置１２としてコミュニケーションロボット（以下、単に「ロボット」とも言う。）１２ｂが適用された場合を説明する。 In this embodiment, a case where a computer 12a is applied as one interactive device 12 and a communication robot (hereinafter simply referred to as “robot”) 12b is applied as the other interactive device 12 will be described.

対話装置１２ａはマイク１６およびスピーカ１８を備える。また、対話装置１２ａはたとえばパーソナルコンピュータであり、ＣＰＵ、メインメモリ、通信装置および入力装置等を備えている。メインメモリには、この発明の対話装置１２として機能するために必要なプログラムおよびデータが記憶される。プログラムおよびデータは、メインメモリに予め固定的に記憶されてもよいし、または、情報記憶媒体やネットワークから取得されてよい。ＣＰＵは、当該プログラムに従って、メインメモリのうちのワーキングメモリに一時的なデータを生成または取得しつつ対話のための処理を実行する。 The interactive device 12 a includes a microphone 16 and a speaker 18. The interactive device 12a is a personal computer, for example, and includes a CPU, a main memory, a communication device, an input device, and the like. The main memory stores programs and data necessary for functioning as the interactive apparatus 12 of the present invention. The program and data may be fixedly stored in the main memory in advance, or may be acquired from an information storage medium or a network. In accordance with the program, the CPU executes processing for dialogue while generating or acquiring temporary data in the working memory of the main memory.

マイク１６は対話者の発話した音声を取得するためのものであり、当該音声は音声入出力ボードでデータに変換されて、音声データとしてメインメモリに記憶される。スピーカ１８は、対話相手の音声およびシステム１０の備える音声を出力するためのものである。ＣＰＵは受信した音声データを音声入出力ボードに与えて当該音声をスピーカ１８から出力する。通信装置は、ネットワークを介して他方の対話装置１２やサーバ１４にデータを送受信する。また、入力装置は、キーボードまたはポインティングデバイス等である。 The microphone 16 is for acquiring the voice uttered by the conversation person, and the voice is converted into data by the voice input / output board and stored in the main memory as voice data. The speaker 18 is for outputting the voice of the conversation partner and the voice of the system 10. The CPU gives the received audio data to the audio input / output board and outputs the audio from the speaker 18. The communication device transmits / receives data to / from the other interactive device 12 or server 14 via the network. The input device is a keyboard or a pointing device.

この実施例では、相手側の対話装置１２としてロボット１２ｂが使用されるので、ユーザが入力装置を用いてロボット１２ｂの身振りを指示可能になっている。ロボット１２ｂの身振りは、表示装置の画面に表示されたリストから選択されてよいし、あるいは入力装置の各キーに割り当てられてもよい。対話者Ａは発話しながら入力装置を用いて動作を指示することによって、相手側のロボット１２から自分の音声を出力することができ、しかも当該ロボット１２に所望の身振りを行わせることができる。 In this embodiment, since the robot 12b is used as the interactive apparatus 12 on the other side, the user can instruct the gesture of the robot 12b using the input device. The gesture of the robot 12b may be selected from a list displayed on the screen of the display device, or may be assigned to each key of the input device. By instructing the operation using the input device while speaking, the conversation person A can output his / her voice from the partner robot 12 and can cause the robot 12 to perform desired gestures.

なお、対話装置１２ａは、音声入出力可能かつ通信可能なコンピュータであればよく、ＰＣに限られず、ゲーム機、携帯電話、携帯ゲーム機などの他のコンピュータであってよい。 The interactive device 12a may be any computer that can input and output voice and can communicate, and is not limited to a PC, and may be another computer such as a game machine, a mobile phone, or a portable game machine.

他方の対話装置１２ｂは人間のような身体部位を有するロボットであり、身体部位を動かすことによって所定の身振りを対話者Ｂに提示することができる。このロボット１２ｂは、マイク２０およびスピーカ２２を備えている。詳しくは、図２にロボット１２ｂの外観の一例が示され、図３には当該ロボット１２の電気的な構成の一例が示される。 The other interaction device 12b is a robot having a body part like a human, and can present a predetermined gesture to the conversation person B by moving the body part. The robot 12 b includes a microphone 20 and a speaker 22. Specifically, FIG. 2 shows an example of the appearance of the robot 12b, and FIG. 3 shows an example of the electrical configuration of the robot 12.

図２を参照して、ロボット１２ｂは台車２４を含み、この台車２４の下面には、このロボット１２ｂを自律移動させる車輪２６が設けられる。この車輪２６は、車輪モータ（図３において参照番号「２８」で示す。）によって駆動され、台車２４すなわちロボット１２ｂを前後左右任意の方向に動かすことができる。なお、図２では示さないが、この台車２４の前面には、衝突センサ（図３において参照番号「３０」で示す。）が取り付けられ、この衝突センサ３０は、台車２４への人や他の障害物との接触を検知する。ロボット１２ｂの移動中に接触を検知すると、直ちに車輪２６の駆動を停止することができる。 Referring to FIG. 2, the robot 12 b includes a carriage 24, and wheels 26 that autonomously move the robot 12 b are provided on the lower surface of the carriage 24. The wheel 26 is driven by a wheel motor (indicated by reference numeral “28” in FIG. 3), and the carriage 24, that is, the robot 12b can be moved in any direction. Although not shown in FIG. 2, a collision sensor (indicated by reference numeral “30” in FIG. 3) is attached to the front surface of the carriage 24. Detects contact with obstacles. If contact is detected during the movement of the robot 12b, the driving of the wheels 26 can be stopped immediately.

台車２４の上には、多角形柱のセンサ取付パネル３２が設けられ、このセンサ取付パネル３２の各面には、超音波距離センサ３４が取り付けられる。この超音波距離センサ３４は、取付パネル３２すなわちロボット１２ｂの周囲の主として人との間の距離を計測するためのものである。 A polygonal column sensor mounting panel 32 is provided on the carriage 24, and an ultrasonic distance sensor 34 is mounted on each surface of the sensor mounting panel 32. The ultrasonic distance sensor 34 is for measuring the distance between the mounting panel 32, that is, the robot 12b and mainly the person.

台車２４の上には、さらに、ロボット１２ｂの胴体が、その下部が上述の取付パネル３２に囲まれて、直立するように取り付けられる。この胴体は下部胴体３６と上部胴体３８とから構成され、これら下部胴体３６および上部胴体３８は、連結部４０によって連結される。連結部４０には、図示しないが、昇降機構が内蔵されていて、この昇降機構を用いることによって、上部胴体３８の高さすなわちロボット１２ｂの高さを変化させることができる。昇降機構は、腰モータ（図３において参照番号「４２」で示す。）によって駆動される。 Further, the body of the robot 12b is mounted on the carriage 24 so that the lower part of the body is surrounded by the mounting panel 32 described above. The body includes a lower body 36 and an upper body 38, and the lower body 36 and the upper body 38 are connected by a connecting portion 40. Although not shown, the connecting portion 40 has a built-in lifting mechanism. By using this lifting mechanism, the height of the upper body 38, that is, the height of the robot 12b can be changed. The lifting mechanism is driven by a waist motor (indicated by reference numeral “42” in FIG. 3).

上部胴体３８のほぼ中央には、１つの全方位カメラ４４と、１つのマイク２０とが設けられる。全方位カメラ４４は、ロボット１２ｂの周囲を撮影するもので、後述の眼カメラ４６と区別される。マイク２０は、上述のように、周囲の音、とりわけ人の声を取り込む。 One omnidirectional camera 44 and one microphone 20 are provided in the approximate center of the upper body 38. The omnidirectional camera 44 photographs the surroundings of the robot 12b and is distinguished from an eye camera 46 described later. As described above, the microphone 20 captures ambient sounds, particularly human voices.

上部胴体３８の両肩には、それぞれ、肩関節４８Ｒおよび４８Ｌによって、上腕５０Ｒおよび５０Ｌが取り付けられる。肩関節４８Ｒおよび４８Ｌは、それぞれ３軸の自由度を有する。すなわち、右肩関節４８Ｒは、Ｘ軸，Ｙ軸およびＺ軸の各軸廻りにおいて上腕５０Ｒの角度を制御できる。Ｙ軸は、上腕５０Ｒの長手方向（または軸）に平行な軸であり、Ｘ軸およびＺ軸は、そのＹ軸に、それぞれ異なる方向から直交する軸である。左肩関節４８Ｌは、Ａ軸，Ｂ軸およびＣ軸の各軸廻りにおいて上腕５０Ｌの角度を制御できる。Ｂ軸は、上腕５０Ｌの長手方向（または軸）に平行な軸であり、Ａ軸およびＣ軸は、そのＢ軸に、それぞれ異なる方向から直交する軸である。 Upper arms 50R and 50L are attached to both shoulders of upper body 38 by shoulder joints 48R and 48L, respectively. The shoulder joints 48R and 48L each have three axes of freedom. That is, the right shoulder joint 48R can control the angle of the upper arm 50R around each of the X, Y, and Z axes. The Y axis is an axis parallel to the longitudinal direction (or axis) of the upper arm 50R, and the X axis and the Z axis are axes orthogonal to the Y axis from different directions. The left shoulder joint 48L can control the angle of the upper arm 50L around each of the A, B, and C axes. The B axis is an axis parallel to the longitudinal direction (or axis) of the upper arm 50L, and the A axis and the C axis are axes orthogonal to the B axis from different directions.

上腕５０Ｒおよび５０Ｌのそれぞれの先端には、肘関節５２Ｒおよび５２Ｌを介して、前腕５４Ｒおよび５４Ｌが取り付けられる。肘関節５２Ｒおよび５２Ｌは、それぞれ、Ｗ軸およびＤ軸の軸廻りにおいて、前腕５４Ｒおよび５４Ｌの角度を制御できる。 Forearms 54R and 54L are attached to the tips of upper arms 50R and 50L via elbow joints 52R and 52L, respectively. The elbow joints 52R and 52L can control the angles of the forearms 54R and 54L around the W axis and the D axis, respectively.

なお、上腕５０Ｒおよび５０Ｌならびに前腕５４Ｒおよび５４Ｌの変位を制御するＸ，Ｙ，Ｚ，Ｗ軸およびＡ，Ｂ，Ｃ，Ｄ軸では、「０度」がホームポジションであり、このホームポジションでは、上腕５０Ｒおよび５０Ｌならびに前腕５４Ｒおよび５４Ｌは下方向に向けられる。 In the X, Y, Z, W axes and A, B, C, and D axes that control the displacement of the upper arms 50R and 50L and the forearms 54R and 54L, "0 degrees" is the home position. The upper arms 50R and 50L and the forearms 54R and 54L are directed downward.

また、図２では示さないが、上部胴体３８の肩関節４８Ｒおよび４８Ｌを含む肩の部分や上述の上腕５０Ｒおよび５０Ｌならびに前腕５４Ｒおよび５４Ｌを含む腕の部分には、それぞれ、タッチセンサ（図３において参照番号「５６」で包括的に示す。）が設けられていて、これらのタッチセンサ５６は、人がロボット１２ｂのこれらの部位に接触したかどうかを検知する。 Although not shown in FIG. 2, a touch sensor (FIG. 3) is provided on the shoulder portion including the shoulder joints 48R and 48L of the upper body 38 and the arm portion including the upper arms 50R and 50L and the forearms 54R and 54L. The touch sensor 56 detects whether or not a person has touched these parts of the robot 12b.

前腕５４Ｒおよび５４Ｌのそれぞれの先端には、手に相当する球体５８Ｒおよび５８Ｌがそれぞれ固定的に取り付けられる。なお、この球体５８Ｒおよび５８Ｌに代えて、この実施例のロボット１２ｂと異なり指の機能が必要な場合には、人の手の形をした「手」を用いることも可能である。 Spheres 58R and 58L corresponding to hands are fixedly attached to the tips of the forearms 54R and 54L, respectively. Instead of the spheres 58R and 58L, a “hand” in the shape of a human hand can be used when a finger function is required unlike the robot 12b of this embodiment.

上部胴体３８の中央上方には、首関節６０を介して、頭部６２が取り付けられる。この首関節６０は、３軸の自由度を有し、Ｓ軸，Ｔ軸およびＵ軸の各軸廻りに角度制御可能である。Ｓ軸は首から真上に向かう軸であり、Ｔ軸およびＵ軸は、それぞれ、このＳ軸に対して異なる方向で直交する軸である。頭部６２には、人の口に相当する位置に、上述のスピーカ２２が設けられる。なお、スピーカ２２は、ロボット１２ｂが、それの周囲の人に対して音声または声によってコミュニケーションを図るために用いられてよい。また、スピーカ２２は、ロボット１２の他の部位たとえば胴体に設けられてもよい。 A head portion 62 is attached to the upper center of the upper body 38 via a neck joint 60. The neck joint 60 has a degree of freedom of three axes, and the angle can be controlled around each of the S, T, and U axes. The S-axis is an axis that goes directly from the neck, and the T-axis and the U-axis are axes that are orthogonal to the S-axis in different directions. The head 62 is provided with the speaker 22 described above at a position corresponding to a human mouth. The speaker 22 may be used for the robot 12b to communicate with a person around it by voice or voice. Further, the speaker 22 may be provided in another part of the robot 12, for example, the trunk.

また、頭部６２には、目に相当する位置に眼球部６４Ｒおよび６４Ｌが設けられる。眼球部６４Ｒおよび６４Ｌは、それぞれ眼カメラ４６Ｒおよび４６Ｌを含む。なお、左右の眼球部６４Ｒおよび６４Ｌをまとめて参照符号「６４」で示し、左右の眼カメラ４６Ｒおよび４６Ｌをまとめて参照符号「４６」で示すこともある。眼カメラ４６は、ロボット１２ｂに接近した人の顔や他の部分ないし物体等を撮影してその映像信号を取り込む。 The head 62 is provided with eyeball portions 64R and 64L at positions corresponding to the eyes. Eyeball portions 64R and 64L include eye cameras 46R and 46L, respectively. Note that the left and right eyeball portions 64R and 64L may be collectively indicated by reference numeral “64”, and the left and right eye cameras 46R and 46L may be collectively indicated by reference numeral “46”. The eye camera 46 captures the video signal by photographing the face of the person approaching the robot 12b and other parts or objects.

なお、上述の全方位カメラ４４および眼カメラ４６のいずれも、たとえばＣＣＤやＣＭＯＳのように固体撮像素子を用いるカメラであってよい。 Note that each of the omnidirectional camera 44 and the eye camera 46 described above may be a camera using a solid-state imaging device such as a CCD or a CMOS.

たとえば、眼カメラ４６は眼球部６４内に固定され、眼球部６４は眼球支持部（図示せず）を介して頭部６２内の所定位置に取り付けられる。眼球支持部は、２軸の自由度を有し、α軸およびβ軸の各軸廻りに角度制御可能である。α軸およびβ軸は頭部６２に対して設定される軸であり、α軸は頭部６２の上へ向かう方向の軸であり、β軸はα軸に直交しかつ頭部６２の正面側（顔）が向く方向に直交する方向の軸である。この実施例では、頭部６２がホームポジションにあるとき、α軸はＳ軸に平行し、β軸はＵ軸に平行するように設定されている。このような頭部６２において、眼球支持部がα軸およびβ軸の各軸廻りに回転されることによって、眼球部６４ないし眼カメラ４６の先端（正面）側が変位され、カメラ軸すなわち視線方向が移動される。 For example, the eye camera 46 is fixed in the eyeball unit 64, and the eyeball unit 64 is attached to a predetermined position in the head 62 via an eyeball support unit (not shown). The eyeball support unit has two degrees of freedom and can be controlled in angle around each of the α axis and the β axis. The α-axis and the β-axis are axes set with respect to the head 62, the α-axis is an axis in a direction toward the top of the head 62, the β-axis is orthogonal to the α-axis, and the front side of the head 62 It is an axis in a direction orthogonal to the direction in which (face) faces. In this embodiment, when the head 62 is at the home position, the α axis is set to be parallel to the S axis and the β axis is set to be parallel to the U axis. In such a head 62, by rotating the eyeball support portion around each of the α axis and the β axis, the tip (front) side of the eyeball portion 64 or the eye camera 46 is displaced, and the camera axis, that is, the line-of-sight direction is changed. Moved.

なお、眼カメラ４６の変位を制御するα軸およびβ軸では、「０度」がホームポジションであり、このホームポジションでは、図２に示すように、眼カメラ４６のカメラ軸は頭部６２の正面側（顔）が向く方向に向けられ、視線は正視状態となる。 In the α axis and β axis that control the displacement of the eye camera 46, “0 degree” is the home position. At this home position, the camera axis of the eye camera 46 is the head 62 as shown in FIG. The direction of the front side (face) is directed, and the line of sight is in the normal viewing state.

図３を参照して、このロボット１２ｂは、全体の制御のためにマイクロコンピュータまたはＣＰＵ６６を含み、このＣＰＵ６６には、バス６８を通して、メモリ７０，モータ制御ボード７２，センサ入力／出力ボード７４および音声入力／出力ボード７６が接続される。 Referring to FIG. 3, the robot 12b includes a microcomputer or a CPU 66 for overall control. The CPU 66 is connected to a memory 70, a motor control board 72, a sensor input / output board 74, and a voice through a bus 68. An input / output board 76 is connected.

メモリ７０は、図示しないが、ＲＯＭやＨＤＤおよびＲＡＭ等を含み、ＲＯＭまたはＨＤＤには、このロボット１２ｂをこの発明の対話装置１２として機能させるためのプログラムおよびデータが予め格納されている。ＣＰＵ６６は、このプログラムに従って処理を実行する。また、ＲＡＭは、バッファメモリやワーキングメモリとして使用される。 Although not shown, the memory 70 includes a ROM, an HDD, a RAM, and the like, and a program and data for causing the robot 12b to function as the interactive device 12 of the present invention are stored in advance in the ROM or the HDD. The CPU 66 executes processing according to this program. The RAM is used as a buffer memory or a working memory.

モータ制御ボード７２は、たとえばＤＳＰ(Digital Signal Processor)で構成され、右腕、左腕、頭および眼等の身体部位を駆動するためのモータを制御する。すなわち、モータ制御ボード７２は、ＣＰＵ６６からの制御データを受け、右肩関節４８ＲのＸ，ＹおよびＺ軸のそれぞれの角度を制御する３つのモータと右肘関節５２Ｒの軸Ｗの角度を制御する１つのモータを含む計４つのモータ（図３ではまとめて「右腕モータ」として示す。）７８の回転角度を調節する。また、モータ制御ボード７２は、左肩関節４８ＬのＡ，ＢおよびＣ軸のそれぞれの角度を制御する３つのモータと左肘関節５２ＬのＤ軸の角度を制御する１つのモータとを含む計４つのモータ（図３ではまとめて「左腕モータ」として示す。）８０の回転角度を調節する。モータ制御ボード７２は、また、首関節６０のＳ，ＴおよびＵ軸のそれぞれの角度を制御する３つのモータ（図３ではまとめて「頭部モータ」として示す。）８２の回転角度を調節する。モータ制御ボード７２は、また、腰モータ４２および車輪２６を駆動する２つのモータ（図３ではまとめて「車輪モータ」として示す。）２８を制御する。さらに、モータ制御ボード７２は、右眼球部６４Ｒのα軸およびβ軸のそれぞれの角度を制御する２つのモータ（図３ではまとめて「右眼球モータ」として示す。）８４の回転角度を調節し、また、左眼球部６４Ｌのα軸およびβ軸のそれぞれの角度を制御する２つのモータ（図３ではまとめて「左眼球モータ」として示す。）８６の回転角度を調節する。 The motor control board 72 is composed of, for example, a DSP (Digital Signal Processor) and controls a motor for driving body parts such as the right arm, the left arm, the head, and the eyes. That is, the motor control board 72 receives the control data from the CPU 66 and controls the angles of the three motors for controlling the X, Y and Z axes of the right shoulder joint 48R and the axis W of the right elbow 52R. The rotation angles of a total of four motors including one motor (collectively shown as “right arm motor” in FIG. 3) 78 are adjusted. The motor control board 72 includes a total of four motors including three motors for controlling the angles of the A, B, and C axes of the left shoulder joint 48L and one motor for controlling the angle of the D axis of the left elbow joint 52L. The rotation angle of the motor 80 (collectively shown as “left arm motor” in FIG. 3) 80 is adjusted. The motor control board 72 also adjusts the rotation angle of three motors 82 (collectively shown as “head motors” in FIG. 3) that control the angles of the S, T, and U axes of the neck joint 60. . The motor control board 72 also controls two motors 28 (shown collectively as “wheel motors” in FIG. 3) that drive the waist motor 42 and the wheels 26. Further, the motor control board 72 adjusts the rotation angle of two motors 84 (collectively shown as “right eyeball motor” in FIG. 3) that control the angles of the α axis and β axis of the right eyeball portion 64R. Further, the rotation angle of two motors 86 (collectively shown as “left eyeball motor” in FIG. 3) for controlling the angles of the α axis and the β axis of the left eyeball portion 64L is adjusted.

なお、この実施例の上述のモータは、車輪モータ２８を除いて、制御を簡単化するためにそれぞれステッピングモータまたはパルスモータであるが、車輪モータ２８と同様に、直流モータであってよい。また、この実施例では、ロボット１２ｂの腕、頭、眼などの身体部位を駆動するアクチュエータとして電力を駆動源とするモータを用いた。しかしながら、このロボット１２ｂとしては、たとえば空気圧（または負圧）、油圧、圧電素子あるいは形状記憶合金などによる他のアクチュエータによって身体部位を駆動するロボットが適用されてもよい。 The above-described motors of this embodiment are stepping motors or pulse motors for simplifying the control, except for the wheel motors 28, but may be direct-current motors as with the wheel motors 28. In this embodiment, a motor using electric power as a drive source is used as an actuator for driving a body part such as an arm, a head, or an eye of the robot 12b. However, as this robot 12b, a robot that drives a body part by other actuators such as air pressure (or negative pressure), oil pressure, piezoelectric element, shape memory alloy, or the like may be applied.

センサ入力／出力ボード７４も、同様に、ＤＳＰで構成され、各センサやカメラからの信号を取り込んでＣＰＵ６６に与える。すなわち、超音波距離センサ３４の各々からの反射時間に関するデータがこのセンサ入力／出力ボード７４を通して、ＣＰＵ６６に入力される。また、全方位カメラ４４からの映像信号が、必要に応じてこのセンサ入力／出力ボード７４で所定の処理が施された後、ＣＰＵ６６に入力される。眼カメラ４６からの映像信号も、同様にして、ＣＰＵ６６に与えられる。また、タッチセンサ５６からの信号がセンサ入力／出力ボード７４を介してＣＰＵ６６に与えられる。 Similarly, the sensor input / output board 74 is also constituted by a DSP, and takes in signals from each sensor and camera and gives them to the CPU 66. That is, data relating to the reflection time from each of the ultrasonic distance sensors 34 is input to the CPU 66 through the sensor input / output board 74. The video signal from the omnidirectional camera 44 is input to the CPU 66 after being subjected to predetermined processing by the sensor input / output board 74 as necessary. Similarly, the video signal from the eye camera 46 is also supplied to the CPU 66. Further, a signal from the touch sensor 56 is given to the CPU 66 via the sensor input / output board 74.

スピーカ２２には音声入力／出力ボード７６を介して、ＣＰＵ６６から音声データが与えられ、それに応じて、スピーカ２２からはそのデータに従った音声または声が出力される。また、マイク２０からの音声入力が、音声入力／出力ボード７６を介して音声データとしてＣＰＵ６６に取り込まれる。 Audio data is given to the speaker 22 from the CPU 66 through the audio input / output board 76, and accordingly, audio or voice according to the data is output from the speaker 22. Also, the voice input from the microphone 20 is taken into the CPU 66 as voice data via the voice input / output board 76.

通信ＬＡＮボード８８も、同様に、ＤＳＰで構成され、ＣＰＵ６６から与えられた送信データを無線通信装置９０に与えて、当該データを無線通信装置９０から送信させる。また、通信ＬＡＮボード８８は無線通信装置９０を介してデータを受信し、受信データをＣＰＵ６６に与える。 Similarly, the communication LAN board 88 is configured by a DSP, and sends the transmission data given from the CPU 66 to the wireless communication device 90 so that the data is transmitted from the wireless communication device 90. The communication LAN board 88 receives data via the wireless communication device 90 and provides the received data to the CPU 66.

図１に戻って、サーバ１４は、両対話者の発話のタイミングを制御するために設けられる。サーバ１４は、ＣＰＵ、メインメモリ、通信装置等を備える。メインメモリにはこのサーバ１４を制御するためのプログラムおよびデータが記憶される。ＣＰＵは当該プログラムに従って処理を実行する。 Returning to FIG. 1, the server 14 is provided to control the timing of the utterances of the two interactors. The server 14 includes a CPU, a main memory, a communication device, and the like. A program and data for controlling the server 14 are stored in the main memory. The CPU executes processing according to the program.

また、サーバ１４は音声解析履歴データベース（ＤＢ）９２および間パターンＤＢ９４を含む。音声解析履歴ＤＢ９２には、対話装置１２で取得された対話者の音声の解析データの履歴が記憶される。 The server 14 also includes a voice analysis history database (DB) 92 and an inter-pattern DB 94. The speech analysis history DB 92 stores a history of speech analysis data obtained by the dialog device 12.

間パターンＤＢ９４には、後述するように、対話に適切な間を与えるための間パターンデータ（図４参照）が記憶されている。間パターンデータは、予め発話の計測を行って得た発話データからパターン認識によって抽出される。計測を実際の使用者を対象として行うと、間の取り方の個人的特徴を抽出できる。ただし、標準的なまたは一般的な間の取り方も存在すると考えられるので、任意の人を被験者としてその発話を計測して間パターンデータを抽出してよい。 As will be described later, the inter-pattern DB 94 stores inter-pattern data (see FIG. 4) for giving an appropriate interval to the dialogue. The inter-pattern data is extracted by pattern recognition from the utterance data obtained by measuring the utterance in advance. When the measurement is performed on an actual user, the personal characteristics of how to make an interval can be extracted. However, since it is considered that there is a standard or general way of taking an interval, the utterance may be measured with an arbitrary person as a subject and the inter-pattern data may be extracted.

このシステム１０では、各対話装置１２が、一定時間ΔＴごとにマイク１６または２０で音声を検出する。ΔＴはたとえば１フレームまたは所定のフレーム数であってよい。１フレームはたとえば１／３０秒である。対話装置１２は、検出した結果すなわち発話の有無に応じた処理を行う。対話装置１２は、検出時刻における発話状態（音声取得状態）および実行した処理（音声出力状態）など、当該装置１２における対話状態に関する情報をサーバ１４に送信する。サーバ１４は、当該対話状態に関する情報を受信して、当該対話装置１２における状態を逐一記憶する。このような対話状態の履歴は、発話フラグテーブル（図５参照）としてメモリに記憶される。発話フラグテーブルでは、後述するように、検出時刻ごとの対話装置１２における少なくとも音声取得状態および音声出力状態を含む対話状態を示すフラグが記憶されている。 In this system 10, each interactive device 12 detects a sound with the microphone 16 or 20 every predetermined time ΔT. ΔT may be, for example, one frame or a predetermined number of frames. One frame is, for example, 1/30 second. The dialogue apparatus 12 performs processing according to the detected result, that is, the presence or absence of an utterance. The dialogue device 12 transmits information related to the dialogue state in the device 12 such as the utterance state (voice acquisition state) and the executed processing (voice output state) at the detection time to the server 14. The server 14 receives information related to the dialog state and stores the state in the dialog device 12 one by one. Such a dialog state history is stored in the memory as an utterance flag table (see FIG. 5). As will be described later, the utterance flag table stores a flag indicating a dialogue state including at least a voice acquisition state and a voice output state in the dialogue device 12 for each detection time.

なお、便宜上、ここでは、対話装置１２ａ側からみた動作を説明する。しかし、対話装置１２ａと対話装置１２ｂとの相違は、主に身体動作の提示に関する機能のみであるから、対話装置１２ｂの動作も、対話装置１２ａの場合と同様である。 For convenience, the operation seen from the interactive device 12a side will be described here. However, since the difference between the interactive device 12a and the interactive device 12b is mainly only the function related to the presentation of the body movement, the operation of the interactive device 12b is the same as that of the interactive device 12a.

対話装置１２ａは、マイク１６から音声を検出した場合には、サーバ１４の発話フラグテーブルを参照する。そして、対話装置１２ａは、１つ前の検出時刻における（つまりΔＴ前の）相手側の状態に応じた処理を実行する。具体的には、前検出時刻における相手の状態フラグがＳＰＥＡＫＩＮＧフラグでない場合には、つまり、前検出時刻において相手が自分に音声を送信している状態ではない場合には、対話装置１２ａは、マイク１６で検出した音声データをメモリにローカルファイルとして記録しつつ、当該音声データを相手側の対話装置１２ｂに送信する。これに応じて、対話装置１２ｂは、当該音声データを受信して当該音声をスピーカ２２から出力する。このように、一方の対話者Ａが発話し、かつ、前検出時刻で他方の対話者Ｂの音声が送信されていない場合には、対話者Ａの音声データが直ちに送信され、当該音声が相手側の対話装置１２ｂで出力されて、相手Ｂに聞かせられる。 When the dialogue apparatus 12a detects voice from the microphone 16, the dialogue apparatus 12a refers to the utterance flag table of the server 14. Then, the interactive device 12a executes processing according to the state of the other party at the previous detection time (that is, before ΔT). Specifically, when the partner state flag at the previous detection time is not the SPEAKING flag, that is, when the partner is not transmitting voice at the previous detection time, the interactive device 12a While recording the audio data detected in 16 as a local file in the memory, the audio data is transmitted to the other-side interactive device 12b. In response to this, the dialogue apparatus 12b receives the audio data and outputs the audio from the speaker 22. As described above, when one conversation person A speaks and the voice of the other conversation person B is not transmitted at the previous detection time, the voice data of the conversation person A is immediately transmitted and the voice is transmitted to the other party. Is output by the interactive device 12b on the side and is sent to the partner B.

また、この場合、対話装置１２ａは、ＳＰＥＡＫＩＮＧフラグをサーバ１４に送信し、これに応じて、サーバ１４は発話フラグテーブルに当該対話装置１２ａの当該検出時刻ｔにおける状態として当該ＳＰＥＡＫＩＮＧフラグを記憶する。ＳＰＥＡＫＩＮＧフラグは、一方の対話者が発話している状態、すなわち、発話音声が直接相手側に送信されて再生されている状態を意味する。さらに、対話装置１２ａは、メモリのＳＥＮＤフラグをオンにして、自分の処理状態として、音声を相手に送信中であることを記憶する。 In this case, the dialogue apparatus 12a transmits the SPEAKING flag to the server 14, and in response thereto, the server 14 stores the SPEAKING flag in the utterance flag table as the state of the dialogue apparatus 12a at the detection time t. The SPEAKING flag means a state in which one of the interlocutors is speaking, that is, a state in which the spoken voice is directly transmitted to the other party and reproduced. Furthermore, the dialogue apparatus 12a turns on the SEND flag in the memory and stores that the voice is being transmitted to the other party as its processing state.

なお、当該音声を記録したローカルファイルには、当該発話が終わったときに音声解析が実行され、当該発話音声の特徴ないし状態はサーバ１４の音声解析履歴ＤＢ９２に記憶される。 Note that, for the local file in which the voice is recorded, voice analysis is performed when the utterance is finished, and the feature or state of the uttered voice is stored in the voice analysis history DB 92 of the server 14.

一方、対話装置１２ａはマイク１６から音声を検出しなかった場合には、ＳＩＬＥＮＴフラグをサーバ１４に送信し、これに応じて、サーバ１４は発話フラグテーブルに当該ＳＩＬＥＮＴフラグを記憶する。ＳＩＬＥＮＴフラグは、当該検出時刻ｔにおいて対話者が発話してない状態を意味する。このように、音声が検出されない場合には、発話フラグテーブルにＳＩＬＥＮＴフラグが記録される。 On the other hand, when the dialogue apparatus 12a does not detect the voice from the microphone 16, it transmits a SILENT flag to the server 14, and the server 14 stores the SILENT flag in the utterance flag table accordingly. The SILENT flag means a state in which the conversation person is not speaking at the detection time t. As described above, when no voice is detected, the SILENT flag is recorded in the utterance flag table.

サーバ１４では、両対話者とも発話していない状態（ＳＩＬＥＮＴフラグ）が検出されたとき、間パターンと対話における現在までの発話状況（間の状況）との照合が行われる。間の状況は、少なくとも対話の空白時間（無音時間）、および当該空白前の対話における発話音声の特徴（音声解析結果）を含む。 In the server 14, when a state in which neither conversation person is speaking (SILENT flag) is detected, the inter-pattern is compared with the current speech state (inter-state) in the conversation. The situation in between includes at least the blank time (silence time) of the dialogue, and the features (voice analysis results) of the speech voice in the dialogue before the blank.

間パターンＤＢ９４に記憶される間パターンデータの一例が図４に示される。間パターンＤＢ９４には、会話に適切な間を与えることができる複数の間パターンデータが記憶されている。間パターンは、少なくとも空白時間とその前の発話音声の特徴に関する情報を含む。この実施例では、間パターンデータは、会話の空白時間、最終発話者、条件（Ｉ）、間機能言葉、発話者および動作コマンド等の情報を含む。空白時間（ｔ）は、両者無音状態が継続している時間の条件である。最終発話者は、当該空白前の対話での最後の発話者ＡまたはＢの条件である。条件（Ｉ）は、当該空白前の対話の音声の解析結果の条件であり、たとえば基本周波数（ピッチ）、振幅および音節の平均持続時間等の要素を含む。 An example of the inter-pattern data stored in the inter-pattern DB 94 is shown in FIG. The inter-pattern DB 94 stores a plurality of inter-pattern data that can give an appropriate interval to the conversation. The inter-pattern includes at least information related to the blank time and the characteristics of the previous speech. In this embodiment, the inter-pattern data includes information such as a conversation blank time, the last speaker, condition (I), inter-function language, speaker, and operation command. The blank time (t) is a condition of time during which both silent states are continued. The last speaker is the condition of the last speaker A or B in the dialogue before the blank. The condition (I) is a condition of the speech analysis result of the dialogue before the blank, and includes elements such as a fundamental frequency (pitch), an amplitude, and an average duration of a syllable.

このような空白時間とその前の発話音声の特徴によって規定される間パターンに見合う間の状況が検出されたとき、当該間パターンに基づく言葉および身振りが対話に挿入される。具体的には、この実施例では、最終発話者、条件（Ｉ）および空白時間条件（ｔ）に合う対話後の空白時間が生じた場合に、当該パターンで指定された発話者側から間機能言葉が発せられる。また、発話者の代わりに当該間機能言葉を発する対話装置１２がロボット１２ｂである場合には、当該出力する言葉に対応する身振りも動作コマンドに従って再現される。 When such a blank time and a situation corresponding to an inter-pattern defined by the feature of the previous speech is detected, words and gestures based on the inter-pattern are inserted into the dialogue. Specifically, in this embodiment, when a blank time after dialogue that meets the last speaker, the condition (I) and the blank time condition (t) occurs, the inter-function from the speaker side specified by the pattern is performed. A word is emitted. Further, when the interactive device 12 that utters a function word during the time instead of the speaker is the robot 12b, the gesture corresponding to the output word is also reproduced according to the operation command.

間機能言葉は、無音時間に挿入されることによって無音を間として機能させ会話に適切な間を与えるための言葉である。たとえば、応答、合の手、間投詞などの言葉であってよい。図４では、「うんうん」、「うーん」、「はいはい」および「えーと」などが間機能言葉として示される。また、各間機能言葉には、当該言葉とともに提示される身体動作を実行するための動作コマンドが対応付けられている。図４では、「うんうん」には「うなづく」のコマンド、「うーん」には「首傾げる」のコマンド、「はいはい」には「うなづく」のコマンド、「えーと」には「視線を上方に向ける」コマンドがそれぞれ対応付けられる。 The inter-function words are words that are inserted during the silence time to function as silence between them and give an appropriate interval to the conversation. For example, it may be a word such as a response, a joint, an interjection. In FIG. 4, “Yun”, “Um”, “Yes yes”, “Ut”, etc. are shown as inter-function words. Each inter-function function word is associated with an action command for executing a body action that is presented together with the word. In FIG. 4, “Noun” is a “Unazuku” command, “Yun” is a “Tilt” command, “Yes” is a “Unazuku” command, and “Uto” is “Gaze up”. ”Commands are associated with each other.

なお、図４において、発話者は、人間である対話者を意味しており、当該間機能言葉を出力する対話装置１２は、この発話者の相手側の場所に存在する対話装置１２である。たとえば、図４の一番上のパターンの場合、最後の発話者が対話者Ａであり、間機能言葉を出力する対話装置は、発話者Ｂの相手である対話者Ａ側の対話装置１２ａとなる。一番上のパターンは「うんうん」という言葉と「うなづく」行動に対応付けられており、最終発話者の相手側である発話者が応答する動作を、最終発話者側に存在するロボット１２ｂが表現することで、間が与えられる。一方、上から２番目のパターンは「うーん」という言葉と「首を傾げる」行動に対応付けられており、最終発話者の相手側に存在するロボット１２ｂが発話および身振りをさらに続けることによって、間が与えられる。 In FIG. 4, the speaker means a person who is a human being, and the interactive device 12 that outputs a function word during this period is the interactive device 12 that exists at the location of the other party of the speaker. For example, in the case of the uppermost pattern in FIG. 4, the last speaker is the conversation person A, and the conversation apparatus that outputs the interactivity function is the conversation apparatus 12 a on the conversation person A side who is the partner of the speaker B. Become. The uppermost pattern is associated with the word “yeah” and “nodding” behavior, and the robot 12b existing on the side of the final speaker responds to the action of the speaker who is the partner of the final speaker. Expressing it gives you a gap. On the other hand, the second pattern from the top is associated with the word “hmm” and the “tilt” action, and the robot 12b on the other side of the last speaker continues further speaking and gesturing. Is given.

間パターンを用いた照合の結果、現在の間の状況にマッチする間パターンデータが間パターンＤＢ９４に存在する場合には、つまり、間パターンに従って適切な間を取る必要がある状況であると判断される場合には、サーバ１４は、必要な対話装置１２に間を取るための言葉の再生を指示する。この実施例では、当該言葉の音声データと再生指示とが送信される。これに応じて、当該対話装置１２側で当該音声が出力される。さらに、対話装置１２が身体表現可能な対話装置１２ｂである場合には、サーバ１４は当該間を取るための言葉に対応する身振りの実行を指示する。この実施例では、当該身振りに動作コマンドと再生指示とが送信される。これに応じて、当該対話装置１２ｂでは、対応する身体部位が動かされて当該身振りが実行される。 As a result of collation using the inter-pattern, if inter-pattern data that matches the current situation exists in the inter-pattern DB 94, that is, it is determined that it is necessary to take an appropriate interval according to the inter-pattern. The server 14 instructs the necessary dialogue device 12 to reproduce the words for the time being. In this embodiment, the voice data of the word and the reproduction instruction are transmitted. In response to this, the voice is output on the interactive device 12 side. Further, when the interactive device 12 is an interactive device 12b capable of expressing the body, the server 14 instructs the execution of the gesture corresponding to the words for taking the interval. In this embodiment, an operation command and a reproduction instruction are transmitted to the gesture. In response to this, in the dialogue apparatus 12b, the corresponding body part is moved and the gesture is executed.

このように、対話に無発話状態（無音状態）が検出されたときに、現在の間の状況と間パターンとの照合を行うようにした。そして、必要があれば適切な間を取るように言葉や身振りを空白時間に挿入するようにしたので、対話における空白時間を適切な時間に維持することができる。このため、遅延によって対話における空白時間が長くなって対話者に違和感を与えてしまうようなことを回避できる。したがって、対話者は発話のタイミングを計りやすくなり、両対話者の発話が重複する事態が生ずるのを防止することができる。このように、言葉や身振りの挿入によって対話者の発話タイミングを制御することができるので、対話を継続させたり発話を促進したりすることができるし、自然な会話の流れを作り出すことができる。したがって、円滑な対話を成立させることができる。 In this way, when a silent state (silent state) is detected in the dialogue, the current situation and the inter-pattern are collated. And if necessary, words and gestures are inserted into the blank time so as to take an appropriate time, so that the blank time in the dialogue can be maintained at an appropriate time. For this reason, it is possible to avoid a situation in which the blank time in the dialogue becomes long due to the delay and gives a stranger to the dialogue person. Therefore, it becomes easy for the interlocutor to measure the timing of the utterance, and the situation where the utterances of both interlocutors overlap can be prevented. In this way, the speech timing of the interlocutor can be controlled by inserting words and gestures, so that the conversation can be continued and the speech can be promoted, and a natural conversation flow can be created. Therefore, a smooth dialogue can be established.

また、このシステム１０では、万一両対話者の発話が重複した場合には、一方の発話の出力を遅らせることによって、両発話が完全に重なってしまうのを回避する機能を備えるようにしている。 In addition, in this system 10, in the unlikely event that the utterances of both dialoguers overlap, the system 10 is provided with a function of preventing the utterances from completely overlapping by delaying the output of one utterance. .

具体的には、対話装置１２ａで音声が検出された場合において、前検出時刻の相手の状態フラグがＳＰＥＡＫＩＮＧフラグであるときには、つまり、両者の発話が重複している場合には、対話装置１２ａは、マイク１６で検出した音声の録音を開始し、当該音声データを音声ファイルとしてメモリに記憶する。さらに、対話装置１２ａは、ＲＥＣＯＲＤＩＮＧフラグをサーバ１４に送信し、これに応じて、サーバ１４は発話フラグテーブルに当該対話装置１２ａの当該検出時刻ｔにおける状態として当該ＲＥＣＯＲＤＩＮＧフラグを記憶する。ＲＥＤＯＲＤＩＮＧフラグは、音声データを録音中であり、当該音声が相手側に送信されていない状態を意味する。 Specifically, when voice is detected by the dialog device 12a, when the partner status flag at the previous detection time is the SPEAKING flag, that is, when both utterances overlap, the dialog device 12a Recording of the voice detected by the microphone 16 is started, and the voice data is stored in the memory as a voice file. Further, the dialogue apparatus 12a transmits a RECORDING flag to the server 14, and in response thereto, the server 14 stores the RECORDING flag as a state at the detection time t of the dialogue apparatus 12a in the utterance flag table. The REDORDING flag means that voice data is being recorded and the voice is not transmitted to the other party.

また、対話装置１２ａは、メモリのＲＥＣＯＲＤフラグをオンにして、自分の処理状態として、音声を録音中であることを記憶する。 Further, the dialogue apparatus 12a turns on the RECORD flag in the memory, and stores that the voice is being recorded as its processing state.

また、サーバ１４の発話フラグテーブルでは、録音した音声ファイルの再生を制御するための情報としてＰＬＡＹフラグが記憶される。この実施例では、対話装置１２ａは、録音しているときは、ＰＬＡＹフラグの値に１を加算するようにサーバ１４に指示する。ＰＬＡＹフラグの初期値は０であり、録音が行われているときは毎検出時刻ごとに前の検出時刻の値に１だけ加算され、録音が行われていないときには前の検出時刻の値が維持される。 In the utterance flag table of the server 14, a PLAY flag is stored as information for controlling the reproduction of the recorded audio file. In this embodiment, the interactive device 12a instructs the server 14 to add 1 to the value of the PLAY flag when recording. The initial value of the PLAY flag is 0. When recording is performed, 1 is added to the value of the previous detection time at every detection time, and when the recording is not performed, the value of the previous detection time is maintained. Is done.

その後、対話装置１２ａで音声が検出されなくなったときには、録音した音声ファイルがサーバ１４に送信される。これに応じて、サーバ１４は、受信した音声ファイルを、当該録音が行われた検出時刻ｔに対応付けてメモリに記憶する。発話フラグテーブルでは、当該音声ファイルを格納した記憶位置が記憶される。なお、音声ファイルにはサーバ１４に送信される前に音声解析処理が施され、当該解析データがサーバ１４に送信されて音声解析履歴ＤＢ９２に記憶される。 Thereafter, when no voice is detected by the interactive apparatus 12a, the recorded voice file is transmitted to the server 14. In response to this, the server 14 stores the received audio file in the memory in association with the detection time t at which the recording was performed. In the utterance flag table, the storage position where the audio file is stored is stored. The voice file is subjected to voice analysis processing before being sent to the server 14, and the analysis data is sent to the server 14 and stored in the voice analysis history DB 92.

サーバ１４は、両対話装置１２とも音声を出力していないことが検出された場合、つまり、両対話者の状態のいずれにもＳＰＥＡＫＩＮＧフラグが記憶されていないことが検出された場合、いずれかの対話者の音声ファイルが再生されずに記憶されているか否かを判定する。未再生の録音ファイルが残っている場合、つまり、ＰＬＡＹフラグの値が１以上である場合には、録音の開始された時刻の早い方の音声ファイルの再生が実行される。具体的には、サーバ１４は、当該ファイルの再生が終了するまで、音声データと再生指示とを相手側の対話装置１２に送信する。これに応じて、対話装置１２は、受信した音声データに基づいて、当該音声を出力する。 If the server 14 detects that neither of the dialog devices 12 is outputting sound, that is, if it is detected that the SPEAKING flag is not stored in any of the states of both dialog parties, It is determined whether or not the dialogue file is stored without being reproduced. If an unreproduced recording file remains, that is, if the value of the PLAY flag is 1 or more, the audio file with the earliest recording start time is reproduced. Specifically, the server 14 transmits the audio data and the reproduction instruction to the partner interactive device 12 until the reproduction of the file is completed. In response to this, the dialogue apparatus 12 outputs the voice based on the received voice data.

このようにして、両対話者の発話が重複した場合には、後から発話し始めた側の音声を録音し、その後両方の発話が終了したときに、当該録音音声を相手側で出力することができる。なお、両発話が同時に始まった場合には優先順位に従って音声を遅延再生できる。したがって、重複したときの発話の出力を遅らせることができるので、円滑な遠隔地間対話を成立させることができる。 In this way, if the utterances of both parties interact, record the voice of the side that started speaking later, and then output the recorded voice on the other side when both utterances are finished. Can do. In addition, when both utterances start at the same time, the sound can be delayed and reproduced according to the priority order. Therefore, since the output of the utterance when it overlaps can be delayed, a smooth remote conversation can be established.

図５には、サーバ１４に記憶される発話フラグテーブルの一例が示される。発話フラグテーブルでは、検出時刻ｔごとに、ユーザ、状態フラグ、対象、保存音声ファイルの記憶位置、保存コマンドファイルの記憶位置、およびＰＬＡＹフラグ等の情報が記憶される。ユーザ情報は、当該データの主体であり、たとえばＡは当該データが対話装置１２ａの状態であることを意味し、Ｂは当該データが対話装置１２ｂの状態であることを意味する。また、対象は、ユーザの発話対象を示す。 FIG. 5 shows an example of the utterance flag table stored in the server 14. In the utterance flag table, information such as a user, a status flag, a target, a storage position of a saved voice file, a storage position of a saved command file, and a PLAY flag are stored for each detection time t. The user information is the subject of the data. For example, A means that the data is in the state of the interactive device 12a, and B means that the data is in the state of the interactive device 12b. The target indicates a user's utterance target.

状態フラグは、対話装置１２での音声取得状態および音声出力状態を示し、上述のように、ＳＰＥＡＫＩＮＧフラグ、ＳＩＬＥＮＴフラグ、ＲＥＣＯＲＤＩＮＧフラグが記憶される。なお、図５の時刻ｔ＝Ｔ＋２ΔＴでは、状態フラグはＩＮＴＥＲＰＯＬＡＴＩＮＧフラグである。上述のように、両対話者の状態フラグがＳＩＬＥＮＴフラグであった場合において、間パターンに従って間機能言葉が挿入されたときには、当該時刻の状態フラグとして、このＩＮＴＥＲＰＯＬＡＴＩＮＧフラグが上書きされるようになっている。これによって、当該検出時刻が、対話における空白時間としては計測されなくなる。 The status flag indicates a voice acquisition state and a voice output state in the interactive device 12, and the SPEAKING flag, the SILENT flag, and the RECORDING flag are stored as described above. At time t = T + 2ΔT in FIG. 5, the status flag is an INTERPOLATING flag. As described above, in the case where the state flag of both the interlocutors is the SILENT flag, when the inter-function language is inserted according to the inter-pattern, the INTERPOLATING flag is overwritten as the state flag at the time. Yes. As a result, the detection time is not measured as a blank time in the dialogue.

保存音声ファイルは、録音された音声ファイルの記憶位置を示している。たとえば、図５では、時刻Ｔ＋４ΔＴおよび時刻Ｔ＋５ΔＴにおいて、ユーザＡの状態フラグとしてＲＥＣＯＲＤＩＮＧフラグが記憶されており、当該時刻の録音に対応する音声ファイルの保存場所が示されている。また、時刻Ｔ＋４ＴでのＰＬＡＹフラグは１であり、録音が開始されたことを意味し、次の時刻Ｔ＋５ＴでのＰＬＡＹフラグは２であり、録音が継続されていることを意味し、その次の時刻Ｔ＋６ＴでのＰＬＡＹフラグは２のままであり、録音が終了されていることを意味する。 The stored audio file indicates the storage location of the recorded audio file. For example, in FIG. 5, at time T + 4ΔT and time T + 5ΔT, the RECORDING flag is stored as the status flag of user A, and the storage location of the audio file corresponding to the recording at that time is shown. Also, the PLAY flag at time T + 4T is 1, which means that recording has started, and the PLAY flag at time T + 5T, which is 2, means that recording has been continued. The PLAY flag at time T + 6T remains at 2, which means that recording has ended.

なお、保存コマンドファイルは、録音が行われている間に、ユーザによって当該対話装置１２ａで入力された動作コマンドを記録したファイルの記憶位置を示している。このコマンドファイルは音声ファイルと一緒に相手側対話装置１２ｂに送信され、したがって、対話装置１２ｂでは、録音した音声とともに入力指示された身振りが実行される。 The saved command file indicates the storage location of a file in which an operation command input by the user with the interactive device 12a is recorded during recording. This command file is transmitted together with the voice file to the other-side dialog device 12b. Accordingly, the dialog device 12b performs the gesture of input instruction together with the recorded voice.

図６から図９には、対話装置１２のＣＰＵの入力処理における動作の一例が示される。入力処理を開始すると、図６の最初のステップＳ１では、初期化が行われる。たとえば、ＳＥＮＤフラグがオフされ、ＲＥＣＯＲＤフラグがオフされ、また、時刻（またはフレーム番号）ｔに現在の時刻Ｔ（または初期値Ｔ）が代入される。続くステップＳ３から図９のステップＳ６９までの処理は一定時間ΔＴごとに、たとえば１フレームごとに繰り返し実行される。 FIGS. 6 to 9 show an example of the operation in the input process of the CPU of the dialogue apparatus 12. When the input process is started, initialization is performed in the first step S1 of FIG. For example, the SEND flag is turned off, the RECORD flag is turned off, and the current time T (or initial value T) is substituted for time (or frame number) t. Subsequent processing from step S3 to step S69 in FIG. 9 is repeatedly executed at regular time intervals ΔT, for example, every frame.

ステップＳ３では、マイク１６または２０の入力をチェックし、ステップＳ５で当該入力データに基づいて、音声入力があるか否かを判断する。ステップＳ５で“ＹＥＳ”であれば、つまり、対話者が発話している場合には、ステップＳ７で、サーバ１４の発話フラグテーブルを参照する。たとえば、対話装置１２は発話フラグの要求をサーバ１４に送信する。サーバ１４はこれに応じて発話フラグテーブルのデータを当該対話装置１２に送信する。対話装置１２は発話フラグテーブルデータを受信してメモリに記憶する。なお、開始後には音声入力の無い状態が続くので、最初の発話の前には発話フラグテーブルには両対話者の状態としてＳＩＬＥＮＴフラグが記憶されている。 In step S3, the input of the microphone 16 or 20 is checked. In step S5, it is determined whether there is a voice input based on the input data. If “YES” in the step S5, that is, if the conversation person is speaking, the utterance flag table of the server 14 is referred to in a step S7. For example, the dialogue apparatus 12 transmits a request for the utterance flag to the server 14. In response to this, the server 14 transmits the data of the utterance flag table to the dialogue apparatus 12. The dialogue device 12 receives the utterance flag table data and stores it in the memory. Since the state where there is no voice input continues after the start, the SILENT flag is stored in the utterance flag table as the states of the two talkers before the first utterance.

続いて、ステップＳ９で、間計測処理を実行する。この間計測処理の動作の一例は図１０に詳細に示される。間計測処理を開始すると、図１０の最初のステップＳ８１では、発話フラグテーブルに基づいて、現時刻ｔのΔＴ前の時刻における自分の状態フラグがＳＩＬＥＮＴフラグであるか否かを判断する。このステップＳ８１では、現在の検出時刻で音声入力があり、かつ、前回の検出時刻で音声入力がなかったか否かを判断している。つまり、この対話装置１２側のユーザが話し始めたタイミングであるか否かを判断している。 Subsequently, an inter-process measurement process is executed in step S9. An example of the operation of the measurement process during this time is shown in detail in FIG. When the interval measurement process is started, in the first step S81 of FIG. 10, it is determined whether or not the state flag at the time before ΔT of the current time t is the SILENT flag based on the utterance flag table. In step S81, it is determined whether there is a voice input at the current detection time and no voice input at the previous detection time. That is, it is determined whether or not it is the timing when the user on the interactive device 12 side starts speaking.

ステップＳ８１で“ＹＥＳ”であれば、今回話し始めるまでの２種類の間を発話フラグテーブルに基づいて計測する。具体的には、ステップＳ８３で、自分が前に言葉を話し終えてから話し始めるまでの空白時間を計測する。また、ステップＳ８５で、相手が言葉を話し終えてから自分が話し始めるまでの空白時間を計測する。そして、ステップＳ８７で、計測データをサーバ１４に送信する。これに応じて、サーバ１４は間計測データを記憶する。ステップＳ８７を終了し、または、ステップＳ８１で“ＮＯ”である場合には、処理は図６のステップＳ１１に戻る。 If “YES” in the step S81, the two types until the start of the current conversation are measured based on the speech flag table. Specifically, in step S83, the blank time from the end of speaking a word before the start of speaking is measured. In step S85, a blank time from when the partner finishes speaking a word until when the partner starts speaking is measured. In step S87, the measurement data is transmitted to the server 14. In response to this, the server 14 stores inter-measurement data. If step S87 ends or if “NO” in the step S81, the process returns to the step S11 in FIG.

このようにして、間の計測データの履歴をサーバ１４で記憶していくことによって、対話者がどのような間を取りながら対話を行っているかをサーバ１４で記録することができる。この間の履歴データと音声解析履歴データから、間のパターンを抽出することができる。 In this way, by storing the measurement data history in the server 14, it is possible to record in the server 14 how long the interlocutor is interacting. A pattern between the history data and the voice analysis history data can be extracted.

続いて、図６のステップＳ１１では、発話フラグテーブルに基づいて、ΔＴ前のときの対話相手のフラグがＳＰＥＡＫＩＮＧフラグであるか否かを判断する。このステップでは、相手が話しているのに、この対話装置１２側の対話者も発話をしているのか否かを判断している。ステップＳ１１で“ＹＥＳ”であれば、つまり、両対話者の発話が重複した場合には、ステップＳ１３で、ＲＥＣＯＲＤフラグはオンであるか否かを判断する。 Subsequently, in step S11 of FIG. 6, it is determined based on the utterance flag table whether or not the flag of the conversation partner at the time before ΔT is the SPEAKING flag. In this step, it is determined whether or not the conversation person on the dialog device 12 side is speaking even though the other party is speaking. If “YES” in the step S11, that is, if the utterances of both the interlocutors overlap, it is determined whether or not the RECORD flag is turned on in a step S13.

ステップＳ１３で“ＮＯ”であれば、つまり、発話の重複が始まったばかりである場合には、ステップＳ１５で、音声の録音を開始し、取得した音声データを音声ファイル化してメモリに記憶する。たとえば、音声データはＰＣＭ方式データであり、音声ファイルはＷＡＶＥ形式であってよい。なお、音声データを送信前に適宜な方式で圧縮し、再生前に復号するようにしてよい。また、ステップＳ１７で、メモリのＲＥＣＯＲＤフラグをオンにして、録音中であることを記憶する。 If “NO” in the step S13, that is, if the utterance duplication has just begun, the voice recording is started in a step S15, and the obtained voice data is converted into a voice file and stored in the memory. For example, the audio data may be PCM format data, and the audio file may be in WAVE format. The audio data may be compressed by an appropriate method before transmission and decoded before reproduction. In step S17, the RECORD flag in the memory is turned on to store that recording is in progress.

なお、相手が先に話し始めている場合には、対話装置１２は相手側から音声を受信してスピーカから出力しているので、この対話装置１２側の対話者は、通常は無理に発話を続けずに、自分の発話を止めて相手の音声を聞くと考えられる。このため、録音される音声は非常に短時間のものになると考えられるので、この実施例では、音声は録音完了後に一括してサーバ１４へ送信するようにしている。しかし、他の実施例では、その都度サーバ１４に音声を送信するようにしてもよい。 When the other party starts speaking first, the dialogue apparatus 12 receives the voice from the other party and outputs it from the speaker. Therefore, the dialogue person on the dialogue apparatus 12 side usually keeps speaking forcibly. Without thinking, you can stop your utterance and listen to the other party's voice. For this reason, it is considered that the voice to be recorded is very short, and in this embodiment, the voice is transmitted to the server 14 at once after the recording is completed. However, in other embodiments, sound may be transmitted to the server 14 each time.

一方、ステップＳ１３で“ＹＥＳ”であれば、つまり、既に録音を開始している場合には、ステップＳ１９で、ステップＳ１５で開始された録音を継続し、取得した音声データを音声ファイルに記憶する。 On the other hand, if “YES” in the step S13, that is, if the recording has already been started, the recording started in the step S15 is continued in the step S19, and the acquired sound data is stored in the sound file. .

ステップＳ１７またはＳ１９を終了すると、ステップＳ２１で、サーバ１４の発話フラグテーブルにＲＥＣＯＲＤＩＮＧフラグを記録する。具体的には、対話装置１２は、時刻ｔ、発話者（この対話装置１２の識別情報）、対象（相手側対話装置１２の識別情報）等の情報とともに、録音中であることを示す情報（ＲＥＣＯＲＤＩＮＧフラグ）をサーバ１４に送信する。これに応じて、サーバ１４は、受信した情報に基づいて、発話フラグテーブルに、時刻、発話者、対象およびＲＥＣＯＲＤＩＮＧフラグを記憶する。 When step S17 or S19 is completed, the RECORDING flag is recorded in the utterance flag table of the server 14 in step S21. Specifically, the dialogue device 12 includes information indicating that recording is being performed, together with information such as the time t, the speaker (identification information of the dialogue device 12), and the target (identification information of the partner dialogue device 12). RECORDING flag) is transmitted to the server 14. In response to this, the server 14 stores the time, speaker, subject, and RECORDING flag in the utterance flag table based on the received information.

なお、システム１０が３つ以上の対話装置１２を含む場合、発話の対象（相手側対話装置１２の識別情報）を入力装置の操作等によって選択できるようにしてもよい。 When the system 10 includes three or more interactive devices 12, an utterance target (identification information of the partner interactive device 12) may be selected by operating an input device or the like.

さらに、ステップＳ２３で、サーバ１４の発話フラグテーブルのＰＬＡＹフラグに、ΔＴ前の値に１を加算した値を記録する。具体的には、対話装置１２は、時刻ｔ、発話者および対象等の情報とともに、ＰＬＡＹフラグの増加指示をサーバ１４に送信する。これに応じて、サーバ１４は、発話フラグテーブルのＰＬＡＹフラグの時刻ｔの１つ前の値を読み出して、この値に１を加算し、当該算出値を時刻ｔのＰＬＡＹフラグの値として記憶する。未再生の音声ファイルが残っていない状態で録音が開始されたときは、ＰＬＡＹフラグに１が記憶され、録音が継続中である限りＰＬＡＹフラグの値は時刻ｔの進行に合わせて１つずつ増加される。ステップＳ２３を終了すると、処理は図９のステップＳ５９に進む。 Further, in step S23, a value obtained by adding 1 to the value before ΔT is recorded in the PLAY flag of the utterance flag table of the server 14. Specifically, the dialogue apparatus 12 transmits an instruction to increase the PLAY flag to the server 14 together with information such as the time t, the speaker, and the subject. In response to this, the server 14 reads the value immediately before the time t of the PLAY flag in the utterance flag table, adds 1 to this value, and stores the calculated value as the value of the PLAY flag at the time t. . When recording is started with no unreproduced audio file remaining, 1 is stored in the PLAY flag, and the value of the PLAY flag increases by 1 as time t progresses as long as recording continues. Is done. When step S23 ends, the process proceeds to step S59 in FIG.

一方、ステップＳ１１で“ＮＯ”である場合には、処理は図７のステップＳ２５に進む。つまり、この対話装置１２側の対話者が発話している場合において、１つ前の時刻で相手側が発話の無い状態、または録音中であるときは、この対話装置１２側の対話者の音声を相手に聞かせる。 On the other hand, if “NO” in the step S11, the process proceeds to a step S25 in FIG. That is, when the conversation person on the dialog apparatus 12 side is speaking, if the other party is not speaking or recording at the previous time, the voice of the conversation person on the conversation apparatus 12 side is Tell the other party.

また、ステップＳ５で“ＮＯ”である場合には、つまり、この対話装置１２側で対話者が発話していない場合には、処理は図８のステップＳ３５に進む。 If “NO” in the step S5, that is, if the dialog person does not speak on the dialog device 12, the process proceeds to a step S35 in FIG.

図７のステップＳ２５では、サーバ１４の発話フラグテーブルにＳＰＥＡＫＩＮＧフラグを記録する。具体的には、対話装置１２は、時刻ｔ、発話者、対象等の情報とともに、発話中であることを示す情報（ＳＰＥＡＫＩＮＧフラグ）をサーバ１４に送信する。これに応じて、サーバ１４は、受信した情報に基づいて、発話フラグテーブルに、時刻、発話者、対象およびＳＰＥＡＫＩＮＧフラグを記憶する。 In step S25 of FIG. 7, the SPEAKING flag is recorded in the utterance flag table of the server 14. Specifically, the dialogue apparatus 12 transmits information indicating that the user is speaking (SPEAKING flag) to the server 14 together with information such as the time t, the speaker, and the subject. In response to this, the server 14 stores the time, speaker, target, and SPEAKING flag in the utterance flag table based on the received information.

続くステップＳ２７で、取得した音声データを音声ファイル化して、メモリにローカルファイルとして記憶する。また、ステップＳ２７で、取得した音声データとその再生指示を相手側の対話装置１２に直接（すなわちサーバ１４を介さずに）送信する。相手側の対話装置１２は、音声データと再生指示を受信すると、当該音声データの再生処理を実行して、当該音声をスピーカ１８または２２から出力する。このようにして、この対話装置１２側のみで発話が行われている場合、あるいは相手側が録音中である場合には、この対話装置１２側の音声がローカルファイルに記録されつつ相手側に直接送信され、相手側の対話装置１２で直ちに当該音声が再生されて出力される。 In the subsequent step S27, the acquired audio data is converted into an audio file and stored as a local file in the memory. In step S27, the acquired voice data and the reproduction instruction thereof are transmitted directly (that is, not via the server 14) to the interactive apparatus 12 on the other side. Upon receiving the voice data and the playback instruction, the other party's dialogue apparatus 12 executes a playback process of the voice data and outputs the voice from the speaker 18 or 22. In this way, when the utterance is being performed only on the dialog device 12 side, or when the other party is recording, the voice on the dialog device 12 side is recorded in the local file and directly transmitted to the other side. Then, the voice is immediately reproduced and output by the other-side dialog device 12.

また、ステップＳ３１では、メモリのＳＥＮＤフラグをオンにして、送信中であることを記憶する。さらに、ステップＳ３３では、サーバ１４の発話フラグテーブルのＰＬＡＹフラグに、ΔＴ前の値をそのまま記録する。具体的には、対話装置１２は、時刻ｔ、発話者および対象等の情報とともに、ＰＬＡＹフラグの維持指示をサーバ１４に送信する。これに応じて、サーバ１４は、発話フラグテーブルのＰＬＡＹフラグの時刻ｔの１つ前の値を読み出して、この値を時刻ｔのＰＬＡＹフラグの値として記憶する。このように、録音中でない場合には、ＰＬＡＹフラグの値として前回の値が維持される。ステップＳ３３を終了すると、処理は図９のステップＳ５９に進む。 In step S31, the SEND flag of the memory is turned on to store that transmission is in progress. In step S33, the value before ΔT is recorded as it is in the PLAY flag of the utterance flag table of the server 14. Specifically, the dialogue apparatus 12 transmits a PLAY flag maintenance instruction to the server 14 together with information such as the time t, the speaker, and the subject. In response to this, the server 14 reads the value immediately before the time t of the PLAY flag in the utterance flag table, and stores this value as the value of the PLAY flag at the time t. Thus, when recording is not in progress, the previous value is maintained as the value of the PLAY flag. When step S33 ends, the process proceeds to step S59 in FIG.

この対話装置１２で音声入力が行われていない場合には、図８のステップＳ３５で、ＲＥＣＯＲＤフラグがオンであるか否かを判断する。ステップＳ３５で“ＹＥＳ”であれば、つまり、１つ前の時刻まで録音が行われていた場合には、ステップＳ３７で、音声ファイルへの音声の録音を終了する。また、ステップＳ３９で、音声の録音中に入力装置を用いて入力された動作コマンドのコマンドファイルへの記録を終了する。さらに、ステップＳ４１で、メモリのＲＥＣＯＲＤフラグをオフにする。 If no voice input is performed on the interactive apparatus 12, it is determined in step S35 in FIG. 8 whether or not the RECORD flag is on. If “YES” in the step S35, that is, if the recording has been performed up to the previous time, the recording of the sound to the sound file is ended in a step S37. In step S39, the recording of the operation command input using the input device during the recording of the voice to the command file is ended. In step S41, the RECORD flag of the memory is turned off.

そして、ステップＳ４３で、録音した音声ファイルに対する音声解析処理を実行する。この音声解析処理の動作の一例が図１１に詳細に示される。なお、図８のステップＳ５３で実行される音声解析処理の動作も同じである。 In step S43, voice analysis processing is performed on the recorded voice file. An example of this voice analysis processing operation is shown in detail in FIG. The operation of the voice analysis process executed in step S53 in FIG. 8 is the same.

音声解析処理を開始すると、図１１のステップＳ９１で、メモリに録音された音声ファイルを読み込む。なお、図８のステップＳ５３で実行される場合には、このステップＳ９１では、ローカルファイルの音声データを読み込む。 When the voice analysis process is started, the voice file recorded in the memory is read in step S91 in FIG. Note that when executed in step S53 of FIG. 8, the audio data of the local file is read in step S91.

次に、ステップＳ９３で、読み込んだ音源の基本周波数（ピッチ）および振幅を算出する。また、ステップＳ９５では、音声データを音節に分割する処理を試みる。そして、ステップＳ９７で、分割した音節が存在するか否かを判断する。ステップＳ９７で“ＹＥＳ”であれば、続くステップＳ９９で、当該音節の持続時間を算出する。さらに、当該音節の持続時間の平均を算出する。ステップＳ９９を終了すると、ステップＳ９７に戻って、分割した音節が残っている場合には、当該音節についてステップＳ９９の処理を繰返す。ステップＳ９７で“ＮＯ”であれば、ステップＳ１０１で、音声解析データをサーバ１４に送信する。したがって、音声解析データは、基本周波数、振幅、および音節の平均持続時間等の情報を含む。この音声解析データは、たとえば、時刻、発話者、対象等の情報に対応付けられてサーバ１４に送信される。これに応じて、サーバ１４は、受信した音声解析データを音声解析履歴ＤＢ９２に記憶する。このようにして、発話音声の特徴が抽出されて、その履歴が記録される。ステップＳ１０１を終了すると、この音声解析処理を終了して、図８のステップＳ４５（ステップＳ４３の場合）、またはステップＳ５５（ステップＳ５３の場合）へ戻る。 Next, in step S93, the fundamental frequency (pitch) and amplitude of the read sound source are calculated. In step S95, an attempt is made to divide the audio data into syllables. In step S97, it is determined whether there is a divided syllable. If “YES” in the step S97, the duration of the syllable is calculated in a succeeding step S99. Further, the average duration of the syllable is calculated. When step S99 ends, the process returns to step S97, and if the divided syllables remain, the process of step S99 is repeated for the syllable. If “NO” in the step S97, the voice analysis data is transmitted to the server 14 in a step S101. Therefore, the voice analysis data includes information such as fundamental frequency, amplitude, and average duration of syllables. This voice analysis data is transmitted to the server 14 in association with information such as the time, the speaker, and the object, for example. In response to this, the server 14 stores the received voice analysis data in the voice analysis history DB 92. In this way, the features of the uttered voice are extracted and the history is recorded. When step S101 ends, the voice analysis process ends, and the process returns to step S45 (in the case of step S43) or step S55 (in the case of step S53) in FIG.

ステップＳ４３を終了すると、ステップＳ４５で、録音した音声ファイルとコマンドファイルとをサーバ１４に送信する。音声ファイルとコマンドファイルとは、たとえば時刻、発話者、対象等の情報に対応付けられてサーバ１４に送信される。これに応じて、サーバ１４は、受信した音声ファイルとコマンドファイルとをメモリの所定領域に保存する。発話フラグテーブルでは、録音された時刻の保存音声ファイル情報として、音声ファイルの記憶位置が登録されるとともに、同時刻の保存コマンドファイル情報として、コマンドファイルの記憶位置が登録される（図５参照）。ステップＳ４５を終了すると処理はステップＳ５５に進む。 When step S43 ends, the recorded voice file and command file are transmitted to the server 14 in step S45. The voice file and the command file are transmitted to the server 14 in association with information such as the time, the speaker, and the object, for example. In response to this, the server 14 stores the received audio file and command file in a predetermined area of the memory. In the utterance flag table, the storage location of the voice file is registered as the saved voice file information at the recorded time, and the storage location of the command file is registered as the saved command file information at the same time (see FIG. 5). . When step S45 ends, the process proceeds to step S55.

一方、ステップＳ３５で“ＮＯ”であれば、ステップＳ４７でＳＥＮＤフラグがオンであるか否かを判断する。ステップＳ４７で“ＹＥＳ”であれば、つまり、１つ前の時刻まで音声を相手側の対話装置１２に送信していた場合には、ステップＳ４９で、音声のローカルファイルへの記録を終了し、ステップＳ５１で、メモリのＳＥＮＤフラグをオフにする。そして、ステップＳ５３で、ローカルファイルの音声データに対して、上述のような図１１の音声解析処理を実行する。ステップＳ５３を終了すると、または、ステップＳ４７で“ＮＯ”である場合には、処理はステップＳ５５へ進む。 On the other hand, if “NO” in the step S35, it is determined whether or not the SEND flag is turned on in a step S47. If “YES” in the step S47, that is, if the voice has been transmitted to the interactive apparatus 12 on the other side until the previous time, the recording of the voice in the local file is ended in a step S49. In step S51, the SEND flag of the memory is turned off. In step S53, the above-described voice analysis processing of FIG. 11 is executed on the voice data of the local file. When step S53 ends or if “NO” in the step S47, the process proceeds to a step S55.

ステップＳ５５では、サーバ１４の発話フラグテーブルにＳＩＬＥＮＴフラグを記録する。具体的には、対話装置１２は、時刻ｔ、発話者、対象等の情報とともに、音声入力が無いことを示す情報（ＳＩＬＥＮＴフラグ）をサーバ１４に送信する。これに応じて、サーバ１４は、受信した情報に基づいて、発話フラグテーブルに、時刻、発話者、対象およびＳＩＬＥＮＴフラグを記憶する。 In step S55, the SILENT flag is recorded in the utterance flag table of the server 14. Specifically, the dialogue apparatus 12 transmits information indicating that there is no voice input (SILENT flag) to the server 14 together with information such as the time t, the speaker, and the target. In response to this, the server 14 stores the time, speaker, object, and SILENT flag in the utterance flag table based on the received information.

また、ステップＳ５７で、図７のステップＳ３３と同様にして、サーバの発話フラグテーブルのＰＬＡＹフラグに、ΔＴ前の値をそのまま記録する。ステップＳ５７を終了すると、処理は図９のステップＳ５９へ進む。 In step S57, the value before ΔT is recorded as it is in the PLAY flag of the utterance flag table of the server in the same manner as in step S33 of FIG. When step S57 ends, the process proceeds to step S59 in FIG.

図９のステップＳ５９からＳ６７では、相手側の対話装置１２がロボット１２ｂである場合の処理である。したがって、相手側がロボット１２ｂでない場合には、これらの処理は行われなくてよい。 Steps S59 to S67 in FIG. 9 are processing when the partner interactive device 12 is the robot 12b. Therefore, when the other party is not the robot 12b, these processes need not be performed.

図９のステップＳ５９では、動作コマンドの入力をチェックする。具体的には、入力装置からの入力データを取得して、ロボット１２ｂの身振りのための動作コマンドが選択されたか否かを判定する。たとえば、動作コマンドはディスプレイに選択可能なリストとして表示されてよい。なお、この対話装置１２がロボット１２ｂである場合には、入力装置とディスプレイを設ける必要がある。 In step S59 of FIG. 9, input of an operation command is checked. Specifically, input data from the input device is acquired, and it is determined whether or not an operation command for gesture of the robot 12b has been selected. For example, the operation commands may be displayed as a selectable list on the display. If the interactive device 12 is a robot 12b, it is necessary to provide an input device and a display.

そして、ステップＳ６１で、選択された動作コマンドがあるかどうかを判断し、“ＹＥＳ”であれば、ステップＳ６３で、メモリのＲＥＣＯＲＤフラグがオンであるか否かを判断する。ステップＳ６３で“ＹＥＳ”であれば、つまり、音声録音中の場合には、ステップＳ６５で、動作コマンドをメモリのコマンドファイルに記憶する。このように、音声を録音している場合には、動作コマンドの入力も同時に記録して、録音終了後に上述のステップＳ４５でサーバ１４に送信するようにしているので、両対話者の発話が重複した場合には、発話と身振りに対して同時に遅延を与えてから相手側で再生することができる。 In step S61, it is determined whether there is a selected operation command. If “YES”, it is determined in step S63 whether the RECORD flag of the memory is on. If “YES” in the step S63, that is, if voice recording is being performed, the operation command is stored in a command file in the memory in a step S65. As described above, when the voice is recorded, the input of the operation command is also recorded at the same time and transmitted to the server 14 at the above-described step S45 after the recording is completed. In this case, it is possible to play back on the other party side after giving a delay to speech and gesture at the same time.

一方、ステップＳ６３で“ＮＯ”であれば、ステップＳ６７で、動作コマンドと再生指示とを相手側の対話装置１２ｂに直接送信する。相手側の対話装置１２ｂは、動作コマンドと再生指示を受信すると、当該動作コマンドに対応するプログラムおよびデータに従って動作し、その身振りを実行する。 On the other hand, if “NO” in the step S63, the operation command and the reproduction instruction are directly transmitted to the partner interactive apparatus 12b in a step S67. Upon receiving the operation command and the reproduction instruction, the other-side dialog device 12b operates according to the program and data corresponding to the operation command and performs the gesture.

ステップＳ６５またはＳ６７を終了したとき、またはステップＳ６１で“ＮＯ”の場合には、ステップＳ６９で、所定時間ΔＴ（たとえば１フレーム）を加算することで時刻（あるいはフレーム番号）ｔを更新する。そして、図６のステップＳ３に戻って、次の時刻ｔにおける処理を繰返す。このようにして、対話装置１２では、この対話装置１２側の対話者の発話の状態および相手側の発話の状態に応じた処理が実行される。 When step S65 or S67 is completed, or if “NO” in the step S61, the time (or frame number) t is updated by adding a predetermined time ΔT (for example, one frame) in a step S69. Then, the process returns to step S3 in FIG. 6 to repeat the process at the next time t. In this way, in the dialog device 12, processing according to the state of the utterance of the conversation person on the dialog device 12 side and the state of the utterance on the other party side is executed.

図１２にはサーバ１４の継続促進処理における動作の一例が示される。また、図１３には、サーバ１４の遅延再生処理における動作の一例が示される。 FIG. 12 shows an example of the operation of the server 14 in the continuation promotion process. FIG. 13 shows an example of the operation in the delayed reproduction process of the server 14.

なお、サーバ１４の他の処理、たとえば受信処理、発話フラグテーブルの作成処理および送信処理などのフロー図は省略する。サーバ１４は上述のような各処理を並列的に実行している。サーバ１４は、上述のように、対話装置１２からデータを受信したときは、当該データをメモリに記憶し、必要に応じて当該データに対応する所定の処理を実行する。たとえば、サーバ１４は、対話装置１２から発話や処理の状態に関するデータを受信したときは発話フラグテーブルを作成する。音声ファイルおよび動作コマンドファイル等を受信したときは、これらのファイルを記憶するとともに、発話フラグテーブルに記憶位置を書き込む。音声解析データを受信したときは、当該データを音声解析履歴ＤＢ９２に記憶する。また、対話装置１２から発話フラグテーブルの要求があったときは、当該対話装置１２に発話フラグテーブルを送信する。 It should be noted that other processes of the server 14 such as a reception process, an utterance flag table creation process, and a transmission process are omitted. The server 14 executes each process as described above in parallel. As described above, when the server 14 receives data from the interactive apparatus 12, the server 14 stores the data in a memory and executes predetermined processing corresponding to the data as necessary. For example, the server 14 creates an utterance flag table when it receives data related to the state of utterance or processing from the dialogue device 12. When an audio file, an operation command file, or the like is received, these files are stored and the storage location is written in the utterance flag table. When the voice analysis data is received, the data is stored in the voice analysis history DB 92. Further, when there is a request for the utterance flag table from the dialogue apparatus 12, the utterance flag table is transmitted to the dialogue apparatus 12.

図１２に示す継続促進処理では、サーバ１４のＣＰＵは、ステップＳ１１１で初期化を実行し、たとえば変数ｔに初期値Ｔを設定する。この初期値Ｔは発話フラグテーブルの時刻ｔの最初の値Ｔであり、つまり、対話装置１２における時刻ｔの初期値Ｔである。したがって、継続促進処理は発話フラグテーブルの作成後に実行される。続くステップＳ１１３からＳ１３５の処理をサーバ１４のＣＰＵは一定時間ΔＴごとに、たとえば１フレームごとに繰り返し実行する。 In the continuation promotion process shown in FIG. 12, the CPU of the server 14 executes initialization in step S111, and sets an initial value T for a variable t, for example. This initial value T is the first value T at time t in the utterance flag table, that is, the initial value T at time t in the interactive device 12. Therefore, the continuation promotion process is executed after the utterance flag table is created. The CPU of the server 14 repeatedly executes the processing of subsequent steps S113 to S135 every predetermined time ΔT, for example, every frame.

ステップＳ１１３では、メモリの発話フラグテーブルを参照する。たとえば現時刻ｔのデータを読み出す。そして、ステップＳ１１５で、対話者同士でＳＩＬＥＮＴフラグであるか否かを判断する。たとえば、現時刻ｔにおいてユーザと発話対象が互いに対になっている両対話者が存在しており、かつ、当該両対話者の状態フラグがＳＩＬＥＮＴフラグであることを判定する。たとえば図５では、時刻Ｔ＋ΔＴのときがこの状態に相当する。 In step S113, the utterance flag table in the memory is referred to. For example, data at the current time t is read. Then, in step S115, it is determined whether or not the SILENT flag is set between the interlocutors. For example, it is determined that there are both interacting parties in which the user and the utterance target are paired at the current time t, and the status flag of both interacting parties is the SILENT flag. For example, in FIG. 5, the time T + ΔT corresponds to this state.

ステップＳ１１５で“ＹＥＳ”であれば、つまり、対話において無音状態になっている場合には、ステップＳ１１７で、空白時間を算出する。たとえば、現時刻ｔ以前の発話フラグテーブルのデータを読み出して、現時刻ｔから遡って両対話者のどちらかの状態フラグがＳＩＬＥＮＴフラグでなくなるまでに掛かった時間（またはフレーム数）を算出する。 If “YES” in the step S115, that is, if there is a silent state in the dialogue, a blank time is calculated in a step S117. For example, the data in the utterance flag table before the current time t is read, and the time (or the number of frames) taken until either state flag of the two talkers is no longer the SILENT flag is calculated from the current time t.

続いて、ステップＳ１１９で、音声解析履歴ＤＢ９２から対話者らの最新のデータを抽出する。具体的には、現時刻ｔに最も近い時刻の発話者の音声解析データから、基本周波数、振幅および音節の平均持続時間等を読み出す。このように、ステップＳ１１７とＳ１１９で、少なくとも空白時間と当該空白前の発話音声の特徴を含む間の状況が検出される。 Subsequently, in step S119, the latest data of the interlocutors is extracted from the voice analysis history DB 92. Specifically, the fundamental frequency, the amplitude, the average duration of the syllable, and the like are read from the speech analysis data of the speaker closest to the current time t. As described above, in steps S117 and S119, a situation between at least the blank time and the feature of the speech before the blank is detected.

そして、ステップＳ１２１で、現在の間の状況と間パターンとの照合を実行して、ステップＳ１２３で、現在の対話の間の状況にマッチする間パターンがあるか否かを判断する。上述の図４のように、間パターンデータ内には、空白時間（ｔ）および条件（Ｉ）設定されているので、このような間パターンに合う空白時間および発話音声の特徴（基本周波数、振幅、音節の平均持続時間など）を有する間の状況（すなわち、最終発話者の発話後の無音状態）が生じているか否かを判定する。マッチする間パターンがある場合には、当該間パターンに対応する間機能言葉を選択する。また、間パターンデータに設定されている最終発話者と発話者との関係（相手か自分か）に基づいて、間機能言葉を発話させる対話装置１２を特定する。 In step S121, the current situation and the inter-pattern are collated, and in step S123, it is determined whether there is a pattern that matches the situation during the current dialogue. As shown in FIG. 4 above, since the blank time (t) and the condition (I) are set in the inter-pattern data, the blank time and the features of the speech voice (basic frequency and amplitude) that match the inter-pattern are set. , The average duration of syllables, etc.) (ie, silence after the last speaker's utterance) is determined. If there is a matching pattern, a function word corresponding to the pattern is selected. In addition, based on the relationship between the last speaker and the speaker set in the inter-pattern data (whether the other party is himself or herself), the dialogue device 12 that utters the inter-function language is specified.

ステップＳ１２３で“ＹＥＳ”であれば、つまり、現在の対話における間の状況が、間パターンに基づく間を挿入すべき状況になっていると判定される場合には、ステップＳ１２５で、選択した間機能言葉の音声ファイルをメモリの作業領域に読み出して、ピッチ、抑揚パターンを調整して、当該調整した間機能言葉の音声ファイルを生成する。これによって、発話者の発話の特徴（たとえば、高揚した口調、淡々とした発話など）に合わせた間機能言葉を出力することが可能になる。したがって、会話に合成音声が挿入されても対話者に違和感をさほど覚えさせないようにすることができるし、また、それまでの会話の調子や流れを継続させることができる。 If “YES” in the step S123, that is, if it is determined that the situation in the current dialogue is a situation to be inserted based on the inter-pattern, the selected interval in the step S125. The function language speech file is read into the work area of the memory, the pitch and the inflection pattern are adjusted, and the function language speech file is generated during the adjustment. This makes it possible to output function words while matching the characteristics of the speaker's utterance (for example, an elevated tone, a light utterance, etc.). Therefore, even if the synthesized speech is inserted into the conversation, it is possible to prevent the conversation person from feeling a sense of incongruity, and it is possible to continue the tone and flow of the conversation so far.

また、ステップＳ１２７で、選択された間機能言葉に適した動作コマンドを選択する。この実施例では、間パターンデータにおいて、間機能言葉に対応する動作コマンドが登録されているので、当該動作コマンドを選択する。 In step S127, an operation command suitable for the selected function word is selected. In this embodiment, since the operation command corresponding to the inter-function word is registered in the inter-pattern data, the operation command is selected.

そして、ステップＳ１２９で、音声ファイルと動作コマンドファイルを、発話させる対話装置１２に送信する。ファイル送信後、ステップＳ１３１で、音声と動作の再生指示を同じ対話装置１２に送信する。これによって、対話における無音領域に言葉や身振りを挿入することができる。なお、その対話装置１２は、音声ファイルの再生を実行し、当該音声を出力する。また、対話装置１２がロボット１２ｂである場合には、さらに動作コマンドの再生を実行し、当該動作コマンドに対応する身振りを行う。 In step S129, the voice file and the operation command file are transmitted to the dialogue apparatus 12 that speaks. After the file transmission, in step S131, the voice and operation playback instructions are transmitted to the same interactive device 12. As a result, words and gestures can be inserted into the silent area in the dialogue. The dialog device 12 executes reproduction of the audio file and outputs the audio. When the interactive device 12 is the robot 12b, the operation command is further reproduced, and the gesture corresponding to the operation command is performed.

さらに、ステップＳ１３３で、発話フラグテーブルにおいて、現時刻ｔの状態フラグにＩＮＴＥＲＰＯＬＡＴＩＮＧフラグを上書きする（図５の時刻Ｔ＋２ΔＴを参照）。これによって、以降のステップＳ１１７では、当該時刻ｔが無音であるとは見なされないようにすることができる。 In step S133, the state flag at the current time t is overwritten with the INTERPOLATING flag in the utterance flag table (see time T + 2ΔT in FIG. 5). Thereby, in the subsequent step S117, the time t can be prevented from being regarded as silent.

一方、ステップＳ１２３で“ＮＯ”である場合には、つまり、未だ、間パターンに従った間を与える必要がない場合には、処理はそのままステップＳ１３５に進む。ステップＳ１３５では、所定時間ΔＴ（たとえば１フレーム）を加算することで時刻（あるいはフレーム番号）ｔを更新する。なお、このサーバ１４におけるΔＴは対話装置１２におけるΔＴと同一である。そして、ステップＳ１１３に戻って、次の時刻ｔにおける処理を繰返す。このようにして、対話において無音が検出された場合には、必要に応じて言葉や身振りを挿入することによって、無音時間を適切な間に変えることができる。 On the other hand, if “NO” in the step S123, that is, if it is not yet necessary to give a space according to the interval pattern, the process proceeds to a step S135 as it is. In step S135, the time (or frame number) t is updated by adding a predetermined time ΔT (for example, one frame). Note that ΔT in the server 14 is the same as ΔT in the interactive device 12. And it returns to step S113 and repeats the process in the following time t. In this way, when silence is detected in the dialogue, the silence time can be changed appropriately by inserting words or gestures as necessary.

図１３に示す遅延再生処理では、サーバ１４のＣＰＵは、ステップＳ１５１で初期化を実行する。たとえば、ＰＬＡＹＩＮＧフラグをオフにする。ＰＬＡＹＩＮＧフラグは録音された音声ファイルおよび動作コマンドファイルを再生中であるか否かを示す。また、図１２の継続促進処理と同様に、変数ｔに初期値Ｔを設定する。この初期値Ｔは発話フラグテーブルの時刻ｔの最初の値Ｔであり、つまり、対話装置１２における時刻ｔの初期値Ｔである。したがって、この遅延再生処理も発話フラグテーブルの作成後に実行される。続くステップＳ１５３からＳ１７９の処理をサーバ１４のＣＰＵは一定時間ΔＴごとに、たとえば１フレームごとに繰り返し実行する。 In the delayed playback process shown in FIG. 13, the CPU of the server 14 performs initialization in step S151. For example, the PLAYING flag is turned off. The PLAYING flag indicates whether or not the recorded audio file and operation command file are being reproduced. Further, as in the continuation promotion process of FIG. 12, an initial value T is set to the variable t. This initial value T is the first value T at time t in the utterance flag table, that is, the initial value T at time t in the interactive device 12. Therefore, this delayed reproduction process is also executed after the utterance flag table is created. The CPU of the server 14 repeatedly executes the processing of subsequent steps S153 to S179 every predetermined time ΔT, for example, every frame.

ステップＳ１５３では、メモリの発話フラグテーブルを参照する。ステップＳ１５５で、メモリのＰＬＡＹＩＮＧフラグがオンであるか否かを判断する。ステップＳ１１５で“ＮＯ”であれば、つまり、再生中ではない場合には、ステップＳ１５７で、現時刻ｔにおける両対話者のどちらかの状態フラグとしてＳＰＥＡＫＩＮＧフラグがあるか否かを判断する。ステップＳ１５７で“ＹＥＳ”の場合、一方が発話をしており、その音声が他方の対話装置１２から出力されているはずである。したがって、遅延再生は行わず処理はステップＳ１７９に進む。 In step S153, the utterance flag table in the memory is referred to. In step S155, it is determined whether or not the PLAYING flag of the memory is on. If “NO” in the step S115, that is, if reproduction is not being performed, it is determined whether or not there is a SPEAKING flag as a state flag of either of the two interactors at the current time t in a step S157. If “YES” in the step S157, one of the voices is speaking and the voice should be output from the other interactive device 12. Accordingly, the delayed reproduction is not performed, and the process proceeds to step S179.

一方、ステップＳ１５７で“ＮＯ”であれば、つまり、両対話装置１２で音声が出力されていない場合には、ステップＳ１５９で、両対話者のどちらかのＰＬＡＹフラグが１以上であるか否かを判断する。ステップＳ１５９で“ＮＯ”であれば、録音されたが未再生である音声ファイルが存在しないので、処理はそのままステップＳ１７９に進む。 On the other hand, if “NO” in the step S157, that is, if no sound is output from both the interactive devices 12, it is determined whether or not the PLAY flag of either of the interactive parties is 1 or more in a step S159. Judging. If “NO” in the step S159, since there is no audio file that has been recorded but not reproduced, the process proceeds to a step S179 as it is.

しかし、ステップＳ１５９で“ＹＥＳ”であれば、つまり、録音されたが未再生の音声ファイルが残っている場合には、ステップＳ１６１で、ＰＬＡＹフラグが１である時刻ｔが早いユーザを発話フラグテーブルから参照する。つまり、録音を開始した時刻が早いユーザを特定する。なお、録音の開始が両対話者で同時刻である場合には、予め設定しておいた優先順位（たとえばＢ＞Ａ）に基づいて、ユーザを特定する。 However, if “YES” in the step S159, that is, if a recorded but unreproduced audio file remains, a user whose earliest time t at which the PLAY flag is 1 is indicated in the utterance flag table in a step S161. Reference from That is, the user whose recording start time is early is specified. When the start of recording is the same time for both conversation parties, the user is specified based on a preset priority (for example, B> A).

続いて、ステップＳ１６４で、再生のための設定を実行し、変数Ｆに１を設定し、変数Ｕに特定したユーザを設定する。変数Ｆは音声再生のためのフレームカウンタである。また、ステップＳ１６５で、メモリのＰＬＡＹＩＮＧフラグをオンにして、再生中であることを記憶する。そして、ステップＳ１６７で、変数ＵのＰＬＡＹフラグが変数Ｆの値である音声および動作を再生する。具体的には、当該音声ファイルを読み出して、当該ユーザの相手側の対話装置１２に音声ファイルと再生指示とを送信する。なお、当該動作コマンドファイルも保存されている場合には、当該動作コマンドファイルも読み出して、音声ファイルと一緒に相手側の対話装置１２に送信する。これに応じて、当該対話装置１２は、音声ファイルおよび動作コマンドファイルを記憶するとともに、その再生を実行する。これによって、音声がスピーカ１８または２２から出力され、動作コマンドもあった場合には、当該身振りも実行される。このようにして、録音されていた音声および記憶されていた動作の再生が開始される。 Subsequently, in step S164, settings for reproduction are executed, 1 is set in the variable F, and the specified user is set in the variable U. A variable F is a frame counter for audio reproduction. In step S165, the PLAYING flag in the memory is turned on to store that playback is in progress. In step S167, the voice and operation in which the PLAY flag of the variable U is the value of the variable F are reproduced. Specifically, the audio file is read out, and the audio file and a reproduction instruction are transmitted to the dialog device 12 on the other side of the user. If the operation command file is also stored, the operation command file is also read out and transmitted to the partner interactive device 12 together with the audio file. In response to this, the dialogue apparatus 12 stores the audio file and the operation command file, and executes the reproduction thereof. As a result, when a voice is output from the speaker 18 or 22 and an operation command is received, the gesture is also executed. In this way, playback of the recorded voice and the stored operation is started.

ステップＳ１６７を終了すると、処理はステップＳ１７９へ進む。ステップＳ１７９では、時刻ｔに所定時間ΔＴが加算されて時刻ｔが更新される。ステップＳ１７９を終了すると、処理はステップＳ１５３へ戻って、次の時刻ｔにおける処理を繰返す。 When step S167 ends, the process proceeds to step S179. In step S179, the predetermined time ΔT is added to the time t, and the time t is updated. When step S179 ends, the process returns to step S153, and the process at the next time t is repeated.

再生が開始されると、ステップＳ１５５で“ＹＥＳ”と判断され、続くステップＳ１６９で、時刻ｔにおける変数ＵのＰＬＡＹフラグが変数Ｆの値に等しいか否かを判断する。上述のように、録音が終了した場合には、ＰＬＡＹフラグの値は前時刻の値を維持するので、このステップＳ１６９では、再生中の音声ファイルの再生を完了したか否かを判定している。 When reproduction is started, “YES” is determined in the step S155, and it is determined in a subsequent step S169 whether or not the PLAY flag of the variable U at the time t is equal to the value of the variable F. As described above, when the recording is completed, the value of the PLAY flag is maintained at the previous time, so in this step S169, it is determined whether or not the reproduction of the audio file being reproduced has been completed. .

ステップＳ１６９で“ＮＯ”であれば、つまり、音声ファイルの再生が未だ完了していない場合には、ステップＳ１７１で、変数Ｆをインクリメントする。その後、ステップＳ１７３で、変数ＵのＰＬＡＹフラグが変数Ｆの値である音声および動作を再生する。これによって、上述のステップＳ１６７と同様にデータが送信され、次のフレームの音声および動作が対話装置１２で再生される。ステップＳ１７３を終了すると、処理はステップＳ１７９へ進む。 If “NO” in the step S169, that is, if the reproduction of the audio file is not yet completed, the variable F is incremented in a step S171. After that, in step S173, the voice and operation in which the PLAY flag of the variable U is the value of the variable F are reproduced. As a result, data is transmitted in the same manner as in step S167 described above, and the voice and operation of the next frame are reproduced by the dialogue apparatus 12. When step S173 ends, the process proceeds to step S179.

一方、ステップＳ１６９で“ＹＥＳ”であれば、つまり、音声ファイルの再生を完了した場合には、ステップＳ１７５で、メモリのＰＬＡＹＩＮＧフラグをオフにする。また、ステップＳ１７７で、変数ＵのＰＬＡＹフラグの値を全て変数Ｆの値だけ減算する。なお、減算の結果、値が負になったとき、当該ＰＬＡＹフラグの値は０に設定される。これによって、再生された変数Ｕおよび時刻ｔのＰＬＡＹフラグの値がすべて０になる。また、当該変数Ｕのユーザの未再生の音声ファイルが存在する場合には、当該ユーザの最も古く録音された音声ファイルのうち最も早い時刻のＰＬＡＹフラグの値が１になる。したがって、次回は、当該未再生の音声を再生することが可能になる。ステップＳ１７７を終了すると、処理はステップＳ１７９へ進む。 On the other hand, if “YES” in the step S169, that is, if the reproduction of the audio file is completed, the PLAYING flag of the memory is turned off in a step S175. In step S177, all the values of the PLAY flag of the variable U are subtracted by the value of the variable F. When the value becomes negative as a result of the subtraction, the value of the PLAY flag is set to 0. As a result, the reproduced variable U and the value of the PLAY flag at time t are all zero. In addition, when there is an unreproduced audio file of the user of the variable U, the value of the PLAY flag at the earliest time among the oldest recorded audio files of the user is 1. Therefore, next time, the unreproduced sound can be reproduced. When step S177 ends, the process proceeds to step S179.

このようにして、両対話者の発話の重複によって録音された音声および記録された動作コマンドを、後から再生することができる。 In this way, it is possible to reproduce the voice recorded by the overlap of the utterances of the two interactors and the recorded operation command later.

図１４には、対話装置１２のＣＰＵの出力処理の動作の一例が示される。この出力処理は上述の図６から図９の入力処理と並列的に実行される。また、この出力処理は一定時間ごと、たとえば１フレームごとに繰り返し実行される。 FIG. 14 shows an example of the output processing operation of the CPU of the interactive apparatus 12. This output process is executed in parallel with the input processes shown in FIGS. Further, this output process is repeatedly executed at regular intervals, for example, every frame.

ステップＳ１９１では、音声を受信したか否かが判断され、“ＹＥＳ”であれば、ステップＳ１９３で、受信した音声ファイルないし音声データをメモリに記憶する。 In step S191, it is determined whether or not voice is received. If “YES”, the received voice file or voice data is stored in the memory in step S193.

続いて、ステップＳ１９５では、動作コマンドを受信したか否かが判断され、“ＹＥＳ”であれば、ステップＳ１９７で、受信した動作コマンドファイルをメモリに記憶する。 Subsequently, in step S195, it is determined whether or not an operation command has been received. If “YES”, the received operation command file is stored in the memory in step S197.

続いて、ステップＳ１９９では、再生指示を受信したか否かが判断され、“ＹＥＳ”であれば、ステップＳ２０１で、音声を再生する。具体的には、対話装置１２のＣＰＵは、受信した音声ファイルを再生を開始し、当該音声データを音声入出力ボードに与えてスピーカから当該音声を出力する。また、当該対話装置１２が身体動作機能を有する対話装置１２ｂである場合には、ステップＳ２０３で、動作を再生する。具体的には、当該動作コマンドに従って対応する身振りを実行する。動作コマンドに対応する身振りを実行するためのプログラムおよび制御データは、対話装置１２ｂのメモリ７０に予め記憶されている。ＣＰＵ６６は動作コマンドに対応するプログラムに従って制御データをモータ制御ボード７２に与えて、対応するモータを制御する。これによって対応する身体部位が動かされて所定の身振りが表現される。 Subsequently, in step S199, it is determined whether or not a reproduction instruction is received. If “YES”, the audio is reproduced in step S201. Specifically, the CPU of the dialogue apparatus 12 starts playing the received audio file, gives the audio data to the audio input / output board, and outputs the audio from the speaker. If the interactive device 12 is an interactive device 12b having a physical motion function, the motion is reproduced in step S203. Specifically, the corresponding gesture is executed according to the operation command. A program and control data for executing gestures corresponding to the operation commands are stored in advance in the memory 70 of the interactive device 12b. The CPU 66 gives control data to the motor control board 72 according to a program corresponding to the operation command to control the corresponding motor. As a result, the corresponding body part is moved to express a predetermined gesture.

なお、上述の実施例では、両対話者の発話が重複したとき、後から発話された方の音声を録音して、その後どちらも発話しなくなってから、当該録音音声を相手側で出力するようにしていた。しかし、他の実施例では、両対話者の発話が重複したときには、後から発話された方の音声をキャンセルするようにしてもよい。 In the above-described embodiment, when the utterances of both of the interlocutors overlap, the voice of the person who was uttered later is recorded, and after both ceases to speak, the recorded voice is output on the other party's side. I was doing. However, in another embodiment, when the utterances of the two interlocutors overlap, the voice of the one uttered later may be canceled.

また、上述の各実施例では、間機能言葉の音声データをサーバ１４が記憶しておいて、サーバ１４から対話装置１２に送信するようにしていた。しかし、他の実施例では、間機能言葉の音声データを各対話装置１２に予め記憶させておいて、サーバ１４から再生すべき間機能言葉を指定する情報を送信するようにしてもよい。 Further, in each of the above-described embodiments, the server 14 stores the voice data of the inter-function language and transmits it from the server 14 to the dialogue device 12. However, in another embodiment, voice data of inter-function words may be stored in advance in each interactive device 12 and information specifying the inter-function words to be reproduced may be transmitted from the server 14.

また、上述の各実施例では、システム１０は、身体動作機能を有しない対話装置１２ａと身体動作機能を有する対話装置１２ｂとを含んでいた。しかし、他の実施例では、身体動作機能を有しない対話装置１２ａのみが使用されてよく、この場合には、動作コマンド関連の処理が不要である。逆に、身体動作機能を有する対話装置１２ｂのみが使用されてもよい。 Further, in each of the above-described embodiments, the system 10 includes the interactive device 12a having no physical motion function and the interactive device 12b having the physical motion function. However, in other embodiments, only the interactive device 12a that does not have a body motion function may be used, and in this case, processing related to motion commands is unnecessary. Conversely, only the interactive device 12b having a body movement function may be used.

また、上述の各実施例では、システム１０は対話装置１２とは別に各対話装置１２の音声取得状態および音声出力状態を示す情報（すなわち発話フラグテーブル）を管理するサーバ１４を備えた。しかし、他の実施例では、サーバ１４を別途に設けずに、サーバ１４の機能（発話フラグテーブルの管理、継続促進処理、遅延再生処理など）を一方の対話装置１２に備えさせるようにしてよいし、あるいは２つの対話装置１２に分散して備えさせるようにしてもよい。 Further, in each of the above-described embodiments, the system 10 includes the server 14 that manages the information (that is, the utterance flag table) indicating the voice acquisition state and the voice output state of each interactive device 12 separately from the interactive device 12. However, in another embodiment, the server 14 may be provided with the functions of the server 14 (speech flag table management, continuation promotion processing, delayed reproduction processing, etc.) without providing the server 14 separately. Alternatively, the two interactive devices 12 may be distributed and provided.

この発明の一実施例の遠隔地間対話システムの構成を示す図解図である。It is an illustration figure which shows the structure of the remote place dialogue system of one Example of this invention. 身体動作機能を有する対話装置の外観の一例を示す図解図である。It is an illustration figure which shows an example of the external appearance of the dialogue apparatus which has a body movement function. 図２の対話装置の電気的な構成の一例を示すブロック図である。FIG. 3 is a block diagram illustrating an example of an electrical configuration of the interactive apparatus in FIG. 2. 間パターンＤＢに記憶される間パターンデータの一例を示す図解図である。It is an illustration figure which shows an example of the interval pattern data memorize | stored in interval pattern DB. サーバに記憶される発話フラグテーブルの一例を示す図解図である。It is an illustration figure which shows an example of the speech flag table memorize | stored in a server. 対話装置の入力処理の動作の一例の一部を示すフロー図である。It is a flowchart which shows a part of example of operation | movement of the input process of a dialogue apparatus. 図６の続きの一部を示すフロー図である。It is a flowchart which shows a part of continuation of FIG. 図６の続きの一部を示すフロー図である。It is a flowchart which shows a part of continuation of FIG. 図６、図７および図８の続きを示すフロー図である。FIG. 9 is a flowchart showing a continuation of FIGS. 6, 7, and 8. 図６の間計測処理の動作の一例を示すフロー図である。It is a flowchart which shows an example of the operation | movement of a measurement process between FIG. 図８の音声解析処理の動作の一例を示すフロー図である。It is a flowchart which shows an example of the operation | movement of the audio | voice analysis process of FIG. サーバの継続促進処理の動作の一例を示すフロー図である。It is a flowchart which shows an example of the operation | movement of a continuation promotion process of a server. サーバの遅延再生処理の動作の一例を示すフロー図である。It is a flowchart which shows an example of the operation | movement of a delayed reproduction process of a server. 対話装置の出力処理の動作の一例を示すフロー図である。It is a flowchart which shows an example of operation | movement of the output process of a dialogue apparatus.

Explanation of symbols

１０ …遠隔地間対話システム
１２，１２ａ，１２ｂ …対話装置
１４ …発話タイミング制御サーバ
１６，２０ …マイク
１８，２２ …スピーカ
９２ …音声解析履歴データベース
９４ …間パターンデータベース DESCRIPTION OF SYMBOLS 10 ... Remote place interaction system 12, 12a, 12b ... Dialogue device 14 ... Speech timing control server 16, 20 ... Microphone 18, 22 ... Speaker 92 ... Voice analysis history database 94 ... Inter-pattern database

Claims

A system for conducting a dialogue between remote locations including two dialogue devices connected via a network,
Each of the interactive devices
Acquisition means for acquiring audio;
Transmitting means for transmitting the voice acquired by the acquiring means to the interactive apparatus on the other side;
Receiving means for receiving the voice transmitted from the interactive apparatus on the other side, and output means for outputting the voice received by the receiving means,
An inter-pattern storage means for storing a plurality of inter-patterns including at least information regarding the blank time and the features of the speech;
A history recording means for recording a history of dialogue states including at least a voice acquisition state and a voice output state in each of the dialogue devices;
When it is determined that both of the interactive devices are in a no-speech state based on the history recorded by the history recording means, the situation between at least the blank time and the features of the uttered voice before the blank and the plural A collating unit that collates with an inter-pattern, and when there is the inter-pattern that matches as a result of collation by the collating unit, a predetermined voice corresponding to the inter-pattern is transmitted to the interactive device corresponding to the inter-pattern. A remote place dialogue system comprising control means for outputting from output means.

When it is determined that the utterances are duplicated in both of the interactive devices based on the history recorded by the history recording means, one voice is recorded, and then the recorded voice is The remote site interaction system according to claim 1, further comprising delay reproduction means for outputting from the output means of the interaction apparatus.

When at least one of the interactive devices is a robot capable of performing gestures, the interval control unit causes the interactive device to execute a predetermined gesture corresponding to the interval pattern together with the output of the voice. 2. The remote communication system according to 2.