JP2017123027A

JP2017123027A - Conversation support system, conversation support device, and conversation support program

Info

Publication number: JP2017123027A
Application number: JP2016001340A
Authority: JP
Inventors: 亮石井; Akira Ishii; 和弘大塚; Kazuhiro Otsuka; 史朗熊野; Shiro Kumano
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-01-06
Filing date: 2016-01-06
Publication date: 2017-07-13
Anticipated expiration: 2036-01-06
Also published as: JP6445473B2

Abstract

PROBLEM TO BE SOLVED: To provide a conversation support system, a conversation support device, and a conversation support program that encourage a participant of a conversation who has missed a chance to speak at an appropriate timing in the conversation to speak.SOLUTION: In a conversation support system 100, a next speaker probability estimation unit 108 estimates a probability with which a participant of a conversation will become the next speaker at an arbitrary time based on the result of measuring the non-verbal behavior of the participant in the conversation. A control unit 109 estimates an anticipated next speaker, who is a participant to speak next, based on the probability of each participant becoming the next speaker, and a timing when the anticipated next speaker starts to speak, and promotes the anticipated next speaker to speak if the control unit has determined that the anticipated next speaker did not speak at the estimated timing. Speech inducing units 111 to 119 receive an instruction from the control unit and promote a subject to speak.SELECTED DRAWING: Figure 1

Description

本発明は、会話支援システム、会話支援装置及び会話支援プログラムに関する。 The present invention relates to a conversation support system, a conversation support apparatus, and a conversation support program.

会話の参加者が適切なタイミングで発話すると、良い雰囲気で会話が進行する。会話の目的は様々であるが、良い雰囲気の会話はその目的の達成に効果的であり、参加者の満足度も高い。しかしながら、会話中に適切なタイミングで発話することは、高度なコミュニケーションスキルを要する。そのため、発話が期待される場面でも、発話のタイミングをつかむことが苦手であるために、あるいは、他の参加者が先に話し出してしまったために、発話の機会を逃してしまう参加者もいる。また、会話の雰囲気から最も発話すべきと期待される参加者が発話を行わないときには、他の参加者もその参加者の発話を待って発話を躊躇してしまい、結果として会話が止まってしまうことがある。 When a conversation participant speaks at an appropriate time, the conversation proceeds in a good atmosphere. The purpose of the conversation is various, but conversation with a good atmosphere is effective in achieving the purpose, and the satisfaction of participants is high. However, speaking at an appropriate time during a conversation requires advanced communication skills. Therefore, even in a scene where utterance is expected, some participants miss the opportunity to speak because they are not good at grasping the timing of utterance or because other participants have spoken first. In addition, when the participant who is expected to speak most from the atmosphere of the conversation does not speak, the other participants wait for the participant's speech and hesitate to speak, and as a result, the conversation stops. Sometimes.

一方、会議において次話者に発話を行わせる技術がある。この技術では、多人数ＴＶ（テレビ）会議において、身体動作や発話情報から各参加者の発話欲求を推定し、その欲求に基づいて次の次話者を決定する。そして、その次話者に確実に発話を行わせるために、その人物のフィラーなどを他の参加者に聞かせる制御を行う。また、多人数ＴＶ会議において、万人が隔たりなく発話できるように、発言が多い人物を検出し、その人物の発話を抑制するように音声を生成する技術がある（例えば、特許文献２参照）。 On the other hand, there is a technique for making the next speaker speak in a conference. In this technology, in a multi-person TV (television) conference, the utterance desire of each participant is estimated from physical motion and utterance information, and the next next speaker is determined based on the desire. Then, in order to make the next speaker surely speak, control is performed to let other participants hear the filler of the person. In addition, in a multi-person TV conference, there is a technique for detecting a person with many utterances and generating voice so as to suppress the utterance of the person so that everyone can speak without any difference (see, for example, Patent Document 2). .

特開２０１２−１４６０７２号公報JP 2012-146072 A 特開２００７−１５８５２６号公報JP 2007-158526 A

上述した特許文献１の技術は、システムが次話者にしようとした参加者以外の発話開始を阻止するものであり、特許文献２の技術は、特定の参加者の発話を阻止（妨害）することで、他の参加者の発話を促進するものである。しかし、これらの従来技術は、参加者が発話のタイミングを逸してしまったときに、その参加者や他の参加者に発話を促すものではない。 The technique of Patent Document 1 described above prevents the start of utterances by a system other than the participant who tried to be the next speaker, and the technique of Patent Document 2 blocks (disturbs) the speech of a specific participant. This is to promote the speech of other participants. However, these conventional techniques do not prompt the participant or other participants to speak when the participant misses the timing of speaking.

上記事情に鑑み、本発明は、会話の参加者が会話中に適切な発話のタイミングを逸してしまったときに、発話を促すことができる会話支援システム、会話支援装置及び会話支援プログラムを提供することを目的としている。 In view of the above circumstances, the present invention provides a conversation support system, a conversation support apparatus, and a conversation support program capable of prompting an utterance when a participant of the conversation misses an appropriate utterance timing during the conversation. The purpose is that.

本発明の一態様は、会話中の各参加者の非言語行動の計測結果に基づいて、前記参加者それぞれが任意の時刻に次発話となる確率である次話者確率を推定する次話者確率推定部と、前記参加者の前記次話者確率に基づいて次に発話を行うべき参加者である予測次話者及び前記予測次話者が発話を開始するタイミングを推定し、推定された前記タイミングに前記予測次話者が発話を行わなかったことを検出した場合に、前記予測次話者を対象者として発話を促すよう指示する制御部と、前記制御部からの指示を受け、前記対象者に発話を促す処理を行う発話誘導部と、を備える会話支援システムである。 One aspect of the present invention is a next speaker that estimates a next speaker probability, which is a probability that each of the participants becomes a next utterance at an arbitrary time, based on a measurement result of non-verbal behavior of each participant in conversation. Based on the probability estimation unit and the next speaker probability of the participant, a predicted next speaker who is a participant to speak next and a timing at which the predicted next speaker starts speaking are estimated and estimated. When it is detected that the predicted next speaker does not speak at the timing, a control unit that instructs the predicted next speaker to be uttered as a target, and receives an instruction from the control unit, A conversation support system including an utterance guidance unit that performs processing for prompting a subject to speak.

本発明の一態様は、上述した会話支援システムであって、前記制御部は、推定された前記タイミングに前記予測次話者が発話を行わなかったことを検出した場合に、前記次話者以外の話者を対象者として発話を促すよう前記発話誘導部に指示する。 One aspect of the present invention is the above-described conversation support system, in which the control unit detects a speech other than the next speaker when the predicted next speaker does not speak at the estimated timing. The utterance guidance unit is instructed to urge utterance with the speaker as the target person.

本発明の一態様は、上述した会話支援システムであって、前記発話誘導部は、前記対象者に発話権の移譲を示す動作を行うようロボットを、又は、表示装置に表示される話者を制御する。 One aspect of the present invention is the conversation support system described above, in which the utterance guide unit selects a robot or a speaker displayed on a display device to perform an operation indicating transfer of the utterance right to the target person. Control.

本発明の一態様は、上述した会話支援システムであって、前記発話誘導部は、前記対象者に視線を向けるようロボットの、又は、表示装置に表示される話者の眼、頭部、又は、胴部のうち１以上を制御する。 One aspect of the present invention is the conversation support system described above, in which the utterance guiding unit is a robot or a speaker's eye, head, or display displayed on a display device so as to direct a line of sight toward the subject. Control one or more of the body parts.

本発明の一態様は、上述した会話支援システムであって、前記発話誘導部は、ロボットの、又は、表示装置に表示される話者の上肢を前記対象者に差し出すよう制御する。 One aspect of the present invention is the above-described conversation support system, in which the utterance guide unit controls the robot or the speaker's upper limb displayed on the display device to be presented to the subject.

本発明の一態様は、上述した会話支援システムであって、前記発話誘導部は、前記対象者の発話を促す音声を出力する。 One aspect of the present invention is the above-described conversation support system, in which the utterance guiding unit outputs a voice that urges the subject to speak.

本発明の一態様は、会話中の各参加者の非言語行動の計測結果に基づいて、前記参加者それぞれが任意の時刻に次発話となる確率である次話者確率を推定する次話者確率推定部と、前記参加者の前記次話者確率に基づいて次に発話を行うべき参加者である予測次話者及び前記予測次話者が発話を開始するタイミングを推定し、推定された前記タイミングに前記予測次話者が発話を行わなかったことを検出した場合に、発話を促す処理を行う発話誘導部に、前記予測次話者を対象者として発話を促すよう指示する制御部と、を備える会話支援装置である。 One aspect of the present invention is a next speaker that estimates a next speaker probability, which is a probability that each of the participants becomes a next utterance at an arbitrary time, based on a measurement result of non-verbal behavior of each participant in conversation. Based on the probability estimation unit and the next speaker probability of the participant, a predicted next speaker who is a participant to speak next and a timing at which the predicted next speaker starts speaking are estimated and estimated. A control unit that instructs the utterance guiding unit that performs the process of prompting the utterance to prompt the utterance with the predicted next speaker as the target person when it is detected that the predicted next speaker has not uttered at the timing; , A conversation support device.

本発明の一態様は、コンピュータに、会話中の各参加者の非言語行動の計測結果に基づいて、前記参加者それぞれが任意の時刻に次発話となる確率である次話者確率を推定する次話者確率推定ステップと、前記参加者の前記次話者確率に基づいて次に発話を行うべき参加者である予測次話者及び前記予測次話者が発話を開始するタイミングを推定し、推定された前記タイミングに前記予測次話者が発話を行わなかったことを検出した場合に、発話を促す処理を行う発話誘導部に、前記予測次話者を対象者として発話を促すよう指示する制御ステップと、を実行させるための会話支援プログラムである。 According to one aspect of the present invention, a computer estimates a next speaker probability, which is a probability that each participant will make a next utterance at an arbitrary time, based on a measurement result of non-verbal behavior of each participant in conversation. A next speaker probability estimating step, and estimating a timing at which the predicted next speaker and the predicted next speaker who are to be uttered next based on the next speaker probability of the participant start utterance, When it is detected that the predicted next speaker does not speak at the estimated timing, an instruction is given to the speech guidance unit that performs processing for prompting speech to promote the speech with the predicted next speaker as the target person. And a control step for executing a control step.

本発明により、会話の参加者が会話中に適切な発話のタイミングを逸してしまったときに、発話を促すことができる。 According to the present invention, when a participant of a conversation misses an appropriate utterance timing during the conversation, the utterance can be prompted.

第１の実施形態におけるロボット１００が備える機能構成の概略を示す図である。It is a figure which shows the outline of a function structure with which the robot 100 in 1st Embodiment is provided. 第１の実施形態におけるセンサ１０３の具体的な構成例を示す図である。It is a figure which shows the specific structural example of the sensor 103 in 1st Embodiment. 第１の実施形態における次話者確率推定部１０８が出力する次話者確率Ｐ^ｎｓ _ｉ（ｔ）の例を示す図である。It is a diagram illustrating an example of the next speaker probability P ^{ns i} output by the next speaker probability estimation unit 108 ^_(t) in the first embodiment. 第１の実施形態における音制御部１１０の構成の詳細の具体例を示す図である。It is a figure which shows the specific example of the detail of a structure of the sound control part 110 in 1st Embodiment. 第１の実施形態におけるロボット１００の外観及び構成の具体例を示す図である。It is a figure which shows the specific example of the external appearance and structure of the robot 100 in 1st Embodiment. 第１の実施形態におけるロボット１００の動作を示すフロー図である。It is a flowchart which shows operation | movement of the robot 100 in 1st Embodiment. 第２の実施形態におけるロボット１００Ａが備える機能構成の概略を示す図である。It is a figure which shows the outline of a function structure with which the robot 100A in 2nd Embodiment is provided. 第２の実施形態におけるロボット１００Ａの動作を示すフロー図である。It is a flowchart which shows operation | movement of 100 A of robots in 2nd Embodiment. 息の吸い込み区間の例を示す図である。It is a figure which shows the example of a breath inhalation area. 注視対象遷移パターンを例示した図である。It is the figure which illustrated the gaze object transition pattern. 時間構造情報を例示した図である。It is the figure which illustrated time structure information.

以下、図面を参照して、本発明の実施形態について説明する。
（第１の実施形態）
図１は、第１の実施形態におけるロボット１００が備える機能構成の概略を示す図である。ロボット１００は、会話支援システムの一例である。第１の実施形態におけるロボット１００は、複数人の参加者と会話を行うロボットである。図１に示すように、ロボット１００は、マイク１０１と、カメラ１０２と、センサ１０３と、音声入力部１０４と、映像入力部１０５と、センサ入力部１０６と、発話区間検出部１０７と、次話者確率推定部１０８と、制御部１０９と、音制御部１１０と、口部制御部１１１と、視線制御部１１２と、頭部制御部１１３と、胴部制御部１１４と、スピーカ１１５と、口部駆動部１１６と、眼部駆動部１１７と、頭部駆動部１１８と、胴部駆動部１１９とを備える。 Embodiments of the present invention will be described below with reference to the drawings.
(First embodiment)
FIG. 1 is a diagram illustrating an outline of a functional configuration included in the robot 100 according to the first embodiment. The robot 100 is an example of a conversation support system. The robot 100 according to the first embodiment is a robot that has a conversation with a plurality of participants. As shown in FIG. 1, the robot 100 includes a microphone 101, a camera 102, a sensor 103, a voice input unit 104, a video input unit 105, a sensor input unit 106, a speech segment detection unit 107, and a next story. Person probability estimation unit 108, control unit 109, sound control unit 110, mouth control unit 111, gaze control unit 112, head control unit 113, torso control unit 114, speaker 115, mouth A head drive unit 116, an eye drive unit 117, a head drive unit 118, and a torso drive unit 119.

マイク１０１は、会話する参加者の音声等を含むロボット１００の周囲の音を集音して、音声信号を含む音信号（以下の説明では単に音声信号という）を出力する。マイク１０１が少なくとも参加者の音声を集音可能であれば、マイク１０１の設置位置と数は任意とすることができる。例えば、マイク１０１は、複数の各参加者それぞれに装着された複数のマイクで構成される。このようにマイク１０１を参加者の口元に近く、参加者個別に装着することで精度よく集音することができる。また、例えば、マイク１０１は、ロボット１００に搭載されてもよく、参加者やロボット１００以外の外界に設置されてもよい。ロボット１００において、複数のマイク１０１と音声入力部１０４とは、有線又は無線で音声信号の送受信が可能に接続された構成である。 The microphone 101 collects sounds around the robot 100 including voices of participants who are talking, and outputs a sound signal including a voice signal (hereinafter simply referred to as a voice signal). As long as the microphone 101 can collect at least the voices of the participants, the installation position and number of the microphones 101 can be set arbitrarily. For example, the microphone 101 is composed of a plurality of microphones attached to each of a plurality of participants. In this manner, the microphone 101 is close to the participant's mouth and can be collected with high accuracy by wearing the participant individually. For example, the microphone 101 may be mounted on the robot 100 or may be installed in the outside world other than the participants and the robot 100. In the robot 100, the plurality of microphones 101 and the voice input unit 104 are connected so as to be able to transmit and receive voice signals by wire or wirelessly.

カメラ１０２は、会話する参加者の映像を撮影して、映像信号を出力する。カメラ１０２が参加者全員を撮影可能であれば、カメラ１０２の設置位置と台数は任意とすることができる。例えば、カメラ１０２は、参加者全員の姿が画角にはいるよう広角な画角を有する撮像装置である。また、例えば、カメラ１０２は、参加者全員の姿をそれぞれ撮影する参加者の人数分の複数のカメラであってもよい。この場合には、ロボット１００において、映像入力部１０５と、複数のカメラとは、有線又は無線で映像信号を送受信可能に接続された構成となる。 The camera 102 captures images of participants who are talking and outputs a video signal. If the camera 102 can photograph all the participants, the installation position and the number of the cameras 102 can be arbitrary. For example, the camera 102 is an imaging device having a wide angle of view so that all participants are in view. Further, for example, the camera 102 may be a plurality of cameras for the number of participants who respectively photograph the appearance of all participants. In this case, in the robot 100, the video input unit 105 and the plurality of cameras are connected so as to be able to transmit and receive video signals by wire or wirelessly.

センサ１０３は、ロボット１００の位置に対する、会話する参加者の位置を計測する第１のセンサ、参加者の呼吸動作を計測する第２のセンサ、参加者の注視対象を検出する第３のセンサ及び参加者の頭部動作を検出する第４のセンサ等の複数のセンサを備え、それらの各センサからのセンサ信号をセンサ入力部１０６へ出力する。 The sensor 103 is a first sensor that measures the position of a participant who has a conversation with respect to the position of the robot 100, a second sensor that measures the breathing motion of the participant, a third sensor that detects a gaze target of the participant, and A plurality of sensors such as a fourth sensor for detecting the participant's head movement are provided, and sensor signals from these sensors are output to the sensor input unit 106.

図２は、第１の実施形態におけるセンサ１０３の具体的な構成例を示す図である。
図２に示すように、センサ１０３は、ロボット１００の位置に対する、会話する参加者の位置（特に顔位置）を計測する位置計測装置（第１のセンサ）２０１と、参加者の呼吸動作を計測する呼吸動作計測装置（第２のセンサ）２０２と、参加者の注視対象を検出する注視対象検出装置（第３のセンサ）２０３と、参加者の頭部動作を検出する頭部動作検出装置（第４のセンサ）２０４とを備える。位置計測装置２０１は、例えばロボット１００内に設置される。呼吸動作計測装置２０２は、参加者の体幹等に装着され、注視対象検出装置２０３及び頭部動作検出装置２０４は、参加者の頭部等に装着される。位置計測装置２０１は、センサ入力部１０６と接続されている。呼吸動作計測装置２０２、注視対象検出装置２０３及び頭部動作検出装置２０４は、センサ入力部１０６と、有線又は無線でセンサ信号の送受信が可能に接続されている。 FIG. 2 is a diagram illustrating a specific configuration example of the sensor 103 according to the first embodiment.
As shown in FIG. 2, the sensor 103 measures a position measuring device (first sensor) 201 that measures the position (particularly the face position) of the participant who talks with respect to the position of the robot 100 and the breathing motion of the participant. A breathing motion measuring device (second sensor) 202, a gaze target detecting device (third sensor) 203 for detecting a participant's gaze target, and a head motion detecting device for detecting a participant's head motion ( 4th sensor) 204. The position measuring device 201 is installed in the robot 100, for example. The breathing motion measurement device 202 is attached to the trunk of the participant, and the gaze target detection device 203 and the head motion detection device 204 are attached to the participant's head. The position measuring device 201 is connected to the sensor input unit 106. The respiratory motion measurement device 202, the gaze target detection device 203, and the head motion detection device 204 are connected to the sensor input unit 106 so as to be able to transmit and receive sensor signals in a wired or wireless manner.

図１の音声入力部１０４は、マイク１０１からの音声信号を入力とし、発話区間検出部１０７、次話者確率推定部１０８及び音制御部１１０へ音声信号を出力する。音声入力部１０４は、マイク１０１からの音声信号を、ロボット１００内で処理可能な信号形式の音声信号に変換する等の処理を行う。映像入力部１０５は、カメラ１０２からの映像信号を入力とし、次話者確率推定部１０８へ映像信号を出力する。映像入力部１０５は、カメラ１０２からの映像信号を、ロボット１００内で処理可能な信号形式の映像信号に変換する等の処理を行う。センサ入力部１０６は、センサ１０３からのセンサ信号を入力とし、次話者確率推定部１０８へセンサ信号を出力する。センサ入力部１０６は、センサ１０３からのセンサ信号を、ロボット１００内で処理可能な信号形式のセンサ信号に変換する等の処理を行う。 The voice input unit 104 in FIG. 1 receives the voice signal from the microphone 101 and outputs the voice signal to the utterance section detection unit 107, the next speaker probability estimation unit 108, and the sound control unit 110. The voice input unit 104 performs processing such as converting the voice signal from the microphone 101 into a voice signal in a signal format that can be processed in the robot 100. The video input unit 105 receives the video signal from the camera 102 and outputs the video signal to the next speaker probability estimation unit 108. The video input unit 105 performs processing such as converting the video signal from the camera 102 into a video signal in a signal format that can be processed in the robot 100. The sensor input unit 106 receives the sensor signal from the sensor 103 and outputs the sensor signal to the next speaker probability estimation unit 108. The sensor input unit 106 performs processing such as converting the sensor signal from the sensor 103 into a sensor signal in a signal format that can be processed in the robot 100.

発話区間検出部１０７は、既存の任意の技術により、音声入力部１０４からの音声信号から得られる音声特徴量に基づいて、各参加者が発話を行った区間を検出する。例えば、発話区間検出部１０７は、音声入力部１０４からの音声信号に基づいて、任意の窓幅を設けてその区間内の音声信号のパワー、ゼロ交差数、周波数などを、音声の特徴を示す値である音声特徴量として算出する。発話区間検出部１０７は、算出した音声特徴量と所定の閾値を比較して発話区間を検出する。発話区間検出部１０７は、検出した発話区間に関する情報である発話区間情報を次話者確率推定部１０８、制御部１０９及び音制御部１１０へ出力する。発話区間情報には、発話の開始及び終了の時刻、及び、発話者の情報が含まれる。なお、マイク１０１から取得される音声信号において、音声の存在する区間（発話区間）と音声の存在しない区間（非発話区間）を自動的に検出するＶＡＤ（Voice Activity Detection）技術は、以下の参考文献１に示すように公知の技術である。発話区間検出部１０７は、公知のＶＡＤ技術を用いて発話区間を検出する。
参考文献１：澤田宏、外４名、"多人数多マイクでの発話区間検出〜ピンマイクでの事例〜"、日本音響学会春季研究発表会、ｐｐ．６７９−６８０、２００７年３月 The utterance section detection unit 107 detects a section in which each participant uttered based on the voice feature amount obtained from the voice signal from the voice input unit 104 by an existing arbitrary technique. For example, the utterance section detection unit 107 provides an arbitrary window width based on the voice signal from the voice input unit 104 and indicates the voice characteristics such as the power, the number of zero crossings, and the frequency of the voice signal in the section. It is calculated as a voice feature value that is a value. The utterance section detection unit 107 detects the utterance section by comparing the calculated voice feature quantity with a predetermined threshold. The utterance section detection unit 107 outputs utterance section information, which is information related to the detected utterance section, to the next speaker probability estimation unit 108, the control unit 109, and the sound control unit 110. The utterance section information includes the start and end times of the utterance and the information of the speaker. Note that the VAD (Voice Activity Detection) technique for automatically detecting a section where speech is present (speech section) and a section where speech is not present (non-speech section) in a speech signal acquired from the microphone 101 is as follows. As shown in Document 1, it is a known technique. The utterance interval detection unit 107 detects an utterance interval using a known VAD technique.
Reference 1: Hiroshi Sawada and four others, "Detection of utterance section with multi-microphones with multi-persons -Example with pin microphones", Acoustical Society of Japan Spring Research Presentation, pp. 679-680, March 2007

次話者確率推定部１０８は、音声入力部１０４からの音声信号と、映像入力部１０５からの映像信号と、センサ入力部１０６からのセンサ信号と、発話区間検出部１０７からの発話区間情報とを入力とし、各参加者が時刻ｔに次話者となる確率である次話者確率を出力する。次話者確率推定部１０８は、音声信号、映像信号、センサ信号及び発話区間情報に基づいて、発話区間情報で特定される発話区間の発話者を示す発話者情報を取得する。次話者確率推定部１０８は、音声信号、映像信号、センサ信号及び取得した発話者情報に基づいて、各参加者ｉが時刻ｔに次話者となる確率である次話者確率Ｐ^ｎｓ _ｉ（ｔ）を算出して、制御部１０９へ出力する。次話者確率推定部１０８は、参加者の非言語行動に基づいて次話者確率Ｐ^ｎｓ _ｉ（ｔ）を算出している。すなわち、次話者確率推定部１０８は、次話者確率Ｐ^ｎｓ _ｉ（ｔ）の算出に、参加者の発話内容を解析等して利用者の言語行動に関する情報を得る必要はない。次話者確率推定部１０８は、次話者確率Ｐ^ｎｓ _ｉ（ｔ）の他に、発話者情報及び参加者の位置情報を制御部１０９へ出力する。 The next speaker probability estimation unit 108 includes an audio signal from the audio input unit 104, a video signal from the video input unit 105, a sensor signal from the sensor input unit 106, and speech segment information from the speech segment detection unit 107. And the next speaker probability, which is the probability that each participant will be the next speaker at time t, is output. The next speaker probability estimation unit 108 acquires speaker information indicating a speaker in the speech section specified by the speech section information based on the audio signal, the video signal, the sensor signal, and the speech section information. The next speaker probability estimation unit 108 is based on the audio signal, the video signal, the sensor signal, and the acquired speaker information, and the next speaker probability P ^ns _i that is the probability that each participant i will be the next speaker at time t. (T) is calculated and output to the control unit 109. The next speaker probability estimation unit 108 calculates the next speaker probability P ^ns _i (t) based on the non-language behavior of the participant. That is, the next-speaker probability estimating unit 108 does not need to obtain information on the user's language behavior by calculating the next-speaker probability P ^ns _i (t) by analyzing the utterance contents of the participants. The next speaker probability estimation unit 108 outputs the speaker information and the participant position information to the control unit 109 in addition to the next speaker probability P ^ns _i (t).

なお、次話者確率推定部１０８は、参加者の位置情報を、例えば、センサ１０３の参加者の位置を計測したセンサ信号に基づいて取得してもよいし、映像信号に基づいて取得してもよいし、センサ１０３の参加者の位置を計測したセンサ信号及び映像信号に基づいて取得してもよい。 Note that the next speaker probability estimation unit 108 may acquire the position information of the participant based on, for example, a sensor signal obtained by measuring the position of the participant of the sensor 103 or based on a video signal. Alternatively, it may be acquired based on a sensor signal and a video signal obtained by measuring the positions of the participants of the sensor 103.

図３は、第１の実施形態における次話者確率推定部１０８が出力する次話者確率Ｐ^ｎｓ _ｉ（ｔ）の例を示す図である。図３においては、４名の参加者Ａ〜Ｄについて参加者Ａの発話の切れ目となる時刻ｔ_ｂｕｅ以降における次話者確率Ｐ^ｎｓ _ｉ（ｔ）の変化例を示している。符号３１を付与した矩形は、参加者Ａの発話区間を示している。発話区間３１は、発話終了時刻ｔ_ｂｕｅで終了している。次話者確率Ｐ^ｎｓ _Ａ（ｔ）３２で示す点線は、発話終了時刻ｔ_ｂｕｅ以降の時刻ｔにおける参加者Ａの次話者確率の変化を示している。次話者確率Ｐ^ｎｓ _Ｂ（ｔ）３３で示す点線は、発話終了時刻ｔ_ｂｕｅ以降の時刻ｔにおける参加者Ｂの次話者確率の変化を示している。次話者確率Ｐ^ｎｓ _Ｃ（ｔ）３４で示す点線は、発話終了時刻ｔ_ｂｕｅ以降の時刻ｔにおける参加者Ｃの次話者確率の変化を示している。次話者確率Ｐ^ｎｓ _Ｄ（ｔ）３５で示す点線は、発話終了時刻ｔ_ｂｕｅ以降の時刻ｔにおける参加者Ｄの次話者確率の変化を示している。このように、次話者確率推定部１０８は、参加者ｉ（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）の発話終了時刻ｔ_ｂｕｅ以降の時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）の変化を算出する。なお、次話者確率推定部１０８における次話者の推定処理の詳細については後述する。 FIG. 3 is a diagram illustrating an example of the next speaker probability P ^ns _i (t) output by the next speaker probability estimation unit 108 according to the first embodiment. FIG. 3 shows an example of change in the next speaker probability P ^ns _i (t) after time t _{bu when the} participants A to D break the utterance of the participant A. The rectangle to which reference numeral 31 is assigned indicates the utterance section of participant A. The utterance section 31 ends at the utterance end time t _bu . The dotted line indicated by the next speaker probability P ^ns _A (t) 32 indicates the change in the next speaker probability of the participant A at time t after the utterance end time t _bu . The dotted line indicated by the next speaker probability P ^ns _B (t) 33 indicates the change in the next speaker probability of the participant B at the time t after the utterance end time t _bu . The dotted line indicated by the next speaker probability P ^ns _C (t) 34 indicates the change in the next speaker probability of the participant C at time t after the utterance end time t _bu . A dotted line indicated by a next speaker probability P ^ns _D (t) 35 indicates a change in the next speaker probability of the participant D at time t after the utterance end time t _bu . Thus, the next speaker probability estimation unit 108 determines the next speaker probability P ^ns _i (t) at time t after the utterance end time t _bu of the participant i (iε {A, B, C, D}). Calculate the change in. Details of the next speaker estimation processing in the next speaker probability estimation unit 108 will be described later.

図１の制御部１０９は、次話者確率推定部１０８からの次話者確率を入力とし、入力した次話者確率に基づいて次に発話を行うと予測される参加者である予測次話者と、予測次話者が発話を開始するタイミング（発話開始タイミング）を推定する。制御部１０９は、動作パターン情報格納部１０９１を備える。動作パターン情報格納部１０９１は、ロボット１００が発話を促す動作を示す動作パターン情報を格納している。 The control unit 109 in FIG. 1 receives the next speaker probability from the next speaker probability estimation unit 108 as an input, and a predicted next episode that is a participant predicted to speak next based on the input next speaker probability. And the predicted next speaker start timing (utterance start timing). The control unit 109 includes an operation pattern information storage unit 1091. The motion pattern information storage unit 1091 stores motion pattern information indicating a motion that the robot 100 prompts to speak.

制御部１０９は、以下に示す第１〜第５の次話者選択方法のいずれかを用いて予測次話者を選択する。なお、以下の説明においては、参加者Ａ、Ｂ、Ｃ、Ｄの４名とロボット１００とが会話を行う場合について説明する。制御部１０９は、次話者確率推定部１０８からＡ〜Ｄの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）を取得する。 The control unit 109 selects a predicted next speaker using any one of first to fifth next speaker selection methods described below. Note that, in the following description, a case will be described in which four participants A, B, C, and D have a conversation with the robot 100. The control unit 109 acquires the next speaker probabilities P ^ns _i (t), (i∈ {A, B, C, D}) of A to D from the next speaker probability estimation unit 108.

（第１の次話者選択方法）
制御部１０９は、参加者Ａ〜Ｄそれぞれの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）を比較する。制御部１０９は、次話者確率Ｐ^ｎｓ _ｉ（ｔ）の最大値が最も高い参加者Ａ〜Ｄのいずれかを予測次話者と判断する。制御部１０９は、予測次話者の次話者確率Ｐ^ｎｓ _ｉ（ｔ）が最大値を取るときの時刻ｔを予測次話者の発話開始タイミングとする。なお、制御部１０９は、参加者Ａ〜Ｄのいずれの次話者確率Ｐ^ｎｓ _ｉ（ｔ）も第１の閾値を超えない場合、予測次話者がロボット１００であると判断してもよい。 (First speaker selection method)
The control unit 109 compares the next speaker probabilities P ^ns _i (t), (iε {A, B, C, D}) of the participants A to D, respectively. The control unit 109 determines any of the participants A to D having the highest maximum value of the next speaker probability P ^ns _i (t) as the predicted next speaker. The control unit 109 sets time t when the next speaker probability P ^ns _i (t) of the predicted next speaker takes the maximum value as the speech start timing of the predicted next speaker. Note that the control unit 109 may determine that the predicted next speaker is the robot 100 when any of the next speaker probabilities P ^ns _i (t) of the participants A to D does not exceed the first threshold. .

（第２の次話者選択方法）
制御部１０９は、参加者Ａ〜Ｄのうち、次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）が最も早い時刻に第２の閾値以上の最大値をとる参加者を予測次話者と判断する。制御部１０９は、予測次話者の次話者確率Ｐ^ｎｓ _ｉ（ｔ）が最大値を取るときの時刻ｔを予測次話者の発話開始タイミングとする。なお、制御部１０９は、参加者Ａ〜Ｄのいずれの次話者確率Ｐ^ｎｓ _ｉ（ｔ）も第２の閾値を超えない場合、予測次話者がロボット１００であると判断してもよい。 (Second next speaker selection method)
The control unit 109 has a maximum value greater than or equal to the second threshold value at the earliest time among the participants A to D when the next speaker probability P ^ns _i (t), (iε {A, B, C, D}). Participants who take are determined to be predicted next speakers. The control unit 109 sets time t when the next speaker probability P ^ns _i (t) of the predicted next speaker takes the maximum value as the speech start timing of the predicted next speaker. Note that the control unit 109 may determine that the predicted next speaker is the robot 100 when any of the next speaker probabilities P ^ns _i (t) of the participants A to D does not exceed the second threshold. .

（第３の次話者選択方法）
制御部１０９は、参加者Ａ〜Ｄの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）それぞれを、時刻ｔについて所定時間（例えば、発話終了時刻から３〜４秒以上の時間）積分して、積分値Ｐ^ｎｓ _ｉを取得する。なお、積分区間を発話終了時刻から無限時間としてもよく、全参加者Ａ〜Ｄの次話者確率Ｐ^ｎｓ _ｉ（ｔ）が所定値未満となり有意な値ではなくなる時間までとしてもよい。制御部１０９は、この積分値Ｐ^ｎｓ _ｉが最も大きい参加者Ａ〜Ｄのいずれかを予測次話者と判断する。制御部１０９は、予測次話者の次話者確率Ｐ^ｎｓ _ｉ（ｔ）が最大値を取るときの時刻ｔを予測次話者の発話開始タイミングとする。なお、制御部１０９は、全ての参加者Ａ〜Ｄとも積分値Ｐ^ｎｓ _ｉが第３の閾値を超えないときには、予測次話者がロボット１００であると判断してもよい。 (Third next speaker selection method)
The control unit 109 sets each of the next speaker probabilities P ^ns _i (t), (iε {A, B, C, D}) of the participants A to D for a predetermined time (for example, from the utterance end time). Integrate for 3 to 4 seconds or more) to obtain an integral value P ^ns _i . The integration interval may be an infinite time from the utterance end time, or may be a time until the next speaker probability P ^ns _i (t) of all the participants A to D becomes less than a predetermined value and becomes no significant value. The control unit 109 determines any of the participants A to D having the largest integral value P ^ns _i as a predicted next speaker. The control unit 109 sets time t when the next speaker probability P ^ns _i (t) of the predicted next speaker takes the maximum value as the speech start timing of the predicted next speaker. Note that the control unit 109 may determine that the predicted next speaker is the robot 100 when the integral value P ^ns _i does not exceed the third threshold value for all the participants A to D.

（第４の次話者選択方法）
制御部１０９は、参加者Ａ〜Ｄの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）を加算した加算値（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））を取得し、第４の閾値である任意の確率Ｐ_γと比較する。制御部１０９は、参加者Ａ〜Ｄ全員の次話者確率の加算値が確率Ｐ_γ以上である（（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））≧Ｐ_γ）場合は、上記の第１〜第３のいずれかの次話者選択方法によって、予測次話者と発話開始タイミングを得る。ただし、第１〜第３の次話者選択方法において、第１〜第３の閾値との比較は行わなくてもよい。制御部１０９は、参加者Ａ〜Ｄ全員の次話者確率の加算値が確率Ｐ_γ未満である（（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））＜Ｐ_γ）場合は、予測次話者がロボット１００であると判断する。 (Fourth speaker selection method)
The control unit 109 adds the next speaker probability P ^ns _i (t), (i∈ {A, B, C, D}) of the participants A to D (P ^ns _A (t) + P ^ns _B (T) + P ^ns _C (t) + P ^ns _D (t)) is acquired and compared with an arbitrary probability P _γ that is the fourth threshold value. The control unit 109 determines that the added value of the next speaker probabilities of all the participants A to D is equal to or greater than the probability P _γ ((P ^ns _A (t) + P ^ns _B (t) + P ^ns _C (t) + P ^ns _D ( When t)) ≧ _Pγ ), the predicted next speaker and the utterance start timing are obtained by any one of the first to third next speaker selection methods described above. However, in the first to third next speaker selection methods, the comparison with the first to third threshold values may not be performed. The control unit 109 adds the next speaker probabilities of all the participants A to D to be less than the probability P _γ ((P ^ns _A (t) + P ^ns _B (t) + P ^ns _C (t) + P ^ns _D ( If t)) <P _γ ), it is determined that the predicted next speaker is the robot 100.

（第５の次話者選択方法）
制御部１０９は、参加者Ａ〜Ｄの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）のそれぞれを、時刻ｔについて所定時間（例えば、３〜４秒以上の時間）積分して、積分値Ｐ^ｎｓ _ｉを取得する。なお、積分区間を発話終了から無限時間としてもよく、全参加者の次話者確率Ｐ^ｎｓ _ｉ（ｔ）が所定値未満となる時間までとしてもよい。制御部１０９は、参加者Ａ〜Ｄの全員の積分値Ｐ^ｎｓ _ｉを加算した加算値（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）を取得し、第５の閾値である任意の確率Ｐ_θと比較する。制御部１０９は、参加者Ａ〜Ｄの積分値の加算値が確率Ｐ_θ以上である（（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）≧Ｐ_θ）場合は、上記の第１〜第３のいずれかの次話者選択方法によって、予測次話者と発話開始タイミングを得る。ただし、第１〜第３の次話者選択方法において、第１〜第３の閾値との比較は行わなくてもよい。制御部１０９は、参加者Ａ〜Ｄの全員の積分値の加算値が確率Ｐ_θ未満である（（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）＜Ｐ_θ）場合は、予測次話者がロボット１００であると判断する。 (Fifth speaker selection method)
The control unit 109 sets each of the next speaker probabilities P ^ns _i (t), (iε {A, B, C, D}) of the participants A to D for a predetermined time (for example, 3 to 4). sec or longer) is integrated to obtain the integrated value ^{P ns} _i. The integration interval may be an infinite time from the end of the utterance, or may be a time until the next speaker probability P ^ns _i (t) of all participants is less than a predetermined value. The control unit 109 acquires an added value (P ^ns _A + P ^ns _B + P ^ns _C + P ^ns _D ) obtained by adding the integral values P ^ns _i of all the participants A to D, and an arbitrary probability that is the fifth threshold value Compare with _Pθ . When the added value of the integral values of the participants A to D is equal to or higher than the probability P _θ (the control unit 109) ((P ^ns _A + P ^ns _B + P ^ns _C + P ^ns _D ) ≧ P _θ ), The predicted next speaker and the utterance start timing are obtained by any of the third next speaker selection methods. However, in the first to third next speaker selection methods, the comparison with the first to third threshold values may not be performed. Control unit 109, if the sum of integral values of all participants A~D is less than the probability _{^{_{^{_{P θ ((P ns A +}}}}} P ns B + P ns C + P ns D) <P θ) is predicted next story It is determined that the person is the robot 100.

次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）は、図３に示したように、発話終了から所定時間後にピークを有する場合が多い。そこで、制御部１０９は、第１〜第５の次話者選択方法において、次話者確率Ｐ^ｎｓ _ｉ（ｔ）を求める時刻ｔを含む窓幅を設けて、その窓幅の中における次話者確率の最大値を、時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）として用いるようにしてもよい。また、制御部１０９は、第１〜第５の次話者選択方法において、次話者確率Ｐ^ｎｓ _ｉ（ｔ）を求める時刻ｔを含む窓幅を設けて、その窓幅の中における次話者確率に複数のピークがある場合に、ｎ番目（ｎは１以上の整数）のピークの次話者確率を、時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）として用いるようにしてもよい。 The next speaker probability P ^ns _i (t), (iε {A, B, C, D}) often has a peak after a predetermined time from the end of the utterance, as shown in FIG. Therefore, in the first to fifth next speaker selection methods, the control unit 109 provides a window width including the time t for obtaining the next speaker probability P ^ns _i (t), and the next story within the window width. The maximum speaker probability may be used as the next speaker probability P ^ns _i (t) at time t. In addition, in the first to fifth next speaker selection methods, the control unit 109 provides a window width including the time t for obtaining the next speaker probability P ^ns _i (t), and the next talk within the window width is provided. When the speaker probability has a plurality of peaks, the next speaker probability of the nth peak (n is an integer of 1 or more) may be used as the next speaker probability P ^ns _i (t) at time t. .

制御部１０９は、第１〜第５の次話者選択方法により予測次話者がロボット１００であると判断した場合、音制御部１１０に対して発話を行うよう指示する発話制御信号を出力する。制御部１０９は、予測次話者が参加者Ａ〜Ｄのいずれかであると判断した場合、音制御部１１０に対して発話を抑制するよう指示する発話制御信号を出力するとともに、推定された発話開始タイミングに予測次話者が発話を行ったか否かを判断する。制御部１０９は、推定された発話開始タイミングに予測次話者が発話を行わなかったことを検出すると、動作パターン情報格納部１０９１から動作パターン情報を読み出す。制御部１０９は、読み出した動作パターン情報が示す動作を行わせるよう指示する発話誘導動作制御信号を、音制御部１１０、口部制御部１１１、視線制御部１１２、頭部制御部１１３、及び、胴部制御部１１４のうち１以上に出力する。発話誘導動作制御信号は、発話誘導対象者に対して発話を促す動作を行うよう指示する信号である。動作パターン情報は、例えば、発話誘導対象者に対して発話を促す内容の発話の音声を出力する、視線を発話誘導対象者の方向に向ける、発話誘導対象者の方向に上肢を差し出す、などの動作を示す。発話誘導動作制御信号には、発話誘導対象者を特定する情報が含まれる。制御部１０９は、発話誘導対象者を、予測次話者又は予測次話者とは異なる参加者とする。視線制御部１１２、頭部制御部１１３、又は、胴部制御部１１４に出力する発話誘導動作制御信号には、発話誘導対象者の位置の情報がさらに含まれる。 When the control unit 109 determines that the predicted next speaker is the robot 100 by the first to fifth next speaker selection methods, the control unit 109 outputs an utterance control signal that instructs the sound control unit 110 to utter. . When the control unit 109 determines that the predicted next speaker is one of the participants A to D, the control unit 109 outputs an utterance control signal that instructs the sound control unit 110 to suppress the utterance and is estimated. It is determined whether or not the predicted next speaker has spoken at the utterance start timing. When the control unit 109 detects that the predicted next speaker does not utter at the estimated utterance start timing, the control unit 109 reads the operation pattern information from the operation pattern information storage unit 1091. The control unit 109 transmits an utterance guidance operation control signal instructing to perform the operation indicated by the read operation pattern information to the sound control unit 110, the mouth control unit 111, the line-of-sight control unit 112, the head control unit 113, and Output to one or more of the body controller 114. The utterance guidance operation control signal is a signal for instructing the utterance guidance target person to perform an operation for prompting utterance. The action pattern information includes, for example, outputting utterance sound with a content urging the utterance guidance target person, directing the line of sight toward the utterance guidance target person, and presenting the upper limb in the direction of the utterance guidance target person. The operation is shown. The utterance guidance operation control signal includes information for specifying the utterance guidance target person. The control unit 109 sets the utterance induction target person as a participant different from the predicted next speaker or the predicted next speaker. The speech guidance operation control signal output to the line-of-sight control unit 112, the head control unit 113, or the torso control unit 114 further includes information on the position of the speech guidance target person.

制御部１０９は、発話誘導動作制御信号を出力したのち所定のタイミングまでに発話区間の開始を検出しなかった場合、新たな発話誘導対象者を選択する。制御部１０９は、新たな発話誘導対象者に対して発話を促す動作を行うよう指示する発話誘導動作制御信号を生成し、発話誘導動作制御信号を音制御部１１０、口部制御部１１１、視線制御部１１２、頭部制御部１１３、及び、胴部制御部１１４のうち一以上に出力する。 When the control unit 109 does not detect the start of the utterance section by a predetermined timing after outputting the utterance induction operation control signal, the control unit 109 selects a new utterance induction target person. The control unit 109 generates an utterance guidance operation control signal for instructing a new utterance guidance target person to perform an utterance urging operation, and the utterance guidance operation control signal is transmitted to the sound control unit 110, the mouth control unit 111, and the line of sight. Output to one or more of the control unit 112, the head control unit 113, and the torso control unit 114.

口部制御部１１１と、視線制御部１１２と、頭部制御部１１３と、胴部制御部１１４と、スピーカ１１５と、口部駆動部１１６と、眼部駆動部１１７と、頭部駆動部１１８と、胴部駆動部１１９とは、制御部１０９からの指示を受け、発話誘導対象者に発話を促す処理を行う発話誘導部として動作する。 Mouth control unit 111, line-of-sight control unit 112, head control unit 113, torso control unit 114, speaker 115, mouth drive unit 116, eye drive unit 117, and head drive unit 118 Then, the body drive unit 119 operates as an utterance guidance unit that receives an instruction from the control unit 109 and performs processing for prompting the utterance guidance target person to speak.

音制御部１１０は、制御部１０９からの発話制御信号又は発話誘導動作制御信号に基づいて、スピーカ１１５に対して音信号を出力する。音制御部１１０は、発話制御信号に基づいて、ロボット１００に発話を行わせるか否かを判断する。音制御部１１０は、発話制御信号に基づいて、ロボット１００に発話を行わせると判断した場合には、ロボット１００に発話させる会話内容（言葉）を含む会話情報を生成し、生成した会話情報に基づいた音信号を出力する。音制御部１１０は、例えば、音声信号及び発話区間情報に基づいて参加者の会話内容を解析し、解析結果に基づいて、ロボット１００に発話させるための会話情報を生成する。また、音制御部１１０は、発話誘導動作制御信号を受信した場合、発話誘導動作制御信号に設定されている発話誘導対象者に発話を促す内容の会話情報を生成し、生成した会話情報に基づいた音信号を出力する。 The sound control unit 110 outputs a sound signal to the speaker 115 based on the utterance control signal or the utterance guidance operation control signal from the control unit 109. The sound control unit 110 determines whether or not to cause the robot 100 to speak based on the speech control signal. When the sound control unit 110 determines that the robot 100 is to speak based on the speech control signal, the sound control unit 110 generates conversation information including conversation contents (words) to be uttered by the robot 100, and the generated conversation information is included in the generated conversation information. Based sound signal is output. For example, the sound control unit 110 analyzes the conversation content of the participant based on the voice signal and the utterance section information, and generates conversation information for causing the robot 100 to utter based on the analysis result. In addition, when receiving the utterance guidance operation control signal, the sound control unit 110 generates conversation information that urges the utterance guidance target person set in the utterance guidance operation control signal to speak, and based on the generated conversation information. Output a sound signal.

ここで、第１の実施形態における音制御部１１０の構成の詳細について一例を示して説明する。
図４は、第１の実施形態における音制御部１１０の構成の詳細の具体例を示す図である。音制御部１１０は、音声解析部４０１と、会話情報生成部４０２と、会話情報ＤＢ（データベース）４０３と、発声情報生成部４０４と、音信号生成部４０５とを備える。 Here, the details of the configuration of the sound control unit 110 in the first embodiment will be described with reference to an example.
FIG. 4 is a diagram illustrating a specific example of details of the configuration of the sound control unit 110 according to the first embodiment. The sound control unit 110 includes a voice analysis unit 401, a conversation information generation unit 402, a conversation information DB (database) 403, an utterance information generation unit 404, and a sound signal generation unit 405.

会話情報ＤＢ４０３は、ロボット１００に会話させるための会話サンプル情報を格納する。会話サンプル情報とは、日常の会話でよく使われる名詞、「こんにちは」等の挨拶及び「ありがとうございます」、「大丈夫ですか」等の日常会話でよく利用するフレーズの音声信号を含む情報である。さらに、会話情報ＤＢ４０３は、各話者の名前の音声信号と、「〜さんは、どう思いますか」、「〜さんは、何かありますか」などの発話を促すフレーズの音声信号を記憶する。 The conversation information DB 403 stores conversation sample information for allowing the robot 100 to speak. The conversation sample information, noun often used in everyday conversation, "Hello" greeting and "Thank you" such as, it is the information that contains the phrase of the speech signal that frequently used in everyday conversation, such as "Are you okay?" . Furthermore, the conversation information DB 403 stores a voice signal of each speaker's name and a voice signal of a phrase that prompts utterances such as “What do you think?” And “Do you have something?” .

音声解析部４０１は、音声入力部１０４からの音声信号と、発話区間検出部１０７からの発話区間情報とに基づいて、音声信号を解析して、その内容（言葉）を特定し、解析結果を出力する。 The voice analysis unit 401 analyzes the voice signal based on the voice signal from the voice input unit 104 and the utterance section information from the utterance section detection unit 107, specifies the contents (words), and determines the analysis result. Output.

会話情報生成部４０２は、発話制御信号を受信した場合、音声解析部４０１の解析結果に基づいて、ロボット１００の発話内容となる会話情報を生成する。会話情報生成部４０２は、音声解析部４０１の解析結果に基づいて、会話する内容に応じた会話サンプル情報を会話情報ＤＢ４０３から取得する。会話情報生成部４０２は、取得した会話サンプル情報に基づいて、会話情報を生成する。会話情報生成部４０２は、発声情報生成部４０４からの会話情報の要求に応じて、会話情報を生成し、発声情報生成部４０４へ出力する。
また、会話情報生成部４０２は、発話誘導動作制御信号を受信した場合、その発話誘導動作制御信号に設定されている発話誘導対象者の名前の音声信号と、発話を促すフレーズの音声信号とを会話情報ＤＢ４０３から取得する。会話情報生成部４０２は、これらの音声信号を続けて出力する会話情報を生成し、発声情報生成部４０４へ出力する。 When the conversation information generation unit 402 receives the utterance control signal, the conversation information generation unit 402 generates conversation information that is the utterance content of the robot 100 based on the analysis result of the voice analysis unit 401. The conversation information generation unit 402 acquires conversation sample information corresponding to the content of conversation from the conversation information DB 403 based on the analysis result of the voice analysis unit 401. The conversation information generation unit 402 generates conversation information based on the acquired conversation sample information. The conversation information generation unit 402 generates conversation information in response to a request for conversation information from the utterance information generation unit 404 and outputs the conversation information to the utterance information generation unit 404.
In addition, when the speech information generation unit 402 receives the speech guidance operation control signal, the conversation information generation unit 402 generates a speech signal of the name of the speech guidance target person set in the speech guidance operation control signal and a speech signal of the phrase that prompts speech. Obtained from the conversation information DB 403. The conversation information generation unit 402 generates conversation information for continuously outputting these audio signals, and outputs the conversation information to the utterance information generation unit 404.

発声情報生成部４０４は、会話情報生成部４０２からの会話情報と、制御部１０９からの発話制御信号又は発話誘導動作制御信号とを入力として、発話信号を出力する。発声情報生成部４０４は、制御部１０９からの発話制御信号又は発話誘導動作制御信号に基づいて、会話情報生成部４０２に対して会話情報を要求する。発声情報生成部４０４は、要求に応じて会話情報生成部４０２から取得した会話情報と、制御部１０９からの発話制御信号又は発話誘導動作制御信号とに基づいて、ロボット１００が発声するための発話信号を生成する。発声情報生成部４０４は、生成した発話信号を音信号生成部４０５へ出力する。 The utterance information generation unit 404 receives the conversation information from the conversation information generation unit 402 and the utterance control signal or utterance guidance operation control signal from the control unit 109 and outputs an utterance signal. The utterance information generation unit 404 requests the conversation information generation unit 402 for conversation information based on the utterance control signal or the utterance guidance operation control signal from the control unit 109. The utterance information generation unit 404 generates an utterance for the robot 100 to utter based on the conversation information acquired from the conversation information generation unit 402 upon request and the utterance control signal or utterance guidance operation control signal from the control unit 109. Generate a signal. The utterance information generation unit 404 outputs the generated utterance signal to the sound signal generation unit 405.

音信号生成部４０５は、発声情報生成部４０４からの発話信号を入力とし、音信号を出力する。音信号生成部４０５は、発声情報生成部４０４からの発話信号に基づいてスピーカ１１５から発話させるための音信号を生成して、スピーカ１１５へ出力する。 The sound signal generation unit 405 receives the utterance signal from the utterance information generation unit 404 and outputs a sound signal. The sound signal generation unit 405 generates a sound signal for uttering from the speaker 115 based on the utterance signal from the utterance information generation unit 404 and outputs the sound signal to the speaker 115.

図１に示す口部制御部１１１は、制御部１０９からの発話誘導動作制御信号に基づいて、口部駆動部１１６に対して口部駆動信号を出力する。視線制御部１１２は、制御部１０９からの発話誘導動作制御信号に基づいて、眼部駆動部１１７に対して眼部駆動信号を出力する。頭部制御部１１３は、制御部１０９からの発話誘導動作制御信号に基づいて、頭部駆動部１１８に対して頭部駆動信号を出力する。胴部制御部１１４は、制御部１０９からの発話誘導動作制御信号に基づいて、胴部駆動部１１９に対して胴部駆動信号を出力する。 The mouth control unit 111 shown in FIG. 1 outputs a mouth drive signal to the mouth drive unit 116 based on the speech guidance operation control signal from the control unit 109. The line-of-sight control unit 112 outputs an eye part drive signal to the eye part drive unit 117 based on the speech guidance operation control signal from the control unit 109. The head control unit 113 outputs a head drive signal to the head drive unit 118 based on the speech guidance operation control signal from the control unit 109. The torso control unit 114 outputs a torso drive signal to the torso drive unit 119 based on the speech guidance operation control signal from the control unit 109.

図５は、第１の実施形態におけるロボット１００の外観及び構成の具体例を示す図である。第１の実施形態におけるロボット１００は、例えば図５に示す外観を有し、図１に示す機能構成を有する。 FIG. 5 is a diagram illustrating a specific example of the appearance and configuration of the robot 100 according to the first embodiment. The robot 100 in the first embodiment has, for example, the appearance shown in FIG. 5 and the functional configuration shown in FIG.

図５に示すように、ロボット１００は、例えば、人間の上半身をモデルとした形状のヒューマノイドロボット（人型ロボット）である。ロボット１００は、発話を行う発話機能、人の音声を認識する音声認識機能、参加者を撮影するカメラ機能を少なくとも備える。ロボット１００は、右目５１ａ及び左目５１ｂと、口部５２とが配置された顔を有する頭部５３を備える。 As shown in FIG. 5, the robot 100 is, for example, a humanoid robot (humanoid robot) having a shape modeled on a human upper body. The robot 100 includes at least a speech function for speaking, a voice recognition function for recognizing a human voice, and a camera function for photographing a participant. The robot 100 includes a head 53 having a face on which a right eye 51a and a left eye 51b and a mouth portion 52 are arranged.

ロボット１００は、頭部５３を支持する頸部５４と、頸部５４を支える胴部５５とを備える。胴部５５は、上肢である右腕５５ａと左腕５５ｂとが側面上部に設けられている。また、頭部５３の右目５１ａ、左目５１ｂの間には、カメラ１０２が設置されている。以下の説明において、右目５１ａ、左目５１ｂをまとめて説明する場合は、眼部５１と称する。 The robot 100 includes a neck 54 that supports the head 53 and a body 55 that supports the neck 54. The torso 55 has a right arm 55a and a left arm 55b, which are upper limbs, provided on the upper side. A camera 102 is installed between the right eye 51 a and the left eye 51 b of the head 53. In the following description, the right eye 51a and the left eye 51b are collectively referred to as the eye part 51.

図１に示す構成の内、図５に示しているのは、カメラ１０２のみであるので、カメラ１０２以外の図１に示す構成の設置位置の一例について説明する。マイク１０１及びセンサ１０３は、ロボット１００の胴部５５内における任意の位置又は胴部５５から離れた位置（例えば参加者の位置）に設置される。図１に示すマイク１０１、カメラ１０２及びセンサ１０３以外の構成は、ロボット１００内部に設置されるものであり、例えば、スピーカ１１５は、図５に示した口部５２の内部に設置されている。 Since only the camera 102 is shown in FIG. 5 in the configuration shown in FIG. 1, an example of the installation position of the configuration shown in FIG. The microphone 101 and the sensor 103 are installed at an arbitrary position in the body 55 of the robot 100 or a position away from the body 55 (for example, the position of the participant). The configuration other than the microphone 101, the camera 102, and the sensor 103 shown in FIG. 1 is installed inside the robot 100. For example, the speaker 115 is installed inside the mouth portion 52 shown in FIG.

ここで、ロボット１００が備える口部駆動部１１６、眼部駆動部１１７、頭部駆動部１１８及び胴部駆動部１１９の配置と駆動する対象について説明する。頭部５３は、右目５１ａ及び左目５１ｂの黒目（視線）を移動させる眼部駆動部１１７と、口部５２の開閉を行う口部駆動部１１６とを備える。 Here, the arrangement of the mouth drive unit 116, the eye drive unit 117, the head drive unit 118, and the torso drive unit 119 included in the robot 100 and the objects to be driven will be described. The head 53 includes an eye drive unit 117 that moves the black eyes (line of sight) of the right eye 51 a and the left eye 51 b, and a mouth drive unit 116 that opens and closes the mouth 52.

頸部５４は、頭部５３に対して所定の動き（例えば、頷かせたり、顔の方向を変えたりする動き）を行わせる頭部駆動部１１８を備え、頭部５３を支持する。胴部５５は、呼吸をしているかのように、肩を動かしたり、胸の部分を膨らませたりする胴部駆動部１１９を備える。口部駆動部１１６は、口部制御部１１１からの口部駆動信号に基づいてロボット１００の口部５２の開閉を行う。眼部駆動部１１７は、視線制御部１１２からの眼部駆動信号に基づいてロボット１００の眼部５１における黒目の方向（＝ロボット１００の視線の方向）を制御する。 The neck portion 54 includes a head drive unit 118 that causes the head 53 to perform a predetermined movement (for example, a movement that changes the direction of the face or the face), and supports the head 53. The torso 55 includes a torso drive unit 119 that moves the shoulder and inflates the chest part as if breathing. The mouth drive unit 116 opens and closes the mouth 52 of the robot 100 based on the mouth drive signal from the mouth control unit 111. The eye drive unit 117 controls the direction of black eyes (= the direction of the line of sight of the robot 100) in the eye 51 of the robot 100 based on the eye drive signal from the line of sight control unit 112.

頭部駆動部１１８は、頭部制御部１１３からの頭部駆動信号に基づいてロボット１００の頭部５３の動きを制御する。胴部駆動部１１９は、胴部制御部１１４からの胴部駆動信号に基づいてロボット１００の胴部５５の形状を制御する。また、胴部駆動部１１９は、胴部制御部１１４からの胴部駆動信号に基づいてロボット１００の右腕５５ａと左腕５５ｂの動きも制御する。 The head drive unit 118 controls the movement of the head 53 of the robot 100 based on the head drive signal from the head control unit 113. The torso drive unit 119 controls the shape of the torso 55 of the robot 100 based on the torso drive signal from the torso controller 114. The torso driving unit 119 also controls the movement of the right arm 55a and the left arm 55b of the robot 100 based on the torso driving signal from the torso control unit 114.

次に、第１の実施形態におけるロボット１００の動作について説明する。
図６は、第１の実施形態におけるロボット１００の動作を示すフロー図である。図６に示す処理は、ロボット１００において、複数の参加者と会話を行う動作を開始した際に行う処理である。以下では、参加者Ａ〜Ｄとロボット１００が会話に参加している場合を例に説明する。 Next, the operation of the robot 100 in the first embodiment will be described.
FIG. 6 is a flowchart showing the operation of the robot 100 according to the first embodiment. The process shown in FIG. 6 is a process that is performed when the robot 100 starts an operation of having a conversation with a plurality of participants. Hereinafter, a case where the participants A to D and the robot 100 are participating in the conversation will be described as an example.

音声入力部１０４は、マイク１０１からの音声信号が入力され、映像入力部１０５は、カメラ１０２からの映像信号が入力され、センサ入力部１０６は、センサ１０３からのセンサ信号が入力される（ステップＳ１０１）。発話区間検出部１０７は、音声入力部１０４からの音声信号に基づいて、音声特徴量を算出し、算出した音声特徴量と所定の閾値を比較して発話区間を検出する（ステップＳ１０２）。 The audio input unit 104 receives the audio signal from the microphone 101, the video input unit 105 receives the video signal from the camera 102, and the sensor input unit 106 receives the sensor signal from the sensor 103 (step). S101). The utterance section detection unit 107 calculates a speech feature amount based on the speech signal from the speech input unit 104, and compares the calculated speech feature amount with a predetermined threshold value to detect a speech section (step S102).

次話者確率推定部１０８は、音声信号、映像信号、センサ信号及び取得した発話者情報に基づいて、各参加者ｉ（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）が時刻ｔに次話者となる確率である次話者確率Ｐ^ｎｓ _ｉ（ｔ）を算出する（ステップＳ１０３）。制御部１０９は、次話者確率推定部１０８が算出した各参加者の次話者確率に基づいて、上述した第１〜第５の次話者選択方法のいずれかを用いて、予測次話者と予測次話者の発話開始タイミングを得る（ステップＳ１０４）。 The next speaker probability estimation unit 108 determines that each participant i (iε {A, B, C, D}) at the time t based on the audio signal, the video signal, the sensor signal, and the acquired speaker information. Next speaker probability P ^ns _i (t), which is the probability of becoming a speaker, is calculated (step S103). Based on the next-speaker probability of each participant calculated by the next-speaker probability estimating unit 108, the control unit 109 uses one of the first to fifth next-speaker selection methods described above to predict a predicted next talk. The utterance start timing of the speaker and the predicted next speaker is obtained (step S104).

制御部１０９は、予測次話者が参加者Ａ〜Ｄのいずれかであるかを判断する（ステップＳ１０５）。制御部１０９は、予測次話者が参加者Ａ〜Ｄのいずれかであると判断した場合（ステップＳ１０５のＮＯ）、音制御部１１０に、発話を行わないよう指示する発話制御信号を出力する。制御部１０９は、発話誘導タイミングが経過するまでの間に参加者Ａ〜Ｄのいずれかが発話したか否かを判断する（ステップＳ１０６）。この発話誘導タイミングは、発話開始タイミング以降のタイミングであり、発話開始タイミングの直後であってもよく、会話中に沈黙が継続した場合に不自然と感じる時間に基づいて決められたタイミングであってもよい。後者のタイミングの場合、例えば、発話終了時刻から所定時間（例えば、２〜３秒）経過後としてもよく、推定された発話開始タイミングから所定時間経過後としてもよい。また、発話誘導タイミングは、予測次話者の次話者確率が所定値以下となる時刻であってもよい。 The control unit 109 determines whether the predicted next speaker is one of the participants A to D (step S105). When the control unit 109 determines that the predicted next speaker is one of the participants A to D (NO in step S105), the control unit 109 outputs an utterance control signal that instructs the sound control unit 110 not to utter. . The control unit 109 determines whether any of the participants A to D has uttered before the utterance induction timing elapses (step S106). This utterance induction timing is a timing after the utterance start timing, may be immediately after the utterance start timing, and is a timing determined based on a time when it feels unnatural when silence continues during a conversation. Also good. In the latter timing, for example, a predetermined time (for example, 2 to 3 seconds) may elapse from the utterance end time, or a predetermined time may elapse from the estimated utterance start timing. Further, the utterance induction timing may be a time at which the next speaker probability of the predicted next speaker becomes a predetermined value or less.

制御部１０９は、発話区間検出部１０７が発話誘導タイミングまでに発話区間の開始を検出した場合、参加者Ａ〜Ｄのいずれかが発話したと判断し（ステップＳ１０６のＹＥＳ）、ステップＳ１０７の処理を実行する。 When the utterance section detection unit 107 detects the start of the utterance section by the utterance guidance timing, the control unit 109 determines that any of the participants A to D has uttered (YES in step S106), and performs the process in step S107. Execute.

一方、制御部１０９は、発話区間検出部１０７が発話誘導タイミングまでに発話区間の開始を検出しない場合（ステップＳ１０６のＮＯ）、発話誘導処理を行う（ステップＳ１０８）。発話誘導処理において、制御部１０９は、発話誘導対象者を、予測次話者、又は、予測次話者の次に次話者確率が高い話者とする。発話誘導対象者を、予測次話者にするか、予測次話者の次に次話者確率が高い話者とするかは予め決められてもよく、動的に決定してもよい。動的に決定する場合、例えば、予測次話者である参加者ｘ（ｘはＡ〜Ｄのいずれか）に対して過去に発話を促したときに参加者ｘが発話を行った確率Ｐｘや、予測次話者の次に次話者確率が高い参加者ｙ（ｙ≠ｘ、ｙはＡ〜Ｄのいずれか）に対して過去に発話を促したときに参加者ｙが発話を行った確率Ｐｙに基づいて決定することができる。具体的には、Ｐｘが所定の閾値以上である場合や、Ｐｘ＞Ｐｙの場合に参加者ｘを予測次話者とし、Ｐｘが所定の閾値よりも低い場合や、Ｐｘ＜Ｐｙの場合に参加者ｙを予測次話者とする。 On the other hand, when the utterance section detection unit 107 does not detect the start of the utterance section by the utterance guidance timing (NO in step S106), the control unit 109 performs utterance guidance processing (step S108). In the utterance guidance process, the control unit 109 sets the utterance guidance target person as the predicted next speaker or the speaker having the next next speaker probability next to the predicted next speaker. Whether the utterance induction target person is a predicted next speaker or a speaker having the next speaker probability that is next to the predicted next speaker may be determined in advance or may be determined dynamically. In the case of dynamic determination, for example, the probability Px that the participant x uttered when the participant x (x is any one of A to D) who is the predicted next speaker is urged in the past. Participant y uttered when utterance was urged in the past to participant y (y ≠ x, y is any one of A to D) with the next speaker probability next to the predicted next speaker It can be determined based on the probability Py. Specifically, if Px is greater than or equal to a predetermined threshold, or if Px> Py, participant x is the predicted next speaker, and if Px is lower than the predetermined threshold or if Px <Py Let y be the predicted next speaker.

制御部１０９は、発話誘導対象者を特定する情報を設定した発話誘導動作制御信号を音制御部１１０、口部制御部１１１、視線制御部１１２、頭部制御部１１３、及び、胴部制御部１１４のうち１以上に出力する。制御部１０９は、視線制御部１１２、頭部制御部１１３、又は、胴部制御部１１４に出力する発話誘導動作制御信号に、発話誘導対象者の位置の情報をさらに設定する。これにより、ロボット１００は、以下の（動作１）〜（動作５）いずれかまたは複数の動作を行い、発話誘導対象者への発話権の委譲を合図する。 The control unit 109 uses the sound control unit 110, the mouth control unit 111, the line-of-sight control unit 112, the head control unit 113, and the torso control unit to generate an utterance guide operation control signal in which information for specifying the utterance guide target is set. Output to one or more of 114. The control unit 109 further sets information on the position of the speech guidance target person in the speech guidance operation control signal output to the line-of-sight control unit 112, the head control unit 113, or the torso control unit 114. Thereby, the robot 100 performs any one or a plurality of operations (Operation 1) to (Operation 5) below, and signals the transfer of the utterance right to the utterance induction target person.

（動作１）音制御部１１０は、発話誘導対象者に対して発話を促す内容の発話の音声をスピーカ１１５から出力する。例えば、発話誘導対象者に対して質問や要求を行う内容の発話を出力する。具体的には、「ＸＸさんはどう思いますか？」（「ＸＸさん」は、発話誘導対象者の名前）といった発話を行う。同時に、口部制御部１１１は、口部駆動信号を口部駆動部１１６に出力し、音声をスピーカ１１５から出力している間、口部５２を開閉するよう制御する。 (Operation 1) The sound control unit 110 outputs, from the speaker 115, speech sound whose content prompts the speech guidance target person to speak. For example, an utterance of contents for making a question or request to the utterance guidance target person is output. Specifically, utterances such as "What do you think about Mr. XX?" At the same time, the mouth control unit 111 controls the mouth 52 to open and close while the mouth drive signal is output to the mouth drive unit 116 and the sound is output from the speaker 115.

（動作２）視線制御部１１２は、眼部駆動信号を眼部駆動部１１７に出力し、眼部２１における黒目の方向を、発話誘導対象者の方向となるように制御する。なお、視線を向けることは発話促進になることが知られている（参考文献２）。
参考文献２：石井亮、外２名、“アバタ音声チャットシステムにおける会話促進のための注視制御”、ヒューマンインタフェース学会論文誌、Ｖｏｌ．１０、Ｎｏ．１、ｐ．８７−９４、２００８年 (Operation 2) The line-of-sight control unit 112 outputs an eye part drive signal to the eye part drive unit 117, and controls the direction of the black eye in the eye part 21 to be the direction of the speech guidance target person. In addition, it is known that turning the line of sight will promote speech (Reference 2).
Reference 2: Ryo Ishii and two others, “Gaze control for conversation promotion in avatar voice chat system”, Journal of Human Interface Society, Vol. 10, no. 1, p. 87-94, 2008

（動作３）頭部制御部１１３は、頭部駆動信号を頭部駆動部１１８に出力し、頸部５４を動かして頭部５３を発話誘導対象者の方向に向けるように制御する。これにより、頭部５３と視線を予測次話者の方向となるように制御する。 (Operation 3) The head control unit 113 outputs a head driving signal to the head driving unit 118 and moves the neck 54 to control the head 53 toward the utterance guidance target person. Thus, the head 53 and the line of sight are controlled to be in the direction of the predicted next speaker.

（動作４）胴部制御部１１４は、胴部駆動信号を胴部駆動部１１９に出力し、胴部５５を発話誘導対象者の方向に回転させるように制御する。これにより、胴部、頭部、及び、視線を発話誘導対象者の方向となるように制御する。 (Operation 4) The torso control unit 114 outputs a torso drive signal to the torso drive unit 119, and controls the torso 55 to rotate in the direction of the speech guidance target person. Thereby, it controls so that a trunk | drum, a head, and a eyes | visual_axis may become the direction of a speech guidance object person.

（動作５）胴部制御部１１４は、胴部駆動信号を胴部駆動部１１９に出力し、右腕５５ａと左腕５５ｂの一方又は両方を発話誘導対象者の方向に差し出すように制御する。 (Operation 5) The torso control unit 114 outputs a torso drive signal to the torso drive unit 119, and controls so that one or both of the right arm 55a and the left arm 55b are directed toward the utterance guidance target person.

制御部１０９は、ステップＳ１０８において発話誘導処理を行った後、次の発話誘導タイミングが経過するまでの間に参加者Ａ〜Ｄのいずれかが発話したか否かを判断する（ステップＳ１０９）。制御部１０９は、次の発話誘導タイミングが経過するまでの間に、発話区間検出部１０７が発話区間の開始を検出しない場合（ステップＳ１０９のＮＯ）、再び、発話誘導処理を行う（ステップＳ１０８）。 The control unit 109 determines whether any of the participants A to D has uttered before the next utterance guidance timing has elapsed after performing the utterance guidance process in step S108 (step S109). When the utterance section detection unit 107 does not detect the start of the utterance section until the next utterance guidance timing elapses (NO in step S109), the control unit 109 performs the utterance guidance process again (step S108). .

制御部１０９は、ステップＳ１０９でＮＯと判断した後に発話誘導処理を行う場合、発話誘導対象者を、直前の発話誘導処理における発話誘導対象者としてもよく、直前の発話誘導処理において発話誘導対象者とした参加者の次に次話者確率が高い参加者としてもよい。例えば、制御部１０９は、同じ参加者がｍ回（ｍは１以上の整数）以上連続して発話誘導対象者となった場合に、その参加者の次に次話者確率が高い話者としてもよい。また、制御部１０９は、発話誘導対象者を、次話者確率が最大値となる時刻が直前の発話誘導処理における発話誘導対象者の次の参加者としてもよい。また、あるいは、制御部１０９は、予測次話者がまだ発話誘導対象者となっていない場合、発話誘導対象者を予測次話者としてもよい。 When performing the utterance guidance process after determining NO in step S109, the control unit 109 may set the utterance guidance target person as the utterance guidance target person in the immediately preceding utterance guidance process, or in the immediately preceding utterance guidance process. It is good also as a participant with the next speaker probability next to the said participant. For example, when the same participant becomes an utterance induction target consecutively m times (m is an integer of 1 or more), the control unit 109 determines that the next speaker has the next highest probability of the speaker. Also good. Further, the control unit 109 may set the utterance induction target person as the next participant of the utterance induction target person in the utterance induction process immediately before the time when the next speaker probability has the maximum value. Alternatively, the control unit 109 may set the utterance guidance target person as the predicted next speaker when the predicted next speaker has not yet become the utterance guidance target person.

具体的には、第１又は第４の次話者選択方法において、参加者ｘの次話者確率Ｐ^ｎｓ _ｉ（ｔ）が最も高く、発話開始タイミングが時刻ｔ１であったとき、時刻ｔ１に参加者ｘが発話を開始しない条件下で、次話者確率Ｐ^ｎｓ _ｘ（ｔ）がある任意の確率ｏを下回る時刻をｔ２（Ｐ^ｎｓ _ｘ（ｔ２）＜ｏ）とする。時刻ｔ２において次話者確率Ｐ^ｎｓ _ｘ（ｔ２）を上回る他の参加者ｙがいるとき（Ｐ^ｎｓ _ｘ（ｔ２）＜Ｐ^ｎｓ _ｙ（ｔ２））、ロボット１００は参加者ｙに時刻ｔ２で発話を促す（ｔ２≧ｔ１）。 Specifically, in the first or fourth next speaker selection method, when the next speaker probability P ^ns _i (t) of the participant x is the highest and the utterance start timing is the time t1, the time t1 Let t2 (P ^ns _x (t2) <o) be a time when the next speaker probability P ^ns _x (t) falls below a certain probability o under the condition that the participant x does not start speaking. When there is another participant y exceeding the next speaker probability P ^ns _x (t2) at time t2 (P ^ns _x (t2) <P ^ns _y (t2)), the robot 100 speaks to the participant y at time t2. (T2 ≧ t1).

また、第３又は第５の次話者選択方法において、参加者ｘの積分値Ｐ^ｎｓ _ｉが最も高く、次話者確率Ｐ^ｎｓ _ｘ（ｔ）が最大となる時刻ｔ１（発話開始タイミング）に参加者ｘが発話を開始しない条件下で、次話者確率Ｐ^ｎｓ _ｘ（ｔ）がある任意の確率ｏを下回る時刻をｔ２（Ｐ^ｎｓ _ｘ（ｔ２）＜ｏ）とする。時刻ｔ２において次話者確率Ｐ^ｎｓ _ｘ（ｔ２）を上回る他の参加者ｙがいるとき（Ｐ^ｎｓ _ｘ（ｔ２）＜Ｐ^ｎｓ _ｙ（ｔ２））、ロボット１００は参加者ｙに時刻ｔ２で発話を促す（ｔ２≧ｔ１）。 Further, in the third or fifth next speaker selection method, at the time t1 (speech start timing) when the integral value P ^ns _i of the participant x is the highest and the next speaker probability P ^ns _x (t) is the maximum. Let t2 (P ^ns _x (t2) <o) be a time when the next speaker probability P ^ns _x (t) falls below a certain probability o under the condition that the participant x does not start speaking. When there is another participant y exceeding the next speaker probability P ^ns _x (t2) at time t2 (P ^ns _x (t2) <P ^ns _y (t2)), the robot 100 speaks to the participant y at time t2. (T2 ≧ t1).

なお、第２の次話者選択方法において、参加者ｘの次話者確率Ｐ^ｎｓ _ｘ（ｔ）が最大となる時刻ｔ１の次に、次話者確率が最大値をとる他の参加者ｙがいるとき、ロボット１００は参加者ｙに時刻ｔ２で発話を促す（ｔ２≧ｔ１）。 In the second next speaker selection method, after the time t1 at which the next speaker probability P ^ns _x (t) of the participant x is maximized, another participant y whose next speaker probability has the maximum value is obtained. When there is, the robot 100 prompts the participant y to speak at time t2 (t2 ≧ t1).

制御部１０９は、次の発話誘導タイミングが経過するまでの間に、発話区間検出部１０７が発話区間の開始を検出した場合（ステップＳ１０９のＹＥＳ）、参加者Ａ〜Ｄのいずれかが発話したと判断し、ステップＳ１０７の処理を実行する。 When the utterance section detection unit 107 detects the start of the utterance section before the next utterance guidance timing elapses (YES in step S109), one of the participants A to D speaks. And the process of step S107 is executed.

ステップＳ１０５において、制御部１０９は、予測次話者がロボット１００であると判断した場合（ステップＳ１０５：ＹＥＳ）、ロボット１００に発話を行わせるよう制御する発話制御信号を出力する。音制御部１１０は、制御部１０９からの発話制御信号に基づいて発話を行わせると判断し、ロボット１００に発話させるための会話情報を生成し、生成した会話情報に基づいた音信号をスピーカ１１５へ出力する（ステップＳ１１０）。これにより、ロボット１００は、音信号に応じた発話をスピーカ１１５から発音する。 In step S105, when the control unit 109 determines that the predicted next speaker is the robot 100 (step S105: YES), the control unit 109 outputs an utterance control signal for controlling the robot 100 to utter. The sound control unit 110 determines that speech is to be performed based on the speech control signal from the control unit 109, generates conversation information for causing the robot 100 to speak, and outputs a sound signal based on the generated conversation information to the speaker 115. (Step S110). As a result, the robot 100 generates an utterance corresponding to the sound signal from the speaker 115.

音制御部１１０は、制御部１０９からの発話制御信号に基づいて、ロボット１００の発話を終了するか否かを判断する（ステップＳ１１１）。ここで、ロボット１００の発話を終了しない場合（ステップＳ１１１のＮＯ）には、音制御部１１０は、ステップＳ１１０の処理に戻る。ロボット１００の発話を終了する場合（ステップＳ１１１のＹＥＳ）には、音制御部１１０は、会話情報の生成を停止することに応じて音信号の出力を停止する。 The sound control unit 110 determines whether or not to end the utterance of the robot 100 based on the utterance control signal from the control unit 109 (step S111). If the utterance of the robot 100 is not terminated (NO in step S111), the sound control unit 110 returns to the process in step S110. When the utterance of the robot 100 is ended (YES in step S111), the sound control unit 110 stops outputting the sound signal in response to stopping the generation of the conversation information.

ステップＳ１０６、ステップＳ１０９、又はステップＳ１１１においてＹＥＳと判断された後、ロボット１００は、複数の参加者と会話を行う会話動作を終了するか否かを判断する（ステップＳ１０７）。ここで、会話動作を終了しないと判断した場合（ステップＳ１０７のＮＯ）には、ステップＳ１０１の処理に戻る。会話動作を終了すると判断した場合（ステップＳ１０７のＹＥＳ）には、ロボット１００は、会話動作を終了する。例えば、参加者が電源スイッチ（図示せず）を入れたタイミングや会話モードのスイッチ（図示せず）をオンにしたタイミングで、ロボット１００は、会話動作を開始し、参加者が電源スイッチを切ったタイミングや会話モードのスイッチをオフにしたタイミングで、ロボット１００は、会話動作を終了する。 After YES is determined in step S106, step S109, or step S111, the robot 100 determines whether or not to end the conversation operation for performing conversation with a plurality of participants (step S107). If it is determined that the conversation operation is not terminated (NO in step S107), the process returns to step S101. If it is determined that the conversation operation is to be ended (YES in step S107), the robot 100 ends the conversation operation. For example, when the participant turns on a power switch (not shown) or turns on a conversation mode switch (not shown), the robot 100 starts a conversation operation, and the participant turns off the power switch. The robot 100 ends the conversation operation at the timing when the switch of the conversation mode or the switch of the conversation mode is turned off.

以上に説明したとおり、第１の実施形態におけるロボット１００は、複数の参加者と会話する際に、各参加者の次話者確率に基づいて次話者を推定し、推定された次話者が発話のタイミングを逸した場合、次話者に発話を促す。これにより、発話のタイミングを逸した参加者が発話しやすいように誘導することができる。また、推定された次話者が発話のタイミングを逸した場合、他の話者に発話を促すことも可能である。例えば、参加者は意図的に発話を控えていることもある。そこで、他の参加者に発話を促すことにより、会話中に沈黙が発生して、参加者が気まずさを感じたりすることが少なくなる。 As described above, the robot 100 according to the first embodiment estimates the next speaker based on the next speaker probability of each participant when conversing with a plurality of participants, and the estimated next speaker. If the timing of utterance is missed, the next speaker is urged to speak. Thereby, it can guide so that the participant who missed the timing of speech may speak easily. Further, when the estimated next speaker misses the utterance timing, it is possible to urge other speakers to speak. For example, the participant may intentionally refrain from speaking. Therefore, by prompting other participants to speak, silence is generated during the conversation, and the participants are less likely to feel awkward.

なお、上記のステップＳ１０９において、次の発話誘導タイミングが経過するまでの間に、発話区間検出部１０７が発話区間の開始を検出しない場合、ロボット１００は、ステップＳ１０３からの処理を行い、各参加者Ａ〜Ｄの次話者確率を算出しなおしてもよい。 In step S109, if the utterance section detection unit 107 does not detect the start of the utterance section until the next utterance guidance timing elapses, the robot 100 performs the process from step S103 and performs each participation. The next speaker probabilities of the speakers A to D may be recalculated.

また、上記のステップＳ１０６において、制御部１０９は、いずれかの参加者の発話を検出したと判断した場合（ステップＳ１０６のＹＥＳ）、さらに、発話者が予測次話者であるか否かを判断するようにしてもよい。制御部１０９は、発話者が予測次話者であると判断した場合、ステップＳ１０７の処理を実行する。一方、制御部１０９は、発話者が予測次話者ではないと判断した場合、予測次話者である参加者ｘが発話行う予定だったにもかかわらず、他の参加者ｙが割り込んで発話を行ったとみなし、参加者ｘに発話を促すようロボット１００を制御する。促すタイミングは任意とすることができる。例えば、参加者ｙの発話の切れ目を検出し、この切れ目を検出した直後、又は、切れ目から所定時間後に、予測次話者を発話誘導対象者として発話誘導処理を行う。切れ目とは、例えば、「〜です。」といった語尾が発話された際や、無音区間がある任意の時間Ｄｓを超えた時とすることができる。また、制御部１０９は、参加者ｙの発話を検出した直後、あるいは、参加者ｙの発話開始時刻から所定時間後に、参加者ｙの発話を制止する内容の音声を出力するよう指示する制御信号を音制御部１１０に出力してもよい。これにより、音制御部１１０は、「ＹＹさん、ちょっと待ってください」といった内容の発話の音声をスピーカ１１５から出力する。その後、ロボット１００は、予測次話者を発話誘導対象者として、ステップＳ１０８からの処理を実行してもよい。このように、参加者ｙの発話を制止する内容の音声によって、予測次話者の発話を促してもよい。 In step S106, when the control unit 109 determines that the speech of any participant has been detected (YES in step S106), the control unit 109 further determines whether or not the speaker is the predicted next speaker. You may make it do. When the control unit 109 determines that the speaker is the predicted next speaker, the control unit 109 performs the process of step S107. On the other hand, when the control unit 109 determines that the speaker is not the predicted next speaker, the other participant y interrupts and speaks even though the participant x who is the predicted next speaker is scheduled to speak. The robot 100 is controlled to urge the participant x to speak. The timing of prompting can be arbitrary. For example, an utterance break of the participant y is detected, and immediately after the break is detected or after a predetermined time from the break, the utterance guidance process is performed with the predicted next speaker as the utterance guidance target person. The break can be defined as, for example, when a ending such as “to” is uttered or when a silent period exceeds an arbitrary time Ds. Further, the control unit 109 instructs to output a voice with a content for stopping the utterance of the participant y immediately after detecting the utterance of the participant y or a predetermined time after the utterance start time of the participant y. May be output to the sound control unit 110. As a result, the sound control unit 110 outputs the voice of the utterance with the content “Please wait a moment, Mr. YY” from the speaker 115. Thereafter, the robot 100 may execute the processing from step S108 with the predicted next speaker as a speech guidance target person. In this way, the speech of the predicted next speaker may be urged by the voice whose content is to stop the speech of the participant y.

また、上記のステップＳ１０９において、制御部１０９は、いずれかの参加者の発話を検出したと判断した場合（ステップＳ１０９のＹＥＳ）、発話者が発話誘導対象者であるか否かを判断するようにしてもよい。制御部１０９は、発話者が発話誘導対象者であると判断した場合、ステップＳ１０７の処理を実行する。一方、制御部１０９は、発話者が発話誘導対象者ではないと判断した場合、発話誘導対象者である参加者ｘが発話行う予定だったにもかかわらず、他の参加者ｙが割り込んで発話を行ったとみなし、参加者ｘに発話を促すようロボット１００を制御する。例えば、上記と同様に、制御部１０９は、参加者ｙの発話の切れ目を検出した直後、又は、切れ目から所定時間後に、同じ発話誘導対象者について発話誘導処理を行う。あるいは、制御部１０９は、参加者ｙの発話を検出した直後、あるいは、参加者ｙの発話開始時刻から所定時間後に、参加者ｙの発話を制止する内容の音声を出力するよう指示する制御信号を音制御部１１０に出力する。 In step S109, when it is determined that the utterance of any participant is detected (YES in step S109), the control unit 109 determines whether or not the utterer is the utterance guidance target person. It may be. When the control unit 109 determines that the utterer is the utterance guidance target person, the control unit 109 executes the process of step S107. On the other hand, when the control unit 109 determines that the speaker is not the speech guidance target person, the other participant y interrupts and speaks even though the participant x who is the speech guidance target person is scheduled to speak. The robot 100 is controlled to urge the participant x to speak. For example, as described above, the control unit 109 performs the utterance guidance process for the same utterance guidance target person immediately after detecting the utterance break of the participant y or after a predetermined time from the break. Alternatively, the control unit 109 instructs to output a voice whose content is to stop the utterance of the participant y immediately after detecting the utterance of the participant y or a predetermined time after the utterance start time of the participant y. Is output to the sound control unit 110.

なお、本実施形態では、ロボット１００が会話に参加する場合を例に記載したが、ロボット１００は、会話に参加せず、参加者の発話を促す動作のみを行ってもよい。 In this embodiment, the case where the robot 100 participates in the conversation has been described as an example. However, the robot 100 may perform only the operation of prompting the participant to speak without participating in the conversation.

（第２の実施形態）
第２の実施形態では、ロボット自身の動き（呼吸動作、視線動作、頭部動作）からロボット自身の次話者確率Ｐ^ｎｓ _Ｒ（ｔ）を求める。ロボットは、求めた次話者確率Ｐ^ｎｓ _Ｒ（ｔ）と他の参加者の次話者確率とに基づいて、予測次話者及び発話開始タイミングを推定する。そのため、ロボットは、会話に参加し、会話中に、会話中の人間同様の動きを行う。つまり、ロボットは、会話中に、呼吸音を発したり胸の膨らみを変化させたりする呼吸動作、視線を話者に向ける等の視線動作、会話に応じて頷いたりする頭部動作を行う。以下では、第１の実施形態との差分を中心に説明する。 (Second Embodiment)
In the second embodiment, the next speaker probability P ^ns _R (t) of the robot itself is obtained from the movement of the robot itself (breathing motion, line-of-sight motion, head motion). The robot estimates the predicted next speaker and the utterance start timing based on the obtained next speaker probability P ^ns _R (t) and the next speaker probabilities of other participants. Therefore, the robot participates in the conversation and performs the same movement as the person in the conversation during the conversation. That is, during the conversation, the robot performs a breathing action that emits a breathing sound or changes the swelling of the chest, a gaze action such as directing the line of sight toward the speaker, and a head action that crawls according to the conversation. Below, it demonstrates centering on the difference with 1st Embodiment.

図７は、第２の実施形態におけるロボット１００Ａが備える機能構成の概略を示す図である。図７に示す第２の実施形態におけるロボット１００Ａは、第１の実施形態におけるロボット１００と同じ構成要素を含む。よって、ロボット１００Ａの説明においては、第１の実施形態におけるロボット１００と同じ構成要素については、同じ符号を付与して説明を省略する。 FIG. 7 is a diagram illustrating an outline of a functional configuration provided in the robot 100A according to the second embodiment. A robot 100A in the second embodiment shown in FIG. 7 includes the same components as the robot 100 in the first embodiment. Therefore, in the description of the robot 100A, the same components as those of the robot 100 according to the first embodiment are denoted by the same reference numerals and description thereof is omitted.

図７に示すように、ロボット１００Ａは、マイク１０１と、カメラ１０２と、センサ１０３と、音声入力部１０４と、映像入力部１０５と、センサ入力部１０６と、発話区間検出部１０７と、次話者確率推定部１０８Ａと、制御部１０９Ａと、音制御部１１０と、口部制御部１１１と、視線制御部１１２と、頭部制御部１１３と、胴部制御部１１４と、スピーカ１１５と、口部駆動部１１６と、眼部駆動部１１７と、頭部駆動部１１８と、胴部駆動部１１９と、センサ信号変換部１２０とを備える。 As shown in FIG. 7, the robot 100A includes a microphone 101, a camera 102, a sensor 103, an audio input unit 104, a video input unit 105, a sensor input unit 106, an utterance section detection unit 107, and a next story. Person probability estimation unit 108A, control unit 109A, sound control unit 110, mouth control unit 111, line of sight control unit 112, head control unit 113, torso control unit 114, speaker 115, mouth Unit driving unit 116, eye unit driving unit 117, head driving unit 118, torso driving unit 119, and sensor signal conversion unit 120.

次話者確率推定部１０８Ａは、音声入力部１０４からの音声信号と、映像入力部１０５からの映像信号と、センサ入力部１０６からのセンサ信号と、発話区間検出部１０７からの発話区間情報と、制御部１０９Ａからの疑似センサ信号とを入力とし、各参加者及びロボット１００Ａのそれぞれが時刻ｔに次話者となる確率である次話者確率を出力する。疑似センサ信号は、制御部１０９Ａが生成する動作制御信号に基づいてロボット１００を動作させ、かつ、そのロボット１００Ａの動作をセンサ１０３で検出したと仮定した場合に、センサ１０３が出力するセンサ信号である。 The next speaker probability estimation unit 108A includes an audio signal from the audio input unit 104, a video signal from the video input unit 105, a sensor signal from the sensor input unit 106, and speech segment information from the speech segment detection unit 107. Then, the pseudo sensor signal from the control unit 109A is input, and the next speaker probability, which is the probability that each participant and the robot 100A become the next speaker at time t, is output. The pseudo sensor signal is a sensor signal output by the sensor 103 when it is assumed that the robot 100 is operated based on the operation control signal generated by the control unit 109A and the operation of the robot 100A is detected by the sensor 103. is there.

次話者確率推定部１０８Ａは、音声信号、映像信号、センサ信号及び発話区間情報に基づいて、発話区間情報で特定される発話区間の発話者を示す発話者情報を取得する。次話者確率推定部１０８Ａは、音声信号、映像信号、センサ信号、疑似センサ信号及び取得した発話者情報に基づいて、ロボット１００Ａが時刻ｔに次話者となる確率であるＰ^ｎｓ _R（ｔ）及び各参加者ｉが時刻ｔに次話者となる確率である次話者確率Ｐ^ｎｓ _ｉ（ｔ）を算出して、制御部１０９Ａへ出力する。次話者確率推定部１０８Ａは、次話者確率Ｐ^ｎｓ _R（ｔ）及びＰ^ｎｓ _ｉ（ｔ）の他に、発話者情報及び参加者の位置情報を制御部１０９Ａへ出力する。 The next speaker probability estimation unit 108A acquires speaker information indicating the speaker in the speech section specified by the speech section information based on the audio signal, the video signal, the sensor signal, and the speech section information. The next speaker probability estimation unit 108A, based on the audio signal, the video signal, the sensor signal, the pseudo sensor signal, and the acquired speaker information, is the probability that the robot 100A will be the next speaker at time t, P ^ns _R (t ) And the next speaker probability P ^ns _i (t), which is the probability that each participant i will be the next speaker at time t, is output to the control unit 109A. The next speaker probability estimation unit 108A outputs the speaker information and the participant position information to the control unit 109A in addition to the next speaker probabilities P ^ns _R (t) and P ^ns _i (t).

次話者確率推定部１０８Ａは、参加者の位置情報を、例えば、センサ１０３の参加者の位置を計測したセンサ信号に基づいて取得してもよいし、映像信号に基づいて取得してもよいし、センサ１０３の参加者の位置を計測したセンサ信号及び映像信号に基づいて取得してもよい。 The next speaker probability estimation unit 108A may acquire the position information of the participant based on, for example, a sensor signal obtained by measuring the position of the participant of the sensor 103 or based on a video signal. Alternatively, the position of the participant of the sensor 103 may be acquired based on the sensor signal and the video signal.

制御部１０９Ａは、次話者確率推定部１０８Ａからの次話者確率Ｐ^ｎｓ _ｉ（ｔ）、発話者情報及び参加者の位置情報を入力とし、発話制御信号又は発話誘導動作制御信号を出力する。制御部１０９Ａは、各参加者及びロボット１００Ａの次話者確率Ｐ^ｎｓ _ｉ（ｔ）に基づいて予測次話者と発話開始タイミングを推定する。制御部１０９Ａは、具体的には、以下に示す第６〜第１０の次話者選択方法のいずれかを用いて次話者を選択する。なお、以下の説明においては、参加者Ａ、Ｂ、Ｃ、Ｄの４名とロボット１００Ａとが会話を行う場合について説明する。制御部１０９Ａは、次話者確率推定部１０８Ａから次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ，Ｒ｝）を取得する。 The control unit 109A receives the next speaker probability P ^ns _i (t) from the next speaker probability estimation unit 108A, the speaker information, and the position information of the participant, and outputs a speech control signal or a speech guidance operation control signal. . The control unit 109A estimates the predicted next speaker and the utterance start timing based on each participant and the next speaker probability P ^ns _i (t) of the robot 100A. Specifically, the control unit 109A selects the next speaker using any of the sixth to tenth next speaker selection methods described below. In the following description, a case will be described in which four participants A, B, C, and D have a conversation with the robot 100A. The control unit 109A acquires the next speaker probability P ^ns _i (t), (iε {A, B, C, D, R}) from the next speaker probability estimation unit 108A.

（第６の次話者選択方法）
制御部１０９Ａは、参加者Ａ〜Ｄ及びロボット１００Ａの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ，Ｒ｝）を比較する。制御部１０９Ａは、Ｐ^ｎｓ _Ｒ（ｔ）が最大であると判断した場合は、ロボット１００Ａを予測次話者とする。制御部１０９Ａは、Ｐ^ｎｓ _Ｒ（ｔ）が最大ではないと判断した場合は、次話者確率Ｐ^ｎｓ _ｉ（ｔ）の最大値が最も高い参加者Ａ〜Ｄのいずれかを予測次話者と判断する。制御部１０９Ａは、予測次話者の次話者確率Ｐ^ｎｓ _ｉ（ｔ）が最大値を取るときの時刻ｔを予測次話者の発話開始タイミングとする。 (Sixth speaker selection method)
The control unit 109A compares the participants A to D and the next speaker probability P ^ns _i (t), (iε {A, B, C, D, R}) of the robot 100A. When the control unit 109A determines that P ^ns _R (t) is the maximum, the control unit 109A sets the robot 100A as a predicted next speaker. When the control unit 109A determines that P ^ns _R (t) is not the maximum, the control unit 109A predicts one of the participants A to D having the highest maximum value of the next speaker probability P ^ns _i (t). Judge. The control unit 109A sets the time t when the next speaker probability P ^ns _i (t) of the predicted next speaker takes the maximum value as the speech start timing of the predicted next speaker.

（第７の次話者選択方法）
制御部１０９Ａは、次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ，Ｒ｝）が最も早い時刻に最大値をとる参加者又はロボット１００Ａのいずれかを予測次話者と判断する。制御部１０９Ａは、予測次話者の次話者確率Ｐ^ｎｓ _ｉ（ｔ）が最大値を取るときの時刻ｔを予測次話者の発話開始タイミングとする。 (Seventh speaker selection method)
The control unit 109A predicts either the participant or the robot 100A whose next speaker probability P ^ns _i (t), (iε {A, B, C, D, R}) takes the maximum value at the earliest time. Judge as the next speaker. The control unit 109A sets the time t when the next speaker probability P ^ns _i (t) of the predicted next speaker takes the maximum value as the speech start timing of the predicted next speaker.

（第８の次話者選択方法）
制御部１０９Ａは、参加者Ａ〜Ｄ及びロボット１００Ａの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ，Ｒ｝）それぞれを、時刻ｔについて所定時間（例えば、発話終了から３〜４秒以上の時間）積分して、積分値Ｐ^ｎｓ _ｉを取得する。なお、積分区間を発話終了から無限時間としてもよく、全参加者の次話者確率Ｐ^ｎｓ _ｉ（ｔ）が所定値未満となり有意な値ではなくなる時間までとしてもよい。制御部１０９Ａは、この積分値Ｐ^ｎｓ _ｉが最も大きい参加者Ａ〜Ｄ又はロボット１００Ａのいずれかを予測次話者と判断する。制御部１０９Ａは、予測次話者の次話者確率Ｐ^ｎｓ _ｉ（ｔ）が最大値を取るときの時刻ｔを予測次話者の発話開始タイミングとする。 (Eighth next speaker selection method)
The control unit 109A sets the next speaker probabilities P ^ns _i (t), (i∈ {A, B, C, D, R}) of the participants A to D and the robot 100A for a predetermined time (for example, Then, the integration value P ^ns _i is obtained. The integration interval may be infinite time from the end of the utterance, or may be the time until the next speaker probability P ^ns _i (t) of all the participants becomes less than a predetermined value and is not significant. The control unit 109A determines that one of the participants A to D or the robot 100A having the largest integral value P ^ns _i is the predicted next speaker. The control unit 109A sets the time t when the next speaker probability P ^ns _i (t) of the predicted next speaker takes the maximum value as the speech start timing of the predicted next speaker.

（第９の次話者選択方法）
制御部１０９Ａは、参加者Ａ〜Ｄの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ｝）を加算した加算値（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））を取得する。制御部１０９Ａは、この加算値と、ロボット１００Ａの次話者確率Ｐ^ｎｓ _Ｒ（ｔ）に定数ιを乗算したＰ^ｎｓ _Ｒ（ｔ）・ιと比較する（ιは正の値となる任意の定数）。制御部１０９Ａは、加算値（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））＜Ｐ^ｎｓ _Ｒ（ｔ）・ιと判断した場合は、ロボット１００Ａを予測次話者とする。制御部１０９Ａは、加算値（Ｐ^ｎｓ _Ａ（ｔ）＋Ｐ^ｎｓ _Ｂ（ｔ）＋Ｐ^ｎｓ _Ｃ（ｔ）＋Ｐ^ｎｓ _Ｄ（ｔ））≧Ｐ^ｎｓ _Ｒ（ｔ）・ιと判断した場合は、第１の実施形態の第１〜第３のいずれかの次話者選択方法によって、予測次話者と発話開始タイミングを得る。ただし、第１〜第３の次話者選択方法において、第１〜第３の閾値との比較は行わなくてもよい。このときの予測次話者は、参加者Ａ〜Ｄのいずれかである。 (9th next speaker selection method)
The control unit 109A adds the next speaker probability P ^ns _i (t), (i∈ {A, B, C, D}) of the participants A to D (P ^ns _A (t) + P ^ns _B (T) + P ^ns _C (t) + P ^ns _D (t)) is acquired. The control unit 109A compares this added value with P ^ns _R (t) · ι obtained by multiplying the next speaker probability P ^ns _R (t) of the robot 100A by a constant ι (ι is an arbitrary value having a positive value). constant). When the control unit 109A determines that the addition value (P ^ns _A (t) + P ^ns _B (t) + P ^ns _C (t) + P ^ns _D (t)) <P ^ns _R (t) · ι, Is the predicted next speaker. When the control unit 109A determines that the added value (P ^ns _A (t) + P ^ns _B (t) + P ^ns _C (t) + P ^ns _D (t)) ≧ P ^ns _R (t) · ι, The predicted next speaker and the utterance start timing are obtained by any one of the first to third next speaker selection methods of the embodiment. However, in the first to third next speaker selection methods, the comparison with the first to third threshold values may not be performed. The predicted next speaker at this time is one of the participants A to D.

（第１０の次話者選択方法）
制御部１０９Ａは、参加者Ａ〜Ｄ及びロボット１００Ａの次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ，Ｒ｝）それぞれを、時刻ｔについて所定時間（例えば、３〜４秒以上の時間）積分して、積分値Ｐ^ｎｓ _ｉを取得する。制御部１０９Ａは、参加者Ａ〜Ｄの全員の積分値Ｐ^ｎｓ _ｉを加算した加算値（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）と、ロボット１００Ａの積分値Ｐ^ｎｓ _Ｒに定数ζを乗算したＰ^ｎｓ _Ｒ・ζと比較する（ζは正の値となる任意の定数）。制御部１０９Ａは、（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）＜Ｐ^ｎｓ _Ｒ・ζと判断した場合は、ロボット１００Ａを予測次話者とする。制御部１０９Ａは、（Ｐ^ｎｓ _Ａ＋Ｐ^ｎｓ _Ｂ＋Ｐ^ｎｓ _Ｃ＋Ｐ^ｎｓ _Ｄ）≧Ｐ^ｎｓ _Ｒ・ζと判断した場合は、第１の実施形態の第１〜第３のいずれかの次話者選択方法によって、予測次話者と発話開始タイミングを得る。ただし、第１〜第３の次話者選択方法において、第１〜第３の閾値との比較は行わなくてもよい。このときの予測次話者は、参加者Ａ〜Ｄのいずれかである。 (10th next speaker selection method)
The control unit 109A sets the next speaker probabilities P ^ns _i (t), (i∈ {A, B, C, D, R}) of the participants A to D and the robot 100A for a predetermined time (for example, , Integration time P ^ns _i is obtained. The control unit 109A adds a constant ζ to an addition value (P ^ns _A + P ^ns _B + P ^ns _C + P ^ns _D ) obtained by adding the integration values P ^ns _i of all the participants A to _D, and the integration value P ^ns _R of the robot 100A. Is compared with P ^ns _R · ζ multiplied by (ζ is an arbitrary constant having a positive value). Control unit ^109A, if it is determined that _{^{_{<P ns R · ζ (P}}} ns A + P ns B + P ns C + P ns D), the robot 100A and predicted next talker. When the control unit 109A determines that (P ^ns _A + P ^ns _B + P ^ns _C + P ^ns _D ) ≧ P ^ns _R · ζ, the controller selects one of the first to third speakers in the first embodiment. According to the method, the predicted next speaker and the utterance start timing are obtained. However, in the first to third next speaker selection methods, the comparison with the first to third threshold values may not be performed. The predicted next speaker at this time is one of the participants A to D.

次話者確率Ｐ^ｎｓ _ｉ（ｔ），（ｉ∈｛Ａ，Ｂ，Ｃ，Ｄ，Ｒ｝）は、図３に示したように、発話終了から所定時間後にピークを有する場合が多い。そこで、制御部１０９Ａは、第６〜第１０の次話者選択方法において、次話者確率Ｐ^ｎｓ _ｉ（ｔ）を求める時刻ｔを含む窓幅を設けて、その窓幅の中における次話者確率の最大値を、時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）として用いるようにしてもよい。また、制御部１０９Ａは、第６〜第１０の次話者選択方法において、次話者確率Ｐ^ｎｓ _ｉ（ｔ）を求める時刻ｔを含む窓幅を設けて、その窓幅の中における次話者確率に複数のピークがある場合に、ｎ番目（ｎは１以上の整数）のピークの次話者確率を、時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）として用いるようにしてもよい。 The next speaker probability P ^ns _i (t), (iε {A, B, C, D, R}) often has a peak after a predetermined time from the end of the utterance, as shown in FIG. Therefore, in the sixth to tenth next speaker selection methods, control unit 109A provides a window width including time t for determining next speaker probability P ^ns _i (t), and the next episode within the window width. The maximum speaker probability may be used as the next speaker probability P ^ns _i (t) at time t. In addition, in the sixth to tenth next speaker selection methods, the control unit 109A provides a window width including the time t for obtaining the next speaker probability P ^ns _i (t), and the next story within the window width. When the speaker probability has a plurality of peaks, the next speaker probability of the nth peak (n is an integer of 1 or more) may be used as the next speaker probability P ^ns _i (t) at time t. .

制御部１０９Ａが備える動作パターン情報格納部１０９１Ａは、第１の実施形態の動作パターン情報格納部１０９１が記憶する動作パターンに加え、ロボット１００Ａが会話中に行う動作の動作パターン情報を格納する。ロボット１００Ａが会話中に行う動作とは、例えば、発話を開始する前に、これから発話を行うことを周りの人に察知させるよう人が行っている動作と同様の動作である。例えば、複数人が会話している際に、非話者である人が次話者として発話する直前に行う行動を解析した結果、以下の（１）〜（３）の行動が「次は私が話を始めます」ということを周囲に示す行動であると考えられる。
（１）吸気音又はフィラーを発声する
（２）現話者に視線向ける
（３）現話者の会話に頷く The operation pattern information storage unit 1091A included in the control unit 109A stores operation pattern information of operations performed by the robot 100A during conversation in addition to the operation patterns stored in the operation pattern information storage unit 1091 of the first embodiment. The operation performed by the robot 100A during the conversation is, for example, the same operation as the operation performed by a person so as to let other people know that an utterance is to be performed before the utterance is started. For example, as a result of analyzing behaviors performed immediately before a non-speaker speaks as the next speaker when multiple people are talking, the following behaviors (1) to (3) are It is thought that this is an action that indicates to the surroundings.
(1) Speaking inspiratory sound or filler (2) Directing gaze toward the current speaker (3) Speaking into the current speaker's conversation

上述した解析結果を参考にして、制御部１０９Ａは、ロボット１００Ａの発話前に、ロボット１００Ａに上述した（１）〜（３）の動作を行わせるよう制御することで、ロボット１００Ａがもうすぐ発話を開始することを参加者に予見させることができる。ロボット１００Ａが上述した（１）〜（３）の動作を行うと次話者確率推定部１０８Ａが推定するロボット１００Ａの次話者確率Ｐ^ｎｓ _Ｒ（ｔ）が上昇する。すなわち、発話を行うことを周りの人に察知させる動作とは、例えば、現話者に視線を移動させる動作、頭を頷かせる動作、吸気音とともに吸気する動作等を含む。 With reference to the analysis result described above, the control unit 109A controls the robot 100A to perform the operations (1) to (3) described above before the robot 100A speaks, so that the robot 100A speaks soon. Let participants foresee to start. When the robot 100A performs the operations (1) to (3) described above, the next speaker probability P ^ns _R (t) of the robot 100A estimated by the next speaker probability estimation unit 108A increases. That is, the operation of making the surrounding people sense that the utterance is performed includes, for example, an operation of moving the line of sight to the current speaker, an operation of raising the head, an operation of inhaling with the intake sound, and the like.

制御部１０９Ａは、以下の公知文献に記載の技術を用いてロボット１００Ａに上述した（１）〜（３）の動作を行わせるよう制御してもよい。
（１）の吸気音を発声する動作をロボット１００Ａに行わせるための技術として以下の参考文献３に記載された公知技術がある。
参考文献３：吉田直人、外３名、“吐息と腹部運動を伴う呼吸表現に関する因子分析に基づいた生物的身体感情インタラクションの設計”、ＨＡＩシンポジウム２０１４、２０１４年
（２）の現話者に視線を向ける動作をロボット１００Ａに行わせるための技術として上記の参考文献２に記載された公知技術がある。
（３）の現話者の会話に頷く動作をロボット１００Ａに行わせるための技術として以下の参考文献４に記載された公知技術がある。
参考文献４：渡辺富夫、外３名、“InterActorを用いた発話音声に基づく身体的インタラクションシステム”、ヒューマンインタフェース学会論文誌、Ｖｏｌ．２、Ｎｏ．２、ｐｐ．２１−２９、２０００年 The control unit 109A may control the robot 100A to perform the above-described operations (1) to (3) using a technique described in the following publicly known document.
There is a known technique described in Reference Document 3 below as a technique for causing the robot 100A to perform the action of uttering the intake sound of (1).
Reference 3: Naoto Yoshida, 3 others, “Design of biological body emotion interaction based on factor analysis on breathing expression with breathing and abdominal movement”, HAI Symposium 2014, 2014 (2) gaze at current speaker As a technique for causing the robot 100 </ b> A to perform the operation of directing the above, there is a known technique described in Reference Document 2 above.
There is a known technique described in Reference Document 4 below as a technique for causing the robot 100 </ b> A to perform the action of speaking the current speaker in (3).
Reference 4: Tomio Watanabe and 3 others, “Physical interaction system based on speech using InterActor”, Journal of Human Interface Society, Vol. 2, No. 2, pp. 21-29, 2000

制御部１０９Ａは、予測次話者がいずれかの参加者である場合、第１の実施形態の制御部１０９と同様の動作を行う。制御部１０９Ａは、予測次話者がロボット１００Ａの場合、ロボット１００Ａの発話の制御を行う発話制御信号を音制御部１１０に出力する。さらに、制御部１０９Ａは、呼吸音やフィラーを発音するよう指示する発音指示信号を音制御部１１０へ出力する。ここで、フィラーとは、言い淀み時などに出現する場つなぎのための発声であり、例えば、「あのー」、「そのー」、「えっと」、等の音声である。また、制御部１０９Ａは、次話者確率推定部１０８Ａからの発話者情報及び参加者の位置情報に基づいて、動作パターン情報格納部１０９１Ａから動作パターン情報を取得して動作制御信号を生成し、生成した動作制御信号を口部制御部１１１、視線制御部１１２、頭部制御部１１３及び胴部制御部１１４へ出力する。 When the predicted next speaker is any participant, the control unit 109A performs the same operation as the control unit 109 of the first embodiment. When the predicted next speaker is the robot 100A, the control unit 109A outputs an utterance control signal for controlling the utterance of the robot 100A to the sound control unit 110. Further, the control unit 109A outputs a sound generation instruction signal for instructing to sound a breathing sound or a filler to the sound control unit 110. Here, the filler is an utterance for joining the scenes that appears at the time of complaining, for example, “Ao”, “That”, “Et”, and the like. Further, the control unit 109A acquires the operation pattern information from the operation pattern information storage unit 1091A based on the speaker information and the participant position information from the next speaker probability estimation unit 108A, and generates an operation control signal. The generated motion control signal is output to the mouth control unit 111, the line-of-sight control unit 112, the head control unit 113, and the torso control unit 114.

センサ信号変換部１２０は、制御部１０９Ａが生成した動作制御信号を疑似センサ信号に変換して次話者確率推定部１０８Ａに出力する。 The sensor signal conversion unit 120 converts the motion control signal generated by the control unit 109A into a pseudo sensor signal and outputs the pseudo sensor signal to the next speaker probability estimation unit 108A.

第２の実施形態におけるロボット１００Ａの外観は、図２に示したロボット１００と同一である。 The appearance of the robot 100A in the second embodiment is the same as that of the robot 100 shown in FIG.

以上の構成により、ロボット１００Ａは、発話を行いたい場合に、発話前に、動作制御信号に基づいて視線を参加者に向けたり、呼吸音やフィラーを発音したりすることができる。参加者は、ロボット１００Ａが発話を開始する前に、ロボット１００Ａがまもなく発話することを予見することができる。この予見により、参加者とロボット１００Ａとの発話衝突を防ぎ、スムーズな会話を実現することができる。 With the above configuration, the robot 100A can turn the line of sight toward the participant based on the operation control signal, or can generate a breathing sound or a filler before speaking, when it is desired to speak. The participant can foresee the robot 100A speaking soon before the robot 100A starts speaking. By this prediction, it is possible to prevent a speech collision between the participant and the robot 100A and realize a smooth conversation.

次に、第２の実施形態におけるロボット１００Ａの動作について説明する。
図８は、第２の実施形態におけるロボット１００Ａの動作を示すフロー図である。図８に示す処理は、図６に示した処理と同様に、ロボット１００Ａにおいて、複数の参加者と会話を行う動作を開始した際に行う処理である。 Next, the operation of the robot 100A in the second embodiment will be described.
FIG. 8 is a flowchart showing the operation of the robot 100A in the second embodiment. The process illustrated in FIG. 8 is a process performed when the robot 100A starts an operation of having conversations with a plurality of participants, similarly to the process illustrated in FIG.

音声入力部１０４は、マイク１０１からの音声信号が入力され、映像入力部１０５は、カメラ１０２からの映像信号が入力され、センサ入力部１０６は、センサ１０３からのセンサ信号が入力される。また、制御部１０９Ａの制御によりロボット１００Ａの会話動作を行う（ステップＳ２０１）。ロボット１００Ａの会話動作には、上述した（１）〜（３）の動作が含まれる。このロボット１００Ａの会話動作に応じて、センサ信号変換部１２０は、疑似センサ信号を次話者確率推定部１０８Ａに出力する。 The audio input unit 104 receives the audio signal from the microphone 101, the video input unit 105 receives the video signal from the camera 102, and the sensor input unit 106 receives the sensor signal from the sensor 103. Further, the conversation operation of the robot 100A is performed under the control of the control unit 109A (step S201). The conversation operation of the robot 100A includes the operations (1) to (3) described above. In response to the conversation operation of the robot 100A, the sensor signal conversion unit 120 outputs a pseudo sensor signal to the next speaker probability estimation unit 108A.

発話区間検出部１０７は、音声入力部１０４からの音声信号に基づいて、音声特徴量を算出し、算出した音声特徴量と所定の閾値を比較して発話区間を検出する（ステップＳ２０２）。次話者確率推定部１０８Ａは、音声信号、映像信号、センサ信号、疑似センサ信号及び発話者情報に基づいて、ロボット１００Ａ及び各参加者ｉが時刻ｔに次話者となる確率である次話者確率Ｐ^ｎｓ _ｉ（ｔ）を算出する（ステップＳ２０３）。 The utterance section detection unit 107 calculates a speech feature amount based on the speech signal from the speech input unit 104, compares the calculated speech feature amount with a predetermined threshold value, and detects a speech section (step S202). The next-speaker probability estimating unit 108A is a next-speak that is the probability that the robot 100A and each participant i will be the next speaker at time t based on the audio signal, video signal, sensor signal, pseudo sensor signal, and speaker information. A person probability P ^ns _i (t) is calculated (step S203).

制御部１０９Ａは、次話者確率推定部１０８Ａからのロボット１００Ａ及び各参加者の次話者確率に基づいて、上述した第６〜第１０の次話者選択方法のいずれかを用いて、予測次話者と予測次話者の発話開始タイミングを得る（ステップＳ２０４）。
ロボット１００ＡのステップＳ２０５〜ステップＳ２１１の処理は、第１の実施形態のステップＳ１０５〜ステップＳ１１１の処理と同様である。ただし、ロボット１００Ａは、ステップＳ２１０の処理の前に、動作制御信号に基づいて視線を参加者に向けたり、発音指示信号に基づいて呼吸音やフィラーを発音したりする。 Based on the robot 100A from the next-speaker probability estimating unit 108A and the next-speaker probability of each participant, the control unit 109A performs prediction using any of the sixth to tenth next-speaker selection methods described above. The utterance start timing of the next speaker and the predicted next speaker is obtained (step S204).
The processing of step S205 to step S211 of the robot 100A is the same as the processing of step S105 to step S111 of the first embodiment. However, the robot 100A directs the line of sight to the participant based on the operation control signal or sounds the breathing sound or filler based on the sound generation instruction signal before the process of step S210.

以上に説明したとおり、第２の実施形態におけるロボット１００Ａは、他の参加者と発話のタイミングが重なる発話衝突の発生を低減し、適切なタイミングで発話を行いながらも、参加者が発話のタイミングを逸した場合に、発話を促すことができる。 As described above, the robot 100 </ b> A according to the second embodiment reduces the occurrence of utterance collisions in which the timing of utterances overlaps with other participants, and the utterance timing of the participants while speaking at an appropriate timing. If you miss, you can encourage utterance.

（第１、第２の実施形態に共通の次話者を推定する処理の具体例）
次に、上述したロボット１００および第２の実施形態におけるロボット１００Ａに共通である次話者を推定する処理の具体例について説明する。ロボット１００及びロボット１００Ａにおける次話者推定には、例えば、以下の参考文献５、６の技術などを適用することができるが、任意の既存の技術を利用してもよい。参考文献５、６記載の技術を利用した場合は、注視対象検出装置２０３が出力する注視対象情報に基づく発話者と非発話者の注視行動の遷移パターンを用いて、次話者確率推定部１０８又は次話者確率推定部１０８Ａは、次話者および発話のタイミングを予測する。 (Specific example of processing for estimating next speaker common to the first and second embodiments)
Next, a specific example of the process for estimating the next speaker common to the robot 100 described above and the robot 100A in the second embodiment will be described. For example, the techniques of the following references 5 and 6 can be applied to the estimation of the next speaker in the robot 100 and the robot 100A, but any existing technique may be used. When the techniques described in References 5 and 6 are used, the next speaker probability estimation unit 108 is used by using the transition pattern of the gaze behavior of the speaker and the non-speaker based on the gaze target information output by the gaze target detection device 203. Alternatively, the next speaker probability estimation unit 108A predicts the next speaker and the timing of the utterance.

参考文献５：特開２０１４−２３８５２５号公報
参考文献６：石井亮、外４名、“複数人対話における注視遷移パターンに基づく次話者と発話タイミングの予測”、人工知能学会研究会資料、SIG-SLUD-B301-06、pp.27-34、2013年 Reference 5: Japanese Patent Application Laid-Open No. 2014-238525 Reference 6: Ryo Ishii and 4 others, “Prediction of next speaker and utterance timing based on gaze transition pattern in multi-person dialogue”, Japanese Society for Artificial Intelligence, SIG -SLUD-B301-06, pp.27-34, 2013

以下に、本実施形態に適用可能な参考文献５、６以外の次話者推定技術の例を示す。
会話の参加者の呼吸動作は次発話者と発話のタイミングに深い関連性がある。このことを利用して、会話の参加者の呼吸動作をリアルタイムに計測し、計測された呼吸動作から発話の開始直前に行われる特徴的な呼吸動作を検出し、この呼吸動作を基に次発話者とその発話タイミングを高精度に算出する。具体的には、発話開始直前におこなわれる呼吸動作の特徴として、発話を行っている発話者は、継続して発話する際（発話者継続時）には、発話終了直後にすぐに急激に息を吸い込む。逆に発話者が次に発話を行わない際（発話者交替時）には、発話者継続時に比べて、発話終了時から間を空けて、ゆっくりと息を吸い込む。また、発話者交替時に、次に発話をおこなう次発話者は、発話を行わない非発話者に比べて大きく息を吸い込む。このような発話の前におこなわれる呼吸は、発話開始に対しておおよそ決められたタイミングで行われる。このように、発話の直前に次発話者は特徴的な息の吸い込みを行うため、このような息の吸い込みの情報は、次発話者とその発話タイミングを予測するのに有用である。本次話者推定技術では、人物の息の吸い込みに着目し、息の吸い込み量や吸い込み区間の長さ、タイミングなどの情報を用いて、次発話者と発話タイミングを予測する。 Below, the example of the next speaker estimation technique other than the references 5 and 6 applicable to this embodiment is shown.
The breathing behavior of conversation participants is closely related to the next speaker and the timing of the speech. Using this, the breathing motion of the participant in the conversation is measured in real time, the characteristic breathing motion performed immediately before the start of the utterance is detected from the measured breathing motion, and the next utterance is based on this breathing motion And the utterance timing are calculated with high accuracy. Specifically, as a feature of breathing movement performed immediately before the start of utterance, when a speaker who is speaking continuously speaks (when the speaker continues), he immediately breathes immediately after the end of the utterance. Inhale. Conversely, when the speaker does not speak next (speaker change), inhale slowly after the end of the speech, compared to when the speaker continues. Further, at the time of changing the speaker, the next speaker who speaks next inhales more greatly than the non-speaker who does not speak. Breathing performed before such utterance is performed at a timing roughly determined with respect to the start of the utterance. As described above, since the next speaker performs a characteristic breath inhalation immediately before the utterance, such breath inhalation information is useful for predicting the next speaker and the timing of the utterance. In this next speaker estimation technique, attention is paid to a person's breath inhalation, and information such as the amount of breath inhalation, the length of the breathing section, and timing is used to predict the next speaker and the speech timing.

以下では、Ａ人の参加者Ｐ_１，…，Ｐ_Ａが対面コミュニケーションを行う状況を想定する。参加者Ｐ_ａ（ただし、ａ＝１，…，Ａ、Ａ≧２）には呼吸動作計測装置２０２およびマイク１０１が装着される。呼吸動作計測装置２０２は、参加者Ｐ_ａの呼吸動作を計測し、各離散時刻ｔでの計測結果を表す呼吸情報Ｂ_ａ，ｔを得て、次話者確率推定部１０８又は次話者確率推定部１０８Ａに出力する。呼吸動作計測装置２０２が、バンド式の呼吸装置を備える構成について説明する。バンド式の呼吸装置は、バンドの伸縮の強さによって呼吸の深さの度合いを示す値を出力する。息の吸い込みが大きいほどバンドの伸びが大きくなり、逆に息の吐き出しが大きいほどバンドの縮みが大きくなる（バンドの伸びが小さくなる）。以降、この値をＲＳＰ値と呼ぶ。なお、ＲＳＰ値は、バンドの伸縮の強さに応じて参加者Ｐ_ａごとに異なる大きさを取る。そこで、これに起因するＰ_ａごとのＲＳＰ値の相違を排除するために、各参加者Ｐ_ａのＲＳＰ値の平均値μ_ａと標準偏差値δ_ａを用いて、μ_ａ+δ_ａが１、μ_ａ−δ_ａが−１になるように参加者Ｐ_ａごとにＲＳＰ値を正規化する。これによって、すべての参加者Ｐ_ａの呼吸動作データを同一に分析することが可能となる。各呼吸動作計測装置２０２は、正規化されたＲＳＰ値を呼吸情報Ｂ_ａ，ｔとして次話者確率推定部１０８又は次話者確率推定部１０８Ａに送る。 In the following, the participants P ₁ of the _{A's, ...,} P _A is assumed a situation to perform a face-to-face communication. Participants P _a (where a = 1,..., A, A ≧ 2) are equipped with the respiratory motion measuring device 202 and the microphone 101. Respiration measuring device 202 measures the respiration of the participant P _a, respiration information B _a representative of the measurement results for each discrete time _t, to obtain _t, next speaker probability estimation unit 108 or the next speaker probability It outputs to the estimation part 108A. A configuration in which the respiratory motion measuring device 202 includes a band-type respiratory device will be described. The band-type breathing apparatus outputs a value indicating the degree of breathing depth according to the strength of expansion and contraction of the band. The greater the inhalation of the breath, the greater the stretch of the band, and the greater the exhalation of the breath, the greater the contraction of the band (the less the stretch of the band). Hereinafter, this value is referred to as an RSP value. It should be noted, RSP value, take a different size each participant P _a according to the strength of the expansion and contraction of the band. Therefore, in order to eliminate the difference of RSP values for each P _a resulting therefrom, using the average value mu _a and the standard deviation value [delta] _a of RSP values for each participant P _{_{_a,} μ a} ₊ _{δ a} is 1 , μ _{_a} -δ _a normalizes RSP values for each participant _{P a} to be -1. This makes it possible to analyze the same respiratory motion data for all participants P _a. Each breathing motion measuring apparatus 202 sends the normalized RSP value to the next speaker probability estimating unit 108 or the next speaker probability estimating unit 108A as the breathing information Ba _{, t} .

さらに、マイク１０１は、参加者Ｐ_ａの音声を取得し、各離散時刻ｔでの参加者Ｐ_ａの音声を表す音声信号Ｖ_ａ，ｔを得て、次話者確率推定部１０８又は次話者確率推定部１０８Ａに出力する。次話者確率推定部１０８又は次話者確率推定部１０８Ａは、入力された音声信号Ｖ_ａ，ｔ（ただし、ａ＝１，…，Ａ）から雑音を除去し、さらに発話区間Ｕ_ｋ（ただし、ｋは発話区間Ｕ_ｋの識別子）とその発話者Ｐ_ｕｋとを抽出する。ただし、「Ｐ_ｕｋ」の下付き添え字はｕ_ｋ＝１，…，Ａを表す。１つの発話区間Ｕ_ｋをＴｄ［ｍｓ］連続した無音区間で囲まれた区間と定義し、この発話区間Ｕ_ｋを発話の一つの単位と規定する。これにより、次話者確率推定部１０８又は次話者確率推定部１０８Ａは、各発話区間Ｕ_ｋを表す発話区間情報、およびその発話者Ｐ_ｕｋを表す発話者情報（参加者Ｐ_１，…，Ｐ_Ａのうち何れが発話区間Ｕ_ｋでの発話者Ｐ_ｕｋであるかを表す発話者情報）を得る。 Further, the microphone 101 acquires the voice of the participant P _a, the audio signals V _a representative of the speech of the participant P _a at each discrete time _t, to obtain _t, next speaker probability estimation unit 108 or Tsugihanashi To the person probability estimation unit 108A. The next speaker probability estimator 108 or the next speaker probability estimator 108A removes noise from the input speech signal V _{a, t} (where a = 1,..., A), and further utters the speech interval U _k (where , K is an identifier of the utterance section U _k ) and its speaker P _uk . However, the subscript “P _uk ” represents u _k = 1,. One utterance section U _k is defined as a section surrounded by Td [ms] continuous silence sections, and this utterance section U _k is defined as one unit of utterance. Thus, the following speaker probability estimation unit 108 or the next speaker probability estimation unit 108A, the speech period information representing each speech segment U _k, and speaker information (participant P ₁ representing the speaker P _{_uk,} ..., any get speaker information) indicating whether the speaker P _uk in the speech segment U _k of P _a.

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、各参加者Ｐ_ａの呼吸情報Ｂ_ａ，ｔを用いて、各参加者Ｐ_ａの息の吸い込み区間Ｉ_ａ，ｋを抽出し、さらに息の吸い込みに関するパラメータλ_ａ，ｋを取得する。息の吸い込み区間とは、息を吐いている状態から、息を吸い込みだす開始位置と、息を吸い込み終わる終了位置との間の区間を示す。 Next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A, the breathing information _{B a} of each participant _{P _a,} with _t, suction section _{I a,} the _k extracted breath of each participant _{P a} Further, parameters λ _{a, k} relating to breath inhalation are acquired. The breath inhaling section indicates a section between a start position where the breath is inhaled and an end position where the breath is finished after the breath is being exhaled.

図９は、息の吸い込み区間の例を示す図である。図９を用いて、息の吸い込み区間Ｉ_ａ，ｋの算出方法を例示する。ここで参加者Ｐ_ａの離散時刻ｔでのＲＳＰ値をＲ_ａ，ｔと表記する。ＲＳＰ値Ｒ_ａ，ｔは呼吸情報Ｂ_ａ，ｔに相当する。図９に例示するように、例えば、以下の（式１）が成り立つとき、 FIG. 9 is a diagram illustrating an example of a breath inhaling section. Using FIG. 9, _a method for calculating the breath inhalation interval I _{a, k} will be exemplified. Here referred to the RSP value in the discrete time t of the participant _{P _a} _{R _a,} and _t. The RSP value R _{a, t} corresponds to the respiration information B _{a, t} . As illustrated in FIG. 9, for example, when the following (Equation 1) holds,

離散時刻ｔ＝ｔ_ｓ（ｋ）の前２フレームでＲＳＰ値Ｒ_ａ，ｔが連続して減少し、その後２フレームでＲＳＰ値Ｒ_ａ，ｔが連続して上昇しているから、離散時刻ｔ_ｓ（ｋ）を息の吸い込みの開始位置とする。さらに、以下の（式２）が成り立つとき、 RSP value _{R a} in the previous two frames discrete time t _{= t s _(k),} _t continuously decreases, RSP value _{R a} in the subsequent two _frames, since _t is increasing continuously, discrete time t _{Let s (k)} be the inhalation start position. Furthermore, when the following (Equation 2) holds,

離散時刻ｔ＝ｔ_ｅ（ｋ）の前２フレームのＲＳＰ値Ｒ_ａ，ｔが連続して上昇し、その後２フレームのＲＳＰ値Ｒ_ａ，ｔが連続して減少しているから、離散時刻ｔ_ｅ（ｋ）を息の吸い込みの終了位置とする。このとき、参加者Ｐ_ａの息の吸い込み区間Ｉ_ａ，ｋはｔ_ｓ（ｋ）からｔ_ｅ（ｋ）までの区間となり、息の吸い込み区間の長さはｔ_ｅ（ｋ）−ｔ_ｓ（ｋ）となる。 Since the RSP values R _{a, t} of the previous two frames at the discrete time t = te _(k) continuously increase and then the RSP values _{Ra, t of the} two frames decrease continuously, the discrete time t _{Let e (k) be} the end position of breath inhalation. In this case, the suction section _{I a} breath of participants _{P _a,} _k becomes the interval from _{t s (k)} to _{t e (k),} the length of the suction section of breath _{t e} _{(k) -t s} ( _k) .

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、息の吸い込み区間Ｉ_ａ，ｋが抽出されると、息の吸い込み区間Ｉ_ａ，ｋ、呼吸情報Ｂ_ａ，ｔ、および発話区間Ｕ_ｋの少なくとも一部を用い、息の吸い込みに関するパラメータλ’_ａ，ｋを抽出する。パラメータλ’_ａ，ｋは、参加者Ｐ_ａの吸い込み区間Ｉ_ａ，ｋでの息の吸い込みの量、吸い込み区間Ｉ_ａ，ｋの長さ、吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量の時間変化、および発話区間Ｕ_ｋと吸い込み区間Ｉ_ａ，ｋとの時間関係の少なくとも一部を表す。パラメータλ’_ａ，ｋは、これらの一つのみを表してもよいし、これらのうち複数を表してもよいし、これらすべてを表してもよい。パラメータλ’_ａ，ｋは、例えば以下のパラメータＭＩＮ_ａ，ｋ，ＭＡＸ_ａ，ｋ，ＡＭＰ_ａ，ｋ，ＤＵＲ_ａ，ｋ，ＳＬＯ_ａ，ｋ，ＩＮＴ１_ａ，ｋの少なくとも一部を含む。パラメータλ’_ａ，ｋは、これらの１つのみを含んでいてもよいし、これらのうち複数を含んでいてもよいし、これらのすべてを含んでいてもよい。
・ＭＩＮ_ａ，ｋ：参加者Ｐ_ａの息の吸い込み開始時のＲＳＰ値Ｒ_ａ，ｔ、すなわち、息の吸い込み区間Ｉ_ａ，ｋのＲＳＰ値Ｒ_ａ，ｔの最小値。
・ＭＡＸ_ａ，ｋ：参加者Ｐ_ａの息の吸い込み終了時のＲＳＰ値Ｒ_ａ，ｔ、すなわち、息の吸い込み区間Ｉ_ａ，ｋのＲＳＰ値Ｒ_ａ，ｔの最大値。
・ＡＭＰ_ａ，ｋ：参加者Ｐ_ａの息の吸い込み区間Ｉ_ａ，ｋのＲＳＰ値Ｒ_ａ，ｔの振幅、すなわち、ＭＡＸ_ａ，ｋ−ＭＩＮ_ａ，ｋで算出される値。吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量を表す。
・ＤＵＲ_ａ，ｋ：参加者Ｐ_ａの息の吸い込み区間Ｉ_ａ，ｋの長さ、すなわち、息の吸い込み区間Ｉ_ａ，ｋの終了位置の離散時刻ｔ_ｅ（ｋ）から開始位置の離散時刻ｔ_ｓ（ｋ）を減じて得られる値ｔ_ｅ（ｋ）−ｔ_ｓ（ｋ）。
・ＳＬＯ_ａ，ｋ：参加者Ｐ_ａの息の吸い込み区間Ｉ_ａ，ｋにおけるＲＳＰ値Ｒ_ａ，ｔの単位時間当たりの傾きの平均値、すなわち、ＡＭＰ_ａ，ｋ／ＤＵＲ_ａ，ｋで算出される値。吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量の時間変化を表す。
・ＩＮＴ１_ａ，ｋ：手前の発話区間Ｕ_ｋの終了時刻ｔ_{ｕｅ（ｋ）}（発話区間末）から参加者Ｐ_ａの息の吸い込みが開始されるまでの間隔、すなわち、息の吸い込み区間Ｉ_ａ，ｋの開始位置の離散時刻ｔ_ｓ（ｋ）から発話区間Ｕ_ｋの終了時刻ｔ_{ｕｅ（ｋ）}を減じて得られる値ｔ_ｓ（ｋ）−ｔ_{ｕｅ（ｋ）}。発話区間Ｕ_ｋと吸い込み区間Ｉ_ａ，ｋとの時間関係を表す。 Next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A, the suction section _{I a breath,} when _k is extracted, the suction section _{I a breath, k,} respiration information _{B a, t,} and speech section Using at least part of U _k , parameters λ ′ _{a, k} relating to breath inhalation are extracted. Parameter lambda _{'a, k} is suction section _{I a} participant _{P _a,} the amount of suction breath at _k, the suction section _{I a,} the length of _k, the suction section _{I a,} the suction amount of the breath at the _k It represents at least part of the temporal change and the time relationship between the utterance section U _k and the suction section I _{a, k} . The parameters λ ′ _{a, k} may represent only one of them, or a plurality of them, or all of them. The parameters λ ′ _{a, k} include, for example, at least a part of the following parameters MIN _{a, k} , MAX _{a, k} , AMP _{a, k} , DUR _{a, k} , SLO _{a, k} , INT1 _{a, k} . The parameter λ ′ _{a, k} may include only one of these, may include a plurality of these, or may include all of them.
· _{MIN a, k:} RSP value _{R a} at the start of the suction of the breath of the participants _{P _a,} _t, that is, the suction section _{I a breath, k} of the RSP value _{R a,} minimum value of _t.
· _{MAX a, k:} RSP value _{R a} of at the end of the suction of the breath of the participants _{P _a,} _t, that is, the suction section _{I a breath, k} of the RSP value _{R a,} the maximum value of _t.
· _{AMP a, k:} Participants _{P a} suction section _{I a breath, k} of RSP values _{R a,} the amplitude of _t, _{_i.e., MAX _a,} k -MIN _a, value calculated by _k. This represents the amount of breath inhaled in the inhalation section _{Ia, k} .
· _{DUR a, k:} the suction section _{I a} breath of participants _{P _a,} length of _k, that is, the suction section _{I a breath,} the discrete time of the start position from the discrete time _{t e} of the end position of _{_{k (k)}} the value obtained by subtracting _{_{t s (k) t e (}} k) -t s (k).
· _{SLO a, k:} Participants _{P a} suction section _{I a breath,} RSP value _{R a,} the average value of the slope per unit time _t in _k, _{_{i.e., AMP a, k / DUR a}} , calculated in _k Value. It represents the time change of the amount of breath inhaled in the inhalation section _{Ia, k} .
· INT1 _{a, k:} distance to the front of the suction from the end time _{t ue} of the speech segment _{U _{k (k)}} _(the end of the speech segment) of the breath of the participants _{P a} is started, ie, the suction of breath interval _{I a ,} discrete time _{t s (k)} from the speech segment _{U k} of the end time _{t ue} value obtained by subtracting the _(k) _t s of the start position of _{_{k (k) -t ue (k}} ). This represents the time relationship between the utterance section U _k and the suction section I _{a, k} .

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、さらに以下のパラメータＩＮＴ２_ａ，ｋを生成してもよい。
・ＩＮＴ２_ａ，ｋ：参加者Ｐ_ａの息の吸い込み終了時から次発話者の発話区間Ｕ_ｋ＋１が開始されるまでの間隔、すなわち、次発話者の発話区間Ｕ_ｋ＋１の開始時刻ｔ_{ｕｓ（ｋ＋１）}から息の吸い込み区間Ｉ_ａ，ｋの終了位置の離散時刻ｔ_ｅ（ｋ）を減じて得られる値ｔ_{ｕｓ（ｋ＋１）}−ｔ_ｅ（ｋ）。発話区間Ｕ_ｋ＋１と吸い込み区間Ｉ_ａ，ｋとの時間関係を表す。パラメータλ’_ａ，ｋにＩＮＴ２_ａ，ｋを加えたものをパラメータλ_ａ，ｋと表記する。 The next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A may further generate the following parameters INT2a _{, k} .
· INT2 _{a, k:} interval up to the speech segment _{U k + 1} of the next speaker is started from the time of the end intake of breath of the participants _{P a,} ie, the next speaker of the speech segment _{U k + 1} of the start time _{t us (k + 1 )} ₍ T ₎ _{(k + 1)} −te _(k) obtained by subtracting the discrete time te _(k) at the end position of the breath inhalation interval I _{a, k} . The time relationship between the utterance section U _{k + 1} and the suction section I _{a, k} is represented. Parameters λ _{_'a,} INT2 _a, a plus _k is denoted as parameter lambda _{a, k} to _k.

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、例えば発話区間Ｕ_ｋ＋１を表す情報が得られ、さらに、パラメータλ_ａ，ｋが得られた以降（発話区間Ｕ_ｋ＋１が開始された後）に、発話区間Ｕ_ｋおよびその発話者Ｐ_ｕｋ、発話区間Ｕ_ｋ＋１およびその発話者Ｐ_ｕｋ＋１とその発話開始タイミングＴ_ｕｋ＋１を表す情報とともにデータベースに記録する。次発話者Ｐ_ｕｋ＋１の発話タイミングとは、発話区間Ｕ_ｋ＋１の何れかの時点またはそれに対応する時点であればよい。発話タイミングＴ_ｕｋ＋１は、発話区間Ｕ_ｋ＋１の開始時刻ｔ_{ｕｓ（ｋ＋１）}であってもよいし、時刻ｔ_{ｕｓ（ｋ＋１）}＋γ（ただし、γは正または負の定数）であってもよいし、発話区間Ｕ_ｋ＋１の終了時刻ｔ_{ｕｅ（ｋ＋１）}であってもよいし、時刻ｔ_{ｕｅ（ｋ＋１）}＋γであってもよいし、発話区間Ｕ_ｋ＋１の中心時刻ｔ_{ｕｓ（ｋ＋１）}＋（ｔ_{ｕｅ（ｋ＋１）}−ｔ_{ｕｓ（ｋ＋１）}）／２であってもよい。λ_ａ，ｋ，Ｕ_ｋ，Ｐ_ｕｋ，Ｐ_ｕｋ＋１，Ｔ_ｕｋ＋１を表す情報の一部またはすべてがデータベースに保持され、次話者確率推定部１０８又は次話者確率推定部１０８Ａが発話区間Ｕ_ｋ＋１よりも後の次発話者とその発話タイミングを予測するために使用される。 The next speaker probability estimator 108 or the next speaker probability estimator 108A obtains, for example, information representing the utterance interval U _{k + 1} , and after the parameters λ _{a, k} are obtained (the utterance interval U _{k + 1} is started). After), it is recorded in the database together with information indicating the utterance section U _k and its utterer P _uk , the utterance section U _{k + 1,} its utterer P _{uk + 1} and its utterance start timing T _{uk + 1} . The utterance timing of the next speaker P _{uk + 1} may be any time point in the utterance section U _{k + 1} or a time point corresponding thereto. The utterance timing T _{uk + 1} may be the start time t _{us (k + 1)} of the utterance interval U _{k + 1} , or the time t _{us (k + 1)} + γ (where γ is a positive or negative constant), It may be the end time t _{ue (k + 1)} of the utterance interval U _{k + 1} , may be the time t _{ue (k + 1)} + γ, or may be the central time t _{us (k + 1)} + (t _{ue (} _{) of} the utterance interval U _{k + 1.} _{k + 1)} -tus _{(k + 1)} ) / 2. Part or all of the information representing λ _{a, k} , U _k , P _uk , P _{uk + 1} , T _{uk + 1} is held in the database, and the next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A performs the utterance interval U _{k + 1.} It is used to predict the next utterer later and the utterance timing.

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、発話者情報Ｐ_ｕｋ、発話区間Ｕ_ｋ、参加者Ｐ_ａの吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量、吸い込み区間Ｉ_ａ，ｋの長さ、吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量の時間変化、および発話区間Ｕ_ｋと吸い込み区間Ｉ_ａ，ｋとの時間関係の少なくとも一部に基づき、参加者Ｐ_１，…，Ｐ_Ａのうち何れが次発話者Ｐ_ｕｋ＋１であるか、および次発話者Ｐ_ｕｋ＋１の発話タイミングの少なくとも一方を表す推定情報を得る。ただし、「Ｐ_ｕｋ＋１」の下付き添え字「ｕｋ＋１」はｕ_ｋ＋１を表す。発話区間Ｕ_ｋの発話者Ｐ_ｕｋが発話区間Ｕ_ｋ＋１でも発話を行う場合（発話継続する場合）、次発話者は発話区間Ｕ_ｋの発話者Ｐ_ｕｋと同一である。一方、発話区間Ｕ_ｋの発話者Ｐ_ｕｋ以外の参加者が発話区間Ｕ_ｋ＋１でも発話を行う場合（すなわち発話交替する場合）、次発話者は発話区間Ｕ_ｋの発話者Ｐ_ｕｋ以外の参加者である。 The next-speaker probability estimating unit 108 or the next-speaker probability estimating unit 108 _</ _b > A includes the speaker information P _uk , the utterance interval U _k , the breath intake amount of the participant Pa in the intake interval I _{a, k} , and the intake interval I _{a. ,} the length of _k, the suction section I _a, suction amount of time variation of the breath at _k, and speech periods U _k and the suction section I _a, based on at least part of the time relationship between _k, participants P _1, ..., obtain estimation information either is or is the next speaker P _{uk + 1,} and representing at least one of the following speaker P _{uk + 1} of the utterance timings of the P _a. However, subscript "uk + 1" of the _{"P uk + 1"} represents a _{u k + 1.} (If speech continues) if speaker _{P uk} speech period _{U k} performs speech even speech section _{U k + 1,} the next speaker is the same as the speaker _{P uk} speech period _{U k.} On the other hand, (if That utterance replacement) when uttered P _uk other participants in the speech period U _k performs speech even speech section U _{k + 1,} the following speaker is other than speaker P _uk speech period U _k participants It is.

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、発話者情報Ｐ_ｕｋ、発話区間Ｕ_ｋ、参加者Ｐ_ａの吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量、吸い込み区間Ｉ_ａ，ｋの長さ、吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量の時間変化、および発話区間Ｕ_ｋと吸い込み区間Ｉ_ａ，ｋとの時間関係の少なくとも一部に対応する特徴量ｆ_ａ，ｋに対する推定情報を得るためのモデルを機械学習し、このモデルを用いて特徴量に対する推定情報を得る。特徴量ｆ_ａ，ｋは、発話者情報Ｐ_ｕｋ、発話区間Ｕ_ｋ、参加者Ｐ_ａの吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量、吸い込み区間Ｉ_ａ，ｋの長さ、吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量の時間変化、および発話区間Ｕ_ｋと吸い込み区間Ｉ_ａ，ｋとの時間関係の１つのみに対応してもよいし、これらのうち複数に対応してもよいし、すべてに対応してもよい。モデルの機械学習には、例えば、過去の吸い込み区間Ｉ_ａ，ｉ（ただし、ｉ＜ｋ）での息の吸い込み量、吸い込み区間Ｉ_ａ，ｉの長さ、吸い込み区間Ｉ_ａ，ｉでの息の吸い込み量の時間変化、および発話区間Ｕ_ｉと吸い込み区間Ｉ_ａ，ｉとの時間関係の少なくとも一部に対応する特徴量ｆ_ａ，ｋ、ならびに発話区間Ｕ_ｉ，Ｕ_ｉ＋１およびそれらの発話者Ｐ_ｕｋ，Ｐ_ｕｋ＋１の情報が学習データとして用いられる。 The next-speaker probability estimating unit 108 or the next-speaker probability estimating unit 108 _</ _b > A includes the speaker information P _uk , the utterance interval U _k , the breath intake amount of the participant Pa in the intake interval I _{a, k} , and the intake interval I _{a. ,} the length of _k, the suction section I _a, suction amount of time variation of the breath at _k, and section suction and speech period U _k I _a, the feature amount corresponding to at least part of the time relationship between _k f _a, A model for obtaining estimation information for _k is machine-learned, and estimation information for feature quantities is obtained using this model. Feature value _{f a, k} is the speaker information _{P uk,} speech segment _{U k,} the suction section _{I a} participant _{P _a,} suction amount of breath at _k, the suction section _{I a,} the length of _k, the suction section I _It may correspond to only one of the temporal changes in the amount of inhalation of breath at _{a, k} and the time relationship between the utterance interval U _k and the inhalation interval I _{a, k} , or may correspond to a plurality of these. It may be good or all. The machine learning model, for example, past suction section _{I a, i} (although, i <k) suction of breath, the suction section _{I a,} the length of the _i, suction section _{I a,} breath in _i , The feature quantity f _{a, k} corresponding to at least a part of the temporal change in the amount of ingestion and the time relationship between the utterance section U _i and the ingestion section I _{a, i} , and the utterance sections U _i , U _{i + 1} and their speakers Information of P _uk and P _{uk + 1} is used as learning data.

次話者確率推定部１０８又は次話者確率推定部１０８Ａによる次発話者／発話タイミング推定処理を例示する。この例では、次発話者Ｐ_ｕｋ＋１を推定するモデルである次発話者推定モデルと、次発話者Ｐ_ｕｋ＋１の発話タイミングを推定するモデルである発話タイミング推定モデルとが生成され、それぞれのモデルを用いて次発話者Ｐ_ｕｋ＋１とその発話タイミングが推定される。 The next speaker / speech timing estimation processing by the next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A will be exemplified. In this example, the next speaker estimation model is a model that estimates the next speaker P _{uk + 1,} and the response timing estimation model is a model for estimating the response timing of the next speaker P _{uk + 1} is generated, using each model Thus, the next speaker P _{uk + 1} and its speech timing are estimated.

次発話者推定モデルを学習する場合、次話者確率推定部１０８又は次話者確率推定部１０８Ａは、学習データとして、データベースから過去のパラメータλ_ａ，ｉ（ただし、ａ＝１，…，Ａであり、ｉ＜ｋである）の少なくとも一部、および発話区間Ｕ_ｉ，Ｕ_ｉ＋１およびそれらの発話者Ｐ_ｕｉ，Ｐ_ｕｉ＋１を表す情報を読み出す。次話者確率推定部１０８又は次話者確率推定部１０８Ａは、パラメータλ_ａ，ｉの少なくとも一部に対応する特徴量Ｆ１_ａ，ｉおよびＵ_ｉ，Ｕ_ｉ＋１，Ｐ_ｕｉ，Ｐ_ｕｉ＋１を学習データとして、次発話者推定モデルを機械学習する。次発話者推定モデルには、例えば、ＳＶＭ（Support Vector Machine）、ＧＭＭ（Gaussian Mixture Model）、ＨＭＭ（Hidden Markov Model）等を用いることができる。 When learning the next speaker estimation model, the next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A uses the past parameters λ _{a, i} (where a = 1,. And i <k), and information representing the utterance sections U _i and U _{i + 1} and the speakers P _ui and P _{ui + 1} are read out. The next speaker probability estimator 108 or the next speaker probability estimator 108A learns feature data F1 _{a, i} and U _i , U _{i + 1} , P _ui , P _{ui + 1} corresponding to at least a part of the parameters λ _{a, i.} Then, the next speaker estimation model is machine-learned. As the next speaker estimation model, for example, SVM (Support Vector Machine), GMM (Gaussian Mixture Model), HMM (Hidden Markov Model), or the like can be used.

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、パラメータλ’_ａ，ｋの少なくとも一部に対応する特徴量Ｆ１_ａ，ｋを次発話者推定モデルに適用し、それによって推定された次発話Ｐ_ｕｋ＋１を表す情報を「推定情報」の一部とする。なお、次発話Ｐ_ｕｋ＋１を表す情報は、何れかの参加者Ｐ_ａを確定的に表すものであってもよいし、確率的に表すものであってもよい。参加者Ｐ_ａが次話者になる確率を、Ｐ１_ａとする。 Next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A, the parameter lambda _'a, the feature amount corresponding to at least a portion of the _k F1 _a, the _k is applied to the next speaker estimation models are estimated thereby Information representing the next utterance P _{uk + 1} is a part of the “estimated information”. Note that the information indicating the next utterance P _{uk + 1} may be _a definite representation of any participant Pa or may be a probability representation. The probability that participant _{P a} becomes the next speaker, and P1 _a.

発話タイミング推定モデルを学習する場合、次話者確率推定部１０８又は次話者確率推定部１０８Ａは、学習データとして、データベースから過去のパラメータλ_ａ，ｉ（ただし、ａ＝１，…，Ａであり、ｉ＜ｋである）の少なくとも一部、発話区間Ｕ_ｉ，Ｕ_ｉ＋１およびそれらの発話者Ｐ_ｕｉ，Ｐ_ｕｉ＋１、および発話区間Ｕ_ｉ＋１の発話開始タイミングＴ_ｕｉ＋１を表す情報を読み出す。次話者確率推定部１０８又は次話者確率推定部１０８Ａは、パラメータλ_ａ，ｉの少なくとも一部に対応する特徴量Ｆ２_ａ，ｉおよびＵ_ｉ，Ｕ_ｉ＋１，Ｐ_ｕｉ，Ｐ_ｕｉ＋１，Ｔ_ｕｉ＋１を学習データとして、発話タイミング推定モデルを機械学習する。次発話者推定モデルには、例えば、ＳＶＭ、ＧＭＭ、ＨＭＭ等を用いることができる。 When learning the utterance timing estimation model, the next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A uses the past parameters λ _{a, i} (where a = 1,. Yes, i <k), and information indicating the utterance sections U _i and U _{i + 1} and the utterers P _ui and P _{ui + 1} and the utterance start timing T _{ui + 1} of the utterance section U _{i + 1} is read. The next speaker probability estimator 108 or the next speaker probability estimator 108A includes the feature amounts F2 _{a, i} and U _i , U _{i + 1} , P _ui , P _{ui + 1} , T _{ui + 1} corresponding to at least a part of the parameters λ _{a, i.} Is used as learning data to machine-learn an utterance timing estimation model. As the next speaker estimation model, for example, SVM, GMM, HMM or the like can be used.

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、発話者Ｐ_ｕｋ、パラメータλ’_ａ，ｋの少なくとも一部、および次発話者推定モデルにより推定された次発話者Ｐ_ｕｋ＋１が得られると、パラメータλ’_ａ，ｋの少なくとも一部に対応する特徴量Ｆ２_ａ，ｋを発話タイミング推定モデルに適用する。次話者確率推定部１０８又は次話者確率推定部１０８Ａは、特徴量Ｆ２_ａ，ｋを発話タイミング推定モデルに適用して推定された次の発話区間Ｕ_ｋ＋１の発話タイミングＴ_ｕｋ＋１（例えば、発話区間Ｕ_ｋ＋１の開始時刻）を表す情報を「推定情報」の一部として出力する。なお、発話タイミングを表す情報は、何れかの発話タイミングを確定的に表すものであってもよいし、確率的に表すものであってもよい。参加者Ｐ_ａが時刻ｔに発話を開始する確率（時刻ｔが参加者Ｐ_ａの発話タイミングである確率）を、Ｐ２_ａ（ｔ）とする。
上述した実施形態の次話者確率推定部１０８又は次話者確率推定部１０８Ａが推定する参加者ｉの時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）は、参加者ｉが本次話者推定技術における参加者Ｐ_ａである場合、確率Ｐ１_ａ×確率Ｐ２_ａ（ｔ）により算出される。 The next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A obtains the speaker P _uk , at least a part of the parameters λ ′ _{a, k} , and the next speaker P _{uk + 1} estimated by the next speaker estimation model. Then, the feature amount F2 _{a, k} corresponding to at least a part of the parameter λ ′ _{a, k} is applied to the utterance timing estimation model. The next speaker probability estimator 108 or the next speaker probability estimator 108A applies the feature amount F2 _{a, k} to the utterance timing estimation model and utterance timing T _{uk + 1 of} the next utterance section U _k _{+ 1} (for example, utterance Information indicating the start time of the section U _{k + 1} ) is output as part of the “estimated information”. Note that the information representing the utterance timing may be deterministically representing any utterance timing or may be represented probabilistically. The probability that the participant _{P a} to start a speech to the time t (the probability time t is the utterance timing of the participant _{P a),} and _P2 a (t).
The next speaker probability P ^ns _i (t) at time t of the participant i estimated by the next speaker probability estimating unit 108 or the next speaker probability estimating unit 108A of the above-described embodiment is determined by the participant i being the primary speaker. If a participant _{P a} in the estimation technique, is calculated by the probability P1 _a × probability _P2 a (t).

上述の次話者確率推定部１０８又は次話者確率推定部１０８Ａは、呼吸動作の観測値に基づいて次に発話を開始する参加者およびタイミングを推定しているが、さらに、視線の観測値を用いてもよい。
視線行動をさらに利用する場合、各参加者Ｐ_ａ（ただし、ａ＝１，…，Ａ）には注視対象検出装置２０３がさらに装着される。注視対象検出装置２０３は、参加者Ｐ_ａが誰を注視しているか（注視対象）を検出し、参加者Ｐ_ａおよび各離散時刻ｔでの注視対象Ｇ_ａ，ｔを表す情報を次話者確率推定部１０８又は次話者確率推定部１０８Ａに送る。次話者確率推定部１０８又は次話者確率推定部１０８Ａは、注視対象情報Ｇ_１，ｔ，…，Ｇ_Ａ，ｔ、発話区間Ｕ_ｋ、および話者情報Ｐ_ｕｋを入力とし、発話区間終了前後における注視対象ラベル情報θ_ｖ，ｋ（ただし、ｖ＝１，…，Ｖ、Ｖは注視対象ラベルの総数）を生成する。注視対象ラベル情報は、発話区間Ｕ_ｋの終了時点Ｔ_ｓｅに対応する時間区間における参加者の注視対象を表す情報である。ここでは、終了時点Ｔ_ｓｅを含む有限の時間区間における参加者Ｐ_ａの注視対象をラベル付けした注視対象ラベル情報θ_ｖ，ｋを例示する。この場合、例えば、発話区間Ｕ_ｋの終了時点Ｔ_ｓｅよりも前の時点Ｔ_ｓｅ−Ｔ_ｂから終了時点Ｔ_ｓｅよりも後の時点Ｔ_ｓｅ＋Ｔ_ａまでの区間に出現した注視行動を扱う。Ｔ_ｂ，Ｔ_ａは０以上の任意の値でよいが、目安として、Ｔ_ｂは０秒〜２．０秒、Ｔ_ａは０秒〜３．０秒程度にするのが適当である。 The next-speaker probability estimating unit 108 or the next-speaker probability estimating unit 108A estimates the participant and timing to start the next utterance based on the observation value of the breathing motion. May be used.
When the gaze behavior is further used, a gaze target detection device 203 is further attached to each participant P _a (where a = 1,..., A). Gaze object detection device 203, participant P _a detects someone or gazing (gaze target), the participant P _a and gaze target G _a, next speaker information representing a _t at each discrete time t This is sent to the probability estimator 108 or the next speaker probability estimator 108A. Next speaker probability estimation unit 108 or next speaker probability estimation unit 108A receives gaze target information G _{1, t} ,..., G _{A, t} , utterance interval U _k , and speaker information P _uk , and ends the utterance interval. Before and after gaze target label information θ _{v, k} (where v = 1,..., V, V are the total number of gaze target labels) is generated. Gaze target label information is information indicating the gaze target participants in time interval corresponding to the end time T _se speech period U _k. Here, an example is shown of the gaze target label information theta _{v, k} was labeled gaze target participants P _a in the finite time interval including end time T _se. In this case, for example, deals with watching action that appeared in the interval from the speech interval _U time before the end time _{T se} of _k _T se -T _b to the time point _{_T} se + _T _a subsequent to the end point _{T se.} T _b, _{T a} is may be any value from 0 or more, as a guide, _{T b} is 0 seconds to 2.0 seconds, _{T a} is appropriate to about 0 seconds to 3.0 seconds.

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、注視対象の参加者を以下のような種別に分類し、注視対象のラベリングを行う。なお、ラベルの記号に意味はなく、判別できればどのような表記でも構わない。
・ラベルＳ：話者（すなわち、話者である参加者Ｐ_ｕｋを表す）
・ラベルＬ_ξ：非話者（ただし、ξは互いに異なる非話者である参加者を識別し、ξ＝１，…，Ａ−１である。例えば、ある参加者が、非話者Ｐ_２、非話者Ｐ_３、の順に注視をしていたとき、非話者Ｐ_２にＬ_１というラベル、非話者Ｐ_３にＬ_２というラベルが割り当てられる。）
・ラベルＸ：誰も見ていない The next speaker probability estimator 108 or the next speaker probability estimator 108A classifies the participants to be watched into the following types, and performs labeling on the eyes to be watched. Note that the symbol of the label has no meaning, and any notation may be used as long as it can be identified.
Label S: speaker (ie, representing participant P _uk who is a speaker)
Label L _ξ : Non-speaker (where ξ identifies participants who are non-speakers different from each other, and ξ = 1,..., A−1. For example, a participant is a non-speaker P _2. , non-speaker P _{3 when,} had a gaze sequentially labeled L ₁ to the non-speaker P _2, labeled L ₂ to the non-speaker P ₃ is assigned.)
・ Label X: No one is watching

ラベルがＳまたはＬ_ξのときには、相互注視（視線交差）が起きたか否かという情報を付与する。本形態では、相互注視が起きた際には、Ｓ_Ｍ，Ｌ_ξＭ（下付き添え字の「_ξＭ」はξ_Ｍを表す）のように、ラベルＳ，Ｌ_ξの末尾にＭラベルを付与する。 When the label is S or _Lξ , information indicating whether or not mutual gaze (gaze crossing) has occurred is given. In this embodiment, when mutual gaze occurs, an _M label is _{added to} the end of the labels S and L _ξ as in S _M , L _ξM (subscript “ _ξM ” represents ξ _M ). .

図１０は、注視対象ラベルの具体例を示す図である。図１０はＡ＝４の例であり、発話区間Ｕ_ｋ，Ｕ_ｋ＋１と各参加者の注視対象が時系列に示されている。図１０の例では、参加者Ｐ_１が発話した後、発話交替が起き、新たに参加者Ｐ_２が発話をした際の様子を示している。ここでは、話者である参加者Ｐ_１が参加者Ｐ_４を注視した後、参加者Ｐ_２を注視している。Ｔ_ｓｅ−Ｔ_ｂの時点からＴ_ｓｅ＋Ｔ_ａの時点までの区間では、参加者Ｐ_１が参加者Ｐ_２を見ていたとき、参加者Ｐ_２は参加者Ｐ_１を見ている。これは、参加者Ｐ_１と参加者Ｐ_２とで相互注視が起きていることを表す。この場合、参加者Ｐ_１の注視対象情報Ｇ_１，ｔから生成される注視対象ラベルはＬ_１とＬ_２Ｍの２つとなる。上述の区間では、参加者Ｐ_２は参加者Ｐ_４を注視した後、話者である参加者Ｐ_１を注視している。この場合、参加者Ｐ_２の注視対象ラベルはＬ_１とＳ_Ｍの２つとなる。また、上述の区間では、参加者Ｐ_３は話者である参加者Ｐ_１を注視している。この場合、参加者Ｐ_３の注視対象ラベルはＳとなる。また、上述の区間では、参加者Ｐ_４は誰も見ていない。この場合、参加者Ｐ_４の注視対象ラベルはＸとなる。したがって、図１０の例では、Ｖ＝６である。 FIG. 10 is a diagram illustrating a specific example of a gaze target label. FIG. 10 is an example of A = 4, and the speech sections U _k and U _{k + 1} and the gaze targets of each participant are shown in time series. In the example of FIG. 10, after the participant P ₁ speaks, an utterance change occurs and the participant P ₂ newly speaks. Here, participants P ₁ is a speaker after watching the participant P _4, gazing at the participant P _2. In the period from the time of T _se -T _b up to the point of _{_T} se + _T _a, when a participant _{P 1} had seen the participants _{P 2,} participants _{P 2} has seen participants _{P 1.} This indicates that what is happening is mutual gaze between the participants P ₁ and participants P _2. In this case, there are two gaze target labels L ₁ and L _2M generated from the gaze target information G _{1, t of} the participant P ₁ . In the above-mentioned period, the participants P ₂ is gazing after watching the participant P _4, the participants P ₁ is a speaker. In this case, you gaze target label participants _{P 2} is two and the _{L 1} and _{S M.} In addition, in the above-mentioned period, the participants P ₃ is gazing at the participant P ₁ is a speaker. In this case, the gaze target label of participants P ₃ is a S. In addition, in the above-mentioned period, the participants P ₄ is not anyone seen. In this case, the gaze target label of participants P ₄ is the X. Therefore, in the example of FIG. 10, V = 6.

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、注視対象ラベルごとの開始時刻、終了時刻も取得する。ここで、誰（Ｒ∈｛Ｓ，Ｌ｝）のどの注視対象ラベル（ＧＬ∈｛Ｓ，Ｓ_Ｍ，Ｌ_１，Ｌ_１Ｍ，Ｌ_２，Ｌ_２Ｍ，…｝）であるかを示す記号としてＲ_ＧＬ、その開始時刻をＳＴ＿Ｒ_ＧＬ、終了時刻をＥＴ＿Ｒ_ＧＬと定義する。ただし、Ｒは参加者の発話状態（話者か非話者か）を表し、Ｓは話者、Ｌは非話者である。例えば、図１０の例において、参加者Ｐ_１の最初の注視対象ラベルはＳ_Ｌ１であり、その開始時刻はＳＴ＿Ｓ_Ｌ１、終了時刻はＥＴ＿Ｓ_Ｌ１である。注視対象ラベル情報θ_ｖ，ｋは注視対象ラベルＲ_ＧＬ、開始時刻ＳＴ＿Ｒ_ＧＬ、および終了時刻ＥＴ＿Ｒ_ＧＬを含む情報である。 The next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A also acquires a start time and an end time for each gaze target label. Here, as a symbol indicating which gaze target label (GLε {S, S _M , L ₁ , L _1M , L ₂ , L _2M ,...) Of which (Rε {S, L}) is R _GL, the start time ST_R _GL, the end time is defined as ET_R _GL. Here, R represents the utterance state (speaker or non-speaker) of the participant, S is a speaker, and L is a non-speaker. For example, in the example of FIG. 10, the first fixation target label participants _{P 1} is _{S L1,} the start time ST_S _L1, the end time is ET_S _L1. The gaze target label information θv _{, k} is information including a gaze target label R _GL , a start time ST_R _GL , and an end time ET_R _GL .

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、注視対象ラベル情報θ_ｖ，ｋを用いて、各参加者Ｐ_ａの注視対象遷移パターンＥ_ａ，ｋを生成する。注視対象遷移パターンの生成は、注視対象ラベルＲ_ＧＬを構成要素として、時間的な順序を考慮した遷移ｎ−ｇｒａｍを生成して行う。ここで、ｎは正の整数である。例えば、図１０の例を考えると、参加者Ｐ１の注視対象ラベルから生成される注視対象遷移パターンＥ_１，ｋはＬ_１−Ｌ_２Ｍである。同様にして、参加者Ｐ_２の注視対象遷移パターンＥ_２，ｋはＬ_１−Ｓ_Ｍ、参加者Ｐ_３の注視対象遷移パターンＥ_３，ｋはＳ、参加者Ｐ_４の注視対象遷移パターンＥ_４，ｋはＸとなる。 Next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A, by using the gaze target label information theta _{v, k,} gaze target transition pattern E _a of each participant P _{_a,} generates a _k. The gaze target transition pattern is generated by generating a transition n-gram considering the temporal order using the gaze target label _RGL as a constituent element. Here, n is a positive integer. For example, considering the example of FIG. 10, the gaze target transition pattern E _{1, k} generated from the gaze target label of the participant P1 is L ₁ -L _2M . Similarly, gaze target transition pattern _{E 2} participants _{P _2,} _k is _L 1 -S _M, gaze target transition patterns _{E 3, k} participants _{P 3} is S, gaze target transition patterns E participants _{P 4} _{4, k} becomes X.

注視対象遷移パターンＥ_ａ，ｋは、例えば発話区間Ｕ_ｋ＋１が開始された後に、発話区間Ｕ_ｋおよびその発話者Ｐ_ｕｋ、発話区間Ｕ_ｋ＋１に該当する発話を行う次発話者Ｐ_ｕｋ＋１および次発話開始タイミングＴ_ｕｋ＋１を表す情報とともにデータベースに送られる。データベースでは、注視対象遷移パターンＥ_ａ，ｋが、パラメータλａ，ｋと併合され、Ｅ_ａ，ｋ，λ_ａ，ｋ，Ｕ_ｋ，Ｐ_ｕｋ，Ｐ_ｕｋ＋１，Ｔ_ｕｋ＋１を表す情報の一部またはすべてがデータベースに保持される。 The gaze target transition pattern E _{a, k} is, for example, after the utterance section U _{k + 1} is started, the utterance section U _k and its utterer P _uk , the next utterer P _{uk + 1} and the next utterance who perform the utterance corresponding to the utterance section U _{k + 1.} It is sent to the database together with information representing the start timing T _{uk + 1} . In the database, the gaze target transition pattern E _{a, k} is merged with the parameters λa, k, and a part or all of the information representing E _{a, k} , λ _{a, k} , U _k , P _uk , P _{uk + 1} , T _{uk + 1.} Is retained in the database.

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、注視対象ラベル情報θ_ｖ，ｋを入力とし、注視対象ラベルごとの時間構造情報Θ_ｖ，ｋを生成する。時間構造情報は参加者の視線行動の時間的な関係を表す情報であり、（１）注視対象ラベルの時間長、（２）注視対象ラベルと発話区間の開始時刻または終了時刻との間隔、（３）注視対象ラベルの開始時刻または終了時刻と他の注視対象ラベルの開始時刻または終了時刻との間隔、をパラメータとして持つ。 Next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A inputs the gaze target label information theta _{v, k,} and generates a time structure information theta _{v, k} for each gaze target label. The time structure information is information representing the temporal relationship of the gaze behavior of the participant, and (1) the time length of the gaze target label, (2) the interval between the gaze target label and the start time or end time of the utterance section, ( 3) An interval between the start time or end time of the gaze target label and the start time or end time of another gaze target label is used as a parameter.

具体的な時間構造情報のパラメータを以下に示す。以下では、発話区間の開始時刻をＳＴ＿Ｕ、発話区間の終了時刻をＥＴ＿Ｕと定義する。
・ＩＮＴ１（＝ＥＴ＿Ｒ_ＧＬ−ＳＴ＿Ｒ_ＧＬ）：注視対象ラベルＲ_ＧＬの開始時刻ＳＴ＿Ｒ_ＧＬと終了時刻ＥＴ＿Ｒ_ＧＬの間隔
・ＩＮＴ２（＝ＳＴ＿Ｕ−ＳＴ＿Ｒ_ＧＬ）：注視対象ラベルＲ_ＧＬの開始時刻ＳＴ＿Ｒ_ＧＬが発話区間の開始時刻ＳＴ＿Ｕよりもどれくらい前であったか
・ＩＮＴ３（＝ＥＴ＿Ｕ−ＳＴ＿Ｒ_ＧＬ）：注視対象ラベルＲ_ＧＬの開始時刻ＳＴ＿Ｒ_ＧＬが発話区間の終了時刻ＥＴ＿Ｕよりもどれくらい前であったか
・ＩＮＴ４（＝ＥＴ＿Ｒ_ＧＬ−ＳＴ＿Ｕ）：注視対象ラベルＲ_ＧＬの終了時刻ＥＴ＿Ｒ_ＧＬが発話区間の開始時刻ＳＴ＿Ｕよりもどれくらい後であったか
・ＩＮＴ５（＝ＥＴ＿Ｕ−ＥＴ＿Ｒ_ＧＬ）：注視対象ラベルＲ_ＧＬの終了時刻ＥＴ＿Ｒ_ＧＬが発話区間の終了時刻ＥＴ＿Ｕよりもどれくらい前であったか
・ＩＮＴ６（＝ＳＴ＿Ｒ_ＧＬ−ＳＴ＿Ｒ_ＧＬ’）：注視対象ラベルＲ_ＧＬの開始時刻ＳＴ＿Ｒ_ＧＬが他の注視対象ラベルＲ_ＧＬ’の開始時刻ＳＴ＿Ｒ_ＧＬ’よりもどれくらい後であったか
・ＩＮＴ７（＝ＥＴ＿Ｒ_ＧＬ’−ＳＴ＿Ｒ_ＧＬ）：注視対象ラベルＲ_ＧＬの開始時刻ＳＴ＿Ｒ_ＧＬが他の注視対象ラベルＲ_ＧＬ’の終了時刻ＥＴ＿Ｒ_ＧＬ’よりもどれくらい前であったか
・ＩＮＴ８（＝ＥＴ＿Ｒ_ＧＬ−ＳＴ＿Ｒ_ＧＬ’）：注視対象ラベルＲ_ＧＬの終了時刻ＥＴ＿Ｒ_ＧＬが注視対象ラベルＲ_ＧＬ’の開始時刻ＳＴ＿Ｒ_ＧＬ’よりもどれくらい後であったか
・ＩＮＴ９（＝ＥＴ＿Ｒ_ＧＬ−ＥＴ＿Ｒ_ＧＬ’）：注視対象ラベルＲ_ＧＬの終了時刻ＥＴ＿Ｒ_ＧＬが注視対象ラベルＲ_ＧＬ’の終了時刻ＥＴ＿Ｒ_ＧＬ’よりもどれくらい後であったか Specific parameters of the time structure information are shown below. Hereinafter, the start time of the utterance section is defined as ST_U, and the end time of the utterance section is defined as ET_U.
_{_{· INT1 (= ET_R GL -ST_R GL}} ): gazing target label _{R GL} of the start time ST_R _GL and end time ET_R interval of _{GL · INT2 (= ST_U-ST_R} GL): start time ST_R _GL of the gaze target label _{R GL} utterance How long before the start time ST_U of the section INT3 (= ET_U-ST_R _GL ): How long before the start time ST_R _GL of the gaze target label R _GL is before the end time ET_U of the speech section INT4 (= ET_R _GL -ST_U): gazing target label _{R GL} of the end time ET_R _GL Do · INT5 was after much than the start time ST_U of the speech segment (= ET_U-ET_R _GL): end time ET_R _GL is the utterance section of the gaze target label _{R GL} Than the end time ET_U of Have either · _INT6 had been before _{_{(= ST_R GL -ST_R GL ')}} : the gaze target label _{R GL} of the start time ST_R _GL other of the gaze target label _{R GL'} of the start time ST_R _GL or was after much than _'· INT7 ( = ET_R _{_GL '-ST_R GL):} gazing target label _{R GL} of the start time ST_R _GL other of the gaze target label _{R GL'} of the end time ET_R _{GL 'or} was before much than _{_{· INT8 (= ET_R GL -ST_R GL}} ' ): gaze target label _{R GL} of the end time ET_R _GL is gazing target label _{R GL 'of} the start time ST_R _GL' or was after much than _{_{· INT9 (= ET_R GL -ET_R GL}} '): the end of the gazing target label _{R GL} time ET_R _GL is none than the _'end time ET_R _{GL of'} gaze target label _{R GL} Did even after leprosy

なお、ＩＮＴ６〜ＩＮＴ９については、すべての参加者の注視対象ラベルとの組み合わせに対して取得する。図１０の例では、注視対象ラベル情報は全部で６つ（Ｌ_１，Ｌ_２Ｍ，Ｌ_１，Ｓ_Ｍ，Ｓ，Ｘ）あるため、ＩＮＴ６〜ＩＮＴ９は、それぞれ６×５＝３０個のデータが生成される。 Note that INT6 to INT9 are acquired for combinations with the gaze target labels of all participants. In the example of FIG. 10, since there are a total of six gaze target label information (L ₁ , L _2M , L ₁ , S _M , S, X), INT6 to INT9 each have 6 × 5 = 30 data. Generated.

時間構造情報Θ_ｖ，ｋは注視対象ラベル情報θ_ｖ，ｋについてのパラメータＩＮＴ１〜ＩＮＴ９からなる情報である。時間構造情報Θ_ｖ，ｋを構成する上記の各パラメータについて、図１１を用いて具体的に示す。図１１は、話者である参加者Ｐ１（Ｒ＝Ｓ）の注視対象ラベルＬ１についての時間構造情報を示す図である。すなわち、Ｒ_ＧＬ＝Ｓ_Ｌ１における時間構造情報である。なお、ＩＮＴ６〜ＩＮＴ９については、図示を簡略化するために、参加者Ｐ２の注視対象ラベルＬ１、すなわちＲ_ＧＬ＝Ｌ_Ｌ１との関係のみを示す。図１１の例では、ＩＮＴ１〜ＩＮＴ９は以下のように求められることがわかる。
・ＩＮＴ１＝ＥＴ＿Ｓ_Ｌ１−ＳＴ＿Ｓ_Ｌ１
・ＩＮＴ２＝ＳＴ＿Ｕ−ＳＴ＿Ｓ_Ｌ１
・ＩＮＴ３＝ＥＴ＿Ｕ−ＳＴ＿Ｓ_Ｌ１
・ＩＮＴ４＝ＥＴ＿Ｓ_Ｌ１−ＳＴ＿Ｕ
・ＩＮＴ５＝ＥＴ＿Ｕ−ＥＴ＿Ｓ_Ｌ１
・ＩＮＴ６＝ＳＴ＿Ｓ_Ｌ１−ＳＴ＿Ｌ_Ｌ１
・ＩＮＴ７＝ＥＴ＿Ｌ_Ｌ１−ＳＴ＿Ｓ_Ｌ１
・ＩＮＴ８＝ＥＴ＿Ｓ_Ｌ１−ＳＴ＿Ｌ_Ｌ１
・ＩＮＴ９＝ＥＴ＿Ｓ_Ｌ１−ＥＴ＿Ｌ_Ｌ１ The time structure information Θ _{v, k} is information including parameters INT1 to INT9 for the gaze target label information θ _{v, k} . Each of the above parameters constituting the time structure information Θ _{v, k} will be specifically described with reference to FIG. FIG. 11 is a diagram showing time structure information about the gaze target label L1 of the participant P1 (R = S) who is a speaker. That is, time structure information in R _GL = S _L1 . Note that for INT6 to INT9, only the relationship with the gaze target label L1 of the participant P2, that is, R _GL = L _L1 is shown in order to simplify the illustration. In the example of FIG. 11, it can be seen that INT1 to INT9 are obtained as follows.
INT1 = ET_S _L1 −ST_S _L1
-INT2 = ST_U-ST_S _L1
・ INT3 = ET_U-ST_S _L1
・ INT4 = ET_S _L1 −ST_U
・ INT5 = ET_U-ET_S _L1
INT6 = ST_S _L1 -ST_L _L1
INT7 = ET_L _L1 -ST_S _L1
INT8 = ET_S _L1 −ST_L _L1
INT9 = ET_S _L1 -ET_L _L1

時間構造情報Θ_ｖ，ｋは、例えば発話区間Ｕ_ｋ＋１が開始された後に、発話区間Ｕ_ｋおよびその発話者Ｐ_ｕｋ、発話区間Ｕ_ｋ＋１に該当する発話を行う次発話者Ｐ_ｕｋ＋１および次発話開始タイミングＴ_ｕｋ＋１を表す情報とともにデータベースに送られる。データベースでは、時間構造情報Θ_ｖ，ｋが、パラメータλ_ａ，ｋと併合され、Θ_ｖ，ｋ，λ_ａ，ｋ，Ｕ_ｋ，Ｐ_ｕｋ，Ｕ_ｋ＋１，Ｐ_ｕｋ＋１，Ｔ_ｕｋ＋１を表す情報の一部またはすべてがデータベースに保持される。 The time structure information Θ _{v, k} is, for example, after the utterance section U _{k + 1} is started, the utterance section U _k and its utterer P _uk , the next utterer P _{uk + 1} who performs the utterance corresponding to the utterance section U _k _{+ 1} and the next utterance start. It is sent to the database together with information representing the timing T _{uk + 1} . In the database, the time structure information Θ _{v, k} is merged with the parameters λ _{a, k} and one piece of information representing Θ _{v, k} , λ _{a, k} , U _k , P _uk , U _{k + 1} , P _{uk + 1} , T _{uk + 1.} Parts or all are kept in the database.

次話者確率推定部１０８又は次話者確率推定部１０８Ａは、注視対象遷移パターンＥ_ａ，ｋ、時間構造情報Θ_ｖ，ｋ、発話者情報Ｐ_ｕｋ、発話区間Ｕ_ｋ、参加者Ｐ_ａの吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量、吸い込み区間Ｉ_ａ，ｋの長さ、吸い込み区間Ｉ_ａ，ｋでの息の吸い込み量の時間変化、および発話区間Ｕ_ｋと吸い込み区間Ｉ_ａ，ｋとの時間関係の少なくとも一部に対応する特徴量ｆ_ａ，ｋに対する推定情報を得るためのモデルを機械学習し、モデルを用いて特徴量に対する推定情報である次話者確率Ｐ^ｎｓ _ｉ（ｔ）を得て出力する。 The next-speaker probability estimating unit 108 or the next-speaker probability estimating unit 108A includes gaze target transition patterns E _{a, k} , time structure information Θ _{v, k} , speaker information P _uk , utterance interval U _k , and participant P _a . suction section _{I a,} suction amount of breath at _k, the suction section _{I a,} the length of _k, the suction section _{I a,} suction amount of time variation of the breath at _k, and speech periods _{U k} and the suction section _{I a,} Machine learning is performed on a model for obtaining estimation information for the feature quantity f _{a, k} corresponding to at least part of the temporal relationship with _k, and the next speaker probability P ^ns _i (estimation information for the feature quantity is used using the model. t) is obtained and output.

上述の次話者確率推定部１０８又は次話者確率推定部１０８Ａは、呼吸動作の観測値および視線の観測値に基づいて次に発話を開始する参加者およびタイミングを推定しているが、さらに、参加者の頭部の動きに関する情報を用いてもよい。これは、人は発話の直前に大きく頷く傾向があることを利用するものである。次話者確率推定部１０８又は次話者確率推定部１０８Ａは、映像入力部１０５からの各参加者の画像データを解析して、頭部が上下に動いたか否かにより参加者が頷いたか否かを判定する。次話者確率推定部１０８又は次話者確率推定部１０８Ａは、参加者ｉが時刻ｔの数秒前に頷いたと判定した場合には、参加者ｉの時刻ｔにおける次話者確率Ｐ^ｎｓ _ｉ（ｔ）に所定値を加算する処理等を行う。また、次話者確率推定部１０８又は次話者確率推定部１０８Ａは、呼吸動作の観測値、視線の観測値および、参加者の頭部の動きに関する情報の少なくとも一つに基づいて次話者確率Ｐ^ｎｓ _ｉ（ｔ）を算出してもよい。 The next-speaker probability estimating unit 108 or the next-speaker probability estimating unit 108A described above estimates the participant and timing to start the next utterance based on the observation value of the breathing motion and the observation value of the line of sight. Information regarding the movement of the participant's head may be used. This takes advantage of the fact that people tend to crawl right before utterance. The next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A analyzes the image data of each participant from the video input unit 105, and whether or not the participant has struck depending on whether or not the head has moved up and down. Determine whether. If the next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A determines that the participant i has reached several seconds before the time t, the next speaker probability P ^ns _i ( A process of adding a predetermined value to t) is performed. Further, the next speaker probability estimation unit 108 or the next speaker probability estimation unit 108 </ b> A is based on at least one of the observation value of the breathing movement, the observation value of the line of sight, and the information on the movement of the participant's head. The probability P ^ns _i (t) may be calculated.

また、次話者確率推定部１０８又は次話者確率推定部１０８Ａが呼吸動作の観測値、視線の観測値および、参加者の頭部の動きに関する情報の少なくとも一つを用いている場合は、次話者確率推定部１０８又は次話者確率推定部１０８Ａで用いる情報に応じて、センサ１０３は、位置計測装置２０１、呼吸動作計測装置２０２、注視対象検出装置２０３及び頭部動作検出装置２０４のいずれか一つ又は複数を備える構成でよい。 Further, when the next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A uses at least one of the observation value of the breathing movement, the observation value of the line of sight, and the movement of the participant's head, In accordance with information used by the next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A, the sensor 103 is used for the position measurement device 201, the respiratory motion measurement device 202, the gaze target detection device 203, and the head motion detection device 204. Any one or more may be provided.

なお、上述した第１の実施形態におけるロボット１００及び第２の実施形態におけるロボット１００Ａは、マイク１０１、カメラ１０２、センサ１０３、音声入力部１０４、映像入力部１０５、センサ入力部１０６、発話区間検出部１０７、次話者確率推定部１０８又は次話者確率推定部１０８Ａ及び制御部１０９又は制御部１０９Ａを内蔵する構成としたが、この構成に限られるものではない。マイク１０１、カメラ１０２、センサ１０３、音声入力部１０４、映像入力部１０５、センサ入力部１０６、発話区間検出部１０７、次話者確率推定部１０８（又は次話者確率推定部１０８Ａ）及び制御部１０９（又は制御部１０９Ａ）の少なくとも一部を備える会話支援装置をロボット１００（又はロボット１００Ａ）と別装置で設ける構成としてもよい。会話支援装置は、ロボット１００（又はロボット１００Ａ）と通信可能な構成であり、制御部１０９（又は制御部１０９Ａ）からの制御信号をロボット１００（又はロボット１００Ａ）へ送信することで、ロボット１００（又はロボット１００Ａ）の発話を制御する。 Note that the robot 100 in the first embodiment and the robot 100A in the second embodiment include a microphone 101, a camera 102, a sensor 103, a voice input unit 104, a video input unit 105, a sensor input unit 106, and an utterance section detection. Although the unit 107, the next speaker probability estimation unit 108 or the next speaker probability estimation unit 108A, and the control unit 109 or the control unit 109A are built in, the present invention is not limited to this configuration. Microphone 101, camera 102, sensor 103, voice input unit 104, video input unit 105, sensor input unit 106, utterance section detection unit 107, next speaker probability estimation unit 108 (or next speaker probability estimation unit 108A) and control unit The conversation support device including at least a part of 109 (or control unit 109A) may be provided as a separate device from robot 100 (or robot 100A). The conversation support apparatus is configured to be able to communicate with the robot 100 (or the robot 100A). By transmitting a control signal from the control unit 109 (or the control unit 109A) to the robot 100 (or the robot 100A), the robot 100 (or the robot 100A) is configured. Alternatively, the utterance of the robot 100A) is controlled.

ロボット１００及びロボット１００Ａは、その体の一部をディスプレイ等の表示部に体の一部を表示する構成であってもよく、全身が仮想的な人物であるエージェントとして表示部に表示されるものであってもよい。ロボット１００及びロボット１００Ａの体の一部を表示部で表現するとは、例えば、顔全体が表示部となっており、その表示部に顔の画像を表示する構成等が考えられる。表示部に表示した顔の画像を変化させていろいろな表現を行うことができる。なお、ロボット１００及びロボット１００Ａは、複数のマイク１０１及びセンサ１０３を備えない構成であってもよく、例えば、ロボット１００及びロボット１００Ａの外部に設置された複数のマイク１０１及びセンサ１０３と有線又は無線にて信号を送受信可能な構成であってもよい。 The robot 100 and the robot 100A may be configured such that a part of the body is displayed on a display unit such as a display, and is displayed on the display unit as an agent whose whole body is a virtual person. It may be. The expression of a part of the body of the robot 100 and the robot 100A on the display unit may be, for example, a configuration in which the entire face is a display unit and a face image is displayed on the display unit. Various expressions can be performed by changing the face image displayed on the display unit. Note that the robot 100 and the robot 100A may be configured not to include the plurality of microphones 101 and the sensors 103. For example, the robot 100 and the robots 100A installed outside the robot 100 and the robot 100A may be wired or wirelessly connected. The signal transmission / reception may be possible.

実施形態におけるロボット１００及び第２の実施形態におけるロボット１００Ａにおいて、上述した発話制御処理の妨げにならない範囲であれば、図１及び図７に示した機能以外の通常のロボットが備えている機能等を備えてもよい。また、第１の実施形態におけるロボット１００は、第２の実施形態におけるロボット１００Ａのような呼吸動作等の会話時の人間と同様の動作を行うことができる構成としてもよい。 In the robot 100 according to the embodiment and the robot 100A according to the second embodiment, as long as the above-described speech control processing is not hindered, functions or the like provided by ordinary robots other than the functions shown in FIGS. May be provided. Further, the robot 100 according to the first embodiment may be configured to be able to perform an operation similar to that of a human during conversation such as a breathing operation like the robot 100A according to the second embodiment.

以上説明した実施形態によれば、会話支援システムは、例えばロボットであり、会話中の各参加者の視線、呼吸、頭部の動きなどの非言語行動の計測結果に基づいて、参加者それぞれが任意の時刻に次発話となる確率である次話者確率を推定する。会話支援システムは、各参加者の次話者確率に基づいて、次に発話を行うべき参加者である予測次話者と、予測次話者が発話を開始するタイミングとを推定し、推定されたタイミングに予測次話者が発話を行わなかったことを検出した場合に、予測次話者又は予測次話者とは異なる参加者を対象者として発話を促す。会話支援システムは、発話を促すために、対象者に発話権の移譲を示す動作を行うよう、ロボット、又は、表示装置に表示される話者（全身が仮想的な人物であるエージェント）を制御する。例えば、ロボット、又は、表示装置に表示される話者は、対象者の発話を促す音声を出力したり、眼、頭部、胴部を動かして対象者に視線や顔を向ける、上肢を対象者に差し出すなどの非言語行動をとったりする。
上述した実施形態によれば、発話のタイミングを逸してしまった参加者に対して、ロボット、又は、表示装置に表示される話者が発話を促すことで、その参加者の発話を促すことができる。また、会話中の沈黙が長くなり、会話の雰囲気が悪くなってしまわないように、参加者へ発話を促すことができる。 According to the embodiment described above, the conversation support system is, for example, a robot, and each participant is based on measurement results of non-verbal behavior such as gaze, breathing, and head movement of each participant during the conversation. The next speaker probability, which is the probability of the next utterance at an arbitrary time, is estimated. The conversation support system estimates the estimated next speaker who is the next participant to speak based on each participant's next speaker probability and the timing when the predicted next speaker starts speaking. When it is detected that the predicted next speaker does not speak at the determined timing, speech is urged with the predicted next speaker or a participant different from the predicted next speaker as the target person. The conversation support system controls the robot or the speaker displayed on the display device (agent whose body is a virtual person) so as to perform the operation indicating the transfer of the right to speak to the target person in order to promote the speech. To do. For example, a speaker displayed on a robot or a display device targets an upper limb that outputs a voice prompting the subject's utterance, or moves his eyes, head, or torso to direct his gaze or face toward the subject. Take nonverbal behaviors such as presenting to the person.
According to the above-described embodiment, the robot or the speaker displayed on the display device prompts the participant to speak by the participant who has missed the timing of the speech, thereby prompting the participant to speak. it can. In addition, it is possible to encourage the participants to speak so that the silence during the conversation is prolonged and the conversation atmosphere is not deteriorated.

上述した本実施形態におけるロボット１００又はロボット１００Ａの備える各機能部は、例えば、コンピュータで実現することができる。その場合、この機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよく、ＦＰＧＡ（Field Programmable Gate Array）等のプログラマブルロジックデバイスを用いて実現されるものであってもよい。 Each function part with which robot 100 or robot 100A in this embodiment mentioned above is provided is realizable with a computer, for example. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on this recording medium may be read into a computer system and executed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory inside a computer system serving as a server or a client in that case may be included and a program held for a certain period of time. Further, the program may be a program for realizing a part of the above-described functions, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system. You may implement | achieve using programmable logic devices, such as FPGA (Field Programmable Gate Array).

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

参加者と会話を行うロボットの制御に適用したり、参加者と会話を行う表示装置に表示されたエージェント（仮想的な人物）の動きの制御に適用したりすることができる。 The present invention can be applied to control of a robot that has a conversation with a participant, or can be applied to control of the movement of an agent (virtual person) displayed on a display device that has a conversation with a participant.

５１ａ…右目，５１ｂ…左目，５２…口部，５３…頭部，５４…頸部，５５…胴部，５５ａ…右腕，５５ｂ…左腕，１００、１００Ａ…ロボット，１０１…マイク，１０２…カメラ，１０３…センサ，１０４…音声入力部，１０５…映像入力部，１０６…センサ入力部，１０７…発話区間検出部，１０８、１０８Ａ…次話者確率推定部，１０９、１０９Ａ…制御部，１１０…音制御部，１１１…口部制御部，１１２…視線制御部，１１３…頭部制御部，１１４…胴部制御部，１１５…スピーカ，１１６…口部駆動部，１１７…眼部駆動部，１１８…頭部駆動部，１１９…胴部駆動部，１２０…センサ信号変換部，２０１…位置計測装置，２０２…呼吸動作計測装置，２０３…注視対象検出装置，２０４…頭部動作検出装置，４０１…音声解析部，４０２…会話情報生成部，４０３…会話情報ＤＢ，４０４…発声情報生成部，４０５…音信号生成部，１０９１、１０９１Ａ…動作パターン情報格納部 51a ... right eye, 51b ... left eye, 52 ... mouth, 53 ... head, 54 ... neck, 55 ... trunk, 55a ... right arm, 55b ... left arm, 100, 100A ... robot, 101 ... microphone, 102 ... camera, DESCRIPTION OF SYMBOLS 103 ... Sensor, 104 ... Audio | voice input part, 105 ... Image | video input part, 106 ... Sensor input part, 107 ... Speech area detection part, 108, 108A ... Next speaker probability estimation part, 109, 109A ... Control part, 110 ... Sound Control unit, 111 ... Mouth control unit, 112 ... Gaze control unit, 113 ... Head control unit, 114 ... Body control unit, 115 ... Speaker, 116 ... Mouth drive unit, 117 ... Eye drive unit, 118 ... Head drive unit, 119 ... trunk drive unit, 120 ... sensor signal conversion unit, 201 ... position measurement device, 202 ... breathing motion measurement device, 203 ... gaze target detection device, 204 ... head Operation detecting apparatus, 401 ... voice analysis unit, 402 ... conversation information generation unit, 403 ... conversation information DB, 404 ... voicing information generation unit, 405 ... sound signal generating unit, 1091,1091A ... operation pattern information storage unit

Claims

Based on the measurement result of the non-verbal behavior of each participant in the conversation, a next speaker probability estimation unit that estimates a next speaker probability that is a probability that each of the participants will be the next utterance at an arbitrary time;
Based on the probability of the next speaker of the participant, a predicted next speaker who is a participant to speak next and a timing at which the predicted next speaker starts speaking are estimated, and the prediction is performed at the estimated timing. A control unit for instructing utterance with the predicted next speaker as a target when it is detected that the next speaker does not speak;
An utterance guidance unit that receives an instruction from the control unit and performs processing for prompting the subject to speak;
A conversation support system characterized by comprising:

When the control unit detects that the predicted next speaker has not spoken at the estimated timing, the control unit causes the utterance guide unit to prompt the utterance with a speaker other than the next speaker as a target person. Instruct,
The conversation support system according to claim 1.

The utterance guiding unit controls the robot or the speaker displayed on the display device to perform an operation indicating transfer of the utterance right to the target person,
The conversation support system according to claim 1 or 2, characterized in that

The utterance guide unit controls one or more of a robot's eyes, a head, or a torso displayed on a display device to direct a line of sight toward the subject.
The conversation support system according to claim 3.

The utterance guide unit controls the robot or the speaker to display the upper limb of the speaker displayed on the display device.
The conversation support system according to claim 3 or 4, characterized by the above.

The utterance guiding unit outputs a voice prompting the subject to speak;
The conversation support system according to any one of claims 1 to 5, wherein

Based on the measurement result of the non-verbal behavior of each participant in the conversation, a next speaker probability estimation unit that estimates a next speaker probability that is a probability that each of the participants will be the next utterance at an arbitrary time;
Based on the probability of the next speaker of the participant, a predicted next speaker who is a participant to speak next and a timing at which the predicted next speaker starts speaking are estimated, and the prediction is performed at the estimated timing. A control unit that instructs the utterance guiding unit that performs the process of prompting the utterance to detect the next speaker as the target person when detecting that the next speaker has not made the utterance;
A conversation support device comprising:

On the computer,
Based on the measurement result of the non-verbal behavior of each participant in the conversation, the next speaker probability estimating step for estimating the next speaker probability, which is the probability that each of the participants will be the next utterance at an arbitrary time,
Based on the probability of the next speaker of the participant, a predicted next speaker who is a participant to speak next and a timing at which the predicted next speaker starts speaking are estimated, and the prediction is performed at the estimated timing. A control step for instructing an utterance guiding unit that performs processing for prompting an utterance to prompt an utterance with the predicted next speaker as a target when it is detected that the next speaker has not made an utterance;
Conversation support program for running.